Architecture for Generic Dataset/Report Generation

Diagrams

Component Diagram

Sequence Diagram

This is a large diagram and it can be loaded into sequencediagram.org for a better visualization.

IDatasetBuilder, IDataset and DatasetExtensions

Inside the Mobilize.ReportGenerator project, there are multiple classes related to the generation of custom datasets.

To generate a custom dataset

  1. Initialize a new SimpleDatasetBuilder<D> with a data source (an IEnumerable<D>).

    1. The data source should contain all the information that is going to be used. If data from multiple "tables" or lists is required, then it will be necessary to perform Joins (System.Linq.Join) or create custom data structures that are able to hold all the necessary information.

  2. Add parameters as necessary, using the IDatasetBuilder<D>.Parameterize<P> method

    1. You can add as many parameters as you want

  3. Build the dataset with the structure you want

    1. Use the different "Build" methods that are available in the IDatasetBuilder<D> interface

  4. Apply additional transformations to the dataset, if necessary

    1. You can use the extensions methods from DatasetExtensions class

    2. You can use the TransformData method in the IDataset<V> interface

    3. You can add extensions method in the DatasetExtensions class. Inside them, you will likely use the TransformData method in the IDataset<V> interface

  5. Convert the dataset to JSON an return it to the sender of the request

    1. You should use ToJson method in the IDataset<V> interface

IDatasetBuilder

This class allows for the construction of IDatasets with an arbitrary number of parameters and a custom structure (there are three predefined structures: labelled values, labelled sequences and labelled dictionaries).

There are two default implementations for this interface:

  • SimpleDatasetBuilder

  • ParametricDatasetBuilder

When designing a dataset, one should instantiate a new SimpleDatasetBuilder and use the methods provided, instead of initializing a ParametricDatasetBuilder.

public interface IDatasetBuilder<D>
{
        // To Add a parameter to the dataset
        IDatasetBuilder<D> Parameterize<P>(Func<D, P> parameterSelector);

        // To build the dataset with a 'Labelled Values' structure
        // (each label corresponds to a group in the datasource)
        ITransformableDataset<V> BuildAsLabelledValues<V>(
                Func<D, string> labelSelector,
                Func<IGrouping<string, D>, V> valueSelector);
        
        // To build the dataset with a 'Labelled Values' structure
        // (each label corresponds to an element in the datasource)
        ITransformableDataset<V> BuildAsLabelledValues<V>(
                Func<D, string> labelSelector,
                Func<D, V> valueSelector);

        // To build the dataset with a 'Labelled Sequences' structure
        // (each label corresponds to a group in the datasource)
        ITransformableDataset<IEnumerable<V>> BuildAsLabelledSequences<V>(
                Func<D, string> labelSelector,
                Func<IGrouping<string, D>, IEnumerable<V>> groupingSelector);

        // To build the dataset with a 'Labelled Sequences' structure
        // (each label corresponds to an element in the datasource)
        ITransformableDataset<IEnumerable<V>> BuildAsLabelledSequences<V>(
                Func<D, string> labelSelector,
                Func<D, IEnumerable<V>> groupingSelector);

        // To build the dataset with a 'Labelled Dictionaries' structure
        // (each label corresponds to a group in the datasource)
        ITransformableDataset<Dictionary<K, V>> BuildAsLabelledDictionaries<K, V>(
                Func<D, string> labelSelector, Func<IGrouping<string, D>,
                Dictionary<K, V>> dictionarySelector);

        // To build the dataset with a 'Labelled Dictionaries' structure
        // (each label corresponds to an element in the datasource)
        ITransformableDataset<Dictionary<K, V>> BuildAsLabelledDictionaries<K, V>(
                Func<D, string> labelSelector,
                Func<D, Dictionary<K, V>> dictionarySelector);

        // To build the dataset with a custom structure
        // (each label corresponds to a group in the datasource)
        ITransformableDataset<V> BuildAsCustomDataset<V>(
                Func<D, string> labelSelector,
                Func<IGrouping<string, D>, V> datasetSelector);
        
        // To build the dataset with a custom structure
        // (each label corresponds to an element in the datasource)
        ITransformableDataset<V> BuildAsCustomDataset<V>(
                Func<D, string> labelSelector,
                Func<D, V> datasetSelector);
}

Some of the methods that are shown in this block of code are actually extension methods, and can be found in the static class DatasetExtensions.

IDataset and ITransformableDataset<V>

The ITransformableDataset interface is implemented by the objects that are created by the IDatasetBuilder. The ITransformableDataset implements the IDataset interface.

The user should use the following methods provided by the ITransformableDataset:

  • TransformData to change the way in which the data inside the Dataset is structured. Examples of transformations:

    • Truncating the enumerable to only show the top 5 labels

    • Adding a label to the enumerable, which holds totalized data

    • Any other kind of transformation

  • AddMetadataBasedOnDataAndSelectedParameters to add metadata that is relevant a combination of parameters (not for all the dataset)

  • AddGlobalMetadata to add metadata that is relevant to all the dataset

  • ToJson to convert the dataset to the optimal format for communication with other components

public interface ITransformableDataset<V> : IDataset
{
        // This method allows for the transformation of the datasets
        ITransformableDataset<V> TransformData(Func<IEnumerable<KeyValuePair<string, V>>, IEnumerable<KeyValuePair<string, V>>> transformation);

        // This method allows for the addition of metadata that is relevant for
        // a combination of parameters
        void AddMetadataBasedOnDataAndSelectedParameters(
            Func<IEnumerable<KeyValuePair<string, V>>, List<string>, Dictionary<string, object>> functionToCreateNewMetadata,
            List<string> parametersForData = null);
}
public interface IDataset
{
        // Holds metadata that is relevant to all the dataset
        Dictionary<string, object> GlobalMetadata { get; }
        
        // Converts the metadata from the dataset to JSON format
        JObject MetadataToJson();

        // Converts the data from the dataset to JSON format
        JObject DataToJson();

        // Converts the dataset to JSON format
        JObject ToJson();
        
        // To add metadata that is relevant to all the dataset
        public static void AddGlobalMetadata(
                this IDataset dataset,
                params (string, object)[] newGlobalMetadataProperties)
}

DatasetExtensions

This class includes extension methods for the IDataset interface. The main purpose of this class is to allow for extension of this interface, mainly through the use of the TransformData method.

Adding extension methods to this class will avoid the duplication of code that can be generic enough to use it with any IDataset.

Dataset Catalogs and Dataset Generator

IDatasetCatalog

The IDatasetCatalog interface must be implemented by any new catalog that is created. A name and version must be specified, and the GenerateDatasets method is the core of the catalog. The idea is that, inside this method:

  1. Multiple IDatasets must be created (using an IDatasetBuilder and the methods from IDataset).

  2. Each of these datasets should be converted to JSON format using the ToJson method.

  3. Each of these datasets in JSON format must be added to a JArray, which is then returned by the catalog.

An important considerations for error handling: If any of the datasets can't be generated, then a NullDataset must be generated in its place (a NullDataset is a dataset that has no data, no metadata, and only has a globalMetadata property, which inside has a successfulGeneration property, set to false).

public interface IDatasetCatalog
{
    string Name { get; }

    int Version { get; }

    JArray GenerateDatasets(IAssessmentModelReader assessmentModelReader, Dictionary<string, string> configuration);
}

Dataset Generator

The dataset generator is a component that is in charge of looking for a catalog name and version, and asking the corresponding catalog to generate the datasets.

There is a method inside the dataset generator where new dataset catalogs can be included. For example, in the method there is currently only one catalog (Name = "RapidScan" and Version = 1):

private static List<IDatasetCatalog> ListAllCatalogs()
{
    return new List<IDatasetCatalog>()
    {
        new RapidScanCatalogV1()
    };
}

Communication with Assessment Web API

Assessment Controller

The AssessmentController will expose a method called GenerateDatasets, which receives a DatasetGenerationRequestDto and returns a JArray with the information for each dataset.

The Assessment API will not cache/store any assessment model in order to improve the performance, since this would require the Assessment API to be able to tell if two codebases are exactly the same, and also hold data that will possibly never be used again. The difficulty of this implementation can be assessed in the future. However, at the moment it does not seem likely that it will be used.

There is also another consideration related to the privacy of the user: we should not store data without the user's consent.

DatasetGenerationRequestDto

This is the object received by the Assessment API whenever a dataset generation request is performed.

  • The compressed output folder must contain the assessment model that will be used to generated the datasets.

  • The catalog name can be any name that is used to identify the catalog of datasets. For instance, it can be RapidScan.

  • The catalog version is the version of the catalog. Older versions should not be erased, in order to support backwards compatibility

  • The configuration is used by the Dataset Catalog to decide if a dataset must be included or not, and also to perform any other custom logic that the specific catalog implements.

public class DatasetGenerationRequestDto
{
        public byte[] CompressedOutputFolder { get; set; }

        public string CatalogName { get; set; }

        public int CatalogVersion { get; set; }

        public Dictionary<string, string> Configuration { get; set; }
}

If the catalog name and version are not found in the list of catalogs supported by the Dataset Generator. Then the

Structure of the generated datasets

General Structure of the Output (JArray of DatasetDto)

[
    {
        "globalMetadata": {
            "successfulGeneration": true,
            // More metadata
        },
        "metadata": {
            // The structure depends on the number of parameters
        },
        "data": {
            // The structure depends on the number of parameters and type of the
            //   generated data
        }
    },
    {
        // Has the same format as the dataset above
    }
]

Examples of the JSON representation of Datasets

Example of DatasetDto

  • 1 Parameter

    • Unit of Measure

      • Files

      • Bytes

      • Lines

  • For each parameter, we have labelled values

    • Label corresponds to Technology

{
    "globalMetadata": {
        "successfulGeneration": true
    },
    "metadata": {
        "Files": {
            "customMetadataProperty": "X"
        },
        "Bytes": {
            "customMetadataProperty": "Y"
        },
        "Lines": {
            "customMetadataProperty": "Z"
        }
    },
    "data": {
        "Files": {
            "C#": 100,
            "SQL": 20,
            "Other": 50
        },
        "Bytes": {
            "C#": 250,
            "SQL": 300,
            "Other": 1000,
        },
        "Lines": {
            "C#": 500,
            "SQL": 800,
            "Other": 200,
        }
    }
}

Example of DatasetDto

  • 2 Parameters

    • Technology

      • C#

      • SQL

    • Unit of measure

      • Kilobytes

      • Lines

  • For each pair of parameters, we have labelled sequences

    • Label corresponds to extension of the files

    • In the sequences, we can see the top 5 values (in the selected unit of measure) for any file with that extension

{
    "globalMetadata": {
        "successfulGeneration": true
    },
    "metadata": {
        "C#": {
            "Kilobytes": {
                "customMetadataProperty": "X"
            },
            "Kilobytes": {
                "customMetadataProperty": "Z"
            }
        },
        "SQL": {
            "Kilobytes": {
                "customMetadataProperty": "X"
            },
            "Lines": {
                "customMetadataProperty": "Z"
            }
        }
    },
    "data": {
        "C#": {
            "Kilobytes": {
                ".cs": [100, 80, 70, 60, 50],
                ".csproj": [110, 100, 80, 40, 30]
            },
            "Lines": {
                ".cs": [1000, 860, 750, 610, 500],
                ".csproj": [100, 70, 65, 64, 60]
            }
        },
        "SQL": {
            "Kilobytes": {
                ".sql": [10000, 7000, 500, 400, 300]
            },
            "Lines": {
                ".sql" : [2400, 1800, 80, 60, 50]
            }
        }
    }
}

Example of DatasetDto

  • 3 Parameters

    • Technology

      • C#

      • SQL

    • BinaryType

      • Binary

      • Non-Binary

    • Extension

      • .sql

      • .cs

      • .csproj

      • .dll

      • .sqlbin (This extension does not exist, but we will assume this is an SQL binary extension)

  • For each triplet of parameters, we have:

    • Labelled dictionaries with

      • Content Lines

      • Comment Lines

      • Control Flow Keywords

    • Each dictionary corresponds to a file, and its labelled with the file's name

{
    "globalMetadata": {
        "successfulGeneration": true
    },
    "metadata": {
        "C#": {
        },
        "SQL": {
        }
    },
    "data": {
        "C#": {
            "Binary": {
                ".dll": [
                    "File #1": { "Content Lines": 100, "Comment Lines": 20, "Control Flow Keywords": 40 },
                    "File #2": { "Content Lines": 120, "Comment Lines": 40, "Control Flow Keywords": 30 }
                ]
            },
            "Non Binary": {
                ".cs":[
                    "File #3": { "Content Lines": 100, "Comment Lines": 20, "Control Flow Keywords": 40 },
                    "File #4": { "Content Lines": 120, "Comment Lines": 40, "Control Flow Keywords": 30 }
                ],
                ".csproj": [
                    "File #5": { "Content Lines": 100, "Comment Lines": 20, "Control Flow Keywords": 40 },
                    "File #6": { "Content Lines": 120, "Comment Lines": 40, "Control Flow Keywords": 30 }
                ]
            }
        },
        "SQL": {
            "Non Binary": {
                ".cs":[
                    "File #7": { "Content Lines": 100, "Comment Lines": 20, "Control Flow Keywords": 40 },
                    "File #8": { "Content Lines": 120, "Comment Lines": 40, "Control Flow Keywords": 30 }
                ],
                ".csproj": [
                    "File #9": { "Content Lines": 100, "Comment Lines": 20, "Control Flow Keywords": 40 },
                    "File #10": { "Content Lines": 120, "Comment Lines": 40, "Control Flow Keywords": 30 }
                ]
            }
        }
    }
}

Error Management

There are two main types of errors:

Partial Errors

A dataset was not generated because there were errors during generation; most likely related to corrupt data in the assessment model or bad logic in the dataset catalog.

In this case, the data and metadata of the dataset can be ignored and the globalMetadata property should have its "successfulGeneration" property set to false.

A mechanism to generate "NullDatasets" must be provided so that

The server will return a 200 OK

No such Catalog

The catalog that was requested does not exist: either the name of the catalog is wrong or the version of the catalog is wrong.

The server will return a 400 Bad Request.

Unhandled error

The assessment controller will handle any unhandled error that ocurred in the dataset generation process (including unhandled errors in the catalog).

The server will return a 500 Internal Server Error.

Generic UI Overview

Communication between the UI components

This communication is performed the same way as it always has been:

  • There is a ChartingService that allows for communication between a UI component and a handler (DatasetsHandler)

  • There is a DatasetsHandler that allows for communication between a service (ChartingService) and the Controller.

  • The Controller handles any request sent by the DatasetsHandler and return the result of such request, which passess through the DatasetsHandler and the ChartingService before "arriving" at the UI component that called the method from the ChartingService.

    • In this particular case, the Controller will talk to the Assessment API to generate the requested datasets with the given configuration, catalog name, catalog version and current assessment model

  • The component is now free to generate the chart using the best suited library/framework

Chart component

A chart component has been created to expose the Chart.js library.

  • An object of type ChartInfo can be passed as a prop to this chart component.

  • This implementation is tentative since there are many charting libraries besides Chart.js.

    • Maybe this component can exist but its name should be different.

Pending Work

  • Complete TO DOs in the code for the front-end

  • Decide what to do with the Chart component in the UI.

  • Complete TO DOs in the code for the back-end

  • Handle errors of type Partial Errors and Unhandled Error

  • Add more methods to fill the metadata/globalMetadata of a dataset

  • Add more unit testing

  • Improve the internal documentation of the interfaces

Last updated