An analytic model enables machine learning by allowing you to train score and evaluate data sets via the analytics nodes. There is a different analytic model type available for each of the different types of analytic nodes.
- Identify an appropriate pipeline and path.
- Select the path where you want to create the new analytic model. Click the menu button to the right of the path and select New > Analytic Model.
- Enter a Name and optionally enter a Description.
If you enter a description, this will be displayed in a tooltip when you hover over the analytic model in the Pipelines view.
- Select the relevant Model Type. For example, if you plan to work with the Anomaly node, select the Anomaly model type. Note: You will need to create a separate analytic model for each of the different analytic node types that you plan to work with.
- Click Save.
This analytic model holds child training models that are created each time an analytics node using the Train operation executes or is tested within an analysis. These child training models are listed in the Manage Models table as they are created.
Each time you train a new child model, it will only use the data that was input to the analytics node during the execution of the analysis. Child models that are part of the same analytic model are independent of one another. This means that when re-training, you can improve your analytic model by training with different data and/or different input parameters.
- By default, Data360 DQ+ names each child model with a date and time. If you want to use a child training model to score new data sets in other analyses, you need to specify a Display Name for each entry.
- Once you have one or more child training models for a data set, you can choose one to use for scoring by selecting the model in the Scoring Model Name field.
Choosing data to use with analytic models
The specific application of an analytic model is dependent on the type of data that you plan to train, score, and evaluate.
Training data
Regardless of which analytics node is used, training data needs input fields, that is, attributes about each record, and a Label field which is the answer that the model will learn to associate with input parameters.
Evaluation/validation data
If you intend to create multiple child models and compare them using the evaluate operation, you will also need to set aside a validation data set. The validation data set also needs a "Label" field, that is, a field containing an actual value that can be compared to a predicted value. Additionally, the validation data set will also need the same set of fields that was used to generate the trained model so that evaluation can take values from these fields, generate predictions, compare predictions to actual values in the Label Field, and ultimately inform you on the quality of your model using evaluation metrics.
Scoring data
Regardless of which analytics node is used, scoring data needs the same set of input fields that were used to create the scoring model. Scoring data does not require a Label field, as the aim of scoring is to make predictions on "unlabeled" data.
Generating training and validation data sets
You can generate training and validation data sets by splitting an initial "labeled" data set, that is, a data set where the answers are known, into two subsets. For training, use a subset that is approximately 60 - 70% of the original data set, and for validation, use the remaining ~30-40%.
Currently, there is no automatic way to generate this split using Data360 DQ+, however, you can perform a split by creating an analysis such as the following example:
1) An Auto Number node gives each record in the data set a unique ID value. Note that if records already have unique IDs, this step is not necessary.
2) A Sample node takes 60% of the data from the Data Store Input node.
3) The Not In node finds IDs that are present in the entire data set, but not present in the 60% sample. The output of the Not In node is then pushed to the Validation Data Set, and the 60% Sample is pushed to the Training Data Set.
Training models
- Prerequisite:You have created an analytic model.
- Identify an appropriate data store input to use for training, then in the Analysis Designer, drag a Data Store Input node onto the canvas and point it to the data store that you want to use.
- Connect the Data Store Input node to your chosen analytics node. You can either connect the nodes directly, or via Enhance, Combine or Shape nodes:
- Select the analytics node and from the Properties panel, select Train in the Operation property.
- Select the Analytic Model. Note that you can only select from analytic models where the type corresponds to the type of the selected analytic node.
- Configure any additional required node properties. The properties that you are required to configure vary depending on the selected analytic node.
- To see the effects of training in the analysis, click Accept Changes or Test.
Your new Prediction Field will appear in the Analytic node's sheet. You can then take this information and push it to a Data Store Output node to be used by other data stages.
After saving and executing your analysis, a child model will be stored in your analytic model, which Data360 DQ+ will name with a date and time.
Evaluating models
- Prerequisite:You have created an analytic model.
- Prerequisite: You have trained your analytic model, see Training models.
- Identify an appropriate data store input to use for validation, then in the Analysis Designer, drag a Data Store Input node onto the canvas and point it to the data store that you want to use.
- Connect the Data Store Input node to your chosen analytics node. You can either connect the nodes directly, or via Enhance, Combine or Shape nodes:
- Select the analytics node and from the Properties panel, select Evaluate in the Operation property.
- Select the Analytic Model. Note that you can only select from analytic models where the type corresponds to the type of the selected analytic node.
- You can choose one or more child models from your analytic model from the Analytic Child Model list. Data360 DQ+ will rank their predictive accuracy. Understanding which child models are the most accurate can help you to determine the accuracy of any scoring that you have already performed and which models to use for future scoring.
When an evaluation runs, it produces measures to assess the performance of your child models. The specific measures used vary for each type of analytics node.
Scoring data
Scoring data allows you to compare a new data set to a trained model.
- Prerequisite:You have created an analytic model.
- Prerequisite: You have trained your analytic model, see Training models.
- Prerequisite: You have specified a Display Name for a child model that you want to use for scoring.
- Identify an appropriate data store input to use for scoring, then in the Analysis Designer, drag a Data Store Input node onto the canvas and point it to the data store that you want to use.
- Connect the Data Store Input node to your chosen analytics node. You can either connect the nodes directly, or via Enhance, Combine or Shape nodes:
- Select the analytics node and from the Properties panel, select Score in the Operation property.
- Depending on the analytics node that you are using, you will also need to configure any additional properties.
- To see the effects of scoring in the analysis, click Accept Changes or Test.
Your new Prediction Field will appear in the analytics node's sheet. You can then take this information and push it to a Data Store Output node to be used by other data stages, for example, a data view that allows for the visualization of predictive analytics.
After saving and executing your analysis, scored data will be pushed to the Data Store Output. The accuracy of your scoring will be dependent on the strength of your scoring model.