The Anomaly node uses an isolation forest algorithm to detect anomalies in a data set. Each record is given an "anomaly score" and is labeled as an anomaly if the record's score is above a calculated threshold.
Anomaly training
To perform Anomaly training, you need a data set that contains a field that you want to analyze to check for anomalies. Training an Anomaly type analytic model "teaches" the model what conforms to a "normal standard" in order to identify outliers (or anomalies) to this normal.
Anomaly example: Training
This example uses a sample from a sales data set. The object of training is to make a model for identifying anomalous sales results.
Field names: date
= Sales date and amt
= Sales amount
date | amt |
---|---|
2020-04-02 | 1433.75 |
2020-06-18 | 16216.27 |
2020-02-19 | 18706.21 |
2020-03-12 | 15331.75 |
2020-07-16 | 13550.19 |
2020-02-05 | 24924.50 |
2020-03-05 | 22778.00 |
2020-01-05 | 2136.01 |
2020-03-17 | 21447.75 |
2020-01-09 | 54567.20 |
- Select Train in the Operation property.
- In the Input Fields property, add the
amt
field. This is the field that the anomaly detection will be based on. - Select an Analytic Model. Note that you can only select from Anomaly type analytic models.
- Specify a name for the Prediction Field, for example
anomalies
. In the output, this field will display 1 for a record that is identified as being an anomaly, and 0 for all "normal" records. - Run the analysis.
The Anomaly node outputs a data set containing all input fields and the Prediction Field. In this example, two records have been identified as being anomalous, where the sales amount is far lower than the other records:
date | amt | anomalies |
---|---|---|
2020-04-02 | 1433.75 | 1 |
2020-06-18 | 16216.27 | 0 |
2020-02-19 | 18706.21 | 0 |
2020-03-12 | 15331.75 | 0 |
2020-07-16 | 13550.19 | 0 |
2020-02-05 | 24924.50 | 0 |
2020-03-05 | 22778.00 | 0 |
2020-01-05 | 2136.01 | 1 |
2020-03-17 | 21447.75 | 0 |
2020-01-09 | 54567.20 | 0 |
Anomaly evaluation and re-training
Once you have trained an Anomaly child model, you can decide if you want to score with the model or if you want to try to evaluate the model and create other models with different parameters. With the Anomaly node, there is no explicit Evaluate operation, however you can use the output of training to compare historical values to identify outliers in a new analysis or a dashboard.
You can then decide whether you want to try to create a different child model by making different specifications in the Anomaly node's properties and training again.
Anomaly scoring
Prerequisite: You have selected a child model within your analytic model to use for scoring, see Creating analytic models.
Once you have created a child model to use for scoring, you can create another analysis that uses an Anomaly node to score a new data set. The new data set must contain the same fields that were used as parameters when the scoring model was trained. During scoring, the Anomaly node will compare the values in these fields to values in the scoring model in order to identify anomalies.
Anomaly example: Scoring
This example continues the "Anomaly example: Training" from above.
- Another data set with the same parameters is used as an input to an Anomaly node in another analysis.
- Select Score in the Operation property.
- Select the same Analytic Model that was used in training.
- Specify a name for the Prediction Field, for example
anomalies
. - Specify a name for the Score Field. This field shows the anomaly scores in the output.
The Anomaly node outputs a data set containing the input fields and the Prediction Field and Score Field. The Prediction Field shows the results of applying the threshold score to the anomaly scores that are displayed in the Score Field.
Properties
Display Name
Specify a name for the node.
The default value is Anomaly.
Model tab
Operation
Select an operation type. Choose from:
- Train
- Score
Input Fields
Click Add Field to select input fields to analyze.
Analytic Model
Select an analytic model. You can only choose from Anomaly type models.
This field is optional if the Operation property is set to Score.
Prediction Field
Enter a name for a prediction field which will be included in the output of the node.
Score Field
Enter a name for a score field which will be included in the output of the node.
Anomaly tab
Max number of Sample Records
Optionally specify the maximum number of sample records. Choose from Records or Percent. If Percent is selected, the value in the numeric field is divided by 100 to create the percent value.
The default value is 250 Records.
Number of Trees
Optionally specify the number of isolation trees that will be used by the anomaly detection algorithm.
The default value is 100.
Max number of Features
Optionally specify the maximum number of features. Choose from Records or Percent. If Percent is selected, the value in the numeric field is divided by 100 to create the percent value.
The default value is 100 Percent.
Contamination
Optionally specify a contamination value between 0 (inclusive) and 0.5 (exclusive). This is an estimation of the number of anomalous records in your data set and is used to calculate a threshold score.
A value of 0.1 would compute a threshold score that labels the top 10% scored records as anomalies.
If set to 0, the threshold score is not computed and an anomaly label is not assigned. In this case, the node will execute more quickly and you can use the anomaly scores to decide how to handle the data.
The default value is 0.1.
Contamination Error Percent
Optionally specify a Contamination Error Percent value. The threshold score computation can be time consuming, so to speed it up, approximation can be applied. The Contamination Error Percent is the error allowed in approximation.
A value of 1 would allow the computation to be within plus or minus 1%.
A value of 0 means that an exact calculation will be used.
The specified value is converted to a percent value.
The default value is 1.