Anomaly - Data360_DQ+ - Latest

Data360 DQ+ Help

Product type
Software
Portfolio
Verify
Product family
Data360
Product
Data360 DQ+
Version
Latest
Language
English
Product name
Data360 DQ+
Title
Data360 DQ+ Help
Copyright
2024
First publish date
2016
ft:lastEdition
2024-07-09
ft:lastPublication
2024-07-09T15:09:58.774265
Note: Before using the Analytics nodes, you first need to create an "Analytic Model", see Creating analytic models.

The Anomaly node uses an isolation forest algorithm to detect anomalies in a data set. Each record is given an "anomaly score" and is labeled as an anomaly if the record's score is above a calculated threshold.

Anomaly training

To perform Anomaly training, you need a data set that contains a field that you want to analyze to check for anomalies. Training an Anomaly type analytic model "teaches" the model what conforms to a "normal standard" in order to identify outliers (or anomalies) to this normal.

Anomaly example: Training

This example uses a sample from a sales data set. The object of training is to make a model for identifying anomalous sales results.

Field names: date = Sales date and amt = Sales amount

date amt
2020-04-02 1433.75
2020-06-18 16216.27
2020-02-19 18706.21
2020-03-12 15331.75
2020-07-16 13550.19
2020-02-05 24924.50
2020-03-05 22778.00
2020-01-05 2136.01
2020-03-17 21447.75
2020-01-09 54567.20

 

  1. Select Train in the Operation property.
  2. In the Input Fields property, add the amt field. This is the field that the anomaly detection will be based on.
  3. Select an Analytic Model. Note that you can only select from Anomaly type analytic models.
  4. Specify a name for the Prediction Field, for example anomalies. In the output, this field will display 1 for a record that is identified as being an anomaly, and 0 for all "normal" records.
  5. Run the analysis.

The Anomaly node outputs a data set containing all input fields and the Prediction Field. In this example, two records have been identified as being anomalous, where the sales amount is far lower than the other records:

date amt anomalies
2020-04-02 1433.75 1
2020-06-18 16216.27 0
2020-02-19 18706.21 0
2020-03-12 15331.75 0
2020-07-16 13550.19 0
2020-02-05 24924.50 0
2020-03-05 22778.00 0
2020-01-05 2136.01 1
2020-03-17 21447.75 0
2020-01-09 54567.20 0
Note: Training will also create a child model within the selected analytic model. You can use this child model at a later date for scoring, see Creating analytic models.

Anomaly evaluation and re-training

Once you have trained an Anomaly child model, you can decide if you want to score with the model or if you want to try to evaluate the model and create other models with different parameters. With the Anomaly node, there is no explicit Evaluate operation, however you can use the output of training to compare historical values to identify outliers in a new analysis or a dashboard.

You can then decide whether you want to try to create a different child model by making different specifications in the Anomaly node's properties and training again.

Anomaly scoring

Prerequisite: You have selected a child model within your analytic model to use for scoring, see Creating analytic models.

Once you have created a child model to use for scoring, you can create another analysis that uses an Anomaly node to score a new data set. The new data set must contain the same fields that were used as parameters when the scoring model was trained. During scoring, the Anomaly node will compare the values in these fields to values in the scoring model in order to identify anomalies.

Anomaly example: Scoring

This example continues the "Anomaly example: Training" from above.

  1. Another data set with the same parameters is used as an input to an Anomaly node in another analysis.
  2. Select Score in the Operation property.
  3. Select the same Analytic Model that was used in training.
  4. Specify a name for the Prediction Field, for example anomalies.
  5. Specify a name for the Score Field. This field shows the anomaly scores in the output.

The Anomaly node outputs a data set containing the input fields and the Prediction Field and Score Field. The Prediction Field shows the results of applying the threshold score to the anomaly scores that are displayed in the Score Field.

Properties

Display Name

Specify a name for the node.

The default value is Anomaly.

Model tab

Operation

Select an operation type. Choose from:

  • Train
  • Score

Input Fields

Click Add Field to select input fields to analyze.

Analytic Model

Select an analytic model. You can only choose from Anomaly type models.

This field is optional if the Operation property is set to Score.

Prediction Field

Enter a name for a prediction field which will be included in the output of the node.

Score Field

Enter a name for a score field which will be included in the output of the node.

Anomaly tab

Max number of Sample Records

Optionally specify the maximum number of sample records. Choose from Records or Percent. If Percent is selected, the value in the numeric field is divided by 100 to create the percent value.

The default value is 250 Records.

Number of Trees

Optionally specify the number of isolation trees that will be used by the anomaly detection algorithm.

The default value is 100.

Max number of Features

Optionally specify the maximum number of features. Choose from Records or Percent. If Percent is selected, the value in the numeric field is divided by 100 to create the percent value.

The default value is 100 Percent.

Contamination

Optionally specify a contamination value between 0 (inclusive) and 0.5 (exclusive). This is an estimation of the number of anomalous records in your data set and is used to calculate a threshold score.

A value of 0.1 would compute a threshold score that labels the top 10% scored records as anomalies.

If set to 0, the threshold score is not computed and an anomaly label is not assigned. In this case, the node will execute more quickly and you can use the anomaly scores to decide how to handle the data.

The default value is 0.1.

Contamination Error Percent

Optionally specify a Contamination Error Percent value. The threshold score computation can be time consuming, so to speed it up, approximation can be applied. The Contamination Error Percent is the error allowed in approximation.

A value of 1 would allow the computation to be within plus or minus 1%.

A value of 0 means that an exact calculation will be used.

The specified value is converted to a percent value.

The default value is 1.