Classification - Data360_DQ+ - Latest

Data360 DQ+ Help

Product type
Software
Portfolio
Verify
Product family
Data360
Product
Data360 DQ+
Version
Latest
Language
English
Product name
Data360 DQ+
Title
Data360 DQ+ Help
Copyright
2024
First publish date
2016
ft:lastEdition
2024-07-09
ft:lastPublication
2024-07-09T15:09:58.774265
Note: Before using the Analytics nodes, you first need to create an "Analytic Model", see Creating analytic models.

The Classification node performs "supervised" learning algorithms to place a new observation into a preexisting, discrete category. Classification categories are derived from a training set of data that has already been labeled, or "clustered." For this reason, the Classification node is often used with the Segmentation node.

 

Classification Given Classify

Classification works best when attempting to output discrete values. For example, given a labeled data set that matches customer demographics to loan types customers qualify for, you could apply the classification node to a new customer data set containing only demographic information to attempt to determine what types of loans customers will qualify for.

Classification training

To perform Classification training, you will need a labeled data set, that is, one where there is a field that already classifies records. This data set also needs to contain a set of fields, or parameters, that the analytic model can associate with classes.

Iris example: Training

This example uses a sample from the famous "Iris Flower Data Set." The object of training is to classify flower records into one of three species based on collected measurements.

SepalLength

SepalWidth

PetalLength

PetalWidth

Species

5.1

3.3

1.7

0.5

setosa

7.7

2.6

6.9

2.3

virginica

6.6

2.9

4.6

1.3

versicolor

4.4

3

1.3

0.2

setosa

5

3.5

1.3

0.3

setosa

6.4

2.7

5.3

1.9

virginica

5.7

2.5

5

2

virginica

6.4

3.1

5.5

1.8

virginica

6.5

3.2

5.1

2

virginica

5.4

3.4

1.7

0.2

setosa

This data set contains four fields, SepalLength, SepalWidth, PetalLength and PetalWidth, and a Label field, Species.

  1. To train an analytic model with this data set, add the four parameters in the Input Fields property.
  2. Specify Species in the Label Field property.
  3. Specify a Prediction Field that has the same data type as the field specified in the Label Field property, for example SpeciesPred of the string data type.
  4. Run the analysis.

The Classification node outputs a data set containing all specified Input Fields, the Label Field, the Prediction Field, and any other fields associated to records:

SepalLength

SepalWidth

PetalLength

PetalWidth

Species

SpeciesPred

5.1

3.3

1.7

0.5

setosa

setosa

7.7

2.6

6.9

2.3

virginica

virginica

6.6

2.9

4.6

1.3

versicolor

versicolor

4.4

3

1.3

0.2

setosa

setosa

5

3.5

1.3

0.3

setosa

setosa

6.4

2.7

5.3

1.9

virginica

virginica

5.7

2.5

5

2

virginica

virginica

6.4

3.1

5.5

1.8

virginica

virginica

6.5

3.2

5.1

2

virginica

virginica

5.4

3.4

1.7

0.2

setosa

setosa

Note: Training will also create a child model within the selected analytic model. You can use this child model at a later date for scoring, see Creating analytic models.

Classification scoring

Prerequisite: You have selected a child model within your analytic model to use for scoring, see Creating analytic models.

Once you have selected a child model to use for scoring, you can create another analysis that uses a Classification node to score an unlabeled data set, that is, to predict to which class each record within the data set belongs. The new data set must contain the same fields that were used as parameters when the scoring model was trained. During scoring, the Classification node will compare the values in these fields to values in the scoring model in order to predict a class.

Iris example: Scoring unlabeled data

This example continues the "Iris example: Training" from above.

You have an unlabeled data set that only contains values for the four parameters: SepalLength, SepalWidth, PetalLength and PetalWidth:

SepalLength

SepalWidth

PetalLength

PetalWidth

7.7

2.8

6.7

2

5.2

3.4

1.4

0.2

6.1

3

4.9

1.8

5

3.5

1.3

0.3

5.4

3.9

1.7

0.4

4.9

2.5

4.5

1.7

5.1

3.5

1.4

0.3

5.8

2.6

4

1.2

5.9

3

4.2

1.5

6.9

3.1

5.4

2.1

  1. Provide this data set as an input to the Classification node.
  2. Select Score in the Operation property.
  3. Specify a Prediction Field, for example SpeciesPred.
  4. Select a Prediction Field Type, for example String.
  5. Run the analysis.

The Classification node outputs a data set containing the parameter fields and the Prediction Field:

SepalLength

SepalWidth

PetalLength

PetalWidth

SpeciesPred

7.7

2.8

6.7

2

virginica

5.2

3.4

1.4

0.2

setosa

6.1

3

4.9

1.8

virginica

5

3.5

1.3

0.3

setosa

5.4

3.9

1.7

0.4

setosa

4.9

2.5

4.5

1.7

virginica

5.1

3.5

1.4

0.3

setosa

5.8

2.6

4

1.2

versicolor

5.9

3

4.2

1.5

versicolor

6.9

3.1

5.4

2.1

virginica

The accuracy of SpeciesPred would ultimately depend on the strength of the child model used to score, indicated by the model's evaluation metrics.

Classification evaluation

To evaluate the accuracy of an analytic model, and by extension the accuracy of scoring that is performed using that model, you can use the Classification node's Evaluate operation.

To evaluate a child training model, you need to use a validation data set as an input to a Classification node, see Generating training and validation data sets.

Iris example: Evaluating the child training model

To evaluate the child model created during the Iris data set training:

  1. Provide a validation data set as input to the Classification node.
  2. Select Evaluate in the Operation property.

The evaluation would produce an F-Measure, as well a number of other evaluation metrics:

ModelDisplayName

ChildModelDisplayName

Rank

FMeasure

Classification Model

Child Model 1

1

0.986

Tip: After training and evaluating your first child model, you can choose to train another one to obtain better evaluation metrics and more accurate scoring results. To retrain, you can use new data and/or different parameters as input fields. Each time you re-train using the same analytic model, another child model is created.

Properties

Display Name

Specify a name for the node.

The default value is Classification.

Model tab

Operation

Select an operation type. Choose from:

  • Train
  • Score
  • Evaluate

Input Fields

Click Add Field to select input fields to analyze.

Analytic Model

Select an analytic model. You can only choose from Classification type models.

Label Field

Enter a name for a label field which will be included in the output of the node.

Prediction Field

Enter a name for a prediction field which will be included in the output of the node.

Prediction Field Type

Select a data type for the field specified in the Prediction Field. Choose from:

  • Boolean
  • Date
  • String
  • DateTime
  • Time
  • Integer
  • Floating Point
  • Big Integer
  • Decimal
  • Currency

Classification tab

Algorithm

Select an algorithm. Choose from:

  • Random Forest
  • Gradient Boosted Tree

Automatically Tune Parameters

Select this option if you want to specify an evaluation metric.

Eval Metric

Select an evaluation metric to use to rank the model when you compare it to others. This property is only visible when Automatically Tune Parameters is selected. Choose from:

  • F1 Score - Uses a result set's "precision" and "recall" to produce a value between 0 and 1 to represent the performance of a classification model:

    A result set's "precision" is the number of true positives (tp) over all positives returned (pr), or tp/pr.

    A result set's "recall" is the number of true positives (tp) over actual positives (ap) in the result set, or tp/ap.

    The closer to 1, the better the performance of the classification model.

    For example, you have ten items known to be in Class A (ap = 10). After training, your model classifies seven items as Class A (pr = 7). After further investigation, however, you find you that only five of these items are actually supposed to be in Class A (tp = 5).

    In this example, the precision is tp/pr, or 5/7. The recall is tp/ap, only 5/10.

    Using the formula above, the model's F1 Score would be roughly 0.59, suggesting that the model has room for improvement.

  • Profit - Helps you to find an optimal probability threshold for classification models by allowing you to specify dollar amounts for the profit gained from each true positive (True Positive Profit) and the cost incurred from each false negative (False Negative Cost).

    To get a more accurate measure of profit, you must also specify a Setup Cost which represents the cost incurred from setting up the model.

  • Accuracy - (True Positives + True Negatives) / (Total number of cases examined)

    Accuracy is the proportion of true results among the total number of cases examined. True results includes both true positives and true negatives. The closer to 1, the better the performance of the classification model.

  • Sensitivity - (True Positives) / (Total Positives in set)

    Sensitivity, also known as Recall, is the True Positive Rate, or, the number of results which tested positive and are actually positive, divided by the total number of positives in the set. The closer to 1, the better the performance of the classification model.

  • Specificity - (True Negatives) / (Total Negatives in set)

    Specificity is the True Negative Rate, or, the number of results which tested negative and are actually negative, divided by the total number of negatives in the set. The closer to 1, the better the performance of the classification model.

  • Precision - (True Positives) / (True Positives + False Positives)

    Precision is the number of True Positives divided by the number of results that tested positive. The closer to 1 (the fewer the False Positives), the better the performance of the classification model.

  • Negative Predictive Value - (True Negatives) / (True Negatives + False Negatives)

    Negative Predictive Value is the number of True Negatives divided by the number of results that tested negative. The closer to 1 (the fewer the False Negatives), the better the performance of the classification model.

  • False Positive Rate - (False Positives) / (Total Negatives in set)

    False Positive Rate is the number of False Positives divided by the total number of negatives in the set. Note that Total Negatives is equivalent to the sum of False Positives and True Negatives. The closer to 0 (the fewer the False Positives), the better the performance of the classification model.

  • False Discovery Rate - (False Positives) / (False Positives + True Positives)

    False Discovery Rate is the number of False Positives divided by the number of results that tested positive in the set. The closer to 0 (the fewer the False Positives), the better the performance of the classification model.

  • False Negative Rate - (False Negatives) / (Total Positives)

    False Negative Rate is the number of False Negatives divided by the total number of positives in the set. Note that Total Positives is equivalent to the sum of False Negatives and True Positives. The closer to 0 (the fewer the False Negatives), the better the performance of the classification model.

Number Of Trees

Specify the number of trees.

This property is not available if you have selected Automatically Tune Parameters.

Max tree depth

Select this option if you want to specify a maximum tree depth, then enter a numeric value.

Specify no: of classes

Select this option if you want to specify the number of classes, then enter a numeric value.