The Classification node performs "supervised" learning algorithms to place a new observation into a preexisting, discrete category. Classification categories are derived from a training set of data that has already been labeled, or "clustered." For this reason, the Classification node is often used with the Segmentation node.
Classification works best when attempting to output discrete values. For example, given a labeled data set that matches customer demographics to loan types customers qualify for, you could apply the classification node to a new customer data set containing only demographic information to attempt to determine what types of loans customers will qualify for.
Classification training
To perform Classification training, you will need a labeled data set, that is, one where there is a field that already classifies records. This data set also needs to contain a set of fields, or parameters, that the analytic model can associate with classes.
Iris example: Training
This example uses a sample from the famous "Iris Flower Data Set." The object of training is to classify flower records into one of three species based on collected measurements.
SepalLength |
SepalWidth |
PetalLength |
PetalWidth |
Species |
---|---|---|---|---|
5.1 |
3.3 |
1.7 |
0.5 |
setosa |
7.7 |
2.6 |
6.9 |
2.3 |
virginica |
6.6 |
2.9 |
4.6 |
1.3 |
versicolor |
4.4 |
3 |
1.3 |
0.2 |
setosa |
5 |
3.5 |
1.3 |
0.3 |
setosa |
6.4 |
2.7 |
5.3 |
1.9 |
virginica |
5.7 |
2.5 |
5 |
2 |
virginica |
6.4 |
3.1 |
5.5 |
1.8 |
virginica |
6.5 |
3.2 |
5.1 |
2 |
virginica |
5.4 |
3.4 |
1.7 |
0.2 |
setosa |
This data set contains four fields, SepalLength
, SepalWidth
, PetalLength
and PetalWidth
, and a Label field, Species
.
- To train an analytic model with this data set, add the four parameters in the Input Fields property.
- Specify
Species
in the Label Field property. - Specify a Prediction Field that has the same data type as the field specified in the Label Field property, for example
SpeciesPred
of the string data type. - Run the analysis.
The Classification node outputs a data set containing all specified Input Fields, the Label Field, the Prediction Field, and any other fields associated to records:
SepalLength |
SepalWidth |
PetalLength |
PetalWidth |
Species |
SpeciesPred |
---|---|---|---|---|---|
5.1 |
3.3 |
1.7 |
0.5 |
setosa |
setosa |
7.7 |
2.6 |
6.9 |
2.3 |
virginica |
virginica |
6.6 |
2.9 |
4.6 |
1.3 |
versicolor |
versicolor |
4.4 |
3 |
1.3 |
0.2 |
setosa |
setosa |
5 |
3.5 |
1.3 |
0.3 |
setosa |
setosa |
6.4 |
2.7 |
5.3 |
1.9 |
virginica |
virginica |
5.7 |
2.5 |
5 |
2 |
virginica |
virginica |
6.4 |
3.1 |
5.5 |
1.8 |
virginica |
virginica |
6.5 |
3.2 |
5.1 |
2 |
virginica |
virginica |
5.4 |
3.4 |
1.7 |
0.2 |
setosa |
setosa |
Classification scoring
Prerequisite: You have selected a child model within your analytic model to use for scoring, see Creating analytic models.
Once you have selected a child model to use for scoring, you can create another analysis that uses a Classification node to score an unlabeled data set, that is, to predict to which class each record within the data set belongs. The new data set must contain the same fields that were used as parameters when the scoring model was trained. During scoring, the Classification node will compare the values in these fields to values in the scoring model in order to predict a class.
Iris example: Scoring unlabeled data
This example continues the "Iris example: Training" from above.
You have an unlabeled data set that only contains values for the four parameters: SepalLength
, SepalWidth
, PetalLength
and PetalWidth
:
SepalLength |
SepalWidth |
PetalLength |
PetalWidth |
---|---|---|---|
7.7 |
2.8 |
6.7 |
2 |
5.2 |
3.4 |
1.4 |
0.2 |
6.1 |
3 |
4.9 |
1.8 |
5 |
3.5 |
1.3 |
0.3 |
5.4 |
3.9 |
1.7 |
0.4 |
4.9 |
2.5 |
4.5 |
1.7 |
5.1 |
3.5 |
1.4 |
0.3 |
5.8 |
2.6 |
4 |
1.2 |
5.9 |
3 |
4.2 |
1.5 |
6.9 |
3.1 |
5.4 |
2.1 |
- Provide this data set as an input to the Classification node.
- Select Score in the Operation property.
- Specify a Prediction Field, for example
SpeciesPred
. - Select a Prediction Field Type, for example String.
- Run the analysis.
The Classification node outputs a data set containing the parameter fields and the Prediction Field:
SepalLength |
SepalWidth |
PetalLength |
PetalWidth |
SpeciesPred |
---|---|---|---|---|
7.7 |
2.8 |
6.7 |
2 |
virginica |
5.2 |
3.4 |
1.4 |
0.2 |
setosa |
6.1 |
3 |
4.9 |
1.8 |
virginica |
5 |
3.5 |
1.3 |
0.3 |
setosa |
5.4 |
3.9 |
1.7 |
0.4 |
setosa |
4.9 |
2.5 |
4.5 |
1.7 |
virginica |
5.1 |
3.5 |
1.4 |
0.3 |
setosa |
5.8 |
2.6 |
4 |
1.2 |
versicolor |
5.9 |
3 |
4.2 |
1.5 |
versicolor |
6.9 |
3.1 |
5.4 |
2.1 |
virginica |
The accuracy of SpeciesPred
would ultimately depend on the strength of the child model used to score, indicated by the model's evaluation metrics.
Classification evaluation
To evaluate the accuracy of an analytic model, and by extension the accuracy of scoring that is performed using that model, you can use the Classification node's Evaluate operation.
To evaluate a child training model, you need to use a validation data set as an input to a Classification node, see Generating training and validation data sets.
Iris example: Evaluating the child training model
To evaluate the child model created during the Iris data set training:
- Provide a validation data set as input to the Classification node.
- Select Evaluate in the Operation property.
The evaluation would produce an F-Measure, as well a number of other evaluation metrics:
ModelDisplayName |
ChildModelDisplayName |
Rank |
FMeasure |
---|---|---|---|
Classification Model |
Child Model 1 |
1 |
0.986 |
Properties
Display Name
Specify a name for the node.
The default value is Classification.
Model tab
Operation
Select an operation type. Choose from:
- Train
- Score
- Evaluate
Input Fields
Click Add Field to select input fields to analyze.
Analytic Model
Select an analytic model. You can only choose from Classification type models.
Label Field
Enter a name for a label field which will be included in the output of the node.
Prediction Field
Enter a name for a prediction field which will be included in the output of the node.
Prediction Field Type
Select a data type for the field specified in the Prediction Field. Choose from:
- Boolean
- Date
- String
- DateTime
- Time
- Integer
- Floating Point
- Big Integer
- Decimal
- Currency
Classification tab
Algorithm
Select an algorithm. Choose from:
- Random Forest
- Gradient Boosted Tree
Automatically Tune Parameters
Select this option if you want to specify an evaluation metric.
Eval Metric
Select an evaluation metric to use to rank the model when you compare it to others. This property is only visible when Automatically Tune Parameters is selected. Choose from:
-
F1 Score - Uses a result set's "precision" and "recall" to produce a value between 0 and 1 to represent the performance of a classification model:
A result set's "precision" is the number of true positives (
tp
) over all positives returned (pr
), ortp/pr
.A result set's "recall" is the number of true positives (
tp)
over actual positives (ap
) in the result set, ortp/ap
.The closer to 1, the better the performance of the classification model.
For example, you have ten items known to be in Class A (
ap = 10
). After training, your model classifies seven items as Class A (pr
= 7). After further investigation, however, you find you that only five of these items are actually supposed to be in Class A (tp
= 5).In this example, the precision is
tp/pr
, or 5/7. The recall istp/ap
, only 5/10.Using the formula above, the model's F1 Score would be roughly 0.59, suggesting that the model has room for improvement.
-
Profit - Helps you to find an optimal probability threshold for classification models by allowing you to specify dollar amounts for the profit gained from each true positive (True Positive Profit) and the cost incurred from each false negative (False Negative Cost).
To get a more accurate measure of profit, you must also specify a Setup Cost which represents the cost incurred from setting up the model.
-
Accuracy -
(True Positives + True Negatives) / (Total number of cases examined)
Accuracy is the proportion of true results among the total number of cases examined. True results includes both true positives and true negatives. The closer to 1, the better the performance of the classification model.
-
Sensitivity -
(True Positives) / (Total Positives in set)
Sensitivity, also known as Recall, is the True Positive Rate, or, the number of results which tested positive and are actually positive, divided by the total number of positives in the set. The closer to 1, the better the performance of the classification model.
-
Specificity -
(True Negatives) / (Total Negatives in set)
Specificity is the True Negative Rate, or, the number of results which tested negative and are actually negative, divided by the total number of negatives in the set. The closer to 1, the better the performance of the classification model.
-
Precision -
(True Positives) / (True Positives + False Positives)
Precision is the number of True Positives divided by the number of results that tested positive. The closer to 1 (the fewer the False Positives), the better the performance of the classification model.
-
Negative Predictive Value -
(True Negatives) / (True Negatives + False Negatives)
Negative Predictive Value is the number of True Negatives divided by the number of results that tested negative. The closer to 1 (the fewer the False Negatives), the better the performance of the classification model.
-
False Positive Rate -
(False Positives) / (Total Negatives in set)
False Positive Rate is the number of False Positives divided by the total number of negatives in the set. Note that Total Negatives is equivalent to the sum of False Positives and True Negatives. The closer to 0 (the fewer the False Positives), the better the performance of the classification model.
-
False Discovery Rate -
(False Positives) / (False Positives + True Positives)
False Discovery Rate is the number of False Positives divided by the number of results that tested positive in the set. The closer to 0 (the fewer the False Positives), the better the performance of the classification model.
-
False Negative Rate -
(False Negatives) / (Total Positives)
False Negative Rate is the number of False Negatives divided by the total number of positives in the set. Note that Total Positives is equivalent to the sum of False Negatives and True Positives. The closer to 0 (the fewer the False Negatives), the better the performance of the classification model.
Number Of Trees
Specify the number of trees.
This property is not available if you have selected Automatically Tune Parameters.
Max tree depth
Select this option if you want to specify a maximum tree depth, then enter a numeric value.
Specify no: of classes
Select this option if you want to specify the number of classes, then enter a numeric value.