The Regression node performs "supervised" learning algorithms to find a best fit relationship between independent and dependent variables in a data set. Once a relationship is found, new data can be compared to the Regression model to make predictions.
Regression works best when attempting to output continuous values. For example, you could apply a Regression node to a data set containing information about product pricing and order quantities to determine how pricing affects demand. Once the Regression node has determined the relationship between the input variables, you can use that relationship to make predictions.
Regression training
To perform Regression training, you will need a labeled data set, that is, one where the desired output value is already known. This data set must also contain a set of fields, or parameters, that the analytic model can associate with values in the Label Field.
Loan example: Training
This example uses a sample from a loan data set. The object of training is to suggest interest rates for applicants based on parameterized information about the applicant.
Field names: amt
= Loan amount, yEm
= Years employed, dti
= Debt to income ratio, inc
= Annual income, intR
= Interest rate.
id |
amt |
yEm |
dti |
inc |
intR |
---|---|---|---|---|---|
001 |
14000 |
10 |
27 |
68000 |
12.29 |
002 |
25000 |
1 |
20.09 |
85000 |
14.65 |
003 |
6000 |
10 |
27.8 |
85000 |
12.69 |
004 |
15600 |
1 |
13.37 |
95000 |
9.17 |
005 |
9250 |
1 |
21.76 |
70000 |
9.99 |
006 |
2500 |
1 |
14.8 |
45000 |
9.99 |
007 |
10000 |
5 |
14.97 |
93000 |
9.17 |
008 |
20000 |
1 |
11.81 |
135000 |
14.65 |
009 |
3600 |
1 |
29.69 |
29999 |
15.61 |
010 |
20150 |
10 |
27.55 |
48000 |
17.86 |
- For training, add four parameters in the Input Fields property:
amt
,yEm
,dti
, andinc
. - Specify
intR
in the Label Field property. - Specify a Prediction Field that has the same data type as the field specified in the Label Field property, with an appropriate name, such as
Suggested_Interest_Rate (sIR)
of the Decimal data type. - Run the analysis.
The Regression node outputs a data set containing all specified Input Fields, the Label Field, the Prediction Field, and any other fields associated to records:
id |
amt |
yEm |
dti |
inc |
intR |
sIR |
---|---|---|---|---|---|---|
001 |
14000 |
10 |
27 |
68000 |
12.29 |
12.68 |
002 |
25000 |
1 |
20.09 |
85000 |
14.65 |
12.23 |
003 |
6000 |
10 |
27.8 |
85000 |
12.69 |
12.22 |
004 |
15600 |
1 |
13.37 |
95000 |
9.17 |
10.06 |
005 |
9250 |
1 |
21.76 |
70000 |
9.99 |
12.17 |
006 |
2500 |
1 |
14.8 |
45000 |
9.99 |
10.65 |
007 |
10000 |
5 |
14.97 |
93000 |
9.17 |
10.08 |
008 |
20000 |
1 |
11.81 |
135000 |
14.65 |
12.50 |
009 |
3600 |
1 |
29.69 |
29999 |
15.61 |
13.54 |
010 |
20150 |
10 |
27.55 |
48000 |
17.86 |
15.49 |
Regression evaluation
To evaluate the accuracy of an analytic model, and by extension the accuracy of scoring that is performed using that model, you can use the Regression node's Evaluate operation.
To evaluate a child training model, you need to use a validation data set as an input to a Regression node, see Generating training and validation data sets.
Loan example: Evaluating the child training model
To evaluate the child model created during the loan data set training:
- Provide a validation data set as input to the Regression node.
- Select Evaluate in the Operation property.
The evaluation produces an RMSE:
ModelDisplayName |
ChildModelDisplayName |
Rank |
RMSE |
---|---|---|---|
Regression Model |
Child Model 1 |
1 |
3.82 |
Regression re-training and re-evaluating
After training and evaluating your first child model, you can choose to train another one in order to obtain a better RMSE and more accurate scoring results. To retrain, you can use new data and/or different parameters as input fields. Each time you re-train using the same analytic model, another child model is produced.
Loan example: Re-training and re-evaluating
In this example, the loan data set's analytic model is re-trained with two additional parameters for each record: open_acc
and msld
, where open_ acc
= Number of credit lines open in lendee's file and msld
= Months since last delinquency.
- To re-train, edit the original analysis by adding these parameters as Input Fields in the Regression node.
- Rebuild the analysis.
By rebuilding the analysis, a new child model is created within the analytic model.
The analysis outputs a new data set, containing all six Input Fields, the Label Field, and the Prediction Field.
- To determine the effect of adding the two additional parameters to training, use the Regression node's Evaluate operation. This time, select the new child model.
The evaluation produces an RMSE, in this case a slightly improved value:
ModelDisplayName |
ChildModelDisplayName |
Rank |
RMSE |
---|---|---|---|
Regression Model |
Child Model 2 |
1 |
3.79 |
Regression scoring
- Prerequisite: You have selected a child model within your analytic model to use for scoring, see Creating analytic models.
Once you have selected a child model to use for scoring, you can create another analysis that uses a Regression node to score an unlabeled data set, that is, to predict values for each record. The new data set must contain the same fields that were used as parameters when the scoring model was trained. During scoring, the Regression node will compare the values in these fields to values in the scoring model.
Loan example: Scoring
This example continues the from the previous examples in this topic. In evaluation, "Child Model 2" performed slightly better, so this is the model that will be used for scoring.
You have an unlabeled data set containing the model's six Input Fields:
id |
amt |
yEm |
dti |
inc |
open_acc |
msld |
---|---|---|---|---|---|---|
001 |
6000 |
2 |
2.98 |
50000 |
11 |
|
002 |
35000 |
10 |
14.39 |
86000 |
13 |
|
003 |
10000 |
1 |
24.44 |
60000 |
10 |
59 |
004 |
25675 |
10 |
18.8 |
95000 |
21 |
|
005 |
20000 |
2 |
17.18 |
200000 |
31 |
|
006 |
9900 |
1 |
21.96 |
45000 |
10 |
56 |
007 |
10000 |
10 |
10.22 |
150000 |
11 |
23 |
008 |
14000 |
1 |
12.39 |
110000 |
11 |
80 |
009 |
18000 |
7 |
36.91 |
85000 |
15 |
48 |
010 |
28000 |
6 |
18.09 |
165000 |
17 |
|
- Provide the unlabeled data set as an input to the Regression node.
- Select Score in the Operation property.
The following results are produced:
id
amt
yEm
dti
inc
open_acc
msld
sIR
001
6000
2
2.98
50000
11
11.21
002
35000
10
14.39
86000
13
13.54
003
10000
1
24.44
60000
10
59
11.82
004
25675
10
18.8
95000
21
11.14
005
20000
2
17.18
200000
31
9.44
006
9900
1
21.96
45000
10
56
11.71
007
10000
10
10.22
150000
11
23
11.26
008
14000
1
12.39
110000
11
80
11.38
009
18000
7
36.91
85000
15
48
14.33
010
28000
6
18.09
165000
17
9.83
You can then output this data to a new data store for use in other data stages, such as a dashboard.
Regression or Recommendation: Root Mean Square Error (RMSE)
The Root Mean Square Error (RMSE) is a measure used to evaluate Regression or Recommendation models. RMSE is the square root of the mean of the square of the summation of all errors between predicted values and labeled values.
In general, the lower the RMSE, the better the performance of a model. What typifies a "low" RMSE depends on the range of values in the model's label field.
If there are large errors between predicted values and labeled values (i.e. a high ), this will magnify the RMSE because this value is squared.
Properties
Display Name
Specify a name for the node.
The default value is Regression.
Model tab
Operation
Select an operation type. Choose from:
- Train
- Score
- Evaluate
Input Fields
Click Add Field to select input fields to analyze.
Analytic Model
Select an analytic model. You can only choose from Regression type models.
Label Field
Enter a name for a label field which will be included in the output of the node.
Prediction Field
Enter a name for a prediction field which will be included in the output of the node.
Prediction Field Type
Select a data type for the field specified in the Prediction Field. Choose from:
- Boolean
- Date
- String
- DateTime
- Time
- Integer
- Floating Point
- Big Integer
- Decimal
- Currency
Regression tab
Algorithm
Select an algorithm. Choose from:
- Random Forest
- Gradient Boosted Tree
Automatically Tune Parameters
Select this option if you want to automatically tune parameters.
Number Of Trees
Specify the number of trees.
This property is not available if you have selected Automatically Tune Parameters.
Specify no: of classes
Select this option if you want to specify the number of classes, then enter a numeric value.
Max tree depth
Select this option if you want to specify a maximum tree depth, then enter a numeric value.