Regression - Data360_DQ+ - Latest

Data360 DQ+ Help

Product type
Software
Portfolio
Verify
Product family
Data360
Product
Data360 DQ+
Version
Latest
Language
English
Product name
Data360 DQ+
Title
Data360 DQ+ Help
Copyright
2024
First publish date
2016
ft:lastEdition
2024-07-09
ft:lastPublication
2024-07-09T15:09:58.774265
Note: Before using the Analytics nodes, you first need to create an "Analytic Model", see Creating analytic models.

The Regression node performs "supervised" learning algorithms to find a best fit relationship between independent and dependent variables in a data set. Once a relationship is found, new data can be compared to the Regression model to make predictions.

Regression

Regression works best when attempting to output continuous values. For example, you could apply a Regression node to a data set containing information about product pricing and order quantities to determine how pricing affects demand. Once the Regression node has determined the relationship between the input variables, you can use that relationship to make predictions.

Regression training

To perform Regression training, you will need a labeled data set, that is, one where the desired output value is already known. This data set must also contain a set of fields, or parameters, that the analytic model can associate with values in the Label Field.

Loan example: Training

This example uses a sample from a loan data set. The object of training is to suggest interest rates for applicants based on parameterized information about the applicant.

Field names: amt = Loan amount, yEm = Years employed, dti = Debt to income ratio, inc = Annual income, intR = Interest rate.

id

amt

yEm

dti

inc

intR

001

14000

10

27

68000

12.29

002

25000

1

20.09

85000

14.65

003

6000

10

27.8

85000

12.69

004

15600

1

13.37

95000

9.17

005

9250

1

21.76

70000

9.99

006

2500

1

14.8

45000

9.99

007

10000

5

14.97

93000

9.17

008

20000

1

11.81

135000

14.65

009

3600

1

29.69

29999

15.61

010

20150

10

27.55

48000

17.86

  1. For training, add four parameters in the Input Fields property: amt, yEm, dti, and inc.
  2. Specify intR in the Label Field property.
  3. Specify a Prediction Field that has the same data type as the field specified in the Label Field property, with an appropriate name, such as Suggested_Interest_Rate (sIR) of the Decimal data type.
  4. Run the analysis.

The Regression node outputs a data set containing all specified Input Fields, the Label Field, the Prediction Field, and any other fields associated to records:

id

amt

yEm

dti

inc

intR

sIR

001

14000

10

27

68000

12.29

12.68

002

25000

1

20.09

85000

14.65

12.23

003

6000

10

27.8

85000

12.69

12.22

004

15600

1

13.37

95000

9.17

10.06

005

9250

1

21.76

70000

9.99

12.17

006

2500

1

14.8

45000

9.99

10.65

007

10000

5

14.97

93000

9.17

10.08

008

20000

1

11.81

135000

14.65

12.50

009

3600

1

29.69

29999

15.61

13.54

010

20150

10

27.55

48000

17.86

15.49

Note: Training will also create a child model within the selected analytic model. You can use this child model at a later date for scoring, see Creating analytic models.

Regression evaluation

To evaluate the accuracy of an analytic model, and by extension the accuracy of scoring that is performed using that model, you can use the Regression node's Evaluate operation.

To evaluate a child training model, you need to use a validation data set as an input to a Regression node, see Generating training and validation data sets.

Loan example: Evaluating the child training model

To evaluate the child model created during the loan data set training:

  1. Provide a validation data set as input to the Regression node.
  2. Select Evaluate in the Operation property.

The evaluation produces an RMSE:

ModelDisplayName

ChildModelDisplayName

Rank

RMSE

Regression Model

Child Model 1

1

3.82

Regression re-training and re-evaluating

After training and evaluating your first child model, you can choose to train another one in order to obtain a better RMSE and more accurate scoring results. To retrain, you can use new data and/or different parameters as input fields. Each time you re-train using the same analytic model, another child model is produced.

Loan example: Re-training and re-evaluating

In this example, the loan data set's analytic model is re-trained with two additional parameters for each record: open_acc and msld, where open_ acc = Number of credit lines open in lendee's file and msld = Months since last delinquency.

  1. To re-train, edit the original analysis by adding these parameters as Input Fields in the Regression node.
  2. Rebuild the analysis.

    By rebuilding the analysis, a new child model is created within the analytic model.

    The analysis outputs a new data set, containing all six Input Fields, the Label Field, and the Prediction Field.

  3. To determine the effect of adding the two additional parameters to training, use the Regression node's Evaluate operation. This time, select the new child model.

The evaluation produces an RMSE, in this case a slightly improved value:

ModelDisplayName

ChildModelDisplayName

Rank

RMSE

Regression Model

Child Model 2

1

3.79

Regression scoring

  • Prerequisite: You have selected a child model within your analytic model to use for scoring, see Creating analytic models.

Once you have selected a child model to use for scoring, you can create another analysis that uses a Regression node to score an unlabeled data set, that is, to predict values for each record. The new data set must contain the same fields that were used as parameters when the scoring model was trained. During scoring, the Regression node will compare the values in these fields to values in the scoring model.

Loan example: Scoring

This example continues the from the previous examples in this topic. In evaluation, "Child Model 2" performed slightly better, so this is the model that will be used for scoring.

You have an unlabeled data set containing the model's six Input Fields:

id

amt

yEm

dti

inc

open_acc

msld

001

6000

2

2.98

50000

11

 

002

35000

10

14.39

86000

13

 

003

10000

1

24.44

60000

10

59

004

25675

10

18.8

95000

21

 

005

20000

2

17.18

200000

31

 

006

9900

1

21.96

45000

10

56

007

10000

10

10.22

150000

11

23

008

14000

1

12.39

110000

11

80

009

18000

7

36.91

85000

15

48

010

28000

6

18.09

165000

17

 

  1. Provide the unlabeled data set as an input to the Regression node.
  2. Select Score in the Operation property.

    The following results are produced:

    id

    amt

    yEm

    dti

    inc

    open_acc

    msld

    sIR

    001

    6000

    2

    2.98

    50000

    11

     

    11.21

    002

    35000

    10

    14.39

    86000

    13

     

    13.54

    003

    10000

    1

    24.44

    60000

    10

    59

    11.82

    004

    25675

    10

    18.8

    95000

    21

     

    11.14

    005

    20000

    2

    17.18

    200000

    31

     

    9.44

    006

    9900

    1

    21.96

    45000

    10

    56

    11.71

    007

    10000

    10

    10.22

    150000

    11

    23

    11.26

    008

    14000

    1

    12.39

    110000

    11

    80

    11.38

    009

    18000

    7

    36.91

    85000

    15

    48

    14.33

    010

    28000

    6

    18.09

    165000

    17

     

    9.83

You can then output this data to a new data store for use in other data stages, such as a dashboard.

Regression or Recommendation: Root Mean Square Error (RMSE)

The Root Mean Square Error (RMSE) is a measure used to evaluate Regression or Recommendation models. RMSE is the square root of the mean of the square of the summation of all errors between predicted values and labeled values.

In general, the lower the RMSE, the better the performance of a model. What typifies a "low" RMSE depends on the range of values in the model's label field.

If there are large errors between predicted values and labeled values (i.e. a high ), this will magnify the RMSE because this value is squared.

Properties

Display Name

Specify a name for the node.

The default value is Regression.

Model tab

Operation

Select an operation type. Choose from:

  • Train
  • Score
  • Evaluate

Input Fields

Click Add Field to select input fields to analyze.

Analytic Model

Select an analytic model. You can only choose from Regression type models.

Label Field

Enter a name for a label field which will be included in the output of the node.

Prediction Field

Enter a name for a prediction field which will be included in the output of the node.

Prediction Field Type

Select a data type for the field specified in the Prediction Field. Choose from:

  • Boolean
  • Date
  • String
  • DateTime
  • Time
  • Integer
  • Floating Point
  • Big Integer
  • Decimal
  • Currency

Regression tab

Algorithm

Select an algorithm. Choose from:

  • Random Forest
  • Gradient Boosted Tree

Automatically Tune Parameters

Select this option if you want to automatically tune parameters.

Number Of Trees

Specify the number of trees.

This property is not available if you have selected Automatically Tune Parameters.

Specify no: of classes

Select this option if you want to specify the number of classes, then enter a numeric value.

Max tree depth

Select this option if you want to specify a maximum tree depth, then enter a numeric value.