Decision Forest - Data360_Analyze - Latest

Data360 Analyze Server Help

Product type
Software
Portfolio
Verify
Product family
Data360
Product
Data360 Analyze
Version
Latest
Language
English
Product name
Data360 Analyze
Title
Data360 Analyze Server Help
Copyright
2024
First publish date
2016
Last updated
2024-11-28
Published on
2024-11-28T15:26:57.181000

Models data using the random forest model allowing identification of data trends using an "ensemble" of decision tree models.

Tip: Before working with this node, there are a number of prerequisite steps, see Working with the Statistical and Predictive Analytics nodes.
Note: An additional Statistical and Predictive Analytics node pack license is required to run this node. See Applying a node pack license.This node processes data in-memory. Additional RAM will be required when processing data sets with a large volume of data.

Uses the embedded R engine to generate a "forest" of decision trees to model the relationship between a dependent variable, and one or more independent variables. The node performs a regression or classification of the data based on the data type of the specified dependent variable. If the data type of the dependent variable is numeric the node performs a regression analysis. If the dependent variable has a string data type the dependent variable is considered to be categorical and the node performs a classification analysis. If the dependent variable and independent variables are not specified, the node uses an unsupervised machine learning mode to assess the proximities among the data points and the importance of the variables.

The node accepts the data to be analyzed on its input pin. You can specify a ModelName. If you do not specify a value in the ModelName property, a default name is used for the generated model.

The model to be analyzed can be defined by configuring the DependentVariable property and IndependentVariables property. Alternatively, the model can be defined by specifying the ModelFormula property.

The generated model can optionally be saved to a file. The configured ModelName is used as part of the filename of the file that contains the serialized model. The location where the saved model is to be written can also be specified. If not set, a default location is used. The node's exception behavior can be configured to specify whether an existing file will be overwritten.

The node can be configured to remove input data records where a record has a missing (NULL) value. By default records with missing values are not excluded from the model and will generate an error.

The Summary output pin contains a summary of the model and includes information on:

  • The call used to generate the model.
  • The type of model that has been created: Regression, Classification or Unsupervised.
  • Number of decision trees in the forest.
  • Number of variables that were sampled for splitting at each node in the decision tree.

Additional information is also included in the model summary, depending on the type of model:

  • The value of the Mean Square Error (Regression only).
  • Percentage of the variance explained by the model (Regression only).
  • The "Out of Bag" estimate of the error rate (Classification only).
  • The confusion matrix (Classification only).

The importance output pin contains measures of the importance of each variable in the model. The specific measures that are output depend on the type of model:

Regression:

  • The percentage increase in Mean Square Error.
  • The decrease in node impurity (as measured by the residual sum of squares).

Classification:

  • The mean decrease in prediction accuracy.
  • The mean decrease in the Gini index.

Unsupervised:

  • The mean decrease in prediction accuracy.
  • The mean decrease in the Gini index.

Powered by TIBCO®.

Properties

ModelName

Optionally specify the name of a model which is displayed on the output data. When the node is configured to write the serialized model to a file, the model name is also used as the output filename.

A model name must start with a letter and may contain any of the following:

  • letters
  • numbers
  • period character (".")
  • underscore ("_")

If not specified, the default model name "DecisionForest" is used.

ModelFormula

Optionally specify the formula for the random forest model. For example:

dependent ~ predictor1 + predictor2 + predictor3

This property should not be set if the following apply:

  • unsupervised mode of the random forest model is to be used.
  • the DependentVariable/IndependentVariables properties are specified.

DependentVariable

Optionally specify the dependent variable which is to be modeled on the independent variable(s). Only one dependent variable can be input.

This property should not be set if the following apply:

  • unsupervised mode of the random forest model is to be used.
  • the ModelFormula property is specified.

IndependentVariables

Optionally specify the independent variables i.e. the predictors that are to be used to model the dependent variable. A comma separated list of fields containing independent variables. It is mandatory to specify at least one independent variable if the DependentVariable property is specified.

This property should not be set if the following apply:

  • unsupervised mode of the random forest model is to be used.
  • the ModelFormula property is specified.

ModelOutputMode

Optionally specify whether the serialized model is written to a file on disk. This property also determines how ModelOutputField and ModelOutputDirectory behave. The default value is None.

ModelOutputField

Optionally specify the names the output field that contains the full path of the file where the serialized model has been written. The default value is "df_ModelOutput".

ModelOutputDirectory

Optionally specify the directory where the serialized model is written when ModelOutputMode is set to File. When ModelOutputDirectory is blank, files are written to the Data360 Analyze temporary directory. Otherwise, the files are written to the specified directory - the specified directory must exist and be writeable. This node will not overwrite existing files by default. Behavior can be set at ExceptionBehavior tab.

This property should only be filled in when ModelOutputMode is set to File.

FileExistsBehavior

Optionally specify whether an existing serialized model file will be overwritten. Choose from:

  • Error - Generate an error and do not overwrite the file.
  • Log - Log a warning message and do not overwrite the file.
  • Ignore - Do not overwrite the file.
  • Overwrite - Overwrite the file.

The default value is Error.

ExcludeNullValues

Optionally specify whether records containing "NULL" values are to be excluded. If set to True, all records from the input data set that contain "NULL" are excluded. The default value is False.

Inputs and outputs

Inputs: data.

Outputs: Summary, importance.