Linear Regression - Data360_Analyze - Latest

Data360 Analyze Server Help

Product type
Software
Portfolio
Verify
Product family
Data360
Product
Data360 Analyze
Version
Latest
Language
English
Product name
Data360 Analyze
Title
Data360 Analyze Server Help
Copyright
2024
First publish date
2016
Last updated
2024-11-28
Published on
2024-11-28T15:26:57.181000

Models data using linear regression allowing identification of data trends.

Tip: Before working with this node, there are a number of prerequisite steps, see Working with the Statistical and Predictive Analytics nodes.
Note: An additional Statistical and Predictive Analytics node pack license is required to run this node. See Applying a node pack license.This node processes data in-memory. Additional RAM will be required when processing data sets with a large volume of data.

This node uses the embedded R engine to model the relationship between a dependent (response) variable and one or more independent (explanatory) variables.

The node fits the data using a linear regression model.

The node provides two summaries of the model created for the input data together with details of the regression coefficients and residual errors.

The Summary pin contains a summary of the model and includes information on:

  • The call used to generate the model.
  • Range and quartile values for the residual errors.
  • The estimates for the coefficients of the independent variables used in the model and the estimate of the intercept, the standard error, the t-statistic value and p-value.
  • Significance code indicators for the independent variables and the intercept.
  • The residual standard error.
  • The Multiple R-squared (coefficient of determination) value and Adjusted R-squared value.
  • The F-statistic with the corresponding p-value for that test.

If the node is configured to output the serialized model to a file, the Summary pin also includes the file path to the file that contains the serialized model.

The Residuals pin contains the value of the residual error for the difference between each of the observations in the input data and the fitted values derived by the model.

The Coefficients pin contains the estimated values of the coefficients of the independent variables and the intercept for the best-fit line determined by the model.

The anovaSummary pin contains a summary table for an Analysis Of Variance performed on the regression model and includes information on:

  • Details of the Degrees of Freedom, Sum of Squares, Mean of Squares, F-value and p-value for each of the independent variables in the regression model.
  • The Degrees of Freedom, Sum of Squares and Mean of Squares for the Residuals.
  • An indication when the model may be unbalanced.
  • Significance code indicators for the independent variables.

The ModelFormula property can be used to provide fine-grain control of the specification of the variables to be included in the regression analysis.

The basic format of the model formula is:

dependent variable ~ independent variables

where the ~ indicates the dependent variable "is modeled by" or "is modeled as a function of" the independent variable.

A simple linear regression could have a model formula of:

y ~ x

Similarly, a multiple linear regression with two independent variables could use the following formula:

y ~ x + z

Note that the operators used in a model formula do not have their normal mathematical meaning. In this case, the "+" means "include this variable".

The model formula supports the use of other operators too, for example:

y ~ x * z

which indicates that in addition to including the "x" and the "z" variable in the model, the interaction between the "x" and "z" variables should also be included in the model.

The * operator can be useful in a range of scenarios, for instance, if the "z" was a categorical variable (e.g. field with a boolean or string data type) it could be used to create a model that estimated the regression coefficients for each level (category) in "z". That is, it implements a "group-by" function using the "z" variable. The coefficients generated by the node include the interaction terms for each level except the baseline factor. The lowest level (in terms of its alphanumeric value) in the categorical variable is used as the baseline factor for the interactions - the interaction coefficients for each of the other levels are then relative to the baseline coefficient. When used with the Predict Linear Regression node, the regression model can be used to predict outcomes for each level in the "group-by" variable.

Variables with alphanumeric levels are automatically considered to be categorical (factors). In the situation where a variable uses numeric values to indicate the levels (e.g. male=1, female=2) the model formula can be modified to indicate that the variable should be treated as a categorical variable when building the regression model by using the "as.factor()" function:

y ~ x * as.factor(z)

In the example given, the baseline factor level would be "1" i.e. male.

Powered by TIBCO®

Properties

ModelName

Optionally specify the name of a model which is displayed on the output data. When the node is configured to write the serialized model to a file, the model name is also used as the output filename.

A model name must start with a letter and may contain any of the following:

  • letters
  • numbers
  • period character (".")
  • underscore ("_")

If not specified, a default model name is displayed on the output data.

ModelFormula

Optionally specify the formula for the linear regression model. For example:

dependent ~ predictor1 + predictor2 + predictor3

or

dependent ~ predictor1 * predictor2 * predictor3

This property should not be specified if the properties DependentVariable and IndependentVariables or OmitModelConstant are set.

This property is case-sensitive.

DependentVariable

Specify the variable which is dependent on the independent variables.

Only one dependent variable can be input.

A value is required for this property if a model formula is not specified.

This property is case-sensitive.

IndependentVariables

Optionally specify the independent variables (i.e. the predictors) to be used in the model. A comma-separated list of fields containing independent variables.

If the ModelFormula property is not specified, at least one independent variable must be specified.

If the ModelFormula property is specified, this property should not be set.

This property is case-sensitive.

WeightVariable

Optionally specify the variable used to estimate a weighted least squares model.

Only one weight variable can be input.

This property is case-sensitive.

OmitModelConstant

Optionally specify whether a model constant is to be excluded from the model.

If the ModelFormula property is specified, this property should not be set.

The default value is False.

ModelOutputMode

Optionally specify whether the serialized model is written to a file on disk.

This property also determines how ModelOutputField and ModelOutputDirectory behave.

The default value is None.

ModelOutputField

Optionally specify the names the output field that contains the full path of the file where the serialized model has been written.

The default value is "lm_ModelOutput".

ModelOutputDirectory

Optionally specify the directory where the serialized model is written when ModelOutputMode is set to File. When ModelOutputDirectory is blank, files are written to the Data360 Analyze temporary directory. Otherwise, the files are written to the specified directory - the specified directory must exist and be writeable. This node will not overwrite existing files by default. This behavior can be set in the ExceptionBehavior tab.

This property should only be filled in when ModelOutputMode is set to File.

OutputAdditionalAttributes

Optionally specify whether an extended set of coefficient attributes is to be provided on the output.

If set to True:

1. The following values are output on the Summary output pin:

  • F-statistic
  • Number of Degrees of Freedom used
  • Total Degrees of Freedom
  • Multiple R squared
  • Adjusted R squared
  • p value for the model
  • AIC Equivalent Degrees of Freedom
  • AIC value

2. The following values are output on the Coefficients output pin:

  • Standard Error
  • t-value
  • p-value

The default value is False.

FileExistsBehavior

Optionally specify whether an existing serialized model file will be overwritten. Choose from:

  • Error - Generate an error and do not overwrite the file.
  • Log - Log a warning message and do not overwrite the file.
  • Ignore - Do not overwrite the file.
  • Overwrite - Overwrite the file.

The default value is Error.

Inputs and outputs

Inputs: data.

Outputs: Summary, Residuals, Coefficients, anovaSummary.