Models data using quantile regression allowing identification of data trends for conditional quantiles of a response variable distribution.
This node uses the embedded R engine to model the relationship between a dependent (response) variable and one or more independent (explanatory) variables. The node fits the data using a quantile regression model.
Quantile regression provides a means to flexibly model data with heterogeneous conditional distributions. Typically these types of distributions occur in the study of ecology, life sciences and economics, where the relationship with the conditional mean of the dependent variable is different to the relationship across the quantiles of the dependent variable. Linear Regression uses Ordinary Least Squares (OLS) to model the relationship between the independent variables and the conditional mean of the dependent variable but may not provide an accurate model where there is a heterogeneous conditional distribution. Under these conditions Quantile regression can provide a more comprehensive picture of the effect of the independent variables across the spectrum of the dependent variable.
Quantile regression can also be used to model the effects at the median (i.e. the 0.50 quantile) which can be more robust to the impact of outliers on the regression.
The node accepts the data to be analyzed on its input pin. The ModelName can be specified. If it is not set, a default name "QuantileReg" is used for the generated model.
The model to be analyzed can be defined by configuring the DependentVariable property and IndependentVariables property. Alternatively, the model can be defined by specifying the ModelFormula property. The ModelFormula property can be a Literal value or obtained from a field on the node's optional second input pin.
If categorical independent variables are being used in the model, please ensure that there are a sufficient number of observations in each category of the variable. Otherwise the node may not be able to process the data correctly, resulting in an error.
By default, the following values apply to the model:
- Quantile level is 0.50.
- Estimation method is Barrowdale and Roberts.
The quantile levels to be used in the regression can be specified as a Literal value or obtained from a field on the node's optional second input pin.
The weights to be used in a weighted OLS model can be specified as a Literal value or obtained from a field on the node's optional second input pin.
The model constant can be included or omitted from the model.
The generated model can optionally be saved to a file. The configured ModelName is used as part of the filename of the file that contains the serialized model. The location where the saved model is to be written can also be specified. If not set, a default location is used. The node's exception behavior can be configured to specify whether an existing file will be overwritten.
Hypothesis statistics for the quantile coefficients can optionally be output.
There is a choice of estimation method to be used in the quantile regression. The choice of estimation method should take into consideration the size of the data set being analyzed as some methods are less efficient for large data sets.
You can specify the assumption to be made on variable distribution - i.e. whether or not the data are "independent and identically distributed" (iid).
The node can be configured to remove input data records where a record has a missing (NULL) value. By default, records with missing values are not excluded from the model and will generate an error.
When run, the node provides a summary of the model created at its Summary pin. The information provided depends on the number of quantiles (taus) that were analyzed in the model. By default, only one quantile level i.e. 0.50 is analyzed.
If one quantile (tau) was analyzed, the Summary pin includes information on:
- The call used to generate the model.
- The quantile level.
- The estimates for the coefficients of the independent variables used in the model and the estimate of the intercept.
- If the node was not configured to output the model test statistics, the upper and lower confidence bounds for the coefficients (for the 95% confidence interval).
- If the node was configured to output the model test statistics, the summary instead includes information on the Standard Error, the t-value and the p-value.
- The Degrees of Freedom (DF) together with the Log Likelihood (LogLik), the Akaike information criterion (AIC) and the value of objective function at the solution (rho).
If more than one quantile (tau) was analyzed, the Summary pin includes the following information for each quantile level (tau):
- The call used to generate the model.
- The quantile level.
- The estimates for the coefficients of the independent variables used in the model and the estimate of the intercept at the specified quantile level.
- If the node was not configured to output the model test statistics, the upper and lower confidence bounds for the 95% confidence interval at the specified quantile level.
- If the node was configured to output the model test statistics, the summary instead includes information on the Standard Error, the t-value and the p-value for each quantile level (tau).
- The Degrees of Freedom; together with the Log Likelihood (LogLik), the Akaike information criterion (AIC) and the value of objective function at the solution (rho) for each quantile level (tau).
If the node is configured to output the serialized model to a file, the Summary pin also includes the file path to the file that contains the serialized model.
The Residuals pin contains, for each observation and quantile level, the value of the residual (error) - i.e. the difference between each of the observations in the input data and the fitted values derived by the model at each quantile level. The pin also contains a record ID field where the ID aligns with the position of the record in the input data set (i.e. it will be non-contiguous where records have been omitted due to them containing NULL values).
The information provided on the Coefficients pin depends on the number of quantiles (taus) that were analyzed in the model and whether model hypothesis test statistics were to be output. By default, hypothesis test statistics are not output.
If one quantile (tau) was analyzed and test statistics are not output, the Coefficients pin contains multiple types of records. For the quantile level, the following records are output:
- The quantile regression coefficient value.
- The upper confidence bound for the quantile regression coefficient.
- The lower confidence bound for the quantile regression coefficient.
- The Ordinary Least Squares (OLS) coefficient from the corresponding Linear Regression model.
- The upper confidence bound for the OLS coefficient.
- The lower confidence bound for the OLS coefficient.
Each record contains the following fields: the quantile level, the record type, and the coefficient/bound value for each independent variable and the intercept.
If one quantile (tau) was analyzed and test statistics are output, the 'Coefficients' pin contains multiple types of records. For the quantile level, the following records are output:
- The quantile regression coefficient value.
- The standard error value.
- The t-test value.
- The p-value.
Each record contains the following fields: the quantile level, the record type, and the coefficient/test statistic value for each independent variable and the intercept.
If more than one quantile (tau) was analyzed and test statistics are not output, the "Coefficients" pin contains multiple types of records for each quantile level. For a particular quantile level, the following records are output:
- The quantile regression coefficient value.
- The upper confidence bound for the quantile regression coefficient.
- The lower confidence bound for the quantile regression coefficient.
Each record contains the following fields: the quantile level, the record type, and the coefficient/confidence bound value for each independent variable and the intercept.If more than one quantile (tau) was analyzed and test statistics are output, the 'Coefficients' pin contains multiple types of records for each quantile level. For a particular quantile level, the following records are output:
- The quantile regression coefficient value.
- The standard error value.
- The t-test value.
- The p-value.
Each record contains the following fields: the quantile level, the record type, and the coefficient/test statistic value for each independent variable and the intercept.
Powered by TIBCO®
Properties
ModelName
Optionally specify the name of a model which is displayed on the output data.
When the node is configured to write the serialized model to a file, the model name is also used as the output filename.
A model name must start with a letter and may contain any of the following:
- letters
- numbers
- period character (".")
- underscore ("_")
If not specified, a default model name "QuantileReg" is displayed on the output data.
Choose the (from Field) variant of this property to look up the value from an input field with the name specified.
ModelFormula
Optionally specify the formula for the quantile regression model.
For example: dependent ~ predictor1 + predictor2 + predictor3
Choose the (from Field) variant of this property to look up the value from an input field with the name specified.
This property should not be specified if the properties DependentVariable and IndependentVariables are set.
This property is case sensitive.
DependentVariable
Specify the variable which is dependent on the independent variables.
Only one dependent variable can be input.
Choose the (from Field) variant of this property to look up the value from an input field with the name specified.
A value is required if a model formula is not specified.
This property is case sensitive.
IndependentVariables
Optionally specify the independent variables (i.e. the predictors) to be used in the model. A comma-separated list of fields containing independent variables.
Choose the (from Field) variant of this property to look up the value from an input field with the name specified.
If the ModelFormula property is not specified, at least one independent variable must be specified.
If the ModelFormula property is specified, this property should not be set.
This property is case sensitive.
WeightVariable
Optionally specify the variable used to estimate a weighted least squares model.
Choose the (from Field) variant of this property to look up the value from an input field with the name specified.
This property is case sensitive.
QuantileLevels
Optionally specify the quantile levels to be used in the quantile regression analysis.
A comma-separated list of one or more quantile levels with a value between 0 and 1.
The default value is 0.5 - corresponding with the 50th percentile.
Choose the (from Field) variant of this property to look up the value from an input field with the name specified.
OmitModelConstant
Specify whether a model constant is to be excluded from the model.
If the ModelFormula property is specified, this property should not be set.
The default value is False.
ModelOutputMode
Specify whether the serialized model is written to a file on disk.
This property also determines how ModelOutputField and ModelOutputDirectory behave.
The default value is None.
ModelOutputField
Optionally specify the name of the output field that contains the full path of the file where the serialized model has been written.
The default value is "qm_ModelOutput".
ModelOutputDirectory
Specify the directory where the serialized model is written when ModelOutputMode is set to File.
When ModelOutputDirectory is blank, files are written to the Data360 Analyze temporary directory. Otherwise, the files are written to the specified directory - the specified directory must exist and be writeable.
This node will not overwrite existing files by default. This behavior can be set in the ExceptionBehavior tab.
OutputTestStatistics
Optionally specify whether hypothesis statistics are to be output for the quantile coefficients.
If True, the node outputs the coefficient value and the following attributes for each coefficient at the specified quantiles instead of the upper and lower confidence limits:
- Standard Error
- t-value
- p-value
If False, the node outputs the coefficient value and the upper and lower confidence limits for each coefficient at the specified quantiles.
Note: When this property is set to False, the 'rank' method is used to compute the confidence intervals. Computation time increases rapidly for the rank method when the number of records exceeds 1000.
When this property is set to True, the VariableDistribution property provides additional options. See the VariableDistribution property help for further information.
The default value is False.
EstimationMethod
Optionally specify the estimation method to use for the quantile regression analysis. Choose from:
- Barrowdale and Roberts - Uses a modified version of the Barrowdale and Roberts algorithm (the Koenker and D'Orey Simplex method). This is efficient for moderate data sizes (up to 5000 records).
- Frisch-Newton interior point - Uses the Portnoy and Koenker Interior Point method. This is computationally efficient for large data sizes.
- Frisch-Newton after Preprocessing - Uses the Portnoy and Koenker Interior Point method with preprocessing. This is suitable for very large data sizes (e.g. greater than 10^5 records).
The default value is Barrowdale and Roberts.
VariableDistribution
Optionally specify whether the variables are assumed to be 'independent and identically distributed' (iid). This affects the standard errors that are computed. Choose from:
- nid - Variables are not assumed to be 'iid', i.e. they have different probability distributions or are not independent. Computes an estimate of the asymptotic covariance matrix as in Koenker and Bassett (1978).
- iid - Variables are assumed to be 'iid', i.e. each random variable has the same probability distribution as the others and all are mutually independent. Computes a Huber sandwich estimate using a local estimate of the sparsity.
The default value is nid.
This property should only be specified when the OutputTestStatistics property is set to True.
ExcludeNullValues
Optionally specify whether records containing "NULL" values are to be excluded.
If set to True all records from the input data set that contain "NULL" are excluded.
The default value is False.
FileExistsBehavior
Specify whether an existing serialized model file will be overwritten. Choose from:
- Error - Generate an error and do not overwrite the file.
- Log - Log a warning message and do not overwrite the file.
- Ignore - Do not overwrite the file.
- Overwrite - Overwrite the file.
The default value is Error.
Inputs and outputs
Inputs: data, 1 optional.
Outputs: Summary, Residuals, Coefficients.