Recommends the best number of clusters to use when performing a K-Means analysis of the data.
This node uses the embedded R engine to analyze the input data using a range of criteria for determining the number of clusters and proposes the best clustering scheme from the different results obtained by varying the number of clusters. The input variables must have a numeric data type. You can specify the minimum number of clusters and maximum number of clusters to be proposed by the node. The minimum number of clusters is 2 (and is the default minimum value). The maximum number of clusters must be at least two more than the minimum number of clusters (i.e. at least 4) and a maximum of N-1 where N is the number of observations (records) in the input data. If the maximum number of clusters to propose is not specified, the default maximum of 15 is used.
Depending on the input data and the configuration of the node, when the node is run no observations may be assigned to one or more clusters which will result in an error. In this situation it is recommended that the properties for the maximum number of clusters and/or minimum number of clusters are changed before re-running the node.
The K-Means algorithm is sensitive to differences in scale in the selected variables. To minimize the effects of scale on the cluster assignment, you can optionally specify that the data is to be standardized using "z-score" standardization.
When run, the node outputs information on the best number of clusters to use.
The clusterData output pin provides the number of "votes" for the number of clusters to use, as voted for by each of the assessment criteria.
If a large number of clusters is used, the processing times may be extended.
Powered by TIBCO®
Properties
MinClusters
Optionally specify the minimum number of clusters to be assessed when determining the best number of clusters.
The value of this property must be a minimum of 2 and a maximum of (MaxClusters -2). The default value is 2.
MaxClusters
Optionally specify the maximum number of clusters to be assessed when determining the best number of clusters.
The value of this property must be a minimum of (MinClusters + 2) and a maximum of (N-1) where N is the number of data records. The default value is 15.
Variables
Optionally specify the variables to be used, separated by commas. By default, all the fields from the input data are used.
TransformData
Optionally specify whether each of the fields in the input data are to be transformed. Choose from:
- None - Do not transform the data.
- Standardize - Apply a z-score standardization to each field in the input data.
The default value is None.
Inputs and outputs
Inputs: data.
Outputs: clusterData.