Classifies data into a specified number of clusters.
The data are partitioned into k groups such that the "sum of squares" distance from the points to the assigned cluster centers is minimized. The node uses the Hartigan & Wong algorithm to partition the observations.
This node uses the embedded R engine to classify the input data using an "unsupervised" iterative refinement algorithm to identify the optimum cluster assignment for each of the observations. The algorithm randomly generates k initial cluster centers which are used for the cluster assignment and refinement process. To minimize cluster assignment errors due to the initial starting configuration, the node runs the algorithm multiple times with different starting configurations. You can optionally specify the number of starting configurations to be used. Within each run, the algorithm iteratively refines the cluster assignment until no changes in cluster assignment are made in an iteration or the maximum number of iterations is reached. You can optionally specify the maximum number of iterations to be performed.
The node considers the user-specified variables when performing the cluster analysis. The variables must have a numeric data type. The data are partitioned into the specified number of clusters - the permitted number of clusters ranges from 2 to N-1 where N is the number of observations (records) in the input data.
The K-Means algorithm is sensitive to differences in scale in the selected variables. To minimize the effects of scale on the cluster assignment, you can optionally specify that the data is to be standardized using "z-score" standardization.
The node provides a summary of the cluster assignment and details of the cluster assignment.
The solutionSummary pin provides a summary that indicates the number of observations assigned to each cluster, the "co-ordinates" of the mean of each of the derived clusters (the "centroids"), and statistics for the "within cluster sum of squares" and the ratio of the "between sum of squares" and "within cluster sum of squares".
The clusterData pin provides details of the cluster assignment which includes: the "co-ordinate" values of the variables for each observation together with the assigned cluster number, the "Type" set to "CLUSTER" and the "set" which is set to "CLUSTERx" where "x" is the assigned cluster number.
Powered by TIBCO®
Properties
ModelName
Optionally specify the name of a model which is displayed on the output data.
A model name must start with a letter and may contain any of the following:
- letters
- numbers
- period character (".")
- underscore ("_")
If not specified, a default model name "KMeans" is displayed on the output data.
Variables
Optionally specify the variables to be used, separated by commas.
By default, all the fields from the input data are used.
ClusterCount
Specify the number of clusters.
The number must be in the range from 2 to N-1, where N is the number of observations (input data records).
A value is required for this property.
MaxIterations
Optionally specify the maximum number of iterations of the algorithm used to assign items to clusters. A positive integer.
The default value is 10.
StartConfigurations
Optionally specify the number of random sets of initial observations that are to be used as the initial cluster centers when performing the cluster analysis. A positive integer.
The default value is 10.
TransformData
Optionally specify whether each of the fields in the input data are to be transformed. Choose from:
- None - Do not transform the data.
- Standardize - Apply a z-score standardization to each field in the input data.
The default value is None.
Inputs and outputs
Inputs: data.
Outputs: solutionSummary, clusterData.