K-Means Clustering - Data360_Analyze - Latest

Data360 Analyze Server Help

Product type
Software
Portfolio
Verify
Product family
Data360
Product
Data360 Analyze
Version
Latest
Language
English
Product name
Data360 Analyze
Title
Data360 Analyze Server Help
Copyright
2024
First publish date
2016
Last updated
2024-11-28
Published on
2024-11-28T15:26:57.181000

Classifies data into a specified number of clusters.

Tip: Before working with this node, there are a number of prerequisite steps, see Working with the Statistical and Predictive Analytics nodes.
Note: An additional Statistical and Predictive Analytics node pack license is required to run this node. See Applying a node pack license.This node processes data in-memory. Additional RAM will be required when processing data sets with a large volume of data.

The data are partitioned into k groups such that the "sum of squares" distance from the points to the assigned cluster centers is minimized. The node uses the Hartigan & Wong algorithm to partition the observations.

This node uses the embedded R engine to classify the input data using an "unsupervised" iterative refinement algorithm to identify the optimum cluster assignment for each of the observations. The algorithm randomly generates k initial cluster centers which are used for the cluster assignment and refinement process. To minimize cluster assignment errors due to the initial starting configuration, the node runs the algorithm multiple times with different starting configurations. You can optionally specify the number of starting configurations to be used. Within each run, the algorithm iteratively refines the cluster assignment until no changes in cluster assignment are made in an iteration or the maximum number of iterations is reached. You can optionally specify the maximum number of iterations to be performed.

The node considers the user-specified variables when performing the cluster analysis. The variables must have a numeric data type. The data are partitioned into the specified number of clusters - the permitted number of clusters ranges from 2 to N-1 where N is the number of observations (records) in the input data.

The K-Means algorithm is sensitive to differences in scale in the selected variables. To minimize the effects of scale on the cluster assignment, you can optionally specify that the data is to be standardized using "z-score" standardization.

The node provides a summary of the cluster assignment and details of the cluster assignment.

The solutionSummary pin provides a summary that indicates the number of observations assigned to each cluster, the "co-ordinates" of the mean of each of the derived clusters (the "centroids"), and statistics for the "within cluster sum of squares" and the ratio of the "between sum of squares" and "within cluster sum of squares".

The clusterData pin provides details of the cluster assignment which includes: the "co-ordinate" values of the variables for each observation together with the assigned cluster number, the "Type" set to "CLUSTER" and the "set" which is set to "CLUSTERx" where "x" is the assigned cluster number.

Powered by TIBCO®

Properties

ModelName

Optionally specify the name of a model which is displayed on the output data.

A model name must start with a letter and may contain any of the following:

  • letters
  • numbers
  • period character (".")
  • underscore ("_")

If not specified, a default model name "KMeans" is displayed on the output data.

Variables

Optionally specify the variables to be used, separated by commas.

By default, all the fields from the input data are used.

ClusterCount

Specify the number of clusters.

The number must be in the range from 2 to N-1, where N is the number of observations (input data records).

A value is required for this property.

MaxIterations

Optionally specify the maximum number of iterations of the algorithm used to assign items to clusters. A positive integer.

The default value is 10.

StartConfigurations

Optionally specify the number of random sets of initial observations that are to be used as the initial cluster centers when performing the cluster analysis. A positive integer.

The default value is 10.

TransformData

Optionally specify whether each of the fields in the input data are to be transformed. Choose from:

  • None - Do not transform the data.
  • Standardize - Apply a z-score standardization to each field in the input data.

The default value is None.

Inputs and outputs

Inputs: data.

Outputs: solutionSummary, clusterData.