K-Means Advisor (Superseded) - Data360_Analyze - 3 - 3.12

Data360 Analyze Server Help

Product type
Software
Portfolio
Verify
Product family
Data360
Product
Data360 Analyze
Version
3.12
Language
English
Product name
Data360 Analyze
Title
Data360 Analyze Server Help
Copyright
2023
First publish date
2016

Recommends the best number of clusters to use when performing a K-Means analysis of the data.

Note: This node has been superseded by the K-Means Advisor node which provides similar functionality. The K-Means Advisor (Superseded) node is provided for backwards compatibility, but where possible it is recommended that you use the new K-Means Advisor node.
Tip: Before working with this node, there are a number of prerequisite steps, see Working with the Statistical and Predictive Analytics nodes.
Note: An additional Statistical and Predictive Analytics node pack license is required to run this node. See Applying a node pack license.This node processes data in-memory. Additional RAM will be required when processing data sets with a large volume of data.

This node uses the embedded R engine to analyze the input data using a range of criteria for determining the number of clusters and proposes the best clustering scheme from the different results obtained by varying the number of clusters. The input variables must have a numeric data type. You can specify the minimum number of clusters and maximum number of clusters to be proposed by the node. The minimum number of clusters is 2 (and is the default minimum value). The maximum number of clusters must be at least two more than the minimum number of clusters (i.e. at least 4) and a maximum of N-1 where N is the number of observations (records) in the input data. If the maximum number of clusters to propose is not specified, the default maximum of 15 is used.

Depending on the input data and the configuration of the node, when the node is run no observations may be assigned to one or more clusters which will result in an error. In this situation it is recommended that the properties for the maximum number of clusters and/or minimum number of clusters are changed before re-running the node.

The K-Means algorithm is sensitive to differences in scale in the selected variables. To minimize the effects of scale on the cluster assignment, you can optionally specify that the data is to be standardized using "z-score" standardization.

When run, the node outputs information on the best number of clusters to use.

The clusterData pin provides the number of "votes" for the number of clusters to use, as voted for by each of the assessment criteria.

The chart pin provides a histogram of the number of votes for each cluster number.

If a large number of clusters is used, the processing times may be extended.

Powered by TIBCO®

Properties

MinClusters

Optionally specify the minimum number of clusters to be assessed when determining the best number of clusters.

The value of this property must be a minimum of 2 and a maximum of (MaxClusters -2).

The default value is 2.

MaxClusters

Optionally specify the maximum number of clusters to be assessed when determining the best number of clusters.

The value of this property must be a minimum of (MinClusters + 2) and a maximum of (N-1) where N is the number of data records.

The default value is 15.

Variables

Optionally specify the variables to be used, separated by commas.

By default, all the fields from the input data are used.

TransformData

Optionally specify whether each of the fields in the input data are to be transformed. Choose from:

  • None - Do not transform the data.
  • Standardize - Apply a z-score standardization to each field in the input data.

The default value is None.

Inputs and outputs

Inputs: data.

Outputs: clusterData, chart.