Calculates statistics for a multi-dimensional dataset, which may help a user to identify interesting patterns or anomalies in data which suggest areas of further analysis.
Terminology
Table 1. The following input dataset is used as an example to clarify terms used in this node.
Id Color Size Make Location Price
----------------------------------------------------------------
1 Red Large Chrysler East 26000
2 Red Large Chrysler West 46000
3 Orange Medium Chrysler Central 10000
4 Silver Compact BMW West 25000
5 Red Compact BMW South 65000
- Dimension - a non-numeric field in the input dataset.Example: Color, Size, Make and Location in TABLE 1 are dimensions that can be chosen for analysis.
- Dimension Pair - a unique pair of dimensions.Example: [Color,Size], [Color,Make] and [Make,Location].
- Dimensional Slice - a Dimensional Slice for a Dimension Pair is all records from the input dataset having the same dimension 1 value and dimension 2 value in that Dimension Pair.
Example: for Dimension Pair [Color, Size], there are four Dimensional Slices as shown in Table 2 below:
Table 2. Dimensional Slice for [Color, Size].
Dimensional Slice row ids from Table 1.
----------------------------------------------------
[Color:Red,Size:Large] 1, 2
[Color:Orange,Size:Medium] 3
[Color:Silver,Size:Compact] 4
[Color:Red,Size:Compact] 5
- Intersection Id - a unique key for a Dimensional Slice, consisting of four fields: dimension 1 name, dimension 1 value, dimension 2 name, and dimension 2 value. Example: Slice [Color:Red,Size:Large] in Table 2.
- Occurrence Count - the number of records in the input dataset for a Dimensional Slice. Example: the Occurrence Count for [Color:Red, Size:Large] is 2, and the Occurrence Count for [Color:Red,Size:Compact] is 1.
- Occurrence Mean - an Occurrence Mean is calculated for the set of records having the same values of dimension 1 name, dimension 1 value and dimension 2 name.
And therefore an Occurrence Mean can be identified by the key: [dim1_name:dim1_value,dim2_name].If the N records below are all the records in the input dataset that have dim1_name for DIM1_NAME, dim1_value for DIM1_VALUE, and dim2_name for DIM2_NAME:
the Occurrence Mean, identified by key [dim1_name:dim1_value,dim2_name], is the average of the Occurrence Counts: count1, count2, ..., countN, of these N records.Example: there are 3 unique colors for Dimension Pair [Color, Size]: Red, Orange and Silver. Three Occurrence Means are calculated, identified by keys [Color:Red,Size], [Color:Orange,Size], and [Color:Silver,Size]. The Occurrence Mean for key [Color:Red,Size] is the average of Occurrence Counts for [Color:Red, Size:Large] and [Color:Red,Size:Compact].DIM1_NAME DIM1_VALUE DIM2_NAME DIM2_VALUE OCCURRENCE_COUNT ------------------------------------------------------------------------------------------------------- dim1_name dim1_value dim2_name dim2_value1 count1 dim1_name dim1_value dim2_name dim2_value2 count2 ... dim1_name dim1_value dim2_name dim2_valueN countN
- Sample Standard Deviation - a statistical measure of spread or variability. The standard deviation is the root mean square deviation of the values from their arithmetic mean.Note that in this node, if there is only one record, the sample standard deviation is zero.
- Occurrence STD - an Occurrence STD (a sample standard deviation) is calculated for the set of records having the same values of dimension1 name, dimension1 value and dimension2 name.
And therefore, an Occurrence STD can be identified by the key: [dim1_name:dim1_value,dim2_name].If the N records below are all the records in the input dataset that have dim1_name for DIM1_NAME, dim1_value for DIM1_VALUE, and dim2_name for DIM2_NAME:
the Occurrence STD, identified by key [dim1_name:dim1_value,dim2_name], is the sample standard deviation of the Occurrence Counts: count1, count2, ..., countN, of these N records.Example: there are 3 unique colors for Dimension Pair [Color, Size]: Red, Orange and Silver. Three Occurrence STDs are calculated, identified by keys [Color:Red,Size], [Color:Orange,Size], and [Color:Silver,Size].DIM1_NAME DIM1_VALUE DIM2_NAME DIM2_VALUE OCCURRENCE_COUNT ------------------------------------------------------------------------------------------------------- dim1_name dim1_value dim2_name dim2_value1 count1 dim1_name dim1_value dim2_name dim2_value2 count2 ... dim1_name dim1_value dim2_name dim2_valueN countN
- Tally - a Tally is a numeric field in the input dataset.Example: in Table 1, Price is a tally.
- Tally Total - the sum of the Tallys in a Dimensional Slice.Example: the total for the tally Price in slice [Color:Red,Size:Large] is 72000.
- Tally Mean - a Tally Mean is similar to an Occurrence Mean, using Tally Total instead of Occurrence Count in the calculation.
- Tally STD - a Tally STD is similar to an Occurrence STD, using Tally Total instead of Occurrence Count, and Tally Mean instead of Occurrence Mean in the calculation.
- Occurrence NSTD - the number of sample standard deviations the Occurrence Count of a slice [dim1_name:dim1_value,dim2_name:dim2_value] is from the Occurrence Mean, identified by key [dim1_name:dim1_value,dim2_name].
- Tally NSTD - the number of sample standard deviations the Tally Total of a slice [dim1_name:dim1_value,dim2_name:dim2_value] is from the Tally Mean, identified by key [dim1_name:dim1_value,dim2_name].
Dimensional Analysis Node Input and Output
Input Multi-dimensional dataset which satisfies the criteria for the dataset as defined in the Dimensional Aggregator node help.
Output
Slice Statistics - statistics for each Dimensional Slice. The following output fields are the same as those defined in the output Intersections of the Dimensional Aggregator node (refer to the Dimensional Aggregator node help for details):
- INTERSECTION_ID: a unique key of a Dimensional Slice as defined above. It has a format of [DIM1_NAME:DIM1_VALUE,DIM2_NAME:DIM2_VALUE].
- OCCURRENCE_COUNT: Occurrence Count of the slice
- DIM1_NAME: dimension 1 name of the slice
- DIM1_VALUE: dimension 1 value of the slice
- DIM2_NAME: dimension 2 name of the slice
- DIM2_VALUE: dimension 2 value of the slice
- tally_TOTAL (for each tally in the property TallyFieldList): tally Total of the slice
tally1_TOTAL
tally2_TOTAL
...
tallyn_TOTAL
- OCCURRENCE_NSTD: the number of sample standard deviations this slice's Occurrence Count is away from the Occurrence Mean, identified by key [DIM1_NAME:DIM1_VALUE,DIM2_NAME]
- tally_NSTD (for each tally in the property TallyFieldList): the number of sample standard deviations its total, tally_Total, is away from the tally Mean identified by key [DIM1_NAME:DIM1_VALUE,DIM2_NAME]
tally1_NSTD
tally2_NSTD
....
tallyn_NSTD
- DIM1_NAME: name of dimension 1
- DIM2_NAME: name of dimension 2
- DIM1_VALUE: value of dimension 1
- SAMPLE_SIZE: the number of unique dimension 2 values in SetOfRecords
- OCCURRENCE_MEAN: the average of the Occurrence Counts in SetOfRecords
- OCCURRENCE_STD: the sample standard deviation of the Occurrence Counts in SetOfRecords
- tally_MEAN (for each tally in the property TallyFieldList): the average of tally Totals in SetOfRecords
- tally_STD (for each tally in the property TallyFieldList): the sample standard deviation of tally Totals in SetOfRecords
tally1_MEAN
tally1_STD
tally2_MEAN
tally2_STD
.....
tallyn_MEAN
tallyn_STD
- OCCURRENCE_MODE: the INTERSECTION_ID of the slice that has the highest Occurrence Count in SetOfRecords. e.g. [Color:Blue,Size:Large]
- OCCURRENCE_HIGHEST: the highest Occurrence Count in SetOfRecords
- OCCURRENCE_LOWEST: the lowest Occurrence Count in SetOfRecords
- tally_MODE (for each tally in the property TallyFieldList): the INTERSECTION_ID of the slice that has the highest tally Total in SetOfRecords
- tally_HIGHEST (for each tally in the property TallyFieldList): the highest tally Total in SetOfRecords
- tally_LOWEST (for each tally in the property TallyFieldList): the lowest tally Total in SetOfRecords
tally1_MODE
tally1_HIGHEST
tally1_LOWEST
tally2_MODE
tally2_HIGHEST
tally2_LOWEST
.....
tallyn_MODE
tallyn_HIGHEST
tallyn_LOWEST
Properties
DimensionFieldList
Specify a list of at least two fields from the input dataset, separated by commas. All the fields must be non-numeric.
Example: color, division
This list of fields will be used as the dimensions for analysis. Each value in each dimension is compared with each value along every other dimension.
A value is required for this property.
Refer to the Dimensional Aggregator node for more information.
TallyFieldList
Specify a list of comma-separated numeric fields from the input dataset.
The tally fields will be analyzed along with the dimension fields specified in the property DimensionFieldList.
Example: salesPrice, salesVolume
Refer to the Dimensional Aggregator node for more information.
Discriminator
Optionally specify a list of discriminators that can be applied to the output.
For this release, there is one choice of discriminator.
If no discriminator is selected, all statistics for all dimensional slices will be generated in the output pin 'Slice Statistics'.
If discriminator Standard Deviation - At or Above is selected, for each dimensional slice, if its <fieldname>_NSTD is at or above the threshold, a record will be generated in the output pin 'Slice Statistics'. The threshold is specified in the DiscriminatorArgument property, and the <fieldname> is specified in the DiscriminatorField property.
The output pin 'Slice Statistics' will be sorted in descending order on <fieldname>.
DiscriminatorField
This optional property depends on the Discriminator property. If the Discriminator property is not set, this field must also not be set. Otherwise, the DiscriminatorField property is set according to which discriminator is selected.
For discriminator Standard Deviation - At or Above: this property can be set to either OCCURRENCE, or one of the fields referenced in the TallyFieldList property.
The output pin 'Slice Statistics' will be sorted in descending order on this property.
Example: SalesPrice
The default value is OCCURRENCE.
DiscriminatorArgument
Optionally specify an argument to be applied to the discriminator selected in the Discriminator property.
For discriminator Standard Deviation - At or Above, this property must be zero or a positive integer.
The default value is 3.
DimensionTypeConversion
Since all of the dimensions need to be placed in the same output field, these need to be converted to a common output field type. Choose from:
- None
- To String
- To Unicode
If Unicode fields are input and this property is set to To String, these will attempt to be converted to a string. If this is not possible, then the node will fail. When this property is set to None, then the node will fail in any case when there are dimension fields that are not of the same type.
The default value is None.
OutputRawStatistics
Optionally specify if the output pin Histogram will include the mode, highest value, and lowest value for the Occurrence Count, and for all the Tally Totals specified in the property TallyFieldList.
The default value is False.
OutputAllPermutations
Optionally specify if the fields in the DimensionFieldList property will be permutated for analysis. If set to true, analysis will be done for both orderings of the dimension pairs.
Example:
If the DimensionFieldList property contains color, division. If the OutputAllPermutations property is set to false, analysis is done for the dimension pair [color, division]. Otherwise, analysis is done for both [color, division] and [division, color].
The default value is False.
Refer to the Dimensional Aggregator node for more information.
NullDimensionBehavior
Optionally specify the node behavior when a value in a dimension field is null. Choose from:
- Error - Forces the node to error with an appropriate error message
- Ignore - Means that the intersection will be ignored.
- Emit - Means that a dimension slice, or intersection, will be created when a null value is encountered. This means that null values are treated like all other values.
The default value is Emit.
Inputs and outputs
Inputs: data.
Outputs: Slice Statistics, Histogram.