Dimensional Analysis - Data360_Analyze - 3.12

Data360 Analyze Server Help

Product
Data360 Analyze
Version
3.12
Language
English
Portfolio
Verify
Product family
Data360
Product name
Data360 Analyze
Title
Data360 Analyze Server Help
Copyright
2023
First publish date
2016

Calculates statistics for a multi-dimensional dataset, which may help a user to identify interesting patterns or anomalies in data which suggest areas of further analysis.

Terminology

Table 1. The following input dataset is used as an example to clarify terms used in this node.

Id     Color     Size         Make         Location        Price
----------------------------------------------------------------
1     Red       Large         Chrysler     East            26000
2     Red       Large         Chrysler     West            46000
3     Orange    Medium        Chrysler     Central         10000
4     Silver    Compact       BMW          West            25000
5     Red       Compact       BMW          South           65000
  • Dimension - a non-numeric field in the input dataset.Example: Color, Size, Make and Location in TABLE 1 are dimensions that can be chosen for analysis.
  • Dimension Pair - a unique pair of dimensions.Example: [Color,Size], [Color,Make] and [Make,Location].
  • Dimensional Slice - a Dimensional Slice for a Dimension Pair is all records from the input dataset having the same dimension 1 value and dimension 2 value in that Dimension Pair.

Example: for Dimension Pair [Color, Size], there are four Dimensional Slices as shown in Table 2 below:

Table 2. Dimensional Slice for [Color, Size].


Dimensional Slice                 row ids from Table 1.
----------------------------------------------------
[Color:Red,Size:Large]            1, 2
[Color:Orange,Size:Medium]        3
[Color:Silver,Size:Compact]       4
[Color:Red,Size:Compact]          5
  • Intersection Id - a unique key for a Dimensional Slice, consisting of four fields: dimension 1 name, dimension 1 value, dimension 2 name, and dimension 2 value. Example: Slice [Color:Red,Size:Large] in Table 2.
  • Occurrence Count - the number of records in the input dataset for a Dimensional Slice. Example: the Occurrence Count for [Color:Red, Size:Large] is 2, and the Occurrence Count for [Color:Red,Size:Compact] is 1.
  • Occurrence Mean - an Occurrence Mean is calculated for the set of records having the same values of dimension 1 name, dimension 1 value and dimension 2 name. And therefore an Occurrence Mean can be identified by the key: [dim1_name:dim1_value,dim2_name].If the N records below are all the records in the input dataset that have dim1_name for DIM1_NAME, dim1_value for DIM1_VALUE, and dim2_name for DIM2_NAME:
    
    DIM1_NAME       DIM1_VALUE       DIM2_NAME       DIM2_VALUE       OCCURRENCE_COUNT
    -------------------------------------------------------------------------------------------------------
    dim1_name       dim1_value       dim2_name       dim2_value1       count1
    dim1_name       dim1_value       dim2_name       dim2_value2       count2
    ...
    dim1_name       dim1_value       dim2_name       dim2_valueN       countN
    
    the Occurrence Mean, identified by key [dim1_name:dim1_value,dim2_name], is the average of the Occurrence Counts: count1, count2, ..., countN, of these N records.Example: there are 3 unique colors for Dimension Pair [Color, Size]: Red, Orange and Silver. Three Occurrence Means are calculated, identified by keys [Color:Red,Size], [Color:Orange,Size], and [Color:Silver,Size]. The Occurrence Mean for key [Color:Red,Size] is the average of Occurrence Counts for [Color:Red, Size:Large] and [Color:Red,Size:Compact].
  • Sample Standard Deviation - a statistical measure of spread or variability. The standard deviation is the root mean square deviation of the values from their arithmetic mean.Note that in this node, if there is only one record, the sample standard deviation is zero.
  • Occurrence STD - an Occurrence STD (a sample standard deviation) is calculated for the set of records having the same values of dimension1 name, dimension1 value and dimension2 name. And therefore, an Occurrence STD can be identified by the key: [dim1_name:dim1_value,dim2_name].If the N records below are all the records in the input dataset that have dim1_name for DIM1_NAME, dim1_value for DIM1_VALUE, and dim2_name for DIM2_NAME:
    
    DIM1_NAME       DIM1_VALUE       DIM2_NAME       DIM2_VALUE       OCCURRENCE_COUNT
    -------------------------------------------------------------------------------------------------------
    dim1_name       dim1_value       dim2_name       dim2_value1       count1
    dim1_name       dim1_value       dim2_name       dim2_value2       count2
    ...
    dim1_name       dim1_value       dim2_name       dim2_valueN       countN
    
    the Occurrence STD, identified by key [dim1_name:dim1_value,dim2_name], is the sample standard deviation of the Occurrence Counts: count1, count2, ..., countN, of these N records.Example: there are 3 unique colors for Dimension Pair [Color, Size]: Red, Orange and Silver. Three Occurrence STDs are calculated, identified by keys [Color:Red,Size], [Color:Orange,Size], and [Color:Silver,Size].
  • Tally - a Tally is a numeric field in the input dataset.Example: in Table 1, Price is a tally.
  • Tally Total - the sum of the Tallys in a Dimensional Slice.Example: the total for the tally Price in slice [Color:Red,Size:Large] is 72000.
  • Tally Mean - a Tally Mean is similar to an Occurrence Mean, using Tally Total instead of Occurrence Count in the calculation.
  • Tally STD - a Tally STD is similar to an Occurrence STD, using Tally Total instead of Occurrence Count, and Tally Mean instead of Occurrence Mean in the calculation.
  • Occurrence NSTD - the number of sample standard deviations the Occurrence Count of a slice [dim1_name:dim1_value,dim2_name:dim2_value] is from the Occurrence Mean, identified by key [dim1_name:dim1_value,dim2_name].
  • Tally NSTD - the number of sample standard deviations the Tally Total of a slice [dim1_name:dim1_value,dim2_name:dim2_value] is from the Tally Mean, identified by key [dim1_name:dim1_value,dim2_name].

Dimensional Analysis Node Input and Output

Input Multi-dimensional dataset which satisfies the criteria for the dataset as defined in the Dimensional Aggregator node help.

Output

Slice Statistics - statistics for each Dimensional Slice. The following output fields are the same as those defined in the output Intersections of the Dimensional Aggregator node (refer to the Dimensional Aggregator node help for details):

  • INTERSECTION_ID: a unique key of a Dimensional Slice as defined above. It has a format of [DIM1_NAME:DIM1_VALUE,DIM2_NAME:DIM2_VALUE].
  • OCCURRENCE_COUNT: Occurrence Count of the slice
  • DIM1_NAME: dimension 1 name of the slice
  • DIM1_VALUE: dimension 1 value of the slice
  • DIM2_NAME: dimension 2 name of the slice
  • DIM2_VALUE: dimension 2 value of the slice
  • tally_TOTAL (for each tally in the property TallyFieldList): tally Total of the slice
Example: For tallies tally1, tally2, ..., tallyn in the property TallyFieldList, the output contains the following fields:

	tally1_TOTAL
	tally2_TOTAL
	...
	tallyn_TOTAL
	
and the following additional fields containing the Dimensional Analysis node specific statistics:
  • OCCURRENCE_NSTD: the number of sample standard deviations this slice's Occurrence Count is away from the Occurrence Mean, identified by key [DIM1_NAME:DIM1_VALUE,DIM2_NAME]
  • tally_NSTD (for each tally in the property TallyFieldList): the number of sample standard deviations its total, tally_Total, is away from the tally Mean identified by key [DIM1_NAME:DIM1_VALUE,DIM2_NAME]
Example: For tallies tally1, tally2, ..., tallyn in the property TallyFieldList, the output contains the following fields:

     tally1_NSTD
     tally2_NSTD
     ....
     tallyn_NSTD
	
Slice Histogram - statistics for Occurrence Counts and Tally Totals in the output Slice Statistics, grouped by DIM1_NAME, DIM1_VALUE and DIM2_NAME. Data in this output pin are calculated using data in the output pin Slice Statistics. Let SetOfRecords be a set of records in the output pin Slice Statistics of the same group: the records have the same values for DIM1_NAME, DIM1_VALUE and DIM2_NAME, the following statistics are calculated. The output fields are:
  • DIM1_NAME: name of dimension 1
  • DIM2_NAME: name of dimension 2
  • DIM1_VALUE: value of dimension 1
(Note that the above three fields: DIM1_NAME, DIM2_NAME, and DIM1_VALUE form the key for this row.)
  • SAMPLE_SIZE: the number of unique dimension 2 values in SetOfRecords
  • OCCURRENCE_MEAN: the average of the Occurrence Counts in SetOfRecords
  • OCCURRENCE_STD: the sample standard deviation of the Occurrence Counts in SetOfRecords
  • tally_MEAN (for each tally in the property TallyFieldList): the average of tally Totals in SetOfRecords
  • tally_STD (for each tally in the property TallyFieldList): the sample standard deviation of tally Totals in SetOfRecords
Example: For tallies tally1, tally2, ..., tallyn in the property TallyFieldList:

	tally1_MEAN
	tally1_STD
	tally2_MEAN
	tally2_STD
	.....
	tallyn_MEAN
	tallyn_STD
If property OutputRawStatistics is set to true, the following additional statistics are available:
  • OCCURRENCE_MODE: the INTERSECTION_ID of the slice that has the highest Occurrence Count in SetOfRecords. e.g. [Color:Blue,Size:Large]
  • OCCURRENCE_HIGHEST: the highest Occurrence Count in SetOfRecords
  • OCCURRENCE_LOWEST: the lowest Occurrence Count in SetOfRecords
  • tally_MODE (for each tally in the property TallyFieldList): the INTERSECTION_ID of the slice that has the highest tally Total in SetOfRecords
  • tally_HIGHEST (for each tally in the property TallyFieldList): the highest tally Total in SetOfRecords
  • tally_LOWEST (for each tally in the property TallyFieldList): the lowest tally Total in SetOfRecords
Example: For tallies tally1, tally2, ..., tallyn in the property TallyFieldList:

	tally1_MODE
	tally1_HIGHEST
	tally1_LOWEST
	tally2_MODE
	tally2_HIGHEST
	tally2_LOWEST
	.....
	tallyn_MODE
	tallyn_HIGHEST
	tallyn_LOWEST	

Properties

DimensionFieldList

Specify a list of at least two fields from the input dataset, separated by commas. All the fields must be non-numeric.

Example: color, division

This list of fields will be used as the dimensions for analysis. Each value in each dimension is compared with each value along every other dimension.

A value is required for this property.

Refer to the Dimensional Aggregator node for more information.

TallyFieldList

Specify a list of comma-separated numeric fields from the input dataset.

The tally fields will be analyzed along with the dimension fields specified in the property DimensionFieldList.

Example: salesPrice, salesVolume

Refer to the Dimensional Aggregator node for more information.

Discriminator

Optionally specify a list of discriminators that can be applied to the output.

For this release, there is one choice of discriminator.

If no discriminator is selected, all statistics for all dimensional slices will be generated in the output pin 'Slice Statistics'.

If discriminator Standard Deviation - At or Above is selected, for each dimensional slice, if its <fieldname>_NSTD is at or above the threshold, a record will be generated in the output pin 'Slice Statistics'. The threshold is specified in the DiscriminatorArgument property, and the <fieldname> is specified in the DiscriminatorField property.

The output pin 'Slice Statistics' will be sorted in descending order on <fieldname>.

DiscriminatorField

This optional property depends on the Discriminator property. If the Discriminator property is not set, this field must also not be set. Otherwise, the DiscriminatorField property is set according to which discriminator is selected.

For discriminator Standard Deviation - At or Above: this property can be set to either OCCURRENCE, or one of the fields referenced in the TallyFieldList property.

The output pin 'Slice Statistics' will be sorted in descending order on this property.

Example: SalesPrice

The default value is OCCURRENCE.

DiscriminatorArgument

Optionally specify an argument to be applied to the discriminator selected in the Discriminator property.

For discriminator Standard Deviation - At or Above, this property must be zero or a positive integer.

The default value is 3.

DimensionTypeConversion

Since all of the dimensions need to be placed in the same output field, these need to be converted to a common output field type. Choose from:

  • None
  • To String
  • To Unicode

If Unicode fields are input and this property is set to To String, these will attempt to be converted to a string. If this is not possible, then the node will fail. When this property is set to None, then the node will fail in any case when there are dimension fields that are not of the same type.

The default value is None.

OutputRawStatistics

Optionally specify if the output pin Histogram will include the mode, highest value, and lowest value for the Occurrence Count, and for all the Tally Totals specified in the property TallyFieldList.

The default value is False.

OutputAllPermutations

Optionally specify if the fields in the DimensionFieldList property will be permutated for analysis. If set to true, analysis will be done for both orderings of the dimension pairs.

Example:

If the DimensionFieldList property contains color, division. If the OutputAllPermutations property is set to false, analysis is done for the dimension pair [color, division]. Otherwise, analysis is done for both [color, division] and [division, color].

The default value is False.

Refer to the Dimensional Aggregator node for more information.

NullDimensionBehavior

Optionally specify the node behavior when a value in a dimension field is null. Choose from:

  • Error - Forces the node to error with an appropriate error message
  • Ignore - Means that the intersection will be ignored.
  • Emit - Means that a dimension slice, or intersection, will be created when a null value is encountered. This means that null values are treated like all other values.

The default value is Emit.

Inputs and outputs

Inputs: data.

Outputs: Slice Statistics, Histogram.