Data Profiler - Data360_Analyze - 3.14

Data360 Analyze Server Help

Product type
Software
Portfolio
Verify
Product family
Data360
Product
Data360 Analyze
Version
3.14
Language
English
Product name
Data360 Analyze
Title
Data360 Analyze Server Help
Copyright
2024
First publish date
2016

Examines input data to determine its data type and statistical composition, and outputs a detailed JSON description that can be used for further analysis. The description includes details of the data such as its current and new data type, minimum and maximum values, the number of values that match the analysis and a confidence measure for the data field, and counts of null or blank fields.

For more information on data types, see Data types.

The output analysis from the node is provided in two formats:

  • The analysis output pin provides data profile information which you can use to analyze the profile metrics of the data set, and consists of a row for each field that the node has profiled, with the properties of each field contained in separate columns.
  • The detail output pin provides a JSON description containing the field properties that you can use for further analysis, and consists of a row for each field that the node has profiled.

The following metrics are detected:

Metric Description
Sample Count The number of records observed.
Match Count The number of records that match the detected Base (or Semantic) type.
Null Count The number of null samples.
Blank Count The number of blank samples.
regExp A Regular Expression (Java) that matches the detected Type.
Confidence The confidence in the determination of the Type.
Type The Base Type (one of Boolean, Double, Long, String, LocalDate, LocalTime, LocalDateTime, OffsetDateTime, ZoneDateTime).
Type Qualifier A modifier with regards to the Base type, for example, for Date types it will be a pattern, for Long types it might be SIGNED, for String types it might be COUNTRY.ISO-3166-2.
Min Value The minimum value observed.
Max Value The maximum value observed.
bottomK The lowest ten values.
topK The highest ten values.
Min Length The minimum length observed.
Max Length The maximum length observed.
Mean / Standard Deviation (numeric types only) The mean and standard deviation (uses Welford's algorithm).
Cardinality The cardinality of the valid set (or Maximum Cardinality if the set is larger than Maximum Cardinality).
Outlier Cardinality The cardinality of the invalid set (or Maximum Outlier Cardinality if the set is larger than Maximum Outlier Cardinality).
Leading White Space Does the observed set have leading white space? (True/False)
Trailing White Space Does the observed set have trailing white space? (True/False)
Multiline Does the observed set have leading multiline elements? (True/False)
Logical Type Does the observed set reflect a Semantic Type? (True/False)
Possible Key Does the observed set appear to be a Key field, that is, is it unique? (True/False)
Cardinality Detail Details on the valid set, including a list of elements and an occurrence count.
Outlier Detail Details on the invalid set, including a list of elements and an occurrence count.
Shape Detail Details on the shape of the observed set, including a list of elements and an occurrence count. This will collapse all numeric data types to '9' and all alphabetic data types to 'X'.
Structure Signature A SHA-1 hash that reflects the structure of the data stream.
Data Signature A SHA-1 hash that reflects the content of the data stream.

Logical Type detection

You can add your own logical types - also known as semantic types - to those detected by default by the Data Profiler node. To do this, provide a JSON specification in a file that you upload to the server, and use the path to that file as the value of the LogicalTypeDefinitions property on the node.

For example, to detect possible UKNational Insurance number values in a data field, you could use the following JSON:

When you supply input data to the Data Profiler node, types can be identified by using the JSONwhen the following criteria are met:

  • If a string field contains values that are matched by the regular expression defined in the regExpReturned property in the JSON,
  • AND, if the percentage of values for a field that are identified as the specified type is equal to or greater than the value specified by threshold. The value is given as a percentage, so in this example, a minimum of 98% of the values must be detected as a match.
  • When a match is found, the output analysis identifies the Type Qualifier for the field as the value given by the qualifier property - in this case UK.NI - and the value of the Validation column for the field is the regular expression defined in the regExp property in the JSON.

Properties

Sample Size

Optionally specify the size of the sample set of the input data to analyze.

The default is to process the entire set.

Empty String Is Null

Optionally specify whether strings that are either zero length or all whitespace are interpreted as NULL. If false, they are treated as distinct string values.

The default value is True.

AmbiguousDateResolutionMode

Choose from:

  • Day First - In case of ambiguity, assume that the day appears before the month.
  • Month First - In case of ambiguity, assume that the month appears before the day.
  • Auto - In case of ambiguity, use the information from the locale - based on the Locale property - to determine the correct order.

As an example, consider there is an input string field, with values "1/2/2009" and "1/3/2009".

If AmbiguousDateResolutionMode is set to Day First, then the determined format will be d/M/yyyy and the values will be parsed as the first of February and March in 2009.

If AmbiguousDateResolutionMode is set to Month First, then the determined format will be M/d/yyyy and the values will be parsed as the second and third day of January in 2009.

If AmbiguousDateResolutionMode is set to Auto, then in a US English (en_US) locale, the determined format will be M/d/yyyy and the values will be parsed as the second and third days of January in 2009.

If AmbiguousDateResolutionMode is set to Auto, then in a British English (en_GB) locale, the determined format will be d/M/yyyy and the values will be parsed as the first of February and March in 2009.

The default value is Auto.

Enable Default Logical Types

Enable the default set of Logical Types, which is based on the locale

The default value is True.

Length Qualifier

Indicate whether to qualify the size of the detected Regular Expression. For example, if set to True the qualifier might be expresssed as \d{3,6}, indicating a minimum length of three digits and a maximum of 6, or if set to False, it could result be expressed as \d+, indicating one or more digits.

Note: This property impacts only simple Numeric, Alpha, and AlphaNumeric fields.

The default value is True.

Detect Window

Set the size of the Detect Window to collect before attempting to determine the type. This value is the number of samples taken in order to determine the type.

The default value is 20.

Detection Threshold

Set the percentage for the Detection Threshold - typically between 60 and 100. If set to 100, then all values must be of the appropriate type.

The default value is 95.

Maximum Cardinality

Set the maximum cardinality that will be tracked. This value represents the number of discrete values for a field that will be tracked in order to analyze a field.

The default value is 12000.

Maximum Outlier Cardinality

Set the maximum outlier cardinality that will be tracked. This value represents the number of discrete values that are outliers, and which will be tracked in order to analyze a field.

The default value is100.

JSONDetail Level

Control the level of detail related to valid, invalid, and shape sets provided in the JSON output.

  • 0 - just counts.
  • 1 - first 100 rows.
  • 2 - all rows.

The default value is 1.

Locale

Optionally specify the Locale to be used for analyzing the input stream.

Locale must be specified in the format <language code>[_<country code>[_<variant code>]]

For example, to set the locale to English, the Locale property should be set to en.

For US English, the Locale property should be set to en_US.

For Austrian German, the Locale property should be set to de_AT.

The default value is the default Locale under which the server is running.

LogicalTypeDefinitions

Optionally provide the path to a JSON specification for additional Logical Types to be detected. For details, see Logical Type detection.

Bulk Mode

If set to True then the input is assumed to be a bulk description of the data to be processed.

In this mode four fields are expected:

  • Key (unicode)
  • Name (unicode)
  • Value (unicode)
  • Count (long)

For example, the first row could be Northwind.Customers.Gender,Gender,Female,10341

The default value is False.

Key Field

Optionally select a Field from the input to be used as a Pass-through field. This provides similar functionality to the Key input in Bulk mode.

Inputs and outputs

Inputs: in1.

Outputs: analysis, detail.