Examines input data to determine its data type and statistical composition, and outputs a detailed JSON description that can be used for further analysis. The description includes details of the data such as its current and new data type, minimum and maximum values, the number of values that match the analysis and a confidence measure for the data field, and counts of null or blank fields.
For more information on data types, see Data types.
The output analysis from the node is provided in two formats:
- The analysis output pin provides data profile information which you can use to analyze the profile metrics of the data set, and consists of a row for each field that the node has profiled, with the properties of each field contained in separate columns.
- The detail output pin provides a JSON description containing the field properties that you can use for further analysis, and consists of a row for each field that the node has profiled.
The following metrics are detected:
Metric | Description |
---|---|
Sample Count | The number of records observed. |
Match Count | The number of records that match the detected Base (or Semantic) type. |
Null Count | The number of null samples. |
Blank Count | The number of blank samples. |
regExp | A Regular Expression (Java) that matches the detected Type. |
Confidence | The confidence in the determination of the Type. |
Type | The Base Type (one of Boolean, Double, Long, String, LocalDate, LocalTime, LocalDateTime, OffsetDateTime, ZoneDateTime). |
Type Qualifier | A modifier with regards to the Base type, for example, for Date types it will be a pattern, for Long types it might be SIGNED, for String types it might be COUNTRY.ISO-3166-2. |
Min Value | The minimum value observed. |
Max Value | The maximum value observed. |
bottomK | The lowest ten values. |
topK | The highest ten values. |
Min Length | The minimum length observed. |
Max Length | The maximum length observed. |
Mean / Standard Deviation (numeric types only) | The mean and standard deviation (uses Welford's algorithm). |
Cardinality | The cardinality of the valid set (or Maximum Cardinality if the set is larger than Maximum Cardinality). |
Outlier Cardinality | The cardinality of the invalid set (or Maximum Outlier Cardinality if the set is larger than Maximum Outlier Cardinality). |
Leading White Space | Does the observed set have leading white space? (True/False) |
Trailing White Space | Does the observed set have trailing white space? (True/False) |
Multiline | Does the observed set have leading multiline elements? (True/False) |
Logical Type | Does the observed set reflect a Semantic Type? (True/False) |
Possible Key | Does the observed set appear to be a Key field, that is, is it unique? (True/False) |
Cardinality Detail | Details on the valid set, including a list of elements and an occurrence count. |
Outlier Detail | Details on the invalid set, including a list of elements and an occurrence count. |
Shape Detail | Details on the shape of the observed set, including a list of elements and an occurrence count. This will collapse all numeric data types to '9' and all alphabetic data types to 'X'. |
Structure Signature | A SHA-1 hash that reflects the structure of the data stream. |
Data Signature | A SHA-1 hash that reflects the content of the data stream. |
Logical Type detection
You can add your own logical types - also known as semantic types - to those detected by default by the Data Profiler node. To do this, provide a JSON specification in a file that you upload to the server, and use the path to that file as the value of the LogicalTypeDefinitions property on the node.
For example, to detect possible UK National Insurance number values in a data field, you could use the following JSON:
[
{
"semanticType": "UK.NI",
"pluginType": "regex",
"validLocales": [
{
"localeTag": "*",
"matchEntries" : [{ "regExpReturned": "\\s*[a-zA-Z]{2}(\\s*\\d\\s*){6}[a-zA-Z]\\s*" } ]
}],
"threshold": 98,
"baseType": "STRING"
}
]
When you supply input data to the Data Profiler node, types can be identified by using the JSON when the following criteria are met:
- If a string field contains values that are matched by the regular expression defined in the
regExpReturned
property in the JSON, -
AND, if the percentage of values for a field that are identified as the specified type is equal to or greater than the value specified by
threshold
. The value is given as a percentage, so in this example, a minimum of 98% of the values must be detected as a match. - When a match is found, the output analysis identifies the Type Qualifier for the
field as the value given by the
semanticType
property - in this caseUK.NI
- and the value of theValidation
column for the field is the regular expression defined in theregExp
property in the JSON. - For more information, see https://github.com/tsegall/fta
Properties
Sample Size
Optionally specify the size of the sample set of the input data to analyze. The default is to process the entire set.
Empty String Is Null
Optionally specify whether strings that are either zero length or all whitespace are interpreted as NULL. If false, they are treated as distinct string values. The default value is True.
AmbiguousDateResolutionMode
Choose from:
- Day First - In case of ambiguity, assume that the day appears before the month.
- Month First - In case of ambiguity, assume that the month appears before the day.
- Auto - In case of ambiguity, use the information from the locale - based on the Locale property - to determine the correct order.
As an example, consider there is an input string field, with values "1/2/2009" and "1/3/2009".
If AmbiguousDateResolutionMode is set to Day First, then the determined format will be d/M/yyyy
and the values will be parsed as the first of February and March in 2009.
If AmbiguousDateResolutionMode is set to Month First, then the determined format will be M/d/yyyy
and the values will be parsed as the second and third day of January in 2009.
If AmbiguousDateResolutionMode is set to Auto, then in a US English (en_US) locale, the determined format will be M/d/yyyy
and the values will be parsed as the second and third days of January in 2009.
If AmbiguousDateResolutionMode is set to Auto, then in a British English (en_GB) locale, the determined format will be d/M/yyyy
and the values will be parsed as the first of February and March in 2009.
The default value is Auto.
Enable Default Logical Types
Enable the default set of Logical Types, which is based on the locale The default value is True.
Length Qualifier
Indicate whether to qualify the size of the detected Regular Expression. For example, if set to True the qualifier might be expresssed as \d{3,6}
, indicating a minimum length of three digits and a maximum of 6, or if set to False, it could result be expressed as \d+
, indicating one or more digits.
The default value is True.
Detect Window
Set the size of the Detect Window to collect before attempting to determine the type. This value is the number of samples taken in order to determine the type. The default value is 20.
Detection Threshold
Set the percentage for the Detection Threshold - typically between 60 and 100. If set to 100, then all values must be of the appropriate type. The default value is 95.
Maximum Cardinality
Set the maximum cardinality that will be tracked. This value represents the number of discrete values for a field that will be tracked in order to analyze a field. The default value is 12000.
Maximum Outlier Cardinality
Set the maximum outlier cardinality that will be tracked. This value represents the number of discrete values that are outliers, and which will be tracked in order to analyze a field. The default value is100.
JSONDetail Level
Control the level of detail related to valid, invalid, and shape sets provided in the JSON output.
- 0 - just counts.
- 1 - first 100 rows.
- 2 - all rows.
The default value is 1.
Locale
Optionally specify the Locale to be used for analyzing the input stream. Locale
must be specified in the format <language code>[_<country
code>[_<variant code>]]
For example, to set the locale to English, the Locale property should be set to en.
For US English, the Locale property should be set to en_US.
For Austrian German, the Locale property should be set to de_AT.
The default value is the default Locale under which the server is running.
LogicalTypeDefinitions
Optionally provide the path to a JSON specification for additional Logical Types to be detected. For details, see Logical Type detection.
Bulk Mode
If set to True then the input is assumed to be a bulk description of the data to be processed.
In this mode four fields are expected:
- Key (unicode)
- Name (unicode)
- Value (unicode)
- Count (long)
For example, the first row could be Northwind.Customers.Gender,Gender,Female,10341
The default value is False.
Key Field
Optionally select a Field from the input to be used as a Pass-through field. This provides similar functionality to the Key input in Bulk mode.
Inputs and outputs
Inputs: in1.
Outputs: analysis, detail.