The Profile Data node allows you to gather information about any field in your data set. Using this node generates a sheet with fields containing useful information about each profiled field.
Properties
Value Limit
Specify how many values can be displayed in the Profile Data node's values and patterns array fields.
For the values and patterns fields, if more values are found than the Value Limit, no values will be displayed.
The default value is 10,000. This is the value that will be used if the property is left blank.
Profile Fields
Select fields from the incoming data store that you want to include in the Profile Data node output.
Click Add Field and use the Field Selector dialog to move fields between the Available and Selected columns.
Counters
Optionally create expressions to apply to individual fields. The result is a count of values that satisfy the expression.
For example, consider the following data set:
id |
value |
---|---|
001 |
100 |
002 |
125 |
003 |
150 |
004 |
175 |
005 |
200 |
If you create a Custom Counter field using the expression value < 150
, the value of the field is 2.
Semantic Type
Use the settings in the Semantic Type tab to identify semantic types for the fields in your data set. Data360 semantic types are automatically detected; you can use this option to identify your own custom semantic types. For more information about creating rules, see Semantic Type Identification Rule.
You can force Data360 DQ+ to check all fields for an applicable semantic type by selecting Force Semantic Type Identification for All Fields. When the box is ticked, all fields will be tested against the ruleset that Data360 DQ+ uses to identify system semantic types.
To add custom semantic typing rules, click Add in the Type Identification Checks panel, and complete the tabs to add a semantic type check.
Details
- Select the Rule Library that contains the Rule Group containing the semantic type rule.
- Select a Rule Group from the Rule Library. You can only select a Rule Group that has been defined as a Semantic Type Identification Rule.
- Tick the box for Execute All Rules in a Group, or select individual rules in the Rules to Execute panel.
Placeholder Mapping
For each of the Placeholder Fields applicable to the rule group, click Edit to open the Edit Placeholder Mapping panel.
The Placeholder Data Type is automatically filled in based on the rule.
Choose a Field Selection Type to determine which fields in the data set will be tested against the rule group.
- Specified Field - Select one field to be tested from the Select Incoming Fields dropdown.
- Specified Fields - Select multiple fields to be tested from the Select Incoming Fields dropdown.
- All Placeholder Data Type Fields - Test all the fields that are of the type specified in Placeholder Data Type.
- All Fields - Test all fields.
Click Accept Changes.
Output fields
The Profile Data node generates the following fields.
field
The name of the field you are profiling.
totalCount
The total number of records in the field.
uniqueCount
The number of distinct values that occur once in the field.
distinctCount
The number of records in the field that have a unique value.
duplicateCount
The number of records that have the same value as another record.
distinctPercent
The percentage of records in the field that are distinct.
uniquePercent
The percentage of records in the field that are unique.
duplicatePercent
The percentage of records in the field that have the same value as another record.
completePercent
Represents the percentage of records in the field that are not empty or not null.
blankCount
The number of records that contain only whitespace.
minLength
Computes the minimum length of non-null and non-empty string fields. The minimum length of blank values is 0.
maxLength
Computes the maximum length of non-null and non-empty string fields. The maximum length of blank values is 0.
nullCount
The number of records in the field that contain null values. Null values are those that equal ‘'
, NULLF()
, or those where the value is empty. An empty value is one that contains only spaces and has a string length of 0.
emptyCount
The number of records that are null or have a string length of 0.
max
The value of max depends on the field type.
- Numeric fields - the maximum numeric value in the field.
- String fields - max indicates the string that appears at the end of the list, when all values in the field are sorted alphabetically.
- Date fields - max indicates the most recent date in the field.
min
The value of min depends on the field type.
- Numeric fields - min indicates the minimum numeric value in the field.
- String fields - min indicates the string that appears at the beginning of the list, when all values in the field are sorted alphabetically.
- Date fields - min indicates the least recent date in the field.
semanticType
The Semantic Type of the field. The Semantic Type can be identified from tests that automatically check for Data360 semantic types.
You can also test for custom Semantic Type tests, by using Semantic Type Identification Rules that you have defined as part of a Rule Group. For more information, see Semantic Type.
regExp
A regular expression that matches the field values. For example, if the semantic type is identified as a US State, the regular expression is given as \p{Alpha}{2}
, or exactly two alphabetic characters, which is how US State abbreviations are given.
For more information about regular expression formatting, see Java Regex Help.
outlierCount
The number of records in the field that are outliers.
structureSignature
A SHA-1 hash that reflects the structure of the data stream.
dataSignature
A SHA-1 hash that reflects the contents of the data stream.
values
The values field generates a column where each record is an array containing all unique values for a profiled field, counts of each unique value, and percentages representing what percent a unique value's record count is of the total record count.
For example, consider the following data set.
field1 |
---|
one |
two |
three |
three |
four |
values Data Set
Profiling field1 would produce the following record in the values field:values = [{‘value': ‘one', ‘two', ‘three', ‘four'}, {‘totalCount': 1, 1, 2, 1}, {‘valuePercent': 20, 20, 40, 20}]
When viewing the values field for an individual profiled field, you can also use the arrow navigation buttons to view the values fields for other profiled fields.
patterns
The patterns field generates a column where each record is an array containing all unique value patterns for a profiled field, counts of each unique value pattern, percentages of each unique value pattern, and a regex for each unique value pattern.
For example, consider the following data set.
patterns Data Set
measure1 |
string2 |
---|---|
1 |
x |
10 |
xy |
100 |
xyz |
Profiling measure1 would produce the following records in the patterns field:pattern = [{‘pattern': N, NN, NNN}, {‘totalCount': 1, 1, 1}, {‘patternPercent':33.33, 33.33, 33.33}, {‘patternRegex': \d{1}, \d{2}, \d{3}]
Profiling string1 would produce the following records in the patterns field:
pattern = [{‘pattern': a, aa, aaa}, {‘totalCount': 1, 1, 1}, {‘patternPercent':33.33, 33.33, 33.33}, {‘patternRegex': \w{1}, \w{2}, \w{3}]
When viewing the pattern field for an individual profiled field, you can also use the arrow navigation buttons to view the patterns fields for other profiled fields.