Profile Data - Data360_DQ+ - Latest

Data360 DQ+ Help

Product type
Software
Portfolio
Verify
Product family
Data360
Product
Data360 DQ+
Version
Latest
Language
English
Product name
Data360 DQ+
Title
Data360 DQ+ Help
Copyright
2024
First publish date
2016
ft:lastEdition
2024-07-09
ft:lastPublication
2024-07-09T15:09:58.774265

The Profile Data node allows you to gather information about any field in your data set. Using this node generates a sheet with fields containing useful information about each profiled field.

Note: When building your analysis, remember that profile information is representative of your data sample. To obtain an actual profile of the entire data set, you need to execute your analysis.

Properties

Value Limit

Specify how many values can be displayed in the Profile Data node's values and patterns array fields.

For the values and patterns fields, if more values are found than the Value Limit, no values will be displayed.

The default value is 10,000. This is the value that will be used if the property is left blank.

Profile Fields

Select fields from the incoming data store that you want to include in the Profile Data node output.

Click Add Field and use the Field Selector dialog to move fields between the Available and Selected columns.

Counters

Optionally create expressions to apply to individual fields. The result is a count of values that satisfy the expression.

For example, consider the following data set:

id

value

001

100

002

125

003

150

004

175

005

200

If you create a Custom Counter field using the expression value < 150, the value of the field is 2.

Semantic Type

Use the settings in the Semantic Type tab to identify semantic types for the fields in your data set. Data360 semantic types are automatically detected; you can use this option to identify your own custom semantic types. For more information about creating rules, see Semantic Type Identification Rule.

You can force Data360 DQ+ to check all fields for an applicable semantic type by selecting Force Semantic Type Identification for All Fields. When the box is ticked, all fields will be tested against the ruleset that Data360 DQ+ uses to identify system semantic types.

To add custom semantic typing rules, click Add in the Type Identification Checks panel, and complete the tabs to add a semantic type check.

Details

  1. Select the Rule Library that contains the Rule Group containing the semantic type rule.
  2. Select a Rule Group from the Rule Library. You can only select a Rule Group that has been defined as a Semantic Type Identification Rule.
  3. Tick the box for Execute All Rules in a Group, or select individual rules in the Rules to Execute panel.

Placeholder Mapping

For each of the Placeholder Fields applicable to the rule group, click Edit to open the Edit Placeholder Mapping panel.

The Placeholder Data Type is automatically filled in based on the rule.

Choose a Field Selection Type to determine which fields in the data set will be tested against the rule group.

  • Specified Field - Select one field to be tested from the Select Incoming Fields dropdown.
  • Specified Fields - Select multiple fields to be tested from the Select Incoming Fields dropdown.
  • All Placeholder Data Type Fields - Test all the fields that are of the type specified in Placeholder Data Type.
  • All Fields - Test all fields.

Click Accept Changes.

Output fields

The Profile Data node generates the following fields.

field

The name of the field you are profiling.

totalCount

The total number of records in the field.

uniqueCount

The number of distinct values that occur once in the field.

distinctCount

The number of records in the field that have a unique value.

duplicateCount

The number of records that have the same value as another record.

distinctPercent

The percentage of records in the field that are distinct.

uniquePercent

The percentage of records in the field that are unique.

duplicatePercent

The percentage of records in the field that have the same value as another record.

completePercent

Represents the percentage of records in the field that are not empty or not null.

blankCount

The number of records that contain only whitespace.

minLength

Computes the minimum length of non-null and non-empty string fields. The minimum length of blank values is 0.

maxLength

Computes the maximum length of non-null and non-empty string fields. The maximum length of blank values is 0.

nullCount

The number of records in the field that contain null values. Null values are those that equal ‘', NULLF(), or those where the value is empty. An empty value is one that contains only spaces and has a string length of 0.

emptyCount

The number of records that are null or have a string length of 0.

max

The value of max depends on the field type.

  • Numeric fields - the maximum numeric value in the field.
  • String fields - max indicates the string that appears at the end of the list, when all values in the field are sorted alphabetically.
  • Date fields - max indicates the most recent date in the field.
Note: Null values are not evaluated by max profiling.

min

The value of min depends on the field type.

  • Numeric fields - min indicates the minimum numeric value in the field.
  • String fields - min indicates the string that appears at the beginning of the list, when all values in the field are sorted alphabetically.
  • Date fields - min indicates the least recent date in the field.
Note: Null values are not evaluated by min profiling.

semanticType

The Semantic Type of the field. The Semantic Type can be identified from tests that automatically check for Data360 semantic types.

You can also test for custom Semantic Type tests, by using Semantic Type Identification Rules that you have defined as part of a Rule Group. For more information, see Semantic Type.

regExp

A regular expression that matches the field values. For example, if the semantic type is identified as a US State, the regular expression is given as \p{Alpha}{2}, or exactly two alphabetic characters, which is how US State abbreviations are given.

For more information about regular expression formatting, see Java Regex Help.

outlierCount

The number of records in the field that are outliers.

structureSignature

A SHA-1 hash that reflects the structure of the data stream.

dataSignature

A SHA-1 hash that reflects the contents of the data stream.

values

The values field generates a column where each record is an array containing all unique values for a profiled field, counts of each unique value, and percentages representing what percent a unique value's record count is of the total record count.

For example, consider the following data set.

field1

one

two

three

three

four

values Data Set

Profiling field1 would produce the following record in the values field:values = [{‘value': ‘one', ‘two', ‘three', ‘four'}, {‘totalCount': 1, 1, 2, 1}, {‘valuePercent': 20, 20, 40, 20}]

When viewing the values field for an individual profiled field, you can also use the arrow navigation buttons to view the values fields for other profiled fields.

patterns

The patterns field generates a column where each record is an array containing all unique value patterns for a profiled field, counts of each unique value pattern, percentages of each unique value pattern, and a regex for each unique value pattern.

For example, consider the following data set.

patterns Data Set

measure1

string2

1

x

10

xy

100

xyz

Profiling measure1 would produce the following records in the patterns field:pattern = [{‘pattern': N, NN, NNN}, {‘totalCount': 1, 1, 1}, {‘patternPercent':33.33, 33.33, 33.33}, {‘patternRegex': \d{1}, \d{2}, \d{3}]

Profiling string1 would produce the following records in the patterns field:

pattern = [{‘pattern': a, aa, aaa}, {‘totalCount': 1, 1, 1}, {‘patternPercent':33.33, 33.33, 33.33}, {‘patternRegex': \w{1}, \w{2}, \w{3}]

When viewing the pattern field for an individual profiled field, you can also use the arrow navigation buttons to view the patterns fields for other profiled fields.