The Profiling side panel is shown when any profiling data is available, for either a business or technical asset. It is populated using the Data360 Govern DataProfiles v2.0 APIs.
The name of the asset displays as the first heading, followed by the asset type, then the categories.
All categories are expanded by default, with a maximum of five entries each. Further entries can be displayed by clicking Show more. If a category has no entries, the label is not displayed.
The Profiling panel can include these categories:
- Sample Summary
- Sample Quality
- Sample Distribution
- Top Values
- Bottom Values
- Invalid/Outliers
- Shapes
- Statistics
The sample categories are always the first categories that display and should always have data. However, there is no validation for any required fields in the API, other than profileSetDate
.
Sample summary
Field | Description | Source |
---|---|---|
Effective Date | The date of the latest set of profiling information received for the asset. |
profileSetDate
|
Total Row Count | The total number of rows in an entire data set. |
totalCount
|
Sample Row Count |
The number of rows that are profiled. A percentage of the total count is also displayed. |
sampleCount
|
Base Type | The type of data. |
type
|
Semantic Type | A further explanation of the type. For example, if the type is "String" the Semantic Type may be Email or Name. |
typeQualifier
|
Type Confidence |
A percentage of how confident you can be that the profiling results are from the type specified. For example, "I'm 97.5% sure the data sampled was from a date field". Displayed as a percentage with two decimal points of precision. For example, if the value in the API is .9753, then the display in the UI is 97.53%. |
confidence
|
Match Detection | The number of duplicates and similar fields in the sample data. |
Semantic types
Semantic types are standardized strings of characters that help to describe the type of information particular data represents.
When the profiling side panel is displayed, a check is carried out to see if the semantic type is found in the semantic definitions. If a match is not found, the qualifier is displayed against the Semantic Type. For example, HONORIFIC_EN.
If a match is found, the name of the semantic type displays as a link to the appropriate type, for example Date. Click the link to display the Information secondary side panel, to the left of the Profiling panel. The definition details of the relevant semantic type are displayed. Click any link on the Information secondary side panel to replace the semantic type information with its details.
If the semantic type is not sent with the profiling data, the Semantic types label will still show, but filled with dashes.
Match detection
The Match Detection field is displayed as part of the Sample Summary category, and shows the number of duplicates and similar fields in the sample data. Match detection is based on the data signature and data structure that is passed to Data360 Govern by Data360 Analyze after an asset is profiled. All profiled assets are checked for similar or duplicate entries, based on those passed fields. If two assets have the same data signature, they are classified as duplicates, but if they have the same data structure, they are regarded as similar assets.
Duplicates:
- Have a red badge, positioned to the left of the label, together with the total number found. Click the link to open the Match Detection dialog.
- If you hover your mouse over the label, a tooltip displays the number of assets detected that are of the same type and have matching data.
- If there are no duplicates, the red badge is muted and the label appears gray with no link.
Similar fields:
- Have an orange badge, positioned to the left of the label, together with the total number found. Click the link to open the Match Detection dialog.
- If you hover your mouse over the label, a tooltip displays the number of assets detected that are of the same type but with different data.
- If there are no similar fields, the orange badge is muted and the label appears gray with no link.
Sample quality
- There is a tool tip next to each percentage calculated, which displays the percentage of the total. The total itself is relative, for example, total of the sample, total of the valid and similar.
Field | Description | Source |
---|---|---|
Quality bar | A single horizontal bar with a spread of counts of valid, invalid and not populated rows from the sample data. | |
Valid |
The number of valid values found in the sample data, based on the Type or Semantic Type. Next to the count is a percentage. This is calculated within Data360 Govern, and equals Valid Count divided by the Sample Count. Distinct - Indicates how many of the valid values are distinct. |
matchCount
|
Invalid/Outliers |
The number of invalid or outlier values found in the sample data. Next to the count is a percentage. This is calculated within Data360 Govern, and equals Invalid/Outliers Count divided by the Sample Count. |
outlierCount
|
Null/Blank |
The count of either nulls or blanks found in the sample data. Next to the count is a percentage. This is calculated within Data360 Govern, and equals Not Populated Count divided by the Sample Count. |
nullCount + blankCount
|
Sample distribution
The bar chart shows the distribution of samples, according to the type of data. For example, if the data is:
- Date/Time - The bar chart shows the distribution over time.
- String - The bar chart shows the distribution according to distinct string values.
- Number - The bar chart shows the range distribution, together with the standard deviation and mean for the distinct values.
- Boolean - The bar chart shows whether values are true or false.
The bar chart displays the relevant results with green bars, and also includes the invalid/outliers, if any, with a red bar and null/blank values with a gray one.
Top values, bottom values, invalid/outliers and shapes
These categories all behave in a similar way, and only display if there is data for them. Each value displays as a bar chart with the value and count.
Next to the count a percentage displays, which is a calculation of the value count divided by the sample count.
-
Top Values - The values are from
topK
with the count of each incardinalityDetail
. -
Bottom Values - The values are from
bottomK
with the count of each incardinalityDetail
. -
Invalid/Outlier Values - Both the values and counts are in
outlierDetail
. -
Shapes - Both the values and counts are in
shapesDetail
.
Statistics
There is a set of statistics that are delivered through the APIs.
Label | Source |
---|---|
Null Count |
nullCount
|
Blank Count |
blankCount
|
Minimum Value |
min
|
Maximum Value |
max
|
Minimum Length |
minLength
|
Maximum Length |
maxLength
|
Mean |
mean
|
Standard Deviation |
standardDeviation
|
Multiline |
multiline
|
Leading Whitespace |
leadingWhiteSpace
|
Trailing Whitespace |
trailingWhiteSpace
|
Leading Zero Count |
leadingZeroCount
|
Validation Regular Expression |
regExp
|
The availability of a particular statistical value depends in part, on the data type. For example, if the data type is boolean, then only Blank Count, Null Count and Validation Regular Expression will be displayed, if those have a value.