- Completeness: The percentage of Complete, Null, and Empty Strings detected in the column. Click the hyperlink to view the records in each categories.
- Uniqueness: These statistics are displayed here:
- Unique: Records with no duplicates in the data source.
- Non- unique: Records having duplicates in the data source.
- Distinct: A list of all records present in your data source irrespective of those being unique or non-unique record.
For example, your column contains these names:
Here:Roger Gigi Gigi Gigi Garey Elena Brad Brad
-
Roger, Garey, and Elena are unique records.
- Gigi and Brad are non-unique records.
- Roger, Gigi, Garey, Brad, and Elena are distinct records.
- Min Length and Max Length of characters in case the column data is a string
- Standard deviation (Stdev), Maximum (Max), Minimum (Min), Variance, and Average value in case the column has numerical data.
- Histogram representing the distribution of data in case the column has numerical data
- Percentile in case the column has numerical data
- Frequency of data in the column. Click the hyper-linked value on the X axis to go the list of associated records.
- String Lengths in case the column data type is string
- Scripts Distribution in case the column data type is string
- Text Pattern Frequency in case the column data type is string
- Character Categories: Graphically displays the frequencies
of Latin character types detected in the selected string column. The various
categories are:
- Casing: Upper Case, Lower Case, and Mixed Case
- Character Data Types: Alphabetic, Numeric, and Alphanumeric
- Contains Spaces: Single Space, Multiple Spaces, and Trailing or Leading Spaces
- Special Character: Contains or Does not Contain.
Note: Only the special characters defined during configuration of Character Analysis rule will be considered here.
Viewing Outliers
You can view the outliers detected in a column by clicking the Outliers tab. This tab displays a tabular structure of category-wise detected outliers in your column. Details such as Pattern, Length, Value or Frequency, Computed Range, and Occurrence are also displayed.
Viewing the Suggestions Page
This page is divided into two parts:
- A header showing the profile details - These details are displayed for the
analyzed profile:
- Profile name: Name of the profile
- Description: A brief description of the profile
- Started: The date (yyyy/mm/dd), time (hh:mm:ss), and period (AM or PM) when the profile analysis was initialized.
- Data Source: The source of the data in the profile being analyzed. It can be a physical or logical model or a direct data source connection.
- User: The user who created the profile
- Tables: Number of tables in the selected model
- Data source type: Whether the data source for the profile is a model or a database
- Records: Total number of records analyzed
- Statistics Button: Navigates you to the Data Profiling Results page. For more information, see Viewing Suggestions on the Data Profiling Results Page.
- The second section displays the name of your table along with a Recommendation button. On clicking this button, suggestion-wise Groovy Scripts for your table are displayed. On expanding the table through the icon, you can view column-wise suggestions for your table.
You can view and select suggestions for independent statistics from the Data Profiling Results page. Examine the statistics of your data and select any suggestion according to your requirements. For more information, see Viewing Suggestions on the Data Profiling Results Page.
Viewing Suggestions on the Data Profiling Results Page
Suggestions for Summary Tab
- Character Data Type: Character data types such as numeric, alphanumeric, and alphabetic detected in your data are displayed in a tabular format with their frequencies. From the list of options, you can select to Trim or Eliminate these values, or set to a default value. You have to specify this value in the text box which appears on selecting this option.
- Remove Spaces: Redundant spaces such as a
Single Space, Multiple Spaces,
and Trailing or leading Spaces present in your data are
displayed in a tabular format with their frequencies. You can choose any of
these options according to your requirement:
- Trim Spaces - This option will remove any trailing or leading spaces present in your data.
- Minimize white spaces - This option will remove redundant spaces between any two words and also remove any leading or trailing spaces
- Remove spaces - This option will remove all spaces present in your data.
- Standardize Casing: You can regularize your data by standardizing the casing of your data such as Mixed Case, Upper Case, and Lower Case through this option. You can choose to change the casing of your data to Lower case or Upper case, this will cleanse your data, improve it's uniformity, and readability.
- Remove Special Characters: You can choose to remove any redundant special characters present in your data through this option. The detected special characters are displayed under the Options section in a text box. To cleanse your data, you can select the special characters to be removed and delete the characters which you want to keep using the backspace key from the text box.
- Remove Control Characters: You can remove any redundant non-printable control characters present in your data through this option. The control characters detected in your data are displayed under the Statistics section, select the Remove Control Characters option and apply this suggestion to further cleanse your data.
Create Semantic Type: You can categorize your distinct data into a semantic type such as Gender, First Name, Last Name, Country, and Phone. Click on the statistics graph of the distinct data, a preview of your data is shown. Download this data and create semantic types through Table Management. For more information about Table Management, see Introduction to Lookup Tables.
Handling Nulls: You can set the null and empty strings present in your data to a default value. This increases the uniformity and completeness of your data. The frequencies of Complete, Null, and Empty strings are displayed in a tabular format under the Statistics section. Enter the desired default value in the Options text-box and select this suggestion.
Suggestions for Date Summary Tab
For Date Patterns, you will see this suggestion:
Standardize Date: You can regulate the formats of dates present in your data to improve the consistency of your data. Frequencies of various date formats detected in your data are displayed in a tabular format. Select the desired format from the date format drop-down placed under the Options section and select this suggestion.
Suggestions for Custom Pattern Tab
For Matched Patterns, you will see this suggestion:
- Set them to a default value. Enter the default value in the text-box provided.
- Set them to null
After selecting the desired option, select this suggestion to cleanse your data.
Suggestions for Outlier Tab
For semantic type outliers, you will see this suggestion:
Segregating Semantic Types: You can move any semantic type outliers present in a column to a separate column and improve the uniformity and clarity of your data. The frequency of various outliers detected in your data are displayed in a tabular format, you can choose to move any of the detected outliers to a separate custom column and segregate the semantic types.Select the outliers which you wish to move to a separate column, enter the custom name of the column in the text box and select this suggestion.