Profiling Rules - discovery - 23.1

Spectrum Discovery Guide

Product type
Software
Portfolio
Verify
Product family
Spectrum
Product
Spectrum > Discovery
Version
23.1
Language
English
Product name
Spectrum Discovery
Title
Spectrum Discovery Guide
Topic type
How Do I
Overview
Reference
First publish date
2007
ft:lastEdition
2024-02-07
ft:lastPublication
2024-02-07T17:21:58.768552

Profiling rules perform different types of analysis on your data. When setting up a profile, choose the profiling rules that perform the types of data analysis you are interested in.

This section describes the profiling rules supported in Spectrum Discovery.

Character Analysis

This rule identifies patterns, scripts, and character types in string fields. To configure this rule when creating a profile, click the Configure icon and add, edit, or delete special characters from the prepopulated list. When you enable this rule, you will get this information for the string fields in your data:

  • Frequency: The most-used phrases in the selected string column.
  • Scripts Distribution: The different scripts identified in the selected string column and their count.
  • String Lengths: The distribution of string lengths in the selected string column.
  • Character Categories: The types of characters in the selected column, such as letter, punctuation, and number.
  • Text Patterns: Converts data in the string column to pattern and displays the pattern, its count, and percentage occurrence. The pattern is determined using this rule:
    • Latin upper-case characters are replaced with "A"
    • Latin lower-case characters are replaced with "a"
    • Digits are replaced with "9"
    • Control characters are replaced with

Single Column Key Analysis

Detects the primary (Candidate) key based on the uniqueness and completeness threshold you configure for your profile data.

To configure the threshold values, perform these steps:
  1. Select the rule check box and click the Configure icon .
  2. In the Rule Configuration pop-up, enter the Completeness threshold and Uniqueness threshold. For example, enter 90 as completeness threshold and 80 as uniqueness threshold.
  3. Click OK.
When you run the profile with this rule selected, it will fetch the result in Candidate Key Analysis section. The details displayed are:
Note: It will be displayed only when any of the columns in the selected data has both completeness and uniqueness score above the threshold value. For example, in our case the column should have completeness above 90 and uniqueness above 80.
  • Column name: Name of the column having the complete and unique data. For example, UserID in this case.
  • Completeness: The completeness score. Example: 100
  • Uniqueness: The uniqueness score. Example: 100
If you go to the identified column, you will see the graphical representation of completeness and uniqueness and the name of the primary key (Example: Candidate Key) adjacent to the column name.

Internal Rule

Determines statistics, such as completeness, uniqueness, frequency and outliers in the data set.

Semantic Analysis

  1. User Defined: This rule detects the user-defined semantic types in the Define Glossary and uses it to profile the data set. The user-defined semantic types can find out data based on these parameters:
    1. If you defined an expression in the semantic type, it finds out its match in the data set and gets the result on the Data Profiling Results page.
    2. If you specified the data to be looked for, the user-defined semantic type looks for it in the data set and presents it on the Data Profiling Results page. For example, if Single, Married is specified in the Exists In option of the semantic type, it will fetch all the records that have these value.
  2. Advanced Transformer Look-up: This rule determines the user-defined tables in the Advanced Transformer stage and uses it to profile the data set. To run this rule, you need to install Spectrum Data Normalization and load the Advanced Transformer reference tables. The tables you configure appears in the Semantic Type tab on the Define Glossary page, which are non-editable.
    Note: You can edit the tables in Enterprise Designer.

    If you select this rule, the Data Profiling Results page displays the detected semantic types if the data matches with the table defined in the stage.

  3. Open Parser Look-up: This rule determines the user-defined tables in the Open Parser stage and uses it to profile the data set. To run this rule, you need to install Spectrum Data Normalization and load the Open Parser reference tables. The tables you configure appears in the Semantic Type tab on the Define Glossary page, which are non-editable.
    Note: You can edit the tables in Enterprise Designer.

    If you select this rule, the Data Profiling Results page displays the detected semantic types if the data matches with the table defined in the stage.

  4. Credit Card Validation Analysis: Select this rule to detect and validate credit card numbers and identify credit card numbers as JCB, VISA, Diners Club (DINERS), MasterCard, Discover, or American Express (AMEX). If you select this rule, the Data Profiling Results page displays an additional Credit Card Summary tab showing these details:
    1. Validity: The valid and invalid credit card numbers.
    2. Credit Card Distribution: Category-wise distribution of the detected credit cards
  5. Date Analysis: This rule detects and validates dates in string columns. It also identifies date patterns in the columns and their distribution. This analysis can be useful in detecting date entries in erroneous columns, for example in email data. If you select this rule, Data Profiling Results page displays an additional Date Summary tab for the string columns that have dates. This tab shows these details:
    1. Validity: The valid and invalid values.
    2. Date Patterns: The date patterns detected in the selected columns, their total count, and percentage of that pattern in the data set.
  6. Email Analysis: This rule detects and validates the email addresses and determines the distribution of email domains in the selected data column. If you select this rule, Data Profiling Results page displays an additional Email Summary tab showing these details:
    1. Validity: The valid and invalid values.
    2. Domain Distribution: The top ten email domains in the selected data column.
  7. Phone Number Analysis: Select this rule to detect and validate phone numbers and identify phone numbers as fixed line numbers, mobile numbers, or any other type of number. This rule also gives the distribution of the phone numbers by country and region. You need to configure this rule to define the default country to use when a phone number does not have a country code. If you select this rule, Data Profiling Results page displays an additional Phone Number Summary tab showing these details:
    1. Validity: The valid and invalid phone numbers.
    2. Phone Number Types: The types of phone numbers, such as mobile, land line, fixed line, VOIP, Pager, voice mail, or toll-free.
    3. Phone Numbers by Country: The country-wise distribution of the detected phone numbers.
    4. Phone Numbers by Region: The region-wise distribution of the detected phone numbers
  8. Vehicle Identification Number (VIN) Analysis: Select this rule to detect and validate vehicle identification numbers. This rule also gives the distribution of Vehicle Identification Numbers by country. If you select this rule, the Data Profiling Results page displays an additional VIN Summary tab showing these details:
    1. Validity: The valid and invalid vehicle identification numbers.
    2. VIN Country Distribution: Country-wise distribution of the detected vehicle identification numbers.
  9. Social Security Number (SSN) Analysis: Select this rule to detect and validate social security numbers. If you select this rule, the Data Profiling Results page displays an additional SSN Summary tab showing the valid and invalid social security numbers.
  10. International Bank Account Number (IBAN) Analysis: Select this rule to detect and validate international bank account numbers. This rule also gives the distribution of International Bank Account Numbers by country. If you select this rule, the Data Profiling Results page displays an additional IBAN Summary tab showing these details:
    1. Validity: The valid and invalid international bank account numbers.
    2. IBAN Country Distribution: Country-wise distribution of the detected international bank account numbers.
  11. Semantic Analysis: Select this rule to detect semantic types, such as first name, city, country, ISO country code 2 and 3, last name (family name), and states. This rule can help find values in incorrect columns, such as city names in a Country column. If you select this rule, the Data Profiling Results page displays an additional Semantic Type tab showing the detected semantic types and their frequency.
  12. U.S. Address Analysis: This rule determines the quality of your address data using the U.S. database of Spectrum Universal Addressing. To run this rule, you need to:
    1. Install the Spectrum Universal Addressing U.S. database and define it as a resource in Management Console. For more information about adding this database resource, the Administration Guide.
    2. Configure the U.S. Address Analysis rule by clicking the Configure button and entering this information:
      • US Address Coder database: Select the Spectrum Universal Addressing database resource configured in Management Console.
      • AddressLine1 field to AddressLine5 field: Map these fields to the columns of the table you are analyzing. You do not necessarily need to enter column names in all the fields. However, the more specific you are the better the matching score will be.
      • Map the columns from your table to the City, Country, USUrbanName, FirmName field, PostalCode, and StateProvince fields.

        If you select this rule, the Data Profiling Results page shows the Address Summary tab.

        • The legend below the chart shows the match score for the data along with the color coding.
        • Point anywhere in the graph area to view the match score. The scores are in ranges (0, 1-25, 26-50, 51-80, 81-99, and 100), with zero representing no match of the data to the database. The graph also shows the percentage of matching records detected (color coded).
        • Click the area in the graph to view the data that matched or did not match to the database.
  13. International Address Analysis: This rule determines the quality of your address data using the Spectrum Global Address Validation database. To run this rule, you need to:
    1. Install the Spectum Global Address Validation database and define it as a resource in Management Console. For more information, see the Administration Guide.
    2. Configure the International Address Analysis rule by clicking the Configure button , and entering this information:
      • Addressing Engine Database: Select the Global Address Validation database resource configured in the Management Console.
      • Confidence Threshold: Enter the value of confidence threshold to detect fields falling below the specified value. The default value of this field is 80.
      • Table List: Select the table on which you would like to run this rule.
      • AddressLine1 field and Country field: Map these fields to the columns in the table you are analyzing.
      • Map the columns from your table to the LastLine, City, CitySubdivision, PostalCode, State, StateSubdivision, and FirmName fields. You do not necessarily need to enter column names in all the fields. However, the more specific you are the better the matching score will be.
        If you select this rule, the Data Profiling Results page shows and Address Summary tab, which displays:
        • International Address Confidence Distribution: The match score for the data. The score is color coded. Point anywhere in the donut chart to view the score range (0, 1-25, 26-50, 51-80, 81-99, and 100), with zero representing no match of the data in the database. Click the area in the chart to preview the matching or non-matching data.
        • International Address Precision: This distribution of validation levels of addresses, such as state, house, postal code, city, city sub-division, and street.
        • International Address Mismatched Fields Distribution: The distribution of mismatched fields of addresses, such as City Subdivision, State Province, Street Name, and Postal Code are displayed here. The confidence threshold of this address is less than the value defined by you while configuring the International Address Analysis rule.

Duplicate Record Analysis

This rule identifies duplicate records within a table and helps resolve them by using the smart rule functionality. If the table has any duplicate records, those will be displayed in groups in a separate Duplicate Record Analysis tab in the results. Let us understand it with this example.

Example

This is an example of how duplicate records are determined. Let us assume that the data you choose to profile contains the below information.
FirstName Address phone
Amanda 68846 John Island 949-727-4834
Aminda 2746 Thomas Ride Suite 252 949-727-4834
Amanda 68846 Jon Island 949-727-4834
Johnathan 2747 Thoms Ridge Suite 252 949-727-4834
manda 68846 Joan Island 949-727-4834
Bill 072 Courtney Mountains 142-025-5264
Judit 3317 Bryan Umion 675-223-2832
Juddith 830 Johnson Streen 675-223-2832
Charles 200 Mia Ramp Apt. 131 237-538-6959
Judith 3317 Brayn Union 675-223-2832
Judih 3317 Brain Union 838-045-9865
Based on the above input data, the duplicates are determined internally by the system based on matching algorithms when the records are compared, and the below results are displayed in groups in the Duplicate Record Analysis tab. It further provides the capabilities to resolve the duplicates too. To learn more, see Viewing Duplicate Records in Table Details in Profiling Results.
Note: Based on the internal matching algorithms, only the fields considered for finding duplicates are highlighted in the groups and displayed at the top.
Group1
FirstName Address phone
Amanda 68846 John Island 949-727-4834
Amanda 68846 Jon Island 949-727-4834
Group2
FirstName Address phone
Juddith 830 Johnson Streen 675-223-2832
Judith 3317 Brayn Union 675-223-2832

Service Configuration

Use this to consume the default Spectrum services and the services you have configured using the different Spectrum stages you licensed for. This configuration helps you reuse the transformations designed with various Spectrum stages on your data. Let us understand it with this example.

Example: You have used the Validate Address stage and the Conditional Router stage to filter out the validated addresses. You can use this transformation to profile your input data, so that profiling runs on the validated addresses. All you need to do is expose this service in Enterprise Designer and consume it here (in Spectrum Discovery) as Service Configuration rule. Let us see, how to do this.
  1. Select the Service Configuration rule, and click the corresponding gear icon .
  2. In the Service Configuration pop-up, specify these details.
    Field Description
    Services From the drop down list, select the required service. It lists all the exposed services.
    Note: You can add multiple services by clicking the Add Services link.
    Service input fields and Source input fields Displays the list of the fields used as input in the selected service. Every Service input field has a corresponding Source input fields. Use the Source input fields drop-down list to select the required field from your input data.
    Data in the selected field will be used as input when you run profile.
    Note: The data types of input fields and source fields with which those are mapped should be the same.
    Example:
    • The Service input fields are First_Name and Last_Name.
    • In the corresponding Source input fields, you select FirstName and LastName.
    Now, when you run profiling, the data in the FirstName and LastName columns in your input sources will be used for profiling. The transformations defined in the selected service will run on this input data and those will be profiled to give you the reports.
    Service output fields Shows the output fields in the selected service. Select the fields that you want in your profile output. You can use the type-ahead search box to find out the required fields.
    Note: If the service has List data type as output, you can't use that in profiling.
  3. Click OK.
  4. When you run profiling, the output will be displayed on the Data Profiling Results page under Name_Service. Click on any of the output fields to view the related statistics.
    Note: You can use the services and obtain profiling outputs even if you do not have a license for the respective module or your license has expired. The Profile Summary page will have an additional tab to indicate the license exception.

Custom Pattern Analysis

This rule identifies any pattern in the string column. You can configure as many regular expressions as you want to match your data against. To configure this rule when creating a profile, click the Configure icon , and enter these details:

  1. Key: Name of the pattern to be identified
  2. Value: The regular expression for the pattern

    For example, if you want to identify email addresses in string columns, enter the regular expression ^(.+)@(.+)$ in the Value field and email in the Key field.

  3. To add another expression, click the Add icon and add details of the next key value pair. You can add as many expressions as you would like to match the data against.

When you enable this rule, you will get this information:

  • Validity: The values that matched at least one of the regular expression patterns in the rule.
  • Pattern Match Distribution: The distribution of records that matched the regular expressions.

Outlier Analysis

Select this rule to detect the outliers in your profile. If the profile has any frequency, semantic type, pattern, length, or other outliers, those will be displayed in a separate Outlier Analysis tab in the results.