Parsing E-mail Addresses - spectrum_quality_1 - 23.1

Spectrum Data Quality Guide

Product type
Product family
Spectrum > Quality > Spectrum Quality
Product name
Spectrum Data Quality
Spectrum Data Quality Guide
Topic type
How Do I
First publish date

This template demonstrates how to parse e-mail addresses into component parts. The parsing rule separates each token in the Email field and copies each token to three fields: Local-Part, DomainName, and DomainExtension. Local-Part represents the domain name part of the e-mail address, DomainName represents the domain name of the e-mail address, and DomainExtension represents the domain extension of the e-mail address. For example, in, "precisely" is the domain name and "com" is the domain extension.

The internet is a great source of public domain information that can aid you in your open parsing tasks. In this example, e-mail formatting information was obtained from various internet resources and was then imported into Table Management to create a table of domain values. The domain extension task that you will perform in this template activity demonstrates the usefulness of this method.

This template also demonstrates how to effectively use table data that you load into Table Management to perform table look-ups as part of your parsing tasks.

Business Scenario

You work for an insurance company that wants to do its first e-mail marketing campaign. Your database contains e-mail addresses of your customers and you have been asked to find a way to make sure that those e-mail addresses are in a valid SMTP format.

Before you create this dataflow, you will need to load a table of valid domain names extensions in Table Management so that you can look up domain name extensions as part of the validation process.

The following dataflow provides a solution to the business scenario:

Business scenario solution workflow

This dataflow template is available in Enterprise Designer. Go to File > New > Dataflow > From template and select ParseEmail. This dataflow requires Data Normalization.

In this dataflow, data is read from a file and processed through the Open Parser stage. For each data row in the input file, this dataflow will do the following:

Create a Domain Extension Table

The first task is to create an Open Parser table in Table Management that you can use to check if the domain extensions in your e-mail addresses are valid.

  1. From the Tools menu, select Table Management.
  2. In the Type list, select Open Parser.
  3. Click New.
  4. In the Add User Defined Table dialog box, type EmailDomains in the Table Name field, make sure that None is selected in the Copy from list, and then click OK.
  5. With EmailDomains displayed in the Name list, click Import.
  6. In the Import dialog box, click Browse and locate the source file for the table. The default location is: <drive>:\Program Files\Precisely\Spectrum\server\modules\coretemplates\data\ Email_Domains.txt. Table Management displays a preview of the terms contained in the import file.
  7. Click OK. Table Management imports the source files and displays a list of internet domain extensions.
  8. Click Close. The EmailDomains table is created. Now create the dataflow using the ParseEmail template.

Read from File

This stage identifies the file name, location, and layout of the file that contains the e­mail addresses you want to parse.

Open Parser

The Open Parser stage parsing grammar defines the following commands and expressions:

  • %Tokenize is set to None. When Tokenize is set to None, the parsing grammar rule must include any spaces or other token separators within its rule definition.
  • %InputField is set to parse input data from the Email_Address field.
  • %OutputFields is set to copy parsed data into three fields: Local-Part, DomainName, and DomainExtension.
  • The root expression defines the pattern of tokens being parsed:
<root> = <Local-Part>"@"<DomainName>"."<DomainExtension>;

The rule variables that define the domain must use the same names as the output fields defined in the required OutputFields command.

  • The remainder of the parsing grammar defines each of the rule variables as expressions.
<Local-Part> = (<alphanum> ".")* <alphanum> | (<alphanum> "_")* <alphanum> ;
<DomainName> = (<alphanum> ".")? <alphanum>;
<DomainExtension> = @Table("EmailDomains")* "."? @Table("EmailDomains");

The <Local-Part> variable is defined as a string of text that contains the <alphanum> variable, the period character, and another <alphanum> variable.

The <alphanum> variable definition is a regular expression that means any string of characters from A to Z, a to a, and 0-9. The <alphanum> variable is used throughout this parsing grammar and is defined once on the last line of the parsing grammar.

The parsing grammar uses a combination of regular expressions and literal characters to build a pattern for e-mail addresses. Any characters in double quotes in this parsing grammar are literal characters, the name of a table used for lookup, or a regular expression. The parsing grammar uses these special characters:

  • The "+" character means that a regular expression can occur one or more times.
  • The "?" character means that a regular expression can occur zero or one time.
  • The "|" character means that the variable has an OR condition.
  • The ";" character means end of a rule.

Use the Commands tab to explore the meaning of the other special symbols you can use in parsing grammars by hovering the mouse over the description.

To test the parsing grammar, click the Preview tab. Type the e-mail addresses shown below in the Email Address field and then click Preview.

Preview tab in Open Parse Options dialog box

You can also type other e-mail addresses to see how the input data is parsed.

You can also use the Trace feature to see a graphical representation of either the final parsing results or to step through the parsing events. Click the link in the Trace column to see the Trace Details for the data row.

Trace Details shows a matching result. Compare the tokens matched for each expression in the parsing grammar.

Tokens in Trace Details

You can also use Trace to view non-matching results. The following graphic shows a non- matching result. Compare the tokens matched for each expression in the parsing grammar. The reason that this input data ( did not match is because it did not contain all of the required tokens to match—there is no @ character separating the Local- Part token and the Domain tokens.

Non-matching tokens

Write to File

The template contains one Write to File stage. In addition to the input field, the output file contains the Local-Part, DomainName, DomainExtension, IsParsed, and ParserScore fields.