Business Data Parser - trillium_discovery - trillium_quality - 17.1

Trillium Control Center

Product type
Software
Portfolio
Verify
Product family
Trillium
Product
Trillium > Trillium Discovery
Trillium > Trillium Quality
Version
17.1
Language
English
Product name
Trillium Quality and Discovery
Title
Trillium Control Center
Topic type
Overview
Administration
Configuration
Installation
Reference
How Do I
First publish date
2008

The Business Data Parser (BDP) process identifies, verifies, and standardizes non-name and address data using word, phrase, and pattern definitions you create and customize to meet your business requirements. The BDP parses and standardizes data into a maximum of 500 categories (output attributes) to accommodate a wide range of industries. For example, you can use the BDP to cleanse and standardize automobile insurance data such as claim code, class, year, make, model, and account number.

The BDP enables you to perform these functions:

  • Identify words and phrases in free-form text
  • Organize unique words and phrases in up to 500 categories
  • Convert unstructured text fields of up to 10,000 characters into a series of separate, structured attributes
  • Correct misspellings and standardize data by recoding words and phrases
  • Test and preview parsing changes before making them permanent
  • Collect process statistics to identify potential issues and areas of improvement

Business Data Parser Process

Step 1: The BDP identifies each word and phrase from the input data and compares them to word and phrase definitions in the Customized Definitions table. If a word or phrase is not specified in the table, the BDP assigns it an intrinsic attribute.

Step 2: When the BDP finds defined words and phrases in the table, it assigns them to output categories that you define in the Customized Definitions table (you can define up to 500 categories). Patterns are defined based on the sequence of categories on a row of data. The BDP looks up word combinations in the table and matches them to the patterns and substring patterns. If no match exists, the BDP considers this a bad pattern and writes the details to the exceptions file.

Step 3: Standardized data is assigned up to 500 output attributes with the default names BP_USER1 through BP_USER500. These attributes correspond to the output categories you define in the Customized Definitions table. You can rename these output attributes to be more meaningful and easier to work with; for example, if the attribute BP_USER4 contains product name recodes, you could change it to Recode_Product_Name.

For each record, the BDP generates the following three output attributes:

  • BP_USERxx. The USERxx attributes contain the standardized data.
  • BP_USERxx_DISPLAY. The DISPLAY attributes contain the original data standardized in BP_USERxx.
  • BP_USERxx_ORIGINAL. The ORIGINAL attributes contain the original data in mixed case used in the input data.

Data that the parser does not recognize is assigned to the bp_misc_data (miscellaneous) attribute.

Note: If the word definition table is imported from the Discovery Library or client system, all related BP_USERxx attribute names will be automatically replaced with the format of category_RECODE, category, and category_DATA_PRESENT, respectively. For example, if you have a category called "YEAR," the output attribute names will be YEAR_RECODE, YEAR, and YEAR_DATA_PRESENT.

Step 4: The customizable output schema determines which of the attributes are returned to the output. BDP output includes original data and recoded (standardized) data. The BDP passes output data to the Business Data Parser Repository (BPREPOS). The BPREPOS consists of fixed-field character data results including error codes and identification indicators. By default, the BDP output schema contains all of the attributes from the BPREPOS. Use the BDP Schema Editor to remove any unwanted attributes from the output. For information about how to view parser results, see Viewing Business Data Results.

Guidelines

Note the following when working with business data in the Control Center:

  • The BDP process is available in Business Data Projects only. For information about projects, see Creating a Business Data Project.
  • Run the Transformer process first. The output attributes from the Transformer are used as input to the BDP. If necessary, make modifications to Transformer attributes and rules. For example, you can remove unnecessary data from the attribute to be parsed, combine multiple data sources into one attribute name, and standardize some of the data to make it more uniform.
  • Use the BDP Schema Editor to make changes to your attributes. You can also use the Transformer Schema Editor to make changes to your input attributes.
  • Use the BDP Parser Tuner to review problems reported by the BDP and update the Customized Definitions word and pattern tables to identify input data elements.
  • Business data cannot be tuned using the external Parser Tuner.
  • If you are importing a BDP project created in the earlier Trillium version, you must run Parser Customization to regenerate the Customized Definitions table before running the BDP because the numeric codes for the user attributes have changed.