The Customer Data Parser (CDP) processes personal and business names for China, Korea, and Taiwan in three steps.
Step 1: Token Identification
The first step is to isolate words and phrases into tokens. This is called "Token Identification." Tokens may contain one or more characters (and/or symbols) that are identifiable as a word or word/phrase element. During the initial scan, the Parser uses commas or space characters in the input attribute to determine where one token ends and the next begins.
- Example
-
Input Data
Initial Token Results
China
吴卓霖,广东省广州市南华路22号,135800
吴卓霖 (1 name token)
Korea
홍길동, 서울시 강남구대치동 973-2 3층, 135-280
홍길동 (1 name token)
Taiwan
鄭淑珍,台北市四維路2號3樓,106
鄭淑珍 (1 name token)
Step 2: Table Lookup
The second step is to scan each token against Standard and Custom Parser Definitions tables (also known as a lookup or word pattern table). This process verifies which tokens are personal names and which are business names. It also identifies the surname character(s) and uncovers new tokens based on the lookup results. During this process, all word elements that can be further identified as part of a name, for example, a surname and given name, are created as separate tokens.
- Example
-
Previous Results
New Results
Reasoning
China
吴卓霖
吴|卓霖
Based on surname lookup
Korea
홍길동
홍|길동
Based on surname lookup
Taiwan
鄭淑珍
鄭|淑珍
Based on surname and given name lookup
Step 3: Output
The Parser passes a comprehensive data block called the PREPOS (Parser Repository). The PREPOS contains parsed data including error codes, identification indicators and name information. The output schema determines which of these attributes are returned to the output.
Click the following topics to setup and run the Customer Data Parser process.
- Schema Editor
- Input Settings
- Options
- Output Settings
- Running the Customer Data Parser
- Reviewing Output
- Tuning the Parser Rules