The TW Postal Matcher uses a four-step process to identify and match postal information.
Step 1: Initial Parsing
The first step in the Postal Matching process isolates all words and phrases by breaking up the input attribute(s) into recognizable tokens. During the initial scan, the Postal Matcher uses commas or space characters in the input attribute to determine where one token ends and the next begins.
Example
Input record: 鄭淑珍, 台北市四維路 2號 3樓, 106
Initial token results: (six tokens)
Token 1 | Token 2 | Token 3 | Token 4 | Token 5 | Token 6 |
鄭淑珍 | 台北市 | 四維路 | 2號 |
3樓 |
106 |
Step 2: Table-Based Tokenizing
After initial tokens are created, the Postal Matcher scans each token against the Parser Definition tables to further identify the tokens. During this secondary identification process, all elements further identifiable via the Parser Definition entries are also separated into tokens.
Example
Token results of previous step: (6 tokens)
鄭淑珍, 台北市四維路 2號 3樓, 106
Token results of this step: (9 tokens)
Previous Results | New Results | Reasoning |
---|---|---|
鄭淑珍 | 鄭 | 淑珍 | Based on surname lookup |
台北市 |
台北市 |
Based on L1 lookup |
四維路 |
四維路 |
Based on L4 lookup |
2 號 | 2|號 | Recognized as house number based on table lookup |
3樓 | 3 | 樓 | Recognized as floor number based on table lookup |
106 | 106 | See Step 3. |
Step 3: Mask-Based Data Identification
Any token that remains unknown after the table look-up process is subsequently reviewed against a set of pre-defined masks (data shapes) in the Parser Definition table.
Example
For example, 1100000 is identified as a postcode based on pr_postcode: 1100000
1100000 | pr_postcode: 1100000 | Recognized as postcode based on mask lookup. |
Step 4: Output to PREPOS
The Postal Matcher passes results of the data identification process to the PREPOS program. See Analyzing the Postal Matcher Results for Asian Countries.