TOKENIZE With Modifier (PHRASE2) - trillium_discovery - trillium_quality - Latest

Trillium Control Center

Product type
Software
Portfolio
Verify
Product family
Trillium
Product
Trillium > Trillium Discovery
Version
Latest
Language
English
Product name
Trillium Quality and Discovery
Title
Trillium Control Center
Copyright
2024
First publish date
2008
Last updated
2024-10-18
Published on
2024-10-18T15:02:04.502478

The PHRASE2 modifier will attempt to enhance the comparison through phrasing by concatenating tokens before matching. When this modifier is specified, the Relationship Linker performs the following steps:

  1. Performs the regular TOKENIZE routine and obtains the base score.
  2. If the base score is less than 98, and there are multiple tokens to concatenate in Field 1, and there are unmatched tokens in Field 2, it attempts the PHRASE2 matching and obtains the score.
  3. If the resulting score is less than 98, and there are multiple tokens to concatenate in Field 2, and there are unmatched tokens in Field 1, it attempts the reverse PHRASE2 matching and obtains the final score.

Each matched phrase will increase the final score by a weighted scoring algorithm described below.

Table 1. Scoring for TOKENIZE (Modifier = PHRASE2)

Score

Description

Base score + Weight

Sum of the base score and weight.

Weight is calculated by: (Number of matches) * (Weight factor) * 3

Weight Factor

Difference in Number of

Tokens in Fields

Weight Factor

0 5
1 4
2 3
3 2
4+ 1
Note: The final score cannot exceed 98.

Example 1 - no reverse matching

Field 1: ABC Company Trillium Software

Field 2: ABC Company TrilliumSoftware

In this case, Field 1 has 4, Field 2 has 3 tokens. The Relationship Linker first performs the regular TOKENIZE routine.

Field 1

Token 1: ABC

Field 2

Token 1: ABC

 

Token 2: Company

 

Token 2: Company

 

Token 3: Trillium

 

Token 3: TrilliumSoftware

 

Token 4: Software

 

 

Attempted matches would be:

Field 1

 

Field 2

 

ABC

vs

ABC

Match

Company

vs

Company

Match

Trillium

vs

TrilliumSoftware

 

Software

vs

TrilliumSoftware

 

The matched percentage is calculated as 2 (number of matched tokens) divided by 3 (number of tokens in the shorter field, Field 2) yielding a base score of 67. A deduction, -3, is given for the 1 extra token in the longer field, Field 1. The final base score is 67 – 3 = 64.

Since the base score is less than 98, and there are two tokens left to concatenate in Field 1 ("Trillium" and "Software"), and there is one unmatched token in Field 2 ("TrilliumSoftware"), the linking attempts the PHRASE2 matching.

Field 1

 

Field 2

 

Trillium+Software

vs

TrilliumSoftware

Match

Due to the concatenation of tokens in Field 1, the number of tokens in Field 1 = 1 (TrilliumSoftware), the number of tokens in Field 2 = 1 (TrilliumSoftware). The number of matches is: 1 (TrilliumSoftware vs TrilliumSoftware), the difference in tokens is 1-1 = 0, yielding a weight factor : 5, total weight is: 1*5*3 = 15. The base score for the above example is 64. Adding the additional weight value of 15 would yield a final score: 64+15=79. In this example, there is no unmatched tokens left at this point and the reverse matching would not be attempted.

Example 2 - reverse matching

Field 1: ABC Company TrilliumSoftware

Field 2: ABC Company Trillium Software

In this case, Field 1 has 3, Field 2 has 4 tokens. The Relationship Linker first performs the regular TOKENIZE routine.

Field 1

Token 1: ABC

Field 2

Token 1: ABC

 

Token 2: Company

 

Token 2: Company

 

Token 3: TrilliumSoftware

 

Token 3: Trillium

 

 

 

Token 4: Software

Attempted matches would be:

Field 1

 

Field 2

 

ABC

vs

ABC

Match

Company

vs

Company

Match

TrilliumSoftware

vs

Trillium

 

TrilliumSoftware

vs

Software

 

The matched percentage is calculated as 2 (number of matched tokens) divided by 3 (number of tokens in the shorter field, Field 1) yielding a base score of 67. A deduction, -3, is given for the 1 extra token in the longer field, Field 2. The final base score is 67 – 3 = 64. Since there are not multiple tokens to concatenate in Field 1, the reverse matching would be attempted.

In this case, there are two tokens to concatenate in Field 2 ("Trillium" and "Software"), and there is one unmatched token in Field 1 ("TrilliumSoftware").

Field 2

 

Field 1

 

Trillium+Software

vs

TrilliumSoftware

Match

Due to the concatenation of tokens in Field 2, the number of tokens in Field 2 = 1 (TrilliumSoftware), the number of tokens in Field 1 = 1 (TrilliumSoftware). The number of matches is: 1 (TrilliumSoftware vs TrilliumSoftware), the difference in tokens is 1-1 = 0, yielding a weight factor : 5, total weight is: 1*5*3 = 15. The base score for the above example would be 64. Adding the additional weight value of 15 would yield a final score: 64+15=79.