SOUNDEX1 Routines Comparison - trillium_discovery - trillium_quality - Latest

Trillium Control Center

Product type
Software
Portfolio
Verify
Product family
Trillium
Product
Trillium > Trillium Discovery
Version
Latest
Language
English
Product name
Trillium Quality and Discovery
Title
Trillium Control Center
Copyright
2024
First publish date
2008
Last updated
2024-10-18
Published on
2024-10-18T15:02:04.502478

The section outlines the differences in the SOUNDEX1 comparison routines between the Series 7 matcher and winkey modules and the TS Quality (TSQ) versions. Key differences include variations in padding methods, handling of duplicate adjacent character values, and treatment of spaces in character evaluations. Updates for TSQ v17.3.0 aim to provide consistent results and address prior deficiencies.

Change type Description
Introduced in version 17.3 A new sections that provides comparative analysis of SOUNDEX1 routines across Trillium Series versions.

For Series 7, the matcher and winkey modules had different algorithm implementations for the SOUNDEX1 comparison routine, which yielded different results. These difference are explained below along with examples.

  1. Padding of Results with Less Than 4 Characters

    • Matcher pads with the zero character.
      • Example: The soundex result for ‘LEE’ is ‘L’; the matcher result is ‘L000’.
    • Winkey pads with the space or blank character.
      • Example: The soundex result for ‘LEE’ is ‘L’; the winkey result is ‘L ’.
  2. Eliminating Duplicate Adjacent Character Values

    • Matcher eliminates duplicates after character values have been recoded.
      • Example: The soundex result for ‘PFISTER’ is ‘P236’; ‘P’ and ‘F’ are coded to ‘1’, which are identified as duplicates, resulting in ‘F’ being eliminated from the result.
    • Winkey eliminates duplicates before character values have been recoded.
      • Example: The soundex result for ‘PFISTER’ is ‘P123’; ‘P’ and ‘F’ are coded to ‘1’, but the duplicate value is not eliminated since the prior values of ‘P’ and ‘F’ are different.
  3. Converting Vowels to Spaces and Eliminating the Spaces When Considering Adjacent Character Values

    • Matcher removes spaces before eliminating duplicates, starting after the first character value.

      Example of algorithm steps for ‘JACKSON’:

      • Characters are coded as: J = 2, A = space, C = 2, K = 2, S = 2, O = space, N = 5.
      • Spaces are removed: J = 2, C = 2, K = 2, S = 2, N = 5.
      • Eliminate duplicates: J = 2, C = 2, N = 5.
      • Result: ‘J25’.
      • Result padded with zero: ‘J250’.
    • Winkey does not remove spaces before eliminating duplicates, starting after the first character value.

      Example of algorithm steps for ‘JACKSON’:

      • Characters are coded as: J = 2, A = space, C = 2, K = 2, S = 2, O = space, N = 5.
      • No duplicates: J = 2, C = 2, K = 2, S = 2.
      • Result: ‘J222’.

When upgrading to TS Quality, the matcher module was replaced with the rellink module. For this upgrade, the differences between the two Series 7 modules were removed to produce consistent results within the TS Quality modules. Additional changes were made for TS Quality to include the first character value to be evaluated when considering the elimination of duplicate adjacent character values.

To provide continued support for the Series 7 comparison routines, the Routine Modifier (S7) was introduced. For v17.3.0, updates were made to these routines to fix prior deficiencies where results were found to be inconsistent.

Differences Between Series 7 SOUNDEX1 Comparison Routines Used for Matcher and Winkey Modules and the TSQ v17.x Rellink and Winkey Modules:

  1. For TSQ versions prior to v17.3.0, an additional difference was introduced when converting vowels to spaces and eliminating the spaces when considering adjacent character values. For TS Quality, the first character value was evaluated when considering adjacent character values.

    Example of algorithm steps for coding ‘JACKSON’ (v17.2 and below):

    • Characters are coded as: J = 2, A = space, C = 2, K = 2, S = 2, O = space, N = 5.
    • Spaces are removed: J = 2, C = 2, K = 2, S = 2, N = 5.
    • Eliminate duplicates (C = 2, K = 2, S = 2): J = 2, N = 5.
    • Result: ‘J5’.
    • Result padded with zero: ‘J500’.

    For v17.3.0, the first character value is no longer evaluated when considering adjacent character values. This change was made to align the algorithm with that of the U.S. National Archives.

    Example of algorithm steps for coding ‘JACKSON’ (v17.3.0):
    • Characters are coded as: J = 2, A = space, C = 2, K = 2, S = 2, O = space, N = 5.
    • Spaces are removed: J = 2, C = 2, K = 2, S = 2, N = 5.
    • Eliminate duplicates (K = 2, S = 2): J = 2, C = 2, N = 5.
    • Result: ‘J25’.
    • Result padded with zero: ‘J250’.
  2. For TSQ versions prior to v17.3.0, spaces between words were eliminated prior to considering adjacent character values. This was in line with the U.S. National Archives.

    Example of algorithm steps for coding ‘PAM NAILS’ (v17.2 and below):

    • Characters are coded as: P = 1, A = space, M = 5, space, N = 5, A = space, I = space, L = 4, S = 2.
    • All spaces are removed: P = 1, M = 5, N = 5, L = 4, S = 2.
    • Eliminate duplicates (N = 5): P = 1, N = 5, L = 4, S = 2.
    • Result: ‘P542’.

    For v17.3.0, spaces between words are no longer eliminated when considering adjacent character values. This change restored the significance placed on spaces between words due to their use in multi-word business names.

    Example of algorithm steps for coding ‘PAM NAILS’ (v17.3.0):
    • Characters are coded as: P = 1, A = space, M = 5, space, N = 5, A = space, I = space, L = 4, S = 2.
    • Spaces are removed within words only: P = 1, M = 5, space, N = 5, L = 4, S = 2.
    • No adjacent duplicates to remove.
    • Eliminate remaining spaces: P = 1, M = 5, N = 5, L = 4, S = 2.
    • Result: ‘P554’.
  3. For TSQ versions prior to v17.3.0, spaces between words were eliminated prior to considering adjacent character values. This was consistent with the U.S. National Archives.

    Example of algorithm steps for coding ‘PAM NAILS’ (v17.2 and below):

    • Characters are coded as: P = 1, A = space, M = 5, space, N = 5, A = space, I = space, L = 4, S = 2.
    • All spaces are removed: P = 1, M = 5, N = 5, L = 4, S = 2.
    • Eliminate duplicates (N = 5): P = 1, N = 5, L = 4, S = 2.
    • Result: ‘P542’.

    For v17.3.0, spaces between words are no longer eliminated when considering adjacent character values. This change restored the significance placed on spaces between words due to their use in multi-word business names.

    Example of algorithm steps for coding ‘PAM NAILS’ (v17.3.0):
    • Characters are coded as: P = 1, A = space, M = 5, space, N = 5, A = space, I = space, L = 4, S = 2.
    • Spaces are removed within words only: P = 1, M = 5, space, N = 5, L = 4, S = 2.
    • No adjacent duplicates to remove.
    • Eliminate remaining spaces: P = 1, M = 5, N = 5, L = 4, S = 2.
    • Result: ‘P554’.

Tables of Sample Differences in SOUNDEX1 Routine Results Across Versions and Modules Using Different Routine Modifiers:

Table 1. Winkey Module Results
Version/Test Data Series 7 Winkey TSQ 17.2 S7 Modifier Winkey TSQ 17.3 S7 Modifier Winkey TSQ 17.2 No S7 Modifier TSQ 17.3 No S7 Modifier National Archives Online
JACKSON J222 J225 J222 J250 J250 J250
PAM NAILS P554 P542 P554 P542 P554 P542
CORNEY C65 C5 C65 C650 C650 C650
LEE L Error L L000 L000 L000
GUTIERREZ G362 G62 G362 G362 G362 G362
PFISTER P123 P236 P123 P123 P236 P236
ASHCRAFT A226 A226 A226 A261 A261 A261
LLOYD L3 L L3 L430 L300 L300
CAMPBELL C511 C114 C522 C514 C514 C514
MCGEE M22 M2 M22 M200 M200 M200
RIEDEMANAS R355 R552 R355 R352 R355 R355
SCHAFER S216 S16 S216 S216 S160 S160
SHAEFFER S16 S6 S16 S160 S160 S160
Table 2. Matcher/Rellink Module Results
Version/Test Data Series 7 Matcher TSQ 17.2 S7 Modifier Rellink TSQ 17.3 S7 Modifier Rellink TSQ 17.2 No S7 Modifier TSQ 17.3 No S7 Modifier National Archives Online
JACKSON J250 J5 J250 J250 J250 J250
PAM NAILS P554 P542 P554 P542 P554 P542
CORNEY C650 C5 C650 C650 C650 C650
LEE L000 Error L000 L000 L000 L000
GUTIERREZ G362 G62 G362 G362 G362 G362
PFISTER P236 P236 P236 P123 P236 P236
ASHCRAFT A226 A226 A226 A261 A261 A261
LLOYD L300 L L3000 L430 L300 L300
CAMPBELL C514 C14 C514 C514 C514 C514
MCGEE M200 M M200 M200 M200 M200
RIEDEMANAS R355 R552 R355 R352 R355 R355
SCHAFER S160 S16 S160 S216 S160 S160
SHAEFFER S160 S6 S160 S160 S160 S160