How Data Matching for Deduplication (One Source Matching) Works - EnterWorks - Fusion - Match_and_Merge - Precisely_EnterWorks - 3.1

EnterWorks Fusion

Product type
Software
Portfolio
Verify
Product family
EnterWorks
Product
Precisely EnterWorks > Match and Merge
Precisely EnterWorks > Fusion
Version
3.1
Language
English
Product name
Precisely EnterWorks
Title
EnterWorks Fusion
Copyright
2023
First publish date
2007

Fusion's one source matching functionality automates the process of identifying records that represent the same real-world entity (duplicate records) and merging the identified records into a single consolidated record.

Terms

  • Match column: A column used in the matching process to determine if records are duplicates.

  • Comparison function: A comparison function compares a column's value in two records and determines the likelihood that the values match. Fusion provides several comparison functions that allow the implementation of fuzzy matching. (See Comparison Functions.)

  • Composite score: The composite score of comparing two records is computed as a sum of the column comparisons scores for the given columns, divided by the number of columns. This value is specified as a percentage ranging from 0 to 100.

  • Match cutoff: If the comparison of two values or records generates a score greater than or equal to the match cutoff value, the values or records are declared a match.

  • Survivorship function: For a set of matched records, a survivorship function is used to determine which record's value will be used in the final merged record.

Match Rules

Match rules define the conditions that must exist for records to be declared duplicates. Fusion performs data matching by running a matching specification that contains one or more match rules that are run in the order specified.

Fusion supports the following types of match rules:

  • Exact: For this type of rule, records that have equal values in the match columns are considered duplicates. The Exact type of match rule is simple, yet very powerful, particularly if you standardize your columns first.

  • Fuzzy, per column limit: This type of matching allows you to specify the match cutoff and comparison function for each match column. The score returned by the column's comparison function is compared with the column's match cutoff value. If the score is greater or equal to the match cutoff value, the column is declared to be a match. For records to be declared a match, all match columns must be a match.

  • Fuzzy, combined single limit: This type of matching allows you to match records by comparing the values in columns, without you specifying the exact set of columns that must be matched. Only one match cutoff value is specified. All pairs of records are compared, a composite score is computed for each pair of records, and record pairs that have a composite score equal to or greater than the match cutoff value are declared to be matches. You can specify the comparison functions used.

    To select the match cutoff value you should review create a Weight Histogram Table by running the match run in the determine cutoffs mode. The Weight Histogram Table contains two columns:

    • weight: The composite score.

    • c: The number of pairs that have this composite score.

Comparison functions

Fusion supports the following comparison functions:

  • equal: Exact equality. The result of the equal operation is computed as a percentage: 0 (not equal) to 100 (equal).

  • edit_distance: A way of quantifying how dissimilar two strings are by counting the minimum number of operations required to transform one string into the other. The result is computed as percentage: from 0 (not equal) to 100 (exactly equal). Fusion uses the following formula to compute the result:

    edit_distance(s1,s2) = (1 - levenshtein_edit_distance(s1, s2) / max(length(s1), length(s2))) * 100 

    where levenshtein_edit_distance is the number of insertion, deletion, and substitution edits required to transform one string into the other.

  • compare_words: Splits compared strings into words, using a space as the delimiter, then counts the number of common words. The result is computed as a percentage: from 0 (no common words) to 100 (have the same set of words). Fusion uses the following formula to compute the result:

    compare_words(s1,s2) = 100 * number_of_common_words(s1,s2) / ((length(s1) + length(s2)) / 2)
  • compare_stand_words: Functions in the same manner as compare_words, except the words are standardized using the general string standardization function: all blank and special characters are removed, and all letters are transformed to lowercase.

  • contains: Implements the string contains function. The result of the contains operation is computed as a percentage: 0 (does not contain) to 100 (contains).

To experiment with the comparison functions you can connect to the database where the Match and Merge library is uploaded and try the functions. For the SQL server-based version (sources of type MSSQLServer) use the following functions:

-- returns 100
select [dbo].[il_equal]('KPMG LLP', 'KPMG LLP') as c;
-- returns 88
select [dbo].[il_edit_distance]('KPMG LLP', 'KPMG LL') as c;
-- returns 50
select [dbo].[il_compare_words]('point one','point two');
-- returns 50
select [dbo].[il_compare_stand_words]('point one','Point$% two');
-- returns 100
select [dbo].[il_contains]('point one','one');

Consolidation Rules

Consolidation rules define how to compile a consolidated (merged) record from a set of matched (duplicate) records.

Consolidation rules consist of a list of columns called survival columns. Each column in your source table has a correlating survival column. For each survival column, you specify a survivorship function, which is the strategy used to compute the final value of the column in the merged record.

Survivorship Functions

Fusion provides the following survivorship functions you can use to select which record's value will survive to be used in consolidated table:

  • max: From the set of values, select the maximum value.

  • min: From the set of values in the matched records, select the minimum.

  • most_frequent: From the set of values in the matched records, select the value that appears most frequently.

  • concat_distinct,<delimeter>

  • source_priority,<priority table>,<source column>: Fusion allows you to prefer a primary record. This means you can specify a strategy to declare one of the duplicate records to be the primary record, then prefer its value. If the value of the column in the primary record is not null, it will be chosen as the value for the consolidated record. If it is null, the per-column survivorship function will be used to compute the consolidated value. Preferring the value from the primary record can be used to get a consistent set of values across the columns of the consolidated record. For example, if the duplicate records have different mailing addresses, you could prefer a primary record to set all the addresses to the same value. For more information, see Source Priority Survivorship Function Details.