Optimizing Matching - 23.1

Spectrum Dataflow Designer Guide

Version
23.1
Language
English
Product name
Spectrum Technology Platform
Title
Spectrum Dataflow Designer Guide
Copyright
2024
First publish date
2007
Last updated
2024-10-11
Published on
2024-10-11T22:55:47.128293

Matching is typically one of the most time-consuming operations in any data quality implementation, making it important to ensure that matching is operating as efficiently as possible. There is always a balance between matching results and performance. If every record in a file is compared to every other record, you can be quite confident that all matches will be identified. However, this approach is unsustainable as the volume of data grows. For example, given an input file of 1 million records, matching each record to every other record would require nearly 1 trillion comparisons to evaluate each match rule.

Given that most records in a file do not match, the general approach to solving this problem is to define a match key and only compare those records that have the same match key. Proper match key definition is the most critical variable affecting performance of the matching engine. To define a proper match key, you must understand how the matching engine processes records and the options that are available.

The default matching method performs an exhaustive comparison of the record in a match queue to identify the maximum number of matches. Because of this, it is often the most time consuming way to do matching. Under the default matching method, the first record in the match queue becomes the suspect record. The next record is compared, and if it matches it is written out as a duplicate. If it does not match, it is added as a suspect, and the next record is compared to the two active suspects. Consider this match queue:

Unique ID Match Key
1 123A
2 123A
3 123A
4 123A
5 123A
6 123A
7 123A
8 123A
9 123A
10 123A

First, record 2 would be compared to record 1. Assuming it does not match, record 2 would be added as a suspect. Then record 3 would be compared to records 1 and 2, and so forth. If there are no matching records, the total number of comparisons would be 45. If some records match, the number of comparisons will be less. For a match queue of a given size N, the maximum number of comparisons will be N×(N-1)÷2. When the queue size is small this is not noticeable, but as the queue size grows the impact is significant. For example, a queue size of 100 could result in 4,450 comparisons, and a queue size of 500 could result in 124,750 comparisons.

Defining an Appropriate Match Key

To define an appropriate match key, consider these points:

  • Most records do not match. Compare only records that are likely to match.
  • Only records with the same match key will be compared.
  • Performance is a key consideration:
    • The match key determines the size of the match queue.
    • For a given number of records, as the match queue size doubles, execution time doubles.
    • A "tight" match key results in better performance. A "tight" match key is one that is specific, containing more characters from possibly more fields.
    • A "loose" match key may result in more matches. A "loose" match key is one that is less specific, containing fewer characters from possibly fewer fields.

Finding a Balance Between Performance and Match Results

To find a good balance between performance and results, consider the match rule and the density of the data.

  • Consider the match rules:
    • Fields requiring an exact match could be included in the match key.
    • Build an appropriate key for the match rule. For example, for a phonetic match rule, a phonetic match key is probably appropriate.
    • A match key will often consist of parts of all the fields being matched.
    • Be aware of the effects of missing data.
  • Consider the density of the data:
    • For example, in address matching, the match key would likely be tighter if all the records are in a single town instead of a national dataset.
    • Consider the largest match queue, not just the average. Review the Match Summary report to find the largest match queue.
  • When using transactional match, the same considerations apply to the SELECT statement in Candidate Finder.

Express Match Key

In a typical file, most of the duplicate records match either exactly or nearly exactly. Defining an express match key allows the matching engine to perform an initial comparison of the express match keys to determine that two records are duplicates. This can significantly improve performance by avoiding the need to evaluate all the field level match rules.

Intraflow Match Methods

The default Intraflow Match match method compares all records having the same match key. For a match queue size of N, the default method performs anywhere from N−1 to N×(N−1) comparisons. If all records match, the number of comparisons is N−1. If no records match the number of comparisons is N×(N−1). Usually the number of comparisons is somewhere in the upper part of this range.

If performance is a priority, consider using the sliding window match method instead of the default method. The sliding window match method compares each record to the next W records (where W is the window size). For a given file size N, the sliding window method performs no more than N×W comparisons. This can lead to better performance, but some matches may be missed.