Techniques for Defining Match Keys - spectrum_quality_1 - 23.1

Spectrum Data Quality Guide

Product type
Software
Portfolio
Verify
Product family
Spectrum
Product
Spectrum > Quality > Spectrum Quality
Version
23.1
Language
English
Product name
Spectrum Data Quality
Title
Spectrum Data Quality Guide
Topic type
Overview
Reference
Tips
How Do I
First publish date
2007
ft:lastEdition
2024-03-04
ft:lastPublication
2024-03-04T22:52:13.486265

Effective and efficient matching requires the right balance between accuracy and performance. The most accurate approach to matching would be to analyze each record against all other records, but this is not practical because the number of records that would need to be processed would result in unacceptably slow performance. A better approach is to limit the number of records involved in the matching process to those that are most likely to match. You can do this by using match keys. A match key is a value created for each record using an algorithm that you define. The algorithm takes values from the record and uses it to produce a match key value, which is stored as a new field in the record.

For example, if the incoming record is:

First Name - Fred
Last Name - Mertz
Postal Code - 21114-1687
Gender Code - M

And you define a match key rule that generates a match key by combining data from the record like this:

Input Field Start Position Length
Postal Code 1 5
Postal Code 7 4
Last Name 1 5
First Name 1 5
Gender Code 1 1

Then the key would be:

211141687MertzFredM

Any records that have the same match key are placed into a match group. The matching process then compares records in the group to each other to identify matches.

To create a match key, use a Match Key Generator stage if you are matching records using Interflow Match or Intraflow Match. If you are matching records using Transactional Match, use the Candidate Finder stage to create match groups.
Note: The guidelines that follow can be applied to both Match Key Generator keys and Candidate Finder queries. In Candidate Finder, these guidelines apply to how you define the SELECT statement.

Match Group Size and Performance

The match key determines the size of the match group, and thus the performance of your dataflow. As the size of the match group doubles, execution time doubles. For example, if you define a match key that produces a group of 20 potentially-matching records, it will take twice as long to process as if you modify the match key so that the match group contains only 10 potentially-matching records. The disadvantage to "tightening" the match key rule to produce a smaller match group is that you run the risk of excluding records that do match. "Loosening" the match key rules reduces the chance of a matching record being excluded from the group, but increases group size. To find the right balance for your data it is important that you test with a variety of match key rules using a data that is representative of the data you intend to process in production.

Density

When designing a match key it is important to consider the density of the data. Density refers to the degree to which the data can be distributed across match groups. Since performance is determined by the number of comparisons the system has to perform, match keys that produce a small number of large match groups will result is slower performance than match keys that produce a large number of small match groups.

To illustrate this concept, consider a situation where you have a set of one million name and address records that you want to match. You might define a match key as the first three bytes of the postal code and the first letter of the last name. If the records are from all over the U.S., the match key would produce a good number of match groups and is likely to have acceptable performance. But if all the records are from New York, the postal codes would all begin with "100" and you would end up with, at most, only 26 match groups. This would produce large match groups containing, on average, approximately 38,000 records.

You can calculate the maximum number of comparisons performed for each match group by using the following formula:

N * (N-1) / 2

Where N is the number of records in the match group.

So if you have 26 match groups containing 38,000 records each, the maximum number of comparisons performed would be approximately 18.7 billion. Here is how this number is calculated:

First, determine the maximum number of comparisons per match group:

38,000 * (38,000-1) / 2 = 721,981,000

Then, multiply this amount by the number of match groups:

721,981,000 * 26 = 18,771,506,000

If there were instead 100 unique values for the first 3 bytes of the postal code you would have 2,600 match groups containing an average of 380 records. In this case the maximum number of comparisons would be 187 million, which is 100 times fewer. So if the records are only from New York, you might consider using the first four or even five bytes of the postal code for the match key in order to produce more match groups and reduce the number of comparisons. You may miss a few matches but the tradeoff would be greatly reduced execution time.

In reality, a match key like the one used in this example will not result in match groups of equal size because of variations in the data. For example, there will be many more people whose last name starts with "S" than with "X". Because of this, you should focus your efforts on reducing the size of the largest match groups. A match group of 100,000 records is 10 times larger than a match group of 10,000 but it will require 100 times more comparisons and will take 100 times as long. For example, say you are using five bytes of postal code and six bytes of the AddressLine1 field for your match key. On the surface that seems like a fairly fine match key. The problem is with PO Box addresses. While most of the match groups may be of an acceptable size, there would be a few very large match groups with keys like 10002PO BOX that contain a very large number of records. To break up the large match groups you could modify your match key to include the first couple of digits of the PO box number.

Aligning the Match Key with the Match Rule

To achieve the most accurate results, you should design the match key to work well with the match rule that you will use it with. This requires you to consider how the match rule is defined.
  • The match key should include any fields that the match rule requires to be an exact match.
  • The match key should use the same kind of algorithm as is used in the match rule. For example, if you are designing a match key for use with a match rule that uses a phonetic algorithm, then the match key should also use a phonetic algorithm.
  • The match key should be built using data from all the fields that are used in the match rule.
  • Consider how the match key will be affected if there is data missing from one or more of the fields used for the match key. For example, say you use middle initial as part of the match key and you have a record for John A. Smith and another for John Smith. You have configured the match rule to ignore blank values in the middle initial field, so these two records would match according to your match rule. However, since the match key uses the middle initial, the two records would end up in different match groups and would not be compared to each other, thus defeating the intent of your match rule.