Fuzzy X-Ref - Data360_Analyze - Latest

Data360 Analyze Server Help

Product type
Software
Portfolio
Verify
Product family
Data360
Product
Data360 Analyze
Version
Latest
Language
English
Product name
Data360 Analyze
Title
Data360 Analyze Server Help
Copyright
2024
First publish date
2016
Last updated
2024-11-28
Published on
2024-11-28T15:26:57.181000

Compares two data sets with input fields that may not exactly match, but which are logically correlated and are intended to match, and also outputs left and right orphans.

The Fuzzy X-Ref node is an extension of the Fuzzy Join node and provides not only the matching sets but also the orphaned populations from each side of the match. It has all the features of the Fuzzy Join node, with two additional outputs: Left Orphan and Right Orphan. It performs a relational join on the data from two input pins, left and right, using the criteria defined in the properties to determine matches.

Six fuzzy algorithms are implemented:

  • Levenshtein Distance - Records match if the string distance (minimum number of string edits between two keys to transform one to the other) is less than the FuzzyThreshold.
  • Transposed Digit - Records match if the number of transposed characters between two keys is less than the FuzzyThreshold.
  • Soundex - Takes two input strings, and transforms each string into a phonetic representation. The two newly transformed phonetic representations are compared. Records match if the number of matched characters in the two phonetic representations is more than the FuzzyThreshold.
  • Time Distance - Records match if the absolute difference in seconds between two keys of type time is less than the FuzzyThreshold.
  • Date Distance - Records match if the absolute difference in days between to date keys is less than the FuzzyThreshold.
  • Datetime Distance - Records match if the absolute difference in seconds between two datetime keys is less than the FuzzyThreshold.
  • Numeric Distance - Records match if the absolute difference between two numeric keys is less than the FuzzyThreshold.
Notes:

Each of the algorithms above takes two keys and a threshold number, and calculates a fuzzy score (numeric "distance") of the two keys. If the fuzzy score is at or below the threshold, the two keys are considered to be a match.For example, if keys 1000 and 1002 are given to algorithm Numeric Distance, the fuzzy score is 2. If the threshold is 2 or more, these two keys match. Otherwise, the keys do not match.Note that in all other algorithms, adding to the threshold creates MORE matches, but in Soundex, adding to the threshold causes a more detailed comparison, and creates LESS matches, in general.

For more efficient node performance, it is recommended that numbers in input data are represented in numeric types instead of string types.

Input and Output

  • Input
    • Each algorithm requires two input data, left and right.
    • The keys in left data are compared against the keys in the right data based on the property criteria.
  • Output
    • Depending on various conditions, four possible types of outputs: Left Orphan, Unique Match, Multi Match and Right Orphan.
    • Left Orphan
      • All entries from the left pin that do not appear in outputs Unique Match or Multi Match.
    • Unique Match
      • This match is very similar to the Multi Match with one restriction: For all key pairs (k1, k2) where k1 comes from the left pin and k2 comes from the right pin, k1 only matches only one key, k2 from the right pin, and k2 matches only one key, k1, from the left pin.
    • Multi Match
      • For each match key pairs (k1, k2) where k1 comes from the left pin and k2 comes from the right pin, if k1 matches one or more keys (in addition to k2) from the right pin as defined by the conditions below, or k2 matches one or more keys (in addition to k1) from the left pin as defined by the conditions below, this match is considered to be multi-match.
      • The output will consist of such key pairs and their corresponding fields in the two pins.
      • Condition 1: ExactExprs and FuzzyExprs are both defined.
        • Entries in the left and right pins are grouped according to LeftExactExpr and RightExactExpr.
        • From the entries grouped above, keys from the left pin are identified by LeftFuzzyExpr while keys from the right pin are identified by RightFuzzyExpr.
        • A fuzzy score is performed on the pairs of these two sets of keys.
        • If the fuzzy score between a key-pair is at or below the threshold, such key pair is considered to be a match.
      • Condition 2: ExactExprs and FuzzyExprs are both not defined. This is an error.
      • Condition 3: ExactExprs are defined and FuzzyExprs are not defined. Contains a regular join of the entries and their corresponding fields in the left and right pins that meet the LeftExactExpr and RightExactExpr criteria.
      • Condition 4: FuzzyExpr is defined and ExactExpr is not defined.
        • Keys from the left pin are identified by LeftFuzzyExpr while keys from the right pin are identified by RightFuzzyExpr.
        • A fuzzy score is performed on the pairs of these two sets of keys.
        • If the fuzzy score between a key-pair is at or below the threshold, such key pair is considered to be a match.
    • Right Orphan
      • All entries from the right pin that do not appear in outputs Unique Match or Multi Match.

    The fuzzy node is a memory intensive node, and there are limits to the size of the input data it can handle. For fuzzy joins with only FuzzyExprs are defined, all the data for FuzzyExprs in the right input pin are loaded into memory. In all other cases, all the data for ExactExprs in the right input pin are loaded into memory.The total of FuzzyExprs or ExactExprs data has to be less than 600MB. Since either FuzzyExpr or ExactExpr of the right input will be loaded into memory, for best performance, the right input pin should contain the smaller data set. Numeric operations are more efficient than string operations. Use numeric values for FuzzyExprs or ExactExprs if possible for better performance. When disk space or processing time is a concern, it is recommended that Fuzzy Join is used instead because it does not produce left and right orphans.

Examples

  • Levenshtein Distance
    • The names "Mark Smith" and "Mack Smith" do not match exactly, but they may represent the same person whose name is entered incorrectly in one source. The fuzzy score for these names is 1. If the threshold is 1 or more, they are a match.
  • Transposed Digit
    • The keys "Margaret" and "aMrgarte" have a fuzzy score of 2: "Ma" and "aM" are transposed, and "et" and "te" are transposed. These two keys match if the threshold is 2 or more.
  • Soundex
    • The Soundex code for a name consists of a letter followed by three numbers: the letter is the first letter of the name, and the numbers encode the remaining consonants. Similar sounding consonants share the same number so, for example, the labial B, F, P and V are all encoded as 1. Vowels can affect the coding, but are never coded directly unless they appear at the start of the name. The fuzzy score is the number of characters in the two Soundex encoded strings that are the same. The Soundex encodings for names "Jackie" and "Jacquelin" are "J200" and "J245" respectively, so the fuzzy score is 2. If the threshold is 2 or less, the names "Jackie" and "Jacquelin" match. Notes: (a) In all other algorithms, adding to the threshold creates MORE matches, but in Soundex, adding to the threshold causes a more detailed comparison, and creates LESS matches, in general. (b) Soundex algorithm is catered toward English language. For names in unicode or extended ASCII character set, Soundex considers two characters a match if they are the same.
  • Time Distance
    • The fuzzy score for keys "09:09:09" and "09:09:39" is 30 seconds. These two keys match if the threshold is 30 or more.
  • Datetime Distance
    • The distance between the keys "2010-10-10 09:09:09" and "2010-10-10 10:10:10" is 3661 seconds. If the threshold is 3661 or more, these two keys match.
  • Numeric Distance
    • The fuzzy score for the keys 10.1 and 10.01 is 0.09. If the threshold is 0.09 or more, the two keys match.

Properties

LeftExactExpr

Specify an expression (can be a list of input fields) to define which fields in the left (top) input must exactly match those in the right input.

LeftExactExpr and RightExactExpr are used together to apply eligibility constraints in joining the Left and Right pins. The fuzzy algorithm calculates the fuzzy scores of the records that meet the ExactExpr constraints.

LeftExactExpr and RightExactExpr are generally used to restrict the set of records over which the LeftFuzzyExpr and RightFuzzyExpr apply. If no LeftFuzzyExpr or RightFuzzyExpr are provided, the node operates similar to a standard Join node.

Both properties LeftExactExpr and RightExactExpr must either be defined or not defined. If defined, both properties must have the same data type.

RightExactExpr

Specify an expression (can be a list of input fields) to define which fields in the right (bottom) input must exactly match those in the left input.

LeftExactExpr and RightExactExpr are used together to apply eligibility constraints in joining the Left and Right pins. The fuzzy algorithm calculates the fuzzy scores of the records that meet the ExactExpr constraints.

LeftExactExpr and RightExactExpr are generally used to restrict the set of records over which the LeftFuzzyExpr and RightFuzzyExpr apply. If no LeftFuzzyExpr or RightFuzzyExpr are provided, the node operates similar to a standard Join node.

Both properties LeftExactExpr and RightExactExpr must either be defined or not defined. If defined, both properties must have the same data type.

FuzzyAlgorithm

Specify one of six algorithms to perform matching. Choose from:

  • Levenshtein Distance - Algorithm determines the minimum number of edits needed to transform one string into the other, with the allowable edit operations being insertion, deletion, or substitution of a single character. Items match if the minimum number of edits is less than the specified FuzzyThreshold.
  • Transposed Digit - Algorithm that calculates the number of transposed characters between two keys.
  • Soundex - Phonetic algorithm that indexes names by sound, as pronounced in English. The goal is for homophones to be encoded to the same representation so that they can be matched despite minor differences in spelling. All non-alphabetical characters, digits, punctuations, and white spaces are ignored. Soundex algorithm is catered toward English language. For names in unicode or extended ASCII character set, Soundex considers two characters a match if they have the same encoding.
  • Time Distance - Algorithm that matches keys of type time. The time distance for two keys is their absolute difference in seconds.
  • Date Distance - Algorithm that matches keys of type date. The date distance for two keys is the absolute difference in days.
  • Datetime Distance - Algorithm that matches keys of type datetime timestamps. The datetime distance for two keys is the absolute difference in seconds.
  • Numeric Distance - Algorithm based on the absolute difference between two numeric keys.

FuzzyThreshold

Specify the threshold to be used to determine if two keys match. Must be set if properties LeftFuzzyExpr and RightFuzzyExpr are defined.

Each FuzzyAlgorithm computes a fuzzy score based on the comparison of the keys in FuzzyLeftExpr and FuzzyRightExpr. The FuzzyThreshold is used to determine if the two keys match based on the calculated fuzzy score.

  • Levenshtein Distance - Records match if the string distance (minimum number of string edits between two keys to transform one to the other) is less than the FuzzyThreshold.
  • Transposed Digit - Records match if the number of transposed characters between two keys is less than the FuzzyThreshold.
  • Time Distance - Records match if the absolute difference in seconds between two keys of type time is less than the FuzzyThreshold.
  • Date Distance - Records match if the absolute difference in days between two date keys is less than the FuzzyThreshold.
  • Datetime Distance - Records match if the absolute difference in seconds between two datetime keys is less than the FuzzyThreshold.
  • Numeric Distance - Records match if the absolute difference between two numeric keys is less than the FuzzyThreshold.
  • Soundex - Transforms an input string into a phonetic representation. Records match if the number of matched characters in the phonetic representation is more than the FuzzyThreshold. Note that in all other algorithms, adding to the threshold creates MORE matches, but in Soundex, adding to the threshold causes a more detailed comparison, and creates LESS matches, in general.

LeftFuzzyExpr

Specify the fields from the left input which are to match those in the right input based on the fuzzy matching criteria.

Both properties LeftFuzzyExpr and RightFuzzyExpr must be defined or not defined. If defined, both properties must have the same data type.

The LeftFuzzyExpr and RightFuzzyExpr defined the fields to be matched by the Fuzzy Algorithm. These fields are used as arguments to the Fuzzy Algorithm to determine if they match, based on the property FuzzyThreshold.

RightFuzzyExpr

Specify the fields from the right input which are to match those in the left input based on the fuzzy matching criteria.

Both properties LeftFuzzyExpr and RightFuzzyExpr must be defined or not defined. If defined, both properties must have the same data type.

The LeftFuzzyExpr and RightFuzzyExpr defined the fields to be matched by the Fuzzy Algorithm. These fields are used as arguments to the Fuzzy Algorithm to determine if they match, based on the property FuzzyThreshold.

RightPrefix

Optionally specify prefix text to add to all fields from the right input to differentiate them from the left input. Useful if there are fields in the left and right input pins that match exactly - ensuring that there are no field naming conflicts in the output.

FuzzyScoreOutputField

Optionally specify a name for the fuzzy score output field to generate an additional field in the uniquematch and the multimatch outputs that contains the fuzzy scores for the data. If not specified, the field will not be included in the output.

MaximumStringLength

Optionally specify the maximum number of characters allowed for string fields in the left and right input pins.

The default value is 4000.

Inputs and outputs

Inputs: left, right.

Outputs: leftorphan, uniquematch, multimatch, rightorphan.