Fuzzy Join - Data360_Analyze - Latest

Data360 Analyze Server Help

Product type
Software
Portfolio
Verify
Product family
Data360
Product
Data360 Analyze
Version
Latest
Language
English
Product name
Data360 Analyze
Title
Data360 Analyze Server Help
Copyright
2024
First publish date
2016
Last updated
2024-11-28
Published on
2024-11-28T15:26:57.181000

Compares two data sets with input fields that may not exactly match, but which are logically correlated and are intended to match.

Fuzzy join algorithms and the criteria defined in the properties are used to match your data (case insensitive).

The following steps guide you through the process of configuring the Fuzzy Joinnode:

  1. If you want to restrict the set of records over which the fuzzy join is performed (optional step): In the LeftExactExpr property, add an expression (can be a list of input fields) to define which fields in the left (top) input must exactly match those in the right input. In the RightExactExpr property, add an expression (can be a list of input fields) to define which fields in the right (bottom) input must exactly match those in the left input.

    Note: If you choose to define these optional "Exact" properties, you must define both the LeftExactExpr property and the RightExactExpr property and they must have the same data type. Entries in the left and right pins are grouped according to the values specified in the LeftExactExpr and RightExactExpr properties.

    If you only define the "Exact" properties and do not define the LeftFuzzyExpr and the RightFuzzyExpr properties, a regular join occurs, see Join. If you want to perform a fuzzy join, see the following step.

  2. From the FuzzyAlgorithm property, select the algorithm that best suits your data to perform a fuzzy join on the values specified in the LeftFuzzyExpr and RightFuzzyExpr properties.

    There are seven fuzzy join algorithms to choose from, depending on your data:

    Matching algorithms

    • Levenshtein Distance: Calculates the distance in terms of the number of letter substitutions necessary to make one word match another. Records match when the distance is the same or less than the value specified in the FuzzyThreshold property.

      The names "Mark Smith" and "Mack Smith" do not match exactly, but they may represent the same person whose name is entered incorrectly in one source. The fuzzy score is 1, as one letter would need to be substituted to make a match; if the FuzzyThreshold property is 1 or more, they are matched.

    • Transposed Digit: Looks for typographical errors. Records match if the number of transposed characters between the two input fields is the same or less than the value specified in the FuzzyThreshold property.

      In the entries "12345678" and "12435678", "34" and "43" are transposed; there is a fuzzy score of 1. If the FuzzyThreshold property is 1 or more, they are matched.

    • Soundex: Matches by sound using a phonetic algorithm that is catered to the English language. Homophones are encoded to the same representation so that they can be matched despite minor differences in spelling. All non-alphabetical characters, digits, punctuations, and white spaces are ignored.

      The Soundex code for a name consists of a letter followed by three numbers: the letter is the first letter of the name, and the numbers encode the remaining consonants. Similar sounding consonants share the same number, for example, the labial B, F, P and V are all encoded as 1. Vowels are never coded directly unless they appear at the start of the name. Records match if the number of identical characters in the two phonetic representations is the same or more than the value specified in the FuzzyThreshold property.

      For names in unicode or in extended ASCII character set, Soundex considers two characters a match if they are the same.

      Note that in all other algorithms, adding to the threshold creates MORE matches, but in Soundex, adding to the threshold causes a more detailed comparison, and creates LESS matches, in general.

      The Soundex encodings for the names "Jackie" and "Jacquelin" are "J200" and "J245", respectively. The fuzzy score is 2; if the FuzzyThreshold property is 2 or less, the names are matched.

    Distance comparisons

    • Time Distance: Looks for the absolute difference in seconds between two times. Records match if the absolute difference in seconds is the same or less than the value specified in the FuzzyThreshold property.

      The time distance between the following two times is 10 seconds: 02:45:47 and 02:45:57. If the FuzzyThreshold property is 10 or more, they are matched.

    • Date Distance: Looks for the absolute difference in days between two dates. Records match if the absolute difference in days is the same or less than the value specified in the FuzzyThreshold property.

      The date distance between the following two dates is 2 days: 2016-03-03 and 2016-03-05. If the FuzzyThreshold property is 2 or more, they are matched.

    • DateTime Distance: Looks for the absolute difference in seconds between two datetime fields. Records match if the absolute difference between two datetime values is the same or less than the value specified in the FuzzyThreshold property.

      The datetime distance between the following two values is 3661 seconds: 2010-10-10 09:09:09 and 2010-10-10 10:10:10. If the FuzzyThreshold property is 3661 or more, they are matched.

    • Numeric Distance: Looks for the absolute difference between two numbers. Records match if the absolute difference between two numbers is less than the value specified in the FuzzyThreshold property.

      There is a difference of 0.09 between the following two numbers: 10.1 and 10.01. If the FuzzyThreshold property is 0.09 or more, they are matched.

  3. Once you have selected a FuzzyAlgorithm, in the FuzzyThreshold property, enter a positive number to specify the threshold that is used to determine if two fields match.
  4. In the LeftFuzzyExpr property, type the fields from the left input which are to match those in the right input based on the fuzzy matching criteria.

    In the RightFuzzyExpr property, type the fields from the right input which are to match those in the left input based on the fuzzy matching criteria.

  5. If there are fields in the left and right input pins that match exactly, in the RightPrefix property, enter some text to add to all fields from the right input to differentiate them from the left input. This ensures that there are no field naming conflicts in the output. This property is optional.
  6. If you want to generate an additional field in the 'uniquematch' and the 'multimatch' outputs that contains the fuzzy scores for the data, in the FuzzyScoreOutputField property, enter a name for the fuzzy score output field. This property is optional; if you do not enter a name, the field will not be included in the output.

The Fuzzy Join node generates two outputs:

  • uniquematch: Data that has a one to one match.
  • multimatch: Data that has a one to many matches.

You can decide to reduce the multi matches to a single occurrence and then combine with the unique matches, or you can use only the unique matches in the remainder of your analysis.

Optimizing node performance

You can improve the efficiency of the Fuzzy Join node by ensuring that numbers in input data are represented by numeric data types instead of string types, see Data Analyzer (Deprecated). Also, it is recommended to use numeric values for FuzzyExprs or ExactExprs where possible.

The Fuzzy Join node is memory intensive. For best performance, the right input pin should contain the smaller data set.

Example

You have two data sets. The first data set lists call issue information:

ChangeDateTimedatetime Current_User_Idstring Categorystring
2013-03-09 10:40:00 BTOENJ CA03
2013-03-20 09:10:00 ILEONIJ BS03
2013-03-09 09:20:00 ILEONIJ BS03
2013-03-09 10:46:30 REPENSHEK CA03
2013-03-17 07:54:00 PEALEG CA03

The second data set lists call escalation information:

CreateDateTimedatetime User_Idstring System_IDint
2013-03-09 10:45:00 KUNDEL 4706
2013-03-20 09:14:00 NOLEDTOR 1860
2013-03-09 09:25:00 ENOKISR 2365
2013-03-09 10:50:30 NOLEDTOR 1714
2013-03-17 08:00:00 NOLEDTOR

3951

Join these two data sets together to create one data set that lists the issue and how it was escalated. When an escalation is logged, a change is made to the issues record, so the dates would coincide with one another. Therefore, these two data sets can be joined by using the ChangeDateTime field from the first data set and the CreateDateTime field from the second data set. However, there may be a slight difference of five minutes and this is where the fuzzy algorithm comes into play.
  1. Given that the fields that are being used to join on are date fields, both of which contain data and time, from the FuzzyAlgorithm property, select DateTime Distance.
  2. Given that there is a five minute difference between when the issue was changed and the escalation was created, enter a FuzzyThreshold of 300.
  3. In the LeftFuzzyExpr property, type:

    ChangeDateTime

  4. In the RightFuzzyExpr property, type:

    CreateDateTime

The uniquematch pin outputs the following:
ChangeDateTimedatetime Current_User_Idstring Categorystring CreateDateTimedatetime User_Idstring System_IDint
2013-03-20 09:10:00 ILEONIJ BS03 2013-03-20 09:14:00 NOLEDTOR 1860
2013-03-09 09:20:00 ILEONIJ BS03 2013-03-09 09:25:00 ENOKISR 2365
The multimatch pin outputs the following:
ChangeDateTimedatetime Current_User_Idstring Categorystring CreateDateTimedatetime User_Idstring System_IDint
2013-03-09 10:40:00 BTOENJ CA03 2013-03-09 10:45:00 KUNDEL 4706
2013-03-09 10:46:30 REPENSHEK CA03 2013-03-09 10:45:00 KUNDEL 4706
2013-03-09 10:46:30 REPENSHEK CA03 2013-03-09 10:50:30 NOLEDTOR 1714

Properties

LeftExactExpr

Specify an expression (can be a list of input fields) to define which fields in the left (top) input must exactly match those in the right input.

Both properties LeftExactExpr and RightExactExpr must either be defined or not defined. If defined, both properties must have the same data type. LeftExactExpr and RightExactExpr are generally used to restrict the set of records over which the LeftFuzzyExpr and RightFuzzyExpr apply. The fuzzy algorithm calculates the fuzzy scores of the records that meet the ExactExpr constraints.

RightExactExpr

Specify an expression (can be a list of input fields) to define which fields in the right (bottom) input must exactly match those in the left input.

Both properties LeftExactExpr and RightExactExpr must either be defined or not defined. If defined, both properties must have the same data type. LeftExactExpr and RightExactExpr are generally used to restrict the set of records over which the LeftFuzzyExpr and RightFuzzyExpr apply. The fuzzy algorithm calculates the fuzzy scores of the records that meet the ExactExpr constraints.

FuzzyAlgorithm

Specify a fuzzy algorithm (mandatory if properties LeftFuzzyExpr and RightFuzzyExpr are defined). A fuzzy algorithm matches two fields, one from the left pin and one from the right pin. Choose from:

  • Levenshtein Distance - Calculates the distance in terms of the number of letter substitutions necessary to make one word match another.
  • Transposed Digit - Looks for typographical errors.
  • Soundex - Matches by sound using a phonetic algorithm that is tailored to the English language. Homophones are encoded to the same representation so that they can be matched despite minor differences in spelling. All non-alphabetical characters, digits, punctuations, and white spaces are ignored. For names in unicode or extended ASCII character set, Soundex considers two characters a match if they have the same encoding. Note that in all other algorithms, adding to the threshold creates more matches, but in Soundex, adding to the threshold causes a more detailed comparison, and creates fewer matches, in general.
  • Time Distance - Looks for the absolute difference in seconds between two times.
  • Date Distance - Looks for the absolute difference in days between two dates.
  • Datetime Distance - Looks for the absolute difference in seconds between two datetime fields.
  • Numeric Distance - Looks for the absolute difference between two numbers.

FuzzyThreshold

Specify a positive number to specify the threshold that is used to determine if two fields match. Mandatory if properties LeftFuzzyExpr and RightFuzzyExpr are defined.

  • Levenshtein Distance - Records match when the distance is the same or less than the value specified in the FuzzyThreshold property.
  • Transposed Digit - Records match if the number of transposed characters between the two input fields is the same or less than the value specified in the FuzzyThreshold property.
  • Soundex - Records match if the number of identical characters in the two phonetic representations is the same or more than the value specified in the FuzzyThreshold property. Note that in all other algorithms, adding to the threshold creates MORE matches, but in Soundex, adding to the threshold causes a more detailed comparison, and creates LESS matches, in general.
  • Time Distance - Records match if the absolute difference in seconds is the same or less than the value specified in the FuzzyThreshold property.
  • Date Distance - Records match if the absolute difference in days is the same or less than the value specified in the FuzzyThreshold property.
  • Datetime Distance - Records match if the absolute difference between two datetime values is the same or less than the value specified in the FuzzyThreshold property.
  • Numeric Distance - Records match if the absolute difference between two numbers is less than the value specified in the FuzzyThreshold property.

LeftFuzzyExpr

Specify the fields from the left input which are to match those in the right input based on the fuzzy matching criteria. Both properties LeftFuzzyExpr and RightFuzzyExpr must be defined or not defined. If defined, both properties must have the same data type.

If no LeftFuzzyExpr or RightFuzzyExpr are provided, the node operates as a standard Join node.

RightFuzzyExpr

Specify the fields from the right input which are to match those in the left input based on the fuzzy matching criteria. Both properties LeftFuzzyExpr and RightFuzzyExpr must be defined or not defined. If defined, both properties must have the same data type.

If no LeftFuzzyExpr or RightFuzzyExpr are provided, the node operates as a standard Join node.

RightPrefix

Optionally specify prefix text to add to all fields from the right input to differentiate them from the left input. Useful if there are fields in the left and right input pins that match exactly - ensuring that there are no field naming conflicts in the output.

FuzzyScoreOutputField

Optionally specify a name for the fuzzy score output field to generate an additional field in the 'uniquematch' and the 'multimatch' outputs that contains the fuzzy scores for the data. If not specified, the field will not be included in the output.

MaximumStringLength

Optionally specify the maximum number of characters allowed for string fields in the left and right input pins.

The default value is 4000.

Inputs and outputs

Inputs: left, right

Outputs: uniquematch, multimatch.