Data Matching for Deduplication (One Source Matching) Specification Management - EnterWorks - Fusion - Match_and_Merge - Precisely_EnterWorks - 3.1

EnterWorks Fusion

Product type
Software
Portfolio
Verify
Product family
EnterWorks
Product
Precisely EnterWorks > Fusion
Precisely EnterWorks > Match and Merge
Version
3.1
Language
English
Product name
Precisely EnterWorks
Title
EnterWorks Fusion
Copyright
2023
First publish date
2007

This section of the documentation is still under development. Additional information can be found at: https://infolinkcloud.com/doc/index.html.

One Source Matching automates the process of identifying records that represent the same real-world entity (duplicate records) and merging the identified records into a single consolidated record.

For instructions on renaming, cloning, or deleting a matching specification, see Specification Management.

How a One Source Matching Specification Proceeds

When a matching specification is run, it proceeds as follows:

  1. Match rules are run in the order their tabs appear in the matching specification, left to right. The leftmost match rule will be run first; the rightmost will be run last. Each match rule takes the match table as an input and updates the set_id, prev_set_id, and specs columns.
  2. Manual merging and unmerging are applied.
  3. Consolidation rules are run. Consolidation takes the match table as an input and creates a new consolidated table with the name defined in the matching specification (the Consolidated table field). A set of matched records are those that have the same set_id in the match table.
  4. Manual edits of consolidated records are applied.
  5. You can review and edit the result of the previous steps in the Stewardship UI.

Match Table Overview

The match table is the input table that contains duplicate records that the matching specification will match. It will be updated with the result of the matching. The match table must have the following column:

  • data_id nvarchar(900): This is the primary key; a unique identifier that cannot be null.

When the matching specification is run, Fusion will check whether the match table has the following columns. Any missing columns will be added to the match table:

  • set_id nvarchar(900): This column is used to group records into a consolidated record. Records with the same set_id are duplicates.

  • prev_set_id nvarchar(900): This is used for debugging. It stores the results of the previous match run so that you know which duplicates were identified in the current match run.

  • match_specs nvarchar(4000): This is used to store a list of the names of matching rules that effected this record.

Creating a One Source Matching Specification

In the navigation tree, matching specifications are located in the Specifications section, in the Matching subsection. Before you can create a specification, you must have configured your data source.

To create a matching specification:

  1. In the navigation tree, right-click the Matching subsection of the Specifications section and select Create match specification.
  2. A pop-up window will appear that has the following configuration options:
    • Specification name: The name of your new specification. After the screen is refreshed, the text "(one-source)" will be appended to the end of the specification name to indicate which type of matching specification it is.

    • Select type: Select Match one source.

    • Select source: Select the data source containing the data to be matched.

  3. When you are done editing the options, click Create. Your new specification will be displayed in the current window. To edit your specification, follow the instructions outlined in Editing a One Source Matching Specification.

Editing a One Source Matching Specification

For a discussion on selecting matching specification configuration settings, see How Data Matching for Deduplication (One Source Matching) Works.

To edit a matching specification:

  1. In the application's navigation tree, click the matching specification's name. Your specification will be displayed in the current window.
  2. Edit the specification's configuration options:
    • Specification Name: This field is pre-populated and cannot be edited.

    • Source Name: This field is pre-populated and may not be able to be changed.

    • Match table: Enter the name of the input table on which you want to perform matching. You can select the table name from the dropdown menu or start typing the name of the table to filter the dropdown menu.

      If your data source is of type EnterworksSource, the dropdown may list the tables by name or it may list them by repository number. To determine the number of an EnterWorks repository, in the New UI, open the repository in the Detail Editor. The number of the repository is the digits at the end of the Detail Editor's tab's URL.

    • Consolidated table: The name of the table the system will create to contain the consolidated records compiled from matched records.

  3. At the bottom of the specification screen are a series of tabs that hold additional configuration options:
    1. Match:<match-rule-name>: Match rules specify how duplicates will be found. Define one or more match rules.
      • To add a match rule: Match rule tabs are added to the right of (after) an existing match rule tab. To add a match rule, right-click the match rule that will be run prior to the new rule and select Insert Match Run next.

      • To delete a match rule, right-click the match rule's tab and select Delete.

      The configuration options for match rules are as follows:
      • Match specification name: The name for your match rule. If you edit the name of your match rule, it may not appear on the tab immediately. If it does not, it will appear the next time the specification tab is opened.

      • Match columns:

        • To add a match column, click the Add column button. Depending on if Fusion has previously accessed the match table, the new match column's dropdown menu may or may not list the available columns. If it doesn't, start typing the name of the desired column and Fusion will display a filtered list of the available columns.

          If your source is an EnterWorks repository, you will only be able to access the relational attributes, (the attributes in the snapshot table).

        • To delete a column, click the column's Delete button on the far right of the screen.

        • To rearrange the order of the columns, drag and drop them into the desired order.

      • Filter:

      • Select type: The type of rule determines the criteria for declaring two records to be a match. For more details on the types of match rules, see How Data Matching for Deduplication (One Source Matching) Works.

        • Exact: The records must exactly match.

        • Fuzzy, per column limit: Each column's match score must exceed the column's match cutoff.

        • Fuzzy, combined single limit:The combined match score for all the columns in a record must exceed the record's match cutoff.

      • Requires manual review:

    2. Manual Matching: These tables are created automatically when you create a new matching specification. They are managed by the system. They will be used to store the results of manual editing performed in the Stewardship User Interface. You can specify the names of the tables or let Fusion create the names.
    3. Stewardship: In the Stewardship UI, if the Show summary columns only checkbox is checked, only the columns specified in this tab will be displayed. If the checkbox is not checked, all the columns will be displayed. Add the columns that you want to appear in the Stewardship UI. To change the display order of the columns, drag and drop the columns to the desired positions.
    4. Consolidation rules: Consolidation rules specify how to compile a consolidated (merged) record from a set of duplicate records. To add columns, click the Add column button and edit the configuration options:
      • Column Name: The name of the column.

      • Survivorship Function: Select one of the following:

        • max: From the set of values, select the maximum value.

        • min: From the set of values in the matched records, select the minimum.

        • most_frequent: From the set of values in the matched records, select the value that appears most frequently.

        • source_priority,<priority table>,<source column>: Use the value in the record with the highest priority. Record priority is defined by sorting records by one or more columns. The value is taken from the first record in the result of the sorting or you can also use the priority specified in an additional table. For more information, see Source Priority Survivorship Function Details.

Source Priority Survivorship Function Details

To use a source priority survivorship function in a MatchOneSource operation, use the following JSON format to specify record priority:

{"filter":<SQL-expr>,"priorityTable":<priority-table-name>,"priorityTableReferenceColumn":<column-name-that-refers-to-the-priority-table>,"priorityColumns":<SQL-expression-with-sorting-columns>,"aggregate":<aggregate-function-name>}

The records are first filtered with the filter, then joined with the priorityTable by the column priorityTableReferenceColumn of the input table and the value column of the priority table. Then the aggregate function is applied to each subgroup of records that have the same value of the sorting columns. Everywhere you use a SQL expression, use a t. qualifier to refer to a column of the input table and a p. qualifier to refer to a column of the priority table. The priority table must have a column named value that will be used in the join and a column with a priority that will be used for sorting in the priorityColumns expression.

In the example below, records are prioritized using a priority table named Source_Priority and the most recently updated records are preferred.

{"filter":null,"priorityTable":"Source_Priority","priorityTableReferenceColumn":"Source_System_Name","priorityColumns":"-p.priority desc, t.[Last_Updated] desc","aggregate":"max"}

In the next example we additionally require that the value in the Fax column must not be null.

{"filter":"t.[Fax] is not null","priorityTable":"Source_Priority","priorityTableReferenceColumn":"Source_System_Name","priorityColumns":"-p.priority desc, t.[Last_Updated] desc","aggregate":"max"}

Executing a One Source Matching Specification

Once a matching specification has been created, you can call it from a scenario using the MatchOneSource or operation, or you can run it directly by right-clicking the specification name in the navigation tree and selecting how you want it to be run.

The available execution options depend on the matching specification's type. Some options are only available when the matching specification is run as an operation.

In the list of matching execution options below:

  • This is how the execution option will appear when the matching specification's name is right-clicked.

  • (This) is the name of the execution option you would use if you include it in the JSON parameters of an operation.

The matching execution options are:

  • run match ruleā€¦: (execMatchRun) runs a particular match rule (selected from a dropdown menu) in one of the two modes:

  • run - match rules: (execMatchRuns) runs all matching rules without executing other steps such as consolidation.

  • run - consolidation: (execConsolidation) runs the consolidation step.

  • run - apply manual: (execApplyManual) Applies manual merges and unmerges.

  • run all: (execAll) run all steps of the specification: adding required columns to the input table, matching, applying manual merging and unmerging, computing consolidated records, and applying manual editing of consolidated records.

  • run all auto rules: (execAllAutoRules) The same as run all, but only auto matching rules are run.

  • run all manual rules: (execAllManualRules) The same as run all but only manual matching rules are run.

  • Remove all manual: (removeAllManual) The result of all manual manipulations performed in the Stewardship UI are stored in manual tables defined in the matching specification in the Manual Matching tab. Remove all manual deletes records from the manual tables.

  • (getPotentialMatches) This option is only available when the option is used in an operation. Fusion takes the following parameters:

    • tableName1

    • tableName2

    • targetTableName

    and uses rules from the matching specification to identify all records from tableName1 that are potential matches to records in tableName2. The resulting records are stored in targetTableName. The potential matches are obtained by joining the two tables using only exact matching columns - fuzzy matching columns are not taken into account. getPotentialMatches is usually used to optimize matching on incremental data loading.