This section of the documentation is still under development. Additional information can be found at: https://infolinkcloud.com/doc/index.html.
One Source Matching automates the process of identifying records that represent the same real-world entity (duplicate records) and merging the identified records into a single consolidated record.
For instructions on renaming, cloning, or deleting a matching specification, see Specification Management.
How a One Source Matching Specification Proceeds
When a matching specification is run, it proceeds as follows:
- Match rules are run in the order their tabs appear in the
matching specification, left to right. The leftmost match rule will be
run first; the rightmost will be run last. Each match rule
takes the match table as an input and updates the
set_id
,prev_set_id
, andspecs
columns. - Manual merging and unmerging are applied.
- Consolidation rules are run. Consolidation takes the match table
as an input and creates a new consolidated table with the name defined
in the matching specification (the Consolidated table field). A
set of matched records are those that have the same
set_id
in the match table. - Manual edits of consolidated records are applied.
- You can review and edit the result of the previous steps in the Stewardship UI.
Match Table Overview
The match table is the input table that contains duplicate records that the matching specification will match. It will be updated with the result of the matching. The match table must have the following column:
-
data_id nvarchar(900)
: This is the primary key; a unique identifier that cannot be null.
When the matching specification is run, Fusion will check whether the match table has the following columns. Any missing columns will be added to the match table:
-
set_id nvarchar(900)
: This column is used to group records into a consolidated record. Records with the sameset_id
are duplicates. -
prev_set_id nvarchar(900)
: This is used for debugging. It stores the results of the previous match run so that you know which duplicates were identified in the current match run. -
match_specs nvarchar(4000)
: This is used to store a list of the names of matching rules that effected this record.
Creating a One Source Matching Specification
In the navigation tree, matching specifications are located in the Specifications section, in the Matching subsection. Before you can create a specification, you must have configured your data source.
To create a matching specification:
- In the navigation tree, right-click the Matching subsection of the Specifications section and select Create match specification.
- A pop-up window will appear that has the following configuration
options:
-
Specification name: The name of your new specification. After the screen is refreshed, the text "(one-source)" will be appended to the end of the specification name to indicate which type of matching specification it is.
-
Select type: Select Match one source.
-
Select source: Select the data source containing the data to be matched.
-
- When you are done editing the options, click Create. Your new specification will be displayed in the current window. To edit your specification, follow the instructions outlined in Editing a One Source Matching Specification.
Editing a One Source Matching Specification
For a discussion on selecting matching specification configuration settings, see How Data Matching for Deduplication (One Source Matching) Works.
To edit a matching specification:
- In the application's navigation tree, click the matching specification's name. Your specification will be displayed in the current window.
- Edit the specification's configuration options:
-
Specification Name: This field is pre-populated and cannot be edited.
-
Source Name: This field is pre-populated and may not be able to be changed.
-
Match table: Enter the name of the input table on which you want to perform matching. You can select the table name from the dropdown menu or start typing the name of the table to filter the dropdown menu.
If your data source is of type EnterworksSource, the dropdown may list the tables by name or it may list them by repository number. To determine the number of an EnterWorks repository, in the New UI, open the repository in the Detail Editor. The number of the repository is the digits at the end of the Detail Editor's tab's URL.
-
Consolidated table: The name of the table the system will create to contain the consolidated records compiled from matched records.
-
- At the bottom of the specification screen are a series of tabs that
hold additional configuration options:
-
Match:<match-rule-name>: Match rules
specify how duplicates will be found. Define one or more match
rules.
-
To add a match rule: Match rule tabs are added to the right of (after) an existing match rule tab. To add a match rule, right-click the match rule that will be run prior to the new rule and select Insert Match Run next.
-
To delete a match rule, right-click the match rule's tab and select Delete.
-
Match specification name: The name for your match rule. If you edit the name of your match rule, it may not appear on the tab immediately. If it does not, it will appear the next time the specification tab is opened.
-
Match columns:
-
To add a match column, click the Add column button. Depending on if Fusion has previously accessed the match table, the new match column's dropdown menu may or may not list the available columns. If it doesn't, start typing the name of the desired column and Fusion will display a filtered list of the available columns.
If your source is an EnterWorks repository, you will only be able to access the relational attributes, (the attributes in the snapshot table).
-
To delete a column, click the column's Delete button on the far right of the screen.
-
To rearrange the order of the columns, drag and drop them into the desired order.
-
-
Filter:
-
Select type: The type of rule determines the criteria for declaring two records to be a match. For more details on the types of match rules, see How Data Matching for Deduplication (One Source Matching) Works.
-
Exact: The records must exactly match.
-
Fuzzy, per column limit: Each column's match score must exceed the column's match cutoff.
-
Fuzzy, combined single limit:The combined match score for all the columns in a record must exceed the record's match cutoff.
-
-
Requires manual review:
-
- Manual Matching: These tables are created automatically when you create a new matching specification. They are managed by the system. They will be used to store the results of manual editing performed in the Stewardship User Interface. You can specify the names of the tables or let Fusion create the names.
- Stewardship: In the Stewardship UI, if the Show summary columns only checkbox is checked, only the columns specified in this tab will be displayed. If the checkbox is not checked, all the columns will be displayed. Add the columns that you want to appear in the Stewardship UI. To change the display order of the columns, drag and drop the columns to the desired positions.
-
Consolidation rules: Consolidation rules specify how to
compile a consolidated (merged) record from a set of duplicate
records. To add columns, click the Add column button
and edit the configuration options:
-
Column Name: The name of the column.
-
Survivorship Function: Select one of the following:
-
max: From the set of values, select the maximum value.
-
min: From the set of values in the matched records, select the minimum.
-
most_frequent: From the set of values in the matched records, select the value that appears most frequently.
-
source_priority,<priority table>,<source column>: Use the value in the record with the highest priority. Record priority is defined by sorting records by one or more columns. The value is taken from the first record in the result of the sorting or you can also use the priority specified in an additional table. For more information, see Source Priority Survivorship Function Details.
-
-
-
Match:<match-rule-name>: Match rules
specify how duplicates will be found. Define one or more match
rules.
Source Priority Survivorship Function Details
To use a source priority survivorship function in a
MatchOneSource
operation, use the following JSON
format to specify record priority:
{"filter":<SQL-expr>,"priorityTable":<priority-table-name>,"priorityTableReferenceColumn":<column-name-that-refers-to-the-priority-table>,"priorityColumns":<SQL-expression-with-sorting-columns>,"aggregate":<aggregate-function-name>}
The records are first filtered with the filter
, then
joined with the priorityTable
by the column
priorityTableReferenceColumn
of the input table and
the value
column of the priority table. Then the
aggregate
function is applied to each subgroup of
records that have the same value of the sorting columns. Everywhere you
use a SQL expression, use a t.
qualifier to refer to a
column of the input table and a p.
qualifier to refer to
a column of the priority table. The priority table must have a column
named value
that will be used in the join and a column
with a priority that will be used for sorting in the
priorityColumns
expression.
In the example below, records are prioritized using a priority table named
Source_Priority
and the most recently updated records
are preferred.
{"filter":null,"priorityTable":"Source_Priority","priorityTableReferenceColumn":"Source_System_Name","priorityColumns":"-p.priority desc, t.[Last_Updated] desc","aggregate":"max"}
In the next example we additionally require that the value in the
Fax
column must not be null.
{"filter":"t.[Fax] is not null","priorityTable":"Source_Priority","priorityTableReferenceColumn":"Source_System_Name","priorityColumns":"-p.priority desc, t.[Last_Updated] desc","aggregate":"max"}
Executing a One Source Matching Specification
Once a matching specification has been created, you can call it from a
scenario using the MatchOneSource
or operation, or you
can run it directly by right-clicking the specification name in the
navigation tree and selecting how you want it to be run.
The available execution options depend on the matching specification's type. Some options are only available when the matching specification is run as an operation.
In the list of matching execution options below:
-
This is how the execution option will appear when the matching specification's name is right-clicked.
-
(
This
) is the name of the execution option you would use if you include it in the JSON parameters of an operation.
The matching execution options are:
-
run match ruleā¦: (
execMatchRun
) runs a particular match rule (selected from a dropdown menu) in one of the two modes:-
match: runs the match run completely.
-
determine cutoffs: runs only part of the match run to generate the Weight Histogram Table, which is used to determine cutoff values in the Fuzzy, combined single limit type of match rule, (see How Data Matching for Deduplication (One Source Matching) Works).
-
-
run - match rules: (
execMatchRuns
) runs all matching rules without executing other steps such as consolidation. -
run - consolidation: (
execConsolidation
) runs the consolidation step. -
run - apply manual: (
execApplyManual
) Applies manual merges and unmerges. -
run all: (
execAll
) run all steps of the specification: adding required columns to the input table, matching, applying manual merging and unmerging, computing consolidated records, and applying manual editing of consolidated records. -
run all auto rules: (
execAllAutoRules
) The same as run all, but only auto matching rules are run. -
run all manual rules: (
execAllManualRules
) The same as run all but only manual matching rules are run. -
Remove all manual: (
removeAllManual
) The result of all manual manipulations performed in the Stewardship UI are stored in manual tables defined in the matching specification in the Manual Matching tab. Remove all manual deletes records from the manual tables. -
(
getPotentialMatches
) This option is only available when the option is used in an operation. Fusion takes the following parameters:-
tableName1
-
tableName2
-
targetTableName
and uses rules from the matching specification to identify all records from
tableName1
that are potential matches to records intableName2
. The resulting records are stored intargetTableName
. The potential matches are obtained by joining the two tables using only exact matching columns - fuzzy matching columns are not taken into account.getPotentialMatches
is usually used to optimize matching on incremental data loading. -