Combines left, right and inner joins into a single node.
Joins two data sources using a comparison expression to determine if the specified data is in the left input only (first / top input pin), in the right input only (second / bottom input pin), or in both inputs. The results of each join are segregated and output to three different pins, as follows:
- First output pin (left orphans): Data that appears in the first (left) data source but not in the second (right) data source.
- Second output pin (matches): Data that appears in both data sources.
- Third output pin (right orphans): Data that appears in the second (right) data source but not in the first (left) data source.
To configure the Merge node:
- Select the related fields on which you want to base the join.Note: The fields on which you base the join must be of a similar type. You can compare the number types int, long and double against each other, and fields of string data type can be compared to Unicode fields.
Use the Match Keys Grid to select the names of the fields on which you want to base the join. You can add multiple rows if you want to base the join on more than one set of fields.
From the menu button to the right of the field name, you can select Case Insensitive matching, or for more advanced cases you can choose to Compare Substrings. There is also an option to Delete a selected field from the list.
The output records are sorted by the matching criteria. You have the option to change the sort order to Sort Descending (high to low).
For more details on these options, see Match keys grid.Tip: When comparing double type fields, ensure that the numbers are not significantly different, e.g. 2.0 and 2.001, as the values in this case will not match.
Alternatively, for more advanced techniques, select the Advanced tab, where you can use Python script to configure the node.Tip: The Merge node sorts input data by default. If the inputs are already sorted, you can improve the performance of the node by setting the optional SortLeftInput and SortRightInput properties to False.
- Select which fields you want to include in the output.
By default, all fields from both inputs are included in the output. You can modify this behavior by excluding fields in the Field List grid.
If there are any field name conflicts, where the same case-insensitive field name appears in both inputs, the node will fail. To prevent this type of error, use the Field List grid to either exclude or rename one of the fields.
Excluding a field
A simple way to resolve a field name conflict is to untick one of the conflicting field names in the Field List grid. It is common practice to exclude the right match key field, regardless of field name, to avoid duplication of values in the output.
Renaming a field
If you want to output both of the conflicting fields, you can rename one of them by typing a new name in the New Name column of the Field List grid.
Alternatively, you can type a new pattern at the data Fields or lookup Fields level to rename all fields from that input, for example
*_right.Note: If you have applied an Exclude pattern, you cannot rename the fields by pattern. If a set of fields has an existing rename pattern and is set to Exclude, the rename pattern is cleared.
For more details on adding new patterns, and including, excluding and renaming fields, see Field list.
The first data set (left) contains order transactions:
The second data set (right) contains customer information:
|1000101||MARCIA SIDONY||"10 Rose Lane, Springfield"|
|1000102||DAVE THURSTAN||"73 Gorham Street, Boston"|
|1000103||ALEXANDRA GILLIAN||"7 James Street, Florida"|
|1000104||HALEY JANNA||"78 Summer Way, Denver"|
You want to join these two data sets to see if there are any transactions that are not associated with a customer, and any customers who have not made any transactions. The input field
AccountID is common to both data sets, so you can use this field to join them (note that even if the names of the input fields did not match, if the contents of the fields are similar they can still be used to join).
- Firstly, run the left and right input data nodes and connect these to the Merge node. By first running the input nodes, this will ensure that the input data is available and allow you to select fields from a pre-populated list. Tip: If you have connected the Merge node to a node that has not yet run, the Match Keys Grid will not contain a list of available fields. In this case, you can type the name of a field and press Enter to commit the field name. When the node is executed, an error will be generated if this field does not exist.
- In the Match Keys Grid, select AccountID in both the Left Field and Right Field columns.
- The Output property group is expanded by default because the AccountID field name is exactly the same in both the left and right inputs, so you need to resolve this conflict before running the node:
- From the Field List tab of the ConfigureFields property, select matches from the Output pin menu.
- By default the node is configured to output all fields from both left and right inputs. Untick the AccountID field from the right Fields list.
- Run the Merge node.
The left orphans output pin lists the records that only appear in the first data set. In this case, it would contain one record:
The second output pin lists all matches:
|1000101||2016-01-01||Tea||MARCIA SIDONY||10 Rose Lane, Springfield|
|1000102||2016-01-01||Coffee||DAVE THURSTAN||73 Gorham Street, Boston|
|1000103||2016-01-01||Scone||ALEXANDRA GILLIAN||7 James Street, Florida|
|1000104||2016-01-02||Water||HALEY JANNA||78 Summer Way, Denver|
The right orphans output pin lists any records that only appear in the second data set, in this case, it would contain no data.
You can then continue your analysis by joining the matches data to another node, or you can investigate the left or right orphans.
The simplest way to merge your data is by using the Match Keys Gridto select the fields on which you want to base the join. Select or type a field from the left data set, then select or type a field from the right data set.
From the menu button to the right of the field name, you can select Case Insensitive matching, or for more advanced cases you can choose to Compare Substrings. There is also an option to Delete a selected field from the list. The output records are also sorted by these matching criteria. You have the option to change the sort order to Sort Descending (high to low).
You can add multiple rows if you want to base the join on more than one field.
Alternatively, for more advanced techniques, select the Advanced tab.
A value is required for this property.
Specify which fields to include or exclude in each of the outputs.
The simplest way to specify which fields to output is to use the Field List grids. By default, the node will output all fields from both data sets.
From the Output pin menu, select left orphans, matches or right orphans:
- To exclude a field from the output, untick the field name from the list.
- To exclude all fields from the selected output pin, click the green plus icon to the left of the field title and select Exclude.
- To rename a field, type a new name next to the field in the New Name column.
- To rename all fields for a selected output pin with a generic pattern, type a pattern in the New Name box to the right of the field list title. For example, to add a prefix of "
Right_" to all field names in the right data set, type
Right_*in the right Fields New Name box. You can override a specified pattern for individual fields by typing an explicit name in the box to the right of the field name.
Alternatively, for more advanced techniques, select the Advanced tab, where you can use Python script to configure the node.
Specify the Python script to process the results of the merge operation and write output records.
This script is executed whenever a match is identified in the merge, or a record is seen which only appears on either the left or right input.
When a match is detected,
joinResult.match will evaluate to True and the matching input records from the left and right input will be provided to the script.
When a record is only seen in the left input,
joinResult.left will evaluate to True and the input record from the left input will be provided to the script.
When a record is only seen in the right input,
joinResult.right will evaluate to True and the input record from the right input will be provided to the script.
The default script passes through input records to the outputs in the following manner:
- Unmatched records from the left input will be passed through to the first output.
- Unmatched records from the right input will be passed through to the last output.
- Both the left and right input will be passed through to the middle output in case of a match.
Only the fields from these inputs which are configured to be mapped to output fields in the corresponding outputs via the ConfigureFields property will be written to the output fields.
If there are no naming conflicts and no spaces in an input or output name, it can be referenced directly using its name (e.g in1, out1). The inputs and outputs can also be referenced using the format:
inputs['name'] (and the equivalent format for outputs).
The input and output references in this property refer to the input and output records as opposed to the metadata in the ConfigureFields property.
Optionally specify whether the first input will be sorted on the LeftKey.
The default value is True whenever a LeftKey is specified.
Optionally specify whether the second input will be sorted on the RightKey.
The default value is True whenever a RightKey is specified.
Optionally specify whether or not the node should validate that the inputs are correctly sorted during the merge operation.
The default value is False provided that the LeftKey and RightKey are not specified or SortLeftInput and SortRightInput evaluate to True otherwise, the default value is True.
Optionally specify the maximum number of records from each input which can be held in memory.
The node only needs to hold records from both inputs in memory when there are large numbers of records on both inputs which each match the same key.
Generally, this property should not need to be set unless there are a lot of records from both inputs matching a given key and the records are too wide for 5,000 records from both inputs to be stored in memory, or if the records have a highly variable width in bytes - e.g. on most records a field is null, but occassionally is very large.
If no value is specified, the BufferSize will be dynamically calculated based on the width of the incoming records, where initially 5,000 records from each input may be loaded into memory.
Optionally specify what to do if no join key is specified. Choose from:
- Error - The node fails.
- Log - The node will continue processing. An explanatory message is written to the log.
- Ignore - The node will continue processing. Nothing is written to the log.
The default value is Error.
Example data flows
A number of sample Data Flows are available from the Samples workspace, found in the Analyze Directory page.
In the Directory under the
/Data360 Samples/Node Examples/ folder, you will find "Joining and Blending Data", which shows examples of how to use this node.
Inputs and outputs
Inputs: left, right.
Outputs: left orphans, matches, right orphans.