Enriches a large data set (left table) with a small data set(right table) by joining the two data sets based on common fields.
The Lookup node works in a similar way to the Join and Merge nodes and is an optimization of the inner and left joins. The Lookup node is recommended for use with a small data set on the right input as the entire right data set is loaded into memory. An advantage of using the Lookup node is that the input data sets do not need to be sorted prior to using the node. If you want to join two large data sets, you should use the Merge or Join node.
- Select the related fields on which you want to base the join.Note: The fields on which you base the join must be of a similar type. You can compare the number types int, long and double against each other, and fields of string data type can be compared to Unicode fields.
Use the Match Keys Grid to select the names of the fields on which you want to base the join. You can add multiple rows if you want to base the join on more than one set of fields.
From the menu button to the right of the field name, you can select Case Insensitive matching, or for more advanced cases you can choose to Compare Substrings. There is also an option to Delete a selected field from the list.
The output records are sorted by the matching criteria. You have the option to change the sort order to Sort Descending (high to low).
For more details on these options, see Match keys grid.
Alternatively, for more advanced techniques, select the Advanced tab, where you can use Python script to configure the node.
- Select which fields you want to include in the output.
By default, all fields from both inputs are included in the output. You can modify this behavior by excluding fields in the Field List grid.
If there are any field name conflicts, where the same case-insensitive field name appears in both inputs, the node will fail. To prevent a field name conflict error, you can use the Field List grid to either exclude or rename one of the fields.
Note: If a second output pin is added to the Lookup node, you must add a new pattern for this output in the Field List grid, see Example - Outputting left orphans to a second output pin.Excluding a field
A simple way to resolve a field name conflict is to untick one of the conflicting field names in the Field List grid. It is common practice to exclude the right match key field, regardless of field name, to avoid duplication of values in the output.
Renaming a field
If you want to output both of the fields with conflicting field names, you can rename one of them by typing a new name in the New Name column of the Field List grid.
Alternatively, you can type a new pattern at the data Fields or lookup Fields level to rename all fields from that input, for example
*_right
.Note: If you have applied an Exclude pattern, you cannot rename the fields by pattern. If a set of fields has an existing rename pattern and is set to Exclude, the rename pattern is cleared.
For more details on adding new patterns, and including, excluding and renaming fields, see Field list.
In the output, if a left orphan is found (i.e. if a record is not found in the right data set) the missing record will show as NULL in the right field.
Example
You have the following main data set containing a list of customers and locations:
Citystring | Customer_Namestring |
---|---|
Liverpool | Derek |
Leeds | Simon |
Sheffield | Alison |
Chester | David |
Liverpool | Aarti |
Liverpool | Jane |
Chester | Jerome |
Sheffield | Holly |
Sheffield | Stephen |
Liverpool | Wendy |
Leeds | Hugh |
Nottingham | Richard |
Bristol | Sarah |
You want to combine the above data set with the following, smaller data set, to add the name of the sales representative that is responsible for each customer:
citystring | sales_repstring |
---|---|
Liverpool | Jon |
Chester | Adam |
Sheffield | Paul |
Leeds | Claire |
You can combine the data sets based on the "City" and "city" fields, as the location information is common to both inputs and both of these fields contain string data.
- Run the two input nodes and drag a Lookup node onto the canvas.
- Connect the main data set to the data input pin of the Lookup node, and connect the smaller data set to the lookup input pin of the Lookup node.
- Select the Lookup node. In the Match Keys grid, select "City" for the Left Field and "city" for the Right Field.
- In the Field List grid, untick "city" from the lookup Fields list.
- Run the Lookup node. Note that by excluding the "city" field, this removes the field name conflict and allows the node to run successfully.
The output contains the following merged data set showing the location that each sales rep is responsible for. In the case of the Nottingham and Bristol records, there is no sales_rep in the lookup data set for these cities, so these non-matches show as NULL in the sales_rep column:
Citystring | Customer_Namestring | sales_repstring |
---|---|---|
Liverpool | Derek | Jon |
Leeds | Simon | Claire |
Sheffield | Alison | Paul |
Chester | David | Adam |
Liverpool | Aarti | Jon |
Liverpool | Jane | Jon |
Chester | Jerome | Adam |
Sheffield | Holly | Paul |
Sheffield | Stephen | Paul |
Liverpool | Wendy | Jon |
Leeds | Hugh | Claire |
Nottingham | Richard | NULL |
Bristol | Sarah | NULL |
Example - Outputting left orphans to a second output pin
After following the above steps, you decide to separate the non-matches (left orphans) to a second output pin. By default, if you add a second output pin and configure the output pattern for this new output, the matches will be output to the first pin and the non-matches will be output to the second pin:
- Add a new output pin on the Lookup node.
- Select the Lookup node, then click the Define tab of the Properties panel.
- Scroll to the bottom of the panel to the Outputs group.
- In the Output Name grid, type a name for the new output pin e.g. "Non-matches" and press Enter.
- Add a new pattern for the new output pin.
- Ensuring that the Lookup node is still selected, click the Configure tab of the Properties panel.
- Scroll down to the Field List grid, and select "Non-matches" from the Output pin list.
- Click Add new pattern, then select data Fields from the list of default patterns.
The default LookupOutput pin contains the following merged data set:
Citystring | Customer_Namestring | sales_repstring |
---|---|---|
Liverpool | Derek | Jon |
Leeds | Simon | Claire |
Sheffield | Alison | Paul |
Chester | David | Adam |
Liverpool | Aarti | Jon |
Liverpool | Jane | Jon |
Chester | Jerome | Adam |
Sheffield | Holly | Paul |
Sheffield | Stephen | Paul |
Liverpool | Wendy | Jon |
Leeds | Hugh | Claire |
The Non-matches output pin contains the two records for which there is no corresponding entry in the first data set:
Citystring | Customer_Namestring |
---|---|
Nottingham | Richard |
Bristol | Sarah |
Properties
Match Keys
The simplest way to merge your data is by using the Match Keys Grid to select the fields on which you want to base the join. Select or type a field from the left data set, then select or type a field from the right data set.
From the menu button to the right of the field name, you can select Case Insensitive matching, or for more advanced cases you can choose to Compare Substrings. There is also an option to Delete a selected field from the list. The output records are also sorted by these matching criteria. You have the option to change the sort order to Sort Descending (high to low).
You can add multiple rows if you want to base the join on more than one set of fields. Alternatively, for more advanced techniques, select the Advanced tab.
ConfigureFields
Specify which fields to include in the node outputs using the Field List grid, or use the Advanced tab to specify the Python script to configure the node outputs and output fields.
This script is executed once in order to configure the output metadata prior to any records being processed. The details of any mappings from input field to output field are stored in the mapper for the output which can be used in the ProcessRecords property.
If there are no naming conflicts and no spaces in an input or output name, it can be
referenced directly using its name (e.g in1, out1). The inputs and outputs can also be
referenced using the format: inputs[index]
or inputs['name']
(and the equivalent format for outputs).
As this script is executed prior to reading any records, the inputs and outputs references refer specifically to the input and output metadata.
ProcessRecords
Specify the Python script to process the results of the lookup operation and write output records. This script is executed for each of the records from the main record.
- When a match is detected, lookupResult.match will evaluate to True and the matching input records from the main and lookup input will be provided to the script.
- When no match is detected, lookupResult.match will evaluate to False and the input record from the main input will be provided to the script.
The default script passes through input records to the outputs in the following manner:
- If the node has one output, fields from the main and lookup inputs will be passed through to the output in the case of a match. When there is no match, only fields from the main input will be passed through to the output.
- If the node has two outputs, fields from the main and lookup inputs will be passed through to the first output in the case of a match. When there is no match, fields from the main input will be passed through to the second output.
- Only the fields from the inputs which are configured to be mapped to an output field in the ConfigureFields property will be written to the output.
If there are no naming conflicts and no spaces in an input or output name, it can be referenced directly using its name (e.g in1, out1).
The inputs and outputs can also be referenced using the format:
inputs[index]
or inputs['name']
(and the equivalent
format for outputs). The input and output references in this property refer to the input and
output records as opposed to the metadata in the ConfigureFields property.
DuplicateLookupKeyBehavior
Optionally specify what to do if there are duplicate records in the lookup (right) data set, based on the values specified in the Match Keys property. Choose from:
- Error - The node fails when it encounters a duplicate value.
- Log - A warning is logged for the first duplicate value encountered. Records from the main input will only ever match the first of the duplicates in the lookup input.
- Ignore - The first matching lookup value is used and no warning or error is logged.
The default value is Error.
Example data flows
A number of sample Data Flows are available from the Samples workspace, found in the Analyze Directory page.
In the Directory under the /Data360 Samples/Node Examples/
folder, you will find "Joining and Blending Data", which shows examples of how to use this node.
Inputs and outputs
Inputs: data, lookup.
Outputs: LookupOutput, multiple optional.