Removes duplicate records from one or more inputs based on the value(s) specified in the IdentifyDuplicatesBy property.
The output data is also sorted by this node.
The simplest way to identify the fields from which you want to remove duplicate records is by using the Fields grid in the IdentifyDuplicatesBy property, as described below. Alternatively, for more advanced techniques, you can type Python script in the Advanced tab.
- Connect the Remove Duplicates node to a node that has run successfully to ensure that the input data is available.
- Select the node and view the Properties panel.
- In the IdentifyDuplicatesBy property, select a field from the drop-down menu to be used in the identification of duplicate records. You can type in the drop-down to filter the list of fields to those which contain the typed text.
- You can add multiple fields by repeating the previous step. The menu button to the right of each item in the Fields grid also allows you to:
- Sort your data in descending order (high to low). By default, your data will be sorted in ascending order (low to high).
- Sort the selected field case insensitively.
- Compare Substrings, see Removing duplicates based on a substring.
- Delete the selected field.
The IdentifyDuplicatesBy property is a multi-field picker, a property type which is found on a number of nodes. For more information on this property type, see Multi-field picker.
Examples
You have the following input data to a Remove Duplicates node:
Product_Codeunicode | Product_Nameunicode |
---|---|
15 | Tea |
2 | Coffee |
3 | Water |
15 | Tea-EarlGrey |
15 | Tea-Herbal |
15 | Tea |
Removing duplicates across all input fields
If you do not enter a value in the IdentifyDuplicatesBy property, the duplicate detection will run across all of the input fields, meaning that only records that are identical across all fields will be removed.
One instance of "15, Tea" is removed:
Product_Codeunicode | Product_Nameunicode |
---|---|
15 | Tea |
15 | Tea-EarlGrey |
15 | Tea-Herbal |
2 | Coffee |
3 | Water |
Removing duplicates based on a specified field
If you add the name of an input field in the IdentifyDuplicatesBy property, any duplicates in that field will be removed, regardless of whether the other input fields contain matching data.
- In the Fields grid of the IdentifyDuplicatesBy property,
select the
Product_Code
field. - Run the Remove Duplicates node.
Three records that have the Product_Code
of "15" are removed, despite the differences in the Product_Name
field:
Product_Codeunicode | Product_Nameunicode |
---|---|
15 | Tea |
2 | Coffee |
3 | Water |
Removing duplicates based on a substring
- In the Fields grid of the IdentifyDuplicatesBy property, select the
Product_Name
field. - In the IdentifyDuplicatesBy property, click the menu button to the right of the
Product_Name
field token and select Compare Substrings. - The Compare Substrings dialog will open. From here, you can enable or disable the option, set a start position and set an optional end position (where the first character of a string is at position 0).
- To remove all duplicate records where the
Product_Name
field contains the same first three characters, in this case, to remove all duplicate "Tea" entries, regardless of the specific type of tea:- Ensure that Compare substrings from start position is selected and leave the default setting of 0.
- In the end position box, type 2.
- Click Done.
- Run the Remove Duplicates node.
All duplicate records are removed based on the first three characters of the Product_Name
:
Product_Codeunicode | Product_Nameunicode |
---|---|
2 | Coffee |
15 | Tea |
3 | Water |
Properties
IdentifyDuplicatesBy
Select or type the names of the input fields to be used in the identification of duplicate records. From the menu button to the right of the field name, you can select Case Insensitive matching, or for more advanced cases you can choose to Compare Substrings. There is also an option to Delete a selected field from the list. The output records are also sorted by these matching criteria. You have the option to change the sort order to Sort Descending (high to low).
If you have added multiple fields, you can drag and drop the fields to reorder the sort if needed. The order of the sort criteria determines which field the data will be sorted by first.
For advanced use cases, you can select the Advanced tab to type Python script to specify the criteria to be used to identify duplicate records, e.g. using the notation fields.<name>
separating each field reference with a comma. To sort in descending order, use the fn.desc
function.
Example: fields.FirstName, fn.desc(fields.DOB)
If no value is given then all fields in the record will be used to identify duplicate records.
MergeOnly
If the node has multiple inputs, you can optionally specify whether to only merge the inputs — sorting records across the different inputs, rather than sorting within the inputs.
Inputs and outputs
Inputs: Duplicates, multiple optional.
Outputs: Duplicates removed.