Detects duplicate data within specified fields, segregating the data into two outputs.
To configure this node:
- In the IdentifyDuplicatesBy property, select or type the names of the input fields to be used in the identification of duplicate records.
The IdentifyDuplicatesBy property is a multi-field picker, a property type which is found on a number of nodes. For more information on this property type, see Multi-field picker.
- By default, the node errors if duplicates are found. You can change this behavior by selecting False in the ErrorIfDuplicates property. When the ErrorIfDuplicates property is set to False, the node has two outputs:
- The first output (single occurrence) contains all rows that have no duplicate collision.
- The second output (multiple occurrence) contains all rows which do contain duplicates.
To remove duplicates, you can use the Remove Duplicates node.
Properties
IdentifyDuplicatesBy
Select or type the names of the input fields to be used in the identification of duplicate records. From the menu button to the right of the field name, you can select Case Insensitive matching, or for more advanced cases you can choose to Compare Substrings. There is also an option to Delete a selected field from the list. The output records are also sorted by these matching criteria. You have the option to change the sort order to Sort Descending (high to low).
If you have added multiple fields, you can drag and drop the fields to reorder the sort if needed. The order of the sort criteria determines which field the data will be sorted by first.
For advanced use cases, you can select the Advanced tab to type Python script to specify the criteria to be used to identify duplicate records, e.g. using the notation fields.<name>
separating each field reference with a comma. To sort in descending order, use the fn.desc
function.
Example: fields.FirstName, fn.desc(fields.DOB)
A value is required for this property.
ErrorIfDuplicates
Optionally specify whether to generate an error if any duplicates are detected.
The default value is True.
Inputs and outputs
Inputs: Input to Validate.
Outputs: single occurrence, multiple occurrence.