Remove Duplicates - Data360_Analyze - Latest

Data360 Analyze Server Help

Product type
Software
Portfolio
Verify
Product family
Data360
Product
Data360 Analyze
Version
Latest
Language
English
Product name
Data360 Analyze
Title
Data360 Analyze Server Help
Copyright
2024
First publish date
2016
Last updated
2024-11-28
Published on
2024-11-28T15:26:57.181000

Removes duplicate records from one or more inputs based on the value(s) specified in the IdentifyDuplicatesBy property.

The output data is also sorted by this node.

Tip: To check for duplicates before removing them, you can use the Duplicate Detection node.

The simplest way to identify the fields from which you want to remove duplicate records is by using the Fields grid in the IdentifyDuplicatesBy property, as described below. Alternatively, for more advanced techniques, you can type Python script in the Advanced tab.

Note: You must choose to work in either the Fields grid or the Advanced tab, as working in one will reset the other.

  1. Connect the Remove Duplicates node to a node that has run successfully to ensure that the input data is available.
  2. Select the node and view the Properties panel.
  3. In the IdentifyDuplicatesBy property, select a field from the drop-down menu to be used in the identification of duplicate records. You can type in the drop-down to filter the list of fields to those which contain the typed text.
  4. You can add multiple fields by repeating the previous step. The menu button to the right of each item in the Fields grid also allows you to:
    • Sort your data in descending order (high to low). By default, your data will be sorted in ascending order (low to high).
    • Sort the selected field case insensitively.
    • Compare Substrings, see Removing duplicates based on a substring.
    • Delete the selected field.
Tip: If you have connected the Remove Duplicates node to a node that has not yet run, the IdentifyDuplicatesBy property will not contain a list of available fields. In this case, you can type the name of a field in the IdentifyDuplicatesBy property and press Enter to commit the field name. When the node is executed, an error will be generated if this field does not exist.

The IdentifyDuplicatesBy property is a multi-field picker, a property type which is found on a number of nodes. For more information on this property type, see Multi-field picker.

Examples

You have the following input data to a Remove Duplicates node:

Product_Codeunicode Product_Nameunicode
15 Tea
2 Coffee
3 Water
15 Tea-EarlGrey
15 Tea-Herbal
15 Tea

Removing duplicates across all input fields

If you do not enter a value in the IdentifyDuplicatesBy property, the duplicate detection will run across all of the input fields, meaning that only records that are identical across all fields will be removed.

One instance of "15, Tea" is removed:

Product_Codeunicode Product_Nameunicode
15 Tea
15 Tea-EarlGrey
15 Tea-Herbal
2 Coffee
3 Water

Removing duplicates based on a specified field

If you add the name of an input field in the IdentifyDuplicatesBy property, any duplicates in that field will be removed, regardless of whether the other input fields contain matching data.

  1. In the Fields grid of the IdentifyDuplicatesBy property, select the Product_Code field.
  2. Run the Remove Duplicates node.

Three records that have the Product_Code of "15" are removed, despite the differences in the Product_Name field:

Product_Codeunicode Product_Nameunicode
15 Tea
2 Coffee
3 Water

Removing duplicates based on a substring

  1. In the Fields grid of the IdentifyDuplicatesBy property, select the Product_Name field.
  2. In the IdentifyDuplicatesBy property, click the menu button to the right of the Product_Name field token and select Compare Substrings.
  3. The Compare Substrings dialog will open. From here, you can enable or disable the option, set a start position and set an optional end position (where the first character of a string is at position 0).
  4. To remove all duplicate records where the Product_Name field contains the same first three characters, in this case, to remove all duplicate "Tea" entries, regardless of the specific type of tea:
    1. Ensure that Compare substrings from start position is selected and leave the default setting of 0.
    2. In the end position box, type 2.
    3. Click Done.
  5. Run the Remove Duplicates node.

All duplicate records are removed based on the first three characters of the Product_Name:

Product_Codeunicode Product_Nameunicode
2 Coffee
15 Tea
3 Water

Properties

IdentifyDuplicatesBy

Select or type the names of the input fields to be used in the identification of duplicate records. From the menu button to the right of the field name, you can select Case Insensitive matching, or for more advanced cases you can choose to Compare Substrings. There is also an option to Delete a selected field from the list. The output records are also sorted by these matching criteria. You have the option to change the sort order to Sort Descending (high to low).

If you have added multiple fields, you can drag and drop the fields to reorder the sort if needed. The order of the sort criteria determines which field the data will be sorted by first.

For advanced use cases, you can select the Advanced tab to type Python script to specify the criteria to be used to identify duplicate records, e.g. using the notation fields.<name> separating each field reference with a comma. To sort in descending order, use the fn.desc function.

Example: fields.FirstName, fn.desc(fields.DOB)

If no value is given then all fields in the record will be used to identify duplicate records.

MergeOnly

If the node has multiple inputs, you can optionally specify whether to only merge the inputs — sorting records across the different inputs, rather than sorting within the inputs.

Note: To use this option, the input data must already be sorted, and the fields across all inputs must match.

Inputs and outputs

Inputs: Duplicates, multiple optional.

Outputs: Duplicates removed.