Filtering Out Duplicate Records - spectrum_quality_1 - 23.1

Spectrum Data Quality Guide

Product type
Software
Portfolio
Verify
Product family
Spectrum
Product
Spectrum > Quality > Spectrum Quality
Version
23.1
Language
English
Product name
Spectrum Data Quality
Title
Spectrum Data Quality Guide
Topic type
Overview
Reference
Tips
How Do I
First publish date
2007
ft:lastEdition
2024-03-04
ft:lastPublication
2024-03-04T22:52:13.486265

The simplest way to remove duplicate records is to add a Filter stage to your dataflow after a matching stage. The Filter stage removes records from collections of duplicate records based on the settings you specify.

  1. In Enterprise Designer, create a dataflow that identifies duplicate records through matching.

    Matching is the first step in deduplication because you need to identify records that are similar, such as records that have the same account number or name. See the following topics for instructions on creating a dataflow that matches records.

    Note: You only need to build the dataflow to the point where it reads data and performs matching with an Interflow Match, Intraflow Match, or Transactional Match stage. Once you have created a dataflow to this point, continue with the following steps.
  2. Once you have defined a dataflow that reads data and matches records, drag a Filter stage to the canvas and connect it to the stage that performs the matching (Interflow Match, Intraflow Match, or Transactional Match).

    For example, if your dataflow reads data from a file and performs matching with Intraflow Match, your dataflow would look like this after adding a Filter stage:

    Filter stage in dataflow
  3. Double-click the Filter stage on the canvas.
  4. In the Group by field, select CollectionNumber.
  5. Leave the option Limit number of returned duplicate records selected and the value set to 1. These are the default settings.
  6. Decide if you want to keep the first record in each collection, or if you want to define a rule to choose which record from each collection to keep. If you want to keep the first record in each collection, skip this step. If you want to define a rule, in the rule tree, select Rules then follow these steps:
    1. Click Add Rule.

      Records in each group are evaluated to see if they meet the rules you define here. If a record meets the rule, it is the surviving record and the other records in the group are discarded.

    2. Define a rule to identify the record from each group to retain.

      Use the following options to define a rule:

      Option Description

      Field name

      Specifies the name of the dataflow field whose value you want to evaluate to determine whether to filter the record.

      Field Type

      Specifies the type of data in the field. One of the following:

      Non-Numeric
      Choose this option if the field contains non-numeric data (for example, string data).
      Numeric
      Choose this option if the field contains numeric data (for example, double, float, and so on).

      Operator

      Specifies the type of comparison you want to use to evaluate the field. One of the following:

      Contains
      Determines if the field contains the value specified. For example, "sailboat" contains the value "boat".
      Equal
      Determines if the field contains the exact value specified.
      Greater Than
      Determines if the field value is greater than the value specified. This operation only works on numeric fields.
      Greater Than Or Equal To
      Determines if the field value is greater than or equal to the value specified. This operation only works on numeric fields.
      Highest
      Compares the field's value for all the records group and determines which record has the highest value in the field. For example, if the fields in the group contain values of 10, 20, 30, and 100, the record with the field value 100 would be selected. This operation only works on numeric fields. If multiple records are tied for the longest value, one record is selected.
      Is Empty
      Determines if the field contains no value.
      Is Not Empty
      Determines if the field contains any value.
      Less Than
      Determines if the field value is less than the value specified. This operation only works on numeric fields.
      Less Than Or Equal To
      Determines if the field value is less than or equal to the value specified. This operation only works on numeric fields.
      Longest
      Compares the field's value for all the records group and determines which record has the longest (in bytes) value in the field. For example, if the group contains the values "Mike" and "Michael", the record with the value "Michael" would be selected. If multiple records are tied for the longest value, one record is selected.
      Lowest
      Compares the field's value for all the records group and determines which record has the lowest value in the field. For example, if the fields in the group contain values of 10, 20, 30, and 100, the record with the field value 10 would be selected. This operation only works on numeric fields. If multiple records are tied for the longest value, one record is selected.
      Most Common
      Determines if the field value contains the value that occurs most frequently in this field among the records in the group. If two or more values are most common, no action is taken.
      Not Equal
      Determines if the field value is not the same as the value specified.

      Value type

      Specifies the type of value you want to compare to the field's value. One of the following:

      Note: This option is not available if you select the operator Highest, Lowest, or Longest.
      Field
      Choose this option if you want to compare another dataflow field's value to the field.
      String
      Choose this option if you want to compare the field to a specific value.
      Value

      Specifies the value to compare to the field's value. If you selected Field in the Field type field, select a dataflow field. If you selected String in the Value type field, type the value you want to use in the comparison.

      Note: This option is not available if you select the operator Highest, Lowest, or Longest.
    3. Click OK.

      You have now configured Filter with one rule. You can add additional rules if needed.

  7. Click OK to close the Filter Options window.
  8. Drag a sink stage onto the canvas and connect it to the Filter stage.

    For example, if you were using a Write to File sink stage your dataflow would look like this:

    Write to File in dataflow
  9. Double-click the sink stage and configure it.

    For information on configuring sink stages, see the Dataflow Designer's Guide.

You now have a dataflow that identifies matching records and removes all but one record for each group of duplicates, resulting in an output file that contains deduplicated data.