Separates the value in an attribute into tokens and returns the deduped and delimited list of tokens. It performs the deduplication on the attribute value by searching the maximum number of tokens per phrase first, and repeats the search after decrementing the number of tokens per phrase by 1 each time. This process will continue until the number of tokens per phrase reaches the minimum specified.
You can use the DEDUPE function in the Transformer and the Set Selection utility.
- Syntax
-
DEDUPE ("attribute", min, max, "separator")
where
- attribute is the attribute to be deduped.
- min is the minimum number of tokens that can comprise a phrase. Default is 1.
- max is the maximum number of tokens that can comprise a phrase. Default is 5.
- separator is token separator characters. Default is space (" ").
General Guidelines
- The search is case-sensitive. For example, "Car" and "car" are not duplicates. When the data is in mixed case, the DEDUPE function can be used with the UPPER or LOWER function.
- Duplicate phrases cannot extend across previously removed phrase(s) within a given record.
- Pay special attention to the logic for multi-token phrase processing outlined in Example 1. When a duplicate multi-token phrase is found, the original order of tokens in the attribute may not be maintained.
Guidelines for the Set Selection Utility
- The numerical values will be returned based on currently calculated precision and string formatting.
- Duplicate phrases cannot extend across boundaries of a given attribute for a given record within the set.
- The maximum length of the returned string will be the length of the input attribute times the number of records in the set. For example, if the input attribute has a length of 30 and there are 4 records in the set each with a full 30 character of data and no duplicates are found, the concatenation of the 4 records will be 120 characters. If this returned string is written back to a receiving attribute with a length less than 120 characters, the returned data will be truncated to the length of the receiving attribute.
- Example 1 - multi-token phrases
-
DEDUPE(Attribute1,1, 3, " ")
Attribute1 contains: 'one way st two way st wrong way st one way st'
It searches for a duplicate from the beginning of the string for the longest phrase of tokens (3). Since there is a duplicate for "one way st" at the beginning, this phrase is added to the output, and the duplicate is removed from the search string.Note:If a duplicate is not found, it moves over one token and looks for a duplicate from the second token.
Output: one way st
Remaining search string: two way st wrong way st
There are no more 3 token duplicates, so it searches for 2 token phrases next. It searches for a duplicate from the beginning of the remaining string, and if not found, it moves over one token and searches from the second token. Since there is a duplicate for "way st " the phrase is added to the output, and the duplicate is removed from the search string.Output: one way st way st
Remaining search string: two wrong
Since there are no more duplicates, the remaining tokens are added to the final output.Final output: one way st way st two wrong
- Example 2 - single token phrases
-
DEDUPE(Attribute1,1,1, " ")
Attribute1 contains: 'one way st two way st wrong way st one way st'
Since the maximum number of token is set to 1, only single tokens are considered. It searches for duplicates from the beginning of the string. Since there is a duplicate for "one," this is added to the output, and the duplicate is removed from the search string.Note:If a duplicate is not found, it moves over one token and looks for duplicates from the second token.
Output: one
Remaining search string: way st two way st wrong way st way st
Next, there are duplicates for "way " so the token is added to the output, and the duplicates are removed from the search string.Output: one way
Remaining search string: st two st wrong st st
Next, there are duplicates for "st" so the phrase is added to the output, and the duplicates are removed from the search string.Output: one way st
Remaining search string: two wrong
Non-duplicate tokens are added to the output as the search is moving through the string. Therefore "two" is added to the output.Output: one way st two
Finally the remaining token is added to the output.Final output: one way st two wrong