Tokens - trillium_discovery - trillium_quality - Latest

Trillium Parser Tuner

Product type
Software
Portfolio
Verify
Product family
Trillium
Product
Trillium > Trillium Quality
Trillium > Trillium Discovery
Version
Latest
Language
English
Product name
Trillium Quality and Discovery
Title
Trillium Parser Tuner
Copyright
2024
First publish date
2008
Last updated
2024-10-18
Published on
2024-10-18T14:59:24.246276

A token is a string consisting of any character, word or phrase and enclosed in single quotes in the left side of the equation in the definition. For example:

'MARY' INS NAME BEG ATT=GVN-NM1,GEN=F -- the token is 'MARY'

'HOLD MAIL' STREET ATT=HOLD - the token is 'HOLD MAIL'

Guidelines

  • The maximum number of characters for a token is 100.
  • A token cannot wrap to a second line.
  • A token may include one or more sub-tokens or masks.
  • A token or sub-token cannot start with a space. If it starts with a space the space will be ignored. For example '  sons' will be 'sons'.

Sub-tokens

A sub-token is a string within a token. A sub-token may appear at the beginning or end of the token. For example, 'STRASSE' can be a sub-token of 'BERGENSTRASSE.' Use the following keywords in the definition to specify sub-tokens.

  • Beginning-Token (BEG-TKN). Keyword that indicates that the sub-token position is at the beginning of a token.
  • Ending-Token (END-TKN). Keyword that indicates that the sub-token position is at the end of a token.

If a pattern entry contains a sub-token, you must specify whether the sub-token should be separated from the word or attached to it.

For example, assume your data contains 'BERGENSTRASSE 12'. A definition entry might exist in this format:

‘STRASSE’ STREET ENDING-TOKEN ATT=STR-TYPE-S

The following pattern is required in order to separate the sub-token from the word:

‘ALPHA STR-TYPE-S NUMERIC' PATTERN STREET REC=’STR-NM STR-TYPE HSNO’

The following pattern is required in order to keep the sub-token attached:

‘ALPHA NUMERIC’ PATTERN STREET REC=’STR-NM HSNO’

Note: You cannot specify one pattern to attach the sub-token and one pattern to separate the sub-token within the same file.

Masks

A mask is a description of a word or phrase. Masks define characters of data elements using:

  • n to represent a number (0-9).
  • a-z to represent lowercase alphabetic letters.
  • Every character that is not a letter or number is represented by the character itself.

For example, a mask can define any series of five numerals as a postal code, instead of entering each of the 99,999 possible combinations in the table. This mask token looks like:

'nnnnn' MASK GEOG DEF ATT=POSTCODE

Masks may include special characters if they are part of the word representation. For example, a mask for the nine-digit postal code is:

'nnnnn-nnnn' MASK GEOG DEF ATT=POSTCODE