This topic explains how Trillium handles UTF-8 encoding and presents solutions to the possible issues arising from the UTF-8 data processing.
What is UTF-8?
UTF-8 (8-bit Unicode Transformation Format) is a variable-length character encoding for Unicode. UTF-8 can represent every character in the Unicode character set and is compatible with ASCII.
Each character in UTF-8 is represented by one to four bytes. The ASCII characters are represented by one byte. Most Latin characters including Greek and Arabic are represented by two bytes. Three bytes are required for the rest of the Basic Multilingual Plane which contains all characters in common use, including the CJK (Chinese, Japanese, and Korean) characters. Four bytes are used for characters in the other planes of Unicode such as various historic scripts.
Multiplication Rule
Due to its variable-length characteristics, Trillium treats UTF-8 differently than other encodings. The Control Center converts UTF-8 data to UCS2, but when converting from UCS2 back to UTF8 during an export, it multiplies the character length by 3 to obtain the byte length for each UTF-8 attribute in the ddx.
For example, if you have a UTF-8 attribute called Line 1 that is 10 character long in the Control Center, Line 1 will become 30 bytes (10 x 3) in the ddx when exported to batch or real-time. This is because TS Quality processes data in fixed-length and must allow for the possibility that each UTF-8 character may require three bytes to avoid data truncation.
- The multiplication rule is applied to the user-defined UTF-8 attributes for all country projects.
- The rule is also used when you save a file in the external Schema Editor.
- The rule is not applied to the standard TSQ output attributes with UTF-8 encoding in an SAP or ZZ project.
- The rule is not applied to the attribute length in the input schema file(s) when the input data is fixed length. See Enabling UTF-8 Multiplication Rule for All Files and Projects for a workaround.
- The rule is not applied to legacy projects created prior to V13.0. See Enabling UTF-8 Multiplication Rule for All Files and Projects for a workaround.
While the multiplication rule ensures that UTF-8 data is properly exported without truncation, it may negatively affect your project.
- It can take up more disk space than necessary by tripling the character length and cause slower performance.
- Processes based on the character position or number of characters, such as schema redefinitions and attribute scans, may fail because the position may become out-of-sync with the ddx after multiplication.
- An issue can also occur when the UTF-8 attribute is referencing the non-UTF-8 attribute or vice versa in the process (for example, Relationship Linker link file setting).
Solutions to the Multiplication Issues
Depending on your data and processes, you can use the following procedures to solve the issues arising from the multiplication rule.
Multiplication Factor
If you know that you have only single-byte or double-byte UTF-8 characters, you can reduce the multiplication factor from 3 to either 1 or 2 by editing the configuration file.
To change the multiplication factor
Schema Redefinitions
An issue arises when UTF-8 attributes are redefined as sub-attributes of the source attribute and the source attribute’s value is copied to each sub-attribute based on offset and number of characters. While this works within the Control Center, once the data is exported and the character lengths are multiplied, the redefinition will no longer work.
You have a 8-character UTF-8 input attribute, Line 1. It is redefined into two UTF-8 sub-attributes in the Schema Editor: Sub 1 (offset = 0, width = 6 characters) and Sub 2 (offset = 6, width = 2 characters). The attributes have the following values.
- Line 1: ABCDEFGH
- Sub 1: ABCDEF
- Sub 2: GH
When exported, lengths of all three attributes are multiplied by 3; Line 1 is now 24 bytes (8 x 3), Sub 1 is 18 bytes (6 x 3), and Sub 2 is 6 bytes (2 x 3). Since Sub 1 takes all 8 characters within 18 bytes, both redefined attributes end up in Sub 1 and there is no data in Sub 2.
- Line 1: ABCDEFGH
- Sub 1: ABCDEFGH
- Sub 2:
There are two ways to solve the issue:
- Using the Schema Editor, convert any UTF-8 attributes that are redefined to UCS2 in the first process (Transformer) and then convert it back to UTF-8 in the last process in the flow. See Modifying Attributes for the procedure to change the encoding for the attribute. This method is recommended.
- If there is no risk of data truncation, change the multiplication factor to 1 in the configuration file. To change the multiplication factor, see the procedure above.
Relationship Linker Link Files
- 14041E ERROR: DDL file:
<C:\project1\batch\ddl/e62_us_srtforrl_p7.ddx> DDL field(length):
<FROM_LINK>(<24>) is less than Settings file:
<C:\project1\batch\settings\e63_usrellink_p8.stx> parameter:
<LINK_SOURCE_ID> DDL field(length): <INPUT_LINE_01>(<72>).
Occurred in CMatcher::InitMatcher - (CMatcher::InitPrmVals)
This is because the length of the link source attribute is multiplied but FROM_LINK and TO_LINK, the attributes that store the value of the link source attribute, are fixed as NOTRANS and not multiplied. To avoid this issue, make sure to use a non-UTF-8 attribute for the "Attribute to write to Link file" (LINK_SOURCE_ID) setting.
Enabling UTF-8 Multiplication Rule for All Files and Projects
Generally the UTF-8 multiplication rule applies only to delimited files, not fixed-length files. It is not applied to legacy projects created prior to V13.0 either. To work around these limitations, you can manually enable the UTF-8 rule for all types of files and projects as detailed in the following procedure.
To enable the UTF-8 multiplication rule for all types of files and projects