Encoding is a mapping of binary values to code positions which represent characters of data. It is also called a code page. Trillium DQ supports most encodings as input data. Once the data is passed, Trillium utilizes the following mechanisms to handle the various encodings:
- The Control Center converts all data into UCS2 (2-byte Universal Character Set) and modifies the ddx (schema file) accordingly, regardless of the encodings used in the original input data, Schema Editor, and country project templates.
- In the Schema Editor, all attribute lengths are defined in number of characters, including offsets and redefinitions.
- In ddx files, all attribute lengths are defined in number of bytes, including offsets and redefinitions.
- When you export a project to batch or real-time processing, Trillium converts the encoding from UCS2 as follows:
- For the user-defined attributes, it converts the encoding from UCS2 back to the encoding used in the original input data. The user-defined attributes are initial input attributes and any other attributes added by the user in the process flow.
- For the standard TSQ output attributes, it converts the encoding from UCS2 back to the encoding used in the country project template. The standard TSQ output attributes are predefined attributes set by Trillium, such as PR_LINE_RULES for the Customer Data Parser.
- Most of the country project templates use NOTRANS (system default) or code pages such
as CP1251 (Russian) in the ddx files. The country project templates for the ZZ (basic)
countries and SAP projects use UTF8.Note: Trillium treats UTF8 differently than other encodings due to its variable-length characteristics. See Working with UTF8 Data for details.
The following table is a list of the main character encoding used in TS Quality.
Type |
Description |
---|---|
NOTRANS |
NOTRANS means No Translation. The operations will be done in the default encoding for the host computer. Note: Users need to be careful that the data will not be translated into their
native encoding. For example, if a data file from Greece is run on a
computer in the US and both the settings files and all of the fields in the
schema are set to NOTRANS, you will likely get a different result than if
the same project was run in Greece.
|
ASCII
|
American Standard Code for Information Interchange (ASCII). A 7-bit encoding for representing English characters. |
BIG5 |
Traditional Chinese |
CCSID937 |
Traditional Chinese, CID937 |
CP037 |
EBCDIC, IBM037 |
CP1250 |
Latin 2, Eastern European |
CP1251 |
Cyrillic (Russian, Bulgarian, Serbian Cyrillic, etc) |
CP1252 |
Latin1 (ANSI) |
CP1253 |
Greek |
CP1254 |
Turkish |
CP1255 |
Hebrew |
CP1256 |
Arabic |
CP1257 |
Baltic |
CP1258 |
Vietnamese |
CP932 |
Microsoft Extended Shift-JIS Japanese |
CP936 |
Simplified Chinese, GBK |
CP949 |
Korean |
CP950 |
Traditional Chinese |
1 |
8-bit character encoding used on IBM mainframe operating systems such as z/OS. |
EBCDIC | Extended Binary Coded Decimal Interchange Code (EBCDIC) used in the mainframe environment. |
EUCCN |
Simplified Chinese, Unix, GB2312, EUC-SC |
EUCJP |
Japanese, Unix, EUC-JP, EUC-J, JEUC, J-EUC, EUCJ |
EUCKR |
Korean, Unix, EUC-KR, KS_C_5861-1992 |
EUCTW |
Traditional Chinese, Unix, CNS-11643, CNS-11643-1992 |
GB12345 |
Traditional Chinese |
HZGB2312 |
Simplified Chinese, HZ-GB-2312 |
IBM-83-4040 IBM-83-4242 |
Japanese corporate kanji code |
ISO2022JP |
Japanese, ISO-2022-JP |
ISO-8859-7 |
Latin/Greek |
ISO-8859-9 |
Latin-1 modification for Turkish (Latin-5) |
JEF-83-A1A1 JEF-83-4040 JEF-78-A1A1 JEF-78-4040 |
Japanese corporate kanji code. Fujitsu.
|
JOHAB |
Korean |
KEIS-83-A1A1 KEIS-83-4040 KEIS-78-A1A1 KEIS-78-4040 |
Japanese corporate kanji code. Hitachi.
|
LATIN1 |
ISO 8859-1 |
LATIN2 |
ISO 8859-2 |
LATIN4 |
Baltic |
LATIN7 |
Baltic |
LATIN9 |
ISO 8859-15, Latin1 + Euro symbol and accented characters |
ShiftJIS |
Microsoft Extended Japanese |
UCS2 |
2-byte Universal Character Set. The encoding of Unicode as 16-bit values. |
UNICODE20:BIG-ENDIAN |
Unicode with the most significant byte first. Other name: big-endian |
UNICODE20:LITTLE-ENDIAN |
Unicode with the least significant byte first. Other name: little-endian |
UTF7 |
7-bit Unicode transformation format, variable-length character encoding. |
UTF8 |
8-bit Unicode transformation format; variable length character encoding for Unicode that is also compatible with ASCII. Note: Trillium treats UTF8 differently
than other encodings due to its variable-length characteristics. See Working with UTF8 Data for details.
|