Encoding (Code Page) - trillium_discovery - trillium_quality

Encoding (Code Page) - trillium_discovery - trillium_quality - 17.1

Trillium Control Center

Product type

Software

Portfolio

Verify

Product family

Trillium

Product

Trillium > Trillium Quality

Trillium > Trillium Discovery

Version

17.1

Language

English

Product name

Trillium Quality and Discovery

Title

Trillium Control Center

Topic type

How Do I

Overview

Configuration

Reference

Administration

Installation

First publish date

2008

Encoding is a mapping of binary values to code positions which represent characters of data. It is also called a code page. Trillium DQ supports most encodings as input data. Once the data is passed, Trillium utilizes the following mechanisms to handle the various encodings:

The Control Center converts all data into UCS2 (2-byte Universal Character Set) and modifies the ddx (schema file) accordingly, regardless of the encodings used in the original input data, Schema Editor, and country project templates.
In the Schema Editor, all attribute lengths are defined in number of characters, including offsets and redefinitions.
In ddx files, all attribute lengths are defined in number of bytes, including offsets and redefinitions.
When you export a project to batch or real-time processing, Trillium converts the encoding from UCS2 as follows:
- For the user-defined attributes, it converts the encoding from UCS2 back to the encoding used in the original input data. The user-defined attributes are initial input attributes and any other attributes added by the user in the process flow.
- For the standard TSQ output attributes, it converts the encoding from UCS2 back to the encoding used in the country project template. The standard TSQ output attributes are predefined attributes set by Trillium, such as PR_LINE_RULES for the Customer Data Parser.
Most of the country project templates use NOTRANS (system default) or code pages such as CP1251 (Russian) in the ddx files. The country project templates for the ZZ (basic) countries and SAP projects use UTF8.
Note: Trillium treats UTF8 differently than other encodings due to its variable-length characteristics. See Working with UTF8 Data for details.

The following table is a list of the main character encoding used in TS Quality.

Note: Encodings may vary depending on the chosen process or Control Center tool.

Type	Description
NOTRANS	NOTRANS means No Translation. The operations will be done in the default encoding for the host computer. Note: Users need to be careful that the data will not be translated into their native encoding. For example, if a data file from Greece is run on a computer in the US and both the settings files and all of the fields in the schema are set to NOTRANS, you will likely get a different result than if the same project was run in Greece.
ASCII	American Standard Code for Information Interchange (ASCII). A 7-bit encoding for representing English characters.
BIG5	Traditional Chinese
CCSID937	Traditional Chinese, CID937
CP037	EBCDIC, IBM037
CP1250	Latin 2, Eastern European
CP1251	Cyrillic (Russian, Bulgarian, Serbian Cyrillic, etc)
CP1252	Latin1 (ANSI)
CP1253	Greek
CP1254	Turkish
CP1255	Hebrew
CP1256	Arabic
CP1257	Baltic
CP1258	Vietnamese
CP932	Microsoft Extended Shift-JIS Japanese
CP936	Simplified Chinese, GBK
CP949	Korean
CP950	Traditional Chinese
1	8-bit character encoding used on IBM mainframe operating systems such as z/OS.
EBCDIC	Extended Binary Coded Decimal Interchange Code (EBCDIC) used in the mainframe environment.
EUCCN	Simplified Chinese, Unix, GB2312, EUC-SC
EUCJP	Japanese, Unix, EUC-JP, EUC-J, JEUC, J-EUC, EUCJ
EUCKR	Korean, Unix, EUC-KR, KS_C_5861-1992
EUCTW	Traditional Chinese, Unix, CNS-11643, CNS-11643-1992
GB12345	Traditional Chinese
HZGB2312	Simplified Chinese, HZ-GB-2312
IBM-83-4040 IBM-83-4242	Japanese corporate kanji code
ISO2022JP	Japanese, ISO-2022-JP
ISO-8859-7	Latin/Greek
ISO-8859-9	Latin-1 modification for Turkish (Latin-5)
JEF-83-A1A1 JEF-83-4040 JEF-78-A1A1 JEF-78-4040	Japanese corporate kanji code. Fujitsu.
JOHAB	Korean
KEIS-83-A1A1 KEIS-83-4040 KEIS-78-A1A1 KEIS-78-4040	Japanese corporate kanji code. Hitachi.
LATIN1	ISO 8859-1
LATIN2	ISO 8859-2
LATIN4	Baltic
LATIN7	Baltic
LATIN9	ISO 8859-15, Latin1 + Euro symbol and accented characters
ShiftJIS	Microsoft Extended Japanese
UCS2	2-byte Universal Character Set. The encoding of Unicode as 16-bit values.
UNICODE20:BIG-ENDIAN	Unicode with the most significant byte first. Other name: big-endian
UNICODE20:LITTLE-ENDIAN	Unicode with the least significant byte first. Other name: little-endian
UTF7	7-bit Unicode transformation format, variable-length character encoding.
UTF8	8-bit Unicode transformation format; variable length character encoding for Unicode that is also compatible with ASCII. Note: Trillium treats UTF8 differently than other encodings due to its variable-length characteristics. See Working with UTF8 Data for details.