Encoding (Code Page) - trillium_discovery - trillium_quality - 17.1

Trillium Control Center

Product type
Software
Portfolio
Verify
Product family
Trillium
Product
Trillium > Trillium Quality
Trillium > Trillium Discovery
Version
17.1
Language
English
Product name
Trillium Quality and Discovery
Title
Trillium Control Center
Topic type
How Do I
Overview
Configuration
Reference
Administration
Installation
First publish date
2008

Encoding is a mapping of binary values to code positions which represent characters of data. It is also called a code page. Trillium DQ supports most encodings as input data. Once the data is passed, Trillium utilizes the following mechanisms to handle the various encodings:

  • The Control Center converts all data into UCS2 (2-byte Universal Character Set) and modifies the ddx (schema file) accordingly, regardless of the encodings used in the original input data, Schema Editor, and country project templates.
  • In the Schema Editor, all attribute lengths are defined in number of characters, including offsets and redefinitions.
  • In ddx files, all attribute lengths are defined in number of bytes, including offsets and redefinitions.
  • When you export a project to batch or real-time processing, Trillium converts the encoding from UCS2 as follows:
    • For the user-defined attributes, it converts the encoding from UCS2 back to the encoding used in the original input data. The user-defined attributes are initial input attributes and any other attributes added by the user in the process flow.
    • For the standard TSQ output attributes, it converts the encoding from UCS2 back to the encoding used in the country project template. The standard TSQ output attributes are predefined attributes set by Trillium, such as PR_LINE_RULES for the Customer Data Parser.
  • Most of the country project templates use NOTRANS (system default) or code pages such as CP1251 (Russian) in the ddx files. The country project templates for the ZZ (basic) countries and SAP projects use UTF8.
    Note: Trillium treats UTF8 differently than other encodings due to its variable-length characteristics. See Working with UTF8 Data for details.

The following table is a list of the main character encoding used in TS Quality.

Note: Encodings may vary depending on the chosen process or Control Center tool.

Type

Description

NOTRANS

NOTRANS means No Translation. The operations will be done in the default encoding for the host computer.

Note: Users need to be careful that the data will not be translated into their native encoding. For example, if a data file from Greece is run on a computer in the US and both the settings files and all of the fields in the schema are set to NOTRANS, you will likely get a different result than if the same project was run in Greece.

ASCII

 

American Standard Code for Information Interchange (ASCII). A 7-bit encoding for representing English characters.

BIG5

Traditional Chinese

CCSID937

Traditional Chinese, CID937

CP037

EBCDIC, IBM037

CP1250

Latin 2, Eastern European

CP1251

Cyrillic (Russian, Bulgarian, Serbian Cyrillic, etc)

CP1252

Latin1 (ANSI)

CP1253

Greek

CP1254

Turkish

CP1255

Hebrew

CP1256

Arabic

CP1257

Baltic

CP1258

Vietnamese

CP932

Microsoft Extended Shift-JIS Japanese

CP936

Simplified Chinese, GBK

CP949

Korean

CP950

Traditional Chinese

1

8-bit character encoding used on IBM mainframe operating systems such as z/OS.

EBCDIC Extended Binary Coded Decimal Interchange Code (EBCDIC) used in the mainframe environment.

EUCCN

Simplified Chinese, Unix, GB2312, EUC-SC

EUCJP

Japanese, Unix, EUC-JP, EUC-J, JEUC, J-EUC, EUCJ

EUCKR

Korean, Unix, EUC-KR, KS_C_5861-1992

EUCTW

Traditional Chinese, Unix, CNS-11643, CNS-11643-1992

GB12345

Traditional Chinese

HZGB2312

Simplified Chinese, HZ-GB-2312

IBM-83-4040

IBM-83-4242

Japanese corporate kanji code

ISO2022JP

Japanese, ISO-2022-JP

ISO-8859-7

Latin/Greek

ISO-8859-9

Latin-1 modification for Turkish (Latin-5)

JEF-83-A1A1

JEF-83-4040

JEF-78-A1A1

JEF-78-4040

Japanese corporate kanji code. Fujitsu.

 

JOHAB

Korean

KEIS-83-A1A1

KEIS-83-4040

KEIS-78-A1A1

KEIS-78-4040

Japanese corporate kanji code. Hitachi.

 

LATIN1

ISO 8859-1

LATIN2

ISO 8859-2

LATIN4

Baltic

LATIN7

Baltic

LATIN9

ISO 8859-15, Latin1 + Euro symbol and accented characters

ShiftJIS

Microsoft Extended Japanese

UCS2

2-byte Universal Character Set. The encoding of Unicode as 16-bit values.

UNICODE20:BIG-ENDIAN

Unicode with the most significant byte first. Other name: big-endian

UNICODE20:LITTLE-ENDIAN

Unicode with the least significant byte first. Other name: little-endian

UTF7

7-bit Unicode transformation format, variable-length character encoding.

UTF8

8-bit Unicode transformation format; variable length character encoding for Unicode that is also compatible with ASCII.

Note: Trillium treats UTF8 differently than other encodings due to its variable-length characteristics. See Working with UTF8 Data for details.