Optional Properties - Connect_ETL - 9.13

Connect ETL for Big Data Sort User Guide

Product type
Software
Portfolio
Integrate
Product family
Connect
Product
Connect > Connect (ETL, Sort, AppMod, Big Data)
Version
9.13
Language
English
Product name
Connect ETL
Title
Connect ETL for Big Data Sort User Guide
Copyright
2023
First publish date
2003
Last updated
2023-09-11
Published on
2023-09-11T19:03:59.237517

The following properties can be optionally specified as appropriate when invoking the Connect for Big Data sort accelerator for MapReduce jobs:

dmx.key.length

Value <length_in_bytes>

Description Optimization to specify the key length when the key is a known fixed length.

dmx.value.length

Value <length_in_bytes>

Description Optimization to specify the value length when the value is a known fixed length.
dmx.max.record.length

Value <length_in_bytes>

Description Specify the maximum length of records sorted by Connect for Big Data. A Connect for Big Data record is the combined key and value. If there are records longer than the specified maximum length, the MapReduce job will fail unless dmx.truncate.records is set to true. Default is 65535 bytes (64 KB). The maximum record length supported by Connect for Big Data is 16777216 bytes (16 MB). For best performance, it is recommended to set this property to a value that will hold the longest expected record, if known, rather than just specifying the maximum value allowed.

dmx.truncate.records

Value true | false

Description Specify whether to truncate records longer than the maximum record length specified using dmx.max.record.length when the specified value is greater than 65535 bytes (otherwise, this property is ignored)

  • If true, records longer than the specified maximum length will be truncated and the MapReduce job will continue to run.
  • If false (default), the MapReduce job will abort when there are records longer than the specified maximum length.
dmx.key.binarycomparable

Value true | false

Description Set this to true when the key uses the binary comparable serializer so that Connect for Big Data can sort on the binary key. This is recommended if your key is not one of the supported types listed in Supported Sort Keys, so that the serialized key can be compared as bytes. Default is false.

dmx.key.layout

Value <key_type>[,<key_type>,…]

Description When using a multi-field key, specify the layout of the key (corresponding to the key definition in your Java map program) as a comma-separated list of supported key types. The Java key comparator specified in the MapReduce job will be ignored, and the keys will be sorted in the order of the specified fields.

Additionally, the Text key type can optionally specify the delimiter, sort order, type, and sort direction of text subfields using the following syntax:

Text{[-t<delimiter>] -k<index>[type][order] \ [-k<index>[type][order] …]}

where:

-t<delimiter> specifies the subfield delimiter, where <delimiter> is a single byte character such as [‘\t’ | ‘:’ | ‘ ’ | …], and the default delimiter is tab (‘\t’) if not specified.

-k<index>[type][order] specifies the key order of the text subfields, where <index> is a 1-based number indicating the position of the given subfield, and the order of -k options indicates primary key, secondary key, etc. When the type is specified as n, a numeric comparison of the key will be used instead of a simple byte comparison, and when the order is specified as r, the subfield will be sorted in reverse (descending) order instead of the default ascending order. The number of subfields in a given Text field must be the same for all MR keys.

Examples

Without Text Subfields

The following layout specifies a key with float, text, and int fields:

-D dmx.key.layout=”FloatWritable,Text,IntWritable”
Given the above layout and the following key/value input:
9.1,New York,10035,value5
1.4,Bronx,10461,value2 
0.9,Queens,11434,value1 
1.4,Queens,11432,value4 
1.4,Queens,11354,value3
The key/value pairs would be sorted as follows:
0.9,Queens,11434 value1
1.4,Bronx,10461 value2 
1.4,Queens,11354 value3 
1.4,Queens,11432 value4 
9.1,New York,10035 value5

With Text Subfields

The following layout specifies a key with float, text, and int fields, where the text field is delimited by commas and sorted by the 3rd subfield first, then by the 2nd numeric subfield in reverse, then by the first subfield:
-D dmx.key.layout=”FloatWritable,Text{-t’,’ -k3 -k2nr -k1},IntWritable”
Given the above layout and the following key/value input:
17.2:Queens,75,NY:10011:value7 
15.4:Cambridge,5,MA:12142:value3 
11.8:Birmingham,114,MI:48009:value1 
15.4:New York,76,NY:10011:value4 
15.4:Chicago,20,IL:60654:value2 
15.4:New York,75,NY:10011:value5 
15.4:Queens,75,NY:10011:value6
The key/value pairs would be sorted as follows:
11.8,Birmingham,114,MI,48009 value1 
15.4,Chicago,20,IL,60654 value2 
15.4,Cambridge,5,MA,12142 value3 
15.4,New York,76,NY,10011 value4 
15.4,New York,75,NY,10011 value5 
15.4,Queens,75,NY,10011 value6 
17.2,Queens,75,NY,10011 value7
dmx.sortwork.dirs

Value <comma-separated list of directories>

Description Specifies one or more local directories, writable by MapReduce users, to override the default location(s) for writing temporary work space files in the Hadoop cluster.

  • For MapR clusters, the default location is the MapReduce job’s working directory in MapR-FS.
  • For clusters of other distributions, the default location is the directories specified in mapreduce.cluster.local.dir (or equivalent).
dmx.sortwork.compress

Value on|off|dynamic

Description Specifies whether Connect for Big Data should compress the sort work (codec is gzip):

  • always (on), (default)
  • load-dependent (dynamic), or
  • never (off)
dmx.hive.support

Value true | false

Description Boolean flag to indicate that Connect for Big Data Sort should be used for hive queries; default is false

dmx.pig.support

Value true| false

Description Boolean flag to indicate that Connect for Big Data Sort should be used for Pig programs; default is false

dmx.map.memory

Value <number_in_MB>

Description Memory limit for Connect for Big Data task running as map sort; default is max(256MB, JVM memory – 256MB)

dmx.reduce.memory

Value <number_in_MB>

Description Memory limit for Connect for Big Data task running as reduce merge; default is max(256MB, JVM memory – 256MB)

dmx.map.datasize.useHDFSsplits

Value true | false

Description Option to determine Connect for Big Data mapper input data size by querying the name node.

  • If true, mapper input data size is calculated based on input split size and mapper count (high overhead for large clusters).
  • If false (default), mapper input datasize is calculated as follows:
  • dmx.map.datasize if specified (recommended if datasize is known), or
  • ¾ of dmx.map.memory if specified, or
  • ¾ of the available memory for Connect for Big Data based on JVM settings

The final mapper data size will appear in the job log output as: DMExpress MAP data size (MB) = X

dmx.map.datasize

Value <number_in_MB>

Description Input data size for Connect for Big Data mappers; ignored if dmx.map.datasize.useHDFSsplits is true. If datasize is within ¾ of map memory, sortwork will stay in memory, which provides optimal performance; if possible, adjust dmx.map.memory to meet this criteria.

dmx.reduce.datasize

Value <number_in_MB>

Description Input data size for Connect for Big Data reducers. If datasize is within ¾ of reduce memory, sortwork will stay in memory, which provides optimal performance; if possible, adjust dmx.reduce.memory to meet this criteria.

dmx.map.sort.a

Value d | n

Description Specifies whether to apply the Connect for Big Data sort on the map side (d), or to apply the “null sort”, meaning that only partitioning but no sorting occurs on the map side (n). Default is n. If bzip2 is specified for map output compression, Connect for Big Data will sort on the map side.