The following properties can be optionally specified as appropriate when invoking the Connect for Big Data sort accelerator for MapReduce jobs:
dmx.key.length
Value <length_in_bytes>
Description Optimization to specify the key length when the key is a known fixed length.
dmx.value.length
Value <length_in_bytes>
dmx.max.record.length
Value <length_in_bytes>
Description Specify the maximum length of records sorted by Connect for Big Data.
A Connect for Big Data record is the combined key and value. If there are records longer
than the specified maximum length, the MapReduce job will fail unless
dmx.truncate.records
is set to true
. Default is 65535
bytes (64 KB). The maximum record length supported by Connect for Big Data is 16777216 bytes
(16 MB). For best performance, it is recommended to set this property to a value that will
hold the longest expected record, if known, rather than just specifying the maximum value
allowed.
dmx.truncate.records
Value true | false
Description Specify whether to truncate records longer than the maximum record
length specified using dmx.max.record.length
when the specified value is
greater than 65535 bytes (otherwise, this property is ignored)
- If
true
, records longer than the specified maximum length will be truncated and the MapReduce job will continue to run. - If
false
(default), the MapReduce job will abort when there are records longer than the specified maximum length.
dmx.key.binarycomparable
Value true | false
Description Set this to true
when the key uses the binary
comparable serializer so that Connect for Big Data can sort on the binary key. This is
recommended if your key is not one of the supported types listed in Supported Sort Keys, so
that the serialized key can be compared as bytes. Default is false
.
dmx.key.layout
Value <key_type>[,<key_type>,…]
Description When using a multi-field key, specify the layout of the key (corresponding to the key definition in your Java map program) as a comma-separated list of supported key types. The Java key comparator specified in the MapReduce job will be ignored, and the keys will be sorted in the order of the specified fields.
Additionally, the Text
key type can optionally specify the delimiter, sort
order, type, and sort direction of text subfields using the following syntax:
Text{[-t<delimiter>] -k<index>[type][order] \ [-k<index>[type][order] …]}
where:
-t<delimiter>
specifies the subfield delimiter, where
<delimiter>
is a single byte character such as [‘\t’ | ‘:’
| ‘ ’ | …]
, and the default delimiter is tab (‘\t’
) if not
specified.
-k<index>[type][order]
specifies the key order of the text
subfields, where <index>
is a 1-based number indicating the position
of the given subfield, and the order of -k options indicates primary key, secondary key,
etc. When the type
is specified as n
, a numeric comparison
of the key will be used instead of a simple byte comparison, and when the
order
is specified as r
, the subfield will be sorted in
reverse (descending) order instead of the default ascending order. The number of subfields
in a given Text
field must be the same for all MR keys.
Examples
Without Text Subfields
The following layout specifies a key with float, text, and int fields:
-D dmx.key.layout=”FloatWritable,Text,IntWritable”
key/value
input:9.1,New York,10035,value5
1.4,Bronx,10461,value2
0.9,Queens,11434,value1
1.4,Queens,11432,value4
1.4,Queens,11354,value3
0.9,Queens,11434 value1
1.4,Bronx,10461 value2
1.4,Queens,11354 value3
1.4,Queens,11432 value4
9.1,New York,10035 value5
With Text Subfields
The following layout specifies a key with float, text, and int fields, where the text field is delimited by commas and sorted by the 3rd subfield first, then by the 2nd numeric subfield in reverse, then by the first subfield:-D dmx.key.layout=”FloatWritable,Text{-t’,’ -k3 -k2nr -k1},IntWritable”
key/value
input:17.2:Queens,75,NY:10011:value7
15.4:Cambridge,5,MA:12142:value3
11.8:Birmingham,114,MI:48009:value1
15.4:New York,76,NY:10011:value4
15.4:Chicago,20,IL:60654:value2
15.4:New York,75,NY:10011:value5
15.4:Queens,75,NY:10011:value6
11.8,Birmingham,114,MI,48009 value1
15.4,Chicago,20,IL,60654 value2
15.4,Cambridge,5,MA,12142 value3
15.4,New York,76,NY,10011 value4
15.4,New York,75,NY,10011 value5
15.4,Queens,75,NY,10011 value6
17.2,Queens,75,NY,10011 value7
dmx.sortwork.dirs
Value <comma-separated list of directories>
Description Specifies one or more local directories, writable by MapReduce users, to override the default location(s) for writing temporary work space files in the Hadoop cluster.
- For MapR clusters, the default location is the MapReduce job’s working directory in MapR-FS.
- For clusters of other distributions, the default location is the directories specified
in
mapreduce.cluster.local.dir
(or equivalent).
dmx.sortwork.compress
Value on|off|dynamic
Description Specifies whether Connect for Big Data should compress the sort work (codec is gzip):
- always (
on
), (default) - load-dependent (
dynamic
), or - never (
off
)
dmx.hive.support
Value true | false
Description Boolean flag to indicate that Connect for Big Data Sort should be used
for hive queries; default is false
dmx.pig.support
Value true| false
Description Boolean flag to indicate that Connect for Big Data Sort should be used
for Pig programs; default is false
dmx.map.memory
Value <number_in_MB>
Description Memory limit for Connect for Big Data task running as map sort; default is max(256MB, JVM memory – 256MB)
dmx.reduce.memory
Value <number_in_MB>
Description Memory limit for Connect for Big Data task running as reduce merge; default is max(256MB, JVM memory – 256MB)
dmx.map.datasize.useHDFSsplits
Value true | false
Description Option to determine Connect for Big Data mapper input data size by querying the name node.
- If
true
, mapper input data size is calculated based on input split size and mapper count (high overhead for large clusters). - If
false
(default), mapper input datasize is calculated as follows: dmx.map.datasize
if specified (recommended if datasize is known), or- ¾ of
dmx.map.memory
if specified, or - ¾ of the available memory for Connect for Big Data based on JVM settings
The final mapper data size will appear in the job log output as: DMExpress MAP data
size (MB) = X
dmx.map.datasize
Value <number_in_MB>
Description Input data size for Connect for Big Data mappers; ignored if
dmx.map.datasize.useHDFSsplits
is
true
. If datasize is within ¾ of map memory, sortwork will stay in memory,
which provides optimal performance; if possible, adjust dmx.map.memory
to
meet this criteria.
dmx.reduce.datasize
Value <number_in_MB>
Description Input data size for Connect for Big Data reducers. If datasize is
within ¾ of reduce memory, sortwork will stay in memory, which provides optimal performance;
if possible, adjust dmx.reduce.memory
to meet this criteria.
dmx.map.sort.a
Value d | n
Description Specifies whether to apply the Connect for Big Data sort on the map
side (d
), or to apply the “null sort”, meaning that only partitioning but
no sorting occurs on the map side (n
). Default is n
. If
bzip2 is specified for map output compression, Connect for Big Data will sort on the map
side.