In order for the Connect for Big Data sort accelerator to be picked up by a MapReduce job, the job must be invoked with the Connect for Big Data Sort properties set in one of the following ways:
- Specifying the properties at the command line with the
–D
option:hadoop jar <jar_file> \
-D mapreduce.job.map.output.collector.class=\ com.syncsort.dmexpress.hadoop.DMXMapOutputCollector \ -D mapreduce.job.reduce.shuffle.consumer.plugin.class=\ com.syncsort.dmexpress.hadoop.DMXShuffleConsumerPlugin \ -D dmx.home.dir=<connect_install> \ [dmx_optional_parameters]
[command_options]
- Using an XML configuration file that contains the Connect for Big Data Sort property settings:
hadoop jar <jar_file> -conf <XML_file> [command_options]
where:
<jar_file>
is the jar file that defines the MapReduce job[dmx_optional_parameters]
are any of the optional parameters shown in Optional Properties.<XML_file>
is the configuration file containing the same properties/values that would have been specified with the–D
option. See Sample Hadoop Configuration File for a sample configuration file.[command_options]
are any application-specific options such as input and output files
If the HADOOP_CLASSPATH
was not set globally to point to the location of the
dmxhadoop jar file as described in , then it will need to be specified with the other
Connect for Big Data options as follows, where <type>
is the
MapReduce version, as described at the beginning of Optional Properties:
–libjars <connect_install>/lib/dmxhadoop_<type>.jar
Note that the “-“ options must precede any non-“-“ application-specific arguments in order to be parsed correctly. If you get a “Use GenericOptionsParser for parsing the arguments. Applications should implement Tool for the same.” warning, you’ll need to modify the MapReduce job to pick up the –D generic command-line options. You can do this by either using the GenericOptionsParser class or by implementing the Tool interface and running your application with the ToolRunner utility, which uses GenericOptionsParser internally.