To verify that Connect for Big Data is installed correctly and/or to compare against the
performance of running TeraSort with the native Hadoop sort, run the TeraSort application as
described before, but with the following additional settings (if not specified globally in the
cluster as described in Set Properties Cluster-Wide) to invoke Connect for Big Data sort
acceleration:
hadoop jar $HADOOP_MAPRED_HOME/<examples_jar_file> terasort \
-D mapreduce.job.map.output.collector.class=\
com.syncsort.dmexpress.hadoop.DMXMapOutputCollector \
-D mapreduce.job.reduce.shuffle.consumer.plugin.class=\
com.syncsort.dmexpress.hadoop.DMXShuffleConsumerPlugin \
-D dmx.home.dir=<connect_install> \
-D dmx.key.length=10 \
-libjars <connect_install>/lib/dmxhadoop_<type>.jar \
<terasort_input_directory> <terasort_output_directory>
For example, assuming that Connect ETL is installed under
/usr/PreciselyConnect/Connect
on all the nodes in an MRv1 cluster, run
the following:hadoop jar $HADOOP_MAPRED_HOME/hadoop-examples.jar terasort \
-D mapreduce.job.map.output.collector.class=\
com.syncsort.dmexpress.hadoop.DMXMapOutputCollector \
-D mapreduce.job.reduce.shuffle.consumer.plugin.class=\
com.syncsort.dmexpress.hadoop.DMXShuffleConsumerPlugin \
-D dmx.home.dir=/usr/PreciselyConnect/Connect \
-D dmx.key.length=10 \
-libjars /usr/PreciselyConnect/Connect/lib/dmxhadoop_mrv1.jar \
input-1TB output-DMX-1TB
Note that the optional parameter dmx.key.length
is specified since the key
is a known fixed length of 10 bytes, thereby improving performance of the TeraSort job. To
verify that Connect for Big Data was properly installed and invoked for the sort
acceleration as expected, see Output
Logs.