Run TeraSort with Connect for Big Data Sort - Connect_ETL - 9.13

Connect ETL for Big Data Sort User Guide

Product type
Software
Portfolio
Integrate
Product family
Connect
Product
Connect > Connect (ETL, Sort, AppMod, Big Data)
Version
9.13
Language
English
Product name
Connect ETL
Title
Connect ETL for Big Data Sort User Guide
Copyright
2023
First publish date
2003
Last updated
2023-09-11
Published on
2023-09-11T19:03:59.237517
To verify that Connect for Big Data is installed correctly and/or to compare against the performance of running TeraSort with the native Hadoop sort, run the TeraSort application as described before, but with the following additional settings (if not specified globally in the cluster as described in Set Properties Cluster-Wide) to invoke Connect for Big Data sort acceleration:
hadoop jar $HADOOP_MAPRED_HOME/<examples_jar_file> terasort \
-D mapreduce.job.map.output.collector.class=\
com.syncsort.dmexpress.hadoop.DMXMapOutputCollector \
-D mapreduce.job.reduce.shuffle.consumer.plugin.class=\
com.syncsort.dmexpress.hadoop.DMXShuffleConsumerPlugin \
-D dmx.home.dir=<connect_install> \
-D dmx.key.length=10 \
-libjars <connect_install>/lib/dmxhadoop_<type>.jar \
<terasort_input_directory> <terasort_output_directory>
For example, assuming that Connect ETL is installed under /usr/PreciselyConnect/Connect on all the nodes in an MRv1 cluster, run the following:
hadoop jar $HADOOP_MAPRED_HOME/hadoop-examples.jar terasort \
-D mapreduce.job.map.output.collector.class=\
com.syncsort.dmexpress.hadoop.DMXMapOutputCollector \
-D mapreduce.job.reduce.shuffle.consumer.plugin.class=\
com.syncsort.dmexpress.hadoop.DMXShuffleConsumerPlugin \
-D dmx.home.dir=/usr/PreciselyConnect/Connect \
-D dmx.key.length=10 \
-libjars /usr/PreciselyConnect/Connect/lib/dmxhadoop_mrv1.jar \
input-1TB output-DMX-1TB

Note that the optional parameter dmx.key.length is specified since the key is a known fixed length of 10 bytes, thereby improving performance of the TeraSort job. To verify that Connect for Big Data was properly installed and invoked for the sort acceleration as expected, see Output Logs.