Before you can execute the TeraSort application, you need to first run the Hadoop TeraGen application to generate the official TeraSort input data set as follows:
hadoop jar $HADOOP_MAPRED_HOME/<examples_jar_file> teragen <number_of_rows> <terasort_input_directory>
where:
HADOOP_MAPRED_HOME
points to the location of the MapReduce libraries, such as/usr/lib/hadoop-0.20-mapreduce
for MRv1 and/usr/lib/hadoop-mapreduce
for MRv2.<examples_jar_file>
is based on the MapReduce version:- MRv1:
hadoop-examples.jar
- MRv2:
hadoop-mapreduce-examples.jar
<number_of_rows>
is the number of 100-byte-long rows to be generated. For example:- Specify 10000000000 to generate about 1 TB of data, useful for performance comparison.
- Specify 1000000 to generate about 100MB of data, enough to verify Connect for Big Data invocation.
<terasort_input_directory>
is the directory to be created in HDFS that will contain the generated data; you will specify this directory as the input when you run TeraSort.
For example, to generate about 1 TB of data in an HDFS folder named
input-1TB
, run the following:
hadoop jar $HADOOP_MAPRED_HOME/<examples_jar_file> teragen 10000000000 input-1TB
If running Terasort for Connect for Big Data verification purposes, generate a small data set (say, 10,000 rows), and run only with Connect for Big Data. If running for performance comparison purposes, generate the full terabyte of data, and run both with and without Connect for Big Data.
Note that the total space requirements to actually run TeraSort for 1TB of data is much more than 1TB. You need space to store the input data, output data, and intermediate files, and the amount depends on the replication factor (typically 3) that you use in the cluster:
- For input and output data, you need <data_size> * <replication_factor> for each
- For intermediate files, you need <data_size>
For instance, to run TeraSort with about 1TB of data and a replication factor of 3, you need a total of 7TB.