Generate Data for TeraSort - Connect_ETL - 9.13

Connect ETL for Big Data Sort User Guide

Product type
Software
Portfolio
Integrate
Product family
Connect
Product
Connect > Connect (ETL, Sort, AppMod, Big Data)
Version
9.13
Language
English
Product name
Connect ETL
Title
Connect ETL for Big Data Sort User Guide
Copyright
2023
First publish date
2003
Last updated
2023-09-11
Published on
2023-09-11T19:03:59.237517

Before you can execute the TeraSort application, you need to first run the Hadoop TeraGen application to generate the official TeraSort input data set as follows:

hadoop jar $HADOOP_MAPRED_HOME/<examples_jar_file> teragen <number_of_rows> <terasort_input_directory>

where:

  • HADOOP_MAPRED_HOME points to the location of the MapReduce libraries, such as /usr/lib/hadoop-0.20-mapreduce for MRv1 and /usr/lib/hadoop-mapreduce for MRv2.
  • <examples_jar_file> is based on the MapReduce version:
  • MRv1: hadoop-examples.jar
  • MRv2: hadoop-mapreduce-examples.jar
  • <number_of_rows> is the number of 100-byte-long rows to be generated. For example:
  • Specify 10000000000 to generate about 1 TB of data, useful for performance comparison.
  • Specify 1000000 to generate about 100MB of data, enough to verify Connect for Big Data invocation.
  • <terasort_input_directory> is the directory to be created in HDFS that will contain the generated data; you will specify this directory as the input when you run TeraSort.

For example, to generate about 1 TB of data in an HDFS folder named input-1TB, run the following:

hadoop jar $HADOOP_MAPRED_HOME/<examples_jar_file> teragen 10000000000 input-1TB

If running Terasort for Connect for Big Data verification purposes, generate a small data set (say, 10,000 rows), and run only with Connect for Big Data. If running for performance comparison purposes, generate the full terabyte of data, and run both with and without Connect for Big Data.

Note that the total space requirements to actually run TeraSort for 1TB of data is much more than 1TB. You need space to store the input data, output data, and intermediate files, and the amount depends on the replication factor (typically 3) that you use in the cluster:

  • For input and output data, you need <data_size> * <replication_factor> for each
  • For intermediate files, you need <data_size>

For instance, to run TeraSort with about 1TB of data and a replication factor of 3, you need a total of 7TB.