MapReduce - Connect_ETL - 9.13

Connect ETL for Big Data Sort User Guide

Product type
Software
Portfolio
Integrate
Product family
Connect
Product
Connect > Connect (ETL, Sort, AppMod, Big Data)
Version
9.13
Language
English
Product name
Connect ETL
Title
Connect ETL for Big Data Sort User Guide
Copyright
2023
First publish date
2003
Last updated
2023-09-11
Published on
2023-09-11T19:03:59.237517

When a MapReduce job is run in the Hadoop cluster, a mapper is invoked for each input split (a portion of the input data file in HDFS, typically one block), it processes its data set, the intermediate data is sorted and passed to one or more reducers, and the output of the reducers is written back to HDFS.

Figure 1. Hadoop High-Level View

Map

The Map function takes input records in the form of key/value pairs. It processes those records and outputs intermediate data in the form of key/value pairs, where the keys will be used to partition the data among the reducers:
map(in_key, in_value) → list(map_out_key, map_out_value)

Sort, Combine, Shuffle, Merge

The key/value pairs of a given mapper are sorted on the map side, then optionally combined if there are duplicate keys whose values can be aggregated.

If there is more than one reducer (specified on a cluster or per-application basis), the partitioner shuffles the key/value pairs among the reducers so that all data with the same key goes to the same reducer. The key/value pairs with the same key are then merged on the reduce side into a single list of values for each key.

Reduce

The Reduce function takes the merged output of the Map function, consisting of key/value_list pairs, processes it, and writes final key/value pair records to the target destination in the desired format:
reduce (map_out_key, list(map_out_value)) → list(final_out_key, final_out_value)