When a MapReduce job is run in the Hadoop cluster, a mapper is invoked for each input split (a portion of the input data file in HDFS, typically one block), it processes its data set, the intermediate data is sorted and passed to one or more reducers, and the output of the reducers is written back to HDFS.
Figure 1. Hadoop High-Level View
Map
map(in_key, in_value) → list(map_out_key, map_out_value)
Sort, Combine, Shuffle, Merge
The key/value pairs of a given mapper are sorted on the map side, then optionally combined if there are duplicate keys whose values can be aggregated.
If there is more than one reducer (specified on a cluster or per-application basis), the partitioner shuffles the key/value pairs among the reducers so that all data with the same key goes to the same reducer. The key/value pairs with the same key are then merged on the reduce side into a single list of values for each key.
Reduce
reduce (map_out_key, list(map_out_value)) → list(final_out_key, final_out_value)