Running Hadoop jobs with compression of the intermediate files between the mappers and reducers may significantly improve performance in some cases. To do so, specify map output compression and a compression codec by adding the following additional options to the hadoop job invocation:
-D mapred.compress.map.output=true
- For gzip compression (CPU-intensive, good compression rates on text
data):
-D mapred.map.output.compression.codec=\ org.apache.hadoop.io.compress.GzipCodec
- For Snappy compression (faster than gzip, good compression rates on random
data):
-D mapred.map.output.compression.codec=\ org.apache.hadoop.io.compress.SnappyCodec
These compression options require installation of the corresponding Hadoop native library for the given codec. For details, see the documentation of the distribution you are using.