During the map and reduce sort stages, if there is not enough memory to perform the sort, Connect for Big Data may spill some data to disk; this is referred to as “sort work”. To minimize the disk read/write impact on performance, this sortwork data is compressed using gzip. Note that the performance gain is more significant on the reduce side, where there is more data, than on the map side.
Sortwork compression is controlled by the dmx.sortwork.compress
option as
described in Connect
for Big Data Sort Accelerator Properties, and should be set to off
when the job is more CPU-bound. Alternatively, it can be set to dynamic
to
let Connect ETL balance the performance trade-offs.