Menu
Amazon EMR
Management Guide

Compress the Output of your Cluster

Output Data Compression

This compresses the output of your Hadoop job. If you are using TextOutputFormat the result is a gzip'ed text file. If you are writing to SequenceFiles then the result is a SequenceFile which is compressed internally. This can be enabled by setting the configuration setting mapred.output.compress to true.

If you are running a streaming job you can enable this by passing the streaming job these arguments.

Copy
-jobconf mapred.output.compress=true

You can also use a bootstrap action to automatically compress all job outputs. Here is how to do that with the Ruby client.

Copy
--bootstrap-action s3://elasticmapreduce/bootstrap-actions/configure-hadoop \ --args "-s,mapred.output.compress=true"

Finally, if are writing a Custom Jar you can enable output compression with the following line when creating your job.

Copy
FileOutputFormat.setCompressOutput(conf, true);

Intermediate Data Compression

If your job shuffles a significant amount data from the mappers to the reducers, you can see a performance improvement by enabling intermediate compression. Compresses the map output and decompresses it when it arrives on the slave node. The configuration setting is mapred.compress.map.output. You can enable this similarly to output compression.

When writing a Custom Jar, use the following command:

Copy
conf.setCompressMapOutput(true);

Using the Snappy Library with Amazon EMR

Snappy is a compression and decompression library that is optimized for speed. It is available on Amazon EMR AMIs version 2.0 and later and is used as the default for intermediate compression. For more information about Snappy, go to http://code.google.com/p/snappy/.