|« PreviousNext »|
|Did this page help you? Yes | No | Tell us about it...|
This compresses the output of your Hadoop job. If you are using TextOutputFormat the result is a gzip'ed text file. If you are writing to SequenceFiles then the result is a SequenceFile which is compressed internally. This can be enabled by setting the configuration setting mapred.output.compress to true.
If you are running a streaming job you can enable this by passing the streaming job these arguments.
You can also use a bootstrap action to automatically compress all job outputs. Here is how to do that with the Ruby client.
--bootstrap-action s3://elasticmapreduce/bootstrap-actions/configure-hadoop \ --args "-s,mapred.output.compress=true"
Finally, if are writing a Custom Jar you can enable output compression with the following line when creating your job.
If your job shuffles a significant amount data from the mappers to the reducers, you can see a performance improvement by enabling intermediate compression. Compresses the map output and decompresses it when it arrives on the slave node. The configuration setting is mapred.compress.map.output. You can enable this similarly to output compression.
When writing a Custom Jar, use the following command:
Snappy is a compression and decompression library that is optimized for speed. It is available on Amazon EMR AMIs version 2.0 and later and is used as the default for intermediate compression. For more information about Snappy, go to http://code.google.com/p/snappy/. For more information about Amazon EMR AMI versions, go to Choose a Machine Image