Amazon EMR
Developer Guide

Run a Hadoop Application to Process Data

This documentation is for AMI versions 2.x and 3.x of Amazon EMR. For information about Amazon EMR releases 4.0.0 and above, see the Amazon EMR Release Guide. For information about managing the Amazon EMR service in 4.x releases, see the Amazon EMR Management Guide.

Amazon EMR provides several models for creating custom Hadoop applications to process data:

  • Custom JAR or Cascading — write a Java application, which may or may not make use of the Cascading Java libraries, generate a JAR file, and upload the JAR file to Amazon S3 where it will be imported into the cluster and used to process data. When you do this, your JAR file must contain an implementation for both the Map and Reduce functionality.

  • Streaming — write separate Map and Reduce scripts using one of several scripting languages, upload the scripts to Amazon S3, where the scripts are imported into the cluster and used to process data. You can also use built-in Hadoop classes, such as aggregate, instead of providing a script.

Regardless of which type of custom application you create, the application must provide both Map and Reduce functionality, and should adhere to Hadoop programming best practices.