Amazon EMR
Developer Guide

What Can You Do with Amazon EMR?

Amazon EMR simplifies running Hadoop and related big-data applications on AWS. You can use it to manage and analyze vast amounts of data. For example, a cluster can be configured to process petabytes of data.

Hadoop Programming on Amazon EMR

In order to develop and deploy custom Hadoop applications, you used to need access to a lot of hardware for your Hadoop programs. Amazon EMR makes it easy to spin up a set of EC2 instances as virtual servers to run your Hadoop cluster. You can run various server configurations, such as fully-loaded production servers and temporary testing servers, without having to purchase or reconfigure hardware. Amazon EMR makes it easy to configure and deploy your always-on production clusters, but also to easily terminate unused testing clusters after your development and testing phase is complete.

Amazon EMR provides several methods of running Hadoop applications, depending on the type of program you are developing and the libraries you intend to use.

Custom JAR

Run your custom MapReduce program written in Java. Running a custom JAR gives you low-level access to the MapReduce API. You have the responsibility of defining and implementing the MapReduce tasks in your Java application.


Run your application using the Cascading Java library, which provides features such as splitting and joining data streams. Using the Cascading Java library can simplify application development. With Cascading you can still access the low-level MapReduce APIs as you can with a Custom JAR application.


Run a Hadoop job based on Map and Reduce functions you upload to Amazon S3. The functions can be implemented in any of the following supported languages: Ruby, Perl, Python, PHP, R, Bash, C++.

Data Analysis and Processing on Amazon EMR

You can also use Amazon EMR to analyze and process data without writing a line of code. Several open-source applications run on top of Hadoop and make it possible to run MapReduce jobs and manipulate data using either a SQL-like syntax or a specialized language called Pig Latin. Amazon EMR is integrated with Apache Hive and Apache Pig.

Data Storage on Amazon EMR

Distributed storage is a way to store large amounts of data over a distributed network of computers with redundancy to protect against data loss. Amazon EMR is integrated with the Hadoop Distributed File System (HDFS) and Apache HBase.

Move Data with Amazon EMR

You can use Amazon EMR to move large amounts of data in and out of databases and data stores. By distributing the work, the data can be moved quickly. Amazon EMR provides custom libraries to move data in and out of Amazon Simple Storage Service (Amazon S3), DynamoDB, and Apache HBase.