Amazon Elastic MapReduce
Developer Guide (API Version 2009-03-31)
« PreviousNext »
View the PDF for this guide.Go to the AWS Discussion Forum for this product.Go to the Kindle Store to download this guide in Kindle format.Did this page help you?  Yes | No |  Tell us about it...

Hadoop

Apache Hadoop is an open-source Java software framework that supports massive data processing across a cluster of servers. It can run on a single server, or thousands of servers. Hadoop uses a programming model called MapReduce to distribute processing across multiple servers. It also implements a distributed file system called HDFS that stores data across multiple servers. Hadoop monitors the health of servers in the cluster, and can recover from the failure of one or more nodes. In this way, Hadoop provides not only increased processing and storage capacity, but also high availability.

For more information, see http://hadoop.apache.org.

MapReduce

MapReduce is a programming model for distributed computing. It simplifies the process of writing parallel distributed applications by handling all of the logic except the Map and Reduce functions. The Map function maps data to sets of key/value pairs called intermediate results. The Reduce function combines the intermediate results, applies additional algorithms, and produces the final output.

For more information, see http://wiki.apache.org/hadoop/HadoopMapReduce.

HDFS

Hadoop Distributed File System (HDFS) is a distributed, scalable, and portable file system for Hadoop. HDFS distributes the data it stores across servers in the cluster, storing multiple copies of data on different servers to ensure that no data is lost if an individual server fails. HDFS is ephemeral storage that is reclaimed when you terminate the cluster. HDFS is useful for caching intermediate results during MapReduce processing or as the basis of a data warehouse for long-running clusters.

For more information, see http://hadoop.apache.org/docs/stable/hadoop-project-dist/hadoop-hdfs/HdfsUserGuide.html.

Amazon EMR extends Hadoop to add the ability to reference data stored in Amazon S3 as if it was a file system like HDFS. You can use either HDFS or Amazon S3 as the file system in your cluster. If you store intermediate results in Amazon S3, however, be aware that data will stream between every slave node in the cluster and Amazon S3. This could potentially overrun the limit of 200 transactions per second to Amazon S3. Most often, Amazon S3 is used to store input and output data and intermediate results are stored in HDFS.

Jobs and Tasks

In Hadoop, a job is a unit of work. Each job may consist of one or more tasks, and each task may be attempted one or more times until it succeeds. Amazon EMR adds a new unit of work to Hadoop, the step, which may contain one or more Hadoop jobs. For more information, see Steps.

You can submit work to your cluster in a variety of ways. For more information, see How to Send Work to a Cluster.

Hadoop Applications

Hadoop is a popular open-source distributed computing architecture. Other open-source applications such as Hive, Pig, and HBase run on top of Hadoop and extend its functionality by adding features such as queries of data stored on a cluster and data warehouse functionality

For more information, see Analyze Data with Hive, Process Data with Pig, and Store Data with HBase.