Amazon Elastic MapReduce
Developer Guide (API Version 2009-03-31)
« PreviousNext »
View the PDF for this guide.Go to the AWS Discussion Forum for this product.Go to the Kindle Store to download this guide in Kindle format.Did this page help you?  Yes | No |  Tell us about it...

Choose the Type of Cluster to Run

Amazon EMR offers several different types of clusters, each configured for a particular type of data processing. They vary in terms of the software installed on the cluster and the technologies available. You should select the one that best supports your intended use. The following sections describe the types of clusters you can launch in Amazon EMR.

If you are new to Amazon EMR and want to learn more about the different cluster types and how they work, you can run the sample applications available in the Amazon EMR console. These provide examples of a Hive cluster, a Pig cluster, two Custom JAR clusters and a Streaming cluster. The tutorial, Get Started: Count Words with Amazon EMR , provides a detailed walkthrough of running a Streaming cluster.

Hive Cluster

Hive clusters are preloaded with Apache Hive, an open-source data warehouse and analytic package that runs on top of Hadoop. With this type of cluster you can load data onto your cluster and then query it using HiveQL, a SQL-like query language. Hive handles the process of converting your queries into distributed map-reduce applications. This cluster type is useful if you want to use the power of distributed processing to store and query your data without having to write a map-reduce application. This is also the cluster type you should select if you want to use Amazon EMR to move data in or out of DynamoDB For more information, see Analyze Data with Hive and Export, Import, Query, and Join Tables in DynamoDB Using Amazon EMR.

Custom JAR Cluster

A custom JAR cluster runs a Java map-reduce application that you have previously compiled into a JAR file and uploaded to Amazon S3. To create the JAR file, compile your Java code against the version of Hadoop you plan to launch in the cluster and submit Hadoop jobs using the Hadoop JobClient interface. Custom JAR clusters are preloaded with the Cascading open-source Java library. This library provides a query API, a query planner, and a job scheduler for creating and running Hadoop MapReduce applications. This type of cluster is best if you are a Hadoop developer and want the flexibility to create a custom map-reduce application. For more information, see Process Data with a Custom JAR Cluster.

Streaming Cluster

A Streaming cluster runs mapper and reducer scripts that you have previously uploaded to Amazon S3. The scripts can be written using any of the following supported languages: Ruby, Perl, Python, PHP, R, Bash, or C++. Streaming clusters are preloaded with the Apache Streaming open-source library, and you can reference the functions in this library in your scripts. This type of cluster is best if you are a Hadoop developer and want to use the Streaming library to quickly develop a map-reduce application. For more information, see Process Data with a Streaming Cluster.

Pig Cluster

Pig clusters are preloaded with Apache Pig, an open-source Apache library that runs on top of Hadoop. With this type of cluster, you can load data onto your cluster and then query it using Pig Latin, a SQL-like query language. Hive handles the process of converting your queries into distributed map-reduce applications. This cluster type is useful if you want to use the power of distributed processing to store and query your data without having to write a map-reduce application. For more information, see Process Data with Pig.

HBase Cluster

HBase clusters are preloaded with Apache HBase, an open source, non-relational, distributed database modeled after Google's BigTable. HBase provides you a fault-tolerant, efficient way of storing large quantities of sparse data using column-based compression and storage. In addition, HBase provides fast lookup of data because data is stored in-memory instead of on disk. HBase is optimized for sequential write operations, and is highly efficient for batch inserts, updates, and deletes. HBase also integrates with Apache Hive, enabling SQL-like queries over HBase tables, joins with Hive-based tables, and support for Java Database Connectivity (JDBC). This cluster type is useful for creating a data warehouse. For more information, see Store Data with HBase