Amazon Elastic MapReduce
Developer Guide (API Version 2009-03-31)
Did this page help you?  Yes | No |  Tell us about it...
« PreviousNext »
View the PDF for this guide.Go to the AWS Discussion Forum for this product.Go to the Kindle Store to download this guide in Kindle format.

Launch a Cluster and Submit a Streaming Step

This section covers the basics of launching an Amazon Elastic MapReduce (Amazon EMR) cluster and submitting a Streaming step. You can create a cluster using either the Amazon EMR console, the CLI, or the API. Before you create your cluster you'll need to prepare your input data. For more information see Prepare Input Data (Optional).

A Streaming application reads input from standard input and then runs a script or executable (called a mapper) against each input. The result from each of the inputs is saved locally, typically on a Hadoop Distributed File System (HDFS) partition. Once all the input is processed by the mapper, a second script or executable (called a reducer) processes the mapper results. The results from the reducer are sent to standard output. You can chain together a series of Streaming steps, where the output of one step becomes the input of another step.

The mapper and the reducer can each be referenced as a file or you can supply a Java class. You can implement the mapper and reducer in any of the supported languages, including Ruby, Perl, Python, PHP, or Bash.

The following example is based on the Amazon EMR Word Count Example. This example shows how to use Hadoop Streaming to count the number of times each word occurs within a text file. In this example, the input is located in the Amazon S3 bucket s3n://elasticmapreduce/samples/wordcount/input. The mapper is a Python script that counts the number of times a word occurs in each input string and is located at s3://elasticmapreduce/samples/wordcount/wordSplitter.py. The reducer references a standard Hadoop library package called aggregate. Aggregate provides a special Java class and a list of simple aggregators that perform aggregations such as sum, max, and min over a sequence of values. The output is saved to an Amazon S3 bucket you created in Prepare an Output Location (Optional).

Create a Cluster and Submit a Streaming Step Using the Console

This example describes how to use the Amazon EMR console to create a cluster and run a Streaming step.

To create a cluster and run a Streaming step

  1. Open the Amazon Elastic MapReduce console at https://console.aws.amazon.com/elasticmapreduce/.

  2. Click Create cluster.

  3. In the Create Cluster page, in the Cluster Configuration section, verify the fields according to the following table.

    FieldAction
    Cluster name

    Enter a descriptive name for your cluster.

    The name is optional, and does not need to be unique.

    Termination protection

    Enabling termination protection ensures that the cluster does not shut down due to accident or error. For more information, see Managing Cluster Termination. Typically, set this value to Yes only when developing an application (so you can debug errors that would have otherwise terminated the cluster) and to protect long-running clusters or clusters that contain data.

    Logging

    This determines whether Amazon EMR captures detailed log data to Amazon S3.

    For more information, see View Log Files.

    Log folder S3 location

    Enter an Amazon S3 path to store your debug logs if you enabled logging in the previous field. If the log folder does not exist, the Amazon EMR console creates it for you.

    When this value is set, Amazon EMR copies the log files from the EC2 instances in the cluster to Amazon S3. This prevents the log files from being lost when the cluster ends and the EC2 instances hosting the cluster are terminated. These logs are useful for troubleshooting purposes.

    For more information, see View Log Files.

    Debugging

    This option creates a debug log index in SimpleDB (additional charges apply) to enable detailed debugging in the Amazon EMR console. You can only set this when the cluster is created. For more information about Amazon SimpleDB, go to the Amazon SimpleDB product description page.

  4. In the Software Configuration section, verify the fields according to the following table.

    FieldAction
    Hadoop distribution

    Choose Amazon.

    This determines which distribution of Hadoop to run on your cluster. You can choose to run the Amazon distribution of Hadoop or one of several MapR distributions. For more information, see Using the MapR Distribution for Hadoop.

    AMI version

    Choose the latest Hadoop 2.x AMI or the latest Hadoop 1.x AMI from the list.

    The AMI you choose determines the specific version of Hadoop and other applications such as Hive or Pig to run on your cluster. For more information, see Choose an Amazon Machine Image (AMI).

  5. In the Hardware Configuration section, verify the fields according to the following table.

    Note

    Twenty is the default maximum number of nodes per AWS account. For example, if you have two clusters, the total number of nodes running for both clusters must be 20 or less. Exceeding this limit results in cluster failures. If you need more than 20 nodes, you must submit a request to increase your Amazon EC2 instance limit. Ensure that your requested limit increase includes sufficient capacity for any temporary, unplanned increases in your needs. For more information, go to the Request to Increase Amazon EC2 Instance Limit Form.

    FieldAction
    Network

    Choose the default VPC. For more information about the default VPC, see Your Default VPC and Subnets in the guide-vpc-user;.

    Optionally, if you have created additional VPCs, you can choose your preferred VPC subnet identifier from the list to launch the cluster in that Amazon VPC. For more information, see Select a Amazon VPC Subnet for the Cluster (Optional).

    EC2 Availability Zone

    Choose No preference.

    Optionally, you can launch the cluster in a specific EC2 Availability Zone.

    For more information, see Regions and Availability Zones in the Amazon EC2 User Guide for Linux Instances.

    Master

    Accept the default instance type.

    The master node assigns Hadoop tasks to core and task nodes, and monitors their status. There is always one master node in each cluster.

    This specifies the EC2 instance type to use for the master node.

    The default instance type is m1.medium for Hadoop 2.x. This instance type is suitable for testing, development, and light workloads.

    For more information on instance types supported by Amazon EMR, see Virtual Server Configurations. For more information on Amazon EMR instance groups, see Instance Groups. For information about mapping legacy clusters to instance groups, see Mapping Legacy Clusters to Instance Groups.

    Request Spot Instances

    Leave this box unchecked.

    This specifies whether to run master nodes on Spot Instances. For more information, see (Optional) Lower Costs with Spot Instances.

    Core

    Accept the default instance type.

    A core node is an EC2 instance that runs Hadoop map and reduce tasks and stores data using the Hadoop Distributed File System (HDFS). Core nodes are managed by the master node.

    This specifies the EC2 instance types to use as core nodes.

    The default instance type is m1.medium for Hadoop 2.x. This instance type is suitable for testing, development, and light workloads.

    For more information on instance types supported by Amazon EMR, see Virtual Server Configurations. For more information on Amazon EMR instance groups, see Instance Groups. For information about mapping legacy clusters to instance groups, see Mapping Legacy Clusters to Instance Groups.

    Count

    Choose 2.

    Request Spot Instances

    Leave this box unchecked.

    This specifies whether to run core nodes on Spot Instances. For more information, see (Optional) Lower Costs with Spot Instances.

    Task

    Accept the default instance type.

    Task nodes only process Hadoop tasks and don't store data. You can add and remove them from a cluster to manage the EC2 instance capacity your cluster uses, increasing capacity to handle peak loads and decreasing it later. Task nodes only run a TaskTracker Hadoop daemon.

    This specifies the EC2 instance types to use as task nodes.

    For more information on instance types supported by Amazon EMR, see Virtual Server Configurations. For more information on Amazon EMR instance groups, see Instance Groups. For information about mapping legacy clusters to instance groups, see Mapping Legacy Clusters to Instance Groups.

    Count

    Choose 0.

    Request Spot Instances

    Leave this box unchecked.

    This specifies whether to run task nodes on Spot Instances. For more information, see (Optional) Lower Costs with Spot Instances.

  6. In the Security and Access section, complete the fields according to the following table.

    FieldAction
    EC2 key pair

    Choose your Amazon EC2 key pair private key from the list.

    For more information, see Create an Amazon EC2 Key Pair and PEM File.

    Optionally, choose Proceed without an EC2 key pair. If you do not enter a value in this field, you cannot use SSH to connect to the master node. For more information, see Connect to the Cluster.

    IAM user access

    Choose All other IAM users to make the cluster visible and accessible to all IAM users on the AWS account. For more information, see Configure IAM User Permissions.

    Alternatively, choose No other IAM users to restrict access to the current IAM user.

    Roles configuration

    Click Default to generate the default EMR role and EC2 instance profile. If the default roles exist, they are used for your cluster. If they do not exist, they are created (assuming you have proper permissions). You may also click View policies for default roles to view the default role properties. Alternatively, if you have custom roles, you can click Custom and choose your roles. An EMR role and EC2 instance profile are required when creating a cluster using the console.

    The EMR role Allows Amazon EMR to access other AWS services on your behalf. The EC2 instance profile controls application access to the Amazon EC2 instances in the cluster.

    For more information, see Configure IAM Roles for Amazon EMR.

  7. In the Bootstrap Actions section, there are no bootstrap actions necessary for this sample configuration.

    Optionally, you can use bootstrap actions, which are scripts that can install additional software and change the configuration of applications on the cluster before Hadoop starts. For more information, see (Optional) Create Bootstrap Actions to Install Additional Software.

  8. In the Steps section, choose Streaming Program from the list and click Configure and add.

    In the Add Step dialog, specify the cluster parameters using the following table as a guide, and then click Add.

    Specify Streaming Parameters
    FieldAction
    Mapper*Specify either a class name that refers to a mapper class in Hadoop, or a path on Amazon S3 where the mapper executable, such as a Python program, resides. The path value must be in the form BucketName/path/MapperExecutable.
    Reducer*Specify either a class name that refers to a reducer class in Hadoop, or a path on Amazon S3 where the reducer executable, such as a Python program, resides. The path value must be in the form BucketName/path/ReducerExecutable. Amazon EMR supports the special aggregate keyword. For more information, go to the Aggregate library supplied by Hadoop.
    Input S3 location*Specify the URI where the input data resides in Amazon S3. The value must be in the form BucketName/path.
    Output S3 location*Specify the URI where you want the output stored in Amazon S3. The value must be in the form BucketName/path.
    ArgumentsOptionally, enter a list of arguments (space-separated strings) to pass to the Hadoop streaming utility. For example, you can specify additional files to load into the distributed cache.
    Action on Failure

    This determines what the cluster does in response to any errors. The possible values for this setting are:

    • Terminate cluster: If the step fails, terminate the cluster. If the cluster has termination protection enabled AND keep alive enabled, it will not terminate.

    • Cancel and wait: If the step fails, cancel the remaining steps. If the cluster has keep alive enabled, the cluster will not terminate.

    • Continue: If the step fails, continue to the next step.

    * Required parameter

  9. Review your configuration and if you are satisfied with the settings, click Create Cluster.

  10. When the cluster starts, the console displays the Cluster Details page.

AWS CLI or Amazon EMR CLI

These examples demonstrate how to use the AWS CLI or Amazon EMRCLI to create a cluster and submit a Streaming step.

To create a cluster and submit a Streaming step using the AWS CLI

  • To create a cluster and submit a Streaming step using the AWS CLI, type the following command and replace myKey with the name of your EC2 key pair.

    • Linux, UNIX, and Mac OS X users:

      aws emr create-cluster --name "Test cluster" --ami-version 3.3 --applications Name=Hue Name=Hive Name=Pig \
      --use-default-roles --ec2-attributes KeyName=myKey \
      --instance-type m3.xlarge --instance-count 3 \
      --steps Type=STREAMING,Name="Streaming Program",ActionOnFailure=CONTINUE,\
      Args=[--files,s3://elasticmapreduce/samples/wordcount/wordSplitter.py,-mapper,wordSplitter.py,-reducer,aggregate,-input,s3://elasticmapreduce/samples/wordcount/input,-output,s3://mybucket/streaming/output/2014-05-26]
    • Windows users:

      aws emr create-cluster --name "Test cluster" --ami-version 3.3 --applications Name=Hue Name=Hive Name=Pig --use-default-roles --ec2-attributes KeyName=myKey --instance-type m3.xlarge --instance-count 3 --steps Type=STREAMING,Name="Streaming Program",ActionOnFailure=CONTINUE,Args=[--files,s3://elasticmapreduce/samples/wordcount/wordSplitter.py,-mapper,wordSplitter.py,-reducer,aggregate,-input,s3://elasticmapreduce/samples/wordcount/input,-output,s3://mybucket/streaming/output/2014-05-26]

    When you specify the instance count without using the --instance-groups parameter, a single Master node is launched, and the remaining instances are launched as core nodes. All nodes will use the instance type specified in the command.

    Note

    If you have not previously created the default EMR service role and EC2 instance profile, type aws emr create-default-roles to create them before typing the create-cluster subcommand.

    For more information on using Amazon EMR commands in the AWS CLI, see http://docs.aws.amazon.com/cli/latest/reference/emr.

To create a cluster and submit a Streaming step using the Amazon EMR CLI

Note

The Amazon EMR CLI is no longer under feature development. Customers are encouraged to use the Amazon EMR commands in the AWS CLI instead.

  • In the directory where you installed the Amazon EMR CLI, type one of the following commands. For more information, see the Command Line Interface Reference for Amazon EMR.

    Note

    The Hadoop streaming syntax is different between Hadoop 1.x and Hadoop 2.x when using the Amazon EMR CLI.

    For Hadoop 2.x, type the following command:

    • Linux, UNIX, and Mac OS X users:

      ./elastic-mapreduce --create --stream --ami-version 3.3 \
      --instance-type m1.large --arg "-files" --arg "s3://elasticmapreduce/samples/wordcount/wordSplitter.py" \
      --input s3://elasticmapreduce/samples/wordcount/input --mapper wordSplitter.py --reducer aggregate \
      --output s3://mybucket/output/2014-01-16
    • Windows users:

      ruby elastic-mapreduce --create --stream --ami-version 3.3 --instance-type m1.large --arg "-files" --arg "s3://elasticmapreduce/samples/wordcount/wordSplitter.py" --input s3://elasticmapreduce/samples/wordcount/input --mapper wordSplitter.py --reducer aggregate --output s3://mybucket/output/2014-01-16

    For Hadoop 1.x, type the following command:

    • Linux, UNIX, and Mac OS X users:

      ./elastic-mapreduce --create --stream \
      --input s3://elasticmapreduce/samples/wordcount/input \
      --mapper s3://elasticmapreduce/samples/wordcount/wordSplitter.py \
      --reducer aggregate \
      --output s3://mybucket/output/2014-01-16
    • Windows users:

      ruby elastic-mapreduce --create --stream --input s3://elasticmapreduce/samples/wordcount/input --mapper s3://elasticmapreduce/samples/wordcount/wordSplitter.py --reducer aggregate --output s3://mybucket/output/2014-01-16

    By default, this command launches a cluster to run on a single-node cluster. Later, when your steps are running correctly on a small set of sample data, you can launch clusters to run on multiple nodes. You can specify the number of nodes and the type of instance to run with the --num-instances and --instance-type parameters, respectively.