Amazon Elastic MapReduce
Developer Guide (API Version 2009-03-31)
Did this page help you?  Yes | No |  Tell us about it...
« PreviousNext »
View the PDF for this guide.Go to the AWS Discussion Forum for this product.Go to the Kindle Store to download this guide in Kindle format.

Launch a Cluster and Submit a Custom JAR Step

This section covers the basics of creating a cluster and submitting a custom JAR step in Amazon EMR. Submitting a custom JAR step enables you to write a script to process your data using the Java programming language.

The example that follows is based on the Amazon EMR sample: CloudBurst. This example demonstrates creating a cluster and submitting a step using the Amazon EMR console and the CLI.

In this example, the JAR file is located in an Amazon S3 bucket at s3://elasticmapreduce/samples/cloudburst/cloudburst.jar. All of the data processing instructions are located in the JAR file. The input data is located in the Amazon S3 bucket s3://elasticmapreduce/samples/cloudburst/input. The output is saved to an Amazon S3 bucket you create as part of Prepare an Output Location (Optional).

Launch a Cluster and Submit a Custom JAR Step Using the Console

This example describes how to use the Amazon EMR console to create a cluster and submit a custom JAR step.

To create a cluster and submit a custom JAR step using the console

  1. Open the Amazon Elastic MapReduce console at https://console.aws.amazon.com/elasticmapreduce/.

  2. Click Create cluster.

  3. In the Create Cluster page, in the Cluster Configuration section, verify the fields according to the following table.

    FieldAction
    Cluster name

    Enter a descriptive name for your cluster.

    The name is optional, and does not need to be unique.

    Termination protection

    Enabling termination protection ensures that the cluster does not shut down due to accident or error. For more information, see Managing Cluster Termination. Typically, set this value to Yes only when developing an application (so you can debug errors that would have otherwise terminated the cluster) and to protect long-running clusters or clusters that contain data.

    Logging

    This determines whether Amazon EMR captures detailed log data to Amazon S3.

    For more information, see View Log Files.

    Log folder S3 location

    Enter an Amazon S3 path to store your debug logs if you enabled logging in the previous field. If the log folder does not exist, the Amazon EMR console creates it for you.

    When this value is set, Amazon EMR copies the log files from the EC2 instances in the cluster to Amazon S3. This prevents the log files from being lost when the cluster ends and the EC2 instances hosting the cluster are terminated. These logs are useful for troubleshooting purposes.

    For more information, see View Log Files.

    Debugging

    This option creates a debug log index in SimpleDB (additional charges apply) to enable detailed debugging in the Amazon EMR console. You can only set this when the cluster is created. For more information about Amazon SimpleDB, go to the Amazon SimpleDB product description page.

  4. In the Software Configuration section, verify the fields according to the following table.

    FieldAction
    Hadoop distribution

    Choose Amazon.

    This determines which distribution of Hadoop to run on your cluster. You can choose to run the Amazon distribution of Hadoop or one of several MapR distributions. For more information, see Using the MapR Distribution for Hadoop.

    AMI version

    Choose the latest Hadoop 2.x AMI or the latest Hadoop 1.x AMI from the list.

    The AMI you choose determines the specific version of Hadoop and other applications such as Hive or Pig to run on your cluster. For more information, see Choose an Amazon Machine Image (AMI).

  5. In the Hardware Configuration section, verify the fields according to the following table.

    Note

    Twenty is the default maximum number of nodes per AWS account. For example, if you have two clusters, the total number of nodes running for both clusters must be 20 or less. Exceeding this limit results in cluster failures. If you need more than 20 nodes, you must submit a request to increase your Amazon EC2 instance limit. Ensure that your requested limit increase includes sufficient capacity for any temporary, unplanned increases in your needs. For more information, go to the Request to Increase Amazon EC2 Instance Limit Form.

    FieldAction
    Network

    Choose the default VPC. For more information about the default VPC, see Your Default VPC and Subnets in the guide-vpc-user;.

    Optionally, if you have created additional VPCs, you can choose your preferred VPC subnet identifier from the list to launch the cluster in that Amazon VPC. For more information, see Select a Amazon VPC Subnet for the Cluster (Optional).

    EC2 Availability Zone

    Choose No preference.

    Optionally, you can launch the cluster in a specific EC2 Availability Zone.

    For more information, see Regions and Availability Zones in the Amazon EC2 User Guide for Linux Instances.

    Master

    Accept the default instance type.

    The master node assigns Hadoop tasks to core and task nodes, and monitors their status. There is always one master node in each cluster.

    This specifies the EC2 instance type to use for the master node.

    The default instance type is m1.medium for Hadoop 2.x. This instance type is suitable for testing, development, and light workloads.

    For more information on instance types supported by Amazon EMR, see Virtual Server Configurations. For more information on Amazon EMR instance groups, see Instance Groups. For information about mapping legacy clusters to instance groups, see Mapping Legacy Clusters to Instance Groups.

    Request Spot Instances

    Leave this box unchecked.

    This specifies whether to run master nodes on Spot Instances. For more information, see (Optional) Lower Costs with Spot Instances.

    Core

    Accept the default instance type.

    A core node is an EC2 instance that runs Hadoop map and reduce tasks and stores data using the Hadoop Distributed File System (HDFS). Core nodes are managed by the master node.

    This specifies the EC2 instance types to use as core nodes.

    The default instance type is m1.medium for Hadoop 2.x. This instance type is suitable for testing, development, and light workloads.

    For more information on instance types supported by Amazon EMR, see Virtual Server Configurations. For more information on Amazon EMR instance groups, see Instance Groups. For information about mapping legacy clusters to instance groups, see Mapping Legacy Clusters to Instance Groups.

    Count

    Choose 2.

    Request Spot Instances

    Leave this box unchecked.

    This specifies whether to run core nodes on Spot Instances. For more information, see (Optional) Lower Costs with Spot Instances.

    Task

    Accept the default instance type.

    Task nodes only process Hadoop tasks and don't store data. You can add and remove them from a cluster to manage the EC2 instance capacity your cluster uses, increasing capacity to handle peak loads and decreasing it later. Task nodes only run a TaskTracker Hadoop daemon.

    This specifies the EC2 instance types to use as task nodes.

    For more information on instance types supported by Amazon EMR, see Virtual Server Configurations. For more information on Amazon EMR instance groups, see Instance Groups. For information about mapping legacy clusters to instance groups, see Mapping Legacy Clusters to Instance Groups.

    Count

    Choose 0.

    Request Spot Instances

    Leave this box unchecked.

    This specifies whether to run task nodes on Spot Instances. For more information, see (Optional) Lower Costs with Spot Instances.

  6. In the Security and Access section, complete the fields according to the following table.

    FieldAction
    EC2 key pair

    Choose your Amazon EC2 key pair private key from the list.

    For more information, see Create an Amazon EC2 Key Pair and PEM File.

    Optionally, choose Proceed without an EC2 key pair. If you do not enter a value in this field, you cannot use SSH to connect to the master node. For more information, see Connect to the Cluster.

    IAM user access

    Choose All other IAM users to make the cluster visible and accessible to all IAM users on the AWS account. For more information, see Configure IAM User Permissions.

    Alternatively, choose No other IAM users to restrict access to the current IAM user.

    Roles configuration

    Click Default to generate the default EMR role and EC2 instance profile. If the default roles exist, they are used for your cluster. If they do not exist, they are created (assuming you have proper permissions). You may also click View policies for default roles to view the default role properties. Alternatively, if you have custom roles, you can click Custom and choose your roles. An EMR role and EC2 instance profile are required when creating a cluster using the console.

    The EMR role Allows Amazon EMR to access other AWS services on your behalf. The EC2 instance profile controls application access to the Amazon EC2 instances in the cluster.

    For more information, see Configure IAM Roles for Amazon EMR.

  7. In the Bootstrap Actions section, there are no bootstrap actions necessary for this sample configuration.

    Optionally, you can use bootstrap actions, which are scripts that can install additional software and change the configuration of applications on the cluster before Hadoop starts. For more information, see (Optional) Create Bootstrap Actions to Install Additional Software.

  8. In the Steps section, choose Custom Jar from the list and click Configure and add.

    In the Add Step dialog, enter values in the boxes using the following table as a guide, and then click Add. Be sure to replace myawsbucket with your Amazon S3 bucket name.

    Specify Custom JAR Parameters
    FieldAction
    JAR S3 location*Specify the URI where your script resides in Amazon S3. The value must be in the form s3://BucketName/path/ScriptName.
    Arguments*Enter a list of arguments (space-separated strings) to pass to the JAR file.
    Action on Failure

    This determines what the cluster does in response to any errors. The possible values for this setting are:

    • Terminate cluster: If the step fails, terminate the cluster. If the cluster has termination protection enabled AND keep alive enabled, it will not terminate.

    • Cancel and wait: If the step fails, cancel the remaining steps. If the cluster has keep alive enabled, the cluster will not terminate.

    • Continue: If the step fails, continue to the next step.

    * Required parameter

  9. Review your configuration and if you are satisfied with the settings, click Create Cluster.

  10. When the cluster starts, the console displays the Cluster Details page.

Launching a cluster and submitting a custom JAR step using the AWS CLI or the Amazon EMR CLI

To launch a cluster and submit a custom JAR step using the AWS CLI

To launch a cluster and submit a custom JAR step using the AWS CLI, type the create-cluster subcommand with the --steps parameter.

  • To launch a cluster and submit a custom JAR step, type the following command, replace myKey with the name of your EC2 key pair, and replace mybucket with your bucket name.

    • Linux, UNIX, and Mac OS X users:

      aws emr create-cluster --name "Test cluster" --ami-version 3.3 --applications Name=Hue Name=Hive Name=Pig \
      --use-default-roles --ec2-attributes KeyName=myKey \
      --instance-type m3.xlarge --instance-count 3 \
      --steps Type=CUSTOM_JAR,Name="Custom JAR Step",ActionOnFailure=CONTINUE,Jar=s3://elasticmapreduce/samples/cloudburst/cloudburst.jar,\
      Args=["s3://elasticmapreduce/samples/cloudburst/input/s_suis.br","s3://elasticmapreduce/samples/cloudburst/input/100k.br","s3://mybucket/cloudburst/output/2012-06-10","36","3","0","1","240","48","24","24","128","16"]
    • Windows users:

      aws emr create-cluster --name "Test cluster" --ami-version 3.3 --applications Name=Hue Name=Hive Name=Pig --use-default-roles --ec2-attributes KeyName=myKey --instance-type m3.xlarge --instance-count 3 --steps Type=CUSTOM_JAR,Name="Custom JAR Step",ActionOnFailure=CONTINUE,Jar=s3://elasticmapreduce/samples/cloudburst/cloudburst.jar,Args=["s3://elasticmapreduce/samples/cloudburst/input/s_suis.br","s3://elasticmapreduce/samples/cloudburst/input/100k.br","s3://mybucket/cloudburst/output/2012-06-10","36","3","0","1","240","48","24","24","128","16"]

    When you specify the instance count without using the --instance-groups parameter, a single Master node is launched, and the remaining instances are launched as core nodes. All nodes will use the instance type specified in the command.

    Note

    If you have not previously created the default EMR service role and EC2 instance profile, type aws emr create-default-roles to create them before typing the create-cluster subcommand.

    For more information on using Amazon EMR commands in the AWS CLI, see http://docs.aws.amazon.com/cli/latest/reference/emr.

To create a cluster and submit a custom JAR step using the Amazon EMR CLI

Note

The Amazon EMR CLI is no longer under feature development. Customers are encouraged to use the Amazon EMR commands in the AWS CLI instead.

  • In the directory where you installed the Amazon EMR CLI, run the following from the command line. For more information, see the Command Line Interface Reference for Amazon EMR.

    • Linux, UNIX, and Mac OS X users:

      ./elastic-mapreduce --create --name "Test custom JAR" \
        --jar s3://elasticmapreduce/samples/cloudburst/cloudburst.jar \
          --arg s3://elasticmapreduce/samples/cloudburst/input/s_suis.br \
          --arg s3://elasticmapreduce/samples/cloudburst/input/100k.br \
          --arg s3://mybucket/cloudburst \
          --arg 36 --arg 3 --arg 0 --arg 1 --arg 240 --arg 48 --arg 24 \
          --arg 24 --arg 128 --arg 16
    • Windows users:

      ruby elastic-mapreduce --create --name "Test custom JAR" --jar s3://elasticmapreduce/samples/cloudburst/cloudburst.jar --arg s3://elasticmapreduce/samples/cloudburst/input/s_suis.br --arg s3://elasticmapreduce/samples/cloudburst/input/100k.br --arg s3://mybucket/cloudburst/output --arg 36 --arg 3 --arg 0 --arg 1 --arg 240 --arg 48 --arg 24 --arg 24 --arg 128 --arg 16

Note

The individual --arg values above could also be represented as --args followed by a comma-separated list.

The output looks similar to the following.

Created cluster JobFlowID

By default, this command launches a cluster to run on a single-node cluster using an Amazon EC2 m1.small instance. Later, when your steps are running correctly on a small set of sample data, you can launch clusters to run on multiple nodes. You can specify the number of nodes and the type of instance to run with the --num-instances and --instance-type parameters, respectively.