Amazon Elastic MapReduce
Developer Guide (API Version 2009-03-31)
Did this page help you?  Yes | No |  Tell us about it...
« PreviousNext »
View the PDF for this guide.Go to the AWS Discussion Forum for this product.Go to the Kindle Store to download this guide in Kindle format.

Step 2: Create a Cluster

For Hive to run on Amazon EMR, you must create a cluster with Hive enabled. This sets up the necessary applications and infrastructure for Hive to connect to DynamoDB. The following procedures explain how to create an interactive Hive cluster from the AWS Management Console and the CLI.

To start a cluster using the AWS Management Console

  1. Sign in to the AWS Management Console and open the Amazon EMR console at https://console.aws.amazon.com/elasticmapreduce/.

  2. Click Create Cluster.

    Create Cluster
  3. In the Create Cluster page, in the Cluster Configuration section, verify the fields according to the following table.

    Cluster Configuration
    FieldAction
    Cluster name

    Enter a descriptive name for your cluster.

    The name is optional, and does not need to be unique.

    Termination protection

    Choose Yes.

    Enabling termination protection ensures that the cluster does not shut down due to accident or error. For more information, see Protect a Cluster from Termination in the Amazon EMR Developer Guide. Typically, set this value to Yes only when developing an application (so you can debug errors that would have otherwise terminated the cluster) and to protect long-running clusters or clusters that contain data.

    Logging

    Choose Enabled.

    This determines whether Amazon EMR captures detailed log data to Amazon S3.

    For more information, see View Log Files in the Amazon EMR Developer Guide.

    Log folder S3 location

    Enter an Amazon S3 path to store your debug logs if you enabled logging in the previous field.

    When this value is set, Amazon EMR copies the log files from the EC2 instances in the cluster to Amazon S3. This prevents the log files from being lost when the cluster ends and the EC2 instances hosting the cluster are terminated. These logs are useful for troubleshooting purposes.

    For more information, see View Log Files in the Amazon EMR Developer Guide.

    Debugging

    Choose Enabled.

    This option creates a debug log index in SimpleDB (additional charges apply) to enable detailed debugging in the Amazon EMR console. You can only set this when the cluster is created. For more information about Amazon SimpleDB, go to the Amazon SimpleDB product description page.

  4. In the Software Configuration section, verify the fields according to the following table.

    Software Configuration
    FieldAction
    Hadoop distribution

    Choose Amazon.

    This determines which distribution of Hadoop to run on your cluster. You can choose to run the Amazon distribution of Hadoop or one of several MapR distributions. For more information, see Using the MapR Distribution for Hadoop in the Amazon EMR Developer Guide.

    AMI version

    Choose 2.4.2 (Hadoop 1.0.3).

    This determines the version of Hadoop and other applications such as Hive or Pig to run on your cluster. For more information, see Choose a Machine Image in the Amazon EMR Developer Guide.

    Applications to be installed - Hive

    A default Hive version should already be selected and displayed in the list. If it does not appear, choose it from the Additional applications list.

    For more information, see Analyze Data with Hive in the Amazon EMR Developer Guide.

    Applications to be installed - Pig

    A default Pig version should already be selected and displayed in the list. If it does not appear, choose it from the Additional applications list.

    For more information, see Process Data with Pig in the Amazon EMR Developer Guide.

  5. In the Hardware Configuration section, verify the fields according to the following table.

    Note

    Twenty is the default maximum number of nodes per AWS account. For example, if you have two clusters running, the total number of nodes running for both clusters must be 20 or less. Exceeding this limit will result in cluster failures. If you need more than 20 nodes, you must submit a request to increase your Amazon EC2 instance limit. Ensure that your requested limit increase includes sufficient capacity for any temporary, unplanned increases in your needs. For more information, go to the Request to Increase Amazon EC2 Instance Limit Form.

    Hardware Configuration
    FieldAction
    Network

    Choose Launch into EC2-Classic.

    Optionally, choose a VPC subnet identifier from the list to launch the cluster in an Amazon VPC. For more information, see Select a Amazon VPC Subnet for the Cluster (Optional) in the Amazon EMR Developer Guide.

    EC2 Availability Zone

    Choose No preference.

    Optionally, you can launch the cluster in a specific EC2 Availability Zone.

    For more information, see Regions and Availability Zones in the Amazon EC2 User Guide.

    Master

    Choose m1.small.

    The master node assigns Hadoop tasks to core and task nodes, and monitors their status. There is always one master node in each cluster.

    This specifies the EC2 instance types to use as master nodes. Valid types are m1.small (default), m1.large, m1.xlarge, c1.medium, c1.xlarge, m2.xlarge, m2.2xlarge, and m2.4xlarge, cc1.4xlarge, cg1.4xlarge.

    This tutorial uses small instances for all nodes due to the light workload and to keep your costs low.

    For more information, see Instance Groups in the Amazon EMR Developer Guide.

    Request Spot Instances

    Leave this box unchecked.

    This specifies whether to run master nodes on Spot Instances. For more information, see Lower Costs with Spot Instances (Optional) in the Amazon EMR Developer Guide.

    Core

    Choose m1.small.

    A core node is an EC2 instance that runs Hadoop map and reduce tasks and stores data using the Hadoop Distributed File System (HDFS). Core nodes are managed by the master node.

    This specifies the EC2 instance types to use as core nodes. Valid types: m1.small (default), m1.large, m1.xlarge, c1.medium, c1.xlarge, m2.xlarge, m2.2xlarge, and m2.4xlarge, cc1.4xlarge, cg1.4xlarge.

    This tutorial uses small instances for all nodes due to the light workload and to keep your costs low.

    For more information, see Instance Groups in the Amazon EMR Developer Guide.

    Count

    Choose 2.

    Request Spot Instances

    Leave this box unchecked.

    This specifies whether to run core nodes on Spot Instances. For more information, see Lower Costs with Spot Instances (Optional) in the Amazon EMR Developer Guide.

    Task

    Choose m1.small.

    Task nodes only process Hadoop tasks and don't store data. You can add and remove them from a cluster to manage the EC2 instance capacity your cluster uses, increasing capacity to handle peak loads and decreasing it later. Task nodes only run a TaskTracker Hadoop daemon.

    This specifies the EC2 instance types to use as task nodes. Valid types: m1.small (default), m1.large, m1.xlarge, c1.medium, c1.xlarge, m2.xlarge, m2.2xlarge, and m2.4xlarge, cc1.4xlarge, cg1.4xlarge.

    For more information, see Instance Groups in the Amazon EMR Developer Guide.

    Count

    Choose 0.

    Request Spot Instances

    Leave this box unchecked.

    This specifies whether to run task nodes on Spot Instances. For more information, see Lower Costs with Spot Instances (Optional) in the Amazon EMR Developer Guide.

  6. In the Security and Access section, complete the fields according to the following table.

    Security and Access
    FieldAction
    EC2 key pair

    Choose Process without an EC2 key pair.

    Optionally, specify a key pair that you created previously. For more information, see Create SSH Credentials for the Master Node in the Amazon EMR Developer Guide.

    If you do not enter a value in this field, you cannot use SSH to connect to the master node. For more information, see Connect to the Cluster in the Amazon EMR Developer Guide.

    IAM user access

    Choose No other IAM users.

    Optionally, choose All other IAM users to make the cluster visible and accessible to all IAM users on the AWS account. For more information, see Configure IAM User Permissions in the Amazon EMR Developer Guide.

    IAM role

    Choose Proceed without role.

    This controls application access to the EC2 instances in the cluster.

    For more information, see Configure IAM Roles for Amazon EMR in the Amazon EMR Developer Guide.

  7. In the Bootstrap Actions section, there are no bootstrap actions necessary for this sample configuration.

    Optionally, you can use bootstrap actions, which are scripts that can install additional software and change the configuration of applications on the cluster before Hadoop starts. For more information, see Create Bootstrap Actions to Install Additional Software (Optional) in the Amazon EMR Developer Guide.

  8. Review your configuration and if you are satisfied with the settings, click Create Cluster.

  9. When the cluster starts, you see the Summary pane.

    Cluster Starting

To start a cluster using a command line client

  1. Download the Amazon EMR Ruby command line client (CLI). If you downloaded the Amazon EMR CLI before 11 December 2011, you will need to download and install the latest version to get support for AMI versioning, Amazon EMR AMI version 2.0, and Hadoop 0.20.205.

  2. Install the command line client and set up your credentials. For information about how to do this, go to Command Line Interface Reference for Amazon EMR in the Amazon EMR Developer Guide.

  3. Use the following syntax to start a new cluster, specifying your own values for the instance size and your own cluster name for "myJobFlowName":

    In the directory where you installed the Amazon EMR CLI, run the following from the command line. For more information about the Amazon EMR CLI, see Command Line Interface for Amazon EMR in the Amazon EMR Developer Guide.

    • Linux, UNIX, and Mac OS X users:

      ./elastic-mapreduce --create --alive --num-instances 3 \
      --instance-type m1.small \
      --name "myJobFlowName" \
      --hive-interactive --hive-versions 0.7.1.1 \
      --ami-version latest \
      --hadoop-version 0.20.205
    • Windows users:

      ruby elastic-mapreduce --create --alive --num-instances 3 --instance-type m1.small --name "myJobFlowName" --hive-interactive --hive-versions 0.7.1.1 --ami-version latest --hadoop-version 0.20.205

    You must use the same account to create the Amazon EMR cluster that you used to store data in DynamoDB. This ensures that the credentials passed in by the CLI will match those required by DynamoDB.

Note

After you create the cluster, you should wait until its status is WAITING before continuing to the next step.