Amazon Elastic MapReduce
Developer Guide (API Version 2009-03-31)
« PreviousNext »
View the PDF for this guide.Go to the AWS Discussion Forum for this product.Go to the Kindle Store to download this guide in Kindle format.Did this page help you?  Yes | No |  Tell us about it...

Step 2: Create a Cluster

For Hive to run on Amazon EMR, you must create a cluster with Hive enabled. This sets up the necessary applications and infrastructure for Hive to connect to Amazon DynamoDB. The following procedures explain how to create an interactive Hive cluster from the AWS Management Console and the CLI.

To start a cluster using the AWS Management Console

  1. Open the Amazon EMR console at https://console.aws.amazon.com/elasticmapreduce/.

    This opens the Amazon EMR console which you can use to launch and manage clusters.

  2. Select a region from the Region drop-down box. This is the region in which you'll create the Amazon EMR cluster. To avoid cross-region data transfer charges, this should be the same region that hosts your Amazon DynamoDB data. Similarly, if you are exporting data to Amazon S3, the Amazon S3 bucket should be in the same region as both the Amazon DynamoDB and the Amazon EMR cluster to avoid cross-region data transfer charges.

  3. Click Create New Job Flow.

    Create Job Flow Button

  4. On the DEFINE NEW JOB FLOW page, do the following:

    • Give your cluster a name, such as "My Cluster.

    • Select which version of Hadoop to run on your cluster in Hadoop Version. You can choose to run the Amazon distribution of Hadoop or one of two MapR distributions. For more information about MapR distributions for Hadoop, see Using the MapR Distribution for Hadoop.

    • Select the Run your own application radio button.

    • In the Choose a Job Type drop-down menu, choose Hive Program.

    Define Cluster

    Click Continue.

  5. On the SPECIFY PARAMETERS page, select the Start an Interactive Hive Session radio button.

    Specify Parameters

    Hive is an open-source tool that runs on top of Hadoop to provide a way to query clusters using a simplified SQL syntax. Select an interactive session to issue commands from a terminal window.

    Later, once you’ve established a set of queries that you’d like to run on a regular basis, you can save your queries as a script in an Amazon S3 bucket and have Amazon EMR run them for you without an interactive session.

    Click Continue.

  6. On the CONFIGURE EC2 INSTANCES page, set the number and type of instances to process the data in parallel.

    In the Master Instance Group, for Instance Type, use an m1.small master node. In the Core Instance Group, for Instance Count use the default value 2 and for Instance Type use the default value m1.small. If you need more processing power, select larger options.

    Configure EC2 Instances

    Click Continue.

  7. On the ADVANCED OPTIONS page, select the key pair you created earlier in the Amazon EC2 Key Pair drop-down menu.

    Leave the rest of the settings on this page at the default values. For example, Amazon VPC Subnet Id should remain set to No preference.

    Select Key Pair

    Click Continue.

  8. In the Bootstrap Actions dialog:

    Select the Proceed with no Bootstrap Actions radio button.

    Bootstrap Actions

    Click Continue.

  9. In the Review dialog:

    Review the settings for your cluster.

    Review Settings

    Click Create Job Flow.

    Note

    When the confirmation window closes, your new cluster appears in the list of clusters in the Amazon EMR console with the status STARTING. If you do not see your cluster with the STARTING status, click Refresh to see the cluster. It takes a few minutes for Amazon EMR to provision the Amazon EC2 instances for your cluster. Your cluster is ready for use when the status is WAITING.

To start a cluster using a command line client

  1. Download the Amazon EMR Ruby command line client (CLI). If you downloaded the Amazon EMR CLI before 11 December 2011, you will need to download and install the latest version to get support for AMI versioning, Amazon EMR AMI version 2.0, and Hadoop 0.20.205.

  2. Install the command line client and set up your credentials. For information about how to do this, go to Sign Up and Install the Command Line Interface in the Amazon EMR Developer Guide.

  3. Use the following syntax to start a new cluster, specifying your own values for the instance size and your own cluster name for "myJobFlowName":

    In the directory where you installed the Amazon EMR CLI, run the following from the command line. For more information about the Amazon EMR CLI, see the Command Line Interface Reference for Amazon EMR.

    • Linux, UNIX, and Mac OS X users:

      ./elastic-mapreduce --create --alive --num-instances 3 \
      --instance-type m1.small \
      --name "myJobFlowName" \
      --hive-interactive --hive-versions 0.7.1.1 \
      --ami-version latest \
      --hadoop-version 0.20.205
    • Windows users:

      ruby elastic-mapreduce --create --alive --num-instances 3 --instance-type m1.small --name "myJobFlowName" --hive-interactive --hive-versions 0.7.1.1 --ami-version latest --hadoop-version 0.20.205

    You must use the same account to create the Amazon EMR cluster that you used to store data in Amazon DynamoDB. This ensures that the credentials passed in by the CLI will match those required by Amazon DynamoDB.

Note

After you create the cluster, you should wait until its status is WAITING before continuing to the next step.