Amazon Elastic MapReduce
Developer Guide (API Version 2009-03-31)
Did this page help you?  Yes | No |  Tell us about it...
« PreviousNext »
View the PDF for this guide.Go to the AWS Discussion Forum for this product.Go to the Kindle Store to download this guide in Kindle format.

Launch a Cluster and Submit Hive Work

This section demonstrates creating an Amazon Elastic MapReduce (Amazon EMR) cluster with Hive installed and submitting Hive work. You can install Hive on a cluster using the Amazon EMR console, the CLI, or the API. Before you create a cluster you need to prepare your input data. For more information on preparing input data, see Prepare Input Data (Optional).

For advanced information on Hive configuration options, see Analyze Data with Hive.

A cluster using Hive enables you to create a data analysis application using a SQL-like language. The example that follows is based on the Amazon EMR sample: Contextual Advertising using Apache Hive and Amazon EMR with High Performance Computing instances. This sample describes how to correlate customer click data to specific advertisements.

In this example, the Hive script is located in an Amazon S3 bucket at s3://elasticmapreduce/samples/hive-ads/libs/model-build. All of the data processing instructions are located in the Hive script. The script requires additional libraries that are located in an Amazon S3 bucket at s3://elasticmapreduce/samples/hive-ads/libs. The input data is located in the Amazon S3 bucket s3://elasticmapreduce/samples/hive-ads/tables. The output is saved to an Amazon S3 bucket you create as part of Prepare an Output Location (Optional).

Launch a Cluster and Submit Hive Work Using the Amazon EMR Console

This example describes how to use the Amazon EMR console to create a cluster with Hive installed and how to submit a Hive step to the cluster.

To launch a cluster with Hive installed and to submit a Hive step

Note

Hive and Pig are selected for installation by default. You may remove either application by clicking the Remove icon (X).

  1. Open the Amazon Elastic MapReduce console at https://console.aws.amazon.com/elasticmapreduce/.

  2. Click Create cluster.

  3. In the Create Cluster page, in the Cluster Configuration section, verify the fields according to the following table.

    FieldAction
    Cluster name

    Enter a descriptive name for your cluster.

    The name is optional, and does not need to be unique.

    Termination protection

    Enabling termination protection ensures that the cluster does not shut down due to accident or error. For more information, see Managing Cluster Termination. Typically, set this value to Yes only when developing an application (so you can debug errors that would have otherwise terminated the cluster) and to protect long-running clusters or clusters that contain data.

    Logging

    This determines whether Amazon EMR captures detailed log data to Amazon S3.

    For more information, see View Log Files.

    Log folder S3 location

    Enter an Amazon S3 path to store your debug logs if you enabled logging in the previous field. If the log folder does not exist, the Amazon EMR console creates it for you.

    When this value is set, Amazon EMR copies the log files from the EC2 instances in the cluster to Amazon S3. This prevents the log files from being lost when the cluster ends and the EC2 instances hosting the cluster are terminated. These logs are useful for troubleshooting purposes.

    For more information, see View Log Files.

    Debugging

    This option creates a debug log index in SimpleDB (additional charges apply) to enable detailed debugging in the Amazon EMR console. You can only set this when the cluster is created. For more information about Amazon SimpleDB, go to the Amazon SimpleDB product description page.

  4. In the Software Configuration section, verify the fields according to the following table.

    FieldAction
    Hadoop distribution

    Choose Amazon.

    This determines which distribution of Hadoop to run on your cluster. You can choose to run the Amazon distribution of Hadoop or one of several MapR distributions. For more information, see Using the MapR Distribution for Hadoop.

    AMI version

    Choose the latest Hadoop 2.x AMI or the latest Hadoop 1.x AMI from the list.

    The AMI you choose determines the specific version of Hadoop and other applications such as Hive or Pig to run on your cluster. For more information, see Choose an Amazon Machine Image (AMI).

  5. In the Hardware Configuration section, verify the fields according to the following table.

    Note

    Twenty is the default maximum number of nodes per AWS account. For example, if you have two clusters, the total number of nodes running for both clusters must be 20 or less. Exceeding this limit results in cluster failures. If you need more than 20 nodes, you must submit a request to increase your Amazon EC2 instance limit. Ensure that your requested limit increase includes sufficient capacity for any temporary, unplanned increases in your needs. For more information, go to the Request to Increase Amazon EC2 Instance Limit Form.

    FieldAction
    Network

    Choose the default VPC. For more information about the default VPC, see Your Default VPC and Subnets in the guide-vpc-user;.

    Optionally, if you have created additional VPCs, you can choose your preferred VPC subnet identifier from the list to launch the cluster in that Amazon VPC. For more information, see Select a Amazon VPC Subnet for the Cluster (Optional).

    EC2 Availability Zone

    Choose No preference.

    Optionally, you can launch the cluster in a specific EC2 Availability Zone.

    For more information, see Regions and Availability Zones in the Amazon EC2 User Guide for Linux Instances.

    Master

    Accept the default instance type.

    The master node assigns Hadoop tasks to core and task nodes, and monitors their status. There is always one master node in each cluster.

    This specifies the EC2 instance type to use for the master node.

    The default instance type is m1.medium for Hadoop 2.x. This instance type is suitable for testing, development, and light workloads.

    For more information on instance types supported by Amazon EMR, see Virtual Server Configurations. For more information on Amazon EMR instance groups, see Instance Groups. For information about mapping legacy clusters to instance groups, see Mapping Legacy Clusters to Instance Groups.

    Request Spot Instances

    Leave this box unchecked.

    This specifies whether to run master nodes on Spot Instances. For more information, see (Optional) Lower Costs with Spot Instances.

    Core

    Accept the default instance type.

    A core node is an EC2 instance that runs Hadoop map and reduce tasks and stores data using the Hadoop Distributed File System (HDFS). Core nodes are managed by the master node.

    This specifies the EC2 instance types to use as core nodes.

    The default instance type is m1.medium for Hadoop 2.x. This instance type is suitable for testing, development, and light workloads.

    For more information on instance types supported by Amazon EMR, see Virtual Server Configurations. For more information on Amazon EMR instance groups, see Instance Groups. For information about mapping legacy clusters to instance groups, see Mapping Legacy Clusters to Instance Groups.

    Count

    Choose 2.

    Request Spot Instances

    Leave this box unchecked.

    This specifies whether to run core nodes on Spot Instances. For more information, see (Optional) Lower Costs with Spot Instances.

    Task

    Accept the default instance type.

    Task nodes only process Hadoop tasks and don't store data. You can add and remove them from a cluster to manage the EC2 instance capacity your cluster uses, increasing capacity to handle peak loads and decreasing it later. Task nodes only run a TaskTracker Hadoop daemon.

    This specifies the EC2 instance types to use as task nodes.

    For more information on instance types supported by Amazon EMR, see Virtual Server Configurations. For more information on Amazon EMR instance groups, see Instance Groups. For information about mapping legacy clusters to instance groups, see Mapping Legacy Clusters to Instance Groups.

    Count

    Choose 0.

    Request Spot Instances

    Leave this box unchecked.

    This specifies whether to run task nodes on Spot Instances. For more information, see (Optional) Lower Costs with Spot Instances.

  6. In the Security and Access section, complete the fields according to the following table.

    FieldAction
    EC2 key pair

    Choose your Amazon EC2 key pair private key from the list.

    For more information, see Create an Amazon EC2 Key Pair and PEM File.

    Optionally, choose Proceed without an EC2 key pair. If you do not enter a value in this field, you cannot use SSH to connect to the master node. For more information, see Connect to the Cluster.

    IAM user access

    Choose All other IAM users to make the cluster visible and accessible to all IAM users on the AWS account. For more information, see Configure IAM User Permissions.

    Alternatively, choose No other IAM users to restrict access to the current IAM user.

    Roles configuration

    Click Default to generate the default EMR role and EC2 instance profile. If the default roles exist, they are used for your cluster. If they do not exist, they are created (assuming you have proper permissions). You may also click View policies for default roles to view the default role properties. Alternatively, if you have custom roles, you can click Custom and choose your roles. An EMR role and EC2 instance profile are required when creating a cluster using the console.

    The EMR role Allows Amazon EMR to access other AWS services on your behalf. The EC2 instance profile controls application access to the Amazon EC2 instances in the cluster.

    For more information, see Configure IAM Roles for Amazon EMR.

  7. In the Bootstrap Actions section, there are no bootstrap actions necessary for this sample configuration.

    Optionally, you can use bootstrap actions, which are scripts that can install additional software and change the configuration of applications on the cluster before Hadoop starts. For more information, see (Optional) Create Bootstrap Actions to Install Additional Software.

  8. In the Steps section, choose Hive Program from the list and click Configure and add.

    In the Add Step dialog, specify the Hive program parameters using the following table as a guide, and then click Add.

    Specify Hive Parameters
    FieldAction
    Script S3 location*Specify the URI where your script resides in Amazon S3. The value must be in the form BucketName/path/ScriptName.
    Input S3 locationOptionally, specify the URI where your input files reside in Amazon S3. The value must be in the form BucketName/path/. If specified, this will be passed to the Hive script as a parameter named INPUT.
    Output S3 locationOptionally, specify the URI where you want the output stored in Amazon S3. The value must be in the form BucketName/path. If specified, this will be passed to the Hive script as a parameter named OUTPUT.
    ArgumentsOptionally, enter a list of arguments (space-separated strings) to pass to Hive.
    Action on Failure

    This determines what the cluster does in response to any errors. The possible values for this setting are:

    • Terminate cluster: If the step fails, terminate the cluster. If the cluster has termination protection enabled AND keep alive enabled, it will not terminate.

    • Cancel and wait: If the step fails, cancel the remaining steps. If the cluster has keep alive enabled, the cluster will not terminate.

    • Continue: If the step fails, continue to the next step.

    * Required parameter

  9. Review your configuration and if you are satisfied with the settings, click Create Cluster.

  10. When the cluster starts, the console displays the Cluster Details page.

Launch a Cluster and Submit Hive Work Using the AWS CLI or the Amazon EMR CLI

These examples describe how to use the AWS CLI or the Amazon EMR CLI to create a cluster with Hive installed and how to submit Hive work to the cluster.

To launch a cluster with Hive installed and to submit a Hive step using the AWS CLI

You can install Hive on a cluster using the AWS CLI by typing the create-cluster subcommand with the --applications parameter. When using the --applications parameter, you identify the application you want to install via the Name argument. To submit a Hive step when the cluster is launched, type the --steps parameter, indicate the step type using the Type argument, and provide the necessary argument string.

  • To launch a cluster and submit a Hive step, type the following command, replace myKey with the name of your EC2 key pair, and replace mybucket with your bucket name.

    • Linux, UNIX, and Mac OS X users:

      aws emr create-cluster --name "Test cluster" --ami-version 3.3 --applications Name=Hue Name=Hive Name=Pig \
      --use-default-roles --ec2-attributes KeyName=myKey \
      --instance-type m3.xlarge --instance-count 3 \
      --steps Type=Hive,Name="Hive Program",ActionOnFailure=CONTINUE,Args=[-f,s3://elasticmapreduce/samples/hive-ads/libs/model-build.q,-d,INPUT=s3://elasticmapreduce/samples/hive-ads/tables,-d,OUTPUT=s3://mybucket/hive-ads/output/2014-04-18/11-07-32,-d,LIBS=s3://elasticmapreduce/samples/hive-ads/libs]
    • Windows users:

      aws emr create-cluster --name "Test cluster" --ami-version 3.3 --applications Name=Hue Name=Hive Name=Pig --use-default-roles --ec2-attributes KeyName=myKey --instance-type m3.xlarge --instance-count 3 --steps Type=Hive,Name="Hive Program",ActionOnFailure=CONTINUE,Args=[-f,s3://elasticmapreduce/samples/hive-ads/libs/model-build.q,-d,INPUT=s3://elasticmapreduce/samples/hive-ads/tables,-d,OUTPUT=s3://mybucket/hive-ads/output/2014-04-18/11-07-32,-d,LIBS=s3://elasticmapreduce/samples/hive-ads/libs]

    When you specify the instance count without using the --instance-groups parameter, a single Master node is launched, and the remaining instances are launched as core nodes. All nodes will use the instance type specified in the command.

    Note

    If you have not previously created the default EMR service role and EC2 instance profile, type aws emr create-default-roles to create them before typing the create-cluster subcommand.

    For more information on using Amazon EMR commands in the AWS CLI, see http://docs.aws.amazon.com/cli/latest/reference/emr.

To launch a cluster with Hive installed and to submit Hive work using the Amazon EMR CLI

Note

The Amazon EMR CLI is no longer under feature development. Customers are encouraged to use the Amazon EMR commands in the AWS CLI instead.

  • In the directory where you installed the Amazon EMR CLI, type the following command. For more information, see the Command Line Interface Reference for Amazon EMR.

    • Linux, UNIX, and Mac OS X users:

      ./elastic-mapreduce --create --name "Test Hive" --ami-version 3.3 --hive-script \
      s3://elasticmapreduce/samples/hive-ads/libs/model-build.q \
      --args -d,LIBS=s3://elasticmapreduce/samples/hive-ads/libs,\
      -d,INPUT=s3://elasticmapreduce/samples/hive-ads/tables,\
      -d,OUTPUT=s3://mybucket/hive-ads/output/
    • Windows users:

      ruby elastic-mapreduce --create --name "Test Hive" --ami-version 3.3 --hive-script s3://elasticmapreduce/samples/hive-ads/libs/model-build.q --args -d,LIBS=s3://elasticmapreduce/samples/hive-ads/libs,-d,INPUT=s3://elasticmapreduce/samples/hive-ads/tables,-d,OUTPUT=s3://mybucket/hive-ads/output/

    The output looks similar to the following.

    Created cluster JobFlowID

    By default, this command launches a cluster to run on a two-node cluster. Later, when your steps are running correctly on a small set of sample data, you can launch clusters to run on multiple nodes. You can specify the number of nodes and the type of instance to run with the --num-instances and --instance-type parameters, respectively.