Amazon Elastic MapReduce
Developer Guide (API Version 2009-03-31)
« PreviousNext »
View the PDF for this guide.Go to the AWS Discussion Forum for this product.Go to the Kindle Store to download this guide in Kindle format.Did this page help you?  Yes | No |  Tell us about it...

Launch a Hive Cluster

This section covers the basics of creating a cluster using Hive in Amazon Elastic MapReduce (Amazon EMR). You'll step through how to create a cluster using Hive with either the Amazon EMR console, the CLI, or the Query API. Before you create your cluster you'll need to create objects and permissions; for more information see Prepare Input Data (Optional).

For advanced information on Hive configuration options, see Analyze Data with Hive.

A cluster using Hive enables you to create a data analysis application using a SQL-like language. The example that follows is based on the Amazon EMR sample: Contextual Advertising using Apache Hive and Amazon EMR with High Performance Computing instances. This sample describes how to correlate customer click data to specific advertisements.

In this example, the Hive script is located in an Amazon S3 bucket at s3n://elasticmapreduce/samples/hive-ads/libs/model-build. All of the data processing instructions are located in the Hive script. The script requires additional libraries that are located in an Amazon S3 bucket at s3n://elasticmapreduce/samples/hive-ads/libs. The input data is located in the Amazon S3 bucket s3n://elasticmapreduce/samples/hive-ads/tables. The output is saved to an Amazon S3 bucket you created as part of Prepare an Output Location (Optional).

Amazon EMR Console

This example describes how to use the Amazon EMR console to create a cluster using Hive.

To create a cluster using Hive

  1. Open the Amazon Elastic MapReduce console at https://console.aws.amazon.com/elasticmapreduce/.

  2. Click Create cluster.

  3. In the Create Cluster page, in the Cluster Configuration section, verify the fields according to the following table.

    Cluster Configuration
    FieldAction
    Cluster name

    Enter a descriptive name for your cluster.

    The name is optional, and does not need to be unique.

    Termination protection

    Enabling termination protection ensures that the cluster does not shut down due to accident or error. For more information, see Protect a Cluster from Termination. Typically, set this value to Yes only when developing an application (so you can debug errors that would have otherwise terminated the cluster) and to protect long-running clusters or clusters that contain data.

    Logging

    This determines whether Amazon EMR captures detailed log data to Amazon S3.

    For more information, see View Log Files.

    Log folder S3 location

    Enter an Amazon S3 path to store your debug logs if you enabled logging in the previous field. If the log folder does not exist, the Amazon EMR console creates it for you.

    When this value is set, Amazon EMR copies the log files from the EC2 instances in the cluster to Amazon S3. This prevents the log files from being lost when the cluster ends and the EC2 instances hosting the cluster are terminated. These logs are useful for troubleshooting purposes.

    For more information, see View Log Files.

    Debugging

    This option creates a debug log index in SimpleDB (additional charges apply) to enable detailed debugging in the Amazon EMR console. You can only set this when the cluster is created. For more information about Amazon SimpleDB, go to the Amazon SimpleDB product description page.

  4. In the Software Configuration section, verify the fields according to the following table.

    Software Configuration
    FieldAction
    Hadoop distribution

    Choose Amazon.

    This determines which distribution of Hadoop to run on your cluster. You can choose to run the Amazon distribution of Hadoop or one of several MapR distributions. For more information, see Using the MapR Distribution for Hadoop.

    AMI version

    Choose the latest Hadoop 2.x AMI or the latest Hadoop 1.x AMI from the list.

    The AMI you choose determines the specific version of Hadoop and other applications such as Hive or Pig to run on your cluster. For more information, see Choose a Machine Image.

  5. In the Hardware Configuration section, verify the fields according to the following table.

    Note

    Twenty is the default maximum number of nodes per AWS account. For example, if you have two clusters, the total number of nodes running for both clusters must be 20 or less. Exceeding this limit results in cluster failures. If you need more than 20 nodes, you must submit a request to increase your Amazon EC2 instance limit. Ensure that your requested limit increase includes sufficient capacity for any temporary, unplanned increases in your needs. For more information, go to the Request to Increase Amazon EC2 Instance Limit Form.

    Hardware Configuration
    FieldAction
    Network

    Choose the default VPC. For more information about the default VPC, see Your Default VPC and Subnets in the guide-vpc-user;.

    Optionally, if you have created additional VPCs, you can choose your preferred VPC subnet identifier from the list to launch the cluster in that Amazon VPC. For more information, see Select a Amazon VPC Subnet for the Cluster (Optional).

    EC2 Availability Zone

    Choose No preference.

    Optionally, you can launch the cluster in a specific EC2 Availability Zone.

    For more information, see Regions and Availability Zones in the Amazon Elastic Compute Cloud User Guide for Linux.

    Master

    Accept the default instance type.

    The master node assigns Hadoop tasks to core and task nodes, and monitors their status. There is always one master node in each cluster.

    This specifies the EC2 instance type to use for the master node.

    The default instance type is m1.medium for Hadoop 2.x. This instance type is suitable for testing, development, and light workloads.

    For more information on instance types supported by Amazon EMR, see Virtual Server Configurations. For more information on Amazon EMR instance groups, see Instance Groups. For information about mapping legacy clusters to instance groups, see Mapping Legacy Clusters to Instance Groups.

    Request Spot Instances

    Leave this box unchecked.

    This specifies whether to run master nodes on Spot Instances. For more information, see Lower Costs with Spot Instances (Optional).

    Core

    Accept the default instance type.

    A core node is an EC2 instance that runs Hadoop map and reduce tasks and stores data using the Hadoop Distributed File System (HDFS). Core nodes are managed by the master node.

    This specifies the EC2 instance types to use as core nodes.

    The default instance type is m1.medium for Hadoop 2.x. This instance type is suitable for testing, development, and light workloads.

    For more information on instance types supported by Amazon EMR, see Virtual Server Configurations. For more information on Amazon EMR instance groups, see Instance Groups. For information about mapping legacy clusters to instance groups, see Mapping Legacy Clusters to Instance Groups.

    Count

    Choose 2.

    Request Spot Instances

    Leave this box unchecked.

    This specifies whether to run core nodes on Spot Instances. For more information, see Lower Costs with Spot Instances (Optional).

    Task

    Accept the default instance type.

    Task nodes only process Hadoop tasks and don't store data. You can add and remove them from a cluster to manage the EC2 instance capacity your cluster uses, increasing capacity to handle peak loads and decreasing it later. Task nodes only run a TaskTracker Hadoop daemon.

    This specifies the EC2 instance types to use as task nodes.

    For more information on instance types supported by Amazon EMR, see Virtual Server Configurations. For more information on Amazon EMR instance groups, see Instance Groups. For information about mapping legacy clusters to instance groups, see Mapping Legacy Clusters to Instance Groups.

    Count

    Choose 0.

    Request Spot Instances

    Leave this box unchecked.

    This specifies whether to run task nodes on Spot Instances. For more information, see Lower Costs with Spot Instances (Optional).

  6. In the Security and Access section, complete the fields according to the following table.

    Security and Access
    FieldAction
    EC2 key pair

    Choose your Amazon EC2 key pair from the list.

    For more information, see Create an Amazon EC2 Key Pair and PEM File.

    Optionally, choose Proceed without an EC2 key pair. If you do not enter a value in this field, you cannot use SSH to connect to the master node. For more information, see Connect to the Cluster.

    IAM user access

    Choose All other IAM users to make the cluster visible and accessible to all IAM users on the AWS account. For more information, see Configure IAM User Permissions.

    Alternatively, choose No other IAM users to restrict access to the current IAM user.

    EMR role

    Accept the default option - No roles found. Alternatively, click Create Default Role > Create Role to generate a default EMR role.

    Allows Amazon EMR to access other AWS services on your behalf.

    For more information, see Configure IAM Roles for Amazon EMR.

    EC2 instance profile

    You can proceed without choosing an instance profile by accepting the default option - No roles found. Alternatively, click Create Default Role > Create Role to generate a default EMR role.

    This controls application access to the Amazon EC2 instances in the cluster.

    For more information, see Configure IAM Roles for Amazon EMR.

  7. In the Bootstrap Actions section, there are no bootstrap actions necessary for this sample configuration.

    Optionally, you can use bootstrap actions, which are scripts that can install additional software and change the configuration of applications on the cluster before Hadoop starts. For more information, see Create Bootstrap Actions to Install Additional Software (Optional).

  8. In the Steps section, choose Hive Program from the list and click Configure and add.

    In the Add Step dialog, specify the cluster parameters using the following table as a guide, and then click Add.

    Specify Hive Parameters
    FieldAction
    Script S3 location* Specify the URI where your script resides in Amazon S3. The value must be in the form BucketName/path/ScriptName.
    Input S3 locationOptionally, specify the URI where your input files reside in Amazon S3. The value must be in the form BucketName/path/. If specified, this will be passed to the Hive script as a parameter named INPUT.
    Output S3 locationOptionally, specify the URI where you want the output stored in Amazon S3. The value must be in the form BucketName/path. If specified, this will be passed to the Hive script as a parameter named OUTPUT.
    Arguments Optionally, enter a list of arguments (space-separated strings) to pass to Hive.
    Action on Failure

    This determines what the cluster does in response to any errors. The possible values for this setting are:

    • Terminate cluster: If the step fails, terminate the cluster. If the cluster has termination protection enabled AND keep alive enabled, it will not terminate.

    • Cancel and wait: If the step fails, cancel the remaining steps. If the cluster has keep alive enabled, the cluster will not terminate.

    • Continue: If the step fails, continue to the next step.

    * Required parameter

  9. Review your configuration and if you are satisfied with the settings, click Create Cluster.

  10. When the cluster starts, the console displays the Cluster Details page.

CLI

This example describes how to use the CLI to create a cluster using Hive.

To create a cluster using Hive

  • In the directory where you installed the Amazon EMR CLI, run the following from the command line. For more information, see the Command Line Interface Reference for Amazon EMR.

    • Linux, UNIX, and Mac OS X users:

      ./elastic-mapreduce --create --name "Test Hive" --hive-script \
      s3n://elasticmapreduce/samples/hive-ads/libs/model-build.q \
      --args -d,LIBS=s3n://elasticmapreduce/samples/hive-ads/libs,\
      -d,INPUT=s3n://elasticmapreduce/samples/hive-ads/tables,\
      -d,OUTPUT=s3n://myawsbucket/hive-ads/output/
    • Windows users:

      ruby elastic-mapreduce --create --name "Test Hive" --hive-script s3n://elasticmapreduce/samples/hive-ads/libs/model-build.q --args -d,LIBS=s3n://elasticmapreduce/samples/hive-ads/libs,-d,INPUT=s3n://elasticmapreduce/samples/hive-ads/tables,-d,OUTPUT=s3n://myawsbucket/hive-ads/output/

The output looks similar to the following.

Created cluster JobFlowID

By default, this command launches a cluster to run on a two-node cluster using an Amazon EC2 m1.small instance. Later, when your steps are running correctly on a small set of sample data, you can launch clusters to run on multiple nodes. You can specify the number of nodes and the type of instance to run with the --num-instances and --instance-type parameters, respectively.