Amazon Elastic MapReduce
Developer Guide (API Version 2009-03-31)
« PreviousNext »
View the PDF for this guide.Go to the AWS Discussion Forum for this product.Go to the Kindle Store to download this guide in Kindle format.Did this page help you?  Yes | No |  Tell us about it...

Interactive and Batch Hive Clusters

Amazon EMR enables you to run Hive scripts in two modes:

  • Interactive

  • Batch

Typically, you use interactive mode to troubleshoot your cluster and use batch mode in production.

In interactive mode, you ssh as the Hadoop user into the master node in the Hadoop cluster and use the Hive Command Line Interface to develop and run your Hive script. Interactive mode enables you to revise the Hive script more easily than batch mode. After you successfully revise the Hive script in interactive mode, you can upload the script to Amazon S3 and use batch mode to run production clusters.

In batch mode, you upload your Hive script to Amazon S3, and then execute it using a job flow. You can pass parameter values into your Hive script and reference resources in Amazon S3. Variables in Hive scripts use the dollar sign and curly braces, for example:

${VariableName}

In the Amazon EMR CLI, use the -d parameter to pass values into the Hive script as in the following example.

In the directory where you installed the Amazon EMR CLI, run the following from the command line. For more information, see the Command Line Interface Reference for Amazon EMR.

  • Linux, UNIX, and Mac OS X users:

    ./elastic-mapreduce --create \
    --name "Hive Cluster"  \
    --hive-script   \
    --args s3://myawsbucket/myquery.q \
    --args -d,INPUT=s3://myawsbucket/input,-d,OUTPUT=s3://myawsbucket/output
  • Windows users:

    ruby elastic-mapreduce --create --name "Hive Cluster" --hive-script --args s3://myawsbucket/myquery.q --args -d,INPUT=s3://myawsbucket/input,-d,OUTPUT=3://myawsbucket/output

Using batch mode, you can pass parameter values into a Hive script from the Specify Parameters page of the Create a New Job Flow wizard found in the Amazon EMR console. The values go into the Extra Args field. For example, you could enter:

-d VariableName=Value

The Amazon EMR console and Amazon EMR command line interface (CLI) support both interactive and batch modes.

Running Hive in Interactive Mode

You can run Hive in interactive mode from both the CLI and Amazon EMR console.

  • To start an interactive cluster from the command line, use the --alive option with the --create parameter so that the cluster remains active until you terminate it.

    In the directory where you installed the Amazon EMR CLI, run the following from the command line. For more information, see the Command Line Interface Reference for Amazon EMR.

    • Linux, UNIX, and Mac OS X users:

      ./elastic-mapreduce --create --alive --name "Hive cluster" \
      --num-instances 5 --instance-type m1.large \
      --hive-interactive
    • Windows users:

      ruby elastic-mapreduce --create --alive --name "Hive cluster" --num-instances 5 --instance-type m1.large --hive-interactive

The return output is similar to the following:

Created jobflow JobFlowID

Add additional steps from the Amazon Elastic MapReduce (Amazon EMR) CLI or ssh directly to the master node following the instructions in the Amazon Elastic MapReduce (Amazon EMR) Getting Started Guide .

You start an interactive cluster from the Amazon EMR console in the Create a New Job Flow wizard.

The cluster begins. When the cluster is in the WAITING state, you can add steps to your cluster from the Amazon EMR CLI or ssh directly to the master node following the instructions in the Amazon Elastic MapReduce (Amazon EMR) Getting Started Guide.

Adding steps can help you test and develop Hive scripts. For example, if the script fails, you can add a new step to the cluster without having to wait for a new cluster to start. The following procedure shows you how to use the command line to add Hive as a new step to an existing cluster.

To add Hive to an existing cluster

  • Enter the following command, replacing the location with an Amazon S3 bucket containing a Hive script and the <JobFlowID> from your job:

    In the directory where you installed the Amazon EMR CLI, run the following from the command line. For more information, see the Command Line Interface Reference for Amazon EMR.

    • Linux, UNIX, and Mac OS X users:

      ./elastic-mapreduce --jobflow JobFlowID \
      --hive-script \
      --args s3://location/myquery.q \
      --args -d,INPUT=s3://location/input,-d,OUTPUT=s3://location/output
    • Windows users:

      ruby elastic-mapreduce --jobflow JobFlowID --hive-script --args s3://location/myquery.q --args -d,INPUT=s3://location/input,-d,OUTPUT=s3://location/output

Running Hive in Batch Mode

The following procedure shows how to run Hive in batch mode from the command line. The procedure assumes that you stored the Hive script in a bucket on Amazon S3. For more information about uploading files into Amazon S3, go to the Amazon S3 Getting Started Guide.

To create a cluster with a step that executes a Hive script

  • Enter the following command, substituting the replaceable parameters with the actual values from your job:

    In the directory where you installed the Amazon EMR CLI, run the following from the command line. For more information, see the Command Line Interface Reference for Amazon EMR.

    • Linux, UNIX, and Mac OS X users:

      ./elastic-mapreduce --create \
      --name "Hive cluster"  \
      --hive-script   \
      --args s3://myawsbucket/myquery.q \
      --args -d,INPUT=s3://myawsbucket/input,-d,OUTPUT=s3://myawsbucket/output
    • Windows users:

      ruby elastic-mapreduce --create --name "Hive cluster" --hive-script --args s3://myawsbucket/myquery.q --args -d,INPUT=s3://myawsbucket/input,-d,OUTPUT=s3://myawsbucket/output

The --args option provides arguments to the Hive-script. The first --args option here specifies the location of the Hive script in Amazon S3. In the second --args option, the -d provides a way to pass values (INPUT, OUTPUT) into the script. Within the Hive script, these parameters are available as ${variable}. In this example, Hive replaces ${INPUT} and ${OUTPUT} with the values you passed in. These variables are substituted during a preprocessing step, so the variables can occur anywhere in the Hive script.

The return output is similar to the following:

Created jobflow JobFlowID