Amazon Elastic MapReduce
Developer Guide (API Version 2009-03-31)
« PreviousNext »
View the PDF for this guide.Go to the AWS Discussion Forum for this product.Go to the Kindle Store to download this guide in Kindle format.Did this page help you?  Yes | No |  Tell us about it...

Add More than 256 Steps to a Cluster

Amazon EMR currently limits the number of steps in a cluster to 256. If your cluster is long-running (such as a Hive data warehouse) or complex, you may require more than 256 steps to process your data. The debugging option uses additional steps to function, so it can exceed your step limit quickly.

You can employ several methods to get around this limitation:

  1. Have each step submit several jobs to Hadoop. This does not allow you unlimited steps, but it is the easiest solution if you need a fixed number of steps greater than 256.

  2. Write a workflow program that runs in a step on a long-running cluster and submits jobs to Hadoop. You could have the workflow program either:

    • Listen to an Amazon SQS queue to receive information about new steps to run.

    • Check an Amazon S3 bucket on a regular schedule for files containing information about the new steps to run.

  3. Write a workflow program that runs on an EC2 instance outside of Amazon EMR and submits jobs to your clusters using SSH.

  4. Manually connect to the master node using SSH and submit clusters.

You can add more steps to a cluster by using the SSH shell to connect to the master node and submitting queries directly to the software running on the master node, such as Hive and Hadoop.

You can SSH directly into the master node using a conventional SSH connection, as outlined in View Log Files. You can also use the --ssh command line argument to pass queries in and save yourself the process of establishing a new SSH connection.

To manually submit steps to Hadoop on the master node

  • From a terminal or command-line window, call the CLI client, specifying the --ssh parameter, and set its value to the command you want to run on the master node. The CLI uses its connection to the master node to run the command.

    In the directory where you installed the Amazon EMR CLI, run the following from the command line. For more information, see the Command Line Interface Reference for Amazon EMR.

    • Linux, UNIX, and Mac OS X users:

      ./elastic-mapreduce –-jobflow JobFlowID –-scp myjar.jar \
      –-ssh “hadoop jar myjar.jar”
    • Windows users:

      ruby elastic-mapreduce –-jobflow JobFlowID –-scp myjar.jar –-ssh “hadoop jar myjar.jar”

    The preceding example uses the --scp parameter to copy the JAR file myjar.jar from your local directory to the master node of cluster JobFlowID. The example uses the --ssh parameter to command the copy of Hadoop running on the master node to run myjar.jar.

To manually submit queries to Hive on the master node

  1. If Hive is not already installed, use the following command to install it.

    In the directory where you installed the Amazon EMR CLI, run the following from the command line. For more information, see the Command Line Interface Reference for Amazon EMR.

    • Linux, UNIX, and Mac OS X users:

      ./elastic-mapreduce -–jobflow JobFlowID –-hive-interactive
    • Windows users:

      ruby elastic-mapreduce -–jobflow JobFlowID –-hive-interactive
  2. Create a Hive script file containing the query or command to run. The following example script creates two tables, aTable and anotherTable, and copies the contents of one table to another, replacing all data.

    ---- sample Hive script file: my-hive.q ----
    create table aTable (aColumn string) ;
    create table anotherTable like aTable;
    insert overwrite table anotherTable select * from aTable
    			
  3. Call the CLI client, specifying the --ssh parameter, and set its value to a Hive script containing the command you want to run on the master node. The CLI uses its connection to the master node and your .pem credentials file to run the command.

    In the directory where you installed the Amazon EMR CLI, run the following from the command line. For more information, see the Command Line Interface Reference for Amazon EMR.

    • Linux, UNIX, and Mac OS X users:

      ./elastic-mapreduce –-jobflow JobFlowID –-scp my-hive.q \
      –-ssh “hive -f my-hive.q”
    • Windows users:

      ruby elastic-mapreduce –-jobflow JobFlowID –-scp my-hive.q –-ssh “hive -f my-hive.q”

    The preceding example connects to Hive on the master node of the JobFlowID cluster and runs the query contained in the script file my-hive.q.

To manually submit tasks based on Python files to Hadoop while connected using SSH

  • Use the Hadoop streaming jar, as shown in the example below.

    hadoop jar /home/hadoop/contrib/streaming/hadoop-streaming.jar
      -input s3n://elasticmapreduce/samples/wordcount/input \
      -output hdfs:///rubish/1 \
      -mapper s3n://elasticmapreduce/samples/wordcount/wordSplitter.py \
      -reducer aggregate