Amazon Elastic MapReduce
Developer Guide (API Version 2009-03-31)
Did this page help you?  Yes | No |  Tell us about it...
« PreviousNext »
View the PDF for this guide.Go to the AWS Discussion Forum for this product.Go to the Kindle Store to download this guide in Kindle format.

AWS EMR Command Line Interface Options (Deprecated)

Note

The Amazon EMR CLI is no longer under feature development. Customers are encouraged to use the Amazon EMR commands in the AWS CLI instead.

The Amazon EMR command line interface (CLI) supports the following options, arranged according to function. Options that fit into more than one category are listed multiple times.

Common Options

--access-id ACCESS_ID

Sets the AWS access identifier.

Shortcut: -a ACCESS_ID

--credentials CREDENTIALS_FILE

Specifies the credentials file that contains the AWS access identifier and the AWS private key to use when contacting Amazon EMR.

Shortcut: -c CREDENTIALS_FILE

For CLI access, you need an access key ID and secret access key. Use IAM user access keys instead of AWS root account access keys. IAM lets you securely control access to AWS services and resources in your AWS account. For more information about creating access keys, see How Do I Get Security Credentials? in the AWS General Reference.

--help

Displays help information from the CLI.

Shortcut: -h

--http-proxy HTTP_PROXY

HTTP proxy server address host[:port].

--http-proxy-user USER

The username supplied to the HTTP proxy.

--http-proxy-pass PASS

The password supplied to the HTTP proxy.

--jobflow JOB_FLOW_IDENTIFIER

Specifies the cluster with the given cluster identifier.

Shortcut: -j JOB_FLOW_IDENTIFIER

--log-uri

Specifies the Amazon S3 bucket to receive log files. Used with --create.

--private-key PRIVATE_KEY

Specifies the AWS private key to use when contacting Amazon EMR.

Shortcut: -p PRIVATE_KEY

--trace

Traces commands made to the web service.

--verbose

Turns on verbose logging of program interaction.

--version

Displays the version of the CLI.

Shortcut: -v

To archive log files to Amazon S3

  • Set the --log-uri argument when you launch the cluster and specify a location in Amazon S3. Alternatively, you can set this value in the credentials.json file that you configured for the CLI. This causes all of the clusters you launch with the CLI to archive log files to the specified Amazon S3 bucket. For more information about credentials.json, see "Configuring Credentials" in Install the Amazon EMR Command Line Interface (Deprecated). The following example illustrates creating a cluster that archives log files to Amazon S3. Replace mybucket with the name of your bucket.

    • Linux, UNIX, and Mac OS X users:

      ./elastic-mapreduce --create --log-uri s3://mybucket
    • Windows users:

      ruby elastic-mapreduce --create --log-uri s3://mybucket

To aggregate logs in Amazon S3

  • Log aggregation in Hadoop 2.x compiles logs from all containers for an individual application into a single file. This option is only available on Hadoop 2.x AMIs. To enable log aggregation to Amazon S3 using the Amazon EMR CLI, you use a bootstrap action at cluster launch to enable log aggregation and to specify the bucket to store the logs.

    • Linux, UNIX, and Mac OS X users:

      ./elastic-mapreduce --create --alive --master-instance-type m1.xlarge --slave-instance-type m1.xlarge \
            --num-instances 1 --ami-version 3.3 --bootstrap-action \
            s3://elasticmapreduce/bootstrap-actions/configure-hadoop --args \
            "-y,yarn.log-aggregation-enable=true,-y,yarn.log-aggregation.retain-seconds=-1,-y,yarn.log-aggregation.retain-check-interval-seconds=3000,\
      	-y,yarn.nodemanager.remote-app-log-dir=s3://mybucket/logs" \
            --ssh --name "log aggregation sub-bucket name"
    • Windows users:

      ruby elastic-mapreduce --create --alive --master-instance-type m1.xlarge --slave-instance-type m1.xlarge --num-instances 1 --ami-version 3.3 --bootstrap-action s3://elasticmapreduce/bootstrap-actions/configure-hadoop --args "-y,yarn.log-aggregation-enable=true,-y,yarn.log-aggregation.retain-seconds=-1,-y,yarn.log-aggregation.retain-check-interval-seconds=3000,-y,yarn.nodemanager.remote-app-log-dir=s3://mybucket/logs" --ssh --name "log aggregation sub-bucket name"

Uncommon Options

--apps-path APPLICATION_PATH

Specifies the Amazon S3 path to the base of the Amazon EMR bucket to use, for example: s3://elasticmapreduce.

--endpoint ENDPOINT

Specifies the Amazon EMR endpoint to connect to.

--debug

Prints stack traces when exceptions occur.

Options Common to All Step Types

--no-wait

Don't wait for the master node to start before executing SCP or SSH, or assigning an elastic IP address.

--key-pair-file FILE_PATH

The path to the local PEM file of the Amazon EC2 key pair to set as the connection credential when you launch the cluster.

Adding and Modifying Instance Groups

--add-instance-group INSTANCE_ROLE

Adds an instance group to an existing cluster. The role may be task only.

--modify-instance-group INSTANCE_GROUP_ID

Modifies an existing instance group.

--add-instance-group INSTANCE_ROLE

Adds an instance group to an existing cluster. The role may be task only.

To launch an entire cluster with Spot Instances using the Amazon EMR CLI

To specify that an instance group should be launched as Spot Instances, use the --bid-price parameter. The following example shows how to create a cluster where the master, core, and task instance groups are all running as Spot Instances. The following code launches a cluster only after until the requests for the master and core instances have been completely fulfilled.

  • In the directory where you installed the Amazon EMR CLI, type the following command.

    • Linux, UNIX, and Mac OS X users:

      ./elastic-mapreduce --create --alive --name "Spot Cluster" \
      --instance-group master --instance-type m1.large --instance-count 1 --bid-price 0.25 \
      --instance-group core --instance-type m1.large --instance-count 4 --bid-price 0.03 \
      --instance-group task --instance-type c1.medium --instance-count 2 --bid-price 0.10
    • Windows users:

      ruby elastic-mapreduce --create --alive --name "Spot Cluster" --instance-group master --instance-type m1.large --instance-count 1 --bid-price 0.25 --instance-group core --instance-type m1.large --instance-count 4 --bid-price 0.03 --instance-group task --instance-type c1.medium --instance-count 2 --bid-price 0.10

To launch a task instance group on Spot Instances

You can launch a task instance group on Spot Instances using the --bid-price parameter, but multiple task groups are not supported. The following example shows how to create a cluster where only the task instance group uses Spot Instances. The command launches a cluster even if the request for Spot Instances cannot be fulfilled. In that case, Amazon EMR adds task nodes to the cluster if it is still running when the Spot Price falls below the bid price.

  • In the directory where you installed the Amazon EMR CLI, type the following command.

    • Linux, UNIX, and Mac OS X users:

      ./elastic-mapreduce --create --alive --name "Spot Task Group" \
      --instance-group master --instance-type m1.large \
      --instance-count 1 \
      --instance-group core --instance-type m1.large \
      --instance-count 2 \
      --instance-group task --instance-type m1.large \
      --instance-count 4 --bid-price 0.03 
    • Windows users:

      ruby elastic-mapreduce --create --alive --name "Spot Task Group" --instance-group master --instance-type m1.large --instance-count 1 --instance-group core --instance-type m1.large --instance-count 2 --instance-group task --instance-type m1.small --instance-count 4 --bid-price 0.03 

To add a task instance group with Spot Instances to a cluster

Using the Amazon EMR CLI, you can add a task instance group with Spot Instances, but you cannot add multiple task groups.

  • In the directory where you installed the Amazon EMR CLI, type the following command.

    • Linux, UNIX, and Mac OS X users:

      ./elastic-mapreduce --jobflow JobFlowId \
      --add-instance-group task  --instance-type m1.small \
      --instance-count 5 --bid-price 0.05
    • Windows users:

      ruby elastic-mapreduce --jobflow JobFlowId --add-instance-group task  --instance-type m1.small --instance-count 5 --bid-price 0.05

To change the number of Spot Instances in instance groups

You can change the number of requested Spot Instances in a cluster using the --modify-instance-group and --instance-count commands. Note that you can only increase the number of core instances in your cluster while you can increase or decrease the number of task instances. Setting the number of task instances to zero removes all Spot Instances (but not the instance group).

  • In the directory where you installed the Amazon EMR CLI, type the following command:

    • Linux, UNIX, and Mac OS X users:

      ./elastic-mapreduce --jobflow JobFlowId \
      --modify-instance-group task --instance-count 5
    • Windows users:

      ruby elastic-mapreduce --jobflow JobFlowId --modify-instance-group task --instance-count 5

Adding JAR Steps to Job Flows

--jar JAR_FILE_LOCATION

Specifies the location of a Java archive (JAR) file. Typically, the JAR file is stored in an Amazon S3 bucket.

--main-class

Specifies the JAR file's main class. This parameter is not needed if your JAR file has a manifest.

--args "arg1,arg2"

Specifies the arguments for the step.

To create a cluster and submit a custom JAR step

  • In the directory where you installed the Amazon EMR CLI, type the following command.

    • Linux, UNIX, and Mac OS X users:

      ./elastic-mapreduce --create --name "Test custom JAR" \
        --jar s3://elasticmapreduce/samples/cloudburst/cloudburst.jar \
          --arg s3://elasticmapreduce/samples/cloudburst/input/s_suis.br \
          --arg s3://elasticmapreduce/samples/cloudburst/input/100k.br \
          --arg s3://mybucket/cloudburst \
          --arg 36 --arg 3 --arg 0 --arg 1 --arg 240 --arg 48 --arg 24 \
          --arg 24 --arg 128 --arg 16
    • Windows users:

      ruby elastic-mapreduce --create --name "Test custom JAR" --jar s3://elasticmapreduce/samples/cloudburst/cloudburst.jar --arg s3://elasticmapreduce/samples/cloudburst/input/s_suis.br --arg s3://elasticmapreduce/samples/cloudburst/input/100k.br --arg s3://mybucket/cloudburst/output --arg 36 --arg 3 --arg 0 --arg 1 --arg 240 --arg 48 --arg 24 --arg 24 --arg 128 --arg 16

Note

The individual --arg values above could also be represented as --args followed by a comma-separated list.

By default, this command launches a cluster to run on a single-node cluster using an Amazon EC2 m1.small instance. Later, when your steps are running correctly on a small set of sample data, you can launch clusters to run on multiple nodes. You can specify the number of nodes and the type of instance to run with the --num-instances and --instance-type parameters, respectively.

To create a cluster and submit a Cascading step

  • In the directory where you installed the Amazon EMR CLI, type the following command.

    • Linux, UNIX, and Mac OS X users:

      ./elastic-mapreduce --create --name "Test Cascading" \
      --bootstrap-action s3://files.cascading.org/sdk/2.1/install-cascading-sdk.sh \
      --JAR elasticmapreduce/samples/cloudfront/logprocessor.jar \
      --args "-input,s3://elasticmapreduce/samples/cloudfront/input,-start,any,-end,2010-12-27-02 300,-output,s3://mybucket/cloudfront/output/2010-12-27-02,-overallVolumeReport,-objectPopularityReport,-clientIPReport,-edgeLocationReport"
    • Windows users:

      ruby elastic-mapreduce --create --name "Test Cascading" --bootstrap-action s3://files.cascading.org/sdk/2.1/install-cascading-sdk.sh --JAR elasticmapreduce/samples/cloudfront/logprocessor.jar --args "-input,s3://elasticmapreduce/samples/cloudfront/input,-start,any,-end,2010-12-27-02 300,-output,s3://mybucket/cloudfront/output/2010-12-27-02,-overallVolumeReport,-objectPopularityReport,-clientIPReport,-edgeLocationReport"

    Note

    The bootstrap action pre-installs the Cascading Software Development Kit on Amazon EMR. The Cascading SDK includes Cascading and Cascading-based tools such as Multitool and Load. The bootstrap action extracts the SDK and adds the available tools to the default PATH. For more information, go to http://www.cascading.org/sdk/.

To create a cluster with the Cascading Multitool

  • Create a cluster referencing the Cascading Multitool JAR file and supply the appropriate Multitool arguments as follows.

    In the directory where you installed the Amazon EMR CLI, type the following command.

    • Linux, UNIX, and Mac OS X users:

      ./elastic-mapreduce --create \
      --jar s3://elasticmapreduce/samples/multitool/multitool-aws-03-31-09.jar \
      --args [args]
    • Windows users:

      ruby elastic-mapreduce --create --jar s3://elasticmapreduce/samples/multitool/multitool-aws-03-31-09.jar --args [args]

Adding JSON Steps to Job Flows

--json JSON_FILE

Adds a sequence of steps stored in the specified JSON file to the cluster.

--param VARIABLE=VALUE ARGS

Substitutes the string VARIABLE with the string VALUE in the JSON file.

Adding Streaming Steps to Job Flows

--cache FILE_LOCATION#NAME_OF_FILE_IN_CACHE

Adds an individual file to the distributed cache.

--cache-archive LOCATION#NAME_OF_ARCHIVE

Adds an archive file to the distributed cache

--ec2-instance-ids-to-terminate INSTANCE_ID

Use with --terminate and --modify-instance-group to specify the instances in the core and task instance groups to terminate. This allows you to shrink the number of core instances by terminating specific instances of your choice rather than those chosen by Amazon EMR.

--input LOCATION_OF_INPUT_DATA

Specifies the input location for the cluster.

--instance-count INSTANCE_COUNT

Sets the count of nodes for an instance group.

--instance-type INSTANCE_TYPE

Sets the type of EC2 instance to create nodes for an instance group.

--jobconf KEY=VALUE

Specifies jobconf arguments to pass to a streaming cluster, for example mapred.task.timeout=800000.

--mapper LOCATION_OF_MAPPER_CODE

The name of a Hadoop built-in class or the location of a mapper script.

--output LOCATION_OF_JOB_FLOW_OUTPUT

Specifies the output location for the cluster.

--reducer REDUCER

The name of a Hadoop built-in class or the location of a reducer script.

--stream

Used with --create and --arg to launch a streaming cluster.

Note

The --arg option must immediately follow the --stream option.

To create a cluster and submit a streaming step

  • In the directory where you installed the Amazon EMR CLI, type one of the following commands.

    Note

    The Hadoop streaming syntax is different between Hadoop 1.x and Hadoop 2.x when using the Amazon EMR CLI.

    For Hadoop 2.x, type the following command:

    • Linux, UNIX, and Mac OS X users:

      ./elastic-mapreduce --create --stream --ami-version 3.3 \
      --instance-type m1.large --arg "-files" --arg "s3://elasticmapreduce/samples/wordcount/wordSplitter.py" \
      --input s3://elasticmapreduce/samples/wordcount/input --mapper wordSplitter.py --reducer aggregate \
      --output s3://mybucket/output/2014-01-16
    • Windows users:

      ruby elastic-mapreduce --create --stream --ami-version 3.3 --instance-type m1.large --arg "-files" --arg "s3://elasticmapreduce/samples/wordcount/wordSplitter.py" --input s3://elasticmapreduce/samples/wordcount/input --mapper wordSplitter.py --reducer aggregate --output s3://mybucket/output/2014-01-16

    For Hadoop 1.x, type the following command:

    • Linux, UNIX, and Mac OS X users:

      ./elastic-mapreduce --create --stream \
      --input s3://elasticmapreduce/samples/wordcount/input \
      --mapper s3://elasticmapreduce/samples/wordcount/wordSplitter.py \
      --reducer aggregate \
      --output s3://mybucket/output/2014-01-16
    • Windows users:

      ruby elastic-mapreduce --create --stream --input s3://elasticmapreduce/samples/wordcount/input --mapper s3://elasticmapreduce/samples/wordcount/wordSplitter.py --reducer aggregate --output s3://mybucket/output/2014-01-16

    By default, this command launches a cluster to run on a single-node cluster. Later, when your steps are running correctly on a small set of sample data, you can launch clusters to run on multiple nodes. You can specify the number of nodes and the type of instance to run with the --num-instances and --instance-type parameters, respectively.

To specify Distributed Cache files

Specify the options –-cache or --cache-archive at the command line.

  • Create a cluster and add the following parameters. The size of the file (or total size of the files in an archive file) must be less than the allocated cache size.

    Action Parameter to add
    Add an individual file to the Distributed Cache--cache followed by the name and location of the file, the pound (#) sign, and then the name you want to give the file when it's placed in the local cache
    Add an archive file to the Distributed Cache--cache-archive followed by the location of the files in Amazon S3, the pound (#) sign, and then the name you want to give the collection of files in the local cache

    Your cluster copies the files to the cache location before processing any job flow steps.

Example Example

The following command shows the creation of a streaming cluster and uses --cache to add one file, sample_dataset_cached.dat, to the cache. The Hadoop streaming syntax is different between Hadoop 1.x and Hadoop 2.x.

For Hadoop 2.x, use the following command:

  • Linux, UNIX, and Mac OS X users:

    ./elastic-mapreduce --create --stream \
       --arg "--files" --arg "s3://my_bucket/my_mapper.py,s3://my_bucket/my_reducer.py" \
       --input s3://my_bucket/my_input \
       --output s3://my_bucket/my_output \
       --mapper my_mapper.py \
       --reducer my_reducer.py \
       --cache s3://my_bucket/sample_dataset.dat#sample_dataset_cached.dat
  • Windows users:

    ruby elastic-mapreduce --create --stream --arg "-files" --arg "s3://my_bucket/my_mapper.py,s3://my_bucket/my_reducer.py" --input s3://my_bucket/my_input --output s3://my_bucket/my_output --mapper my_mapper.py --reducer my_reducer.py --cache s3://my_bucket/sample_dataset.dat#sample_dataset_cached.dat

For Hadoop 1.x, use the following command:

  • Linux, UNIX, and Mac OS X users:

    ./elastic-mapreduce --create --stream \
       --input s3://my_bucket/my_input \
       --output s3://my_bucket/my_output \
       --mapper s3://my_bucket/my_mapper.py \
       --reducer s3://my_bucket/my_reducer.py \
       --cache s3://my_bucket/sample_dataset.dat#sample_dataset_cached.dat
  • Windows users:

    ruby elastic-mapreduce --create --stream --input s3://my_bucket/my_input --output s3://my_bucket/my_output --mapper s3://my_bucket/my_mapper.py --reducer s3://my_bucket/my_reducer.py --cache s3://my_bucket/sample_dataset.dat#sample_dataset_cached.dat

Assigning an Elastic IP Address to the Master Node

--eip ELASTIC_IP

Associates an Elastic IP address to the master node. If no Elastic IP address is specified, allocate a new Elastic IP address and associate it to the master node.

You can allocate an Elastic IP address and assign it to either a new or running cluster. After you assign an Elastic IP address to a cluster. It may take one or two minutes before the instance is available from the assigned address.

To assign an Elastic IP address to a new cluster

  • Create a cluster and add the --eip parameter. The CLI allocates an Elastic IP address and waits until the Elastic IP address is successfully assigned to the cluster. This assignment can take up to two minutes to complete.

    Note

    If you want to use a previously allocated Elastic IP address, use the --eip parameter followed by your allocated Elastic IP address. If the allocated Elastic IP address is in use by another cluster, the other cluster loses the Elastic IP address and is assigned a new dynamic IP address.

To assign an Elastic IP address to a running cluster

  1. If you do not currently have a running cluster, create a cluster.

  2. Identify your cluster:

    Your cluster must have a public DNS name before you can assign an Elastic IP address. Typically, a cluster is assigned a public DNS name one or two minutes after launching the cluster.

    In the directory where you installed the Amazon EMR CLI, type the following command.

    • Linux, UNIX, and Mac OS X users:

      ./elastic-mapreduce --list
    • Windows users:

      ruby elastic-mapreduce --list

    The output looks similar to the following.

    j-SLRI9SCLK7UC    STARTING    ec2-75-101-168-82.compute-1.amazonaws.com
    	New Job Flow  PENDING     Streaming Job

    The response includes the cluster ID and the public DNS name. You need the cluster ID to perform the next step.

  3. Allocate and assign an Elastic IP address to the cluster:

    In the directory where you installed the Amazon EMR CLI, type the following command. If you assign an Elastic IP address that is currently associated with another cluster, the other cluster is assigned a new dynamic IP address.

    • Linux, UNIX, and Mac OS X users:

      ./elastic-mapreduce JobFlowId --eip
    • Windows users:

      ruby elastic-mapreduce JobFlowId --eip

    This allocates an Elastic IP address and associates it with the named cluster.

    Note

    If you want to use a previously allocated Elastic IP address, include your Elastic IP address, Elastic_IP, as follows.

    • Linux, UNIX, and Mac OS X users:

      ./elastic-mapreduce JobFlowId --eip Elastic_IP
    • Windows users:

      ruby elastic-mapreduce JobFlowId --eip Elastic_IP

Connecting to the Master Node

--get SOURCE

Copies the specified file from the master node using SCP.

--logs

Displays the step logs for the step most recently executed.

--put SOURCE

Copies a file to the master node using SCP.

--scp FILE_TO_COPY

Copies a file from your local directory to the master node of the cluster.

--socks

Uses SSH to create a tunnel to the master node of the specified cluster. You can then use this as a SOCKS proxy to view web interfaces hosted on the master node.

--ssh COMMAND

Uses SSH to connect to the master node of the specified cluster and, optionally, run a command. This option requires that you have an SSH client, such as OpenSSH, installed on your desktop.

--to DESTINATION

Specifies the destination location when copying files to and from the master node using SCP.

To connect to the master node

To connect to the master node, you must: configure your credentials.json file so the keypair value is set to the name of the keypair you used to launch the cluster, set the key-pair-file value to the full path to your private key file, set appropriate permissions on the .pem file, and install an SSH client on your machine (such as OpenSSH). You can open an SSH connection to the master node by issuing the following command. This is a handy shortcut for frequent CLI users. Replace j-3L7WXXXXXHO4H with your cluster identifier.

    • Linux, UNIX, and Mac OS X users:

      ./elastic-mapreduce -j j-3L7WXXXXXHO4H --ssh
    • Windows users:

      ruby elastic-mapreduce -j j-3L7WXXXXXHO4H --ssh

To create an SSH tunnel to the master node

  • In the directory where you installed the Amazon EMR CLI, type the following command.

    • Linux, UNIX, and Mac OS X users:

      ./elastic-mapreduce -j j-3L7WXXXXXHO4H --socks
    • Windows users:

      ruby elastic-mapreduce -j j-3L7WXXXXXHO4H --socks

    Note

    The --socks feature is available only on the CLI version 2012-06-12 and later. To find out what version of the CLI you have, run elastic-mapreduce --version at the command line. You can download the latest version of the CLI from http://aws.amazon.com/code/Elastic-MapReduce/2264.

Creating Job Flows

--alive

Used with --create to launch a cluster that continues running even after completing all its steps. Interactive clusters require this option.

--ami-version AMI_VERSION

Used with --create to specify the version of the AMI to use when launching the cluster. This setting also determines the version of Hadoop to install, because the --hadoop-version parameter is no longer supported.

In the Amazon EMR CLI, if you use the keyword latest instead of a version number for the AMI (for example --ami-version latest), the cluster is launched with the AMI listed as the "latest" AMI version—currently AMI version 2.4.2. This configuration is suitable for prototyping and testing, and is not recommended for production environments. This option is not supported by the AWS CLI, SDK, or API.

For Amazon EMR CLI version 2012-07-30 and later, the latest AMI is 2.4.2 with Hadoop 1.0.3. For Amazon EMR CLI versions 2011-12-08 to 2012-07-09, the latest AMI is 2.1.3 with Hadoop 0.20.205. For Amazon EMR CLI version 2011-12-11 and earlier, the latest AMI is 1.0.1 with Hadoop 0.18.

The default AMI is unavailable in the Asia Pacific (Sydney) region. Instead, use --ami-version latest (in the Amazon EMR CLI), fully specify the AMI, or use the major-minor version.

--availability-zone AVAILABILITY_ZONE

The Availability Zone to launch the cluster in. For more information about Availability Zones supported by Amazon EMR, see Regions and Endpoints in the Amazon Web Services General Reference.

--bid-price BID_PRICE

The bid price, in U.S. dollars, for a group of Spot Instances.

--create

Launches a new cluster.

--hadoop-version VERSION

Specify the version of Hadoop to install.

--info INFO

Specifies additional information during cluster creation.

--instance-group INSTANCE_GROUP_TYPE

Sets the instance group type. A type is MASTER, CORE, or TASK.

--jobflow-role IAM_ROLE_NAME

Launches the EC2 instances of a cluster with the specified IAM role.

--service-role IAM_ROLE_NAME

Launches the Amazon EMR service with the specified IAM role.

--key-pair KEY_PAIR_PEM_FILE

The name of the Amazon EC2 key pair to set as the connection credential when you launch the cluster.

--master-instance-type INSTANCE_TYPE

The type of EC2 instances to launch as the master nodes in the cluster.

--name "JOB_FLOW_NAME"

Specifies a name for the cluster. This can only be set when the jobflow is created.

--num-instances NUMBER_OF_INSTANCES

Used with --create and --modify-instance-group to specify the number of EC2 instances in the cluster.

You can increase or decrease the number of task instances in a running cluster, and you can add a single task instance group to a running cluster. You can also increase but not decrease the number of core instances.

--plain-output

Returns the cluster identifier from the create step as simple text.

--region REGION

Specifies the region in which to launch the cluster.

--slave-instance-type

The type of EC2 instances to launch as the slave nodes in the cluster.

--subnet EC2-SUBNET_ID

Launches a cluster in an Amazon VPC subnet.

--visible-to-all-users BOOLEAN

Makes the instances in an existing cluster visible to all IAM users of the AWS account that launched the cluster.

--with-supported-products PRODUCT

Installs third-party software on an Amazon EMR cluster; for example, installing a third-party distribution of Hadoop. It accepts optional arguments for the third-party software to read and act on. It is used with--create to launch the cluster that can use the specified third-party applications. The2013-03-19 and newer versions of the Amazon EMR CLI accepts optional arguments using the --argsparameter.

--with-termination-protection

Used with --create to launch the cluster with termination protection enabled.

To launch a cluster into a VPC

After your VPC is configured, you can launch Amazon EMR clusters in it by using the --subnet argument with the subnet address.

  • In the directory where you installed the Amazon EMR CLI, type the following command.

    • Linux, UNIX, and Mac OS X users:

      ./elastic-mapreduce --create --alive --subnet subnet-77XXXX03
    • Windows users:

      ruby elastic-mapreduce --create --alive --subnet subnet-77XXXX03

To create a long-running cluster

  • In the directory where you installed the Amazon EMR CLI, type the following command.

    • Linux, UNIX, and Mac OS X users:

      ./elastic-mapreduce --create --alive --name "Interactive Cluster" \
      --num-instances=1 --master-instance-type=m1.large --hive-interactive
    • Windows users:

      ruby elastic-mapreduce --create --alive --name "Interactive Cluster" --num-instances=1 --master-instance-type=m1.large --hive-interactive

To specify the AMI version when creating a cluster

When creating a cluster using the CLI, add the --ami-version parameter. If you do not specify this parameter, or if you specify --ami-version latest, the most recent version of AMI will be used.

  • In the directory where you installed the Amazon EMR CLI, type the following command.

    • Linux, UNIX, and Mac OS X users:

      ./elastic-mapreduce --create --alive --name "Static AMI Version" \
      --ami-version 2.4.8 \
      --num-instances 5 --instance-type m1.large  
    • Windows users:

      ruby elastic-mapreduce --create --alive --name "Static AMI Version" --ami-version 2.4.8 --num-instances 5 --instance-type m1.large

    The following example specifies the AMI using just the major and minor version. It will launch the cluster on the AMI that matches those specifications and has the latest patches. For example, if the most recent AMI version is 2.4.8, specifying --ami-version 2.4 would launch a cluster using AMI 2.4.8.

    • Linux, UNIX, and Mac OS X users:

      ./elastic-mapreduce --create --alive --name "Major-Minor AMI Version" \
      --ami-version 2.4 \
      --num-instances 5 --instance-type m1.large  
    • Windows users:

      ruby elastic-mapreduce --create --alive --name "Major-Minor AMI Version" --ami-version 2.4 --num-instances 5 --instance-type m1.large

    The following example specifies that the cluster should be launched with the latest AMI.

    • Linux, UNIX, and Mac OS X users:

      ./elastic-mapreduce --create --alive --name "Latest AMI Version" \
      --ami-version latest \
      --num-instances 5 --instance-type m1.large 
    • Windows users:

      ruby elastic-mapreduce --create --alive --name "Latest AMI Version" --ami-version latest --num-instances 5 --instance-type m1.large

To view the current AMI version of a cluster

Use the --describe parameter to retrieve the AMI version on a cluster. The AMI version will be returned along with other information about the cluster.

  • In the directory where you installed the Amazon EMR CLI, type the following command:

    • Linux, UNIX, and Mac OS X users:

      ./elastic-mapreduce --describe -–jobflow JobFlowID
    • Windows users:

      ruby elastic-mapreduce --describe -–jobflow JobFlowID

To configure cluster visibility

By default, clusters created using the Amazon EMR CLI are not visible to all users. If you are adding IAM user visibility to a new cluster using the Amazon EMR CLI, add the --visible-to-all-users flag to the cluster call as shown in the following example.

  • In the directory where you installed the Amazon EMR CLI, type the following command.

    • Linux, UNIX, and Mac OS X users:

      ./elastic-mapreduce --create --alive /
      --instance-type m1.xlarge --num-instances 2 /
      --visible-to-all-users 
    • Windows users:

      ruby elastic-mapreduce --create --alive --instance-type m1.xlarge --num-instances 2 --visible-to-all-users 

    If you are adding IAM user visibility to an existing cluster, you can use the --set-visible-to-all-users option, and specify the identifier of the cluster to modify. The visibility of a running cluster can be changed only by the IAM user that created the cluster or the AWS account that owns the cluster.

    In the directory where you installed the Amazon EMR CLI, type the following command.

    • Linux, UNIX, and Mac OS X users:

      ./elastic-mapreduce --set-visible-to-all-users true --jobflow JobFlowId
    • Windows users:

      ruby elastic-mapreduce --set-visible-to-all-users true --jobflow JobFlowId

To create and use IAM roles

If the default roles already exist, no output is returned. We recommend that you begin by creating the default roles, then modify those roles as needed. For more information about default roles, see Default IAM Roles for Amazon EMR.

  1. In the directory where you installed the Amazon EMR CLI, type the following command

    :

    • Linux, UNIX, and Mac OS X users:

      ./elastic-mapreduce --create-default-roles 
    • Windows users:

      ruby elastic-mapreduce --create-default-roles 
  2. To specify the default roles, type the following command. This command can also be used to specify custom roles.

    • Linux, UNIX, and Mac OS X users:

      ./elastic-mapreduce --create --alive --name "Test cluster" \
      --ami-version 2.4 \
      --num-instances 5 --instance-type m1.large \
      --service-role EMR_DefaultRole --jobflow-role EMR_EC2_DefaultRole
    • Windows users:

      ruby elastic-mapreduce --create --alive --name "Test cluster" --ami-version 2.4 --num-instances 5 --instance-type m1.large --service-role EMR_DefaultRole --jobflow-role EMR_EC2_DefaultRole

To launch a cluster with IAM roles

Add the --service-role and --jobflow-role parameters to the command that creates the cluster and specify the name of the IAM roles to apply to Amazon EMR and EC2 instances in the cluster.

  • In the directory where you installed the Amazon EMR CLI, type the following command.

    • Linux, UNIX, and Mac OS X users:

      ./elastic-mapreduce --create --alive --num-instances 3 \
      --instance-type m1.large \
      --name "myJobFlowName" \
      --hive-interactive --hive-versions 0.8.1.6 \
      --ami-version 2.3.0 \
      --jobflow-role EMR_EC2_DefaultRole \
      --service-role EMR_DefaultRole
    • Windows users:

      ruby elastic-mapreduce --create --alive --num-instances 3 --instance-type m1.small --name "myJobFlowName" --hive-interactive --hive-versions 0.8.1.6 --ami-version 2.3.0 --jobflow-role EMR_EC2_DefaultRole --service-role EMR_DefaultRole

To set a default IAM role

If you launch most or all of your clusters with a specific IAM role, you can set that IAM role as the default for the Amazon EMR CLI, so you don't need to specify it at the command line. You can override the IAM role specified in credentials.json at any time by specifying a different IAM role at the command line as shown in the preceding procedure.

  • Add a jobflow-role field in the credentials.json file that you created when you installed the CLI. For more information about credentials.json, see Configuring Credentials.

    The following example shows the contents of a credentials.json file that causes the CLI to always launch clusters with the user-defined IAM roles, MyCustomEC2Role and MyCustomEMRRole.

    {
    "access-id": "AccessKeyID",
    "private-key": "PrivateKey",
    "key-pair": "KeyName",
    "jobflow-role": "MyCustomEC2Role",
    "service-role": "MyCustomEMRRole",
    "key-pair-file": "location of key pair file",
    "region": "Region",
    "log-uri": "location of bucket on Amazon S3"
    }
                    

To specify a region

  • In the directory where you installed the Amazon EMR CLI, type the following command.

    • Linux, UNIX, and Mac OS X users:

      ./elastic-mapreduce --create --region eu-west-1
    • Windows users:

      ruby elastic-mapreduce --create --region eu-west-1

Tip

To reduce the number of parameters required each time you issue a command from the CLI, you can store information such as region in your credentials.json file. For more information about creating a credentials.json file, go to the Configuring Credentials.

To launch a cluster with MapR

  • In the directory where you installed the Amazon EMR CLI, specify the MapR edition and version by passing arguments with the --args option.

    • Linux, UNIX, and Mac OS X users:

      ./elastic-mapreduce --create --alive \
      --instance-type m1.large –-num-instances 3 \
      --supported-product mapr --name m5 --args "--edition,m5,--version,3.1.1"
    • Windows users:

      ruby elastic-mapreduce --create --alive --instance-type m1.large –-num-instances 3 --supported-product mapr --name m5 --args "--edition,m5,--version,3.1.1"

To reset a cluster in an ARRESTED state

Use the --modify-instance-group command to reset a cluster in the ARRESTED state. Enter the --modify-instance-group command as follows:

  • In the directory where you installed the Amazon EMR CLI, type the following command.

    • Linux, UNIX, and Mac OS X users:

      ./elastic-mapreduce --modify-instance-group InstanceGroupID \
      -–instance-count COUNT
    • Windows users:

      ruby elastic-mapreduce --modify-instance-group InstanceGroupID -–instance-count COUNT

    The <InstanceGroupID> is the ID of the arrested instance group and <COUNT> is the number of nodes you want in the instance group.

Tip

You do not need to change the number of nodes from the original configuration to free a running cluster. Set -–instance-count to the same count as the original setting.

Using HBase Options

--backup-dir BACKUP_LOCATION

The directory where an Hbase backup exists or should be created.

--backup-version VERSION_NUMBER

Specifies the version number of an existing Hbase backup to restore.

--consistent

Pauses all write operations to the HBase cluster during the backup process, to ensure a consistent backup.

--full-backup-time-interval INTERVAL

An integer that specifies the period of time units to elapse between automated full backups of the HBase cluster.

--full-backup-time-unit TIME_UNIT

The unit of time to use with--full-backup-time-interval to specify how often automatically scheduled Hbase backups should run. This can take any one of the following values: minutes, hours, days.

--hbase

Used to launch an Hbase cluster.

--hbase-backup

Creates a one-time backup of HBase data to the location specified by --backup-dir.

--hbase-restore

Restores a backup from the location specified by --backup-dir and (optionally) the version specified by --backup-version.

--hbase-schedule-backup

Schedules an automated backup of HBase data.

--incremental-backup-time-interval TIME_INTERVAL

An integer that specifies the period of time units to elapse between automated incremental backups of the HBase cluster. Used with --hbase-schedule-backup this parameter creates regularly scheduled incremental backups. If this period schedules a full backup at the same time as an incremental backup is scheduled, only the full backup is created. Used with --incremental-backup-time-unit.

--incremental-backup-time-unit TIME_UNIT

The unit of time to use with--incremental-backup-time-interval to specify how often automatically scheduled incremental Hbase backups should run. This can take any one of the following values: minutes, hours, days.

--disable-full-backups

Turns off scheduled full Hbase backups by passing this flag into a call with --hbase-schedule-backup.

--disable-incremental-backups

Turns off scheduled incremental Hbase backups by passing this flag into a callwith --hbase-schedule-backup.

--start-time START_TIME

Specifies the time that a Hbase backup schedule should start. If this is not set, the first backup begins immediately. This should be in ISO date-time format. You can use this to ensure your first data load process has completed before performing the initial backup or to have the backup occur at a specific time each day.

To launch a cluster and install HBase

Specify the --hbase parameter when you launch a cluster using the CLI.

The following example shows how to launch a cluster running HBase from the CLI. We recommend that you run at least two instances in the HBase cluster.

The CLI implicitly launches the HBase cluster with keep alive and termination protection set.

  • In the directory where you installed the Amazon EMR CLI, type the following command.

    • Linux, UNIX, and Mac OS X users:

      ./elastic-mapreduce --create --hbase --name "HBase Cluster" \
      --num-instances 3 \
      --instance-type c1.xlarge
    • Windows users:

      ruby elastic-mapreduce --create --hbase --name "HBase Cluster" --num-instances 3 --instance-type c1.xlarge

To configure HBase daemons

Add a bootstrap action, configure-hbase-daemons, when you launch the HBase cluster. You can use this bootstrap action to configure one or more daemons and set values for zookeeper-opts and hbase-master-opts which configure the options used by the zookeeper and master node components of the HBase cluster..

  • In the directory where you installed the Amazon EMR CLI, type the following command.

    • Linux, UNIX, and Mac OS X users:

      ./elastic-mapreduce --create --hbase --name "My HBase Cluster" \  
      --bootstrap-action s3://elasticmapreduce/bootstrap-actions/configure-hbase-daemons --args "--hbase-zookeeper-opts=-Xmx1024m -XX:GCTimeRatio=19,--hbase-master-opts=-Xmx2048m,--hbase-regionserver-opts=-Xmx4096m"
    • Windows users:

      ruby elastic-mapreduce --create --hbase --name "My HBase Cluster" --bootstrap-action s3://elasticmapreduce/bootstrap-actions/configure-hbase-daemons --args "--hbase-zookeeper-opts=-Xmx1024m -XX:GCTimeRatio=19,--hbase-master-opts=-Xmx2048m,--hbase-regionserver-opts=-Xmx4096m"

Note

When you specify the arguments for this bootstrap action, you must put quotes around the --args parameter value to keep the shell from breaking the arguments up. You must also include a space character between JVM arguments; in the example above, there is a space between -Xmx1000M and -XX:GCTimeRatio=19.

To specify individual HBase site settings

Set the configure-hbase bootstrap action when you launch the HBase cluster, and specify the values within hbase-site.xml to change. The following example illustrates how to change the hbase.hregion.max.filesize settings.

  • In the directory where you installed the Amazon EMR CLI, type the following command.

    • Linux, UNIX, and Mac OS X users:

      ./elastic-mapreduce --create --hbase --name "My HBase Cluster" \   
      --bootstrap-action s3://elasticmapreduce/bootstrap-actions/configure-hbase \
      --args -s,hbase.hregion.max.filesize=52428800
    • Windows users:

      ruby elastic-mapreduce --create --hbase --name "My HBase Cluster" --bootstrap-action s3://elasticmapreduce/bootstrap-actions/configure-hbase --args -s,hbase.hregion.max.filesize=52428800

To specify HBase site settings with an XML file

  1. Create a custom version of hbase-site.xml. Your custom file must be valid XML. To reduce the chance of introducing errors, start with the default copy of hbase-site.xml, located on the Amazon EMR HBase master node at /home/hadoop/conf/hbase-site.xml, and edit a copy of that file instead of creating a file from scratch. You can give your new file a new name, or leave it as hbase-site.xml.

  2. Upload your custom hbase-site.xml file to an Amazon S3 bucket. It should have permissions set so the AWS account that launches the cluster can access the file. If the AWS account launching the cluster also owns the Amazon S3 bucket, it will have access.

  3. Set the configure-hbase bootstrap action when you launch the HBase cluster, and pass in the location of your custom hbase-site.xml file.

    In the directory where you installed the Amazon EMR CLI, type the following command.

    • Linux, UNIX, and Mac OS X users:

      ./elastic-mapreduce --create --hbase --name "My HBase Cluster" \
      --bootstrap-action s3://elasticmapreduce/bootstrap-actions/configure-hbase \
      --args --site-config-file s3://bucket/config.xml
    • Windows users:

      ruby elastic-mapreduce --create --hbase --name "My HBase Cluster" --bootstrap-action s3://elasticmapreduce/bootstrap-actions/configure-hbase --args --site-config-file s3://bucket/config.xml

To configure an HBase cluster for Ganglia

Launch the cluster and specify both the install-ganglia and configure-hbase-for-ganglia bootstrap actions.

Note

You can prefix the Amazon S3 bucket path with the region where your HBase cluster was launched, for example s3://region.elasticmapreduce/bootstrap-actions/configure-hbase-for-ganglia. For a list of regions supported by Amazon EMR see Choose an AWS Region.

  • In the directory where you installed the Amazon EMR CLI, type the following command.

    • Linux, UNIX, and Mac OS X users:

      ./elastic-mapreduce --create --hbase --name "My HBase Cluster" \
          --bootstrap-action s3://elasticmapreduce/bootstrap-actions/install-ganglia \
          --bootstrap-action s3://region.elasticmapreduce/bootstrap-actions/configure-hbase-for-ganglia
    • Windows users:

      ruby elastic-mapreduce --create --hbase --name "My HBase Cluster" --bootstrap-action s3://elasticmapreduce/bootstrap-actions/install-ganglia --bootstrap-action s3://region.elasticmapreduce/bootstrap-actions/configure-hbase-for-ganglia

To manually back up HBase data

Run --hbase-backup in the CLI and specify the cluster and the backup location in Amazon S3. Amazon EMR tags the backup with a name derived from the time the backup was launched. This is in the format YYYYMMDDTHHMMSSZ, for example: 20120809T031314Z. If you want to label your backups with another name, you can create a location in Amazon S3 (such as backups in the example below) and use the location name as a way to tag the backup files.

  • In the directory where you installed the Amazon EMR CLI, type the following command.

    • Linux, UNIX, and Mac OS X users:

      ./elastic-mapreduce --jobflow j-ABABABABABA --hbase-backup \
      --backup-dir s3://myawsbucket/backups/j-ABABABABABA
    • Windows users:

      ruby elastic-mapreduce --jobflow j-ABABABABABA --hbase-backup --backup-dir s3://myawsbucket/backups/j-ABABABABABA

    This example backs up data, and uses the --consistent flag to enforce backup consistency. This flag causes all writes to the HBase cluster to pause during the backup.

    • Linux, UNIX, and Mac OS X users:

      ./elastic-mapreduce --jobflow j-ABABABABABA --hbase-backup \
      --backup-dir s3://myawsbucket/backups/j-ABABABABABA \
      --consistent
    • Windows users:

      ruby elastic-mapreduce --jobflow j-ABABABABABA --hbase-backup --backup-dir s3://myawsbucket/backups/j-ABABABABABA --consistent

To schedule automated backups of HBase data

Call --hbase-schedule-backup on the HBase cluster and specify the backup time interval and units. If you do not specify a start time, the first backup starts immediately. The following example creates a weekly full backup, with the first backup starting immediately.

The following example creates a weekly full backup, with the first backup starting on 15 June 2012, 8 p.m. UTC time.

  • Linux, UNIX, and Mac OS X users:

    ./elastic-mapreduce --jobflow j-ABABABABABA \
    --hbase-schedule-backup \
    --full-backup-time-interval 7 --full-backup-time-unit days \
    --backup-dir s3://mybucket/backups/j-ABABABABABA \
    --start-time 2012-06-15T20:00Z
  • Windows users:

    ruby elastic-mapreduce --jobflow j-ABABABABABA --hbase-schedule-backup --full-backup-time-interval 7 --full-backup-time-unit days --backup-dir s3://mybucket/backups/j-ABABABABABA --start-time 2012-06-15T20:00Z

The following example creates a daily incremental backup. The first incremental backup will begin immediately.

  • Linux, UNIX, and Mac OS X users:

    ./elastic-mapreduce --jobflow j-ABABABABABA \
    --hbase-schedule-backup \
    --incremental-backup-time-interval 24 \
    --incremental-backup-time-unit hours \
    --backup-dir s3://mybucket/backups/j-ABABABABABA
  • Windows users:

    ruby elastic-mapreduce --jobflow j-ABABABABABA --hbase-schedule-backup --incremental-backup-time-interval 24 --incremental-backup-time-unit hours --backup-dir s3://mybucket/backups/j-ABABABABABA

The following example creates a daily incremental backup, with the first backup starting on 15 June 2012, 8 p.m. UTC time.

  • Linux, UNIX, and Mac OS X users:

    ./elastic-mapreduce --jobflow j-ABABABABABA \
    --hbase-schedule-backup \
    --incremental-backup-time-interval 24 \
    --incremental-backup-time-unit hours \
    --backup-dir s3://mybucket/backups/j-ABABABABABA \
    --start-time 2012-06-15T20:00Z
  • Windows users:

    ruby elastic-mapreduce --jobflow j-ABABABABABA --hbase-schedule-backup --incremental-backup-time-interval 24 --incremental-backup-time-unit hours --backup-dir s3://mybucket/backups/j-ABABABABABA --start-time 2012-06-15T20:00Z

The following example creates both a weekly full backup and a daily incremental backup, with the first full backup starting immediately. Each time the schedule has the full backup and the incremental backup scheduled for the same time, only the full backup will run.

  • Linux, UNIX, and Mac OS X users:

    ./elastic-mapreduce --jobflow j-ABABABABABA \
    --hbase-schedule-backup \
    --full-backup-time-interval 7 \
    --full-backup-time-unit days \
    --incremental-backup-time-interval 24 \
    --incremental-backup-time-unit hours \
    --backup-dir s3://mybucket/backups/j-ABABABABABA
  • Windows users:

    ruby elastic-mapreduce --jobflow j-ABABABABABA --hbase-schedule-backup --full-backup-time-interval 7 --full-backup-time-unit days --incremental-backup-time-interval 24 --incremental-backup-time-unit hours --backup-dir s3://mybucket/backups/j-ABABABABABA

The following example creates both a weekly full backup and a daily incremental backup, with the first full backup starting on June 15, 2012. Each time the schedule has the full backup and the incremental backup scheduled for the same time, only the full backup will run.

  • Linux, UNIX, and Mac OS X users:

    ./elastic-mapreduce --jobflow j-ABABABABABA \
    --hbase-schedule-backup \
    --full-backup-time-interval 7 \
    --full-backup-time-unit days \
    --incremental-backup-time-interval 24 \
    --incremental-backup-time-unit hours \
    --backup-dir s3://mybucket/backups/j-ABABABABABA \
    --start-time 2012-06-15T20:00Z
  • Windows users:

    ruby elastic-mapreduce --jobflow j-ABABABABABA --hbase-schedule-backup --full-backup-time-interval 7 --full-backup-time-unit days --incremental-backup-time-interval 24 --incremental-backup-time-unit hours --backup-dir s3://mybucket/backups/j-ABABABABABA --start-time 2012-06-15T20:00Z

Use the following command to create both a weekly full backup and a daily incremental backup, with the first full backup starting on June 15, 2012. Each time the schedule has the full backup and the incremental backup scheduled for the same time, only the full backup will run. The --consistent flag is set, so both the incremental and full backups will pause write operations during the initial portion of the backup process to ensure data consistency.

  • Linux, UNIX, and Mac OS X users:

    ./elastic-mapreduce --jobflow j-ABABABABABA \
    --hbase-schedule-backup \
    --full-backup-time-interval 7 \
    --full-backup-time-unit days \
    --incremental-backup-time-interval 24 \
    --incremental-backup-time-unit hours \
    --backup-dir s3://mybucket/backups/j-ABABABABABA \
    --start-time 2012-06-15T20:00Z \
    --consistent
  • Windows users:

    ruby elastic-mapreduce --jobflow j-ABABABABABA --hbase-schedule-backup --full-backup-time-interval 7 --full-backup-time-unit days --incremental-backup-time-interval 24 --incremental-backup-time-unit hours --backup-dir s3://mybucket/backups/j-ABABABABABA --start-time 2012-06-15T20:00Z --consistent
  • In the directory where you installed the Amazon EMR CLI, type the following command.

    • Linux, UNIX, and Mac OS X users:

      ./elastic-mapreduce --jobflow j-ABABABABABA \
      --hbase-schedule-backup \
      --full-backup-time-interval 7 --full-backup-time-unit days \
      --backup-dir s3://mybucket/backups/j-ABABABABABA
    • Windows users:

      ruby elastic-mapreduce --jobflow j-ABABABABABA --hbase-schedule-backup --full-backup-time-interval 7 --full-backup-time-unit days --backup-dir s3://mybucket/backups/j-ABABABABABA

To turn off automated HBase backups

Call the cluster with the --hbase-schedule-backup parameter and set the --disable-full-backups or --disable-incremental-backups flag, or both flags.

  1. In the directory where you installed the Amazon EMR CLI, type the following command.

    • Linux, UNIX, and Mac OS X users:

      ./elastic-mapreduce --jobflow j-ABABABABABA \
      --hbase-schedule-backup --disable-full-backups
    • Windows users:

      ruby elastic-mapreduce --jobflow j-ABABABABABA --hbase-schedule-backup --disable-full-backups
  2. Use the following command to turn off incremental backups.

    • Linux, UNIX, and Mac OS X users:

      ./elastic-mapreduce --jobflow j-ABABABABABA \
      --hbase-schedule-backup --disable-incremental-backups
    • Windows users:

      ruby elastic-mapreduce --jobflow j-ABABABABABA --hbase-schedule-backup --disable-incremental-backups
  3. Use the following command to turn off both full and incremental backups.

    • Linux, UNIX, and Mac OS X users:

      ./elastic-mapreduce --jobflow j-ABABABABABA \
      --hbase-schedule-backup --disable-full-backups \
      --disable-incremental-backups
    • Windows users:

      ruby elastic-mapreduce --jobflow j-ABABABABABA --hbase-schedule-backup --disable-full-backups --disable-incremental-backups

To restore HBase backup data to a running cluster

Run an --hbase-restore step and specify the jobflow, the backup location in Amazon S3, and (optionally) the name of the backup version. If you do not specify a value for --backup-version, Amazon EMR loads the last version in the backup directory. This is the version with the name that is lexicographically greatest.

  • In the directory where you installed the Amazon EMR CLI, type the following command.

    • Linux, UNIX, and Mac OS X users:

      ./elastic-mapreduce --jobflow j-ABABABABABA --hbase-restore \
      --backup-dir s3://myawsbucket/backups/j-ABABABABABA
    • Windows users:

      ruby elastic-mapreduce --jobflow j-ABABABABABA --hbase-restore --backup-dir s3://myawsbucket/backups/j-ABABABABABA

    This example restored the HBase cluster to the specified version of backup data stored in s3://myawsbucket/backups, overwriting any data stored in the HBase cluster.

    • Linux, UNIX, and Mac OS X users:

      ./elastic-mapreduce --jobflow j-ABABABABABA --hbase-restore \
      --backup-dir s3://myawsbucket/backups/j-ABABABABABA \
      --backup-version  20120809T031314Z
    • Windows users:

      ruby elastic-mapreduce --jobflow j-ABABABABABA --hbase-restore --backup-dir s3://myawsbucket/backups/j-ABABABABABA --backup-version  20120809T031314Z

To populate a new cluster with HBase backup data

When you add --hbase-restore and --backup-directory to the --create step in the CLI, you can optionally specify --backup-version to indicate which version in the backup directory to load. If you do not specify a value for --backup-version, Amazon EMR loads the last version in the backup directory. This will either be the version with the name that is lexicographically last or, if the version names are based on timestamps, the latest version.

  • In the directory where you installed the Amazon EMR CLI, type the following command line.

    • Linux, UNIX, and Mac OS X users:

      ./elastic-mapreduce --create --name "My HBase Restored" \
      --hbase --hbase-restore \
      --backup-dir s3://myawsbucket/backups/j-ABABABABABA 
    • Windows users:

      ruby elastic-mapreduce --create --name "My HBase Restored" --hbase --hbase-restore --backup-dir s3://myawsbucket/backups/j-ABABABABABA

    This example creates a new HBase cluster and loads it with the specified version of data in s3://myawsbucket/backups/j-ABABABABABA.

    • Linux, UNIX, and Mac OS X users:

      ./elastic-mapreduce --create --name "My HBase Restored" \
      --hbase --hbase-restore \
      --backup-dir s3://myawsbucket/backups/j-ABABABABABA \
      --backup-version  20120809T031314Z
    • Windows users:

      ruby elastic-mapreduce --create --name "My HBase Restored" --hbase --hbase-restore --backup-dir s3://myawsbucket/backups/j-ABABABABABA --backup-version  20120809T031314Z

Using Hive Options

--hive-interactive

Used with --create to launch a cluster with Hive installed.

--hive-script HIVE_SCRIPT_LOCATION

The Hive script to run in the cluster.

--hive-site HIVE_SITE_LOCATION

Installs the configuration values in hive-site.xml in the specified location. The --hive-site parameter overrides only the values defined in hive-site.xml.

--hive-versions HIVE_VERSIONS

The Hive version or versions to load. This can be a Hive version number or "latest" to load the latest version. When you specify more than one Hive version, separate the versions with a comma.

To pass variable values into Hive steps

To pass a Hive variable value into a step using the Amazon EMR CLI, type the --args parameter with the -d flag.

  • In the directory where you installed the Amazon EMR CLI, type the following command.

    • Linux, UNIX, and Mac OS X users:

      ./elastic-mapreduce --hive-script --arg s3://mybucket/script.q \
      --args -d,LIB=s3://elasticmapreduce/samples/hive-ads/lib
    • Windows users:

      ruby elastic-mapreduce --hive-script --arg s3://mybucket/script.q --args -d,LIB=s3://elasticmapreduce/samples/hive-ads/lib

To specify the latest Hive version when creating a cluster

Use the --hive-versions option with the latest keyword.

  • In the directory where you installed the Amazon EMR CLI, type the following command line.

    • Linux, UNIX, and Mac OS X users:

      ./elastic-mapreduce --create --alive --name "Test Hive" \
      --num-instances 5 --instance-type m1.large \
      --hive-interactive \
      --hive-versions latest 
    • Windows users:

      ruby elastic-mapreduce --create --alive --name "Test Hive" --num-instances 5 --instance-type m1.large --hive-interactive --hive-versions latest 

To specify the Hive version for a cluster that is interactive and uses a Hive script

If you have a cluster that uses Hive both interactively and from a script, you must set the Hive version for each type of use. The following example illustrates setting both the interactive and the script version of Hive to use 0.7.1.

  • In the directory where you installed the Amazon EMR CLI, type the following command line.

    • Linux, UNIX, and Mac OS X users:

      ./elastic-mapreduce --create --debug --log-uri s3://mybucket/logs/ \
      --name "Testing m1.large AMI 1" \
      --ami-version latest \
      --instance-type m1.large --num-instances 5 \
      --hive-interactive  --hive-versions 0.7.1.2 \
      --hive-script s3://mybucket/hive-script.hql --hive-versions 0.7.1.2
    • Windows users:

      ruby elastic-mapreduce --create --debug --log-uri s3://mybucket/logs/ --name "Testing m1.large AMI" --ami-version latest --instance-type m1.large --num-instances 5 --hive-interactive  --hive-versions 0.7.1.2 --hive-script s3://mybucket/hive-script.hql --hive-versions 0.7.1.2 

To load multiple versions of Hive for a cluster

With this configuration, you can use any of the installed versions of Hive on the cluster.

  • In the directory where you installed the Amazon EMR CLI, type the following command line.

    • Linux, UNIX, and Mac OS X users:

      ./elastic-mapreduce --create --alive --name "Test Hive" \
      --num-instances 5 --instance-type m1.large \
      --hive-interactive \
      --hive-versions 0.5,0.7.1
    • Windows users:

      ruby elastic-mapreduce --create --alive --name "Test Hive" --num-instances 5 --instance-type m1.large --hive-interactive --hive-versions 0.5,0.7.1

To call a specific version of Hive

  • Add the version number to the call. For example, hive-0.5 or hive-0.7.1.

Note

If you have multiple versions of Hive loaded on a cluster, calling hive accesses the default version of Hive or the version loaded last if there are multiple --hive-versions options specified in the cluster creation call. When the comma-separated syntax is used with --hive-versions to load multiple versions, hive accesses the default version of Hive.

Note

When running multiple versions of Hive concurrently, all versions of Hive can read the same data. They cannot, however, share metadata. Use an external metastore if you want multiple versions of Hive to read and write to the same location.

To display the Hive version

This is a useful command to call after you have upgraded to a new version of Hive to confirm that the upgrade succeeded, or when you are using multiple versions of Hive and need to confirm which version is currently running.

  • In the directory where you installed the Amazon EMR CLI, type the following command.

    • Linux, UNIX, and Mac OS X users:

      ./elastic-mapreduce --jobflow JobFlowID --print-hive-version
    • Windows users:

      ruby elastic-mapreduce --jobflow JobFlowID --print-hive-version

To launch a Hive cluster in interactive mode

  • In the directory where you installed the Amazon EMR CLI, type the following command line.

    • Linux, UNIX, and Mac OS X users:

      ./elastic-mapreduce --create --alive --name "Hive cluster" \
      --num-instances 5 --instance-type m1.large \
      --hive-interactive
    • Windows users:

      ruby elastic-mapreduce --create --alive --name "Hive cluster" --num-instances 5 --instance-type m1.large --hive-interactive

To launch a cluster and submit a Hive step

  • In the directory where you installed the Amazon EMR CLI, type the following command.

    • Linux, UNIX, and Mac OS X users:

      ./elastic-mapreduce --create --name "Test Hive" --ami-version 3.3 --hive-script \
      s3://elasticmapreduce/samples/hive-ads/libs/model-build.q \
      --args -d,LIBS=s3://elasticmapreduce/samples/hive-ads/libs,\
      -d,INPUT=s3://elasticmapreduce/samples/hive-ads/tables,\
      -d,OUTPUT=s3://mybucket/hive-ads/output/
    • Windows users:

      ruby elastic-mapreduce --create --name "Test Hive" --ami-version 3.3 --hive-script s3://elasticmapreduce/samples/hive-ads/libs/model-build.q --args -d,LIBS=s3://elasticmapreduce/samples/hive-ads/libs,-d,INPUT=s3://elasticmapreduce/samples/hive-ads/tables,-d,OUTPUT=s3://mybucket/hive-ads/output/

    By default, this command launches a cluster to run on a two-node cluster. Later, when your steps are running correctly on a small set of sample data, you can launch clusters to run on multiple nodes. You can specify the number of nodes and the type of instance to run with the --num-instances and --instance-type parameters, respectively.

To create an external Hive metastore using the Amazon EMR CLI

  • To specify the location of the configuration file using the Amazon EMR CLI, in the directory where you installed the Amazon EMR CLI, type the following command.

    • Linux, UNIX, and Mac OS X users:

      ./elastic-mapreduce --create --alive \
        --name "Hive cluster"    \
        --hive-interactive \
        --hive-site=s3://mybucket/hive-site.xml
    • Windows users:

      ruby elastic-mapreduce --create --alive --name "Hive cluster" --hive-interactive --hive-site=s3://mybucket/hive-site.xml

    The --hive-site parameter installs the configuration values in hive-site.xml in the specified location. The --hive-site parameter overrides only the values defined in hive-site.xml.

To interactively submit Hive jobs

In the directory where you installed the Amazon EMR CLI, type the following commands.

  1. If Hive is not already installed, type the following command to install it.

    • Linux, UNIX, and Mac OS X users:

      ./elastic-mapreduce -–jobflow JobFlowID –-hive-interactive
    • Windows users:

      ruby elastic-mapreduce -–jobflow JobFlowID –-hive-interactive
  2. Create a Hive script file containing the queries or commands to run. The following example script named my-hive.q creates two tables, aTable and anotherTable, and copies the contents of aTable to anotherTable, replacing all data.

    ---- sample Hive script file: my-hive.q ----
    create table aTable (aColumn string) ;
    create table anotherTable like aTable;
    insert overwrite table anotherTable select * from aTable			
  3. Type the following command, using the --scp parameter to copy the script from your local machine to the master node and the --ssh parameter to create an SSH connection and submit the Hive script for processing.

    • Linux, UNIX, and Mac OS X users:

      ./elastic-mapreduce –-jobflow JobFlowID –-scp my-hive.q \
      –-ssh “hive -f my-hive.q”
    • Windows users:

      ruby elastic-mapreduce –-jobflow JobFlowID –-scp my-hive.q –-ssh “hive -f my-hive.q”

Using Impala Options

--impala-conf OPTIONS

Use with the --create and --impala-interactive options to provide command-line parameters for Impala to parse.

The parameters are key/value pairs in the format "key1=value1,key2=value2,…". For example to set the Impala start-up options IMPALA_BACKEND_PORT and IMPALA_MEM_LIMIT, use the following command:

./elastic-mapreduce --create --alive --instance-type m1.large --instance-count 3 --ami-version 3.0.2 --impala-interactive --impala-conf "IMPALA_BACKEND_PORT=22001,IMPALA_MEM_LIMIT=70%"

--impala-interactive

Use with the --create option to launch an Amazon EMR cluster with Impala installed.

--impala-output PATH

Use with the --impala-script option to store Impala script output to an Amazon S3 bucket using the syntax --impala-output s3-path.

--impala-script [SCRIPT]

Use with the --create option to add a step to a cluster to run an Impala query file stored in Amazon S3 using the syntax --impala-scripts3-path. For example:

./elastic-mapreduce --create --alive --instance-type m1.large --instance-count 3 --ami-version 3.0.2 --impala-script s3://my-bucket/script-name.sql --impala-output s3://my-bucket/ --impala-conf "IMPALA_MEM_LIMIT=50%"

When using --impala-script with --create, the --impala-version and --impala-conf options will also function. It is acceptable, but unnecessary, to use --impala-interactive and --impala-script in the same command when creating a cluster. The effect is equivalent to using --impala-script alone.

Alternatively, you can add a step to an existing cluster, but you must already have installed Impala on the cluster. For example:

./elastic-mapreduce -j cluster-id --impala-script s3://my-bucket/script---.sql --impala-output s3://my-bucket/

If you try to use --impala-script to add a step to a cluster where Impala is not installed, you will get an error message similar to Error: Impala is not installed.

--impala-version IMPALA_VERSION

The version of Impala to be installed.

To add Impala to a cluster

  • In the directory where you installed the Amazon EMR CLI, type the following command.

    • Linux, UNIX, and Mac OS X users:

      ./elastic-mapreduce --create --alive --instance-type m1.large --instance-count 3 --ami-version 3.3 --impala-interactive --key-pair keypair-name
    • Windows users:

       ruby elastic-mapreduce --create --alive --instance-type m1.large --instance-count 3 --ami-version 3.3 --impala-interactive --key-pair keypair-name

Listing and Describing Job Flows

--active

Modifies a command to apply only to clusters in the RUNNING, STARTING or WAITING states. Used with --list.

--all

Modifies a command to apply only to all clusters, regardless of status. Used with --list, it lists all the clusters created in the last two weeks.

--created-after=DATETIME

Lists all clusters created after the specified time and date in XML date-time format.

--created-before=DATETIME

Lists all clusters created before the specified time and date in XML date-time format.

--describe

Returns information about the specified cluster or clusters.

--list

Lists clusters created in the last two days.

--no-steps

Prevents the CLI from listing steps when listing clusters.

--print-hive-version

Prints the version of Hive that is currently active on the cluster.

--state JOB_FLOW_STATE

Specifies the state of the cluster. The cluster state will be one of the following values: STARTING, RUNNING, WAITING, TERMINATED.

To retrieve the public DNS name of the master node

You can retrieve the master public DNS using the Amazon EMR CLI.

  • In the directory where you installed the Amazon EMR CLI, type the following command.

    • Linux, UNIX, and Mac OS X users:

      ./elastic-mapreduce --list
    • Windows users:

      ruby elastic-mapreduce --list

To list clusters created in the last two days

  • Use the --list parameter with no additional arguments to display clusters created during the last two days as follows:

    In the directory where you installed the Amazon EMR CLI, type the following command.

    • Linux, UNIX, and Mac OS X users:

      ./elastic-mapreduce --list
    • Windows users:

      ruby elastic-mapreduce --list

The response is similar to the following:

j-1YE2DN7RXJBWU   FAILED      Example Job Flow
                  CANCELLED   Custom Jar
j-3GJ4FRRNKGY97   COMPLETED   ec2-67-202-3-73.compute-1.amazonaws.com   Example cluster
j-5XXFIQS8PFNW    COMPLETED   ec2-67-202-51-30.compute-1.amazonaws.com  demo 3/24 s1
                  COMPLETED   Custom Jar 

The example response shows that three clusters were created in the last two days. The indented lines are the steps of the cluster. The information for a cluster is in the following order: the cluster ID, the cluster state, the DNS name of the master node, and the cluster name. The information for a cluster step is in the following order: step state, and step name.

If no clusters were created in the previous two days, this command produces no output.

To list active clusters

  • Use the --list and --active parameters as follows:

    • Linux, UNIX, and Mac OS X users:

      ./elastic-mapreduce  --list --active
    • Windows users:

      ruby elastic-mapreduce  --list --active

The response lists clusters that are in the state of STARTING, RUNNING, or SHUTTING_DOWN.

To list only running or terminated clusters

  • Use the --state parameter as follows:

    • Linux, UNIX, and Mac OS X users:

      ./elastic-mapreduce --list --state RUNNING  --state TERMINATED
    • Windows users:

      ruby elastic-mapreduce --list --state RUNNING  --state TERMINATED

The response lists clusters that are running or terminated.

To view information about a cluster

You can view information about a cluster using the --describe parameter with the cluster ID.

  • Use the --describe parameter with a valid cluster ID.

    • Linux, UNIX, and Mac OS X users:

      ./elastic-mapreduce --describe --jobflow JobFlowID
    • Windows users:

      ruby elastic-mapreduce --describe --jobflow JobFlowID

To interactively submit Hadoop jobs

  • To interactively submit Hadoop jobs using the Amazon EMR CLI, use the --ssh parameter to create an SSH connection to the master node and set the value to the command you want to run.

    In the directory where you installed the Amazon EMR CLI, type the following command. This command uses the --scp parameter to copy the JAR file myjar.jar from your local machine to the master node of cluster JobFlowID and runs the command using an SSH connection.

    • Linux, UNIX, and Mac OS X users:

      ./elastic-mapreduce –-jobflow JobFlowID –-scp myjar.jar –-ssh “hadoop jar myjar.jar”
    • Windows users:

      ruby elastic-mapreduce –-jobflow JobFlowID –-scp myjar.jar –-ssh “hadoop jar myjar.jar”

Passing Arguments to Steps

--arg ARG

Passes in a single argument value to a script or application running on the cluster.

Note

When used in a Hadoop streaming cluster, if you use the --arg options, they must immediately follow the --stream option.

--args ARG1,ARG2,ARG3,...

Passes in multiple arguments, separated by commas, to a script or application running on the cluster. This is a shorthand for specifying multiple --arg options. The --args option does not support escaping for the comma character (,). To pass arguments containing the comma character (,) use the --arg option which does not consider commas as a separator. The argument string may be surrounded with double-quotes. In addition, you can use double quotes when passing arguments containing whitespace characters.

Note

When used in a Hadoop streaming cluster, if you use the --args option, it must immediately follow the --stream option.

--step-action

Specifies the action the cluster should take when the step finishes. This can be one of CANCEL_AND_WAIT, TERMINATE_JOB_FLOW, or CONTINUE.

--step-name

Specifies a name for a cluster step.

This section describes the methods for adding steps to a cluster using the Amazon EMR CLI. You can add steps to a running cluster only if you use the --alive parameter to when you create the cluster. This parameter creates a long-running cluster by keeping the cluster active even after the completion of your steps.

To add a custom JAR step to a running cluster

  • In the directory where you installed the Amazon EMR CLI, type the following command.

    • Linux, UNIX, and Mac OS X users:

      ./elastic-mapreduce -j JobFlowID \
          --jar s3n://elasticmapreduce/samples/cloudburst/cloudburst.jar \
          --arg s3n://elasticmapreduce/samples/cloudburst/input/s_suis.br \
          --arg s3n://elasticmapreduce/samples/cloudburst/input/100k.br \
          --arg hdfs:///cloudburst/output/1 \
          --arg 36 --arg 3 --arg 0 --arg 1 --arg 240 --arg 48 --arg 24 --arg 24 --arg 128 --arg 16
    • Windows users:

      ruby elastic-mapreduce -j JobFlowID --jar s3n://elasticmapreduce/samples/cloudburst/cloudburst.jar --arg s3n://elasticmapreduce/samples/cloudburst/input/s_suis.br --arg s3n://elasticmapreduce/samples/cloudburst/input/100k.br --arg hdfs:///cloudburst/output/1 --arg 36 --arg 3 --arg 0 --arg 1 --arg 240 --arg 48 --arg 24 --arg 24 --arg 128 --arg 16

      This command adds a step that downloads and runs a JAR file. The arguments are passed to the main function in the JAR file. If your JAR file does not have a manifest, specify the JAR file's main class using the --main-class option.

To add a step to run a script

  • In the directory where you installed the Amazon EMR CLI, type the following command.

    • Linux, UNIX, and Mac OS X users:

      ./elastic-mapreduce --create --alive --name "My Development Jobflow" \
      --jar s3://elasticmapreduce/libs/script-runner/script-runner.jar \
      --args "s3://mybucket/script-path/my_script.sh"
    • Windows users:

      ruby elastic-mapreduce --create --alive --name "My Development Jobflow" --jar s3://elasticmapreduce/libs/script-runner/script-runner.jar --args "s3://mybucket/script-path/my_script.sh"

    This cluster runs the script my_script.sh on the master node when the step is processed.

Using Pig Options

--pig-interactive

Used with --create to launch a cluster with Pig installed.

--pig-script PIG_SCRIPT_LOCATION

The Pig script to run in the cluster.

--pig-versions VERSION

Specifies the version or versions of Pig to install on the cluster. If specifying more than one version of Pig, separate the versions with commas.

To add a specific Pig version to a cluster

  • Use the --pig-versions parameter. The following command-line example creates an interactive Pig cluster running Hadoop 1.0.3 and Pig 0.11.1.

    In the directory where you installed the Amazon EMR CLI, type the following command.

    • Linux, UNIX, and Mac OS X users:

      ./elastic-mapreduce --create --alive --name "Test Pig" \
      --ami-version 2.3.6 \
      --num-instances 5 --instance-type m1.large \
      --pig-interactive \
      --pig-versions 0.11.1
    • Windows users:

      ruby elastic-mapreduce --create --alive --name "Test Pig" --ami-version 2.3.6 --num-instances 5 --instance-type m1.large --pig-interactive --pig-versions 0.11.1

To add the latest version of Pig to a cluster

  • Use the --pig-versions parameter with the latest keyword. The following command-line example creates an interactive Pig cluster running the latest version of Pig.

    In the directory where you installed the Amazon EMR CLI, type the following command.

    • Linux, UNIX, and Mac OS X users:

      ./elastic-mapreduce --create --alive --name "Test Latest Pig" \
      --ami-version 2.2 \
      --num-instances 5 --instance-type m1.large \
      --pig-interactive \
      --pig-versions latest
    • Windows users:

      ruby elastic-mapreduce --create --alive --name "Test Latest Pig" --ami-version 2.2 --num-instances 5 --instance-type m1.large --pig-interactive --pig-versions latest

To add multiple versions of Pig to a cluster

  • Use the --pig-versions parameter and separate the version numbers by commas. The following command-line example creates an interactive Pig job flow running Hadoop 0.20.205 and Pig 0.9.1 and Pig 0.9.2. With this configuration, you can use either version of Pig on the cluster.

    In the directory where you installed the Amazon EMR CLI, type the following command.

    • Linux, UNIX, and Mac OS X users:

      ./elastic-mapreduce --create --alive --name "Test Pig" \
      --ami-version 2.0 \
      --num-instances 5 --instance-type m1.large \
      --pig-interactive \
      --pig-versions 0.9.1,0.9.2
    • Windows users:

      ruby elastic-mapreduce --create --alive --name "Test Pig" --ami-version 2.0 --num-instances 5 --instance-type m1.large --pig-interactive --pig-versions 0.9.1,0.9.2

If you have multiple versions of Pig loaded on a cluster, calling Pig accesses the default version of Pig, or the version loaded last if there are multiple --pig-versions parameters specified in the cluster creation call. When the comma-separated syntax is used with --pig-versions to load multiple versions, Pig accesses the default version.

To run a specific version of Pig on a cluster

  • Add the version number to the call. For example, pig-0.11.1 or pig-0.9.2. You would do this, for example, in an interactive Pig cluster by using SSH to connect to the master node and then running a command like the following from the terminal.

    pig-0.9.2
    	  		

To run Pig in interactive mode

To run Pig in interactive mode use the --alive parameter to create a long-running cluster with the --pig-interactive parameter.

  • In the directory where you installed the Amazon EMR CLI, type the following command.

    • Linux, UNIX, and Mac OS X users:

      ./elastic-mapreduce --create --alive --name "Testing Pig" \
      --num-instances 5 --instance-type m1.large \
      --pig-interactive
    • Windows users:

      ruby elastic-mapreduce --create --alive --name "Testing Pig" --num-instances 5 --instance-type m1.large --pig-interactive

To add Pig to a cluster and submit a Pig step

  • In the directory where you installed the Amazon EMR CLI, type the following command.

    • Linux, UNIX, and Mac OS X users:

      ./elastic-mapreduce --create --name "Test Pig" \
      --pig-script s3://elasticmapreduce/samples/pig-apache/do-reports2.pig \
      --ami-version 2.0 \
      --args "-p,INPUT=s3://elasticmapreduce/samples/pig-apache/input, \
      -p,OUTPUT=s3://mybucket/pig-apache/output"
    • Windows users:

      ruby elastic-mapreduce --create --name "Test Pig" --pig-script s3://elasticmapreduce/samples/pig-apache/do-reports2.pig --ami-version 2.0 --args "-p,INPUT=s3://elasticmapreduce/samples/pig-apache/input, -p,OUTPUT=s3://mybucket/pig-apache/output"

By default, this command launches a single-node cluster. Later, when your steps are running correctly on a small set of sample data, you can launch clusters to run on multiple nodes. You can specify the number of nodes and the type of instance to run with the --num-instances and --instance-type parameters, respectively.

Specifying Step Actions

--enable-debugging

Used with --create to launch a cluster with debugging enabled.

--script SCRIPT_LOCATION

Specifies the location of a script. Typically, the script is stored in an Amazon S3 bucket.

--wait-for-steps

Causes the cluster to wait until a step has completed.

When you submit steps to a cluster using the Amazon EMR CLI, you can specify that the CLI should wait until the cluster has completed all pending steps before accepting additional commands. This can be useful, for example, if you are using a step to copy data from Amazon S3 into HDFS and need to be sure that the copy operation is complete before you run the next step in the cluster. You do this by specifying the --wait-for-steps parameter after you submit the copy step.

Note

The AWS CLI does not have an option comparable to the --wait-for-steps parameter.

The --wait-for-steps parameter does not ensure that the step completes successfully, just that it has finished running. If, as in the earlier example, you need to ensure the step was successful before submitting the next step, check the cluster status. If the step failed, the cluster is in the FAILED status.

Although you can add the --wait-for-steps parameter in the same CLI command that adds a step to the cluster, it is best to add it in a separate CLI command. This ensures that the --wait-for-steps argument is parsed and applied after the step is created.

To wait until a step completes

  • Add the --wait-for-steps parameter to the cluster. This is illustrated in the following example, where JobFlowID is the cluster identifier that Amazon EMR returned when you created the cluster. The JAR, main class, and arguments specified in the first CLI command are from the Word Count sample application; this command adds a step to the cluster. The second CLI command causes the cluster to wait until all of the currently pending steps have completed before accepting additional commands.

    In the directory where you installed the Amazon EMR CLI, type the following command.

    • Linux, UNIX, and Mac OS X users:

      ./elastic-mapreduce -j JobFlowID \
          --jar s3n://elasticmapreduce/samples/cloudburst/cloudburst.jar \
          --main-class org.myorg.WordCount \
          --arg s3n://elasticmapreduce/samples/cloudburst/input/s_suis.br \
          --arg s3n://elasticmapreduce/samples/cloudburst/input/100k.br \
          --arg hdfs:///cloudburst/output/1 \
          --arg 36 --arg 3 --arg 0 --arg 1 --arg 240 --arg 48 --arg 24 --arg 24 --arg 128 --arg 16 				
      				
      ./elastic-mapreduce -j JobFlowID --wait-for-steps
    • Windows users:

      ruby elastic-mapreduce -j JobFlowID --jar s3n://elasticmapreduce/samples/cloudburst/cloudburst.jar --main-class org.myorg.WordCount --arg s3n://elasticmapreduce/samples/cloudburst/input/s_suis.br --arg s3n://elasticmapreduce/samples/cloudburst/input/100k.br --arg hdfs:///cloudburst/output/1 --arg 36 --arg 3 --arg 0 --arg 1 --arg 240 --arg 48 --arg 24 --arg 24 --arg 128 --arg 16 				
      				
      ruby elastic-mapreduce -j JobFlowID --wait-for-steps

To enable the debugging tool

  • Use the --enable-debugging argument when you create the cluster. You must also set the --log-uri argument and specify a location in Amazon S3 because archiving the log files to Amazon S3 is a prerequisite of the debugging tool. Alternately, you can set the --log-uri value in the credentials.json file that you configured for the CLI. For more information about credentials.json, see "Configuring Credentials" in Install the Amazon EMR Command Line Interface (Deprecated). The following example illustrates creating a cluster that archives log files to Amazon S3. Replace mybucket with the name of your bucket.

    • Linux, UNIX, and Mac OS X users:

      ./elastic-mapreduce --create --enable-debugging \
           --log-uri s3://mybucket
    • Windows users:

      ruby elastic-mapreduce --create --enable-debugging --log-uri s3://mybucket

Specifying Bootstrap Actions

--bootstrap-action LOCATION_OF_bootstrap_ACTION_SCRIPT

Used with --create to specify a bootstrap action to run when the cluster launches. The location of the bootstrap action script is typically a location in Amazon S3. You can add more than one bootstrap action to a cluster.

--bootstrap-name bootstrap_NAME

Sets the name of the bootstrap action.

--args "arg1,arg2"

Specifies arguments for the bootstrap action.

To add Ganglia to a cluster using a bootstrap action

  • When you create a new cluster using the Amazon EMR CLI, specify the Ganglia bootstrap action by adding the following parameter to your cluster call:

    --bootstrap-action s3://elasticmapreduce/bootstrap-actions/install-ganglia
    

The following command illustrates the use of the bootstrap-action parameter when starting a new cluster. In this example, you start the Word Count sample cluster provided by Amazon EMR and launch three instances.

In the directory where you installed the Amazon EMR CLI, type the following command.

Note

The Hadoop streaming syntax is different between Hadoop 1.x and Hadoop 2.x.

For Hadoop 2.x, use the following command:

  • Linux, UNIX, and Mac OS X users:

    ./elastic-mapreduce --create --alive --ami-version 3.0.3 --instance-type m1.xlarge \
    --num-instances 3 --stream --arg "-files" --arg "s3://elasticmapreduce/samples/wordcount/wordSplitter.py" \
    --bootstrap-action s3://elasticmapreduce/bootstrap-actions/install-ganglia --input s3://elasticmapreduce/samples/wordcount/input \
    --output s3://mybucket/output/2014-01-16 --mapper wordSplitter.py --reducer aggregate
  • Windows users:

    ruby elastic-mapreduce --create --alive --ami-version 3.0.3 --instance-type m1.xlarge --num-instances 3 --stream --arg "-files" --arg "s3://elasticmapreduce/samples/wordcount/wordSplitter.py" --bootstrap-action s3://elasticmapreduce/bootstrap-actions/install-ganglia --input s3://elasticmapreduce/samples/wordcount/input --output s3://mybucket/output/2014-01-16 --mapper wordSplitter.py --reducer aggregate

For Hadoop 1.x, use the following command:

  • Linux, UNIX, and Mac OS X users:

    ./elastic-mapreduce --create --alive --instance-type m1.xlarge --num-instances 3 \
    --bootstrap-action s3://elasticmapreduce/bootstrap-actions/install-ganglia --stream \
    --input s3://elasticmapreduce/samples/wordcount/input \
    --output s3://mybucket/output/2014-01-16 \
    --mapper s3://elasticmapreduce/samples/wordcount/wordSplitter.py --reducer aggregate
  • Windows users:

    ruby elastic-mapreduce --create --alive --instance-type m1.xlarge --num-instances 3 --bootstrap-action s3://elasticmapreduce/bootstrap-actions/install-ganglia --stream --input s3://elasticmapreduce/samples/wordcount/input --output s3://mybucket/output/2014-01-16 --mapper s3://elasticmapreduce/samples/wordcount/wordSplitter.py --reducer aggregate

To set the NameNode heap size using a bootstrap action

  • In the directory where you installed the Amazon EMR CLI, type the following command.

    • Linux, UNIX, and Mac OS X:

      ./elastic-mapreduce --create --alive \
        --bootstrap-action s3://elasticmapreduce/bootstrap-actions/configure-daemons \
        --args --namenode-heap-size=2048,--namenode-opts=-XX:GCTimeRatio=19 
    • Windows:

      ruby elastic-mapreduce --create --alive --bootstrap-action s3://elasticmapreduce/bootstrap-actions/configure-daemons --args --namenode-heap-size=2048,--namenode-opts=-XX:GCTimeRatio=19

To change the maximum number of map tasks using a bootstrap action

  • In the directory where you installed the Amazon EMR CLI, type the following command.

    • Linux, UNIX, and Mac OS X:

      ./elastic-mapreduce --create \
      --bootstrap-action s3://elasticmapreduce/bootstrap-actions/configure-hadoop \
      --args "-M,s3://mybucket/config.xml,-m,mapred.tasktracker.map.tasks.maximum=2"
    • Windows:

      ruby elastic-mapreduce --create --bootstrap-action s3://elasticmapreduce/bootstrap-actions/configure-hadoop --args "-M,s3://myawsbucket/config.xml,-m,mapred.tasktracker.map.tasks.maximum=2"

To run a command conditionally using a bootstrap action

  • In the directory where you installed the Amazon EMR CLI, type the following command. Notice that the optional arguments for the --args parameter are separated with commas.

    • Linux, Unix, and Mac OS X:

      ./elastic-mapreduce --create --alive \
      --bootstrap-action s3://elasticmapreduce/bootstrap-actions/run-if \
      --args "instance.isMaster=true,echo running on master node"
    • Windows:

      ruby elastic-mapreduce --create --alive --bootstrap-action s3://elasticmapreduce/bootstrap-actions/run-if --args "instance.isMaster=true,echo running on master node"

To create a cluster with a custom bootstrap action

  • In the directory where you installed the Amazon EMR CLI, type the following command.

    • Linux, UNIX, and Mac OS X:

      ./elastic-mapreduce --create --alive --bootstrap-action "s3://elasticmapreduce/bootstrap-actions/download.sh"
    • Windows:

      ruby elastic-mapreduce --create --alive --bootstrap-action "s3://elasticmapreduce/bootstrap-actions/download.sh"

To read settings in instance.json with a bootstrap action

This procedure uses a run-if bootstrap action to demonstrate how to execute the command line function echo to display the string running on master node by evaluating the JSON file parameter instance.isMaster in the instance.json file.

  • In the directory where you installed the Amazon EMR CLI, type the following command.

    • Linux, UNIX, and Mac OS X users:

      ./elastic-mapreduce --create --alive --name "RunIf" \
      --bootstrap-action s3://elasticmapreduce/bootstrap-actions/run-if \
      --bootstrap-name "Run only on master" \
      --args "instance.isMaster=true,echo,’Running on master node’"
    • Windows users:

      ruby elastic-mapreduce --create --alive --name "RunIf" --bootstrap-action s3://elasticmapreduce/bootstrap-actions/run-if --bootstrap-name "Run only on master" --args "instance.isMaster=true,echo,’Running on master node’"

To modify JVM settings using a bootstrap action

  • In the directory where you installed the Amazon EMR CLI, type the following command.

    • Linux, UNIX, and Mac OS X users:

      ./elastic-mapreduce --create --alive --name "JVM infinite reuse" \
      --bootstrap-action s3://elasticmapreduce/bootstrap-actions/configure-hadoop \
      --bootstrap-name "Configuring infinite JVM reuse" \
      --args "-m,mapred.job.reuse.jvm.num.tasks=-1"
    • Windows users:

      ruby elastic-mapreduce --create --alive --name "JVM infinite reuse" --bootstrap-action s3://elasticmapreduce/bootstrap-actions/configure-hadoop --bootstrap-name "Configuring infinite JVM reuse" --args "-m,mapred.job.reuse.jvm.num.tasks=-1"

Note

Amazon EMR sets the value of mapred.job.reuse.jvm.num.tasks to 20, but you can override it with a bootstrap action. A value of -1 means infinite reuse within a single job, and 1 means do not reuse tasks.

To disable reducer speculative execution using a bootstrap action

  • In the directory where you installed the Amazon EMR CLI, type the following command.

    • Linux, UNIX, and Mac OS X users:

      ./elastic-mapreduce --create --alive --name "Reducer speculative execution" \
      --bootstrap-action s3://elasticmapreduce/bootstrap-actions/configure-hadoop \
      --bootstrap-name "Disable reducer speculative execution" \
      --args "-m,mapred.reduce.tasks.speculative.execution=false"
    • Windows users:

      ruby elastic-mapreduce --create --alive --name "Reducer speculative execution" --bootstrap-action s3://elasticmapreduce/bootstrap-actions/configure-hadoop --bootstrap-name "Disable reducer speculative execution" --args "-m,mapred.reduce.tasks.speculative.execution=false"

To disable intermediate compression or change the compression codec using a bootstrap action

  • In the directory where you installed the Amazon EMR CLI, type the following command. Use mapred.compress.map.output=false to disable intermediate compression. Use mapred.map.output.compression.codec=org.apache.hadoop.io.compress.GzipCodec to change the compression codec to Gzip. Both arguments are presented below.

    • Linux, UNIX, and Mac OS X users:

      ./elastic-mapreduce --create --alive --name "Disable compression" \
      --bootstrap-action s3://elasticmapreduce/bootstrap-actions/configure-hadoop \
      --bootstrap-name "Disable compression" \
      --args "-m,mapred.compress.map.output=false" \
      --args "-m,mapred.map.output.compression.codec=org.apache.hadoop.io.compress.GzipCodec"
    • Windows users:

      ruby elastic-mapreduce --create --alive --name "Disable compression" --bootstrap-action s3://elasticmapreduce/bootstrap-actions/configure-hadoop --bootstrap-name "Disable compression" --args "-m,mapred.compress.map.output=false" --args "-m,mapred.map.output.compression.codec=org.apache.hadoop.io.compress.GzipCodec"

To increase the mapred.max.tracker.failures parameter using a bootstrap action

The following example shows how to launch a cluster and use a bootstrap action to set the value of mapred.max.tracker.failures to 7, instead of the default 4. This allows you to troubleshoot issues where TaskTracker nodes are being blacklisted.

  • In the directory where you installed the Amazon EMR CLI, type the following command.

    • Linux, UNIX, and Mac OS X users:

      ./elastic-mapreduce --create --alive --name "Modified  mapred.max.tracker.failures" \
      --num-instances 2 --slave-instance-type  m1.large  --master-instance-type m1.large \ 
      --key-pair mykeypair --debug  --log-uri  s3://mybucket/logs \
      --bootstrap-action s3://elasticmapreduce/bootstrap-actions/configure-hadoop \
      --bootstrap-name "Modified  mapred.max.tracker.failures" \
      --args "-m,mapred.max.tracker.failures=7"
    • Windows users:

      ruby elastic-mapreduce --create --alive --name "Modified  mapred.max.tracker.failures" --num-instances 2 --slave-instance-type  m1.large  --master-instance-type m1.large --key-pair mykeypair --debug  --log-uri  s3://mybucket/logs --bootstrap-action s3://elasticmapreduce/bootstrap-actions/configure-hadoop --bootstrap-name "Modified  mapred.max.tracker.failures" --args "-m,mapred.max.tracker.failures=7"

To disable S3 multipart upload using a bootstrap action

This procedure explains how to disable multipart upload using the Amazon EMR CLI. The command creates a cluster in a waiting state with multipart upload disabled.

  • In the directory where you installed the Amazon EMR CLI, type the following command.

    • Linux, UNIX, and Mac OS X users:

      ./elastic-mapreduce --create --alive \
      --bootstrap-action s3://elasticmapreduce/bootstrap-actions/configure-hadoop \
      --bootstrap-name "enable multipart upload" \
      --args "-c,fs.s3n.multipart.uploads.enabled=false"
    • Windows users:

      ruby elastic-mapreduce --create --alive --bootstrap-action s3://elasticmapreduce/bootstrap-actions/configure-hadoop --bootstrap-name "enable multipart upload" --args "-c,fs.s3n.multipart.uploads.enabled=false"

    This cluster remains in the WAITING state until it is terminated.

Terminating Job Flows

--set-termination-protection TERMINATION_PROTECTION_STATE

Enables or disables termination protection on the specified cluster or clusters. To enable termination protection, set this value to true. To disable termination protection, set this value to false.

--terminate

Terminates the specified cluster or clusters.

To configure termination protection for a new cluster

  • To enable termination protection using the Amazon EMR CLI, specify --set-termination-protection true during the cluster creation call. If the parameter is not used, termination protection is disabled. You can also type --set-termination-protection false to disable protection. The following example shows setting termination protection on a cluster running the WordCount sample application.

    In the directory where you installed the Amazon EMR CLI, type the following command.

    Note

    The Hadoop streaming syntax shown in the following examples is different between Hadoop 1.x and Hadoop 2.x.

    For Hadoop 2.x, type the following command:

    • Linux, UNIX, and Mac OS X users:

      ./elastic-mapreduce --create --alive --ami-version 3.0.3 \
      --instance-type m1.xlarge --num-instances 2 \
      --stream --arg "-files" --arg "s3://elasticmapreduce/samples/wordcount/wordSplitter.py" \
      --input s3://elasticmapreduce/samples/wordcount/input \
      --output s3://mybucket/output/2014-01-16 --mapper wordSplitter.py --reducer aggregate \
      --set-termination-protection true
    • Windows users:

      ruby elastic-mapreduce --create --alive --ami-version 3.0.3 --instance-type m1.xlarge --num-instances 2 --stream --arg "-files" --arg "s3://elasticmapreduce/samples/wordcount/wordSplitter.py" --input s3://elasticmapreduce/samples/wordcount/input --output s3://mybucket/output/2014-01-16 --mapper wordSplitter.py --reducer aggregate --set-termination-protection true

    For Hadoop 1.x, type the following command:

    • Linux, UNIX, and Mac OS X users:

      ./elastic-mapreduce --create --alive /
      --instance-type m1.xlarge --num-instances 2 --stream /
      --input s3://elasticmapreduce/samples/wordcount/input /
      --output s3://myawsbucket/wordcount/output/2011-03-25 /
      --mapper s3://elasticmapreduce/samples/wordcount/wordSplitter.py --reducer aggregate /
      --set-termination-protection true
    • Windows users:

      ruby elastic-mapreduce --create --alive --instance-type m1.xlarge --num-instances 2 --stream --input s3://elasticmapreduce/samples/wordcount/input --output s3://myawsbucket/wordcount/output/2011-03-25 --mapper s3://elasticmapreduce/samples/wordcount/wordSplitter.py --reducer aggregate --set-termination-protection true

To configure termination protection for a running cluster

  • Set the --set-termination-protection flag to true. This is shown in the following example, where JobFlowID is the identifier of the cluster on which to enable termination protection.

    In the directory where you installed the Amazon EMR CLI, type the following command.

    • Linux, UNIX, and Mac OS X users:

      ./elastic-mapreduce --set-termination-protection true --jobflow JobFlowID
    • Windows users:

      ruby elastic-mapreduce --set-termination-protection true --jobflow JobFlowID

To terminate an unprotected cluster

To terminate an unprotected cluster using the Amazon EMR CLI, type the --terminate parameter and specify the cluster to terminate.

  • In the directory where you installed the Amazon EMR CLI, type the following from command.

    • Linux, UNIX, and Mac OS X users:

      ./elastic-mapreduce --terminate JobFlowID
    • Windows users:

      ruby elastic-mapreduce --terminate JobFlowID

To terminate a protected cluster

  1. Disable termination protection by setting the --set-termination-protection parameter to false. This is shown in the following example, where JobFlowID is the identifier of the cluster on which to disable termination protection.

    elastic-mapreduce --set-termination-protection false --jobflow JobFlowID
  2. Terminate the cluster using the --terminate parameter and the cluster identifier of the cluster to terminate.

    In the directory where you installed the Amazon EMR CLI, type the following command.

    • Linux, UNIX, and Mac OS X users:

      ./elastic-mapreduce --terminate JobFlowID
    • Windows users:

      ruby elastic-mapreduce --terminate JobFlowID

Using S3DistCp

When you call S3DistCp, you can specify options that change how it copies and compresses data. For more information about the options available for S3DistCp, see S3DistCp Options.

To add a S3DistCp step to a cluster

  • Add a step to the cluster that calls S3DistCp, passing in the parameters that specify how S3DistCp should perform the copy operation.

    The following example copies daemon logs from Amazon S3 to hdfs:///output.

    In this CLI command:

    • --jobflow specifies the cluster to add the copy step to.

    • --jar is the location of the S3DistCp JAR file.

    • --args is a comma-separated list of the option name-value pairs to pass in to S3DistCp. For a complete list of the available options, see S3DistCp Options. You can also specify the options singly, using multiple --arg parameters. Both forms are shown in examples below.

    You can use either the --args or --arg syntax to pass options into the cluster step. The --args parameter is a convenient way to pass in several --arg parameters at one time. It splits the string passed in on comma (,) characters to parse them into arguments. This syntax is shown in the following example. Note that the value passed in by --args is enclosed in single quotes ('). This prevents asterisks (*) and any other special characters in any regular expressions from being expanded by the Linux shell.

    In the directory where you installed the Amazon EMR CLI, type the following command.

    • Linux, UNIX, and Mac OS X users:

      ./elastic-mapreduce --jobflow JobFlowID --jar \
      /home/hadoop/lib/emr-s3distcp-1.0.jar \
      --args 'S3DistCp-OptionName1,S3DistCp-OptionValue1, \
      S3DistCp-OptionName2,S3DistCp-OptionValue2,\
      S3DistCp-OptionName3,S3DistCp-OptionValue3'
    • Windows users:

      ruby elastic-mapreduce --jobflow JobFlowID --jar /home/hadoop/lib/emr-s3distcp-1.0.jar --args "S3DistCp-OptionName1,S3DistCp-OptionValue1,S3DistCp-OptionName2,S3DistCp-OptionValue2,S3DistCp-OptionName3,S3DistCp-OptionValue3"

    If the value of a S3DistCp option contains a comma, you cannot use --args, and must use instead individual --arg parameters to pass in the S3DistCp option names and values. Only the --src and --dest arguments are required. Note that the option values are enclosed in single quotes ('). This prevents asterisks (*) and any other special characters in any regular expressions from being expanded by the Linux shell.

    • Linux, UNIX, and Mac OS X users:

      ./elastic-mapreduce --jobflow JobFlowID --jar \
      /home/hadoop/lib/emr-s3distcp-1.0.jar \
      --arg S3DistCp-OptionName1 --arg 'S3DistCp-OptionValue1' \
      --arg S3DistCp-OptionName2 --arg 'S3DistCp-OptionValue2' \
      --arg S3DistCp-OptionName3 --arg 'S3DistCp-OptionValue3' 
    • Windows users:

      ruby elastic-mapreduce --jobflow JobFlowID --jar /home/hadoop/lib/emr-s3distcp-1.0.jar --arg "S3DistCp-OptionName1" --arg "S3DistCp-OptionValue1" --arg "S3DistCp-OptionName2" --arg "S3DistCp-OptionValue2" --arg "S3DistCp-OptionName3" --arg "S3DistCp-OptionValue3" 

Example Specify an option value that contains a comma

In this example, --srcPattern is set to '.*[a-zA-Z,]+'. The inclusion of a comma in the --srcPattern regular expression requires the use of individual --arg parameters.

  • Linux, UNIX, and Mac OS X users:

    ./elastic-mapreduce --jobflow j-3GYXXXXXX9IOJ --jar \
    /home/hadoop/lib/emr-s3distcp-1.0.jar \
    --arg --s3Endpoint --arg 's3-eu-west-1.amazonaws.com' \
    --arg --src --arg 's3://myawsbucket/logs/j-3GYXXXXXX9IOJ/node/' \
    --arg --dest --arg 'hdfs:///output' \
    --arg --srcPattern --arg '.*[a-zA-Z,]+'
  • Windows users:

    ruby elastic-mapreduce --jobflow j-3GYXXXXXX9IOJ --jar /home/hadoop/lib/emr-s3distcp-1.0.jar --arg --s3Endpoint --arg "s3-eu-west-1.amazonaws.com" --arg --src --arg "s3://myawsbucket/logs/j-3GYXXXXXX9IOJ/node/" --arg --dest --arg "hdfs:///output" --arg --srcPattern --arg ".*[a-zA-Z,]+"

Example Copy log files from Amazon S3 to HDFS

This example illustrates how to copy log files stored in an Amazon S3 bucket into HDFS. In this example the --srcPattern option is used to limit the data copied to the daemon logs.

  • Linux, UNIX, and Mac OS X users:

    ./elastic-mapreduce --jobflow j-3GYXXXXXX9IOJ --jar \
    /home/hadoop/lib/emr-s3distcp-1.0.jar \
    --args '--src,s3://mybucket/logs/j-3GYXXXXXX9IOJ/node/,--dest,hdfs:///output,--srcPattern,.*daemons.*-hadoop-.*'
  • Windows users:

    ruby elastic-mapreduce --jobflow j-3GYXXXXXX9IOJ --jar /home/hadoop/lib/emr-s3distcp-1.0.jar --args "--src,s3://myawsbucket/logs/j-3GY8JC4179IOJ/node/,--dest,hdfs:///output,--srcPattern,.*daemons.*-hadoop-.*"

Example Load Amazon CloudFront logs into HDFS

This example loads Amazon CloudFront logs into HDFS. In the process it changes the compression format from Gzip (the CloudFront default) to LZO. This is useful because data compressed using LZO can be split into multiple maps as it is decompressed, so you don't have to wait until the compression is complete, as you do with Gzip. This provides better performance when you analyze the data using Amazon EMR. This example also improves performance by using the regular expression specified in the --groupBy option to combine all of the logs for a given hour into a single file. Amazon EMR clusters are more efficient when processing a few, large, LZO-compressed files than when processing many, small, Gzip-compressed files. To split LZO files, you must index them and use the hadoop-lzo third party library. For more information, see How to Process Compressed Files.

In the directory where you installed the Amazon EMR CLI, type the following command.

  • Linux, UNIX, and Mac OS X users:

    ./elastic-mapreduce --jobflow j-3GYXXXXXX9IOK --jar \
    /home/hadoop/lib/emr-s3distcp-1.0.jar \
    --args '--src,s3://mybucket/cf,--dest,hdfs:///local,--groupBy,.*XABCD12345678.([0-9]+-[0-9]+-[0-9]+-[0-9]+).*,--targetSize,128,--outputCodec,lzo,--deleteOnSuccess'
  • Windows users:

    ruby elastic-mapreduce --jobflow j-3GYXXXXXX9IOK --jar /home/hadoop/lib/emr-s3distcp-1.0.jar --args "--src,s3://myawsbucket/cf,--dest,hdfs:///local,--groupBy,.*XABCD12345678.([0-9]+-[0-9]+-[0-9]+-[0-9]+).*,--targetSize,128,--outputCodec,lzo,--deleteOnSuccess"

Consider the case in which the preceding example is run over the following CloudFront log files.

s3://myawsbucket/cf/XABCD12345678.2012-02-23-01.HLUS3JKx.gz
s3://myawsbucket/cf/XABCD12345678.2012-02-23-01.I9CNAZrg.gz
s3://myawsbucket/cf/XABCD12345678.2012-02-23-02.YRRwERSA.gz
s3://myawsbucket/cf/XABCD12345678.2012-02-23-02.dshVLXFE.gz
s3://myawsbucket/cf/XABCD12345678.2012-02-23-02.LpLfuShd.gz
		

S3DistCp copies, concatenates, and compresses the files into the following two files, where the file name is determined by the match made by the regular expression.

hdfs:///local/2012-02-23-01.lzo
hdfs:///local/2012-02-23-02.lzo