Amazon Elastic MapReduce
Developer Guide (API Version 2009-03-31)
Did this page help you?  Yes | No |  Tell us about it...
« PreviousNext »
View the PDF for this guide.Go to the AWS Discussion Forum for this product.Go to the Kindle Store to download this guide in Kindle format.

(Optional) Create Bootstrap Actions to Install Additional Software

You can use a bootstrap action to install additional software and to change the configuration of applications on the cluster. Bootstrap actions are scripts that are run on the cluster nodes when Amazon EMR launches the cluster. They run before Hadoop starts and before the node begins processing data. You can create custom bootstrap actions, or use predefined bootstrap actions provided by Amazon EMR. A common use of bootstrap actions is to change the Hadoop configuration settings.

Bootstrap Action Basics

Bootstrap actions execute as the Hadoop user by default. You can execute a bootstrap action with root privileges by using sudo.

All Amazon EMR interfaces support bootstrap actions. You can specify up to 16 bootstrap actions per cluster by providing multiple bootstrap-action parameters from the CLI or API.

From the Amazon EMR console, you can optionally specify a bootstrap action while creating a cluster.

When you use the CLI, you can pass references to bootstrap action scripts to Amazon EMR by adding the --bootstrap-action parameter when you create the cluster. The syntax for a --bootstrap-action parameter is as follows:

AWS CLI

--bootstrap-action Path=s3://mybucket/filename",Args=[arg1,arg2]

Amazon EMR CLI

Note

The Amazon EMR CLI is no longer under feature development. Customers are encouraged to use the Amazon EMR commands in the AWS CLI instead.

--bootstrap-action s3://mybucket/filename --args "arg1,arg2"

If the bootstrap action returns a nonzero error code, Amazon EMR treats it as a failure and terminates the instance. If too many instances fail their bootstrap actions, then Amazon EMR terminates the cluster. If just a few instances fail, Amazon EMR attempts to reallocate the failed instances and continue. Use the cluster lastStateChangeReason error code to identify failures caused by a bootstrap action.

Use Predefined Bootstrap Actions

Amazon EMR provides predefined bootstrap action scripts that you can use to customize Hadoop settings. This section describes the available predefined bootstrap actions. References to predefined bootstrap action scripts are passed to Amazon EMR by using the bootstrap-action parameter.

Configure Daemons Bootstrap Action

Use this predefined bootstrap action to specify the heap size or other Java Virtual Machine (JVM) options for the Hadoop daemons. You can configure Hadoop for large jobs that require more memory than Hadoop allocates by default. You can also use this bootstrap action to modify advanced JVM options, such as garbage collector (GC) behavior.

The location of the script is s3://elasticmapreduce/bootstrap-actions/configure-daemons.

The following table describes the valid parameters for the script. In the table, daemon can be namenode, datanode, jobtracker, tasktracker, or client (Hadoop 1.x) or namenode, datanode, resourcemanager, nodemanager, or client (Hadoop 2.x). For example, --namenode-heap-size=2048,--namenode-opts=\"-XX:GCTimeRatio=19\"

Configuration ParameterDescription
--daemon-heap-sizeSets the heap size in megabytes for the specified daemon.
--daemon-optsSets additional Java options for the specified daemon.
--replaceReplaces the existing hadoop-user-env.sh file if it exists.

In Hadoop 1.x, --client-heap-size has no effect. Instead, change the client heap size using the --client-opts=\"-Xmx#####\" equivalent, where ##### is numeric.

The configure-daemons bootstrap action supports Hadoop 2.x with a configuration file, yarn-site.xml. Its configuration file keyword is yarn.

The following examples set the NameNode JVM heap size to 2048 MB and configures a JVM GC option for the NameNode.

To set the NameNode heap size using the AWS CLI

When using the AWS CLI to include a bootstrap action, specify the Path and Args as a comma-separated list.

  • To create a cluster and run a bootstrap action to configure the Hadoop NameNode daemon's heap size, type the following command and replace myKey with the name of your Amazon EC2 key pair.

    • Linux, UNIX, and Mac OS X users:

      aws emr create-cluster --name "Test cluster" --ami-version 3.3 \
      --use-default-roles --ec2-attributes KeyName=myKey \
      --applications Name=Hue Name=Hive Name=Pig \
      --instance-count 5 --instance-type m3.xlarge \
      --bootstrap-action Path=s3://elasticmapreduce/bootstrap-actions/configure-daemons,Args=["--namenode-heap-size=2048","--namenode-opts=-XX:GCTimeRatio=19"]
    • Windows users:

      aws emr create-cluster --name "Test cluster" --ami-version 3.3 --use-default-roles --ec2-attributes KeyName=myKey --applications Name=Hue Name=Hive Name=Pig --instance-count 5 --instance-type m3.xlarge --bootstrap-action Path=s3://elasticmapreduce/bootstrap-actions/configure-daemons,Args=["--namenode-heap-size=2048","--namenode-opts=-XX:GCTimeRatio=19"]

    When you specify the instance count without using the --instance-groups parameter, a single Master node is launched, and the remaining instances are launched as core nodes. All nodes will use the instance type specified in the command.

    Note

    If you have not previously created the default Amazon EMR service role and Amazon EC2 instance profile, type aws emr create-default-roles to create them before typing the create-cluster subcommand.

    For more information on using Amazon EMR commands in the AWS CLI, see http://docs.aws.amazon.com/cli/latest/reference/emr.

    Alternatively, you can supply a JSON syntax in a file if you have a long list of arguments or multiple bootstrap actions. For example, the JSON file configuredaemons.json would look like the following:

    [
      {
        "Path": "s3://elasticmapreduce/bootstrap-actions/configure-daemons",
        "Args": ["--namenode-heap-size=2048","--namenode-opts=-XX:GCTimeRatio=19"],
        "Name": "Configure Daemons"
      }
    ]

To set the NameNode heap size using the Amazon EMR CLI

Note

The Amazon EMR CLI is no longer under feature development. Customers are encouraged to use the Amazon EMR commands in the AWS CLI instead.

  • In the directory where you installed the Amazon EMR CLI, type the following command.

    • Linux, UNIX, and Mac OS X:

      ./elastic-mapreduce --create --alive \
        --bootstrap-action s3://elasticmapreduce/bootstrap-actions/configure-daemons \
        --args --namenode-heap-size=2048,--namenode-opts=-XX:GCTimeRatio=19 
    • Windows:

      ruby elastic-mapreduce --create --alive --bootstrap-action s3://elasticmapreduce/bootstrap-actions/configure-daemons --args --namenode-heap-size=2048,--namenode-opts=-XX:GCTimeRatio=19

Configure Hadoop Bootstrap Action

You can use this bootstrap action to set cluster-wide Hadoop settings. The location of the script is s3://elasticmapreduce/bootstrap-actions/configure-hadoop. This script provides the following command line options:

  • --keyword-config-file—Merges the existing Hadoop configuration with a user-specified XML configuration file that you upload to Amazon S3 or the local filesystem. The user-specified file can be named anything.

  • --keyword-key-value—Overrides specific key-value pairs in the Hadoop configuration files.

With both options, replace --keyword with a keyword (or use the single character shortcut instead) that represents one of the five Hadoop configuration files described in the following table. Because the single-character shortcuts can be used together in the same command, an uppercase character indicates that the shortcut refers to a configuration file and a lowercase character indicates that the shortcut refers to a key-value pair. If you specify multiple options, the later options override the earlier ones.

Configuration File NameConfiguration File KeywordFile Name ShortcutKey-Value Pair Shortcut
log4j.propertieslog4jLl
core-site.xmlcoreCc
hadoop-default.xml (deprecated)defaultDd
hadoop-site.xml (deprecated)siteSs
hdfs-site.xmlhdfsHh
mapred-site.xmlmapredMm
yarn-site.xmlyarnYy
httpsfs-site.xmlhttpfsTt
emrfs-site.xmlemrfsEe
capacity-scheduler.xmlcapacityZz

You can provide multiple configurations for multiple instance types. For example, you may have a task group that consists of different instance types than your core group. The configure-hadoop bootstrap action provides an option, instance-type-config, which accepts the Amazon S3 URI or local path to a JSON file that specifies configurations for each instance type. That file would look something like:

{
  "m1.small":{
    "log4j":{
      "key1":"value1"
    },
    "site":{
      "key3":"value3"
    }
  },
  "m1.xlarge":{
    "yarn":{
      "lkey1":"lvalue1",
      "lkey11":"lvalue12"
    },
    "emrfs":{
      "lkey2":"lvalue2"
    },
    "site":{
      "lkey3":"lvalue3"
    }
  }
}

Note

If you do not want to set any values for a particular instance type, you should still provide a blank entry in the JSON list, e.g.

{
  "m1.small":{
  }
}

The following example shows how to use the configuration file keywords ('mapred' in this example) to merge a user-specified configuration file (config.xml) with Hadoop's mapred-site.xml file and set the maximum map tasks value to 2 in the mapred-site.xml file. The configuration file that you provide in the Amazon S3 bucket must be a valid Hadoop configuration file; for example:

<?xml version="1.0"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
<configuration>
  <property>
    <name>mapred.userlog.retain.hours</name>
    <value>4</value>
  </property>
</configuration>

The configuration file for Hadoop 0.18 is hadoop-site.xml. In Hadoop 0.20 and later, the old configuration file is replaced with three new files: core-site.xml, mapred-site.xml, and hdfs-site.xml.

For Hadoop 0.18, the name and location of the configuration file is /conf/hadoop-site.xml.

The configuration options are applied in the order described in the bootstrap action script. Settings specified later in the sequence override those specified earlier.

To change the maximum number of map tasks using the AWS CLI

When using the AWS CLI to include a bootstrap action, specify the Path and Args as a comma-separated list.

  • To launch a cluster with a bootstrap action that configures the maximum number of map tasks, type the following command and replace myKey with the name of your EC2 key pair.

    • Linux, UNIX, and Mac OS X users:

      aws emr create-cluster --name "Test cluster" --ami-version 3.3.0 \
      --use-default-roles --ec2-attributes KeyName=myKey \
      --applications Name=Hue Name=Hive Name=Pig \
      --instance-count 5 --instance-type m3.xlarge \
      --bootstrap-action Path=s3://elasticmapreduce/bootstrap-actions/configure-hadoop,Args=["-M","s3://myawsbucket/config.xml","-m","mapred.tasktracker.map.tasks.maximum=2"]
    • Windows users:

      aws emr create-cluster --name "Test cluster" --ami-version 3.3.0 --use-default-roles --ec2-attributes KeyName=myKey --applications Name=Hue Name=Hive Name=Pig --instance-count 5 --instance-type m3.xlarge --bootstrap-action Path=s3://elasticmapreduce/bootstrap-actions/configure-hadoop,Args=["-M","s3://myawsbucket/config.xml","-m","mapred.tasktracker.map.tasks.maximum=2"]

    When you specify the instance count without using the --instance-groups parameter, a single Master node is launched, and the remaining instances are launched as core nodes. All nodes will use the instance type specified in the command.

    Note

    If you have not previously created the default Amazon EMR service role and EC2 instance profile, type aws emr create-default-roles to create them before typing the create-cluster subcommand.

    Alternatively, you can provide a JSON file if you have a long list of arguments or multiple bootstrap actions. For example, the JSON file configuredaemons.json would look like this:

    [
      {
        "Path": "s3://elasticmapreduce/bootstrap-actions/configure-hadoop",
        "Args": ["-M","s3://myawsbucket/config.xml","-m","mapred.tasktracker.map.tasks.maximum=2"],
        "Name": "Configure Hadoop"
      }
    ]

    For more information on using Amazon EMR commands in the AWS CLI, see http://docs.aws.amazon.com/cli/latest/reference/emr.

To change the maximum number of map tasks using the Amazon EMR CLI

Note

The Amazon EMR CLI is no longer under feature development. Customers are encouraged to use the Amazon EMR commands in the AWS CLI instead.

  • In the directory where you installed the Amazon EMR CLI, type the following command.

    • Linux, UNIX, and Mac OS X:

      ./elastic-mapreduce --create \
      --bootstrap-action s3://elasticmapreduce/bootstrap-actions/configure-hadoop \
      --args "-M,s3://mybucket/config.xml,-m,mapred.tasktracker.map.tasks.maximum=2"
    • Windows:

      ruby elastic-mapreduce --create --bootstrap-action s3://elasticmapreduce/bootstrap-actions/configure-hadoop --args "-M,s3://myawsbucket/config.xml,-m,mapred.tasktracker.map.tasks.maximum=2"

To provide multiple configurations using the AWS CLI

  • To launch a cluster with different instance type configurations using the AWS CLI and configure-hadoop bootstrap action, supply the instance-type-config option with the URI or path to the JSON configuration file:

    aws emr create-cluster --name "Test cluster" --ami-version 3.3.2 \
    --use-default-roles --ec2-attributes KeyName=myKey \
    --applications Name=Hue Name=Hive Name=Pig \
    --instance-count 5 --instance-type m3.xlarge \
    --bootstrap-action Path=s3://elasticmapreduce/bootstrap-actions/configure-hadoop,Args=["instance-type-config","s3://myBucket/myInstanceConfigfile.json"]

Run If Bootstrap Action

Use this predefined bootstrap action to run a command conditionally when an instance-specific value is found in the instance.json or job-flow.json file. The command can refer to a file in Amazon S3 that Amazon EMR can download and execute.

The location of the script is s3://elasticmapreduce/bootstrap-actions/run-if.

The following example echoes the string "running on master node" if the node is a master.

To run a command conditionally using the AWS CLI

When using the AWS CLI to include a bootstrap action, specify the Path and Args as a comma-separated list.

  • To launch a cluster with a bootstrap action that conditionally runs a command when an instance-specific value is found in the instance.json or job-flow.json file, type the following command and replace myKey with the name of your EC2 key pair.

    • Linux, UNIX, and Mac OS X users:

      aws emr create-cluster --name "Test cluster" --ami-version 3.3 \
      --use-default-roles --ec2-attributes KeyName=myKey \
      --applications Name=Hue Name=Hive Name=Pig \
      --instance-count 5 --instance-type m3.xlarge \
      --bootstrap-action Path=s3://elasticmapreduce/bootstrap-actions/run-if,Args=["instance.isMaster=true","echo running on master node"]
    • Windows users:

      aws emr create-cluster --name "Test cluster" --ami-version 3.3 --use-default-roles --ec2-attributes KeyName=myKey --applications Name=Hue Name=Hive Name=Pig --instance-count 5 --instance-type m3.xlarge --bootstrap-action Path=s3://elasticmapreduce/bootstrap-actions/run-if,Args=["instance.isMaster=true","echo running on master node"]

    When you specify the instance count without using the --instance-groups parameter, a single Master node is launched, and the remaining instances are launched as core nodes. All nodes will use the instance type specified in the command.

    Note

    If you have not previously created the default Amazon EMR service role and EC2 instance profile, type aws emr create-default-roles to create them before typing the create-cluster subcommand.

    For more information on using Amazon EMR commands in the AWS CLI, see http://docs.aws.amazon.com/cli/latest/reference/emr.

To run a command conditionally using the Amazon EMR CLI

Note

The Amazon EMR CLI is no longer under feature development. Customers are encouraged to use the Amazon EMR commands in the AWS CLI instead.

  • In the directory where you installed the Amazon EMR CLI, type the following command. Notice that the optional arguments for the --args parameter are separated with commas.

    • Linux, Unix, and Mac OS X:

      ./elastic-mapreduce --create --alive \
      --bootstrap-action s3://elasticmapreduce/bootstrap-actions/run-if \
      --args "instance.isMaster=true,echo running on master node"
    • Windows:

      ruby elastic-mapreduce --create --alive --bootstrap-action s3://elasticmapreduce/bootstrap-actions/run-if --args "instance.isMaster=true,echo running on master node"

Shutdown Actions

A bootstrap action script can create one or more shutdown actions by writing scripts to the /mnt/var/lib/instance-controller/public/shutdown-actions/ directory. When a cluster is terminated, all the scripts in this directory are executed in parallel. Each script must run and complete within 60 seconds.

Shutdown action scripts are not guaranteed to run if the node terminates with an error.

Use Custom Bootstrap Actions

In addition to predefined bootstrap actions, you can create a custom script to perform a customized bootstrap action. Any of the Amazon EMR interfaces can reference a custom bootstrap action.

Add Custom Bootstrap Actions Using the AWS CLI or the Amazon EMR CLI

The following example uses a bootstrap action script to download and extracts a compressed TAR archive from Amazon S3. The sample script is stored at http://elasticmapreduce.s3.amazonaws.com/bootstrap-actions/download.sh.

The sample script looks like the following:

#!/bin/bash
set -e
wget -S -T 10 -t 5 http://elasticmapreduce.s3.amazonaws.com/bootstrap-actions/file.tar.gz
mkdir -p /home/hadoop/contents
tar -xzf file.tar.gz -C /home/hadoop/contents                         

To create a cluster with a custom bootstrap action using the AWS CLI

When using the AWS CLI to include a bootstrap action, specify the Path and Args as a comma-separated list. The following example does not use an arguments list.

  • To launch a cluster with a custom bootstrap action, type the following command, replace myKey with the name of your EC2 key pair.

    • Linux, UNIX, and Mac OS X users:

      aws emr create-cluster --name "Test cluster" --ami-version 3.3 \
      --use-default-roles --ec2-attributes KeyName=myKey \
      --applications Name=Hue Name=Hive Name=Pig \
      --instance-count 5 --instance-type m3.xlarge \
      --bootstrap-action Path="s3://elasticmapreduce/bootstrap-actions/download.sh"
    • Windows users:

      aws emr create-cluster --name "Test cluster" --ami-version 3.3 --use-default-roles --ec2-attributes KeyName=myKey --applications Name=Hue Name=Hive Name=Pig --instance-count 5 --instance-type m3.xlarge --bootstrap-action Path="s3://elasticmapreduce/bootstrap-actions/download.sh"

    When you specify the instance count without using the --instance-groups parameter, a single Master node is launched, and the remaining instances are launched as core nodes. All nodes will use the instance type specified in the command.

    Note

    If you have not previously created the default Amazon EMR service role and EC2 instance profile, type aws emr create-default-roles to create them before typing the create-cluster subcommand.

    For more information on using Amazon EMR commands in the AWS CLI, see http://docs.aws.amazon.com/cli/latest/reference/emr.

To create a cluster with a custom bootstrap action using the Amazon EMR CLI

Note

The Amazon EMR CLI is no longer under feature development. Customers are encouraged to use the Amazon EMR commands in the AWS CLI instead.

  • In the directory where you installed the Amazon EMR CLI, type the following command.

    • Linux, UNIX, and Mac OS X:

      ./elastic-mapreduce --create --alive --bootstrap-action "s3://elasticmapreduce/bootstrap-actions/download.sh"
    • Windows:

      ruby elastic-mapreduce --create --alive --bootstrap-action "s3://elasticmapreduce/bootstrap-actions/download.sh"

Add Custom Bootstrap Actions Using the Console

The following example adds a bootstrap action script that downloads and extracts a compressed TAR archive from Amazon S3. The sample script is stored at http://elasticmapreduce.s3.amazonaws.com/bootstrap-actions/download.sh.

To add a custom bootstrap action using the console

  1. Open the Amazon EMR console at https://console.aws.amazon.com/elasticmapreduce/.

  2. Click Create Cluster.

  3. In the Bootstrap Actions section, in the Add bootstrap action field, select Custom action and then click Configure and add.

  4. In the Add Bootstrap Action dialog box, do the following:

    1. Enter the following text in the S3 location field:

      s3://elasticmapreduce/bootstrap-actions/download.sh
    2. (Optional) Enter any arguments in the Optional arguments field. Use spaces to separate the arguments.

    3. Click Add.

  5. In the Bootstrap Actions section, note the properties of the custom bootstrap action.

  6. Proceed with creating your cluster.

While the cluster's master node is running, you can connect to the master node and see the log files that the bootstrap action script generated in the /mnt/var/log/bootstrap-actions/1 directory.

Related Topics