Amazon Elastic MapReduce
Developer Guide (API Version 2009-03-31)
« PreviousNext »
View the PDF for this guide.Go to the AWS Discussion Forum for this product.Go to the Kindle Store to download this guide in Kindle format.Did this page help you?  Yes | No |  Tell us about it...

Create Bootstrap Actions to Install Additional Software (Optional)

You can use a bootstrap action to install additional software and to change the configuration of applications on the cluster. Bootstrap actions are scripts that are run on the cluster nodes when Amazon EMR launches the cluster. They run before Hadoop starts and before the node begins processing data. You can create custom bootstrap actions, or use predefined bootstrap actions provided by Amazon EMR. A common use of bootstrap actions is to change the Hadoop configuration settings.

Bootstrap Action Basics

Bootstrap actions execute as the Hadoop user by default. You can execute a bootstrap action with root privileges by using sudo.

All Amazon EMR interfaces support bootstrap actions. You can specify up to 16 bootstrap actions per cluster by providing multiple bootstrap-action parameters from the CLI or API.

From the Amazon EMR console, you can optionally specify a bootstrap action while creating a cluster.

When you use the CLI, you can pass references to bootstrap action scripts to Amazon EMR by adding the --bootstrap-action parameter when you create the cluster. The syntax for a --bootstrap-action parameter is as follows:

--bootstrap-action s3://myawsbucket/FileName --args "arg1,arg2"

If the bootstrap action returns a nonzero error code, Amazon EMR treats it as a failure and terminates the instance. If too many instances fail their bootstrap actions, then Amazon EMR terminates the cluster. If just a few instances fail, Amazon EMR attempts to reallocate the failed instances and continue. Use the cluster lastStateChangeReason error code to identify failures caused by a bootstrap action.

Using Predefined Bootstrap Actions

Amazon EMR provides predefined bootstrap action scripts that you can use to customize Hadoop settings. This section describes the available predefined bootstrap actions. References to predefined bootstrap action scripts are passed to Amazon EMR by using the bootstrap-action parameter.

Configure Daemons

Use this predefined bootstrap action to specify the heap size or other Java Virtual Machine (JVM) options for the Hadoop daemons. You can configure Hadoop for large jobs that require more memory than Hadoop allocates by default. You can also use this bootstrap action to modify advanced JVM options, such as garbage collector (GC) behavior.

The location of the script is s3://elasticmapreduce/bootstrap-actions/configure-daemons.

The following table describes the valid parameters for the script. In the table, daemon can be namenode, datanode, jobtracker, tasktracker, or client (Hadoop 1.x) or namenode, datanode, resourcemanager, nodemanager, or client (Hadoop 2.x). For example, --namenode-heap-size=2048,--namenode-opts=\"-XX:GCTimeRatio=19\"

Configuration ParameterDescription
--daemon-heap-sizeSets the heap size in megabytes for the specified daemon.
--daemon-optsSets additional Java options for the specified daemon.
--replaceReplaces the existing hadoop-user-env.sh file if it exists.

In Hadoop 1.x, --client-heap-size has no effect. Instead, change the client heap size using the --client-opts=\"-Xmx#####\" equivalent, where ##### is numeric.

The configure-daemons bootstrap action supports Hadoop 2.x with a configuration file, yarn-site.xml. Its configuration file keyword is yarn.

The following examples set the NameNode JVM heap size to 2048 MB and configures a JVM GC option for the NameNode.

To set the NameNode heap size using the Amazon EMR CLI

In the directory where you installed the Amazon EMR CLI, run the following command.

  • Linux, UNIX, and Mac OS X:

    ./elastic-mapreduce --create --alive \
      --bootstrap-action s3://elasticmapreduce/bootstrap-actions/configure-daemons \
      --args --namenode-heap-size=2048,--namenode-opts=-XX:GCTimeRatio=19 
  • Windows:

    ruby elastic-mapreduce --create --alive --bootstrap-action s3://elasticmapreduce/bootstrap-actions/configure-daemons --args --namenode-heap-size=2048,--namenode-opts=-XX:GCTimeRatio=19 

Configure Hadoop Settings with a Bootstrap Action

You can use this bootstrap action to set cluster-wide Hadoop settings. The location of the script is s3://elasticmapreduce/bootstrap-actions/configure-hadoop. This script provides the following command line options:

  • --keyword-config-file—Merges the existing Hadoop configuration with a user-specified XML configuration file that you upload to Amazon S3 or the local filesystem. The user-specified file can be named anything.

  • --keyword-key-value—Overrides specific key-value pairs in the Hadoop configuration files.

With both options, replace --keyword with a keyword (or use the single character shortcut instead) that represents one of the five Hadoop configuration files described in the following table. Because the single-character shortcuts can be used together in the same command, an uppercase character indicates that the shortcut refers to a configuration file and a lowercase character indicates that the shortcut refers to a key-value pair. If you specify multiple options, the later options override the earlier ones.

Configuration File NameConfiguration File KeywordFile Name ShortcutKey-Value Pair Shortcut
core-site.xmlcoreCc
hadoop-default.xml (deprecated)defaultDd
hadoop-site.xml (deprecated)siteSs
hdfs-site.xmlhdfsHh
mapred-site.xmlmapredMm
yarn-site.xmlyarnYy

The following example shows how to use the configuration file keywords ('mapred' in this example) to merge a user-specified configuration file (config.xml) with Hadoop's mapred-site.xml file and set the maximum map tasks value to 2 in the mapred-site.xml file. The configuration file that you provide in the Amazon S3 bucket must be a valid Hadoop configuration file; for example:

<?xml version="1.0"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
<configuration>
  <property>
    <name>mapred.userlog.retain.hours</name>
    <value>4</value>
  </property>
</configuration>

The configuration file for Hadoop 0.18 is hadoop-site.xml. In Hadoop 0.20 and later, the old configuration file is replaced with three new files: core-site.xml, mapred-site.xml, and hdfs-site.xml.

For Hadoop 0.18, the name and location of the configuration file is /conf/hadoop-site.xml.

The configuration options are applied in the order described in the bootstrap action script. Settings specified later in the sequence override those specified earlier.

To change the maximum number of map tasks using the Amazon EMR CLI

In the directory where you installed the Amazon EMR CLI, run the following command.

  • Linux, UNIX, and Mac OS X:

    ./elastic-mapreduce --create \
    --bootstrap-action s3://elasticmapreduce/bootstrap-actions/configure-hadoop \
    --args "-M,s3://myawsbucket/config.xml,-m,mapred.tasktracker.map.tasks.maximum=2"
  • Windows:

    ruby elastic-mapreduce --create --bootstrap-action s3://elasticmapreduce/bootstrap-actions/configure-hadoop
    --args "-M,s3://myawsbucket/config.xml,-m,mapred.tasktracker.map.tasks.maximum=2"

Run If

Use this predefined bootstrap action to run a command conditionally when an instance-specific value is found in the instance.json or job-flow.json file. The command can refer to a file in Amazon S3 that Amazon EMR can download and execute.

The location of the script is s3://elasticmapreduce/bootstrap-actions/run-if.

The following example echoes the string "running on master node" if the node is a master.

To run a command conditionally using the Amazon EMR CLI

In the directory where you installed the Amazon EMR CLI, run the following command. Notice that the optional arguments for the --args parameter are separated with commas.

  • Linux, Unix, and Mac OS X:

    ./elastic-mapreduce --create --alive \
    --bootstrap-action s3://elasticmapreduce/bootstrap-actions/run-if \
    --args "instance.isMaster=true,echo running on master node"
  • Windows:

    ruby elastic-mapreduce --create --alive --bootstrap-action s3://elasticmapreduce/bootstrap-actions/run-if --args "instance.isMaster=true,echo running on master node"

Shutdown Actions

A bootstrap action script can create one or more shutdown actions by writing scripts to the /mnt/var/lib/instance-controller/public/shutdown-actions/ directory. When a cluster is terminated, all the scripts in this directory are executed in parallel. Each script must run and complete within 60 seconds.

Shutdown action scripts are not guaranteed to run if the node terminates with an error.

Using Custom Bootstrap Actions

In addition to predefined bootstrap actions, you can create a custom script to perform a customized bootstrap action. Any of the Amazon EMR interfaces can reference a custom bootstrap action.

Running Custom Bootstrap Actions Using the CLI

The following example uses a bootstrap action script to download and extracts a compressed TAR archive from Amazon S3. The sample script is stored at http://elasticmapreduce.s3.amazonaws.com/bootstrap-actions/download.sh.

The sample script looks like the following:

#!/bin/bash
set -e
bucket=elasticmapreduce
path=samples/bootstrap-actions/file.tar.gz
wget -S -T 10 -t 5 http://$bucket.s3.amazonaws.com/$path
mkdir -p /home/hadoop/contents
tar -C /home/hadoop/contents -xzf file.tar.gz                         

To create a cluster with a custom bootstrap action using the Amazon EMR CLI

In the directory where you installed the Amazon EMR CLI, run the following command.

  • Linux, UNIX, and Mac OS X:

    ./elastic-mapreduce --create --alive --bootstrap-action "s3://elasticmapreduce/bootstrap-actions/download.sh"
  • Windows:

    ruby elastic-mapreduce --create --alive --bootstrap-action "s3://elasticmapreduce/bootstrap-actions/download.sh"

Running Custom Bootstrap Actions Using the Console

The following procedure creates a predefined word count sample cluster with a bootstrap action script that downloads and extracts a compressed TAR archive from Amazon S3. The sample script is stored at http://elasticmapreduce.s3.amazonaws.com/bootstrap-actions/download.sh.

To create a cluster with a custom bootstrap action using the console

  1. Open the Amazon Elastic MapReduce console at https://console.aws.amazon.com/elasticmapreduce/.

  2. Click Create cluster.

  3. In the Create Cluster page, click Configure sample application.

  4. In the Configure Sample Application page, in the Select sample application field, choose the Word count sample application from the list.

  5. In the Output location field, type the path of an Amazon S3 bucket to store your output and then click Ok.

  6. In the Create Cluster page, in the Cluster Configuration section, verify the fields according to the following table.

    Cluster Configuration
    FieldAction
    Cluster name

    Enter a descriptive name for your cluster.

    The name is optional, and does not need to be unique.

    Termination protection

    Enabling termination protection ensures that the cluster does not shut down due to accident or error. For more information, see Protect a Cluster from Termination. Typically, set this value to Yes only when developing an application (so you can debug errors that would have otherwise terminated the cluster) and to protect long-running clusters or clusters that contain data.

    Logging

    This determines whether Amazon EMR captures detailed log data to Amazon S3.

    For more information, see View Log Files.

    Log folder S3 location

    Enter an Amazon S3 path to store your debug logs if you enabled logging in the previous field. If the log folder does not exist, the Amazon EMR console creates it for you.

    When this value is set, Amazon EMR copies the log files from the EC2 instances in the cluster to Amazon S3. This prevents the log files from being lost when the cluster ends and the EC2 instances hosting the cluster are terminated. These logs are useful for troubleshooting purposes.

    For more information, see View Log Files.

    Debugging

    This option creates a debug log index in SimpleDB (additional charges apply) to enable detailed debugging in the Amazon EMR console. You can only set this when the cluster is created. For more information about Amazon SimpleDB, go to the Amazon SimpleDB product description page.

  7. In the Software Configuration section, verify the fields according to the following table.

    Software Configuration
    FieldAction
    Hadoop distribution

    Choose Amazon.

    This determines which distribution of Hadoop to run on your cluster. You can choose to run the Amazon distribution of Hadoop or one of several MapR distributions. For more information, see Using the MapR Distribution for Hadoop.

    AMI version

    Choose 2.4.2 (Hadoop 1.0.3).

    This determines the version of Hadoop and other applications such as Hive or Pig to run on your cluster. For more information, see Choose a Machine Image.

  8. In the Hardware Configuration section, verify the fields according to the following table.

    Note

    Twenty is the default maximum number of nodes per AWS account. For example, if you have two clusters, the total number of nodes running for both clusters must be 20 or less. Exceeding this limit results in cluster failures. If you need more than 20 nodes, you must submit a request to increase your Amazon EC2 instance limit. Ensure that your requested limit increase includes sufficient capacity for any temporary, unplanned increases in your needs. For more information, go to the Request to Increase Amazon EC2 Instance Limit Form.

    Hardware Configuration
    FieldAction
    Network

    Choose the default VPC. For more information about the default VPC, see Your Default VPC and Subnets in the guide-vpc-user;.

    Optionally, if you have created additional VPCs, you can choose your preferred VPC subnet identifier from the list to launch the cluster in that Amazon VPC. For more information, see Select a Amazon VPC Subnet for the Cluster (Optional).

    EC2 Availability Zone

    Choose No preference.

    Optionally, you can launch the cluster in a specific EC2 Availability Zone.

    For more information, see Regions and Availability Zones in the Amazon Elastic Compute Cloud User Guide.

    Master

    Choose m1.small.

    The master node assigns Hadoop tasks to core and task nodes, and monitors their status. There is always one master node in each cluster.

    This specifies the EC2 instance types to use as master nodes. Valid types are m1.small (default), m1.large, m1.xlarge, c1.medium, c1.xlarge, m2.xlarge, m2.2xlarge, and m2.4xlarge, cc1.4xlarge, cg1.4xlarge.

    This tutorial uses small instances for all nodes due to the light workload and to keep your costs low.

    For more information, see Instance Groups. For information about mapping legacy clusters to instance groups, see Mapping Legacy Clusters to Instance Groups.

    Request Spot Instances

    Leave this box unchecked.

    This specifies whether to run master nodes on Spot Instances. For more information, see Lower Costs with Spot Instances (Optional).

    Core

    Choose m1.small.

    A core node is an EC2 instance that runs Hadoop map and reduce tasks and stores data using the Hadoop Distributed File System (HDFS). Core nodes are managed by the master node.

    This specifies the EC2 instance types to use as core nodes. Valid types: m1.small (default), m1.large, m1.xlarge, c1.medium, c1.xlarge, m2.xlarge, m2.2xlarge, and m2.4xlarge, cc1.4xlarge, cg1.4xlarge.

    This tutorial uses small instances for all nodes due to the light workload and to keep your costs low.

    For more information, see Instance Groups. For information about mapping legacy clusters to instance groups, see Mapping Legacy Clusters to Instance Groups.

    Count

    Choose 2.

    Request Spot Instances

    Leave this box unchecked.

    This specifies whether to run core nodes on Spot Instances. For more information, see Lower Costs with Spot Instances (Optional).

    Task

    Choose m1.small.

    Task nodes only process Hadoop tasks and don't store data. You can add and remove them from a cluster to manage the EC2 instance capacity your cluster uses, increasing capacity to handle peak loads and decreasing it later. Task nodes only run a TaskTracker Hadoop daemon.

    This specifies the EC2 instance types to use as task nodes. Valid types: m1.small (default), m1.large, m1.xlarge, c1.medium, c1.xlarge, m2.xlarge, m2.2xlarge, and m2.4xlarge, cc1.4xlarge, cg1.4xlarge.

    For more information, see Instance Groups. For information about mapping legacy clusters to instance groups, see Mapping Legacy Clusters to Instance Groups.

    Count

    Choose 0.

    Request Spot Instances

    Leave this box unchecked.

    This specifies whether to run task nodes on Spot Instances. For more information, see Lower Costs with Spot Instances (Optional).

  9. In the Security and Access section, complete the fields according to the following table.

    Security and Access
    FieldAction
    EC2 key pair

    Choose an Amazon EC2 key pair from the list.

    For more information, see Create an Amazon EC2 Key Pair and PEM File.

    Optionally, choose Proceed without an EC2 key pair. If you do not enter a value in this field, you cannot use SSH to connect to the master node. For more information, see Connect to the Cluster.

    IAM user access

    Choose No other IAM users.

    Optionally, choose All other IAM users to make the cluster visible and accessible to all IAM users on the AWS account. For more information, see Configure IAM User Permissions.

    EC2 instance profile

    You can proceed without choosing an instance profile. If you create a cluster without selecting a specific instance profile, one will be created for you.

    This controls application access to the Amazon EC2 instances in the cluster.

    For more information, see Configure IAM Roles for Amazon EMR.

    EMR role

    Choose Proceed without role.

    Allows Amazon EMR to access other AWS services on your behalf.

    For more information, see Configure IAM Roles for Amazon EMR.

  10. In the Bootstrap Actions section, in the Add bootstrap action field, select Custom action and click Configure and add.

    In the Add Bootstrap Action dialog box, do the following:

    1. Enter the following text in the S3 location field:

      s3://elasticmapreduce/bootstrap-actions/download.sh
    2. (Optional) Enter any arguments in the Optional arguments field. Use spaces to separate the arguments.

    3. Click Add.

    Add Bootstrap Action
  11. In the Steps section, note the step that Amazon EMR configured for you by choosing the sample application.

    You do not need to change any of the settings in this section.

  12. Review your configuration and if you are satisfied with the settings, click Create Cluster.

  13. When the cluster starts, the console displays the Cluster Details page.

While the cluster master node is running, you can connect to the master node and see the log files that the bootstrap action script generated in the /mnt/var/log/bootstrap-actions/1 directory.

Related Topics