Amazon Elastic MapReduce
Developer Guide (API Version 2009-03-31)
« PreviousNext »
View the PDF for this guide.Go to the AWS Discussion Forum for this product.Go to the Kindle Store to download this guide in Kindle format.Did this page help you?  Yes | No |  Tell us about it...

Create Bootstrap Actions to Install Additional Software (Optional)

You can use bootstrap actions to install additional software and to change the configuration of applications on the cluster. Bootstrap actions are scripts that are run on the cluster nodes when Amazon EMR launches the cluster. They run before Hadoop starts and before the node begins processing data. You can write custom bootstrap actions, or use predefined bootstrap actions provided by Amazon EMR. A common use of bootstrap actions is to change Hadoop configuration settings.

Bootstrap Action Basics

Bootstrap actions execute as the Hadoop user by default. A bootstrap action can execute with root privileges if you use sudo.

Note

If the bootstrap action returns a nonzero error code, Amazon EMR treats it as a failure and terminates the instance. If too many instances fail their bootstrap actions, then Amazon EMR terminates the cluster. If just a few instances fail, Amazon EMR attempts to reallocate the failed instances and continue. Use the cluster lastStateChangeReason error code to identify failures caused by a bootstrap action.

All three Amazon EMR interfaces support bootstrap actions. You can specify up to 16 bootstrap actions per cluster by providing multiple --bootstrap-action parameters from the CLI or API.

From the CLI, references to bootstrap action scripts are passed to Amazon EMR by adding the bootstrap-action parameter after the create parameter. The syntax for a bootstrap-action parameter is as follows:

--bootstrap-action "s3://myawsbucket/FileName" --args "arg1,arg2"

From the Amazon EMR console, you can optionally specify a bootstrap action while creating a cluster on the Bootstrap Actions page in the Job Flow Creation Wizard.

For more information about how to reference a bootstrap action from the API, see the Amazon Elastic MapReduce API Reference.

Topics

    Using Predefined Bootstrap Actions

    Amazon EMR provides a number of predefined bootstrap action scripts that you can use to customize Hadoop settings. This section describes the available predefined bootstrap actions. References to predefined bootstrap action scripts are passed to Amazon EMR by using the bootstrap-action parameter.

    You can specify up to 16 bootstrap actions per cluster by providing multiple bootstrap-action parameters.

    Configure Daemons

    This predefined bootstrap action lets you specify the heap size or other Java Virtual Machine (JVM) options for the Hadoop daemons. You can use this bootstrap action to configure Hadoop for large jobs that require more memory than Hadoop allocates by default. You can also use this bootstrap action to modify advanced JVM options, such as garbage collector (GC) behavior.

    The location of the script is s3://elasticmapreduce/bootstrap-actions/configure-daemons.

    The following procedure sets the NameNode JVM heap size to 2048 MB and configures a JVM GC option for the NameNode.

    To set the NameNode heap size

    • In the directory where you installed the Amazon EMR CLI, run the following from the command line. For more information, see the Command Line Interface Reference for Amazon EMR.

      • Linux, UNIX, and Mac OS X users:

        ./elastic-mapreduce --create --alive \
          --bootstrap-action s3://elasticmapreduce/bootstrap-actions/configure-daemons \
          --args --namenode-heap-size=2048,--namenode-opts=-XX:GCTimeRatio=19 
      • Windows users:

        ruby elastic-mapreduce --create --alive --bootstrap-action s3://elasticmapreduce/bootstrap-actions/configure-daemons --args --namenode-heap-size=2048,--namenode-opts=-XX:GCTimeRatio=19 

    Configure Hadoop Settings with a Bootstrap Action

    This bootstrap action allows you to set cluster-wide Hadoop settings. The location of the script is s3://elasticmapreduce/bootstrap-actions/configure-hadoop. This script provides two types of command line options:

    • --keyword-config-file—This option merges the existing Hadoop configuration with a user-specified XML configuration file that you upload to Amazon S3 or the local filesystem. The user-specified file can be named anything.

    • --keyword-key-value—This option allows you to override specific key value pairs in the Hadoop configuration files.

    In both options, replace the --keyword portion of the command-line with a keyword (or use the single character shortcut instead) that represents one of the five Hadoop configuration files as described in the following table. Because the single character shortcuts can be used together in the same command, an upper-case character indicates the shortcut refers to a configuration file and a lower-case character indicates the shortcut refers to a key value pair. When multiple options are used on the command-line, the later options override the earlier ones.

    Configuration File NameConfiguration File KeywordFile Name ShortcutKey-Value Pair Shortcut
    hadoop-default.xmldefaultDd
    hadoop-site.xmlsiteSs
    core-site.xmlcoreCc
    hdfs-site.xmlhdfsHh
    mapred-site.xmlmapredMm

    The following example shows how to use the configuration file keywords ('mapred' in this example) to merge a user-specified configuration file (config.xml) with Hadoop's mapred-site.xml file and set the maximum map tasks value to 2 in the mapred-site.xml file.

    To change the maximum number of map tasks

    • In the directory where you installed the Amazon EMR CLI, run the following from the command line. For more information, see the Command Line Interface Reference for Amazon EMR.

      • Linux, UNIX, and Mac OS X users:

        ./elastic-mapreduce --create \
        --bootstrap-action s3://elasticmapreduce/bootstrap-actions/configure-hadoop \
        --args "--M,s3://myawsbucket/config.xml,-m,mapred.tasktracker.map.tasks.maximum=2"
      • Windows users:

        ruby elastic-mapreduce --create --bootstrap-action s3://elasticmapreduce/bootstrap-actions/configure-hadoop
        --args "--M,s3://myawsbucket/config.xml,-m,mapred.tasktracker.map.tasks.maximum=2"

    The configuration options are applied in the order described in the bootstrap action script. Settings specified later in the sequence override those specified earlier.

    Note

    The configuration file you supply in the Amazon S3 bucket must be a valid Hadoop configuration file; for example:

    <?xml version="1.0"?>
    <?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
    <configuration>
      <property>
        <name>mapred.userlog.retain.hours</name>
        <value>4</value>
      </property>
    </configuration> 

    The configuration file for Hadoop 0.18 is hadoop-site.xml. In Hadoop 0.20 and later, the old configuration file is replaced with three new files: core-site.xml, mapred-site.xml, and hdfs-site.xml.

    For Hadoop 0.18, the name and location of the configuration file is /conf/hadoop-site.xml. The default hadoop-site.xml properties are as follows.

    <?xml version="1.0"?>
    <?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
    <configuration>
      <property><name>mapred.reduce.tasks.speculative.execution</name><value>false</value></property>
      <property><name>dfs.http.address</name><value>0.0.0.0:9101</value></property>
      <property><name>dfs.https.address</name><value>0.0.0.0:9202</value></property>
      <property><name>mapred.tasktracker.map.tasks.maximum</name><value>2</value></property>
      <property><name>fs.s3.awsSecretAccessKey</name><value>SecretKey</value></property>
      <property><name>dfs.datanode.http.address</name><value>0.0.0.0:9102</value></property>
      <property><name>dfs.datanode.https.address</name><value>0.0.0.0:9402</value></property>
      <property><name>dfs.datanode.address</name><value>0.0.0.0:9200</value></property>
      <property><name>mapred.task.tracker.http.address</name><value>0.0.0.0:9103</value></property>
      <property><name>mapred.map.tasks.speculative.execution</name><value>true</value></property>
      <property><name>mapred.userlog.retain.hours</name><value>48</value></property>
      <property><name>dfs.datanode.du.reserved</name><value>536870912</value></property>
      <property><name>mapred.output.direct.NativeS3FileSystem</name><value>true</value></property>
      <property><name>dfs.namenode.handler.count</name><value>20</value></property>
      <property><name>mapred.reduce.tasksperslot</name><value>1.75</value></property>
      <property><name>fs.s3.impl</name><value>org.apache.hadoop.fs.s3native.NativeS3FileSystem</value></property>
      <property><name>fs.s3bfs.awsAccessKeyId</name><value>AccessKeyID</value></property>
      <property><name>io.sort.factor</name><value>40</value></property>
      <property><name>fs.default.name</name><value>hdfs://domU-12-31-39-06-7E-53.compute-1.internal:9000</value></property>
      <property><name>dfs.name.dir</name><value>/mnt/var/lib/hadoop/dfs-name</value></property>
      <property><name>hadoop.tmp.dir</name><value>/mnt/var/lib/hadoop/tmp</value></property>
      <property><name>mapred.tasktracker.reduce.tasks.maximum</name><value>1</value></property>
      <property><name>mapred.reduce.parallel.copies</name><value>20</value></property>
      <property><name>tasktracker.http.threads</name><value>20</value></property>
      <property><name>mapred.reduce.tasks</name><value>1</value></property>
      <property><name>mapred.output.compression.codec</name><value>org.apache.hadoop.io.compress.GzipCodec</value></property>
      <property><name>mapred.job.tracker.http.address</name><value>0.0.0.0:9100</value></property>
      <property><name>fs.s3bfs.awsSecretAccessKey</name><value>SecretKey</value></property>
      <property><name>mapred.child.java.opts</name><value>-Xmx725m</value></property>
      <property><name>fs.s3n.awsSecretAccessKey</name><value>SecretKey</value></property>
      <property><name>fs.s3n.awsAccessKeyId</name><value>AccessKeyID</value></property>
      <property><name>io.file.buffer.size</name><value>65536</value></property>
      <property><name>fs.s3.buffer.dir</name><value>/mnt/var/lib/hadoop/s3</value></property>
      <property><name>mapred.local.dir</name><value>/mnt/var/lib/hadoop/mapred</value></property>
      <property><name>dfs.block.size</name><value>134217728</value></property>
      <property><name>dfs.datanode.ipc.address</name><value>0.0.0.0:9201</value></property>
      <property><name>dfs.data.dir</name><value>/mnt/var/lib/hadoop/dfs</value></property>
      <property><name>mapred.job.tracker</name><value>domU-12-31-39-06-7E-53.compute-1.internal:9001</value></property>
      <property><name>fs.s3bfs.impl</name><value>org.apache.hadoop.fs.s3.S3FileSystem</value></property>
      <property><name>io.sort.mb</name><value>150</value></property>
      <property><name>hadoop.job.history.user.location</name><value>none</value></property>
      <property><name>dfs.secondary.http.address</name><value>0.0.0.0:9104</value></property>
      <property><name>dfs.replication</name><value>1</value></property>
      <property><name>mapred.job.tracker.handler.count</name><value>20</value></property>
      <property><name>fs.s3.awsAccessKeyId</name><value>AccessKeyID</value></property>
    </configuration>			

    In Hadoop 0.20, the configuration file names and locations are core-site.xml, hdfs-site.xml, and mapred-site.xml.

    The default core-site.xml properties are as follows.

    <?xml version="1.0"?>
    <?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
    <configuration>
      <property><name>fs.s3.impl</name><value>org.apache.hadoop.fs.s3native.NativeS3FileSystem</value></property>
      <property><name>fs.default.name</name><value>hdfs://ip-10-116-159-127.ec2.internal:9000</value></property>
      <property><name>hadoop.tmp.dir</name><value>/mnt/var/lib/hadoop/tmp</value></property>
      <property><name>fs.s3n.awsSecretAccessKey</name><value>SecretKey</value></property>
      <property><name>fs.s3n.awsAccessKeyId</name><value>AccessKeyID</value></property>
      <property><name>fs.s3.buffer.dir</name><value>/mnt/var/lib/hadoop/s3</value></property>
      <property><name>io.compression.codecs</name><value>org.apache.hadoop.io.compress.GzipCodec,org.apache.hadoop.io.compress.DefaultCodec,com.hadoop.compression.lzo.LzoCodec,com.hadoop.compression.lzo.LzopCodec,org.apache.hadoop.io.compress.BZip2Codec</value></property>
      <property><name>fs.s3bfs.impl</name><value>org.apache.hadoop.fs.s3.S3FileSystem</value></property>
      <property><name>fs.s3.awsSecretAccessKey</name><value>SecretKey</value></property>
      <property><name>fs.s3bfs.awsAccessKeyId</name><value>AccessKeyID</value></property>
      <property><name>fs.s3bfs.awsSecretAccessKey</name><value>SecretKey</value></property>
      <property><name>io.compression.codec.lzo.class</name><value>com.hadoop.compression.lzo.LzoCodec</value></property>
      <property><name>fs.s3.awsAccessKeyId</name><value>AccessKeyID</value></property>
    </configuration>

    The default hdfs-site.xml properties are as follows.

    <?xml version="1.0"?>
    <?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
    <configuration>
      <property><name>dfs.datanode.https.address</name><value>0.0.0.0:9402</value></property>
      <property><name>dfs.datanode.du.reserved</name><value>536870912</value></property>
      <property><name>dfs.namenode.handler.count</name><value>20</value></property>
      <property><name>io.file.buffer.size</name><value>65536</value></property>
      <property><name>dfs.block.size</name><value>134217728</value></property>
      <property><name>dfs.data.dir</name><value>/mnt/var/lib/hadoop/dfs</value></property>
      <property><name>dfs.replication</name><value>1</value></property>
      <property><name>dfs.secondary.http.address</name><value>0.0.0.0:9104</value></property>
      <property><name>dfs.https.address</name><value>0.0.0.0:9202</value></property>
      <property><name>dfs.http.address</name><value>0.0.0.0:9101</value></property>
      <property><name>dfs.datanode.http.address</name><value>0.0.0.0:9102</value></property>
      <property><name>dfs.datanode.address</name><value>0.0.0.0:9200</value></property>
      <property><name>dfs.name.dir</name><value>/mnt/var/lib/hadoop/dfs-name</value></property>
      <property><name>dfs.datanode.ipc.address</name><value>0.0.0.0:9201</value></property>
    </configuration>

    The default mapred-site.xml properties are as follows.

    <?xml version="1.0"?>
    <?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
    <configuration>
      <property><name>mapred.output.committer.class</name><value>org.apache.hadoop.mapred.DirectFileOutputCommitter</value></property>
      <property><name>mapred.reduce.tasks.speculative.execution</name><value>false</value></property>
      <property><name>mapred.tasktracker.map.tasks.maximum</name><value>2</value></property>
      <property><name>mapred.task.tracker.http.address</name><value>0.0.0.0:9103</value></property>
      <property><name>mapred.map.tasks.speculative.execution</name><value>true</value></property>
      <property><name>mapred.userlog.retain.hours</name><value>48</value></property>
      <property><name>mapred.job.reuse.jvm.num.tasks</name><value>20</value></property>
      <property><name>io.sort.factor</name><value>40</value></property>
      <property><name>mapred.reduce.tasks</name><value>1</value></property>
      <property><name>tasktracker.http.threads</name><value>20</value></property>
      <property><name>mapred.reduce.parallel.copies</name><value>20</value></property>
      <property><name>hadoop.job.history.user.location</name><value>none</value></property>
      <property><name>mapred.job.tracker.handler.count</name><value>20</value></property>
      <property><name>mapred.map.output.compression.codec</name><value>com.hadoop.compression.lzo.LzoCodec</value></property>
      <property><name>mapred.output.direct.NativeS3FileSystem</name><value>true</value></property>
      <property><name>mapred.reduce.tasksperslot</name><value>1.75</value></property>
      <property><name>mapred.tasktracker.reduce.tasks.maximum</name><value>1</value></property>
      <property><name>mapred.compress.map.output</name><value>true</value></property>
      <property><name>mapred.output.compression.codec</name><value>org.apache.hadoop.io.compress.GzipCodec</value></property>
      <property><name>mapred.job.tracker.http.address</name><value>0.0.0.0:9100</value></property>
      <property><name>mapred.child.java.opts</name><value>-Xmx725m</value></property>
      <property><name>mapred.local.dir</name><value>/mnt/var/lib/hadoop/mapred</value></property>
      <property><name>mapred.job.tracker</name><value>ip-10-116-159-127.ec2.internal:9001</value></property>
      <property><name>io.sort.mb</name><value>150</value></property>
    </configuration>

    Configure Memory-Intensive Workloads

    This bootstrap action allows you to set cluster-wide Hadoop settings to values appropriate for clusters with memory-intensive workloads.

    Note

    The memory-intensive bootstrap action should be used only with AMI versions 1.0.1 and earlier. Using the memory-intensive bootstrap action with AMI versions 2.0.0 and later may cause your cluster to fail.

    The following Hadoop configuration parameters are set:

    Parameters modified in hadoop.env.sh

    • HADOOP_JOBTRACKER_HEAPSIZE

    • HADOOP_NAMENODE_HEAPSIZE

    • HADOOP_TASKTRACKER_HEAPSIZE

    • HADOOP_DATANODE_HEAPSIZE

    Parameters modified in mapred-site.xml

    • mapred.child.java.opts

    • mapred.tasktracker.map.tasks.maximum

    • mapred.tasktracker.reduce.tasks.maximum

    The bootstrap script is located at s3://elasticmapreduce/bootstrap-actions/configurations/latest/memory-intensive.

    The default configurations for cc1.4xlarge, cc2.8xlarge, hi1.4xlarge, hs1.8xlarge, and cg1.4xlarge instances are sufficient for memory-intensive workloads. This bootstrap action does not modify the settings for these instance types.

    For information about the configuration values for each supported EC2 instance type, see Hadoop Memory-Intensive Configuration Settings (AMI 1.0) .

    The following procedure creates a default cluster with the memory-intensive bootstrap action. The bootstrap action modifies the Hadoop cluster configuration settings to the recommended configuration for an Amazon EC2 m1.small instance.

    To configure clusters for memory-intensive workloads

    • In the directory where you installed the Amazon EMR CLI, run the following from the command line. For more information, see the Command Line Interface Reference for Amazon EMR.

      • Linux, UNIX, and Mac OS X users:

        ./elastic-mapreduce --create \
          --bootstrap-action s3://elasticmapreduce/bootstrap-actions/configurations/latest/memory-intensive
      • Windows users:

        ruby elastic-mapreduce --create --bootstrap-action s3://elasticmapreduce/bootstrap-actions/configurations/latest/memory-intensive

    Run If

    You can use this predefined bootstrap action to run a command conditionally when an instance-specific value is found in the instance.json or job-flow.json files. The command can refer to a file in Amazon S3 that Amazon EMR can download and execute.

    The location of the script is s3://elasticmapreduce/bootstrap-actions/run-if.

    The following procedure echoes the string running on master node if the node is a master.

    To run a command conditionally

    • In the directory where you installed the Amazon EMR CLI, run the following from the command line. For more information, see the Command Line Interface Reference for Amazon EMR.

      • Linux, UNIX, and Mac OS X users:

        ./elastic-mapreduce --create --alive \
        --bootstrap-action s3://elasticmapreduce/bootstrap-actions/run-if \
        --args "instance.isMaster=true,echo running on master node"
      • Windows users:

        ruby elastic-mapreduce --create --alive --bootstrap-action s3://elasticmapreduce/bootstrap-actions/run-if --args "instance.isMaster=true,echo running on master node"

    Note

    You must use commas to separate commands that you specify with the --args option.

    Shutdown Actions

    A bootstrap action script can create one or more shutdown actions by writing scripts to the /mnt/var/lib/instance-controller/public/shutdown-actions/ directory. When a cluster is terminated, all the scripts in this directory are executed in parallel. Each script must run and complete within 60 seconds.

    Note

    Shutdown action scripts are not guaranteed to run if the node terminates with an error.

    Using Custom Bootstrap Actions

    In addition to predefined bootstrap actions, you can write a custom script to perform a customized bootstrap action. Any of the Amazon EMR interfaces can reference a custom bootstrap action.

    Running Custom Bootstrap Actions from the CLI

    The following procedure uses a bootstrap action script to download and extracts a compressed TAR archive from Amazon S3. The sample script is stored in Amazon S3 at: http://elasticmapreduce.s3.amazonaws.com/bootstrap-actions/download.sh.

    The sample script looks like the following:

    #!/bin/bash
    set -e
    bucket=elasticmapreduce
    path=samples/bootstrap-actions/file.tar.gz
    wget -S -T 10 -t 5 http://$bucket.s3.amazonaws.com/$path
    mkdir -p /home/hadoop/contents
    tar -C /home/hadoop/contents -xzf file.tar.gz                         

    To create a cluster with a custom bootstrap action

    • In the directory where you installed the Amazon EMR CLI, run the following from the command line. For more information, see the Command Line Interface Reference for Amazon EMR.

      • Linux, UNIX, and Mac OS X users:

        ./elastic-mapreduce --create --stream --alive \
        --input s3n://elasticmapreduce/samples/wordcount/input \
        --mapper  s3://elasticmapreduce/samples/wordcount/wordSplitter.py \ 
        --output s3n://myawsbucket \ 
        --bootstrap-action "s3://elasticmapreduce/bootstrap-actions/download.sh"
      • Windows users:

        ruby elastic-mapreduce --create --stream --alive --input
        s3n://elasticmapreduce/samples/wordcount/input --mapper s3://elasticmapreduce/samples/wordcount/wordSplitter.py --output s3n://myawsbucket --bootstrap-action "s3://elasticmapreduce/bootstrap-actions/download.sh"

    Running Custom Bootstrap Actions from the Amazon EMR Console

    The following procedure creates a predefined word count sample cluster with a bootstrap action script that downloads and extracts a compressed TAR archive from Amazon S3. The sample script is stored in Amazon S3 at http://elasticmapreduce.s3.amazonaws.com/bootstrap-actions/download.sh.

    To create a cluster with a custom bootstrap action

    1. Start a new cluster:

      1. From the Amazon EMR console region selector, select a region.

      2. Click Create a New Job Flow.

    2. On the DEFINE JOB FLOW page, enter the following information:

      1. Enter a name in the Job Flow Name field.

        We recommend that you use a descriptive name. It does not need to be unique.

      2. Select Run a sample application.

      3. Select Word Count (Streaming) from the menu and click Continue.

      Create a New Cluster: Define Cluster
    3. On the SPECIFY PARAMETERS page, replace the myawsbucket text in the Output Location field with the name of a valid Amazon S3 bucket and then click Continue.

    4. On the CONFIGURE EC2 INSTANCES page, accept the default parameters and click Continue.

    5. On the ADVANCED OPTIONS page, accept the default parameters and click Continue.

    6. On the BOOTSTRAP ACTIONS page, select Configure your Bootstrap Actions.

      Enter the following information:

      1. Select Custom Action from the Action Type list.

      2. Enter the following text in the Amazon S3 Location field:

        s3://elasticmapreduce/bootstrap-actions/download.sh
      3. Click Continue.

    7. In the REVIEW page, review the information, edit as necessary to correct any of the values, and then click Create Job Flow when the information is correct.

      After you click Create Job Flow your request is processed; when it succeeds, a message appears.

      Amazon EMR console
    8. Click Close.

      The Amazon EMR console shows the new cluster starting. Starting a new cluster may take several minutes, depending on the number and type of EC2 instances Amazon EMR is launching and configuring. Click the Refresh button for the latest view of the cluster's progress.

      Amazon EMR console

    While the cluster master node is running, you can connect to the master node and see the log files the that the bootstrap action script generated stored in the /mnt/var/log/bootstrap-actions/1 directory.

    Related Topics