| « PreviousNext » | |
![]() ![]() ![]() | Did this page help you? Yes | No | Tell us about it... |
You can use bootstrap actions to install additional software and to change the configuration of applications on the cluster. Bootstrap actions are scripts that are run on the cluster nodes when Amazon EMR launches the cluster. They run before Hadoop starts and before the node begins processing data. You can write custom bootstrap actions, or use predefined bootstrap actions provided by Amazon EMR. A common use of bootstrap actions is to change Hadoop configuration settings.
Bootstrap actions execute as the Hadoop user by default. A bootstrap action can execute
with root privileges if you use sudo.
Note
If the bootstrap action returns a nonzero error code, Amazon EMR treats it as a failure and
terminates the instance. If too many instances fail their bootstrap actions, then
Amazon EMR terminates the cluster. If just a few instances fail, Amazon EMR attempts to
reallocate the failed instances and continue. Use the cluster
lastStateChangeReason error code to identify failures caused by a
bootstrap action.
All three Amazon EMR interfaces support bootstrap actions. You can specify up to
16 bootstrap actions per cluster by providing multiple
--bootstrap-action parameters from the CLI or API.
From the CLI, references to bootstrap action scripts are passed to Amazon EMR by
adding the bootstrap-action parameter after the
create parameter. The syntax for a
bootstrap-action parameter is as follows:
--bootstrap-action "s3://myawsbucket/FileName" --args "arg1,arg2"
From the Amazon EMR console, you can optionally specify a bootstrap action while creating a cluster on the Bootstrap Actions page in the Job Flow Creation Wizard.
For more information about how to reference a bootstrap action from the API, see the Amazon Elastic MapReduce API Reference.
Topics
Topics
Amazon EMR provides a number of predefined bootstrap action scripts that you can use to customize
Hadoop settings. This section describes the available predefined bootstrap actions.
References to predefined bootstrap action scripts are passed to Amazon EMR by using the
bootstrap-action parameter.
You can specify up to 16 bootstrap actions per cluster by providing multiple
bootstrap-action parameters.
This predefined bootstrap action lets you specify the heap size or other Java Virtual Machine (JVM) options for the Hadoop daemons. You can use this bootstrap action to configure Hadoop for large jobs that require more memory than Hadoop allocates by default. You can also use this bootstrap action to modify advanced JVM options, such as garbage collector (GC) behavior.
The location of the script is
s3://elasticmapreduce/bootstrap-actions/configure-daemons.
The following procedure sets the NameNode JVM heap size to 2048 MB and configures a JVM GC option for the NameNode.
To set the NameNode heap size
In the directory where you installed the Amazon EMR CLI, run the following from the command line. For more information, see the Command Line Interface Reference for Amazon EMR.
Linux, UNIX, and Mac OS X users:
./elastic-mapreduce --create --alive \ --bootstrap-action s3://elasticmapreduce/bootstrap-actions/configure-daemons \ --args --namenode-heap-size=2048,--namenode-opts=-XX:GCTimeRatio=19
Windows users:
ruby elastic-mapreduce --create --alive --bootstrap-action s3://elasticmapreduce/bootstrap-actions/configure-daemons --args --namenode-heap-size=2048,--namenode-opts=-XX:GCTimeRatio=19
This bootstrap action allows you to set cluster-wide Hadoop settings. The
location of the script is
s3://elasticmapreduce/bootstrap-actions/configure-hadoop. This
script provides two types of command line options:
--keyword-config-file—This option
merges the existing Hadoop configuration with a user-specified XML
configuration file that you upload to Amazon S3 or the local filesystem. The
user-specified file can be named anything.
--keyword-key-value—This
option allows you to override specific key value pairs in the Hadoop
configuration files.
In both options, replace the --keyword portion of the command-line
with a keyword (or use the single character shortcut instead) that represents one of
the five Hadoop configuration files as described in the following table. Because the
single character shortcuts can be used together in the same command, an upper-case
character indicates the shortcut refers to a configuration file and a lower-case
character indicates the shortcut refers to a key value pair. When multiple options
are used on the command-line, the later options override the earlier ones.
| Configuration File Name | Configuration File Keyword | File Name Shortcut | Key-Value Pair Shortcut |
|---|---|---|---|
| hadoop-default.xml | default | D | d |
| hadoop-site.xml | site | S | s |
| core-site.xml | core | C | c |
| hdfs-site.xml | hdfs | H | h |
| mapred-site.xml | mapred | M | m |
The following example shows how to use the configuration file keywords ('mapred' in this example) to merge a user-specified configuration file (config.xml) with Hadoop's mapred-site.xml file and set the maximum map tasks value to 2 in the mapred-site.xml file.
To change the maximum number of map tasks
In the directory where you installed the Amazon EMR CLI, run the following from the command line. For more information, see the Command Line Interface Reference for Amazon EMR.
Linux, UNIX, and Mac OS X users:
./elastic-mapreduce --create \ --bootstrap-action s3://elasticmapreduce/bootstrap-actions/configure-hadoop \ --args "--M,s3://myawsbucket/config.xml,-m,mapred.tasktracker.map.tasks.maximum=2"
Windows users:
ruby elastic-mapreduce --create --bootstrap-action s3://elasticmapreduce/bootstrap-actions/configure-hadoop --args "--M,s3://myawsbucket/config.xml,-m,mapred.tasktracker.map.tasks.maximum=2"
The configuration options are applied in the order described in the bootstrap action script. Settings specified later in the sequence override those specified earlier.
Note
The configuration file you supply in the Amazon S3 bucket must be a valid Hadoop configuration file; for example:
<?xml version="1.0"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
<configuration>
<property>
<name>mapred.userlog.retain.hours</name>
<value>4</value>
</property>
</configuration> The configuration file for Hadoop 0.18 is hadoop-site.xml. In
Hadoop 0.20 and later, the old configuration file is replaced with three new files:
core-site.xml, mapred-site.xml, and
hdfs-site.xml.
For Hadoop 0.18, the name and location of the configuration file is
/conf/hadoop-site.xml. The default
hadoop-site.xml properties are as follows.
<?xml version="1.0"?> <?xml-stylesheet type="text/xsl" href="configuration.xsl"?> <configuration> <property><name>mapred.reduce.tasks.speculative.execution</name><value>false</value></property> <property><name>dfs.http.address</name><value>0.0.0.0:9101</value></property> <property><name>dfs.https.address</name><value>0.0.0.0:9202</value></property> <property><name>mapred.tasktracker.map.tasks.maximum</name><value>2</value></property> <property><name>fs.s3.awsSecretAccessKey</name><value>SecretKey</value></property> <property><name>dfs.datanode.http.address</name><value>0.0.0.0:9102</value></property> <property><name>dfs.datanode.https.address</name><value>0.0.0.0:9402</value></property> <property><name>dfs.datanode.address</name><value>0.0.0.0:9200</value></property> <property><name>mapred.task.tracker.http.address</name><value>0.0.0.0:9103</value></property> <property><name>mapred.map.tasks.speculative.execution</name><value>true</value></property> <property><name>mapred.userlog.retain.hours</name><value>48</value></property> <property><name>dfs.datanode.du.reserved</name><value>536870912</value></property> <property><name>mapred.output.direct.NativeS3FileSystem</name><value>true</value></property> <property><name>dfs.namenode.handler.count</name><value>20</value></property> <property><name>mapred.reduce.tasksperslot</name><value>1.75</value></property> <property><name>fs.s3.impl</name><value>org.apache.hadoop.fs.s3native.NativeS3FileSystem</value></property> <property><name>fs.s3bfs.awsAccessKeyId</name><value>AccessKeyID</value></property> <property><name>io.sort.factor</name><value>40</value></property> <property><name>fs.default.name</name><value>hdfs://domU-12-31-39-06-7E-53.compute-1.internal:9000</value></property> <property><name>dfs.name.dir</name><value>/mnt/var/lib/hadoop/dfs-name</value></property> <property><name>hadoop.tmp.dir</name><value>/mnt/var/lib/hadoop/tmp</value></property> <property><name>mapred.tasktracker.reduce.tasks.maximum</name><value>1</value></property> <property><name>mapred.reduce.parallel.copies</name><value>20</value></property> <property><name>tasktracker.http.threads</name><value>20</value></property> <property><name>mapred.reduce.tasks</name><value>1</value></property> <property><name>mapred.output.compression.codec</name><value>org.apache.hadoop.io.compress.GzipCodec</value></property> <property><name>mapred.job.tracker.http.address</name><value>0.0.0.0:9100</value></property> <property><name>fs.s3bfs.awsSecretAccessKey</name><value>SecretKey</value></property> <property><name>mapred.child.java.opts</name><value>-Xmx725m</value></property> <property><name>fs.s3n.awsSecretAccessKey</name><value>SecretKey</value></property> <property><name>fs.s3n.awsAccessKeyId</name><value>AccessKeyID</value></property> <property><name>io.file.buffer.size</name><value>65536</value></property> <property><name>fs.s3.buffer.dir</name><value>/mnt/var/lib/hadoop/s3</value></property> <property><name>mapred.local.dir</name><value>/mnt/var/lib/hadoop/mapred</value></property> <property><name>dfs.block.size</name><value>134217728</value></property> <property><name>dfs.datanode.ipc.address</name><value>0.0.0.0:9201</value></property> <property><name>dfs.data.dir</name><value>/mnt/var/lib/hadoop/dfs</value></property> <property><name>mapred.job.tracker</name><value>domU-12-31-39-06-7E-53.compute-1.internal:9001</value></property> <property><name>fs.s3bfs.impl</name><value>org.apache.hadoop.fs.s3.S3FileSystem</value></property> <property><name>io.sort.mb</name><value>150</value></property> <property><name>hadoop.job.history.user.location</name><value>none</value></property> <property><name>dfs.secondary.http.address</name><value>0.0.0.0:9104</value></property> <property><name>dfs.replication</name><value>1</value></property> <property><name>mapred.job.tracker.handler.count</name><value>20</value></property> <property><name>fs.s3.awsAccessKeyId</name><value>AccessKeyID</value></property> </configuration>
In Hadoop 0.20, the configuration file names and locations are
core-site.xml, hdfs-site.xml, and
mapred-site.xml.
The default core-site.xml properties are as follows.
<?xml version="1.0"?> <?xml-stylesheet type="text/xsl" href="configuration.xsl"?> <configuration> <property><name>fs.s3.impl</name><value>org.apache.hadoop.fs.s3native.NativeS3FileSystem</value></property> <property><name>fs.default.name</name><value>hdfs://ip-10-116-159-127.ec2.internal:9000</value></property> <property><name>hadoop.tmp.dir</name><value>/mnt/var/lib/hadoop/tmp</value></property> <property><name>fs.s3n.awsSecretAccessKey</name><value>SecretKey</value></property> <property><name>fs.s3n.awsAccessKeyId</name><value>AccessKeyID</value></property> <property><name>fs.s3.buffer.dir</name><value>/mnt/var/lib/hadoop/s3</value></property> <property><name>io.compression.codecs</name><value>org.apache.hadoop.io.compress.GzipCodec,org.apache.hadoop.io.compress.DefaultCodec,com.hadoop.compression.lzo.LzoCodec,com.hadoop.compression.lzo.LzopCodec,org.apache.hadoop.io.compress.BZip2Codec</value></property> <property><name>fs.s3bfs.impl</name><value>org.apache.hadoop.fs.s3.S3FileSystem</value></property> <property><name>fs.s3.awsSecretAccessKey</name><value>SecretKey</value></property> <property><name>fs.s3bfs.awsAccessKeyId</name><value>AccessKeyID</value></property> <property><name>fs.s3bfs.awsSecretAccessKey</name><value>SecretKey</value></property> <property><name>io.compression.codec.lzo.class</name><value>com.hadoop.compression.lzo.LzoCodec</value></property> <property><name>fs.s3.awsAccessKeyId</name><value>AccessKeyID</value></property> </configuration>
The default hdfs-site.xml properties are as follows.
<?xml version="1.0"?> <?xml-stylesheet type="text/xsl" href="configuration.xsl"?> <configuration> <property><name>dfs.datanode.https.address</name><value>0.0.0.0:9402</value></property> <property><name>dfs.datanode.du.reserved</name><value>536870912</value></property> <property><name>dfs.namenode.handler.count</name><value>20</value></property> <property><name>io.file.buffer.size</name><value>65536</value></property> <property><name>dfs.block.size</name><value>134217728</value></property> <property><name>dfs.data.dir</name><value>/mnt/var/lib/hadoop/dfs</value></property> <property><name>dfs.replication</name><value>1</value></property> <property><name>dfs.secondary.http.address</name><value>0.0.0.0:9104</value></property> <property><name>dfs.https.address</name><value>0.0.0.0:9202</value></property> <property><name>dfs.http.address</name><value>0.0.0.0:9101</value></property> <property><name>dfs.datanode.http.address</name><value>0.0.0.0:9102</value></property> <property><name>dfs.datanode.address</name><value>0.0.0.0:9200</value></property> <property><name>dfs.name.dir</name><value>/mnt/var/lib/hadoop/dfs-name</value></property> <property><name>dfs.datanode.ipc.address</name><value>0.0.0.0:9201</value></property> </configuration>
The default mapred-site.xml properties are as follows.
<?xml version="1.0"?> <?xml-stylesheet type="text/xsl" href="configuration.xsl"?> <configuration> <property><name>mapred.output.committer.class</name><value>org.apache.hadoop.mapred.DirectFileOutputCommitter</value></property> <property><name>mapred.reduce.tasks.speculative.execution</name><value>false</value></property> <property><name>mapred.tasktracker.map.tasks.maximum</name><value>2</value></property> <property><name>mapred.task.tracker.http.address</name><value>0.0.0.0:9103</value></property> <property><name>mapred.map.tasks.speculative.execution</name><value>true</value></property> <property><name>mapred.userlog.retain.hours</name><value>48</value></property> <property><name>mapred.job.reuse.jvm.num.tasks</name><value>20</value></property> <property><name>io.sort.factor</name><value>40</value></property> <property><name>mapred.reduce.tasks</name><value>1</value></property> <property><name>tasktracker.http.threads</name><value>20</value></property> <property><name>mapred.reduce.parallel.copies</name><value>20</value></property> <property><name>hadoop.job.history.user.location</name><value>none</value></property> <property><name>mapred.job.tracker.handler.count</name><value>20</value></property> <property><name>mapred.map.output.compression.codec</name><value>com.hadoop.compression.lzo.LzoCodec</value></property> <property><name>mapred.output.direct.NativeS3FileSystem</name><value>true</value></property> <property><name>mapred.reduce.tasksperslot</name><value>1.75</value></property> <property><name>mapred.tasktracker.reduce.tasks.maximum</name><value>1</value></property> <property><name>mapred.compress.map.output</name><value>true</value></property> <property><name>mapred.output.compression.codec</name><value>org.apache.hadoop.io.compress.GzipCodec</value></property> <property><name>mapred.job.tracker.http.address</name><value>0.0.0.0:9100</value></property> <property><name>mapred.child.java.opts</name><value>-Xmx725m</value></property> <property><name>mapred.local.dir</name><value>/mnt/var/lib/hadoop/mapred</value></property> <property><name>mapred.job.tracker</name><value>ip-10-116-159-127.ec2.internal:9001</value></property> <property><name>io.sort.mb</name><value>150</value></property> </configuration>
This bootstrap action allows you to set cluster-wide Hadoop settings to values appropriate for clusters with memory-intensive workloads.
Note
The memory-intensive bootstrap action should be used only with AMI versions 1.0.1 and earlier. Using the memory-intensive bootstrap action with AMI versions 2.0.0 and later may cause your cluster to fail.
The following Hadoop configuration parameters are set:
Parameters modified in hadoop.env.sh
HADOOP_JOBTRACKER_HEAPSIZE
HADOOP_NAMENODE_HEAPSIZE
HADOOP_TASKTRACKER_HEAPSIZE
HADOOP_DATANODE_HEAPSIZE
Parameters modified in mapred-site.xml
mapred.child.java.opts
mapred.tasktracker.map.tasks.maximum
mapred.tasktracker.reduce.tasks.maximum
The bootstrap script is located at
s3://elasticmapreduce/bootstrap-actions/configurations/latest/memory-intensive.
The default configurations for cc1.4xlarge, cc2.8xlarge, hi1.4xlarge, hs1.8xlarge, and cg1.4xlarge instances are sufficient for memory-intensive workloads. This bootstrap action does not modify the settings for these instance types.
For information about the configuration values for each supported EC2 instance type, see Hadoop Memory-Intensive Configuration Settings (AMI 1.0) .
The following procedure creates a default cluster with the memory-intensive bootstrap action. The bootstrap action modifies the Hadoop cluster configuration settings to the recommended configuration for an Amazon EC2 m1.small instance.
To configure clusters for memory-intensive workloads
In the directory where you installed the Amazon EMR CLI, run the following from the command line. For more information, see the Command Line Interface Reference for Amazon EMR.
Linux, UNIX, and Mac OS X users:
./elastic-mapreduce --create \ --bootstrap-action s3://elasticmapreduce/bootstrap-actions/configurations/latest/memory-intensive
Windows users:
ruby elastic-mapreduce --create --bootstrap-action s3://elasticmapreduce/bootstrap-actions/configurations/latest/memory-intensive
You can use this predefined bootstrap action to run a command conditionally when an
instance-specific value is found in the instance.json or
job-flow.json files. The command can refer to a file in
Amazon S3 that Amazon EMR can download and execute.
The location of the script is
s3://elasticmapreduce/bootstrap-actions/run-if.
The following procedure echoes the string running on master node if the node is a master.
To run a command conditionally
In the directory where you installed the Amazon EMR CLI, run the following from the command line. For more information, see the Command Line Interface Reference for Amazon EMR.
Linux, UNIX, and Mac OS X users:
./elastic-mapreduce --create --alive \ --bootstrap-action s3://elasticmapreduce/bootstrap-actions/run-if \ --args "instance.isMaster=true,echo running on master node"
Windows users:
ruby elastic-mapreduce --create --alive --bootstrap-action s3://elasticmapreduce/bootstrap-actions/run-if --args "instance.isMaster=true,echo running on master node"
Note
You must use commas to separate commands that you specify with the --args
option.
A bootstrap action script can create one or more shutdown
actions by writing scripts to the
/mnt/var/lib/instance-controller/public/shutdown-actions/ directory.
When a cluster is terminated, all the scripts in this directory are executed in
parallel. Each script must run and complete within 60 seconds.
Note
Shutdown action scripts are not guaranteed to run if the node terminates with an error.
Topics
In addition to predefined bootstrap actions, you can write a custom script to perform a customized bootstrap action. Any of the Amazon EMR interfaces can reference a custom bootstrap action.
The following procedure uses a bootstrap action script to download and extracts a compressed TAR archive from Amazon S3. The sample script is stored in Amazon S3 at: http://elasticmapreduce.s3.amazonaws.com/bootstrap-actions/download.sh.
The sample script looks like the following:
#!/bin/bash set -e bucket=elasticmapreduce path=samples/bootstrap-actions/file.tar.gz wget -S -T 10 -t 5 http://$bucket.s3.amazonaws.com/$path mkdir -p /home/hadoop/contents tar -C /home/hadoop/contents -xzf file.tar.gz
To create a cluster with a custom bootstrap action
In the directory where you installed the Amazon EMR CLI, run the following from the command line. For more information, see the Command Line Interface Reference for Amazon EMR.
Linux, UNIX, and Mac OS X users:
./elastic-mapreduce --create --stream --alive \ --inputs3n://elasticmapreduce/samples/wordcount/input\ --mappers3://elasticmapreduce/samples/wordcount/wordSplitter.py\ --outputs3n://myawsbucket\ --bootstrap-action"s3://elasticmapreduce/bootstrap-actions/download.sh"
Windows users:
ruby elastic-mapreduce --create --stream --alive --inputs3n://elasticmapreduce/samples/wordcount/input--mappers3://elasticmapreduce/samples/wordcount/wordSplitter.py--outputs3n://myawsbucket--bootstrap-action"s3://elasticmapreduce/bootstrap-actions/download.sh"
The following procedure creates a predefined word count sample cluster with a bootstrap action script that downloads and extracts a compressed TAR archive from Amazon S3. The sample script is stored in Amazon S3 at http://elasticmapreduce.s3.amazonaws.com/bootstrap-actions/download.sh.
To create a cluster with a custom bootstrap action
Start a new cluster:
From the Amazon EMR console region selector, select a region.

Click Create a New Job Flow.
On the DEFINE JOB FLOW page, enter the following information:
Enter a name in the Job Flow Name field.
We recommend that you use a descriptive name. It does not need to be unique.
Select Run a sample application.
Select Word Count (Streaming) from the menu and click Continue.

On the SPECIFY PARAMETERS page, replace the
myawsbucket text in the
Output Location field with the name of a valid Amazon S3
bucket and then click Continue.

On the CONFIGURE EC2 INSTANCES page, accept the default parameters and click Continue.
On the ADVANCED OPTIONS page, accept the default parameters and click Continue.
On the BOOTSTRAP ACTIONS page, select Configure your Bootstrap Actions.
Enter the following information:
Select Custom Action from the Action Type list.
Enter the following text in the Amazon S3 Location field:
s3://elasticmapreduce/bootstrap-actions/download.sh
Click Continue.

In the REVIEW page, review the information, edit as necessary to correct any of the values, and then click Create Job Flow when the information is correct.
After you click Create Job Flow your request is processed; when it succeeds, a message appears.

Click Close.
The Amazon EMR console shows the new cluster starting. Starting a new cluster may take several minutes, depending on the number and type of EC2 instances Amazon EMR is launching and configuring. Click the Refresh button for the latest view of the cluster's progress.

While the cluster master node is running, you can connect to the master node and see
the log files the that the bootstrap action script generated stored in the
/mnt/var/log/bootstrap-actions/1 directory.
Related Topics