Amazon Elastic MapReduce
Developer Guide (API Version 2009-03-31)
« PreviousNext »
View the PDF for this guide.Go to the AWS Discussion Forum for this product.Go to the Kindle Store to download this guide in Kindle format.Did this page help you?  Yes | No |  Tell us about it...

Create Bootstrap Actions to Install Additional Software (Optional)

You can use bootstrap actions to install additional software and to change the configuration of applications on the cluster. Bootstrap actions are scripts that are run on the cluster nodes when Amazon EMR launches the cluster. They run before Hadoop starts and before the node begins processing data. You can write custom bootstrap actions, or use predefined bootstrap actions provided by Amazon EMR. A common use of bootstrap actions is to change Hadoop configuration settings.

Bootstrap Action Basics

Bootstrap actions execute as the Hadoop user by default. A bootstrap action can execute with root privileges if you use sudo.

Note

If the bootstrap action returns a nonzero error code, Amazon EMR treats it as a failure and terminates the instance. If too many instances fail their bootstrap actions, then Amazon EMR terminates the cluster. If just a few instances fail, Amazon EMR attempts to reallocate the failed instances and continue. Use the cluster lastStateChangeReason error code to identify failures caused by a bootstrap action.

All three Amazon EMR interfaces support bootstrap actions. You can specify up to 16 bootstrap actions per cluster by providing multiple --bootstrap-action parameters from the CLI or API.

From the CLI, references to bootstrap action scripts are passed to Amazon EMR by adding the bootstrap-action parameter after the create parameter. The syntax for a bootstrap-action parameter is as follows:

--bootstrap-action "s3://myawsbucket/FileName" --args "arg1,arg2"

From the Amazon EMR console, you can optionally specify a bootstrap action while creating a cluster on the Bootstrap Actions page in the Job Flow Creation Wizard.

For more information about how to reference a bootstrap action from the API, see the Amazon Elastic MapReduce API Reference.

Topics

    Using Predefined Bootstrap Actions

    Amazon EMR provides a number of predefined bootstrap action scripts that you can use to customize Hadoop settings. This section describes the available predefined bootstrap actions. References to predefined bootstrap action scripts are passed to Amazon EMR by using the bootstrap-action parameter.

    You can specify up to 16 bootstrap actions per cluster by providing multiple bootstrap-action parameters.

    Configure Daemons

    This predefined bootstrap action lets you specify the heap size or other Java Virtual Machine (JVM) options for the Hadoop daemons. You can use this bootstrap action to configure Hadoop for large jobs that require more memory than Hadoop allocates by default. You can also use this bootstrap action to modify advanced JVM options, such as garbage collector (GC) behavior.

    The location of the script is s3://elasticmapreduce/bootstrap-actions/configure-daemons.

    The following table describes the valid parameters for the script. In the table, daemon can be namenode, datanode, jobtracker, tasktracker, or client (using Hadoop 1.x) and namenode, datanode, resourcemanager, nodemanager, or client (using Hadoop 2.x). For example, --namenode-heap-size=2048,--namenode-opts=\"-XX:GCTimeRatio=19\"

    Configuration ParameterDescription
    --daemon-heap-sizeSets the heap size in megabytes for the specified daemon.
    --daemon-optsSets additional Java options for the specified daemon.
    --replaceReplaces the existing hadoop-user-env.sh file if it exists.

    In Hadoop 1.x, --client-heap-size has no effect. Instead, change the client heap size using the --client-opts=\"-Xmx#####\" equivalent, where ##### is numeric.

    Note

    The configure-daemons bootstrap action supports Hadoop 2.x with a new configuration file, yarn-site.xml. It's configuration file keyword is yarn.

    The following procedure sets the NameNode JVM heap size to 2048 MB and configures a JVM GC option for the NameNode.

    To set the NameNode heap size

    • In the directory where you installed the Amazon EMR CLI, run the following from the command line. For more information, see the Command Line Interface Reference for Amazon EMR.

      • Linux, UNIX, and Mac OS X users:

        ./elastic-mapreduce --create --alive \
          --bootstrap-action s3://elasticmapreduce/bootstrap-actions/configure-daemons \
          --args --namenode-heap-size=2048,--namenode-opts=-XX:GCTimeRatio=19 
      • Windows users:

        ruby elastic-mapreduce --create --alive --bootstrap-action s3://elasticmapreduce/bootstrap-actions/configure-daemons --args --namenode-heap-size=2048,--namenode-opts=-XX:GCTimeRatio=19 

    Configure Hadoop Settings with a Bootstrap Action

    This bootstrap action allows you to set cluster-wide Hadoop settings. The location of the script is s3://elasticmapreduce/bootstrap-actions/configure-hadoop. This script provides two types of command line options:

    • --keyword-config-file—This option merges the existing Hadoop configuration with a user-specified XML configuration file that you upload to Amazon S3 or the local filesystem. The user-specified file can be named anything.

    • --keyword-key-value—This option allows you to override specific key value pairs in the Hadoop configuration files.

    In both options, replace the --keyword portion of the command-line with a keyword (or use the single character shortcut instead) that represents one of the five Hadoop configuration files described in the following table. Because the single character shortcuts can be used together in the same command, an uppercase character indicates that the shortcut refers to a configuration file and a lowercase character indicates that the shortcut refers to a key value pair. When multiple options are used on the command-line, later options override earlier ones.

    Configuration File NameConfiguration File KeywordFile Name ShortcutKey-Value Pair Shortcut
    core-site.xmlcoreCc
    hadoop-default.xml (deprecated)defaultDd
    hadoop-site.xml (deprecated)siteSs
    hdfs-site.xmlhdfsHh
    mapred-site.xmlmapredMm
    yarn-site.xmlyarnYy

    The following example shows how to use the configuration file keywords ('mapred' in this example) to merge a user-specified configuration file (config.xml) with Hadoop's mapred-site.xml file and set the maximum map tasks value to 2 in the mapred-site.xml file.

    To change the maximum number of map tasks

    • In the directory where you installed the Amazon EMR CLI, run the following from the command line. For more information, see the Command Line Interface Reference for Amazon EMR.

      • Linux, UNIX, and Mac OS X users:

        ./elastic-mapreduce --create \
        --bootstrap-action s3://elasticmapreduce/bootstrap-actions/configure-hadoop \
        --args "-M,s3://myawsbucket/config.xml,-m,mapred.tasktracker.map.tasks.maximum=2"
      • Windows users:

        ruby elastic-mapreduce --create --bootstrap-action s3://elasticmapreduce/bootstrap-actions/configure-hadoop
        --args "-M,s3://myawsbucket/config.xml,-m,mapred.tasktracker.map.tasks.maximum=2"

    The configuration options are applied in the order described in the bootstrap action script. Settings specified later in the sequence override those specified earlier.

    Note

    The configuration file you supply in the Amazon S3 bucket must be a valid Hadoop configuration file; for example:

    <?xml version="1.0"?>
    <?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
    <configuration>
      <property>
        <name>mapred.userlog.retain.hours</name>
        <value>4</value>
      </property>
    </configuration> 

    The configuration file for Hadoop 0.18 is hadoop-site.xml. In Hadoop 0.20 and later, the old configuration file is replaced with three new files: core-site.xml, mapred-site.xml, and hdfs-site.xml.

    For Hadoop 0.18, the name and location of the configuration file is /conf/hadoop-site.xml. The default hadoop-site.xml properties are as follows.

    <?xml version="1.0"?>
    <?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
    <configuration>
      <property><name>mapred.reduce.tasks.speculative.execution</name><value>false</value></property>
      <property><name>dfs.http.address</name><value>0.0.0.0:9101</value></property>
      <property><name>dfs.https.address</name><value>0.0.0.0:9202</value></property>
      <property><name>mapred.tasktracker.map.tasks.maximum</name><value>2</value></property>
      <property><name>fs.s3.awsSecretAccessKey</name><value>SecretKey</value></property>
      <property><name>dfs.datanode.http.address</name><value>0.0.0.0:9102</value></property>
      <property><name>dfs.datanode.https.address</name><value>0.0.0.0:9402</value></property>
      <property><name>dfs.datanode.address</name><value>0.0.0.0:9200</value></property>
      <property><name>mapred.task.tracker.http.address</name><value>0.0.0.0:9103</value></property>
      <property><name>mapred.map.tasks.speculative.execution</name><value>true</value></property>
      <property><name>mapred.userlog.retain.hours</name><value>48</value></property>
      <property><name>dfs.datanode.du.reserved</name><value>536870912</value></property>
      <property><name>mapred.output.direct.NativeS3FileSystem</name><value>true</value></property>
      <property><name>dfs.namenode.handler.count</name><value>20</value></property>
      <property><name>mapred.reduce.tasksperslot</name><value>1.75</value></property>
      <property><name>fs.s3.impl</name><value>org.apache.hadoop.fs.s3native.NativeS3FileSystem</value></property>
      <property><name>fs.s3bfs.awsAccessKeyId</name><value>AccessKeyID</value></property>
      <property><name>io.sort.factor</name><value>40</value></property>
      <property><name>fs.default.name</name><value>hdfs://domU-12-31-39-06-7E-53.compute-1.internal:9000</value></property>
      <property><name>dfs.name.dir</name><value>/mnt/var/lib/hadoop/dfs-name</value></property>
      <property><name>hadoop.tmp.dir</name><value>/mnt/var/lib/hadoop/tmp</value></property>
      <property><name>mapred.tasktracker.reduce.tasks.maximum</name><value>1</value></property>
      <property><name>mapred.reduce.parallel.copies</name><value>20</value></property>
      <property><name>tasktracker.http.threads</name><value>20</value></property>
      <property><name>mapred.reduce.tasks</name><value>1</value></property>
      <property><name>mapred.output.compression.codec</name><value>org.apache.hadoop.io.compress.GzipCodec</value></property>
      <property><name>mapred.job.tracker.http.address</name><value>0.0.0.0:9100</value></property>
      <property><name>fs.s3bfs.awsSecretAccessKey</name><value>SecretKey</value></property>
      <property><name>mapred.child.java.opts</name><value>-Xmx725m</value></property>
      <property><name>fs.s3n.awsSecretAccessKey</name><value>SecretKey</value></property>
      <property><name>fs.s3n.awsAccessKeyId</name><value>AccessKeyID</value></property>
      <property><name>io.file.buffer.size</name><value>65536</value></property>
      <property><name>fs.s3.buffer.dir</name><value>/mnt/var/lib/hadoop/s3</value></property>
      <property><name>mapred.local.dir</name><value>/mnt/var/lib/hadoop/mapred</value></property>
      <property><name>dfs.block.size</name><value>134217728</value></property>
      <property><name>dfs.datanode.ipc.address</name><value>0.0.0.0:9201</value></property>
      <property><name>dfs.data.dir</name><value>/mnt/var/lib/hadoop/dfs</value></property>
      <property><name>mapred.job.tracker</name><value>domU-12-31-39-06-7E-53.compute-1.internal:9001</value></property>
      <property><name>fs.s3bfs.impl</name><value>org.apache.hadoop.fs.s3.S3FileSystem</value></property>
      <property><name>io.sort.mb</name><value>150</value></property>
      <property><name>hadoop.job.history.user.location</name><value>none</value></property>
      <property><name>dfs.secondary.http.address</name><value>0.0.0.0:9104</value></property>
      <property><name>dfs.replication</name><value>1</value></property>
      <property><name>mapred.job.tracker.handler.count</name><value>20</value></property>
      <property><name>fs.s3.awsAccessKeyId</name><value>AccessKeyID</value></property>
    </configuration>			

    In Hadoop 0.20, the configuration file names and locations are core-site.xml, hdfs-site.xml, and mapred-site.xml.

    The default core-site.xml properties are as follows.

    <?xml version="1.0"?>
    <?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
    <configuration>
      <property><name>fs.s3.impl</name><value>org.apache.hadoop.fs.s3native.NativeS3FileSystem</value></property>
      <property><name>fs.default.name</name><value>hdfs://ip-10-116-159-127.ec2.internal:9000</value></property>
      <property><name>hadoop.tmp.dir</name><value>/mnt/var/lib/hadoop/tmp</value></property>
      <property><name>fs.s3n.awsSecretAccessKey</name><value>SecretKey</value></property>
      <property><name>fs.s3n.awsAccessKeyId</name><value>AccessKeyID</value></property>
      <property><name>fs.s3.buffer.dir</name><value>/mnt/var/lib/hadoop/s3</value></property>
      <property><name>io.compression.codecs</name><value>org.apache.hadoop.io.compress.GzipCodec,org.apache.hadoop.io.compress.DefaultCodec,com.hadoop.compression.lzo.LzoCodec,com.hadoop.compression.lzo.LzopCodec,org.apache.hadoop.io.compress.BZip2Codec</value></property>
      <property><name>fs.s3bfs.impl</name><value>org.apache.hadoop.fs.s3.S3FileSystem</value></property>
      <property><name>fs.s3.awsSecretAccessKey</name><value>SecretKey</value></property>
      <property><name>fs.s3bfs.awsAccessKeyId</name><value>AccessKeyID</value></property>
      <property><name>fs.s3bfs.awsSecretAccessKey</name><value>SecretKey</value></property>
      <property><name>io.compression.codec.lzo.class</name><value>com.hadoop.compression.lzo.LzoCodec</value></property>
      <property><name>fs.s3.awsAccessKeyId</name><value>AccessKeyID</value></property>
    </configuration>

    The default hdfs-site.xml properties are as follows.

    <?xml version="1.0"?>
    <?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
    <configuration>
      <property><name>dfs.datanode.https.address</name><value>0.0.0.0:9402</value></property>
      <property><name>dfs.datanode.du.reserved</name><value>536870912</value></property>
      <property><name>dfs.namenode.handler.count</name><value>20</value></property>
      <property><name>io.file.buffer.size</name><value>65536</value></property>
      <property><name>dfs.block.size</name><value>134217728</value></property>
      <property><name>dfs.data.dir</name><value>/mnt/var/lib/hadoop/dfs</value></property>
      <property><name>dfs.replication</name><value>1</value></property>
      <property><name>dfs.secondary.http.address</name><value>0.0.0.0:9104</value></property>
      <property><name>dfs.https.address</name><value>0.0.0.0:9202</value></property>
      <property><name>dfs.http.address</name><value>0.0.0.0:9101</value></property>
      <property><name>dfs.datanode.http.address</name><value>0.0.0.0:9102</value></property>
      <property><name>dfs.datanode.address</name><value>0.0.0.0:9200</value></property>
      <property><name>dfs.name.dir</name><value>/mnt/var/lib/hadoop/dfs-name</value></property>
      <property><name>dfs.datanode.ipc.address</name><value>0.0.0.0:9201</value></property>
    </configuration>

    The default mapred-site.xml properties are as follows.

    <?xml version="1.0"?>
    <?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
    <configuration>
      <property><name>mapred.output.committer.class</name><value>org.apache.hadoop.mapred.DirectFileOutputCommitter</value></property>
      <property><name>mapred.reduce.tasks.speculative.execution</name><value>false</value></property>
      <property><name>mapred.tasktracker.map.tasks.maximum</name><value>2</value></property>
      <property><name>mapred.task.tracker.http.address</name><value>0.0.0.0:9103</value></property>
      <property><name>mapred.map.tasks.speculative.execution</name><value>true</value></property>
      <property><name>mapred.userlog.retain.hours</name><value>48</value></property>
      <property><name>mapred.job.reuse.jvm.num.tasks</name><value>20</value></property>
      <property><name>io.sort.factor</name><value>40</value></property>
      <property><name>mapred.reduce.tasks</name><value>1</value></property>
      <property><name>tasktracker.http.threads</name><value>20</value></property>
      <property><name>mapred.reduce.parallel.copies</name><value>20</value></property>
      <property><name>hadoop.job.history.user.location</name><value>none</value></property>
      <property><name>mapred.job.tracker.handler.count</name><value>20</value></property>
      <property><name>mapred.map.output.compression.codec</name><value>com.hadoop.compression.lzo.LzoCodec</value></property>
      <property><name>mapred.output.direct.NativeS3FileSystem</name><value>true</value></property>
      <property><name>mapred.reduce.tasksperslot</name><value>1.75</value></property>
      <property><name>mapred.tasktracker.reduce.tasks.maximum</name><value>1</value></property>
      <property><name>mapred.compress.map.output</name><value>true</value></property>
      <property><name>mapred.output.compression.codec</name><value>org.apache.hadoop.io.compress.GzipCodec</value></property>
      <property><name>mapred.job.tracker.http.address</name><value>0.0.0.0:9100</value></property>
      <property><name>mapred.child.java.opts</name><value>-Xmx725m</value></property>
      <property><name>mapred.local.dir</name><value>/mnt/var/lib/hadoop/mapred</value></property>
      <property><name>mapred.job.tracker</name><value>ip-10-116-159-127.ec2.internal:9001</value></property>
      <property><name>io.sort.mb</name><value>150</value></property>
    </configuration>

    Run If

    You can use this predefined bootstrap action to run a command conditionally when an instance-specific value is found in the instance.json or job-flow.json files. The command can refer to a file in Amazon S3 that Amazon EMR can download and execute.

    The location of the script is s3://elasticmapreduce/bootstrap-actions/run-if.

    The following procedure echoes the string running on master node if the node is a master.

    To run a command conditionally

    • In the directory where you installed the Amazon EMR CLI, run the following from the command line. For more information, see the Command Line Interface Reference for Amazon EMR.

      • Linux, UNIX, and Mac OS X users:

        ./elastic-mapreduce --create --alive \
        --bootstrap-action s3://elasticmapreduce/bootstrap-actions/run-if \
        --args "instance.isMaster=true,echo running on master node"
      • Windows users:

        ruby elastic-mapreduce --create --alive --bootstrap-action s3://elasticmapreduce/bootstrap-actions/run-if --args "instance.isMaster=true,echo running on master node"

    Note

    You must use commas to separate commands that you specify with the --args option.

    Shutdown Actions

    A bootstrap action script can create one or more shutdown actions by writing scripts to the /mnt/var/lib/instance-controller/public/shutdown-actions/ directory. When a cluster is terminated, all the scripts in this directory are executed in parallel. Each script must run and complete within 60 seconds.

    Note

    Shutdown action scripts are not guaranteed to run if the node terminates with an error.

    Using Custom Bootstrap Actions

    In addition to predefined bootstrap actions, you can write a custom script to perform a customized bootstrap action. Any of the Amazon EMR interfaces can reference a custom bootstrap action.

    Running Custom Bootstrap Actions from the CLI

    The following procedure uses a bootstrap action script to download and extracts a compressed TAR archive from Amazon S3. The sample script is stored in Amazon S3 at: http://elasticmapreduce.s3.amazonaws.com/bootstrap-actions/download.sh.

    The sample script looks like the following:

    #!/bin/bash
    set -e
    bucket=elasticmapreduce
    path=samples/bootstrap-actions/file.tar.gz
    wget -S -T 10 -t 5 http://$bucket.s3.amazonaws.com/$path
    mkdir -p /home/hadoop/contents
    tar -C /home/hadoop/contents -xzf file.tar.gz                         

    To create a cluster with a custom bootstrap action

    • In the directory where you installed the Amazon EMR CLI, run the following from the command line. For more information, see the Command Line Interface Reference for Amazon EMR.

      • Linux, UNIX, and Mac OS X users:

        ./elastic-mapreduce --create --alive --bootstrap-action "s3://elasticmapreduce/bootstrap-actions/download.sh"
      • Windows users:

        ruby elastic-mapreduce --create --alive --bootstrap-action "s3://elasticmapreduce/bootstrap-actions/download.sh"

    Running Custom Bootstrap Actions from the Amazon EMR Console

    The following procedure creates a predefined word count sample cluster with a bootstrap action script that downloads and extracts a compressed TAR archive from Amazon S3. The sample script is stored in Amazon S3 at http://elasticmapreduce.s3.amazonaws.com/bootstrap-actions/download.sh.

    To create a cluster with a custom bootstrap action

    1. Sign in to the AWS Management Console and open the Amazon Elastic MapReduce console at https://console.aws.amazon.com/elasticmapreduce/vnext/.

    2. Click Create Cluster.

      Create Cluster
    3. In the Create Cluster page, click Configure sample application.

    4. In the Configure Sample Application page, in the Select sample application field, choose the Word count sample application from the list.

    5. In the Output location field, type the path of an Amazon S3 bucket to store your output and click Ok.

    6. In the Create Cluster page, in the Cluster Configuration section, verify the fields according to the following table.

      Cluster Configuration
      FieldAction
      Cluster name

      Enter a descriptive name for your cluster.

      The name is optional, and does not need to be unique.

      Termination protection

      Choose Yes.

      Enabling termination protection ensures that the cluster does not shut down due to accident or error. For more information, see Protect a Cluster from Termination. Typically, set this value to Yes only when developing an application (so you can debug errors that would have otherwise terminated the cluster) and to protect long-running clusters or clusters that contain data.

      Logging

      Choose Enabled.

      This determines whether Amazon EMR captures detailed log data to Amazon S3.

      For more information, see View Log Files.

      Log folder S3 location

      Enter an Amazon S3 path to store your debug logs if you enabled logging in the previous field.

      When this value is set, Amazon EMR copies the log files from the EC2 instances in the cluster to Amazon S3. This prevents the log files from being lost when the cluster ends and the EC2 instances hosting the cluster are terminated. These logs are useful for troubleshooting purposes.

      For more information, see View Log Files.

      Debugging

      Choose Enabled.

      This option creates a debug log index in SimpleDB (additional charges apply) to enable detailed debugging in the Amazon EMR console. You can only set this when the cluster is created. For more information about Amazon SimpleDB, go to the Amazon SimpleDB product description page.

    7. In the Software Configuration section, verify the fields according to the following table.

      Software Configuration
      FieldAction
      Hadoop distribution

      Choose Amazon.

      This determines which distribution of Hadoop to run on your cluster. You can choose to run the Amazon distribution of Hadoop or one of several MapR distributions. For more information, see Using the MapR Distribution for Hadoop.

      AMI version

      Choose 2.4.2 (Hadoop 1.0.3).

      This determines the version of Hadoop and other applications such as Hive or Pig to run on your cluster. For more information, see Choose a Machine Image .

    8. In the Hardware Configuration section, verify the fields according to the following table.

      Note

      Twenty is the default maximum number of nodes per AWS account. For example, if you have two clusters running, the total number of nodes running for both clusters must be 20 or less. Exceeding this limit will result in cluster failures. If you need more than 20 nodes, you must submit a request to increase your Amazon EC2 instance limit. Ensure that your requested limit increase includes sufficient capacity for any temporary, unplanned increases in your needs. For more information, go to the Request to Increase Amazon EC2 Instance Limit Form.

      Hardware Configuration
      FieldAction
      Network

      Choose Launch into EC2-Classic.

      Optionally, choose a VPC subnet identifier from the list to launch the cluster in an Amazon VPC. For more information, see Select a Amazon VPC Subnet for the Cluster (Optional).

      EC2 Availability Zone

      Choose No preference.

      Optionally, you can launch the cluster in a specific EC2 Availability Zone.

      For more information, see Regions and Availability Zones in the Amazon EC2 User Guide.

      Master

      Choose m1.small.

      The master node assigns Hadoop tasks to core and task nodes, and monitors their status. There is always one master node in each cluster.

      This specifies the EC2 instance types to use as master nodes. Valid types are m1.small (default), m1.large, m1.xlarge, c1.medium, c1.xlarge, m2.xlarge, m2.2xlarge, and m2.4xlarge, cc1.4xlarge, cg1.4xlarge.

      This tutorial uses small instances for all nodes due to the light workload and to keep your costs low.

      For more information, see Instance Groups. For information about mapping legacy clusters to instance groups, see Mapping Legacy Clusters to Instance Groups.

      Request Spot Instances

      Leave this box unchecked.

      This specifies whether to run master nodes on Spot Instances. For more information, see Lower Costs with Spot Instances (Optional).

      Core

      Choose m1.small.

      A core node is an EC2 instance that runs Hadoop map and reduce tasks and stores data using the Hadoop Distributed File System (HDFS). Core nodes are managed by the master node.

      This specifies the EC2 instance types to use as core nodes. Valid types: m1.small (default), m1.large, m1.xlarge, c1.medium, c1.xlarge, m2.xlarge, m2.2xlarge, and m2.4xlarge, cc1.4xlarge, cg1.4xlarge.

      This tutorial uses small instances for all nodes due to the light workload and to keep your costs low.

      For more information, see Instance Groups. For information about mapping legacy clusters to instance groups, see Mapping Legacy Clusters to Instance Groups.

      Count

      Choose 2.

      Request Spot Instances

      Leave this box unchecked.

      This specifies whether to run core nodes on Spot Instances. For more information, see Lower Costs with Spot Instances (Optional).

      Task

      Choose m1.small.

      Task nodes only process Hadoop tasks and don't store data. You can add and remove them from a cluster to manage the EC2 instance capacity your cluster uses, increasing capacity to handle peak loads and decreasing it later. Task nodes only run a TaskTracker Hadoop daemon.

      This specifies the EC2 instance types to use as task nodes. Valid types: m1.small (default), m1.large, m1.xlarge, c1.medium, c1.xlarge, m2.xlarge, m2.2xlarge, and m2.4xlarge, cc1.4xlarge, cg1.4xlarge.

      For more information, see Instance Groups. For information about mapping legacy clusters to instance groups, see Mapping Legacy Clusters to Instance Groups.

      Count

      Choose 0.

      Request Spot Instances

      Leave this box unchecked.

      This specifies whether to run task nodes on Spot Instances. For more information, see Lower Costs with Spot Instances (Optional).

    9. In the Security and Access section, complete the fields according to the following table.

      Security and Access
      FieldAction
      EC2 key pair

      Choose an Amazon EC2 key pair from the list.

      For more information, see Create an Amazon EC2 Key Pair and PEM File.

      If you do not enter a value in this field, you cannot use SSH to connect to the master node. For more information, see Connect to the Cluster.

      Optionally, choose Proceed without an EC2 key pair.

      IAM user access

      Choose No other IAM users.

      Optionally, choose All other IAM users to make the cluster visible and accessible to all IAM users on the AWS account. For more information, see Configure IAM User Permissions.

      IAM role

      Choose Proceed without role.

      This controls application access to the EC2 instances in the cluster.

      For more information, see Configure IAM Roles for Amazon EMR.

    10. In the Bootstrap Actions section, in the Add bootstrap action field, select Custom action and click Configure and add.

      Enter the following information:

      1. Enter the following text in the S3 location field:

        s3://elasticmapreduce/bootstrap-actions/download.sh
      2. Click Add.

      Add Bootstrap Action

      For more information, see Create Bootstrap Actions to Install Additional Software (Optional).

    11. In the Steps section, note the step that Amazon EMR configured for you by choosing the sample application.

      You do not need to change any of the settings in this section.

    12. Review your configuration and if you are satisfied with the settings, click Create Cluster.

    13. When the cluster starts, you see the Summary pane.

      Cluster Starting

    While the cluster master node is running, you can connect to the master node and see the log files that the bootstrap action script generated in the /mnt/var/log/bootstrap-actions/1 directory.

    Related Topics