Amazon Elastic MapReduce
Developer Guide (API Version 2009-03-31)
« PreviousNext »
View the PDF for this guide.Go to the AWS Discussion Forum for this product.Go to the Kindle Store to download this guide in Kindle format.Did this page help you?  Yes | No |  Tell us about it...

Task Configuration (Hadoop 1.0.3)

There are a number of configuration variables for tuning the performance of your MapReduce jobs. This section describes some of the important task-related settings.

Tasks per Machine

Two configuration options determine how many tasks are run per node, one for mappers and the other for reducers. They are:

  • mapred.tasktracker.map.tasks.maximum

  • mapred.tasktracker.reduce.tasks.maximum

Amazon EMR provides defaults that are entirely dependent on the EC2 instance type. The following table shows the default settings for clusters launched with AMIs after 2.4.6.

EC2 Instance NameMappersReducers
m1.small21
m1.medium21
m1.large31
m1.xlarge83
m2.xlarge31
m2.2xlarge62
m2.4xlarge144
m3.xlarge61
m3.2xlarge123
c1.medium21
c1.xlarge72
c3.xlarge61
c3.2xlarge123
c3.4xlarge246
c3.8xlarge4812
cc1.4xlarge123
cc2.8xlarge246
hi1.4xlarge246
hs1.8xlarge246
cg1.4xlarge123
g2.2xlarge84
i2.xlarge123
i2.2xlarge123
i2.4xlarge246
i2.8xlarge4812
r3.xlarge61
r3.2xlarge123
r3.4xlarge246
r3.8xlarge4812

Note

The number of default mappers is based on the memory available on each EC2 instance type. If you increase the default number of mappers, you also need to modify the task JVM settings to decrease the amount of memory allocated to each task. Failure to modify the JVM settings appropriately could result in out of memory errors.

Tasks per Job (AMI 2.3)

When your cluster runs, Hadoop creates a number of map and reduce tasks. These determine the number of tasks that can run simultaneously during your cluster. Run too few tasks and you have nodes sitting idle; run too many and there is significant framework overhead.

Amazon EMR determines the number of map tasks from the size and number of files of your input data. You configure the reducer setting. There are four settings you can modify to adjust the reducer setting.

The parameters for configuring the reducer setting are described in the following table.

ParameterDescription
mapred.map.tasksTarget number of map tasks to run. The actual number of tasks created is sometimes different than this number.
mapred.map.tasksperslotTarget number of map tasks to run as a ratio to the number of map slots in the cluster. This is used if mapred.map.tasks is not set.
mapred.reduce.tasksNumber of reduce tasks to run.
mapred.reduce.tasksperslotNumber of reduce tasks to run as a ratio of the number of reduce slots in the cluster.

The two tasksperslot parameters are unique to Amazon EMR. They only take effect if mapred.*.tasks is not defined. The order of precedence is:

  1. mapred.map.tasks set by the Hadoop job

  2. mapred.map.tasks set in mapred-conf.xml on the master node

  3. mapred.map.tasksperslot if neither of those are defined

Task JVM Settings (AMI 2.3)

You can configure the amount of heap space for tasks as well as other JVM options with the mapred.child.java.opts setting. Amazon EMR provides a default -Xmx value in this location, with the defaults per instance type shown in the following table.

Amazon EC2 Instance NameDefault JVM value
m1.small -Xmx288m
m1.medium -Xmx576m
m1.large-Xmx864m
m1.xlarge-Xmx768m
c1.medium -Xmx288m
c1.xlarge-Xmx384m
m2.xlarge-Xmx2304m
m2.2xlarge-Xmx2688m
m2.4xlarge -Xmx2304m
cc1.4xlarge-Xmx912m
cc2.8xlarge-Xmx1536m
hi1.4xlarge-Xmx2048m
hs1.8xlarge-Xmx1536m
cg1.4xlarge-Xmx864m

You can start a new JVM for every task, which provides better task isolation, or you can share JVMs between tasks, providing lower framework overhead. If you are processing many small files, it makes sense to reuse the JVM many times to amortize the cost of start-up. However, if each task takes a long time or processes a large amount of data, then you might choose to not reuse the JVM to ensure all memory is freed for subsequent tasks.

Use the mapred.job.reuse.jvm.num.tasks option to configure the JVM reuse settings.

To modify JVM using a bootstrap action

  • In the directory where you installed the Amazon EMR CLI, run the following from the command line. For more information, see the Command Line Interface Reference for Amazon EMR.

    • Linux, UNIX, and Mac OS X users:

      ./elastic-mapreduce --create --alive --name "JVM infinite reuse" \
      --bootstrap-action s3://elasticmapreduce/bootstrap-actions/configure-hadoop \
      --bootstrap-name "Configuring infinite JVM reuse" \
      --args "-m,mapred.job.reuse.jvm.num.tasks=-1"
    • Windows users:

      ruby elastic-mapreduce --create --alive --name "JVM infinite reuse" --bootstrap-action s3://elasticmapreduce/bootstrap-actions/configure-hadoop --bootstrap-name "Configuring infinite JVM reuse" --args "-m,mapred.job.reuse.jvm.num.tasks=-1"

Note

Amazon EMR sets the value of mapred.job.reuse.jvm.num.tasks to 20, but you can override it with a bootstrap action. A value of -1 means infinite reuse within a single job, and 1 means do not reuse tasks.

Avoiding Cluster Slowdowns (AMI 2.3)

In a distributed environment, you are going to experience random delays, slow hardware, failing hardware, and other problems that collectively slow down your cluster. This is known as the stragglers problem. Hadoop has a feature called speculative execution that can help mitigate this issue. As the cluster progresses, some machines complete their tasks. Hadoop schedules tasks on nodes that are free. Whichever task finishes first is the successful one, and the other tasks are killed. This feature can substantially cut down on the run time of jobs. The general design of a mapreduce algorithm is such that the processing of map tasks is meant to be idempotent. However, if you are running a job where the task execution has side effects (for example, a zero reducer job that calls an external resource), it is important to disable speculative execution.

You can enable speculative execution for mappers and reducers independently. By default, Amazon EMR enables it for mappers and reducers in AMI 2.3. You can override these settings with a bootstrap action. For more information about using bootstrap actions, see Create Bootstrap Actions to Install Additional Software (Optional).

Speculative Execution Parameters

ParameterDefault Setting
mapred.map.tasks.speculative.executiontrue
mapred.reduce.tasks.speculative.executiontrue

To disable reducer speculative execution using a bootstrap action

  • In the directory where you installed the Amazon EMR CLI, run the following from the command line. For more information, see the Command Line Interface Reference for Amazon EMR.

    • Linux, UNIX, and Mac OS X users:

      ./elastic-mapreduce --create --alive --name "Reducer speculative execution" \
      --bootstrap-action s3://elasticmapreduce/bootstrap-actions/configure-hadoop \
      --bootstrap-name "Disable reducer speculative execution" \
      --args "-m,mapred.reduce.tasks.speculative.execution=false"
    • Windows users:

      ruby elastic-mapreduce --create --alive --name "Reducer speculative execution" --bootstrap-action s3://elasticmapreduce/bootstrap-actions/configure-hadoop --bootstrap-name "Disable reducer speculative execution" --args "-m,mapred.reduce.tasks.speculative.execution=false"