Adjusting the mappers - Amazon DynamoDB

Adjusting the mappers

When Hive launches a Hadoop job, the job is processed by one or more mapper tasks. Assuming that your DynamoDB table has sufficient throughput capacity, you can modify the number of mappers in the cluster, potentially improving performance.

Note

The number of mapper tasks used in a Hadoop job are influenced by input splits, where Hadoop subdivides the data into logical blocks. If Hadoop does not perform enough input splits, then your write operations might not be able to consume all the write throughput available in the DynamoDB table.

Increasing the number of mappers

Each mapper in an Amazon EMR has a maximum read rate of 1 MiB per second. The number of mappers in a cluster depends on the size of the nodes in your cluster. (For information about node sizes and the number of mappers per node, see Task Configuration in the Amazon EMR Developer Guide.)

If your DynamoDB table has ample throughput capacity for reads, you can try increasing the number of mappers by doing one of the following:

  • Increase the size of the nodes in your cluster. For example, if your cluster is using m1.large nodes (three mappers per node), you can try upgrading to m1.xlarge nodes (eight mappers per node).

  • Increase the number of nodes in your cluster. For example, if you have three-node cluster of m1.xlarge nodes, you have a total of 24 mappers available. If you were to double the size of the cluster, with the same type of node, you would have 48 mappers.

You can use the AWS Management Console to manage the size or the number of nodes in your cluster. (You might need to restart the cluster for these changes to take effect.)

Another way to increase the number of mappers is to modify the mapred.tasktracker.map.tasks.maximum Hadoop configuration parameter. (This is a Hadoop parameter, not a Hive parameter. You cannot modify it interactively from the command prompt.). If you increase the value of mapred.tasktracker.map.tasks.maximum, you can increase the number of mappers without increasing the size or number of nodes. However, it is possible for the cluster nodes to run out of memory if you set the value too high.

You set the value for mapred.tasktracker.map.tasks.maximum as a bootstrap action when you first launch your Amazon EMR cluster. For more information, see (Optional) Create Bootstrap Actions to Install Additional Software in the Amazon EMR Management Guide.

Decreasing the number of mappers

If you use the SELECT statement to select data from an external Hive table that maps to DynamoDB, the Hadoop job can use as many tasks as necessary, up to the maximum number of mappers in the cluster. In this scenario, it is possible that a long-running Hive query can consume all of the provisioned read capacity of the DynamoDB table, negatively impacting other users.

You can use the dynamodb.max.map.tasks parameter to set an upper limit for map tasks:

SET dynamodb.max.map.tasks=1

This value must be equal to or greater than 1. When Hive processes your query, the resulting Hadoop job will use no more than dynamodb.max.map.tasks when reading from the DynamoDB table.