DynamoDB provisioned throughput - Amazon DynamoDB

DynamoDB provisioned throughput

When you issue HiveQL statements against the external DynamoDB table, the DynamoDBStorageHandler class makes the appropriate low-level DynamoDB API requests, which consume provisioned throughput. If there is not enough read or write capacity on the DynamoDB table, the request will be throttled, resulting in slow HiveQL performance. For this reason, you should ensure that the table has enough throughput capacity.

For example, suppose that you have provisioned 100 read capacity units for your DynamoDB table. This will let you read 409,600 bytes per second (100 × 4 KB read capacity unit size). Now suppose that the table contains 20 GB of data (21,474,836,480 bytes) and you want to use the SELECT statement to select all of the data using HiveQL. You can estimate how long the query will take to run like this:

21,474,836,480 / 409,600 = 52,429 seconds = 14.56 hours

In this scenario, the DynamoDB table is a bottleneck. It won't help to add more Amazon EMR nodes, because the Hive throughput is constrained to only 409,600 bytes per second. The only way to decrease the time required for the SELECT statement is to increase the provisioned read capacity of the DynamoDB table.

You can perform a similar calculation to estimate how long it would take to bulk-load data into a Hive external table mapped to a DynamoDB table. Determine the total number of write capacity units needed per item (less than 1KB = 1, 1-2KB = 2, etc), and multiply that by the number of items to load. This will give you the number of write capacity units required. Divide that number by the number of write capacity units that are allocated per second. This will yield the number of seconds it will take to load the table.

You should regularly monitor the CloudWatch metrics for your table. For a quick overview in the DynamoDB console, choose your table and then choose the Metrics tab. From here, you can view read and write capacity units consumed and read and write requests that have been throttled.

Read capacity

Amazon EMR manages the request load against your DynamoDB table, according to the table's provisioned throughput settings. However, if you notice a large number of ProvisionedThroughputExceeded messages in the job output, you can adjust the default read rate. To do this, you can modify the dynamodb.throughput.read.percent configuration variable. You can use the SET command to set this variable at the Hive command prompt:

SET dynamodb.throughput.read.percent=1.0;

This variable persists for the current Hive session only. If you exit Hive and return to it later, dynamodb.throughput.read.percent will return to its default value.

The value of dynamodb.throughput.read.percent can be between 0.1 and 1.5, inclusively. 0.5 represents the default read rate, meaning that Hive will attempt to consume half of the read capacity of the table. If you increase the value above 0.5, Hive will increase the request rate; decreasing the value below 0.5 decreases the read request rate. (The actual read rate will vary, depending on factors such as whether there is a uniform key distribution in the DynamoDB table.)

If you notice that Hive is frequently depleting the provisioned read capacity of the table, or if your read requests are being throttled too much, try reducing dynamodb.throughput.read.percent below 0.5. If you have sufficient read capacity in the table and want more responsive HiveQL operations, you can set the value above 0.5.

Write capacity

Amazon EMR manages the request load against your DynamoDB table, according to the table's provisioned throughput settings. However, if you notice a large number of ProvisionedThroughputExceeded messages in the job output, you can adjust the default write rate. To do this, you can modify the dynamodb.throughput.write.percent configuration variable. You can use the SET command to set this variable at the Hive command prompt:

SET dynamodb.throughput.write.percent=1.0;

This variable persists for the current Hive session only. If you exit Hive and return to it later, dynamodb.throughput.write.percent will return to its default value.

The value of dynamodb.throughput.write.percent can be between 0.1 and 1.5, inclusively. 0.5 represents the default write rate, meaning that Hive will attempt to consume half of the write capacity of the table. If you increase the value above 0.5, Hive will increase the request rate; decreasing the value below 0.5 decreases the write request rate. (The actual write rate will vary, depending on factors such as whether there is a uniform key distribution in the DynamoDB table.)

If you notice that Hive is frequently depleting the provisioned write capacity of the table, or if your write requests are being throttled too much, try reducing dynamodb.throughput.write.percent below 0.5. If you have sufficient capacity in the table and want more responsive HiveQL operations, you can set the value above 0.5.

When you write data to DynamoDB using Hive, ensure that the number of write capacity units is greater than the number of mappers in the cluster. For example, consider an Amazon EMR cluster consisting of 10 m1.xlarge nodes. The m1.xlarge node type provides 8 mapper tasks, so the cluster would have a total of 80 mappers (10 × 8). If your DynamoDB table has fewer than 80 write capacity units, then a Hive write operation could consume all of the write throughput for that table.

To determine the number of mappers for Amazon EMR node types, see Task Configuration in the Amazon EMR Developer Guide.

For more information on mappers, see Adjusting the mappers.