Amazon DynamoDB
Developer Guide (API Version 2012-08-10)
Did this page help you?  Yes | No |  Tell us about it...
« PreviousNext »
View the PDF for this guide.Go to the AWS Discussion Forum for this product.Go to the Kindle Store to download this guide in Kindle format.

Guidelines for Working with Tables

This section covers some best practices for working with tables.

Design For Uniform Data Access Across Items In Your Tables

The optimal usage of a table's provisioned throughput depends on these factors:

  • The primary key selection.

  • The workload patterns on individual items

When it stores data, DynamoDB divides a table's items into multiple partitions, and distributes the data primarily based upon the hash key element. The provisioned throughput associated with a table is also divided evenly among the partitions, with no sharing of provisioned throughput across partitions.

Total Provisioned Throughput / Partitions = Throughput Per Partition

Consequently, to achieve the full amount of request throughput you have provisioned for a table, keep your workload spread evenly across the hash key values. Distributing requests across hash key values distributes the requests across partitions.

For example, if a table has a very small number of heavily accessed hash key elements, possibly even a single very heavily used hash key element, request traffic is concentrated on a small number of partitions – potentially only one partition. If the workload is heavily unbalanced, meaning that it is disproportionately focused on one or a few partitions, the requests will not achieve the overall provisioned throughput level. To get the most out of DynamoDB throughput, create tables where the hash key element has a large number of distinct values, and values are requested fairly uniformly, as randomly as possible.

This does not mean that you must access all of the hash keys to achieve your throughput level; nor does it mean that the percentage of accessed hash keys needs to be high. However, do be aware that when your workload accesses more distinct hash keys, those requests will be spread out across the partitioned space in a manner that better utilizes your allocated throughput level. In general, you will utilize your throughput more efficiently as the ratio of hash keys accessed to total hash keys in a table grows.

Choosing a Hash Key

The following table compares some common hash key schema for provisioned throughput efficiency:

Hash key valueUniformity

User ID, where the application has many users.

Good

Status code, where there are only a few possible status codes.Bad
Item creation date, rounded to the nearest time period (e.g. day, hour, minute)Bad
Device ID, where each device accesses data at relatively similar intervalsGood
Device ID, where even if there are a lot of devices being tracked, one is by far more popular than all the others.Bad

If a single table has only a very small number of hash key values, consider distributing your write operations across more distinct hash values. In other words, structure the primary key elements to avoid one "hot" (heavily requested) hash key value that slows overall performance.

For example, consider a table with a hash and range type primary key. The hash key represents the item's creation date, rounded to the nearest day. The range key is an item identifier. On a given day, say 2014-07-09, all of the new items will be written to that same hash key value.

If the table will fit entirely into a single partition (taking into consideration growth of your data over time), and if your application's read and write throughput requirements do not exceed the read and write capabilities of a single partition, then your application should not encounter any unexpected throttling as a result of partitioning.

However, if you anticipate scaling beyond a single partition, then you should architect your application so that it can use more of the table's full provisioned throughput.

Randomizing Across Multiple Hash Key Values

One way to increase the write throughput of this application would be to randomize the writes across multiple hash key values. Choose a random number from a fixed set (for example, 1 to 200) and concatenate it as a suffix to the date. This will yield hash key values such as 2014-07-09.1, 2014-07-09.2 and so on through 2014-07-09.200. Because you are randomizing the hash key, the writes to the table on each day are spread evenly across all of the hash key values; this will yield better parallelism and higher overall throughput.

To read all of the items for a given day, you would need to obtain all of the items for each suffix. For example, you would first issue a Query request for the hash key 2014-07-09.1, then another Query for 2014-07-09.2, and so on through 2014-07-09.200. Finally, your application would need to merge the results from all of the Query requests.

Using a Calculated Value

A randomizing strategy can greatly improve write throughput; however, it is difficult to read a specific item because you don't know which suffix value was used when writing the item. To make it easier to read individual items, you can use a different strategy: Instead of using a random number to distribute the items among partitions, use a number that you are able to calculate based upon something that's intrinsic to the item.

Continuing with our example, suppose that each item has an OrderId. Before your application writes the item to the table, it can calculate a hash key suffix based upon the order ID. The calculation should result in a number between 1 and 200 that is fairly evenly distributed given any set of names (or user IDs.)

A simple calculation would suffice, such as the product of the ASCII values for the characters in the order ID, modulo 200 + 1. The hash key value would then be the date concatenated with the calculation result as a suffix. With this strategy, the writes are spread evenly across the hash keys, and thus across the partitions. You can easily perform a GetItem operation on a particular item, because you can calculate the hash key you need when you want to retrieve a specific OrderId value.

To read all of the items for a given day, you would still need to Query each of the 2014-07-09.N keys (where N is 1 to 200), and your application would need to merge all of the results. However, you will avoid having a single "hot" hash key taking all of the workload.

Understand Partition Behavior

DynamoDB manages table partitioning for you automatically, adding new partitions as your table grows in size. You can estimate the number of partitions that DynamoDB will create for your table, and compare that estimate against your scale and access patterns. This can help you determine the best table design for your application needs.

The number of partitions in a table is based on the table's storage requirements, and also its provisioned throughput requirements.

Number of partitions required, based solely on a table's size

  • numPartitionstableSize = tableSizeInBytes / 10 GB

The size of the table is one factor in determining the number of partitions needed. The other factor is the table's provisioned throughput requirements. For any one partition, DynamoDB can allocate a maximum of 3000 read capacity units or 1000 write capacity units.

Number of partitions required, based solely on a table's provisioned read and write throughput settings

  • numPartitionsthroughput = ( readCapacityUnits / 3000 ) + ( writeCapacityUnits / 1000 )

For example, suppose that you provisioned a table with 1000 read capacity units and 500 write capacity units. In this case, numPartitionsthroughput would be:

( 1000 / 3000 ) + ( 500 / 1000 ) = 0.8333

Therefore, a single partition could accommodate all of the table's provisioned throughput requirements.

However, if you provisioned 1000 read capacity units and 1000 write capacity units, then numPartitionsthroughput would exceed a single partition's throughput capacity:

( 1000 / 3000 ) + ( 1000 / 1000 ) = 1.333

In this case, the table would require two partitions, each with 500 read capacity units and 500 write capacity units.

The total number of partitions allocated by DynamoDB is the larger of the table's size or its provisioned throughput requirements:

Total number of partitions allocated by DynamoDB

  • numPartitionstotal = MAX ( numPartitionstableSize | numPartitionsthroughput )

A single partition can hold approximately 10 GB of data.  If your table size grows beyond 10 GB, DynamoDB will spread your data across additional partitions, and will also distribute your table's read and write throughput accordingly.  Therefore, as your table grows in size, less throughput will be provisioned per partition.

Suppose that you have a table that has grown to 500 GB in size. This would mean that the table now occupies approximately 50 partitions:

500 GB / 10 GB = 50 partitions

Now suppose that you have allocated 100,000 read capacity units to the table. You could determine the amount of read capacity per partition as follows:

100,000 read capacity units / 50 partitions = 2000 read capacity units per partition

Note

In the future, these details of partition sizes and throughput allocation per partition may change.

Use Burst Capacity Sparingly

DynamoDB provides some flexibility in the per-partition throughput provisioning: When you are not fully utilizing a partition's throughput, DynamoDB reserves a portion of your unused capacity for later "bursts" of throughput usage.  DynamoDB currently reserves up 5 minutes (300 seconds) of unused read and write capacity.  During an occasional burst of read or write activity, this reserved throughput can be consumed very quickly — even faster than the per-second provisioned throughput capacity that you've defined for your table. However, do not design your application so that it depends on burst capacity being available at all times: DynamoDB can and does use burst capacity for background maintenance and other tasks without prior notice.

Note

In the future, these details of burst capacity may change.

Distribute Write Activity During Data Upload

There are times when you load data from other data sources into DynamoDB. Typically, DynamoDB partitions your table data on multiple servers. When uploading data to a table, you get better performance if you upload data to all the allocated servers simultaneously. For example, suppose you want to upload user messages to a DynamoDB table. You might design a table that uses a hash and range type primary key in which UserID is the hash attribute and the MessageID is the range attribute. When uploading data from your source, you might tend to read all message items for a specific user and upload these items to DynamoDB as shown in the sequence in the following table.

UserIDMessageID

U1

1

U12
U1...
U1... up to 100

U2

1

U22
U2...
U2... up to 200

The problem in this case is that you are not distributing your write requests to DynamoDB across your hash key values. You are taking one hash key at a time and uploading all its items before going to the next hash key items. Behind the scenes, DynamoDB is partitioning the data in your tables across multiple servers. To fully utilize all of the throughput capacity that has been provisioned for your tables, you need to distribute your workload across your hash keys. In this case, by directing an uneven amount of upload work toward items all with the same hash key, you may not be able to fully utilize all of the resources DynamoDB has provisioned for your table. You can distribute your upload work by uploading one item from each hash key first. Then you repeat the pattern for the next set of range keys for all the items until you upload all the data as shown in the example upload sequence in the following table:

UserIDMessageID

U1

1

U21
U31
.......

U1

2

U22
U32
......

Every upload in this sequence uses a different hash key, keeping more DynamoDB servers busy simultaneously and improving your throughput performance.

Understand Access Patterns for Time Series Data

For each table that you create, you specify the throughput requirements. DynamoDB allocates and reserves resources to handle your throughput requirements with sustained low latency. When you design your application and tables, you should consider your application's access pattern to make the most efficient use of your table's resources.

Suppose you design a table to track customer behavior on your site, such as URLs that they click. You might design the table with hash and range type primary key with Customer ID as the hash attribute and date/time as the range attribute. In this application, customer data grows indefinitely over time; however, the applications might show uneven access pattern across all the items in the table where the latest customer data is more relevant and your application might access the latest items more frequently and as time passes these items are less accessed, eventually the older items are rarely accessed. If this is a known access pattern, you could take it into consideration when designing your table schema. Instead of storing all items in a single table, you could use multiple tables to store these items. For example, you could create tables to store monthly or weekly data. For the table storing data from the latest month or week, where data access rate is high, request higher throughput and for tables storing older data, you could dial down the throughput and save on resources.

You can save on resources by storing "hot" items in one table with higher throughput settings, and "cold" items in another table with lower throughput settings. You can remove old items by simply deleting the tables. You can optionally backup these tables to other storage options such as Amazon Simple Storage Service (Amazon S3). Deleting an entire table is significantly more efficient than removing items one-by-one, which essentially doubles the write throughput as you do as many delete operations as put operations.

Cache Popular Items

Some items in a table might be more popular than others. For example, consider the ProductCatalog table that is described in Example Tables and Data, and suppose that this table contains millions of different products. Some products might be very popular among customers, so those items would be consistently accessed more frequently than the others. As a result, the distribution of read activity on ProductCatalog would be highly skewed toward those popular items.

One solution would be to cache these reads at the application layer. Caching is a technique that is used in many high-throughput applications, offloading read activity on hot items to the cache rather than to the database. Your application can cache the most popular items in memory, or use a product such as ElastiCache to do the same.

Continuing with the ProductCatalog example, when a customer requests an item from that table, the application would first consult the cache to see if there is a copy of the item there. If so, it is a cache hit; otherwise, it is a cache miss. When there is a cache miss, the application would need to read the item from DynamoDB and store a copy of the item in the cache. Over time, the cache misses would decrease as the cache fills with the most popular items; applications would not need to access DynamoDB at all for these items.

A caching solution can mitigate the skewed read activity for popular items. In addition, since it reduces the amount of read activity against the table, caching can help reduce your overall costs for using DynamoDB.

Consider Workload Uniformity When Adjusting Provisioned Throughput

As the amount of data in your table grows, or as you provision additional read and write capacity, DynamoDB automatically spreads the data across multiple partitions. If your application doesn't require as much throughput, you simply decrease it using the UpdateTable operation, and pay only for the throughput that you have provisioned.

For applications that are designed for use with uniform workloads, DynamoDB's partition allocation activity is not noticeable. A temporary non-uniformity in a workload can generally be absorbed by the bursting allowance, as described in Use Burst Capacity Sparingly. However, if your application must accommodate non-uniform workloads on a regular basis, you should design your table with DynamoDB's partitioning behavior in mind (see Understand Partition Behavior), and be mindful when increasing and decreasing provisioned throughput on that table.

If you reduce the amount of provisioned throughput for your table, DynamoDB will not decrease the number of partitions . Suppose that you created a table with a much larger amount of provisioned throughput than your application actually needed, and then decreased the provisioned throughput later. In this scenario, the provisioned throughput per partition would be less than it would have been if you had initially created the table with less throughput.

For example, consider a situation where you need to bulk-load 20 million items into a DynamoDB table. Assume that each item is 1 KB in size, resulting in 20 GB of data. This bulk-loading task will require a total of 20 million write capacity units. To perform this data load within 30 minutes, you would need to set the provisioned write throughput of the table to 11,000 write capacity units.

The maximum write throughput of a partition is 1000 write capacity units ( (see Understand Partition Behavior); therefore, DynamoDB will create 11 partitions, each with 1000 provisioned write capacity units.

After the bulk data load, your steady-state write throughput requirements might be much lower — for example, suppose that your applications only require 200 writes per second. If you decrease the table's provisioned throughput to this level, each of the 11 partitions will have around 20 write capacity units per second provisioned. This level of per-partition provisioned throughput, combined with DynamoDB's bursting behavior, might be adequate for the application.

However, if an application will require sustained write throughput above 20 writes per second per partition, you should either: (a) design a schema that requires fewer writes per second per hash key, or (b) design the bulk data load so that it runs at a slower pace and reduces the initial throughput requirement. For example, suppose that it was acceptable to run the bulk import for over 3 hours, instead of just 30 minutes. In this scenario, only 1900 write capacity units per second needs to be provisioned, rather than 11,000. As a result, DynamoDB would create only two partitions for the table.

Test Your Application At Scale

Many tables begin with small amounts of data, but then grow larger as applications perform write activity. This growth can occur gradually, without exceeding the provisioned throughput settings you have defined for the table. As your table grows larger, DynamoDB automatically scales your table out by distributing the data across more partitions. When this occurs, the provisioned throughput that is allocated to each resulting partition is less than that which is allocated for the original partition(s).

Suppose that your application accesses the table's data across all of the hash key values, but in a non-uniform fashion (accessing a small number of hash keys more frequently than others). Your application might perform acceptably when there is not very much data in the table. However, as the table becomes larger, there will be more partitions and less throughput per partition. You might discover that your application is throttled when it attempts to use the same non-uniform access pattern that worked in the past.

To avoid problems with "hot" keys when your table becomes larger, make sure that you test your application design at scale. Consider the ratio of storage to throughput when running at scale, and how DynamoDB will allocate partitions to the table. (For more information, see Understand Partition Behavior.)

If it isn't possible for you to generate a large amount of test data, you can create a table that has very high provisioned throughput settings. This will create a table with many partitions; you can then use UpdateTable to reduce the settings, but keep the same ratio of storage to throughput that you determined for running the application at scale. You now have a table that has the throughput-per-partition ratio that you expect after it grows to scale. Test your application against this table using a realistic workload.

Tables that store time series data can grow in an unbounded manner, and can cause slower application performance over time. With time series data, applications typically read and write the most recent items in the table more frequently than older items. If you can remove older time series data from your real-time table, and archive that data elsewhere, you can maintain a high ratio of throughput per partition.

For best practices with time series data, Understand Access Patterns for Time Series Data.