|« PreviousNext »|
|Did this page help you? Yes | No | Tell us about it...|
This section covers some best practices for working with tables.
Provisioned throughput is dependent on the primary key selection, and the workload patterns on individual items. When storing data, DynamoDB divides a table's items into multiple partitions, and distributes the data primarily based on the hash key element. The provisioned throughput associated with a table is also divided evenly among the partitions, with no sharing of provisioned throughput across partitions.
Total provisioned throughput/partitions = throughput per partition.
Consequently, to achieve the full amount of request throughput you have provisioned for a table, keep your workload spread evenly across the hash key values. Distributing requests across hash key values distributes the requests across partitions.
For example, if a table has a very small number of heavily accessed hash key elements, possibly even a single very heavily used hash key element, traffic is concentrated on a small number of partitions – potentially only one partition. If the workload is heavily unbalanced, meaning disproportionately focused on one or a few partitions, the operations will not achieve the overall provisioned throughput level. To get the most out of DynamoDB throughput, build tables where the hash key element has a large number of distinct values, and values are requested fairly uniformly, as randomly as possible.
This behavior does not imply that you need to access all of the hash keys, or even that the percentage of accessed hash keys needs to be high to achieve your throughput level. But, be aware that when your workload accesses more distinct hash keys, those operations are spread out across the partitioned space in a manner that better utilizes your allocated throughput level. In general, you utilize throughput more efficiently as the ratio of hash keys accessed to total hash keys in a table grows.
The following table compares some common hash key schema for provisioned throughput efficiency.
|Hash key value||Efficiency|
User ID, where the application has many users.
|Status code, where there are only a few possible status codes.||Bad|
|Device ID, where even if there are a lot of devices being tracked, one is by far more popular than all the others.||Bad|
When the number of hash key values in a single table is very few, consider distributing your write operations across more distinct hash values. In other words, consider the primary key elements to avoid one "hot" (heavily requested) hash key value that slows overall performance.
For example, consider a composite primary hash and range key table where the hash key
represents a device ID, and where device ID "D17" is particularly heavily requested. To
increase the read and write throughput for this "hot" hash key, pick a random number
chosen from a fixed set (for example 1 to 200) and concatenate it with the device ID (so
D17.200). Due to
randomization, writes for device ID "D17" are spread evenly across the multiple hash key
values, yielding better parallelism and higher overall throughput.
This strategy greatly improves the write throughput, but reads for a specific item
become harder since you don't know which of the 200 keys contains the item. You can
improve this strategy to get better read characteristics: instead of choosing a
completely random number, choose a number that you are able to calculate from something
intrinsic to the item. For example, if the item represents a person that has the device,
calculate the hash key suffix from their name, or user ID. This calculation should
compute a number between 1 and 200 that is fairly evenly distributed given any set of
names (or user IDs.) A simple calculation generally suffices (such as, the product of
the ASCII values for the letters in the person's name modulo 200 + 1). Now, the writes
are spread evenly across the hash keys (and thus partitions). And
you can easily perform a get operation, because you can determine the hash key you need
when you want to retrieve a specific "device owner" value. Query operations still need
to run against all
D17.x keys, and your application needs some logic on the
client side to merge all of the query results for each hash key (200 in this case). But,
the schema avoids having one "hot" hash key taking all of the workload.
There are times when you load data from other data sources into DynamoDB. Typically, DynamoDB partitions your table data on multiple servers. When uploading data to a table, you get better performance if you upload data to all the allocated servers simultaneously. For example, suppose you want to upload user messages to a DynamoDB table. You might design a table that uses a hash and range type primary key in which UserID is the hash attribute and the MessageID is the range attribute. When uploading data from your source, you might tend to read all message items for a specific user and upload these items to DynamoDB as shown in the sequence in the following table.
|U1||... up to 100|
|U2||... up to 200|
The problem in this case is that you are not distributing your write requests to DynamoDB across your hash key values. You are taking one hash key at a time and uploading all its items before going to the next hash key items. Behind the scenes, DynamoDB is partitioning the data in your tables across multiple servers. To fully utilize all of the throughput capacity that has been provisioned for your tables, you need to distribute your workload across your hash keys. In this case, by directing an uneven amount of upload work toward items all with the same hash key, you may not be able to fully utilize all of the resources DynamoDB has provisioned for your table. You can distribute your upload work by uploading one item from each hash key first. Then you repeat the pattern for the next set of range keys for all the items until you upload all the data as shown in the example upload sequence in the following table:
Every upload in this sequence uses a different hash key, keeping more DynamoDB servers busy simultaneously and improving your throughput performance.
For each table that you create, you specify the throughput requirements. DynamoDB allocates and reserves resources to handle your throughput requirements with sustained low latency. When you design your application and tables, you should consider your application's access pattern to make the most efficient use of your table's resources.
Suppose you design a table to track customer behavior on your site, such as URLs that they click. You might design the table with hash and range type primary key with Customer ID as the hash attribute and date/time as the range attribute. In this application, customer data grows indefinitely over time; however, the applications might show uneven access pattern across all the items in the table where the latest customer data is more relevant and your application might access the latest items more frequently and as time passes these items are less accessed, eventually the older items are rarely accessed. If this is a known access pattern, you could take it into consideration when designing your table schema. Instead of storing all items in a single table, you could use multiple tables to store these items. For example, you could create tables to store monthly or weekly data. For the table storing data from the latest month or week, where data access rate is high, request higher throughput and for tables storing older data, you could dial down the throughput and save on resources.
So storing "hot" items in one table with higher throughput and "cold" items with reduced throughput requirements help you save on resources. You can remove old items by simply deleting the tables. You can optionally backup these tables to other storage options such as Amazon Simple Storage Service (Amazon S3). Deleting an entire table is significantly more efficient than removing items one-by-one, which essentially doubles the write throughput as you do as many delete operations as put operations.
Some items in a table might be more popular than others. For example, consider the ProductCatalog table that is described in Example Tables and Data, and suppose that this table contains millions of different products. Some products might be very popular among customers, so those items would be consistently accessed more frequently than the others. As a result, the distribution of read activity on ProductCatalog would be highly skewed toward those popular items.
One solution would be to cache these reads at the application layer. Caching is a technique that is used in many high-throughput applications, offloading read activity on hot items to the cache rather than to the database. Your application can cache the most popular items in memory, or use a product such as ElastiCache to do the same.
Continuing with the ProductCatalog example, when a customer requests an item from that table, the application would first consult the cache to see if there is a copy of the item there. If so, it is a cache hit; otherwise, it is a cache miss. When there is a cache miss, the application would need to read the item from DynamoDB and store a copy of the item in the cache. Over time, the cache misses would decrease as the cache fills with the most popular items; applications would not need to access DynamoDB at all for these items.
A caching solution can mitigate the skewed read activity for popular items. In addition, since it reduces the amount of read activity against the table, caching can help reduce your overall costs for using DynamoDB.