Key concepts - Best Practices for Migrating from RDBMS to Amazon DynamoDB

Key concepts

As described in the previous section, DynamoDB organizes data into tables consisting of items. Each item in a DynamoDB table can define an arbitrary set of attributes, but all items in the table must define a primary key that uniquely identifies the item. This key must contain an attribute known as the partition key and optionally an attribute called the sort key. The following figure shows the structure of a DynamoDB table that defines both a partition and sort key.

DynamoDB table that defines both a partition and sort key.

DynamoDB Table with partition and sort keys

If an item can be uniquely identified by a single attribute value, then this attribute can function as the partition key. In other cases, an item may be uniquely identified by two values. In this case, the primary key will be defined as a composite of the partition key and the sort key. The following figure demonstrates this concept.

Image showing the key defined as a composite of the partition key and the sort key.

Example of partition and sort keys

RDBMS tables relating media files with the codec used to transcode them can be modeled as a single table in DynamoDB using a primary key consisting of a partition and sort key. Note how the data is de-normalized in the DynamoDB table. This is a common practice when migrating data from an RDBMS to a NoSQL database, and will be discussed in more detail later in this whitepaper.

The ideal partition key will contain a large number of distinct values uniformly distributed across the items in the table. A user ID is a good example of an attribute that tends to be uniformly distributed across items in a table. Attributes that would be modeled as lookup values or enumerations in an RDBMS tend to make poor partition keys. The reason is that certain values may occur much more frequently than others. These concepts are shown in Table 2. Notice how the counts of user_id are uniform whereas the counts of status_code are not. If the status_code is used as a partition key in a DynamoDB table, the value that occurs most frequently will end up being stored on a single partition, and this means that most reads and writes will be hitting that single partition. This is called a hot partition and this will negatively impact performance.

Table 2 – Uniform and non-uniform distribution of potential key values

select user_id, count(*) as total from user_preferences group by user_id

user_id total
8a9642f7-5155-4138-bb63-870cd45d7e19 1
31667c72-86c5-4afb-82a1-a988bfe34d49 1
693f8265-b0d2-40f1-add0-bbe2e8650c08 1

select status_code, count(*) as total from status_code sc, log l where l.status_code_id = sc.status_code_id

status_code total
400 125000
403 250
500 1000
505 2

Items can be fetched from a table using the primary key. Often, it is useful to be able to fetch items using a different set of values than the partition and the sort keys. DynamoDB supports these operations through local and global secondary indexes. A local secondary index uses the same partition key as defined on the table, but a different attribute as the sort key. The following figure shows how a local secondary index is defined on a table. A global secondary index can use any scalar attribute as the partition or sort key. An important difference between the two index types is that a local secondary index can only be created at the time of the table’s creation and it stays present until the table is deleted, while a global secondary index can be created and deleted at any moment. Fetching items using secondary indexes is done using the query interface defined in the DynamoDB API.

Image displaying the partition key, sort key, and LSI sort key columns.

A local secondary index

Adding secondary indexes consumes additional storage and capacity for writes and so, as with any database, it is important to limit the number of indexes you define for a table. This requires understanding the data access requirements of any application that uses DynamoDB for persistent storage. In addition, global secondary indexes require that attribute values be projected into the index. What this means is that when an index is created, a subset of attributes from the parent table needs to be selected for inclusion into the index. When an item is queried using a global secondary index, the only attributes that will be populated in the returned item are those that have been projected into the index. The following figure demonstrates this concept.

Image showing that the attributes populated in the returned item are those that have been projected into the index.

Create a global secondary index on a table

The original partition and sort key attributes are automatically projected into the global secondary index. Reads on global secondary indexes are always eventually consistent, whereas local secondary indexes support eventual or strong consistency. Finally, both local and global secondary indexes consume capacity for reads and writes to the index. This means that when an item is inserted or updated in the main table, secondary indexes will consume capacity to update the index. The only exceptions to this are cases where the item isn’t written to an index, because the attributes that are part of the index’s primary key are not present in the item (refer to Sparse Indexes) or when an item modification isn’t reflected in the index because the changed attributes aren’t projected to the index.

DynamoDB allows for specifying the capacity mode for each table. With the on-demand capacity mode, which is suitable for workloads that are less predictable, the service takes care of managing capacity for you, and you only pay for what you consume. With provisioned capacity mode you are required to specify the table’s read and write capacity and you pay based on the provisioned capacity.

Whenever an item is read from or written to a DynamoDB table or index, the amount of capacity required to perform the read or write operation is expressed as read capacity units (RCUs) or write capacity units (WCUs). One RCU represents one strongly consistent read per second, or two eventually consistent reads per second, for items up to 4KB in combined size. Transactional read requests require two RCUs to perform one read per second for items up to 4KB. One WCU represents one write per second for an item up to 1KB in size. Transactional write requests require two WCUs to perform one write per second for item up to 1KB. This means that fetching one or more items with a total size of 8KB in a single strongly consistent read will consume two RCUs. Making a regular (non-transactional) insert of an item of 8KB in size will consume eight WCUs.

With provisioned capacity mode you choose the number of RCUs and WCUs the table supports. If your application requires that 1000 4KB items be written per second, then the provisioned write capacity of the table would need to be a minimum of 4000 WCUs. When an insufficient amount of read or write capacity is provisioned on a table, the DynamoDB service will throttle the read and write operations. This can result in poor performance and in some cases throttling exceptions in the client application. For this reason, it is important to understand an application’s I/O requirements when designing the tables. However, both read and write capacity can be dynamically altered on an existing table. If an application suddenly experiences a spike in usage that results in throttling, the provisioned capacity can be increased to handle the new workload. Similarly, if load decreases for some reason, the provisioned capacity can be reduced. This dynamic change in the table’s read or write capacity can be achieved through a simple API call, or automatically through DynamoDB Auto Scaling. In addition, you can change the table’s capacity mode once per 24 hours. This ability to dynamically alter the I/O characteristics of a table is a key differentiator between DynamoDB and a traditional RDBMS, in which I/O throughput is going to be fixed based on the underlying hardware the database engine is running on. This means that in many cases DynamoDB can be much more cost-effective than a traditional RDBMS, that is usually provisioned for peak consumption and stays underutilized for the majority of time.