Data Models - Comparing the Use of Amazon DynamoDB and Apache HBase for NoSQL

Data Models

Amazon DynamoDB is a key/value as well as a document store and Apache HBase is a key/value store. For a meaningful comparison of Amazon DynamoDB with Apache HBase as a NoSQL data store, this document focuses on the key/value data model for Amazon DynamoDB.

Amazon DynamoDB and Apache HBase are designed with the goal to deliver significant performance benefits with low latency and high throughput. To achieve this goal, key/value stores and document stores have simpler and less constrained data models than traditional relational databases. Although the fundamental data model building-blocks are similar in both Amazon DynamoDB and Apache HBase, each database uses a distinct terminology to describe its specific data model.

At a high level, a database is a collection of tables, and each table is a collection of rows. A row can contain one or more columns. In most cases, NoSQL database tables typically do not require a formal schema except for a mandatory primary key that uniquely identifies each row. The following table illustrates the high-level concept of a NoSQL database.

Table 2: High-Level NoSQL Database Table Representation

Table
Row Primary Key Column 1

Columnar databases are devised to store each column separately so that aggregate operations for one column of the entire table are significantly quicker than the traditional row storage model.

From a comparative standpoint, a row in Amazon DynamoDB is referred to as an item, and each item can have any number of attributes. An attribute comprises a key and a value and commonly referred to as a name-value pair. An Amazon DynamoDB table can have unlimited items indexed by primary key, as shown in the following example.

Table 3: High-Level Representation of Amazon DynamoDB Table

Table
Item 1 Primary Key Attribute 1 Attribute 2 Attribute 3 Attribute …n
Item 2 Primary Key Attribute 1 Attribute 3
Item n Primary Key Attribute 2 Attribute 3

Amazon DynamoDB defines two types of primary keys: a simple primary key with one attribute called a partition key (Table 4) and a composite primary key with two attributes (Table 5).

Table 4: Amazon DynamoDB Simple Primary Key (Partition Key)

Table
Item Partition Key Attribute 1 Attribute 2 Attribute 3 Attribute …n

Table 5: Amazon DynamoDB Composite Primary Key (Partition & Sort Key)

Table
Item Partition Key Sort Key Attribute 1 Attribute 2 Attribute 3 attribute …n

A JSON representation of the item in the Table 5 with additional nested attributes is given below:

{ "Partition Key": "Value", "Sort Key": "Value", "Attribute 1": "Value", "Attribute 2": "Value", "Attribute 3": [ { "Attribute 4": "Value", "Attribute 5": "Value", }, { "Attribute 4": "Value", "Attribute 5": "Value", } ] }

In Amazon DynamoDB, a single attribute primary key or partition key is useful for quick reads and writes of data. For example, PersonID serves as the partition key in the following Person table.

Table 6: Example Person Amazon DynamoDB Table

Person Table
Item PersonId (Partition Key) FirstName LastName Zipcode Gender
Item 1 1001 Fname-1 Lname-1 00000
Item 2 1002 Fname-2 Lname-2 M
Item 3 2002 Fname-3 Lname-3 10000 F

A composite key in Amazon DynamoDB is indexed as a partition key and a sort key. This multi-part key maintains a hierarchy between the first and second element values. Holding the partition key element constant facilitates searches across the sort key element to retrieve items quickly for a given partition key. In the following GameScores table, the composite partition-sort key is a combination of PersonId (partition key) and GameId (sort key).

Table 7: Example GameScores Amazon DynamoDB Table

GameScores Table
PersonId (Partition Key) GameId (Sort Key) TopScore TopScoreDate Wins Losses
item1 1001 Game01 67453 2013-12-09:17:24:31 73 21
item2 1001 Game02 98567 2013-12-11:14:14:37 98 27
Item3 1002 Game01 43876 2013-12-15:19:24:39 12 23
Item4 2002 Game02 65689 2013-10-01:17:14:41 23 54

The partition key of an item is also known as its hash attribute and sort key as its range attribute. The term hash attribute arises from the use of an internal hash function that takes the value of the partition key as input and the output of that hash function determines the partition or physical storage node where the item will be stored. The term range attribute derives from the way DynamoDB stores items with the same partition key together, in sorted order by the sort key value.

Although there is no explicit limit on the number of attributes associated with an individual item in an Amazon DynamoDB table, there are restrictions on the aggregate size of an item or payload, including all attribute names and values. A small payload can potentially improve performance and reduce costs because it requires fewer resources to process. For information on how to handle items that exceed the maximum item size, see Best Practices for Storing Large Items and Attributes.

In Apache HBase, the most basic unit is a column. One or more columns form a row. Each row is addressed uniquely by a primary key referred to as a row key. A row in Apache HBase can have millions of columns. Each column can have multiple versions with each distinct value contained in a separate cell.

One fundamental modeling concept in Apache HBase is that of a column family. A column family is a container for grouping sets of related data together within one table, as shown in the following example.

Table 8: Apache HBase Row Representation

Table
Column Family 1 Column Family 2 Column Family 3
row row key Column 1 Column 2 Column 3 Column 4 Column 5 Column 6

Apache HBase groups columns with the same general access patterns and size characteristics into column families to form a basic unit of separation. For example, in the following Person table, you can group personal data into one column family called personal_info and the statistical data into a demographic column family. Any other columns in the table would be grouped accordingly as well, as shown in the following example.

Table 9: Example Person Table in Apache HBase

Person Table
personal_info demographic
row key firstname lastname zipcode gender
row 1 1001 Fname-1 Lname-1 00000
row 2 1002 Fname-2 Lname-2 M
row 3 2002 Fname-3 Lname-3 10000 F

Columns are addressed as a combination of the column family name and the column qualifier expressed as family:qualifier. All members of a column family have the same prefix. In the preceding example, the firstname and lastname column qualifiers can be referenced as personal_info:firstname and personal_info:lastname, respectively.

Column families allow you to fetch only those columns that are required by a query. All members of a column family are physically stored together on a disk. This means that optimization features, such as performance tunings, compression encodings, and so on, can be scoped at the column family level.

The row key is a combination of user and game identifiers in the following Apache HBase GameScores table. A row key can consist of multiple parts concatenated to provide an immutable way of referring to entities. From an Apache HBase modeling perspective, the resulting table is tall-narrow. This is because the table has few columns relative to the number of rows, as shown in the following example.

Table 10: Tall-Narrow GameScores Apache HBase Table

GameScores Table
top_scores metrics
row key score date wins loses
row 1 1001-game01 67453 2013-12-09:17:24:31 73 21
row 2 1001-game02 98567 2013-12-11:14:14:37 98 27
row 3 1002-game01 43876 2013-12-15:19:24:39 12 23
row 4 2002-game02 65689 2013-10-01:17:14:41 23 54

Alternatively, you can model the game identifier as a column qualifier in Apache HBase. This approach facilitates precise column lookups and supports usage of filters to read data. The result is a flat-wide table with few rows relative to the number of columns. This concept of a flat-wide Apache HBase table is shown in the following table.

Table 11: Flat-Wide GameScores Apache HBase Table

GameScores Table
top_scores metrics
row key gameId score top_score_date gameId wins loses
row 1 1001 game01 98567 2013-12-11:14:14:37 game01 98 27
game02 43876 2013-12-15:19:24:39 game02 12 23
row 2 1002 game01 67453 2013-12-09:17:24:31 game01 73 21
row 3 2002 game02 65689 2013-10-01:17:14:41 game02 23 54

For performance reasons, it is important to keep the number of column families in your Apache HBase schema low. Anything above three-column families can potentially degrade performance. The recommended best practice is to maintain a one-column family in your schemas and introduce a two-column family and three-column family only if data access is limited to a one-column family at a time. Note that Apache HBase does not impose any restrictions on row size.