Indexing
In general, data is indexed using a primary key for fast retrieval in both Amazon DynamoDB and Apache HBase. Secondary indexes extend the basic indexing functionality and provide an alternate query path in addition to queries against the primary key.
Amazon DynamoDB supports two kinds of secondary indexes on a table that already implements a partition and sort key:
-
Global secondary index—An index with a partition and optional sort key that can be different from those on the table.
-
Local secondary index—An index that has the same partition key as the table, but a different sort key.
You can define one or more global secondary indexes and one or more local secondary indexes per table. For documents, you can create a local secondary index or global secondary index on any top-level JSON element.
In the example GameScores table introduced in the preceding section, you can define LeaderBoardIndex as a global secondary index as follows:
Table 13: Example Global Secondary Index in Amazon DynamoDB
LeaderBoardIndex | ||
---|---|---|
Index Key | Attribute 1 | |
GameId (Partition Key) | TopScore (Sort Key) | PersonId |
Game01 | 98567 | 1001 |
Game02 | 43876 | 1001 |
Game01 | 65689 | 1002 |
Game02 | 67453 | 2002 |
The LeaderBoardIndex shown in Table 13 defines GameId as its primary key and TopScore as its sort key. It is not necessary for the index key to contain any of the key attributes from the source table. However, the table’s primary key attributes are always present in the global secondary index. In this example, PersonId is automatically projected or copied into the index.
With LeaderBoardIndex defined, you can easily obtain a list of top scores for a specific game by simply querying it. The output is ordered by TopScore, the sort key. You can choose to project additional attributes from the source table into the index.
A local secondary index, on the other hand, organizes data by the index sort key. It provides an alternate query path for efficiently accessing data using a different sort key.
You can define PersonTopScoresIndex as a local secondary index for the example GameScores table introduced in the preceding section. The index contains the same partition key, PersonId, as the source table and defines TopScoreDate as its new sort key. The old sort key value from the source table (in this example, GameId) is automatically projected or copied into the index, but it is not a part of the index key, as shown in the following table.
Table 14: Local Secondary Index in Amazon DynamoDB
PersonTopScoresIndex | |||
---|---|---|---|
Index Key | Attribute1 | Attribute2 | |
PersonId (Partition Key) | TopScoreDate (New Sort Key) | GameId (Old Sort Key as attribute) | TopScore (Optional projected attribute) |
1001 | 2013-12-09:17:24:31 | Game01 | 67453 |
1001 | 2013-12-11:14:14:37 | Game02 | 98567 |
1002 | 2013-12-15:19:24:39 | Game01 | 43876 |
2002 | 2013-10-01:17:14:41 | Game02 | 65689 |
A local secondary index is a sparse index. An index will only have an item if the index sort key attribute has a value.
With local secondary indexes, any group of items that have the same partition key value in a table and all their associated local secondary indexes form an item collection. There is a size restriction on item collections in a DynamoDB table. For more information, see Item Collection Size Limit.
The main difference between a global secondary index and a local secondary index is that a global secondary index defines a completely new partition key and optional sort index on a table. You can define any attribute as the partition key for the global secondary index as long as its data type is scalar rather than a multi-value set.
Additional highlights between global and local secondary indexes are captured in the following table.
Table 15: Global and secondary indexes
Global Secondary Indexes | Local Secondary Indexes | |
---|---|---|
Creation |
Can be created for existing tables (Online indexing supported) |
Only at table creation time (Online indexing not supported) |
Primary Key Values | Need not be unique | Must be unique |
Partition Key | Different from primary table | Same as primary table |
Sort Key | Optional |
Required (different from Primary table) |
Provisioned Throughput | Independent from primary table | Dependent on primary table |
Writes | Asynchronous | Synchronous |
For more information on global and local secondary indexes in Amazon DynamoDB, see Improving Data Access with Secondary Indexes.
In Apache HBase, all rows are always sorted lexicographically by row key. The sort is byte-ordered. This means that each row key is compared on a binary level, byte by byte, from left to right. Row keys are always unique and act as the primary index in Apache HBase.
Although Apache HBase does not have native support for built-in indexing models such as Amazon DynamoDB, you can implement custom secondary indexes to serve as alternate query paths by using these techniques:
-
Create an index in another table—You can maintain a secondary table that is periodically updated. However, depending on the load strategy, the risk with this method is that the secondary index can potentially become out of sync with the main table. You can mitigate this risk if you build the secondary index while publishing data to the cluster and perform concurrent writes into the index table.
-
Use the coprocessor framework—You can leverage the coprocessor framework to implement custom secondary indexes. Coprocessors act like triggers that are similar to stored procedures in RDBMS.
-
Use Apache Phoenix—Acts as a front-end to Apache HBase to convert standard SQL into native HBase scans and queries and for secondary indexing.
In summary, both Amazon DynamoDB and Apache HBase define data models that allow efficient storage of data to optimize query performance. Amazon DynamoDB imposes a restriction on its item size to allow efficient processing and reduce costs. Apache HBase uses the concept of column families to provide data locality for more efficient read operations.
Amazon DynamoDB supports both scalar and multi-valued sets to accommodate a wide range of unstructured datasets. Similarly, Apache HBase stores its key/value pairs as arbitrary arrays of bytes, giving it the flexibility to store any data type.
Amazon DynamoDB supports built-in secondary indexes and automatically updates and synchronizes all indexes with their parent tables. With Apache HBase, you can implement and manage custom secondary indexes yourself.
From a data model perspective, you can choose Amazon DynamoDB if your item size is relatively small. Although Amazon DynamoDB provides a number of options to overcome row size restrictions, Apache HBase is better equipped to handle large complex payloads with minimal restrictions.