Partitioning - Comparing the Use of Amazon DynamoDB and Apache HBase for NoSQL

Partitioning

Amazon DynamoDB stores three geographically distributed replicas of each table to enable high availability and data durability within a region. Data is auto-partitioned primarily using the partition key. As throughput and data size increase, Amazon DynamoDB will automatically repartition and reallocate data across more nodes.

Partitions in Amazon DynamoDB are fully independent, resulting in a shared nothing cluster. However, provisioned throughput is divided evenly across the partitions.

A region is the basic unit of scalability and load balancing in Apache HBase. Region splitting and subsequent load-balancing follow this sequence of events:

  1. Initially, there is only one region for a table, and as more data is added to it, the system monitors the load to ensure that the configured maximum size is not exceeded.

  2. If the region size exceeds the configured limit, the system dynamically splits the region into two at the row key in the middle of the region, creating two roughly equal halves.

  3. The NameNode then schedules the new regions to be moved off to other servers for load balancing, if required.

Behind the scenes, Apache ZooKeeper tracks all activities that take place during a region split and maintains the state of the region in case of server failure. Apache HBase regions are equivalent to range partitions that are used in RDBMS sharding. Regions can be spread across many physical servers that consequently distribute the load, resulting in scalability.

In summary, as a managed service, the architectural details of Amazon DynamoDB are abstracted from you to let you focus on your application details.

With the self-managed Apache HBase deployment model, it is crucial to understand the underlying architectural details to maximize scalability and performance. AWS gives you the option to offload Apache HBase administrative overhead if you opt to launch your cluster on Amazon EMR.