Apache HBase Architecture Overview - Comparing the Use of Amazon DynamoDB and Apache HBase for NoSQL

Apache HBase Architecture Overview

Apache HBase is typically deployed on top of HDFS. Apache ZooKeeper is a critical component for maintaining configuration information and managing the entire Apache HBase cluster.

The three major Apache HBase components are the following:

  • Client API — Provides programmatic access to Data Manipulation Language (DML) for performing CRUD operations on HBase tables.

  • Region servers — HBase tables are split into regions and are served by region servers.

  • Master server — Responsible for monitoring all region server instances in the cluster and is the interface for all metadata changes.

Apache HBase stores data in indexed store files called HFiles on HDFS. The store files are sequences of blocks with a block index stored at the end for fast lookups. The store files provide an API to access specific values as well as to scan ranges of values, given a start and end key.

During a write operation, data is first written to a commit log called a write-ahead-log (WAL) and then moved into memory in a structure called Memstore. When the size of the Memstore exceeds a given maximum value, it is flushed as a HFile to disk. Each time data is flushed from Memstores to disk, new HFiles must be created. As the number of HFiles builds up, a compaction process merges the files into fewer, larger files.

A read operation essentially is a merge of data stored in the Memstores and in the HFiles. The WAL is never used in the read operation. It is meant only for recovery purposes if a server crashes before writing the in-memory data to disk.

A region in Apache HBase acts as a store per column family. Each region contains contiguous ranges of rows stored together. Regions can be merged to reduce the number of store files. A large store file that exceeds the configured maximum store file size can trigger a region split.

A region server can serve multiple regions. Each region is mapped to exactly one region server. Region servers handle reads and writes, as well as keeping data in-memory until enough is collected to warrant a flush. Clients communicate directly with region servers to handle all data-related operations.

The master server is responsible for monitoring and assigning regions to region servers and uses Apache ZooKeeper to facilitate this task. Apache ZooKeeper also serves as a registry for region servers and a bootstrap location for region discovery.

The master server is also responsible for handling critical functions such as load balancing of regions across region servers, region server failover, and completing region splits, but it is not part of the actual data storage or retrieval path.

You can run Apache HBase in a multi-master environment. All masters compete to run the cluster in a multi-master mode. However, if the active master shuts down, then the remaining masters contend to take over the master role.