Menu
Amazon EMR
Amazon EMR Release Guide

Amazon S3 Storage Mode for HBase

When you run HBase on Amazon EMR version 5.2.0 or later, you can enable Amazon S3 storage mode, which offers the following advantages:

  • The HBase root directory is stored in Amazon S3, including store files (HFiles) and table metadata. This data is persistent outside of the cluster, available across Amazon EC2 Availability Zones, and you don't need to recover using snapshots or other methods.

  • With store files in Amazon S3, you can size your Amazon EMR cluster for your compute requirements instead of data requirements, with 3x replication in HDFS.

The following illustration shows the HBase components relevant to Amazon S3 storage mode.


						S3 storage mode for HBase architecture.

Enabling Amazon S3 Storage Mode for HBase

You can enable Amazon S3 storage mode using the Amazon EMR console, the AWS CLI, or the Amazon EMR API. The configuration is an option during cluster creation. When you use the console, you choose the setting using Advanced options. When you use the AWS CLI, use the --configurations option to provide a configuration object. Properties of the configuration object specify the storage mode and the root directory location in Amazon S3. The Amazon S3 location that you specify should be in the same region as your Amazon EMR cluster, and only one active cluster at a time can use the same HBase root directory in Amazon S3. For console steps and a detailed create-cluster example using the AWS CLI, see Creating a Cluster with HBase Using the Console. An example configuration object is shown in the following JSON snippet.

Important

We strongly recommend that you use EMRFS consistent view when you enable Amazon S3 storage mode for production workloads. Not using consistent view may result in performance impacts for specific operations. For more information about configuring consistent view, see Consistent View in the Amazon EMR Management Guide.

Copy
{ "Classification": "hbase-site", "Properties": { "hbase.rootdir": "s3://my-bucket/my-hbase-rootdir"} }, { "Classification": "hbase", "Properties": { "hbase.emr.storageMode":"s3" } }

Operational Considerations

HBase region servers use BlockCache to store data reads in memory and BucketCache to store data reads on local disk. In addition, region servers use MemStore to store data writes in-memory, and use write-ahead logs to store data writes in HDFS before the data is written to store files in Amazon S3. The read performance of your cluster relates to how often a record can be retrieved from the in-memory or on-disk caches. A cache miss results in the record being read from the store file in Amazon S3, which has significantly higher latency and higher standard deviation than reading from HDFS. In addition, the maximum request rates for Amazon S3 are lower than what can be achieved from the local cache, so caching data may be important for read-heavy workloads. For more information abut Amazon S3 performance, see Performance Optimization in the Amazon Simple Storage Service Developer Guide.

To improve performance, we recommend that you cache as much of your dataset as possible in EC2 instance storage. Because the BucketCache uses the region server's EC2 instance storage, you can choose an instance type with a sufficient instance store and add Amazon EBS storage to accommodate the required cache size. You can also increase the BucketCache size on attached instance stores and EBS volumes using the hbase.bucketcache.size property. The default setting is 8,192 MB.

For writes, the frequency of MemStore flushes and the number of store files during minor and major compactions can contribute significantly to an increase in region server response times. For optimal performance, consider increasing the size of the MemStore flush and HRegion block multiplier, which increases the elapsed time between major compactions. Also, in some cases, you may get better performance using larger file block sizes (but less than 5 GB) to trigger Amazon S3 multipart upload functionality in EMRFS. Additionally, HBase compactions and region servers perform optimally when fewer store files need to be compacted.

Tables can take a significant amount of time to drop on Amazon S3 because large directories need to be renamed. Consider disabling tables instead of dropping.

There is an HBase cleaner process that cleans up old WAL files and archived HFiles. Cleaner operation can affect query performance when running heavy workloads, so we recommend you enable the cleaner only during off-peak times. The cleaner has the following HBase shell commands:

  • cleaner_enabled queries whether the cleaner is enabled.

  • cleaner_run manually runs the cleaner to remove files.

  • cleaner_switch enables or disables the cleaner and returns the previous state of the cleaner. For example, cleaner-switch true enables the cleaner.

HBase Properties for Amazon S3 Storage Mode Performance Tuning

The following parameters can be adjusted to tune the performance of your workload when you use Amazon S3 storage mode.

Configuration Property Default Value Description

hbase.bucketcache.size

8,192

The amount of disk space, in MB, reserved on region server Amazon EC2 instance stores and EBS volumes for BucketCache storage. The setting applies to all region server instances. Larger BucketCache sizes generally correspond to improved performance

hbase.hregion.memstore.flush.size

134217728

The maximum limit, in bytes, that a single memstore flushes data to Amazon S3.

hbase.hregion.memstore.block.multiplier

4

A multiplier that determines the MemStore upper limit at which updates are blocked. If the MemStore exceeds hbase.hregion.memstore.flush.size multiplied by this value, updates are blocked. MemStore flushes and compaction may happen to unblock updates.

hbase.hstore.blockingStoreFiles

10

The maximum number of store files that can exist in a store before updates are blocked.

hbase.hregion.max.filesize

10737418240

The maximum combined size of HFiles, in bytes, that can exist in a region before the region is split.

Shutting Down and Restoring a Cluster Without Data Loss

To shut down an Amazon EMR cluster without losing data that hasn't been written to Amazon S3, the MemStore cache needs to flush to Amazon S3 to write new store files. To do this, you can run a shell script provided on the EMR cluster. You can either add it as a step or run it directly using the on-cluster CLI. The script disables all HBase tables, which causes the MemStore in each region server to flush to Amazon S3. If the script completes successfully, the data persists in Amazon S3 and the cluster can be terminated.

The following step configuration can be used when you add a step to the cluster. For more information, see Work with Steps Using the CLI and Console in the Amazon EMR Management Guide.

Copy
Name="Disable all tables",Jar="command-runner.jar",Args=["/bin/bash","/usr/lib/hbase/bin/disable_all_tables.sh"]

Alternatively, you can run the following bash command directly.

Copy
bash /usr/lib/hbase/bin/disable_all_tables.sh

To restart a cluster with the same HBase data, specify the same Amazon S3 location as the previous cluster either in the AWS Management Console or using the hbase.rootdir configuration property.