Amazon Elastic MapReduce
Developer Guide (API Version 2009-03-31)
Did this page help you?  Yes | No |  Tell us about it...
« PreviousNext »
View the PDF for this guide.Go to the AWS Discussion Forum for this product.Go to the Kindle Store to download this guide in Kindle format.

Back Up and Restore HBase

Amazon EMR provides the ability to back up your HBase data to Amazon S3, either manually or on an automated schedule. You can perform both full and incremental backups. Once you have a backed-up version of HBase data, you can restore that version to an HBase cluster. You can restore to an HBase cluster that is currently running, or launch a new cluster prepopulated with backed-up data.

During the backup process, HBase continues to execute write commands. Although this ensures that your cluster remains available throughout the backup, there is the risk of inconsistency between the data being backed up and any write operations being executed in parallel. To understand the inconsistencies that might arise, you have to consider that HBase distributes write operations across the nodes in its cluster. If a write operation happens after a particular node is polled, that data will not be included in the backup archive. You may even find that earlier writes to the HBase cluster (sent to a node that has already been polled) might not be in the backup archive, whereas later writes (sent to a node before it was polled) are included.

If a consistent backup is required, you must pause writes to HBase during the initial portion of the backup process, synchronization across nodes. You can do this by specifying the --consistent parameter when requesting a backup. With this parameter, writes during this period will be queued and executed as soon as the synchronization completes. You can also schedule recurring backups, which will resolve any inconsistencies over time, as data that is missed on one backup pass will be backed up on the following pass.

When you back up HBase data, you should specify a different backup directory for each cluster. An easy way to do this is to use the cluster identifier as part of the path specified for the backup directory. For example, s3://mybucket/backups/j-3AEXXXXXX16F2. This ensures that any future incremental backups reference the correct HBase cluster.

When you are ready to delete old backup files that are no longer needed, we recommend that you first do a full backup of your HBase data. This ensures that all data is preserved and provides a baseline for future incremental backups. Once the full backup is done, you can navigate to the backup location and manually delete the old backup files.

The HBase backup process uses S3DistCp for the copy operation, which has certain limitations regarding temporary file storage space. For more information, see Distributed Copy Using S3DistCp.

Back Up and Restore HBase Using the Console

The console provides the ability to launch a new cluster and populate it with data from a previous HBase backup. It also gives you the ability to schedule periodic incremental backups of HBase data. Additional backup and restore functionality, such as the ability to restore data to an already running cluster, do manual backups, and schedule automated full backups is available using the CLI. For more information, see Back Up and Restore HBase Using the AWS CLI.

To populate a new cluster with archived HBase data using the console

  1. Open the Amazon EMR console at https://console.aws.amazon.com/elasticmapreduce/.

  2. Click Create cluster.

  3. In the Software Configuration section, in the Additional Applications field, choose HBase and click Configure and add.

  4. On the Add Application dialog box, check Restore From Backup. For more information, see Install HBase on an Amazon EMR Cluster.

  5. In Backup Location field, specify the location of the backup you wish to load into the new HBase cluster. This should be an Amazon S3 URL of the form s3://myawsbucket/backups/.

  6. In the Backup Version field, you have the option to specify the name of a backup version to load by setting a value. If you do not set a value for Backup Version, Amazon EMR loads the latest backup in the specified location.

    Restore HBase from backup
  7. Click Add and proceed to create the cluster as described in Plan an Amazon EMR Cluster.

To schedule automated backups of HBase data using the console

  1. In the Software Configuration section, in the Additional Applications field, choose HBase and click Configure and add.

  2. Click Schedule Regular Backups.

  3. Specify whether the backups should be consistent. A consistent backup is one which pauses write operations during the initial backup stage, synchronization across nodes. Any write operations thus paused are placed in a queue and resume when synchronization completes.

  4. Set how often backups should occur by entering a number for Backup Frequency and selecting Days, Hours, or Minutes from the drop-down box. The first automated backup that runs will be a full backup, after that, Amazon EMR will save incremental backups based on the schedule you specify.

  5. Specify the location in Amazon S3 where the backups should be stored. Each HBase cluster should be backed up to a separate location in Amazon S3 to ensure that incremental backups are calculated correctly.

  6. Specify when the first backup should occur by setting a value for Backup Start Time. You can set this to now, which causes the first backup to start as soon as the cluster is running, or enter a date and time in ISO format. For example, 2013-09-26T20:00Z, would set the start time to September 26, 2013 at 8pm UTC.

    Configure HBase backups
  7. Click Add.

  8. Proceed with creating the cluster as described in Plan an Amazon EMR Cluster.

Back Up and Restore HBase Using the AWS CLI

Running HBase on Amazon EMR provides many ways to back up your data, you can create full or incremental backups, run backups manually, and schedule automatic backups.

Back Up and Restore HBase Using the AWS CLI

Using the AWS CLI, you can create HBase backups, restore HBase data from backup when creating an Amazon EMR cluster, schedule HBase backups, restore HBase from backup data in Amazon S3, and disable HBase backups.

To manually create an HBase backup using the AWS CLI

To create an HBase backup, type the create-hbase-backup subcommand with the --dir parameter to identify the backup location in Amazon S3. Amazon EMR tags the backup with a name derived from the time the backup was launched. This is in the format YYYYMMDDTHHMMSSZ, for example: 20120809T031314Z. If you want to label your backups with another name, you can create a location in Amazon S3 (such as backups in the example below) and use the location name as a way to tag the backup files.

  • Type the following command to back up HBase data to s3://mybucket/backups, with the timestamp as the version name. Replace j-3AEXXXXXX16F2 with the cluster ID and replace mybucket with your Amazon S3 bucket name. This backup does not pause writes to HBase and as such, may be inconsistent.

    aws emr create-hbase-backup --cluster-id j-3AEXXXXXX16F2 --dir s3://mybucket/backups/j-3AEXXXXXX16F2

    Type the following command to back up data and use the --consistent parameter to enforce backup consistency. This flag pauses all writes to HBase during the backup:

    aws emr create-hbase-backup --cluster-id j-3AEXXXXXX16F2 --dir s3://mybucket/backups/j-3AEXXXXXX16F2 --consistent

    For more information on using Amazon EMR commands in the AWS CLI, see http://docs.aws.amazon.com/cli/latest/reference/emr.

To schedule automated backups of HBase data using the AWS CLI

To schedule HBase backups, type the schedule-hbase-backup subcommand with the --interval and --unit parameters. If you do not specify a start time, the first backup starts immediately. Use the --consistent parameter to pause all write operations to HBase during the backup process.

  • Type the following command to schedule consistent HBase backups:

    To create a consistent weekly full backup, with the first backup starting immediately, type the following command, replace j-3AEXXXXXX16F2 with the cluster ID, and replace mybucket with your Amazon S3 bucket name.

    aws emr schedule-hbase-backup --cluster-id j-3AEXXXXXX16F2 --type full --dir s3://mybucket/backups/j-3AEXXXXXX16F2 --interval 7 --unit days --consistent

    To create a consistent weekly full backup, with the first backup starting on 15 June 2014, 8 p.m. UTC time, type:

    aws emr schedule-hbase-backup --cluster-id j-3AEXXXXXX16F2 --type full --dir s3://mybucket/backups/j-3AEXXXXXX16F2 --interval 7 --unit days --start-time 2014-06-15T20:00Z --consistent

    To create a consistent daily incremental backup with the first backup beginning immediately, type:

    aws emr schedule-hbase-backup --cluster-id j-3AEXXXXXX16F2 --type incremental --dir s3://mybucket/backups/j-3AEXXXXXX16F2 --interval 24 --unit hours --consistent

    To create a consistent daily incremental backup, with the first backup starting on 15 June 2014, 8 p.m. UTC time, type:

    aws emr schedule-hbase-backup --cluster-id j-3AEXXXXXX16F2 --type incremental --dir s3://mybucket/backups/j-3AEXXXXXX16F2 --interval 24 --unit hours --start-time 2014-06-15T20:00Z --consistent

    For more information on using Amazon EMR commands in the AWS CLI, see http://docs.aws.amazon.com/cli/latest/reference/emr.

To disable HBase backups using the AWS CLI

To disable HBase backups, type the disable-hbase-backups subcommand with the --cluster-id parameter. The cluster id can be retrieved using the console or the list-clusters subcommand. When disabling backups, identify the backup type: --full or --incremental.

  • Type the following command to disable full backups and replace j-3AEXXXXXX16F2 with your cluster ID.

    aws emr disable-hbase-backups --cluster-id j-3AEXXXXXX16F2 --full

    For more information on using Amazon EMR commands in the AWS CLI, see http://docs.aws.amazon.com/cli/latest/reference/emr.

To restore HBase backup data to a running cluster using the AWS CLI

To restore HBase backup data to a running cluster, type the restore-from-hbase-backup subcommand with the --cluster-id parameter. To restore from backup, you must provide the backup directory and (optionally) the backup version. The backup version specifies the version number of an existing backup to restore. If the backup version is not specified, Amazon EMR uses the latest backup, as determined by lexicographical order. This is in the format YYYYMMDDTHHMMSSZ, for example: 20120809T031314Z.

  • To restore HBase backup data to a running cluster, type the following command, replace j-3AEXXXXXX16F2 with your cluster ID, and replace mybucket with your Amazon S3 bucket name.

    aws emr restore-from-hbase-backup --cluster-id j-3AEXXXXXX16F2 --dir s3://mybucket/backups/j-3AEXXXXXX16F2 --backup-version 20120809T031314Z

    For more information on using Amazon EMR commands in the AWS CLI, see http://docs.aws.amazon.com/cli/latest/reference/emr.

To populate a new cluster with HBase backup data using the AWS CLI

To populate a new cluster with HBase backup data, type the create-cluster subcommand with the --restore-from-hbase-backup parameter. To restore from backup, you must provide the backup directory and (optionally) the backup version. The backup version specifies the version number of an existing backup to restore. If the backup version is not specified, Amazon EMR uses the latest backup, as determined by lexicographical order. This is in the format YYYYMMDDTHHMMSSZ, for example: 20120809T031314Z.

  • Type the following command to create a cluster with HBase installed and to load HBase with the backup data in s3://mybucket/backups/j-3AEXXXXXX16F2. Replace myKey with the name of your EC2 key pair.

    • Linux, UNIX, and Mac OS X users:

      aws emr create-cluster --name "Test cluster" --ami-version 3.3 \
      --applications Name=Hue Name=Hive Name=Pig Name=HBase \
      --restore-from-hbase-backup Dir=s3://mybucket/backups/j-3AEXXXXXX16F2,BackupVersion=20120809T031314Z \
      --use-default-roles --ec2-attributes KeyName=myKey \
      --instance-type c1.xlarge --instance-count 3 --termination-protected
    • Windows users:

      aws emr create-cluster --name "Test cluster" --ami-version 3.3 --applications Name=Hue Name=Hive Name=Pig Name=HBase --restore-from-hbase-backup Dir=s3://mybucket/backups/j-3AEXXXXXX16F2,BackupVersion=20120809T031314Z --use-default-roles --ec2-attributes KeyName=myKey --instance-type c1.xlarge --instance-count 3 --termination-protected

    When you specify the instance count without using the --instance-groups parameter, a single Master node is launched, and the remaining instances are launched as core nodes. All nodes will use the instance type specified in the command.

    Note

    If you have not previously created the default EMR service role and EC2 instance profile, type aws emr create-default-roles to create them before typing the create-cluster subcommand.

    For more information on using Amazon EMR commands in the AWS CLI, see http://docs.aws.amazon.com/cli/latest/reference/emr.