Amazon Elastic MapReduce
Developer Guide (API Version 2009-03-31)
« PreviousNext »
View the PDF for this guide.Go to the AWS Discussion Forum for this product.Go to the Kindle Store to download this guide in Kindle format.Did this page help you?  Yes | No |  Tell us about it...

Back Up and Restore HBase

Amazon EMR provides the ability to back up your HBase data to Amazon S3, either manually or on an automated schedule. You can perform both full and incremental backups. Once you have a backed-up version of HBase data, you can restore that version to an HBase cluster. You can restore to an HBase cluster that is currently running, or launch a new cluster prepopulated with backed-up data.

During the backup process, HBase continues to execute write commands. Although this ensures that your cluster remains available throughout the backup, there is the risk of inconsistency between the data being backed up and any write operations being executed in parallel. To understand the inconsistencies that might arise, you have to consider that HBase distributes write operations across the nodes in its cluster. If a write operation happens after a particular node is polled, that data will not be included in the backup archive. You may even find that earlier writes to the HBase cluster (sent to a node that has already been polled) might not be in the backup archive, whereas later writes (sent to a node before it was polled) are included.

If a consistent backup is required, you must pause writes to HBase during the initial portion of the backup process, synchronization across nodes. You can do this by specifying the --consistent flag when requesting a backup. With this flag, writes during this period will be queued and executed as soon as the synchronization completes. You can also schedule recurring backups, which will resolve any inconsistencies over time, as data that is missed on one backup pass will be backed up on the following pass.

When you back up HBase data, you should specify a different backup directory for each cluster. An easy way to do this is to use the cluster identifier as part of the path specified for the backup directory. For example, s3://mybucket/backups/j-ABABABABAB. This ensures that any future incremental backups reference the correct HBase cluster.

When you are ready to delete old backup files that are no longer needed, we recommend that you first do a full backup of your HBase data. This ensures that all data is preserved and provides a baseline for future incremental backups. Once the full backup is done, you can navigate to the backup location and manually delete the old backup files.

The HBase backup process uses S3DistCp for the copy operation, which has certain limitations regarding temporary file storage space. For more information, see Distributed Copy Using S3DistCp.

Back Up and Restore HBase Using the Console

The console provides the ability to launch a new HBase cluster and populate it with data from a previous backup of an HBase cluster. It also gives you the ability to schedule periodic incremental backups of a new HBase cluster. Additional backup and restore functionality, such as the ability to restore data to an already running cluster, do manual backups, and schedule automated full backups is available using the Amazon EMR CLI. For more information, see Back Up and Restore HBase Using the CLI

To populate a new HBase cluster with archived data using the console

  1. Sign in to the AWS Management Console and open the Amazon Elastic MapReduce console at https://console.aws.amazon.com/elasticmapreduce/vnext/.

  2. Click Create Cluster.

    Create Cluster
  3. In the Software Configuration section, in the Additional Applications field, choose HBase and click Configure and add.

  4. On the Add Application dialog box, check Restore From Backup. For more information, see Launch an HBase Cluster on Amazon EMR.

  5. In Backup Location field, specify the location of the backup you wish to load into the new HBase cluster. This should be an Amazon S3 URL of the form s3://myawsbucket/backups/.

  6. In the Backup Version field, you have the option to specify the name of a backup version to load by setting a value. If you do not set a value for Backup Version, Amazon EMR loads the latest backup in the specified location.

  7. Click Add and proceed to create the cluster as described in Plan an Amazon EMR Cluster.

To schedule automated backups of HBase data using the console

  1. On the Create Cluster page, in the Software Configuration section, click Configure.

  2. Specify whether the backups should be consistent. A consistent backup is one which pauses write operations during the initial backup stage, synchronization across nodes. Any write operations thus paused are placed in a queue and resume when synchronization completes.

  3. Set how often backups should occur by entering a number for Backup Frequency and selecting Days, Hours, or Minutes from the drop-down box. The first automated backup that runs will be a full backup, after that, Amazon EMR will save incremental backups based on the schedule you specify.

  4. Specify the location in Amazon S3 where the backups should be stored. Each HBase cluster should be backed up to a separate location in Amazon S3 to ensure that incremental backups are calculated correctly.

  5. Specify when the first backup should occur by setting a value for Backup Start Time. You can set this to now, which causes the first backup to start as soon as the cluster is running, or enter a date and time in ISO format. For example, 2012-06-15T20:00Z, would set the start time to June 15, 2012 at 8pm UTC.

  6. Click Add Application.

  7. Click Done and proceed to create the cluster as described in Plan an Amazon EMR Cluster.

Back Up and Restore HBase Using the CLI

Running HBase on Amazon EMR provides many ways to back up your data, you can create full or incremental backups, run backups manually, and schedule automatic backups. The following table lists all the flags and parameters you can set in order to backup HBase data. Following the table are examples of commands that use these flags and parameters to back up data in various ways.

ParameterDescription
--backup-dirThe directory where a backup exists or should be created.
--backup-version(Optional) Specifies the version number of an existing backup to restore. If the backup version is not specified in a restore operation, Amazon EMR uses the latest backup, as determined by lexicographical order. This is in the format YYYYMMDDTHHMMSSZ, for example: 20120809T031314Z.
--consistent(Optional) Pauses all write operations to the HBase cluster during the backup process, to ensure a consistent backup.
--disable-full-backupsTurn off scheduled full backups by passing this flag into a call with --hbase-schedule-backup
--disable-incremental-backupsTurn off scheduled incremental backups by passing this flag into a call with --hbase-schedule-backup
--full-backup-time-intervalAn integer that specifies the period of time units to elapse between automated full backups of the HBase cluster. Used with --hbase-schedule-backup this parameter creates regularly scheduled full backups. If this period schedules a full backup at the same time as an incremental backup is scheduled, only the full backup is created. Used with --full-backup-time-unit.
--full-backup-time-unitThe unit of time to use with --full-backup-time-interval to specify how often automatically scheduled backups should run. This can take any one of the following values: minutes, hours, days.
--hbase-backupCreate a one-time backup of HBase data to the location specified by --backup-dir.
--hbase-restoreRestore a backup from the location specified by --backup-dir and (optionally) the version specified by --backup-version.
--hbase-schedule-backupSchedule an automated backup of HBase data. This can set an incremental backup, a full backup, or both, depending on the flags used to set the intervals and time units. The first backup in the schedule begins immediately unless a value is specified by --start-time.
--incremental-backup-time-intervalAn integer that specifies the period of time units to elapse between automated incremental backups of the HBase cluster. Used with --hbase-schedule-backup this parameter creates regularly scheduled incremental backups. If this period schedules a full backup at the same time as an incremental backup is scheduled, only the full backup is created. Used with --incremental-backup-time-unit.
--incremental-backup-time-unitThe unit of time to use with --incremental-backup-time-interval to specify how often automatically scheduled incremental backups should run. This can take any one of the following values: minutes, hours, days.
--start-time(Optional) Specifies the time that a backup schedule should start. If this is not set, the first backup begins immediately. This should be in ISO date-time format. You can use this to ensure your first data load process has completed before performing the initial backup or to have the backup occur at a specific time each day.

To manually back up HBase data

  • Run --hbase-backup in the CLI and specify the cluster and the backup location in Amazon S3. Amazon EMR tags the backup with a name derived from the time the backup was launched. This is in the format YYYYMMDDTHHMMSSZ, for example: 20120809T031314Z. If you want to label your backups with another name, you can create a location in Amazon S3 (such as backups in the example below) and use the location name as a way to tag the backup files.

    The following example backs up the HBase data to s3://myawsbucket/backups, with the timestamp as the version name. This backup does not pause writes to the HBase cluster and as such, may be inconsistent.

    In the directory where you installed the Amazon EMR CLI, run the following from the command line. For more information, see the Command Line Interface Reference for Amazon EMR.

    • Linux, UNIX, and Mac OS X users:

      ./elastic-mapreduce --jobflow j-ABABABABABA --hbase-backup \
      --backup-dir s3://myawsbucket/backups/j-ABABABABABA
    • Windows users:

      ruby elastic-mapreduce --jobflow j-ABABABABABA --hbase-backup --backup-dir s3://myawsbucket/backups/j-ABABABABABA

    This example backs up data, and uses the --consistent flag to enforce backup consistency. This flag causes all writes to the HBase cluster to pause during the backup.

    • Linux, UNIX, and Mac OS X users:

      ./elastic-mapreduce --jobflow j-ABABABABABA --hbase-backup \
      --backup-dir s3://myawsbucket/backups/j-ABABABABABA \
      --consistent
    • Windows users:

      ruby elastic-mapreduce --jobflow j-ABABABABABA --hbase-backup --backup-dir s3://myawsbucket/backups/j-ABABABABABA --consistent

To schedule automated backups of HBase data

  1. Call --hbase-schedule-backup on the HBase cluster and specify the backup time interval and units. If you do not specify a start time, the first backup starts immediately. The following example creates a weekly full backup, with the first backup starting immediately.

    In the directory where you installed the Amazon EMR CLI, run the following from the command line. For more information, see the Command Line Interface Reference for Amazon EMR.

    • Linux, UNIX, and Mac OS X users:

      ./elastic-mapreduce --jobflow j-ABABABABABA \
      --hbase-schedule-backup \
      --full-backup-time-interval 7 --full-backup-time-unit days \
      --backup-dir s3://mybucket/backups/j-ABABABABABA
    • Windows users:

      ruby elastic-mapreduce --jobflow j-ABABABABABA --hbase-schedule-backup --full-backup-time-interval 7 --full-backup-time-unit days --backup-dir s3://mybucket/backups/j-ABABABABABA

    The following example creates a weekly full backup, with the first backup starting on 15 June 2012, 8 p.m. UTC time.

    • Linux, UNIX, and Mac OS X users:

      ./elastic-mapreduce --jobflow j-ABABABABABA \
      --hbase-schedule-backup \
      --full-backup-time-interval 7 --full-backup-time-unit days \
      --backup-dir s3://mybucket/backups/j-ABABABABABA \
      --start-time 2012-06-15T20:00Z
    • Windows users:

      ruby elastic-mapreduce --jobflow j-ABABABABABA --hbase-schedule-backup --full-backup-time-interval 7 --full-backup-time-unit days --backup-dir s3://mybucket/backups/j-ABABABABABA --start-time 2012-06-15T20:00Z

    The following example creates a daily incremental backup. The first incremental backup will begin immediately.

    • Linux, UNIX, and Mac OS X users:

      ./elastic-mapreduce --jobflow j-ABABABABABA \
      --hbase-schedule-backup \
      --incremental-backup-time-interval 24 \
      --incremental-backup-time-unit hours \
      --backup-dir s3://mybucket/backups/j-ABABABABABA 
    • Windows users:

      ruby elastic-mapreduce --jobflow j-ABABABABABA \
      --hbase-schedule-backup --incremental-backup-time-interval 24 --incremental-backup-time-unit hours --backup-dir s3://mybucket/backups/j-ABABABABABA 

    The following example creates a daily incremental backup, with the first backup starting on 15 June 2012, 8 p.m. UTC time.

    • Linux, UNIX, and Mac OS X users:

      ./elastic-mapreduce --jobflow j-ABABABABABA \
      --hbase-schedule-backup \
      --incremental-backup-time-interval 24 \
      --incremental-backup-time-unit hours \
      --backup-dir s3://mybucket/backups/j-ABABABABABA \
      --start-time 2012-06-15T20:00Z
    • Windows users:

      ruby elastic-mapreduce --jobflow j-ABABABABABA --hbase-schedule-backup --incremental-backup-time-interval 24 --incremental-backup-time-unit hours --backup-dir s3://mybucket/backups/j-ABABABABABA --start-time 2012-06-15T20:00Z

    The following example creates both a weekly full backup and a daily incremental backup, with the first full backup starting immediately. Each time the schedule has the full backup and the incremental backup scheduled for the same time, only the full backup will run.

    • Linux, UNIX, and Mac OS X users:

      ./elastic-mapreduce --jobflow j-ABABABABABA \
      --hbase-schedule-backup \
      --full-backup-time-interval 7 \
      --full-backup-time-unit days \
      --incremental-backup-time-interval 24 \
      --incremental-backup-time-unit hours \
      --backup-dir s3://mybucket/backups/j-ABABABABABA 
    • Windows users:

      ruby elastic-mapreduce --jobflow j-ABABABABABA --hbase-schedule-backup --full-backup-time-interval 7 --full-backup-time-unit days --incremental-backup-time-interval 24 --incremental-backup-time-unit hours --backup-dir s3://mybucket/backups/j-ABABABABABA 

    The following example creates both a weekly full backup and a daily incremental backup, with the first full backup starting on June 15, 2012. Each time the schedule has the full backup and the incremental backup scheduled for the same time, only the full backup will run.

    • Linux, UNIX, and Mac OS X users:

      ./elastic-mapreduce --jobflow j-ABABABABABA \
      --hbase-schedule-backup \
      --full-backup-time-interval 7 \
      --full-backup-time-unit days \
      --incremental-backup-time-interval 24 \
      --incremental-backup-time-unit hours \
      --backup-dir s3://mybucket/backups/j-ABABABABABA \
      --start-time 2012-06-15T20:00Z
    • Windows users:

      ruby elastic-mapreduce --jobflow j-ABABABABABA --hbase-schedule-backup --full-backup-time-interval 7 --full-backup-time-unit days --incremental-backup-time-interval 24 --incremental-backup-time-unit hours --backup-dir s3://mybucket/backups/j-ABABABABABA --start-time 2012-06-15T20:00Z
  2. The following example creates both a weekly full backup and a daily incremental backup, with the first full backup starting on June 15, 2012. Each time the schedule has the full backup and the incremental backup scheduled for the same time, only the full backup will run. The --consistent flag is set, so both the incremental and full backups will pause write operations during the initial portion of the backup process to ensure data consistency.

    • Linux, UNIX, and Mac OS X users:

      ./elastic-mapreduce --jobflow j-ABABABABABA \
      --hbase-schedule-backup \
      --full-backup-time-interval 7 \
      --full-backup-time-unit days \
      --incremental-backup-time-interval 24 \
      --incremental-backup-time-unit hours \
      --backup-dir s3://mybucket/backups/j-ABABABABABA \
      --start-time 2012-06-15T20:00Z \
      --consistent
    • Windows users:

      ruby elastic-mapreduce --jobflow j-ABABABABABA --hbase-schedule-backup --full-backup-time-interval 7 --full-backup-time-unit days --incremental-backup-time-interval 24 --incremental-backup-time-unit hours --backup-dir s3://mybucket/backups/j-ABABABABABA --start-time 2012-06-15T20:00Z --consistent

To turn off automated backups

  • Call the cluster with the --hbase-schedule-backup parameter and set the --disable-full-backups or --disable-incremental-backups flag, or both flags. The following example turns off full backups.

    In the directory where you installed the Amazon EMR CLI, run the following from the command line. For more information, see the Command Line Interface Reference for Amazon EMR.

    • Linux, UNIX, and Mac OS X users:

      ./elastic-mapreduce --jobflow j-ABABABABABA \
      --hbase-schedule-backup --disable-full-backups
    • Windows users:

      ruby elastic-mapreduce --jobflow j-ABABABABABA --hbase-schedule-backup --disable-full-backups

    The following example turns off incremental backups.

    • Linux, UNIX, and Mac OS X users:

      ./elastic-mapreduce --jobflow j-ABABABABABA \
      --hbase-schedule-backup --disable-incremental-backups
    • Windows users:

      ruby elastic-mapreduce --jobflow j-ABABABABABA --hbase-schedule-backup --disable-incremental-backups

    The following example turns off both full and incremental backups.

    • Linux, UNIX, and Mac OS X users:

      ./elastic-mapreduce --jobflow j-ABABABABABA \
      --hbase-schedule-backup --disable-full-backups \
      --disable-incremental-backups
    • Windows users:

      ruby elastic-mapreduce --jobflow j-ABABABABABA --hbase-schedule-backup --disable-full-backups --disable-incremental-backups

To restore data to a running HBase cluster

  • Run an --hbase-restore step and specify the jobflow, the backup location in Amazon S3, and (optionally) the name of the backup version. If you do not specify a value for --backup-version, Amazon EMR loads the last version in the backup directory. This is the version with the name that is lexicographically greatest.

    The following example restores the HBase cluster to the latest version of backup data stored in s3://myawsbucket/backups, overwriting any data stored in the HBase cluster.

    In the directory where you installed the Amazon EMR CLI, run the following from the command line. For more information, see the Command Line Interface Reference for Amazon EMR.

    • Linux, UNIX, and Mac OS X users:

      ./elastic-mapreduce --jobflow j-ABABABABABA --hbase-restore \
      --backup-dir s3://myawsbucket/backups/j-ABABABABABA
    • Windows users:

      ruby elastic-mapreduce --jobflow j-ABABABABABA --hbase-restore --backup-dir s3://myawsbucket/backups/j-ABABABABABA

    This example restored the HBase cluster to the specified version of backup data stored in s3://myawsbucket/backups, overwriting any data stored in the HBase cluster.

    • Linux, UNIX, and Mac OS X users:

      ./elastic-mapreduce --jobflow j-ABABABABABA --hbase-restore \
      --backup-dir s3://myawsbucket/backups/j-ABABABABABA \
      --backup-version  20120809T031314Z
    • Windows users:

      ruby elastic-mapreduce --jobflow j-ABABABABABA --hbase-restore --backup-dir s3://myawsbucket/backups/j-ABABABABABA --backup-version  20120809T031314Z

To populate a new HBase cluster with archived data

  • Add --hbase-restore and --backup-directory to the --create step in the CLI.

    You can optionally specify --backup-version to indicate which version in the backup directory to load. If you do not specify a value for --backup-version, Amazon EMR loads the last version in the backup directory. This will either be the version with the name that is lexicographically last or, if the version names are based on timestamps, the latest version.

    The following example creates a new HBase cluster and loads it with the latest version of data in s3://myawsbucket/backups/j-ABABABABABA.

    • Linux, UNIX, and Mac OS X users:

      ./elastic-mapreduce --create --name "My HBase Restored" \
      --hbase --hbase-restore \
      --backup-dir s3://myawsbucket/backups/j-ABABABABABA 
    • Windows users:

      ruby elastic-mapreduce --create --name "My HBase Restored" --hbase --hbase-restore --backup-dir s3://myawsbucket/backups/j-ABABABABABA 

    This example creates a new HBase cluster and loads it with the specified version of data in s3://myawsbucket/backups/j-ABABABABABA.

    • Linux, UNIX, and Mac OS X users:

      ./elastic-mapreduce --create --name "My HBase Restored" \
      --hbase --hbase-restore \
      --backup-dir s3://myawsbucket/backups/j-ABABABABABA \
      --backup-version  20120809T031314Z
    • Windows users:

      ruby elastic-mapreduce --create --name "My HBase Restored" --hbase --hbase-restore --backup-dir s3://myawsbucket/backups/j-ABABABABABA --backup-version  20120809T031314Z