Amazon Elastic MapReduce
Developer Guide (API Version 2009-03-31)
« PreviousNext »
View the PDF for this guide.Go to the AWS Discussion Forum for this product.Go to the Kindle Store to download this guide in Kindle format.Did this page help you?  Yes | No |  Tell us about it...

Install HBase on an Amazon EMR Cluster

When you launch HBase on Amazon EMR, you get the benefits of running in the Amazon Web Services (AWS) cloud—easy scaling, low cost, pay only for what you use, and ease of use. The Amazon EMR team has tuned HBase to run optimally on AWS. For more information about HBase and running it on Amazon EMR, see Store Data with HBase.

The following procedure shows how to launch an HBase cluster with the default settings. If your application needs custom settings, you can configure HBase as described in Configure HBase.

Note

HBase configuration can only be done at launch time.

To launch a cluster and install HBase using the console

  1. Open the Amazon Elastic MapReduce console at https://console.aws.amazon.com/elasticmapreduce/.

  2. Click Create cluster.

  3. In the Create Cluster page, in the Cluster Configuration section, verify the fields according to the following table.

    Cluster Configuration
    FieldAction
    Cluster name

    Enter a descriptive name for your cluster.

    The name is optional, and does not need to be unique.

    Termination protection

    Enabling termination protection ensures that the cluster does not shut down due to accident or error. For more information, see Managing Cluster Termination. Typically, set this value to Yes only when developing an application (so you can debug errors that would have otherwise terminated the cluster) and to protect long-running clusters or clusters that contain data.

    Logging

    This determines whether Amazon EMR captures detailed log data to Amazon S3.

    For more information, see View Log Files.

    Log folder S3 location

    Enter an Amazon S3 path to store your debug logs if you enabled logging in the previous field. If the log folder does not exist, the Amazon EMR console creates it for you.

    When this value is set, Amazon EMR copies the log files from the EC2 instances in the cluster to Amazon S3. This prevents the log files from being lost when the cluster ends and the EC2 instances hosting the cluster are terminated. These logs are useful for troubleshooting purposes.

    For more information, see View Log Files.

    Debugging

    This option creates a debug log index in SimpleDB (additional charges apply) to enable detailed debugging in the Amazon EMR console. You can only set this when the cluster is created. For more information about Amazon SimpleDB, go to the Amazon SimpleDB product description page.

  4. In the Software Configuration section, verify the fields according to the following table.

    Software Configuration
    FieldAction
    Hadoop distribution

    Choose Amazon.

    This determines which distribution of Hadoop to run on your cluster. You can choose to run the Amazon distribution of Hadoop or one of several MapR distributions. For more information, see Using the MapR Distribution for Hadoop.

    AMI version

    Choose the latest Hadoop 2.x AMI or the latest Hadoop 1.x AMI from the list.

    The AMI you choose determines the specific version of Hadoop and other applications such as Hive or Pig to run on your cluster. For more information, see Choose an Amazon Machine Image (AMI).

  5. Under the Additional Applications list, choose HBase and click Configure and add.

  6. In the Add Application section, indicate whether you want to pre-load the HBase cluster with data stored in Amazon S3 and whether you want to schedule regular backups of your HBase cluster. Use the following table for guidance on making your selections. For more information about backing up and restoring HBase data, see Back Up and Restore HBase.

    FieldAction
    Restore from backupSpecify whether to pre-load the HBase cluster with data stored in Amazon S3.
    Backup locationSpecify the URI where the backup to restore from resides in Amazon S3.
    Backup versionOptionally, specify the version name of the backup at Backup Location to use. If you leave this field blank, Amazon EMR uses the latest backup at Backup Location to populate the new HBase cluster.
    Schedule Regular BackupsSpecify whether to schedule automatic incremental backups. The first backup will be a full backup to create a baseline for future incremental backups.
    Consistent backupSpecify whether the backups should be consistent. A consistent backup is one which pauses write operations during the initial backup stage, synchronization across nodes. Any write operations thus paused are placed in a queue and resume when synchronization completes.
    Backup frequencyThe number of Days/Hours/Minutes between scheduled backups.
    Backup locationThe Amazon S3 URI where backups will be stored. The backup location for each HBase cluster should be different to ensure that differential backups stay correct.
    Backup start timeSpecify when the first backup should occur. You can set this to now, which causes the first backup to start as soon as the cluster is running, or enter a date and time in ISO format. For example, 2012-06-15T20:00Z, would set the start time to June 15, 2012 at 8pm UTC.
  7. In the Hardware Configuration section, verify the fields according to the following table.

    Note

    Twenty is the default maximum number of nodes per AWS account. For example, if you have two clusters, the total number of nodes running for both clusters must be 20 or less. Exceeding this limit results in cluster failures. If you need more than 20 nodes, you must submit a request to increase your Amazon EC2 instance limit. Ensure that your requested limit increase includes sufficient capacity for any temporary, unplanned increases in your needs. For more information, go to the Request to Increase Amazon EC2 Instance Limit Form.

    Hardware Configuration
    FieldAction
    Network

    Choose the default VPC. For more information about the default VPC, see Your Default VPC and Subnets in the guide-vpc-user;.

    Optionally, if you have created additional VPCs, you can choose your preferred VPC subnet identifier from the list to launch the cluster in that Amazon VPC. For more information, see Select a Amazon VPC Subnet for the Cluster (Optional).

    EC2 Availability Zone

    Choose No preference.

    Optionally, you can launch the cluster in a specific EC2 Availability Zone.

    For more information, see Regions and Availability Zones in the Amazon EC2 User Guide for Linux Instances.

    Master

    Accept the default instance type.

    The master node assigns Hadoop tasks to core and task nodes, and monitors their status. There is always one master node in each cluster.

    This specifies the EC2 instance type to use for the master node.

    The default instance type is m1.medium for Hadoop 2.x. This instance type is suitable for testing, development, and light workloads.

    For more information on instance types supported by Amazon EMR, see Virtual Server Configurations. For more information on Amazon EMR instance groups, see Instance Groups. For information about mapping legacy clusters to instance groups, see Mapping Legacy Clusters to Instance Groups.

    Request Spot Instances

    Leave this box unchecked.

    This specifies whether to run master nodes on Spot Instances. For more information, see Lower Costs with Spot Instances (Optional).

    Core

    Accept the default instance type.

    A core node is an EC2 instance that runs Hadoop map and reduce tasks and stores data using the Hadoop Distributed File System (HDFS). Core nodes are managed by the master node.

    This specifies the EC2 instance types to use as core nodes.

    The default instance type is m1.medium for Hadoop 2.x. This instance type is suitable for testing, development, and light workloads.

    For more information on instance types supported by Amazon EMR, see Virtual Server Configurations. For more information on Amazon EMR instance groups, see Instance Groups. For information about mapping legacy clusters to instance groups, see Mapping Legacy Clusters to Instance Groups.

    Count

    Choose 2.

    Request Spot Instances

    Leave this box unchecked.

    This specifies whether to run core nodes on Spot Instances. For more information, see Lower Costs with Spot Instances (Optional).

    Task

    Accept the default instance type.

    Task nodes only process Hadoop tasks and don't store data. You can add and remove them from a cluster to manage the EC2 instance capacity your cluster uses, increasing capacity to handle peak loads and decreasing it later. Task nodes only run a TaskTracker Hadoop daemon.

    This specifies the EC2 instance types to use as task nodes.

    For more information on instance types supported by Amazon EMR, see Virtual Server Configurations. For more information on Amazon EMR instance groups, see Instance Groups. For information about mapping legacy clusters to instance groups, see Mapping Legacy Clusters to Instance Groups.

    Count

    Choose 0.

    Request Spot Instances

    Leave this box unchecked.

    This specifies whether to run task nodes on Spot Instances. For more information, see Lower Costs with Spot Instances (Optional).

  8. In the Security and Access section, complete the fields according to the following table.

    Security and Access
    FieldAction
    EC2 key pair

    Choose your Amazon EC2 key pair from the list.

    For more information, see Create an Amazon EC2 Key Pair and PEM File.

    Optionally, choose Proceed without an EC2 key pair. If you do not enter a value in this field, you cannot use SSH to connect to the master node. For more information, see Connect to the Cluster.

    IAM user access

    Choose All other IAM users to make the cluster visible and accessible to all IAM users on the AWS account. For more information, see Configure IAM User Permissions.

    Alternatively, choose No other IAM users to restrict access to the current IAM user.

    EMR role

    Accept the default option - No roles found. Alternatively, click Create Default Role > Create Role to generate a default EMR role.

    Allows Amazon EMR to access other AWS services on your behalf.

    For more information, see Configure IAM Roles for Amazon EMR.

    EC2 instance profile

    You can proceed without choosing an instance profile by accepting the default option - No roles found. Alternatively, click Create Default Role > Create Role to generate a default EMR role.

    This controls application access to the Amazon EC2 instances in the cluster.

    For more information, see Configure IAM Roles for Amazon EMR.

  9. In the Bootstrap Actions section, there are no bootstrap actions necessary for this sample configuration.

    Optionally, you can use bootstrap actions, which are scripts that can install additional software and change the configuration of applications on the cluster before Hadoop starts. For more information, see Create Bootstrap Actions to Install Additional Software (Optional).

  10. In the Steps section, you do not need to change any of these settings.

  11. Review your configuration and if you are satisfied with the settings, click Create Cluster.

  12. When the cluster starts, the console displays the Cluster Details page.

To launch a cluster and install HBase using the AWS CLI

You can install HBase on a cluster using the AWS CLI by typing the create-cluster subcommand with the --applications parameter. When using the --applications parameter, you identify the application you want to install via the Name argument.

  • To install HBase when a cluster is launched, type the following command:

    aws emr create-cluster --applications Name=string --ami-version string \
    --instance-groups InstanceGroupType=string,InstanceCount=integer,InstanceType=string InstanceGroupType=string,InstanceCount=integer,InstanceType=string \
    --no-auto-terminate --termination-protected

    For example, to install only HBase on your cluster, type:

    aws emr create-cluster --applications Name=HBase --ami-version 3.1.1 \
    --instance-groups InstanceGroupType=MASTER,InstanceCount=1,InstanceType=c1.xlarge InstanceGroupType=CORE,InstanceCount=2,InstanceType=c1.xlarge \
    --no-auto-terminate --termination-protected

    For more information on using Amazon EMR commands in the AWS CLI, see http://docs.aws.amazon.com/cli/latest/reference/emr.

To launch a cluster and install HBase using the Amazon EMR CLI

Note

The Amazon EMR CLI is no longer under feature development. Customers are encouraged to use the Amazon EMR commands in the AWS CLI instead.

  • Specify the --hbase parameter when you launch a cluster using the CLI.

    The following example shows how to launch a cluster running HBase from the CLI. We recommend that you run at least two instances in the HBase cluster. The --instance-type parameter must be one of the following: m1.large, m1.xlarge, c1.xlarge, m2.2xlarge, m2.4xlarge, cc2.8xlarge, hi1.4xlarge, or hs1.8xlarge. The cc2.8xlarge instance type is only available in the US East (Northern Virginia), US West (Oregon), and EU (Ireland) regions. The hi1.4xlarge instance type is only supported in the US East (Northern Virginia) and EU (Ireland) regions.

    The CLI implicitly launches the HBase cluster with keep alive and termination protection set.

    In the directory where you installed the Amazon EMR CLI, type the following command. For more information, see the Command Line Interface Reference for Amazon EMR.

    • Linux, UNIX, and Mac OS X users:

      ./elastic-mapreduce --create --hbase --name "HBase Cluster" \
      --num-instances 2 \
      --instance-type c1.xlarge
    • Windows users:

      ruby elastic-mapreduce --create --hbase --name "HBase Cluster" --num-instances 2 --instance-type c1.xlarge