Menu
Amazon EMR
Developer Guide

Launch the Cluster

The next step is to launch the cluster. This tutorial provides the steps to launch a long-running cluster using both the Amazon EMR console and the CLI. Choose the method that best meets your needs. When you launch the cluster, Amazon EMR provisions EC2 instances (virtual servers) to perform the computation. These EC2 instances are preloaded with an Amazon Machine Image (AMI) that has been customized for Amazon EMR and which has Hadoop and other big data applications preloaded.

To add Impala to a cluster using the console

  1. Open the Amazon EMR console at https://console.aws.amazon.com/elasticmapreduce/.

  2. Choose Create cluster.

  3. On the Create Cluster page, in the Cluster Configuration section, verify the fields according to the following table.

    Field Action
    Cluster name

    Enter a descriptive name for your cluster or leave the default name "My cluster."

    The name is optional, and does not need to be unique.

    Termination protection

    Leave the default option selected: Yes.

    Enabling termination protection ensures that the cluster does not shut down due to accident or error. For more information, see Managing Cluster Termination . Typically, you set this value to Yes when developing an application (so you can debug errors that would have otherwise terminated the cluster), to protect long-running clusters, or to preserve data.

    Logging

    This determines whether Amazon EMR captures detailed log data to Amazon S3.

    For more information, see View Log Files .

    Log folder S3 location

    Type or browse to an Amazon S3 path to store your debug logs if you enabled logging in the previous field. You may also allow the console to generate an Amazon S3 path for you. If the log folder does not exist, the Amazon EMR console creates it.

    When Amazon S3 log archiving is enabled, Amazon EMR copies the log files from the EC2 instances in the cluster to Amazon S3. This prevents the log files from being lost when the cluster ends and the EC2 instances hosting the cluster are terminated. These logs are useful for troubleshooting purposes.

    For more information, see View Log Files .

  4. In the Software Configuration section, verify the fields according to the following table.

    Field Action
    Hadoop distribution

    Choose Amazon.

    This determines which distribution of Hadoop to run on your cluster. You can choose to run the Amazon distribution of Hadoop or one of several MapR distributions. For more information, see Using the MapR Distribution for Hadoop.

    AMI version

    Choose the latest Hadoop 2.4.0 AMI.

    This determines the version of Hadoop and other applications such as Hive or Pig to run on your cluster. Impala requires an Amazon EMR AMI that has Hadoop 2.x or newer. For more information, see Choose an Amazon Machine Image (AMI).

  5. Under the Additional Applications list, choose Impala and click Configure and add.

    Note

    If you do not see Impala in the list, ensure that you have chosen a Hadoop 2.4.0 AMI.

  6. In the Add Application section, use the following table for guidance on making your selections.

    Field Action
    Version Choose the Impala version from the list, such as 1.2.4.
    Arguments Optionally, specify command line arguments for Impala to execute. For examples of Impala command line arguments, see the --impala-conf section at AWS EMR Command Line Interface Options (Deprecated).
  7. Click Add.

  8. In the Hardware Configuration section, verify the fields according to the following table.

    Note

    Twenty is the default maximum number of nodes per AWS account. For example, if you have two clusters running, the total number of nodes running for both clusters must be 20 or less. Exceeding this limit results in cluster failures. If you need more than 20 nodes, you must submit a request to increase your Amazon EC2 instance limit. Ensure that your requested limit increase includes sufficient capacity for any temporary, unplanned increases in your needs. For more information, go to the Request to Increase Amazon EC2 Instance Limit Form.

    Field Action
    Network

    Choose Launch into EC2-Classic.

    Optionally, choose a VPC subnet identifier from the list to launch the cluster in an Amazon VPC. For more information, see Plan and Configure Networking.

    EC2 Availability Zone

    Choose No preference.

    Optionally, you can launch the cluster in a specific Amazon EC2 Availability Zone.

    For more information, see Regions and Availability Zones in the Amazon EC2 User Guide for Linux Instances.

    Master

    Choose m3.xlarge. This specifies the EC2 instance type to use for the master node.

    The master node assigns Hadoop tasks to core and task nodes, and monitors their status. There is always one master node in each cluster.

    For more information, see Create a Cluster with Instance Fleets or Uniform Instance Groups. For information about mapping legacy clusters to instance groups, see Mapping Legacy Clusters to Instance Groups.

    Request Spot Instances

    Leave this box unchecked.

    This specifies whether to run master nodes on Spot Instances. For more information, see When Should You Use Spot Instances?.

    Core

    Choose m1.large. This specifies the EC2 instance type to use for the core nodes.

    A core node is an EC2 instance that runs Hadoop map and reduce tasks and stores data using the Hadoop Distributed File System (HDFS). Core nodes are managed by the master node.

    For more information, see Create a Cluster with Instance Fleets or Uniform Instance Groups. For information about mapping legacy clusters to instance groups, see Mapping Legacy Clusters to Instance Groups.

    Count

    Choose 2.

    Request Spot Instances

    Leave this box unchecked.

    This specifies whether to run core nodes on Spot Instances. For more information, see When Should You Use Spot Instances?.

    Task

    Accept the default option. This specifies the EC2 instance types to use as task nodes.

    Task nodes only process Hadoop tasks and don't store data. You can add and remove them from a cluster to manage the EC2 instance capacity your cluster uses, increasing capacity to handle peak loads and decreasing it later. Task nodes only run a TaskTracker Hadoop daemon.

    For more information, see Create a Cluster with Instance Fleets or Uniform Instance Groups. For information about mapping legacy clusters to instance groups, see Mapping Legacy Clusters to Instance Groups.

    Count

    Choose 0. You do not use task nodes for this tutorial.

    Request Spot Instances

    Leave this box unchecked.

    This specifies whether to run task nodes on Spot Instances. For more information, see When Should You Use Spot Instances?.

  9. In the Security and Access section, complete the fields according to the following table.

    Field Action
    EC2 key pair

    Choose your Amazon EC2 key pair private key from the list.

    Optionally, choose Proceed without an EC2 key pair. If you do not enter a value in this field, you cannot use SSH to connect to the master node. For more information, see Connect to the Master Node Using SSH .

    IAM user access

    Choose All other IAM users to make the cluster visible and accessible to all IAM users on the AWS account. For more information, see Configure User Permissions Using IAM Roles .

    Alternatively, choose No other IAM users to restrict access to the current IAM user.

    Roles configuration

    Choose Default to generate the default Amazon EMR role and Amazon EC2 instance profile. If the default roles exist, they are used for your cluster. If they do not exist, they are created (assuming you have proper permissions). You may also choose View policies for default roles to view the default role properties. Alternatively, if you have custom roles, you can choose Custom and choose your roles. An Amazon EMR role and Amazon EC2 instance profile are required when creating a cluster using the console.

    The role allows Amazon EMR to access other AWS services on your behalf. The Amazon EC2 instance profile controls application access to the Amazon EC2 instances in the cluster.

    For more information, see Configure IAM Roles for Amazon EMR and Applications .

  10. In the Bootstrap Actions section, there are no bootstrap actions necessary for this sample configuration.

    Optionally, you can use bootstrap actions, which are scripts that can install additional software and change the configuration of applications on the cluster before Hadoop starts. For more information, see (Optional) Create Bootstrap Actions to Install Additional Software.

  11. In the Steps section, you do not need to change any of these settings.

  12. Review your configuration and if you are satisfied with the settings, choose Create Cluster.

  13. When the cluster starts, the console displays the Cluster Details page.

To add Impala to a cluster using the AWS CLI

To add Impala to a cluster using the AWS CLI, type the create-cluster subcommand with the --applications parameter.

  • To install Impala on a cluster, type the following command and replace myKey with the name of your EC2 key pair.

    • Linux, UNIX, and Mac OS X users:

      Copy
      aws emr create-cluster --name "Test cluster" --ami-version 3.3 --applications Name=Hue Name=Hive Name=Pig Name=Impala \ --use-default-roles --ec2-attributes KeyName=myKey \ --instance-type m3.xlarge --instance-count 3
    • Windows users:

      Copy
      aws emr create-cluster --name "Test cluster" --ami-version 3.3 --applications Name=Hue Name=Hive Name=Pig Name=Impala --use-default-roles --ec2-attributes KeyName=myKey --instance-type m3.xlarge --instance-count 3

    When you specify the instance count without using the --instance-groups parameter, a single Master node is launched, and the remaining instances are launched as core nodes. All nodes will use the instance type specified in the command.

    Note

    If you have not previously created the default EMR service role and EC2 instance profile, type aws emr create-default-roles to create them before typing the create-cluster subcommand.

    For more information on using Amazon EMR commands in the AWS CLI, see http://docs.aws.amazon.com/cli/latest/reference/emr.