Amazon Elastic MapReduce
Developer Guide (API Version 2009-03-31)
« PreviousNext »
View the PDF for this guide.Go to the AWS Discussion Forum for this product.Go to the Kindle Store to download this guide in Kindle format.Did this page help you?  Yes | No |  Tell us about it...

Select a Amazon VPC Subnet for the Cluster (Optional)

Amazon Virtual Private Cloud (Amazon VPC) enables you to provision a virtual private cloud (VPC), an isolated area within AWS where you can configure a virtual network, controlling aspects such as private IP address ranges, subnets, routing tables and network gateways.

The reasons to launch your cluster into a VPC include the following:

  • Processing sensitive data

    Launching a cluster into a VPC is similar to launching the cluster into a private network with additional tools, such as routing tables and network ACLs, to define who has access to the network. If you are processing sensitive data in your cluster, you may want the additional access control that launching your cluster into a VPC provides.

  • Accessing resources on an internal network

    If your data source is located in a private network, it may be impractical or undesirable to upload that data to AWS for import into Amazon EMR, either because of the amount of data to transfer or because of the sensitive nature of the data. Instead, you can launch the cluster into a VPC and connect to your data center to your VPC through a VPN connection, enabling the cluster to access resources on your internal network. For example, if you have an Oracle database in your data center, launching your cluster into a VPC connected to that network by VPN makes it possible for the cluster to access the Oracle database.

For more information about Amazon VPC, see the Amazon VPC User Guide.

Clusters in a VPC

There are two platforms on which you can launch the EC2 instances of your cluster: EC2-Classic and EC2-VPC. In EC2-Classic, your instances run in a single, flat network that you share with other customers. In EC2-VPC, your instances run in a virtual private cloud (VPC) that's logically isolated to your AWS account. Your AWS account is capable of launching clusters into either the EC2-Classic or EC2-VPC platform, or only into the EC2-VPC platform, on a region-by-region basis. For more information about EC2-VPC, see Amazon Virtual Private Cloud (Amazon VPC).

Because access to and from the AWS cloud is a requirement of the cluster, you must connect an Internet gateway to the subnet hosting the cluster. If your application has components you do not want connected to the Internet gateway, you can launch those components in a private subnet you create within your VPC.

Clusters running in a VPC uses two security groups, ElasticMapReduce-master and ElasticMapReduce-slave, which control access to the master and slave nodes. Both the slave and master nodes connect to Amazon S3 through the Internet gateway.

Note

When you launch instances in a VPC using Amazon EMR, several Elastic IP addresses are loaned to you by the system at no cost; however, these addresses are not visible because they are not created using your account.

The following diagram shows how an Amazon EMR cluster runs in a VPC. The cluster is launched within a subnet. The cluster is able to connect to other AWS resources, such as Amazon S3 buckets, through the Internet gateway.

Cluster on a VPC

The following diagram shows how to set up a VPC so that a cluster in the VPC can access resources in your own network, such as an Oracle database.

Set up a VPC and cluster to access local VPN resources

Setting Up a VPC to Host Clusters

Before you can launch clusters on a VPC, you must create a VPC, a subnet, and an Internet gateway. The following instructions describe how to create a VPC capable of hosting Amazon EMR clusters using the Amazon EMR console.

To create a subnet to run Amazon EMR clusters

  1. Open the Amazon VPC console at https://console.aws.amazon.com/vpc/.

  2. In the navigation bar, select the region where you'll be running your cluster.

  3. Create a VPC by clicking Start VPC Wizard.

  4. Choose the VPC configuration by selecting one of the following options:

    • VPC with a Single Public Subnet - Select this option if the data used in the cluster is available on the Internet (for example, in Amazon S3 or Amazon RDS).

    • VPC with Public and Private subnets and Hardware VPN Access - Select this option if the data used in the cluster is stored in your network (for example, an Oracle database).

  5. Confirm the VPC settings.

    Configuring VPC settings
    • To work with Amazon EMR, the VPC must have both an Internet gateway and a subnet.

    • Use a private IP address space for your VPC to ensure proper DNS hostname resolution, otherwise you may experience Amazon EMR cluster failures. This includes the following IP address ranges:

      • 10.0.0.0 - 10.255.255.255

      • 172.16.0.0 - 172.31.255.255

      • 192.168.0.0 - 192.168.255.255

    • Verify that Enable DNS hostnames is checked. You have the option to enable DNS hostnames when you create the VPC. To change the setting of DNS hostnames, select your VPC in the VPC list, then click Edit in the details pane. To create a DNS entry that does not include a domain name, you need to create a DNS Options Set, and then associate it with your VPC. You cannot edit the domain name using the console after the DNS option set has been created.

      For more information, see Using DNS with Your VPC.

    • It is a best practice with Hadoop and related applications to ensure resolution of the fully qualified domain name (FQDN) for nodes. To ensure proper DNS resolution, configure a VPC that includes a DHCP options set whose parameters are set to the following values:

      • domain-name = ec2.internal

        Use ec2.internal if your region is us-east-1. For other regions, use region-name.compute.internal. For example in us-west-2, use us-west-2.compute.internal. For AWS GovCloud (US) Region, use us-gov-west-1.compute.internal.

      • domain-name-servers = AmazonProvidedDNS

      For more information, see DHCP Options Sets in the Amazon VPC User Guide.

  6. Click Create VPC. A dialog box confirms that the VPC has been successfully created. Click Close.

After the VPC is created, go to the Subnets page and note the identifier of one of the subnets of your VPC. You'll use this information when you launch the Amazon EMR cluster into the VPC.

Launching Clusters into a VPC

After you have a subnet that is configured to host Amazon EMR clusters, launching clusters on that subnet is as simple as specifying the subnet identifier during the cluster creation.

If the subnet does not have an Internet gateway, the cluster creation fails with the error: Subnet not correctly configured, missing route to an Internet gateway.

When the cluster is launched, Amazon EMR adds two security groups to the VPC: ElasticMapReduce-slave and ElasticMapReduce-master. By default, the ElasticMapReduce-master security group allows inbound SSH connections. The ElasticMapReduce-slave group does not. If you require SSH access for slave (core and task) nodes, you can add a rule to the ElasticMapReduce-slave security group. For more information about modifying security group rules, see Adding Rules to a Security Group in the Amazon EC2 User Guide for Linux Instances.

To manage the cluster on a VPC, Amazon EMR attaches a network device to the master node and manages it through this device. You can view this device using the Amazon EC2 API DescribeInstances. If you disconnect this device, the cluster will fail.

After the cluster is created, it is able to access AWS services to connect to data stores, such as Amazon S3.

To launch a cluster into a VPC using the Amazon EMR console

  1. Open the Amazon Elastic MapReduce console at https://console.aws.amazon.com/elasticmapreduce/.

  2. Click Create cluster.

  3. In the Hardware Configuration section, in the Network field, choose the ID of a VPC network you created previously. When you choose a VPC, the EC2 Availability Zone field becomes an EC2 Subnet field.

    For more information, see What is Amazon VPC?.

  4. In the EC2 Subnet field, choose the ID of a subnet that you created previously.

  5. Proceed with creating the cluster as described in Plan an Amazon EMR Cluster.

To launch a cluster into a VPC using the AWS CLI

  • After your VPC is configured, you can launch Amazon EMR clusters in it by using the create-cluster subcommand with the --ec2-attributes parameter. Use the --ec2-attributes parameter to specify the VPC subnet for your cluster. This is illustrated in the following example, which creates a long-running cluster in the specified subnet:

    aws emr create-cluster --ami-version string --no-auto-terminate --ec2-attributes SubnetId=subnet-string \ 
    --instance-groups InstanceGroupType=string,InstanceType=string,InstanceCount=integer \
    InstanceGroupType=string,InstanceType=string,InstanceCount=integer

    For example:

    aws emr create-cluster --ami-version 3.2.0 --no-auto-terminate --ec2-attributes SubnetId=subnet-77XXXX03 \ 
    --instance-groups InstanceGroupType=MASTER,InstanceType=m1.large,InstanceCount=1 \
    InstanceGroupType=CORE,InstanceType=m1.large,InstanceCount=1

For more information on using Amazon EMR commands in the AWS CLI, see http://docs.aws.amazon.com/cli/latest/reference/emr.

To launch a cluster into a VPC using the Amazon EMR CLI

Note

The Amazon EMR CLI is no longer under feature development. Customers are encouraged to use the Amazon EMR commands in the AWS CLI instead.

  • After your VPC is configured, you can launch Amazon EMR clusters in it by using the --subnet argument with the subnet address. This is illustrated in the following example, which creates a long-running cluster in the specified subnet. In the directory where you installed the Amazon EMR CLI, type the following command.

    • Linux, UNIX, and Mac OS X users:

      ./elastic-mapreduce --create --alive --subnet subnet-77XXXX03
    • Windows users:

      ruby elastic-mapreduce --create --alive --subnet subnet-77XXXX03

Restricting Permissions to a VPC Using IAM

When you launch a cluster into a VPC, you can use AWS Identity and Access Management (IAM) to control access to clusters and restrict actions using policies, just as you would with clusters launched into EC2-Classic. For more information about how IAM works with Amazon EMR, see Using IAM.

You can also use IAM to control who can create and administer subnets. For more information about administering policies and actions in Amazon EC2 and Amazon VPC, see IAM Policies for Amazon EC2 in the Amazon EC2 User Guide for Linux Instances.

By default, all IAM users can see all of the subnets for the account, and any user can launch a cluster in any subnet.

You can limit access to the ability to administer the subnet, while still allowing users to launch clusters into subnets. To do so, create one user account which has permissions to create and configure subnets and a second user account that can launch clusters but which can’t modify Amazon VPC settings.

To allow users to launch clusters in a VPC without the ability to modify the VPC

  1. Create the VPC and launch Amazon EMR into a subnet of that VPC using an account with permissions to administer Amazon VPC and Amazon EMR.

  2. Create a second user account with permissions to call the RunJobFlow, DescribeJobFlows, TerminateJobFlows, and AddJobFlowStep actions in the Amazon EMR API. You should also create an IAM policy that allows this user to launch EC2 instances. An example of this is shown below.

    {
    "Version": "2012-10-17",  
    "Statement": [
        {
          "Action": [
             "ec2:AuthorizeSecurityGroupIngress",
             "ec2:CancelSpotInstanceRequests",
             "ec2:CreateSecurityGroup",
             "ec2:CreateTags",
             "ec2:DescribeAvailabilityZones",
             "ec2:DescribeInstances",
             "ec2:DescribeKeyPairs",
             "ec2:DescribeSubnets",
             "ec2:DescribeSecurityGroups",
             "ec2:DescribeSpotInstanceRequests",
             "ec2:DescribeRouteTables",
             "ec2:ModifyImageAttribute",	
             "ec2:ModifyInstanceAttribute",
             "ec2:RequestSpotInstances",
             "ec2:RunInstances",
             "ec2:TerminateInstances"  
             ],
          "Effect": "Allow",
          "Resource": "*"
        },
        {
          "Action": [
             "elasticmapreduce:AddInstanceGroups",
             "elasticmapreduce:AddJobFlowSteps", 
             "elasticmapreduce:DescribeJobFlows",
             "elasticmapreduce:ModifyInstanceGroups",
             "elasticmapreduce:RunJobFlow",
             "elasticmapreduce:TerminateJobFlows"
             ],
          "Effect": "Allow",
          "Resource": "*"
        },
        {
           "Action": [
             "s3:GetObject",
             "s3:PutObject",
             "s3:ListBucket"
             ],
         "Effect": "Allow",
         "Resource": "*"
        }
      ]
    }
    

    Users with the IAM permissions set above are able to launch clusters within the VPC subnet, but are not able to change the VPC configuration.

    Note

    You should be cautious when granting ec2:TerminateInstances permissions because this action gives the recipient the ability to shut down any EC2 instance in the account, including those outside of Amazon EMR.