Amazon Elastic MapReduce
Developer Guide (API Version 2009-03-31)
« PreviousNext »
View the PDF for this guide.Go to the AWS Discussion Forum for this product.Go to the Kindle Store to download this guide in Kindle format.Did this page help you?  Yes | No |  Tell us about it...

Select a Amazon VPC Subnet for the Cluster (Optional)

Amazon Virtual Private Cloud (Amazon VPC) enables you to provision a private area within AWS where you can configure a virtual network, controlling aspects such as private IP address ranges, subnets, routing tables and network gateways. For more information about Amazon VPC, see the Amazon Virtual Private Cloud User Guide.

There are two platforms on which you can launch the EC2 instances of your cluster: EC2-Classic and EC2-VPC. In EC2-Classic, your instances run in a single, flat network that you share with other customers. In EC2-VPC, your instances run in a virtual private cloud (VPC) that's logically isolated to your AWS account. Your AWS account is capable of launching clusters into either the EC2-Classic or EC2-VPC platform, or only into the EC2-VPC platform, on a region-by-region basis. For more information about EC2-VPC, see Amazon Virtual Private Cloud (Amazon VPC).

Reasons to launch your cluster on Amazon VPC include:

  • Processing sensitive data

    Launching a cluster on Amazon VPC is similar to launching the cluster on a private network with additional tools, such as routing tables and Network ACLs, for defining who has access to the network. If you are processing sensitive data in your cluster, you may want the additional access control that launching your cluster on Amazon VPC provides.

  • Accessing resources on an internal network

    If your data store is located on a private network, it may be impractical or undesirable to upload that data to AWS for import into Amazon EMR, either because of the amount of data to transfer or because of the sensitive nature of the data. Instead, you can launch the cluster on a Amazon VPC and connect to your data center through a VPN connection, enabling the cluster to access resources on your internal network. For example, if you have an Oracle database on a private VPN, launching your cluster on a Amazon VPC connected to that VPN makes it possible for the cluster to access the Oracle database.

The following diagram illustrates how an Amazon EMR cluster runs on a VPC. The cluster is launched within a VPC subnet. Through the Internet gateway the cluster is able to contact resources on the AWS cloud such as Amazon S3 buckets.

Cluster on a VPC

Because access to and from the AWS cloud is a requirement of the cluster, you must connect an Internet gateway to the VPC subnet hosting the cluster. If your application has components you do not want connected to the Internet gateway you can launch those components in other subnets you create within your VPC. In addition, because of the need to access the AWS cloud, you cannot use Network Address Translation (NAT) when you are running Amazon EMR on a VPC.

Amazon EMR running on a VPC uses two security groups, ElasticMapReduce-master and ElasticMapReduce-slave, which control access to the master and slave nodes. Both the slave and master nodes connect to Amazon S3 through the Internet gateway.

Note

When you launch instances in a VPC using Amazon EMR, several Elastic IP addresses are loaned to you by the system at no cost; however, these addresses are not visible because they are not created using your account.

The following diagram shows how to set up a VPC in order for the cluster to access resources on a local VPN.

Set up a VPC and cluster to access local VPN resources

Note

For an Amazon EMR cluster to run inside a VPC, it must be able to connect to the AWS Cloud through an Internet gateway. You cannot use Network Address Translation (NAT) with the cluster.

Restricting Permissions with IAM on a VPC

When you launch a cluster on a VPC, you can use IAM to control access to clusters and restrict actions via policies just as you would with clusters launched on the AWS cloud. For more information about how IAM works with Amazon EMR, see Using IAM.

You can also use IAM to control who can create and administer VPC subnets. For more information about administering policies and actions, see Configuring User Permissions in the Using IAM.

By default, all IAM users can see all of the VPC subnets for the account, and any user can launch a cluster in any subnet.

You can limit access to the ability to administer the VPC subnet, while still allowing users to launch clusters into VPC subnets. To do so, create one user account which has permissions to create and configure VPC subnets and a second user account that can launch clusters but which can’t modify Amazon VPC settings.

To allow users to launch clusters in a VPC without the ability to modify the VPC

  1. Create the VPC and launch Amazon EMR into a subnet of that VPC using an account with permissions to administer Amazon VPC and Amazon EMR.

  2. Create a second user account with permissions to call the RunJobFlow, DescribeJobFlows, TerminateJobFlows, and AddJobFlowStep actions in the Amazon EMR API. You should also create an IAM policy that allows this user to launch EC2 instances. An example of this is shown below.

    {
      "Statement": [
        {
          "Action": [
    	     "ec2:AuthorizeSecurityGroupIngress",
    	     "ec2:CancelSpotInstanceRequests",
    	     "ec2:CreateSecurityGroup",
    	     "ec2:CreateTags",
    	     "ec2:DescribeAvailabilityZones",
    	     "ec2:DescribeInstances",
    	     "ec2:DescribeSecurityGroups",
    	     "ec2:DescribeSpotInstanceRequests",
    	     "ec2:ModifyImageAttribute",	
    	     "ec2:ModifyInstanceAttribute",
    	     "ec2:RequestSpotInstances",
    	     "ec2:RunInstances",
    	     "ec2:TerminateInstances"  
          ],
          "Effect": "Allow",
          "Resource": "*"
        },
        {
          "Action": [
             "elasticmapreduce:AddInstanceGroups",
             "elasticmapreduce:AddJobFlowSteps", 
             "elasticmapreduce:DescribeJobFlows",
             "elasticmapreduce:ModifyInstanceGroups",
             "elasticmapreduce:RunJobFlow"
    	     "elasticmapreduce:TerminateJobFlows"
          ],
          "Effect": "Allow",
          "Resource": "*"
        }
      
      }
    				

    Users with the IAM permissions set above are able to launch clusters within the VPC subnet, but are not able to change the VPC configuration.

    Note

    You should be cautious when granting ec2:TerminateInstances permissions because this action gives the recipient the ability to shut down any EC2 instance in the account, including those outside of Amazon EMR.

Setting up a VPC to Host Clusters

Before you can launch clusters on a VPC, you must create a VPC, a VPC subnet, and an Internet gateway. The following instructions describe how to create a VPC capable of hosting Amazon EMR clusters using the Amazon EMR console.

To create a VPC subnet to run Amazon EMR clusters

  1. Sign in to the AWS Management Console and open the Amazon VPC console at https://console.aws.amazon.com/vpc/.

  2. Create a VPC by clicking Get started creating a VPC. Make sure that the Region box is set to the same region where you'll be running your cluster. In this example, we're creating a VPC in the US East (N. Virginia) region.

    Creating a VPC

  3. Choose the VPC configuration by selecting one of the radio buttons.

    If the data used in the cluster is available on the Internet (e.g., Amazon S3 or Amazon RDS) select VPC with a Single Public Subnet Only.

    Choosing a VPC configuration for a single public subnet

    If the data used in the cluster is stored locally (e.g., an Oracle database), select VPC with Public and Private subnets and Hardware VPN Access.

    Choosing a VPC configuration for multiple subnets

  4. Confirm the VPC settings. To work with Amazon EMR, the VPC must have both an Internet Gateway and a subnet.

    In a new VPC, the default DHCP configuration has a domain name by default (for example ec2.internal); therefore, you must click Edit VPC IP CIDR Block and enable the DNS Hostnames checkbox. For a VPC configured under EC2-Classic that has a domain name in the DHCP table, under DNS Settings, set the option Enable DNS hostname support for instances launched in this VPC or delete the domain-name in DHCP for the associated VPC. For more information, see Updating DNS Support for Your VPC.

    Configuring VPC settings

  5. A dialog box confirms that the Amazon VPC was successfully created. Click Close.

    VPC creation confirmation

You cannot use Network Address Translation (NAT) when you are using Amazon EMR on a VPC.

After you've created a VPC, you need to locate its subnet identifier; you'll use this value to launch the Amazon EMR cluster on the VPC.

To find the VPC subnet identifier

  • Click on Subnets in the navigation menu of the Amazon VPC console. The right pane displays information about the VPC, including its subnet identifier.

    VPC subnet identifier

Launching Clusters on a VPC

After you have a VPC subnet that is configured to host Amazon EMR clusters, launching clusters on that VPC subnet is as simple as specifying the subnet identifier during the cluster creation.

If the VPC subnet does not have an Internet gateway, the cluster creation fails with the error: “Subnet not correctly configured, missing route to an Internet gateway."

When the cluster is launched, Amazon EMR adds two security groups to the VPC: ElasticMapReduce-slave and ElasticMapReduce-master. By default, the ElasticMapReduce-master security group does not allow inbound SSH connections. If you require this functionality, you can add it to the security group.

To manage the cluster on a VPC, Amazon EMR attaches a network device to the master node and manages it through this device. You can view this device using the Amazon EC2 API DescribeInstances. If you disconnect this device, the cluster will fail.

After the cluster is created, it is able to access AWS services to connect to data stores, such as Amazon S3.

Note

Amazon VPC currently does not support CC1 instances. Thus, you cannot specify a cc1.4xlarge instance type for nodes of a cluster launched on a VPC.

To launch a cluster on a VPC using the Amazon EMR console

  1. In the Amazon EMR console, click Create New Job Flow.

    Creating a new cluster
  2. Follow the instructions in the Create a New Job Flow wizard, selecting options that match the cluster you want to launch.

  3. When you reach the ADVANCED OPTIONS page, choose the VPC subnet you created previously from the Amazon VPC Subnet Id box. If you have not created a subnet, click Create a VPC underneath the drop-down box to open the Amazon VPC console and create a VPC and subnet.

    Choosing a VPC subnet
  4. Continue the Create a Job Flow wizard until it is complete and the cluster is launched within the subnet specified in Step 3.

To launch a cluster on a VPC using the CLI

  • After your VPC is configured, you can launch Amazon EMR clusters on it by using the --subnet argument and specifying the subnet address. This is illustrated in the following example, which creates a long-running cluster on the specified VPC subnet. In the directory where you installed the Amazon EMR CLI, run the following from the command line. For more information, see the Command Line Interface Reference for Amazon EMR.

    • Linux, UNIX, and Mac OS X users:

      ./elastic-mapreduce --create --alive --subnet subnet-identifier
    • Windows users:

      ruby elastic-mapreduce --create --alive --subnet subnet-identifier

To launch a cluster on a VPC using the API

  • After your VPC is configured, you can launch Amazon EMR clusters on it by providing the VPC subnet identifier as the value for Ec2SubnetId, an optional String parameter on the JobFlowInstancesConfig structure.

    https://elasticmapreduce.amazonaws.com?
                Operation=RunJobFlow&
                Name=MyJobFlowName&
                LogUri=s3n%3A%2F%2Fmybucket%2Fsubdir&
                Instances.MasterInstanceType=m1.small&
                Instances.SlaveInstanceType=m1.small&
                Instances.InstanceCount=4&
                Instances.Ec2KeyName=myec2keyname&
                Instances.Placement.AvailabilityZone=us-east-1a&
                Instances.KeepJobFlowAliveWhenNoSteps=true&
                Instances.Ec2SubnetId=subnet-identifier&
                Steps.member.1.Name=MyStepName&
                Steps.member.1.ActionOnFailure=CONTINUE&
                Steps.member.1.HadoopJarStep.Jar=MyJarFile&
                Steps.member.1.HadoopJarStep.MainClass=MyMailClass&
                Steps.member.1.HadoopJarStep.Args.member.1=arg1&
                Steps.member.1.HadoopJarStep.Args.member.2=arg2&
                AWSAccessKeyId=AWS Access Key ID&
                SignatureVersion=2&
                SignatureMethod=HmacSHA256&
                Timestamp=2009-01-28T21%3A48%3A32.000Z&
                Signature=calculated value