Menu
Amazon EMR
Management Guide

Select an Amazon VPC Subnet for the Cluster

Amazon Virtual Private Cloud (Amazon VPC) enables you to provision a virtual private cloud (VPC), an isolated area within AWS where you can configure a virtual network, controlling aspects such as private IP address ranges, subnets, routing tables, and network gateways.

The reasons to launch your cluster into a VPC include the following:

  • Processing sensitive data

    Launching a cluster into a VPC is similar to launching the cluster into a private network with additional tools, such as routing tables and network ACLs, to define who has access to the network. If you are processing sensitive data in your cluster, you may want the additional access control that launching your cluster into a VPC provides. Furthermore, you can choose to launch your resources into a private subnet where none of those resources has direct Internet connectivity.

  • Accessing resources on an internal network

    If your data source is located in a private network, it may be impractical or undesirable to upload that data to AWS for import into Amazon EMR, either because of the amount of data to transfer or because of the sensitive nature of the data. Instead, you can launch the cluster into a VPC and connect your data center to your VPC through a VPN connection, enabling the cluster to access resources on your internal network. For example, if you have an Oracle database in your data center, launching your cluster into a VPC connected to that network by VPN makes it possible for the cluster to access the Oracle database.

Public and private subnets

You can launch EMR clusters in both public and private VPC subnets. This means you do not need Internet connectivity to run an EMR cluster; however, you may need to configure network address translation (NAT) and VPN gateways to access services or resources located outside of the VPC, for example in a corporate intranet or public AWS service endpoints like AWS Key Management Service.

Important

Amazon EMR only supports launching clusters in private subnets in releases 4.2 or greater.

For more information about Amazon VPC, see the Amazon VPC User Guide.

Clusters in a VPC

There are two platforms on which you can launch the EC2 instances of your cluster: EC2-Classic or EC2-VPC. In EC2-Classic, your instances run in a single, flat network that you share with other customers. In EC2-VPC, your instances run in a VPC that's logically isolated to your AWS account. Your AWS account is capable of launching clusters into either the EC2-Classic or EC2-VPC platform, or only into the EC2-VPC platform, on a region-by-region basis. For more information about EC2-VPC, see Amazon Virtual Private Cloud (Amazon VPC).

When launching an EMR cluster within a VPC, you can launch it within either a public or private subnet. There are slight, notable differences in configuration, depending on the subnet type you choose for your cluster.

Public Subnets

EMR clusters in a public subnet require a connected Internet gateway. This is because Amazon EMR clusters must access AWS services and Amazon EMR. If a service, such as Amazon S3, provides the ability to create a VPC endpoint, you can access those services using the endpoint instead of accessing a public endpoint through an Internet gateway. Additionally, Amazon EMR cannot communicate with clusters in public subnets through a network address translation (NAT) device. An Internet gateway is required for this purpose but you can still use a NAT instance or gateway for other traffic in more complex scenarios.

If you have additional AWS resources that you do not want connected to the Internet gateway, you can launch those components in a private subnet that you create within your VPC.

Clusters running in a public subnet use two security groups, ElasticMapReduce-master and ElasticMapReduce-slave, which control access to the master and slave instance groups, respectively.

Public Subnet Security Groups

Security Group NameDescriptionOpen Inbound PortsOpen Outbound Ports
ElasticMapReduce-masterSecurity group for master instance groups of clusters in a public subnet.TCP

0-65535

8443

22

UDP

0-65535

All
ElasticMapReduce-slaveSecurity group for slave instance groups (containing core and task nodes) of clusters in a public subnet. TCP

0-65535

UDP

0-65535

All


The master instance group contains the master node while a slave group contains both task and core nodes of the cluster. All instances in a cluster connect to Amazon S3 through either a VPC endpoint or Internet gateway. Other AWS services which do not currently support VPC endpoints use only an Internet gateway.

The following diagram shows how an Amazon EMR cluster runs in a VPC using a public subnet. The cluster is able to connect to other AWS resources, such as Amazon S3 buckets, through the Internet gateway.

Cluster on a VPC

The following diagram shows how to set up a VPC so that a cluster in the VPC can access resources in your own network, such as an Oracle database.

Set up a VPC and cluster to access local VPN resources

Private Subnets

Private subnets allow you to launch AWS resources without requiring the subnet to have an attached Internet gateway. This might be useful, for example, in an application that uses these private resources in the back end. Those resources can then initiate outbound traffic using a NAT instance located in another subnet that has an Internet gateway attached. For more information about this scenario, see Scenario 2: VPC with Public and Private Subnets (NAT).

Important

Amazon EMR only supports launching clusters in private subnets in releases 4.2 or greater.

The following are differences from public subnets:

  • To access AWS services that do not provide a VPC endpoint, you still must use a NAT instance or an Internet gateway. Currently, the only service supported with a VPC endpoint is Amazon S3.

  • At a minimum you must provide a route to the Amazon EMR service logs bucket and Amazon Linux repository in Amazon S3. See Minimum Amazon S3 Policy for Private Subnet

  • If you use EMRFS features, you need to have an Amazon S3 VPC endpoint and a route from your private subnet to DynamoDB.

  • Debugging only works if you provide a route from your private subnet to a public Amazon SQS endpoint.

  • Creating a private subnet configuration with a NAT instance or gateway in a public subnet is only supported using the AWS Management Console. The easiest way to add and configure NAT instances and Amazon S3 VPC endpoints for EMR clusters is to use the VPC Subnets List page in the Amazon EMR console. To configure NAT gateways, follow the procedures outlined in the section called NAT Gateways in the Amazon Virtual Private Cloud User Guide.

  • You cannot change a subnet with an existing EMR cluster from public to private or vice versa. To locate an EMR cluster within a private subnet, the cluster must be started in that private subnet.

Amazon EMR creates different security groups for the cluster(s) in a private subnet: ElasticMapReduce-Master-Private, ElasticMapReduce-Slave-Private, and ElasticMapReduce-ServiceAccess.

Private Subnet Security Groups

Security Group NameDescriptionOpen Inbound PortsOpen Outbound Ports
ElasticMapReduce-Master-PrivateSecurity group for master instance groups of clusters in a private subnet.TCP

0-65535

8443

UDP

0-65535

All
ElasticMapReduce-Slave-PrivateSecurity group for slave instance groups (containing core and task nodes) of clusters in a private subnet. TCP

0-65535

8443

UDP

0-65535

All
ElasticMapReduce-ServiceAccessSecurity group for Amazon EMR-managed ENI resources used to allow communication from the web service to the cluster. The ENI is owned by you but managed by Amazon EMR. N/A8443

For a complete listing of NACLs of your cluster, click on the hyperlinked Security groups for Master and Security groups for Core & Task in the Amazon EMR Console Cluster Details page.

The following image shows how an EMR cluster is configured within a private subnet. The only communication outside the subnet is to Amazon EMR.

Launch an EMR cluster in a private subnet

The following image shows a sample configuration for an EMR cluster within a private subnet connected to a NAT instance residing in a public subnet.

Private subnet with NAT

Setting Up a VPC to Host Clusters

Before you can launch clusters in a VPC, you must create a VPC, and a subnet. For public subnets, you must create an Internet gateway and attach it to the subnet. The following instructions describe how to create a VPC capable of hosting Amazon EMR clusters.

To create a subnet to run Amazon EMR clusters

  1. Open the Amazon VPC console at https://console.aws.amazon.com/vpc/.

  2. In the navigation bar, select the region in which to run your cluster.

  3. Choose Start VPC Wizard.

  4. Choose the VPC configuration by selecting one of the following options:

    • VPC with a Single Public Subnet—Select this option if the data used in the cluster is available on the Internet (for example, in Amazon S3 or Amazon RDS).

    • VPC with Public and Private subnets and Hardware VPN Access—Select this option if you wish to use a private subnet or if data for your application is stored in your own network (for example, in an Oracle database). This option also allows you to include public subnets within the same VPC as private subnets.

  5. Confirm the VPC settings. The images show both single public and private and public scenarios.

    Configuring VPC settings with public subnet
    Configuring VPC settings with public and private subnets
    • To work with Amazon EMR, the VPC with a public subnet must have both an Internet gateway and a subnet.

      For a VPC in a private subnet, your master and slave nodes must at least have a route to Amazon EMR through the ENI. In the console, this is automatically configured for you.

    • Use a private IP address space for your VPC to ensure proper DNS hostname resolution; otherwise, you may experience Amazon EMR cluster failures. This includes the following IP address ranges:

      • 10.0.0.0 - 10.255.255.255

      • 172.16.0.0 - 172.31.255.255

      • 192.168.0.0 - 192.168.255.255

    • Choose Use a NAT instance instead and select options as appropriate.

    • Optionally choose to Add endpoints for S3 to your subnets.

    • Verify that Enable DNS hostnames is checked. You have the option to enable DNS hostnames when you create the VPC. To change the setting of DNS hostnames, select your VPC in the VPC list, then choose Edit in the details pane. To create a DNS entry that does not include a domain name, you need to create a value for DHCP Options Set, and then associate it with your VPC. You cannot edit the domain name using the console after the DNS option set has been created.

      For more information, see Using DNS with Your VPC.

    • It is a best practice with Hadoop and related applications to ensure resolution of the fully qualified domain name (FQDN) for nodes. To ensure proper DNS resolution, configure a VPC that includes a DHCP options set whose parameters are set to the following values:

      • domain-name = ec2.internal

        Use ec2.internal if your region is US East (N. Virginia). For other regions, use region-name.compute.internal. For examples in us-west-2, use us-west-2.compute.internal. For the AWS GovCloud (US) region, use us-gov-west-1.compute.internal.

      • domain-name-servers = AmazonProvidedDNS

      For more information, see DHCP Options Sets in the Amazon VPC User Guide.

  6. Choose Create VPC. If you are creating a NAT instance, it may take a few minutes for this to complete.

After the VPC is created, go to the Subnets page and note the identifier of one of the subnets of your VPC. You'll use this information when you launch the EMR cluster into the VPC.

Launching Clusters into a VPC

After you have a subnet that is configured to host Amazon EMR clusters, launch the cluster in that subnet by specifying the associated subnet identifier when creating the cluster.

Note

Amazon EMR supports private subnets in release versions 4.2 and above.

When the cluster is launched, Amazon EMR adds security groups based on whether the cluster is launching into VPC private or public subnets. All security groups allow ingress at port 8443 to communicate to the Amazon EMR service, but IP address ranges vary for public and private subnets. Amazon EMR manages all of these security groups, and may need to add additional IP addresses to the AWS range over time.

In public subnets, Amazon EMR creates ElasticMapReduce-slave and ElasticMapReduce-master for the slave and master instance groups, respectively. By default, the ElasticMapReduce-master security group allows inbound SSH connections while the ElasticMapReduce-slave group does not. Both master and slave security groups allow inbound traffic on port 8443 from the AWS public IP range. If you require SSH access for slave (core and task) nodes, you can add a rule to the ElasticMapReduce-slave security group or use SSH agent forwarding.

Other security groups and rules are required when launching clusters in a private subnet. This is to ensure that the service can still manage those resources while they are private. The additional security groups are: ElasticMapReduce-Master-Private, ElasticMapReduce-Slave-Private.. The security group for the ENI is of the form ElasticMapReduce-ServiceAccess. Inbound traffic on port 8443 is open to allow contact to the Amazon EMR web service. Outbound traffic on port 80 and 443 should be allowed so that the cluster can communicate back to the service. Furthermore, inbound and output ephemeral ports should be open in your network ACLs.

For more information about modifying security group rules, see Adding Rules to a Security Group in the Amazon EC2 User Guide for Linux Instances. For more information about connecting to instances in your VPC, see Securely connect to Linux instances running in a private Amazon VPC.

To manage the cluster on a VPC, Amazon EMR attaches a network device to the master node and manages it through this device. You can view this device using the Amazon EC2 API action DescribeInstances. If you modify this device in any way, the cluster may fail.

To launch a cluster into a VPC using the Amazon EMR console

  1. Open the Amazon EMR console at https://console.aws.amazon.com/elasticmapreduce/.

  2. Choose Create cluster.

  3. Choose Go to advanced options.

  4. In the Hardware Configuration section, for Network, select the ID of a VPC network that you created previously.

  5. For EC2 Subnet, select the ID of a subnet that you created previously.

    1. If your private subnet is properly configured with NAT instance and S3 endpoint options, it will say (EMR Ready) above the subnet name(s) and identifier(s).

    2. If your private subnet does not have a NAT instance and/or S3 endpoint, you can configure this by choosing Add S3 endpoint and NAT instance, Add S3 endpoint, or Add NAT instance. Select the desired options for your NAT instance and S3 endpoint and choose Configure.

      Important

      In order to create a NAT instance from the Amazon EMR, you need ec2:CreateRoute, ec2:RevokeSecurityGroupEgress, ec2:AuthorizeSecurityGroupEgress, cloudformation:DescribeStackEvents and cloudformation:CreateStack permissions.

      Note

      There is an additional cost for launching an EC2 instance for your NAT device.

  6. Proceed with creating the cluster.

To launch a cluster into a VPC using the AWS CLI

Note

The AWS CLI does not provide a way to create a NAT instance automatically and connect it to your private subnet. However, to create a S3 endpoint in your subnet, you can use the Amazon VPCCLI commands. Use the console to create NAT instances and launch clusters in a private subnet.

After your VPC is configured, you can launch EMR clusters in it by using the create-cluster subcommand with the --ec2-attributes parameter. Use the --ec2-attributes parameter to specify the VPC subnet for your cluster.

  • To create a cluster in a specific subnet, type the following command, replace myKey with the name of your EC2 key pair, and replace 77XXXX03 with your subnet ID.

    aws emr create-cluster --name "Test cluster" --release-label emr-4.2.0 --applications Name=Hadoop Name=Hive Name=Pig --use-default-roles --ec2-attributes KeyName=myKey,SubnetId=subnet-77XXXX03 --instance-type m3.xlarge --instance-count 3

    When you specify the instance count without using the --instance-groups parameter, a single master node is launched, and the remaining instances are launched as core nodes. All nodes use the instance type specified in the command.

    Note

    If you have not previously created the default Amazon EMR service role and EC2 instance profile, type aws emr create-default-roles to create them before typing the create-cluster subcommand.

For more information about using Amazon EMR commands in the AWS CLI, see the AWS CLI.

Restricting Permissions to a VPC Using IAM

When you launch a cluster into a VPC, you can use AWS Identity and Access Management (IAM) to control access to clusters and restrict actions using policies, just as you would with clusters launched into EC2-Classic. For more information about how IAM, see IAM User Guide.

You can also use IAM to control who can create and administer subnets. For more information about administering policies and actions in Amazon EC2 and Amazon VPC, see IAM Policies for Amazon EC2 in the Amazon EC2 User Guide for Linux Instances.

By default, all IAM users can see all of the subnets for the account, and any user can launch a cluster in any subnet.

You can limit access to the ability to administer the subnet, while still allowing users to launch clusters into subnets. To do so, create one user account that has permissions to create and configure subnets and a second user account that can launch clusters but which can’t modify Amazon VPC settings.

Minimum Amazon S3 Policy for Private Subnet

For private subnets, at a minimum you must provide the ability for Amazon EMR to access Amazon Linux repositories and Amazon EMR service support log buckets. The following policy provides these permissions:

{
   "Version": "2008-10-17",
   "Statement": [
       {
           "Sid": "AmazonLinuxAMIRepositoryAccess",
           "Effect": "Allow",
           "Principal": "*",
           "Action": "s3:GetObject",
           "Resource": [
               "arn:aws:s3:::packages.*.amazonaws.com/*",
               "arn:aws:s3:::repo.*.amazonaws.com/*"
           ]
       },
       {
           "Sid": "AccessToEMRLogBucketsForSupport",
           "Effect": "Allow",
           "Principal": "*",
           "Action": [
               "s3:Put*",
               "s3:Get*",
               "s3:Create*",
               "s3:Abort*",
               "s3:List*"
           ],
           "Resource": [
               "arn:aws:s3:::aws157-logs-prod-us-east-1/*",
               "arn:aws:s3:::aws157-logs-prod/*"
           ]
       }
   ]
}