Plan and Configure Networking
There may be two network platform options you can choose for your cluster: EC2-Classic or EC2-VPC. In EC2-Classic, your instances run in a single, flat network that you share with other customers. EC2-Classic is available only with certain accounts in certain regions. For more information, see Amazon EC2 and Amazon VPC in the Amazon EC2 User Guide for Linux Instances. In EC2-VPC, your cluster uses Amazon Virtual Private Cloud (Amazon VPC), and EC2 instances run in a VPC that's logically isolated within your AWS account. Amazon VPC enables you to provision a virtual private cloud (VPC), an isolated area within AWS where you can configure a virtual network, controlling aspects such as private IP address ranges, subnets, routing tables, and network gateways.
VPC offers the following capabilities:
Processing sensitive data
Launching a cluster into a VPC is similar to launching the cluster into a private network with additional tools, such as routing tables and network ACLs, to define who has access to the network. If you are processing sensitive data in your cluster, you may want the additional access control that launching your cluster into a VPC provides. Furthermore, you can choose to launch your resources into a private subnet where none of those resources has direct Internet connectivity.
Accessing resources on an internal network
If your data source is located in a private network, it may be impractical or undesirable to upload that data to AWS for import into Amazon EMR, either because of the amount of data to transfer or because of the sensitive nature of the data. Instead, you can launch the cluster into a VPC and connect your data center to your VPC through a VPN connection, enabling the cluster to access resources on your internal network. For example, if you have an Oracle database in your data center, launching your cluster into a VPC connected to that network by VPN makes it possible for the cluster to access the Oracle database.
Public and private subnets
You can launch EMR clusters in both public and private VPC subnets. This means you do not need Internet connectivity to run an EMR cluster; however, you may need to configure network address translation (NAT) and VPN gateways to access services or resources located outside of the VPC, for example in a corporate intranet or public AWS service endpoints like AWS Key Management Service.
Amazon EMR only supports launching clusters in private subnets in releases 4.2 or greater.
For more information about Amazon VPC, see the Amazon VPC User Guide.
Private subnets in a VPC
Public subnets in a VPC
General VPC information