Launch clusters into a VPC
After you have a subnet that is configured to host Amazon EMR clusters, launch the cluster in that subnet by specifying the associated subnet identifier when creating the cluster.
Note
Amazon EMR supports private subnets in release versions 4.2 and above.
When the cluster is launched, Amazon EMR adds security groups based on whether the cluster is launching into VPC private or public subnets. All security groups allow ingress at port 8443 to communicate to the Amazon EMR service, but IP address ranges vary for public and private subnets. Amazon EMR manages all of these security groups, and may need to add additional IP addresses to the AWS range over time. For more information, see Control network traffic with security groups.
To manage the cluster on a VPC, Amazon EMR attaches a network device to the primary
node and manages it through this device. You can view this device using the
Amazon EC2 API action DescribeInstances
. If you modify this device in
any way, the cluster may fail.
Ensuring available IP addresses for an EMR cluster on EC2
To ensure that a subnet with enough free IP addresses is available when you launch, the EC2 subnet selection checks IP availability. It The creation process uses a subnet with the necessary count of IP address to launch core, primary and task nodes as required, even if upon initial creation, only core nodes for the cluster are created. EMR checks the number of IP addresses required to launch primary and task nodes during creation, as well as calculating separately the number of IP addresses needed to launch core nodes. The minimum number of primary and task instances or nodes required is determined automatically by Amazon EMR.
Important
If no subnets in the VPC have enough available IPs to accommodate essential nodes, an error is returned and the cluster isn't created.
In most deployment cases, there is a time difference between each launch of core, primary and task nodes. Additionally, it's possible for multiple clusters to share a subnet. In these cases, IP-address availability can fluctuate and subsequent task-node launches, for instance, can be limited by available IP addresses.