Amazon EMR runs a managed version of Apache Hadoop, handling the details of creating the cloud-server infrastructure to run the Hadoop cluster. Amazon EMR defines the concept of instance groups, which are collections of EC2 instances that perform a set of roles defined by the distributed applications that are installed on your cluster. Generally, these groups are organized as master and slave groups. There are three types of instance groups: master, core (slave), and task (slave).
Each Amazon EMR cluster can include up to 50 instance groups: one master instance group that contains one master node, a core instance group containing one or more core nodes, and up to 48 optional task instance groups, which can contain any number of task nodes.
If the cluster is run on a single node, then that instance is simultaneously a master and a core node. For clusters running on more than one node, one instance is the master node and the remaining are core or task nodes.
For more information about instance groups, see Manually Resizing a Running Cluster.
Master Instance Group
The master instance group manages the cluster and typically runs master components of the distributed applications that are installed on your cluster. For example, it runs the YARN ResourceManager service to manage resources for applications and the HDFS NameNode service. It also tracks the status of jobs submitted to the cluster, and monitors the health of the instance groups. To monitor the progress of the cluster, you can SSH into the master node as the Hadoop user and either look at the Hadoop log files directly or access the user interface that Hadoop or Spark publishes to a web server running on the master node. For more information, see View Log Files.
During a Hadoop MapReduce or Spark job, components on core and task nodes process the data, transfer the output to Amazon S3 or HDFS, and provide status metadata back to the master node. In the case of a single node cluster, all components are run on the master node.
Core Instance Group
The core instance group contains all of the core nodes of a cluster. A core node is an EC2 instance in the cluster that runs tasks and stores data as part of the Hadoop Distributed File System (HDFS) by running the DataNode daemon. For example, a core node runs YARN NodeManager daemons and run Hadoop MapReduce tasks and Spark executors.
Core nodes are managed by the master node. Because core nodes store data in HDFS, you can remove them by issuing a resize request for that instance group. The "shrink" operation will attempt to gracefully reduce the number of instances in that group, for example, when no running YARN jobs or applications are present.
Removing HDFS daemons from a running node runs the risk of losing data.
For more information about core instance groups, see Manually Resizing a Running Cluster.
Task Instance Group
Task instance groups contain the task nodes in your cluster. Task instance groups are optional. You can add task groups when you start the cluster, or you can add task groups to a running cluster. Task nodes do not run the DataNode daemon or store data in HDFS. You can add and remove task nodes to adjust the number of EC2 instances in your cluster, increasing capacity to handle peak loads and decreasing it later.
Task nodes are managed by the master node. While a cluster is running you can increase and decrease the number of task nodes in your task groups, and you can add up to 48 additional task groups. When you add task groups to your cluster, you can create groups that leverage multiple instance types. For example, if you create a task group that leverages Spot Instances, and there are not enough Spot Instances available for a particular instance type, you can create an additional task group that leverages a different instance type with available Spot Instance capacity.
Creating multiple task groups that leverage Spot Instances can provide you with potential cost savings. For more information, see (Optional) Lower Costs with Spot Instances.
For more information about task instance groups, see Manually Resizing a Running Cluster.