Replacing unhealthy nodes - Amazon EMR

Replacing unhealthy nodes

Amazon EMR periodically uses the NodeManager health checker service in Apache Hadoop to monitor the statuses of core nodes in your Amazon EMR on Amazon EC2 clusters. If a node is not functionally optimally, the health checker reports that node to the Amazon EMR controller. The Amazon EMR controller adds the node to a denylist, preventing the node from receiving new YARN applications until the status of the node improves. One common reason of why a node might become unhealthy is because of overutilizing the disk. For more information about identifying unhealthy nodes and recovery, see Resource errors.

You can choose whether Amazon EMR should terminate unhealthy nodes or keep them in the cluster. If you turn off unhealthy node replacement, the unhealthy nodes stay in the denylist and continue to count towards cluster capacity. You can still connect to your Amazon EC2 core instance for configuration and recovery, so you can resize your cluster to add capacity. Note that Amazon EMR will replace unhealthy nodes even if termination protection is on.

If unhealthy node replacement is on, Amazon EMR will terminate the unhealthy core node and provision a new instance based on the number of instances in the instance group or the target capacity for instance fleets. If multiple or all core nodes are unhealthy for more than 45 minutes, Amazon EMR will gracefully replace the nodes.

Important

To avoid the possibility of permanently losing HDFS data as Amazon EMR gracefully replaces an unhealthy core instance, we recommend that you always back up your data.

Amazon EMR publishes Amazon CloudWatch Events for unhealthy node replacement, so you can keep track of what's happening with your unhealthy core instances. For more information, see unhealthy node replacement events.

Default node replacement and termination protection settings

Unhealthy node replacement is available for all Amazon EMR releases, but the default settings depend on the release label you choose. You can change any of these settings by configuring unhealthy node replacement when creating a new cluster or by going to cluster configuration at any time.

If you're creating a single-node cluster or high-availability cluster that is running Amazon EMR release 7.0 or lower, the default setting of unhealthy node replacement is dependent on termination protection:

  • Enabling termination protection disables unhealthy node replacement.

  • Disabling termination protection enables unhealthy node replacement.

Configuring unhealthy node replacement when you launch a cluster

You can enable or disable unhealthy node replacement when you launch a cluster using the console, the AWS CLI, or the API.

The default unhealthy node replacement setting depends on how you launch the cluster:

  • Amazon EMR console — unhealthy node replacement is enabled by default.

  • AWS CLI aws emr create-cluster — unhealthy node replacement is enabled by default unless you specify --no-unhealthy-node-replacement.

  • Amazon EMR RunJobFlow API command — unhealthy node replacement is enabled by default unless you set the UnhealthyNodeReplacement Boolean value to True or False.

Console
To turn unhealthy node replacement on or off when you create a cluster with the console
  1. Sign in to the AWS Management Console, and open the Amazon EMR console at https://console.aws.amazon.com/emr.

  2. Under EMR on EC2 in the left navigation pane, choose Clusters, and then choose Create cluster.

  3. For EMR release version, choose the Amazon EMR release label you want.

  4. Under Cluster termination and node replacement, make sure that Unhealthy node replacement (recommended) is pre-selected, or clear the selection to turn it off.

  5. Choose any other options that apply to your cluster.

  6. To launch your cluster, choose Create cluster.

AWS CLI
To turn unhealthy node replacement on or off when you create a cluster using the AWS CLI
  • With the AWS CLI, you can launch a cluster with unhealthy node replacement enabled with the create-cluster command with the --unhealthy-node-replacement parameter. Unhealthy node replacement is on by default.

    The following example creates a cluster with unhealthy node replacement enabled:

    Note

    Linux line continuation characters (\) are included for readability. They can be removed or used in Linux commands. For Windows, remove them or replace with a caret (^).

    aws emr create-cluster --name "SampleCluster" --release-label emr-7.1.0 \ --applications Name=Hadoop Name=Hive Name=Pig \ --use-default-roles --ec2-attributes KeyName=myKey --instance-type m5.xlarge \ --instance-count 3 --unhealthy-node-replacement

    For more information about using Amazon EMR commands in the AWS CLI, see Amazon EMR AWS CLI commands.

Configuring unhealthy node replacement in a running cluster

You can turn unhealthy node replacement on or off for a running cluster using the console, the AWS CLI, or the API.

Console
To turn unhealthy node replacement on or off for a running cluster with the console
  1. Sign in to the AWS Management Console, and open the Amazon EMR console at https://console.aws.amazon.com/emr.

  2. Under EMR on EC2 in the left navigation pane, choose Clusters, and select the cluster that you want to update.

  3. On the Properties tab on the cluster details page, find Cluster termination and node replacement and select Edit.

  4. Select or clear the unhealthy node replacement check box to turn the feature on or off. Then select Save changes to confirm.

AWS CLI
To turn unhealthy node replacement on or off for a running cluster using the AWS CLI
  • To turn on unhealthy node replacement on a running cluster with the AWS CLI, use the modify-cluster-attributes command with the --unhealthy-node-replacement parameter. To disable it, use the --no-unhealthy-node-replacement parameter.

    The following example turns on unhealthy node replacement on the cluster with ID j-3KVTXXXXXX7UG:

    aws emr modify-cluster-attributes --cluster-id j-3KVTXXXXXX7UG --unhealthy-node-replacement

    The following example turns off unhealthy node replacement on the same cluster:

    aws emr modify-cluster-attributes --cluster-id j-3KVTXXXXXX7UG --no-unhealthy-node-replacement