Amazon EMR
Management Guide

Blacklisted Nodes

The NodeManager daemon is responsible for launching and managing containers on core and task nodes. The containers are allocated to the NodeManager daemon by the ResourceManager daemon that runs on the master node. The ResourceManager monitors the NodeManager node through a heartbeat.

There are a couple of situations in which the ResourceManager daemon blacklists a NodeManager, removing it from the pool of nodes available to process tasks:

  • If the NodeManager has not sent a heartbeat to the ResourceManager daemon in the past 10 minutes (60000 milliseconds). This time period can be configured using the yarn.nm.liveness-monitor.expiry-interval-ms configuration setting. For more information about changing Yarn configuration settings, see Configuring Applications in the Amazon EMR Release Guide.

  • NodeManager checks the health of the disks determined by yarn.nodemanager.local-dirs and yarn.nodemanager.log-dirs. The checks include permissions and free disk space (< 90%). If a disk fails the check, the NodeManager stops using that particular disk but still reports the node status as healthy. If a number of disks fail the check, the node is reported as unhealthy to the ResourceManager and new containers are not assigned to the node.

The application master can also blacklist a NodeManager node if it has more than three failed tasks. You can change this to a higher value using the mapreduce.job.maxtaskfailures.per.tracker configuration parameter. Other configuration settings you might change control how many times to attempt a task before marking it as failed: mapreduce.map.max.attempts for map tasks and mapreduce.reduce.maxattempts for reduce tasks. For more information about changing configuration settings, see Configuring Applications in the Amazon EMR Release Guide.