Automatically repair nodes in EKS clusters

This topic details the EKS automatic node repair behavior and how to configure it to meet your requirements. EKS automatic node repair is enabled by default in EKS Auto Mode, and can be used with EKS managed node groups and Karpenter.

The default EKS automatic node repair actions are summarized in the table below and they apply to the behavior for EKS Auto Mode, EKS managed node groups, and Karpenter. When using EKS Auto Mode or Karpenter all AcceleratedHardwareReady repair actions are Replace, and only EKS managed node groups support Reboot as a repair action.

For a detailed list of node health issues detected by the EKS node monitoring agent and their corresponding node repair actions, see Detect node health issues with the EKS node monitoring agent.

Node Condition	Description	Repair after	Repair action(s)
AcceleratedHardwareReady	AcceleratedHardwareReady indicates whether accelerated hardware (GPU, Neuron) on the node is functioning correctly.	10m	Replace or Reboot
ContainerRuntimeReady	ContainerRuntimeReady indicates whether the container runtime (containerd, etc.) is functioning correctly and able to run containers.	30m	Replace
DiskPressure	DiskPressure is a standard Kubernetes condition indicating the node is experiencing disk pressure (low disk space or high I/O).	N/A	None
KernelReady	KernelReady indicates whether the kernel is functioning correctly without critical errors, panics, or resource exhaustion.	30m	Replace
MemoryPressure	MemoryPressure is a standard Kubernetes condition indicating the node is experiencing memory pressure (low available memory).	N/A	None
NetworkingReady	NetworkingReady indicates whether the node’s networking stack is functioning correctly (interfaces, routing, connectivity).	30m	Replace
StorageReady	StorageReady indicates whether the node’s storage subsystem is functioning correctly (disks, filesystems, I/O).	30m	Replace
Ready	Ready is the standard Kubernetes condition indicating the node is healthy and ready to accept pods.	30m	Replace

EKS automatic node repair actions are disabled in the following scenarios by default. In-progress node repair actions continue in each scenario. See Configure automatic node repair for how to override these default settings.

EKS managed node groups

The node group has more than five nodes and more than 20% of the nodes in the node group are unhealthy.
A zonal shift for your cluster triggers through the Application Recovery Controller (ARC).

EKS Auto Mode and Karpenter

More than 20% of the nodes in the NodePool are unhealthy.
For standalone NodeClaims, 20% of nodes in the cluster are unhealthy.

Configure automatic node repair

Automatic node repair cannot be configured when using EKS Auto Mode and it is always enabled with the same default settings as Karpenter.

Karpenter

To use automatic node repair with Karpenter, enable the feature gate NodeRepair=true. You can enable the feature gates through the --feature-gates CLI option or the FEATURE_GATES environment variable in the Karpenter deployment. For more information, see the Karpenter documentation.

Managed node groups

You can enable automatic node repair when creating new EKS managed node groups or by updating existing EKS managed node groups.

Amazon EKS console – Select the Enable node auto repair checkbox for the managed node group. For more information, see Create a managed node group for your cluster.
AWS CLI – Add --node-repair-config enabled=true to the eks create-nodegroup or eks update-nodegroup-config command.
eksctl – Configure managedNodeGroups.nodeRepairConfig.enabled: true, see the example in the eksctl GitHub.

When using EKS managed node groups, you can control node auto repair behavior with the following settings.

To control when node auto repair stops taking action, set a threshold based on the number of unhealthy nodes in the node group. Set either the absolute count or percentage, but not both.

Setting	Description
`maxUnhealthyNodeThresholdCount`	The absolute number of unhealthy nodes above which node auto repair stops. Use this to limit the scope of repairs.
`maxUnhealthyNodeThresholdPercentage`	The percentage of unhealthy nodes above which node auto repair stops (0-100).

To control how many nodes repair at the same time, you can configure repair parallelism. As with the unhealthy node threshold, set either the absolute count or percentage, but not both.

Setting	Description
`maxParallelNodesRepairedCount`	The maximum number of nodes to repair concurrently.
`maxParallelNodesRepairedPercentage`	The maximum percentage of unhealthy nodes to repair concurrently (0-100).

With nodeRepairConfigOverrides, you can customize repair behavior for specific conditions. Use this when you need different repair actions or wait times for different issue types.

Each override requires all of the following fields:

Field	Description
`nodeMonitoringCondition`	The node condition type reported by the node monitoring agent. For example: `AcceleratedHardwareReady`, `NetworkingReady`, `StorageReady`, `KernelReady`.
`nodeUnhealthyReason`	The specific reason code for the unhealthy condition. For example: `NvidiaXID31Error`, `IPAMDNotRunning`.
`minRepairWaitTimeMins`	The minimum time in minutes that the condition must persist before the node becomes eligible for repair. Use this to avoid repairing nodes for temporary issues.
`repairAction`	The action to take when conditions are met. Valid values: `Replace` (terminate and replace the node), `Reboot` (reboot the node), or `NoAction` (no repair actions).

The following AWS CLI example creates a node group with custom repair settings.


aws eks create-nodegroup \
  --cluster-name my-cluster \
  --nodegroup-name my-nodegroup \
  --node-role arn:aws:iam::111122223333:role/NodeRole \
  --subnets subnet-0123456789abcdef0 \
  --node-repair-config '{
    "enabled": true,
    "maxUnhealthyNodeThresholdPercentage": 10,
    "maxParallelNodesRepairedCount": 3,
    "nodeRepairConfigOverrides": [
      {
        "nodeMonitoringCondition": "AcceleratedHardwareReady",
        "nodeUnhealthyReason": "NvidiaXID64Error",
        "minRepairWaitTimeMins": 5,
        "repairAction": "Replace"
      },
      {
        "nodeMonitoringCondition": "AcceleratedHardwareReady",
        "nodeUnhealthyReason": "NvidiaXID31Error",
        "minRepairWaitTimeMins": 15,
        "repairAction": "NoAction"
      }
    ]
  }'

This configuration does the following:

Enables node auto repair
Stops repair actions when more than 10% of nodes are unhealthy
Repairs up to 3 nodes at a time
Overrides XID 64 errors (GPU memory remapping failure) to replace the node after 5 minutes. The default is reboot after 10 minutes.
Overrides XID 31 errors (GPU memory page fault) to take no action. The default is reboot after 10 minutes.

Warning Javascript is disabled or is unavailable in your browser.

To use the Amazon Web Services Documentation, Javascript must be enabled. Please refer to your browser's Help pages for instructions.

Document Conventions

Node health detection

View node health