Basic Node Repair Configuration Enhanced Node Repair Configuration Complete Configuration Examples CLI Reference Configuration Reference Further information

Support Node Repair Configuration for EKS Managed Nodegroups

EKS Managed Nodegroups supports Node Repair, where the health of managed nodes are monitored, and unhealthy worker nodes are replaced or rebooted in response. eksctl now provides comprehensive configuration options for fine-grained control over node repair behavior.

Basic Node Repair Configuration

Using CLI flags

To create a cluster with a managed nodegroup using basic node repair, pass the --enable-node-repair flag:


eksctl create cluster --enable-node-repair

To create a managed nodegroup with node repair on an existing cluster:


eksctl create nodegroup --cluster=<clusterName> --enable-node-repair

Using configuration files


# basic-node-repair.yaml
apiVersion: eksctl.io/v1alpha5
kind: ClusterConfig

metadata:
  name: basic-node-repair-cluster
  region: us-west-2

managedNodeGroups:
- name: ng-1
  nodeRepairConfig:
    enabled: true


eksctl create cluster -f basic-node-repair.yaml

Enhanced Node Repair Configuration

Threshold Configuration

You can configure when node repair actions will stop using either percentage or count-based thresholds. Note: You cannot use both percentage and count thresholds at the same time.

CLI flags for thresholds


# Percentage-based threshold - repair stops when 20% of nodes are unhealthy
eksctl create cluster --enable-node-repair \
  --node-repair-max-unhealthy-percentage=20

# Count-based threshold - repair stops when 5 nodes are unhealthy
eksctl create cluster --enable-node-repair \
  --node-repair-max-unhealthy-count=5

Configuration file for thresholds


managedNodeGroups:
- name: threshold-ng
  nodeRepairConfig:
    enabled: true
    # Stop repair actions when 20% of nodes are unhealthy
    maxUnhealthyNodeThresholdPercentage: 20
    # Alternative: stop repair actions when 3 nodes are unhealthy
    # maxUnhealthyNodeThresholdCount: 3
    # Note: Cannot use both percentage and count thresholds simultaneously

Parallel Repair Limits

Control the maximum number of nodes that can be repaired concurrently or in parallel. This gives you finer-grained control over the pace of node replacements. Note: You cannot use both percentage and count limits at the same time.

CLI flags for parallel limits


# Percentage-based parallel limits - repair at most 15% of unhealthy nodes in parallel
eksctl create cluster --enable-node-repair \
  --node-repair-max-parallel-percentage=15

# Count-based parallel limits - repair at most 2 unhealthy nodes in parallel
eksctl create cluster --enable-node-repair \
  --node-repair-max-parallel-count=2

Configuration file for parallel limits


managedNodeGroups:
- name: parallel-ng
  nodeRepairConfig:
    enabled: true
    # Repair at most 15% of unhealthy nodes in parallel
    maxParallelNodesRepairedPercentage: 15
    # Alternative: repair at most 2 unhealthy nodes in parallel
    # maxParallelNodesRepairedCount: 2
    # Note: Cannot use both percentage and count limits simultaneously

Custom Repair Overrides

Specify granular overrides for specific repair actions. These overrides control the repair action and the repair delay time before a node is considered eligible for repair. If you use this, you must specify all the values for each override.


managedNodeGroups:
- name: custom-repair-ng
  instanceType: g4dn.xlarge  # GPU instances
  nodeRepairConfig:
    enabled: true
    maxUnhealthyNodeThresholdPercentage: 25
    maxParallelNodesRepairedCount: 1
    nodeRepairConfigOverrides:
      # Handle GPU-related failures with immediate termination
      - nodeMonitoringCondition: "AcceleratedInstanceNotReady"
        nodeUnhealthyReason: "NvidiaXID13Error"
        minRepairWaitTimeMins: 10
        repairAction: "Terminate"
      # Handle network issues with restart after waiting
      - nodeMonitoringCondition: "NetworkNotReady"
        nodeUnhealthyReason: "InterfaceNotUp"
        minRepairWaitTimeMins: 20
        repairAction: "Restart"

Complete Configuration Examples

For a comprehensive example with all configuration options, see examples/44-node-repair.yaml.

Example 1: Basic repair with percentage thresholds


apiVersion: eksctl.io/v1alpha5
kind: ClusterConfig

metadata:
  name: basic-repair-cluster
  region: us-west-2

managedNodeGroups:
- name: basic-ng
  instanceType: m5.large
  desiredCapacity: 3
  nodeRepairConfig:
    enabled: true
    maxUnhealthyNodeThresholdPercentage: 20
    maxParallelNodesRepairedPercentage: 15

Example 2: Conservative repair for critical workloads


apiVersion: eksctl.io/v1alpha5
kind: ClusterConfig

metadata:
  name: critical-workload-cluster
  region: us-west-2

managedNodeGroups:
- name: critical-ng
  instanceType: c5.2xlarge
  desiredCapacity: 6
  nodeRepairConfig:
    enabled: true
    # Very conservative settings
    maxUnhealthyNodeThresholdPercentage: 10
    maxParallelNodesRepairedCount: 1
    nodeRepairConfigOverrides:
      # Wait longer before taking action on critical workloads
      - nodeMonitoringCondition: "NetworkNotReady"
        nodeUnhealthyReason: "InterfaceNotUp"
        minRepairWaitTimeMins: 45
        repairAction: "Restart"

Example 3: GPU workload with specialized repair


apiVersion: eksctl.io/v1alpha5
kind: ClusterConfig

metadata:
  name: gpu-workload-cluster
  region: us-west-2

managedNodeGroups:
- name: gpu-ng
  instanceType: g4dn.xlarge
  desiredCapacity: 4
  nodeRepairConfig:
    enabled: true
    maxUnhealthyNodeThresholdPercentage: 25
    maxParallelNodesRepairedCount: 1
    nodeRepairConfigOverrides:
      # GPU failures require immediate termination
      - nodeMonitoringCondition: "AcceleratedInstanceNotReady"
        nodeUnhealthyReason: "NvidiaXID13Error"
        minRepairWaitTimeMins: 5
        repairAction: "Terminate"

CLI Reference

Node Repair Flags

Flag	Description	Example
`--enable-node-repair`	Enable automatic node repair	`--enable-node-repair`
`--node-repair-max-unhealthy-percentage`	Maximum percentage of unhealthy nodes before repair	`--node-repair-max-unhealthy-percentage=20`
`--node-repair-max-unhealthy-count`	Maximum count of unhealthy nodes before repair	`--node-repair-max-unhealthy-count=5`
`--node-repair-max-parallel-percentage`	Maximum percentage of nodes to repair in parallel	`--node-repair-max-parallel-percentage=15`
`--node-repair-max-parallel-count`	Maximum count of nodes to repair in parallel	`--node-repair-max-parallel-count=2`

Note: Node repair config overrides are only supported through YAML configuration files due to their complexity.

Configuration Reference

nodeRepairConfig

Field	Type	Description	Constraints	Example
`enabled`	boolean	Enable/disable node repair	-	`true`
`maxUnhealthyNodeThresholdPercentage`	integer	Percentage threshold of unhealthy nodes, above which node auto repair actions will stop	Cannot be used with `maxUnhealthyNodeThresholdCount`	`20`
`maxUnhealthyNodeThresholdCount`	integer	Count threshold of unhealthy nodes, above which node auto repair actions will stop	Cannot be used with `maxUnhealthyNodeThresholdPercentage`	`5`
`maxParallelNodesRepairedPercentage`	integer	Maximum percentage of unhealthy nodes that can be repaired concurrently or in parallel	Cannot be used with `maxParallelNodesRepairedCount`	`15`
`maxParallelNodesRepairedCount`	integer	Maximum count of unhealthy nodes that can be repaired concurrently or in parallel	Cannot be used with `maxParallelNodesRepairedPercentage`	`2`
`nodeRepairConfigOverrides`	array	Granular overrides for specific repair actions controlling repair action and delay time	All values must be specified for each override	See examples above

nodeRepairConfigOverrides

Field	Type	Description	Valid Values
`nodeMonitoringCondition`	string	Unhealthy condition reported by the node monitoring agent that this override applies to	`"AcceleratedInstanceNotReady"`, `"NetworkNotReady"`
`nodeUnhealthyReason`	string	Reason reported by the node monitoring agent that this override applies to	`"NvidiaXID13Error"`, `"InterfaceNotUp"`
`minRepairWaitTimeMins`	integer	Minimum time in minutes to wait before attempting to repair a node with the specified condition and reason	Any positive integer
`repairAction`	string	Repair action to take for nodes when all of the specified conditions are met	`"Terminate"`, `"Restart"`, `"NoAction"`

Further information

EKS Managed Nodegroup Node Health

Warning Javascript is disabled or is unavailable in your browser.

To use the Amazon Web Services Documentation, Javascript must be enabled. Please refer to your browser's Help pages for instructions.

Document Conventions

EKS Hybrid Nodes

Networking