基本节点修复配置增强的节点修复配置完整的配置示例 CLI 参考配置引用进一步信息

EKS 托管节点组的 Support 节点修复配置

EKS Managed Nodegroups 支持节点修复，即监控托管节点的运行状况，并根据需要替换或重新启动不健康的工作节点。eksctl 现在提供了全面的配置选项，用于精细控制节点修复行为。

基本节点修复配置

使用 CLI 标志

要使用基本节点修复功能创建带有托管节点组的集群，请传递以下--enable-node-repair标志：


eksctl create cluster --enable-node-repair

要在现有集群上创建带有节点修复功能的托管节点组，请执行以下操作：


eksctl create nodegroup --cluster=<clusterName> --enable-node-repair

使用配置文件


# basic-node-repair.yaml
apiVersion: eksctl.io/v1alpha5
kind: ClusterConfig

metadata:
  name: basic-node-repair-cluster
  region: us-west-2

managedNodeGroups:
- name: ng-1
  nodeRepairConfig:
    enabled: true


eksctl create cluster -f basic-node-repair.yaml

增强的节点修复配置

阈值配置

您可以使用基于百分比或计数的阈值来配置何时停止节点修复操作。注意：不能同时使用百分比和计数阈值。

阈值的 CLI 标志


# Percentage-based threshold - repair stops when 20% of nodes are unhealthy
eksctl create cluster --enable-node-repair \
  --node-repair-max-unhealthy-percentage=20

# Count-based threshold - repair stops when 5 nodes are unhealthy
eksctl create cluster --enable-node-repair \
  --node-repair-max-unhealthy-count=5

阈值的配置文件


managedNodeGroups:
- name: threshold-ng
  nodeRepairConfig:
    enabled: true
    # Stop repair actions when 20% of nodes are unhealthy
    maxUnhealthyNodeThresholdPercentage: 20
    # Alternative: stop repair actions when 3 nodes are unhealthy
    # maxUnhealthyNodeThresholdCount: 3
    # Note: Cannot use both percentage and count thresholds simultaneously

并行修复限制

控制可以同时或并行修复的最大节点数。这让您可以更精细地控制节点更换的速度。注意：您不能同时使用百分比和计数限制。

并行限制的 CLI 标志


# Percentage-based parallel limits - repair at most 15% of unhealthy nodes in parallel
eksctl create cluster --enable-node-repair \
  --node-repair-max-parallel-percentage=15

# Count-based parallel limits - repair at most 2 unhealthy nodes in parallel
eksctl create cluster --enable-node-repair \
  --node-repair-max-parallel-count=2

并行限制的配置文件


managedNodeGroups:
- name: parallel-ng
  nodeRepairConfig:
    enabled: true
    # Repair at most 15% of unhealthy nodes in parallel
    maxParallelNodesRepairedPercentage: 15
    # Alternative: repair at most 2 unhealthy nodes in parallel
    # maxParallelNodesRepairedCount: 2
    # Note: Cannot use both percentage and count limits simultaneously

自定义修复优先选项

为特定的修复操作指定精细覆盖。这些覆盖可控制节点被视为符合修复条件之前的修复操作和修复延迟时间。如果使用此选项，则必须为每个覆盖指定所有值。


managedNodeGroups:
- name: custom-repair-ng
  instanceType: g4dn.xlarge  # GPU instances
  nodeRepairConfig:
    enabled: true
    maxUnhealthyNodeThresholdPercentage: 25
    maxParallelNodesRepairedCount: 1
    nodeRepairConfigOverrides:
      # Handle GPU-related failures with immediate termination
      - nodeMonitoringCondition: "AcceleratedInstanceNotReady"
        nodeUnhealthyReason: "NvidiaXID13Error"
        minRepairWaitTimeMins: 10
        repairAction: "Terminate"
      # Handle network issues with restart after waiting
      - nodeMonitoringCondition: "NetworkNotReady"
        nodeUnhealthyReason: "InterfaceNotUp"
        minRepairWaitTimeMins: 20
        repairAction: "Restart"

完整的配置示例

有关包含所有配置选项的完整示例，请参阅 examples/44-node-re pair.yam l。

示例 1：带有百分比阈值的基本修复


apiVersion: eksctl.io/v1alpha5
kind: ClusterConfig

metadata:
  name: basic-repair-cluster
  region: us-west-2

managedNodeGroups:
- name: basic-ng
  instanceType: m5.large
  desiredCapacity: 3
  nodeRepairConfig:
    enabled: true
    maxUnhealthyNodeThresholdPercentage: 20
    maxParallelNodesRepairedPercentage: 15

示例 2：对关键工作负载进行保守修复


apiVersion: eksctl.io/v1alpha5
kind: ClusterConfig

metadata:
  name: critical-workload-cluster
  region: us-west-2

managedNodeGroups:
- name: critical-ng
  instanceType: c5.2xlarge
  desiredCapacity: 6
  nodeRepairConfig:
    enabled: true
    # Very conservative settings
    maxUnhealthyNodeThresholdPercentage: 10
    maxParallelNodesRepairedCount: 1
    nodeRepairConfigOverrides:
      # Wait longer before taking action on critical workloads
      - nodeMonitoringCondition: "NetworkNotReady"
        nodeUnhealthyReason: "InterfaceNotUp"
        minRepairWaitTimeMins: 45
        repairAction: "Restart"

示例 3：具有专门修复功能的 GPU 工作负载


apiVersion: eksctl.io/v1alpha5
kind: ClusterConfig

metadata:
  name: gpu-workload-cluster
  region: us-west-2

managedNodeGroups:
- name: gpu-ng
  instanceType: g4dn.xlarge
  desiredCapacity: 4
  nodeRepairConfig:
    enabled: true
    maxUnhealthyNodeThresholdPercentage: 25
    maxParallelNodesRepairedCount: 1
    nodeRepairConfigOverrides:
      # GPU failures require immediate termination
      - nodeMonitoringCondition: "AcceleratedInstanceNotReady"
        nodeUnhealthyReason: "NvidiaXID13Error"
        minRepairWaitTimeMins: 5
        repairAction: "Terminate"

CLI 参考

节点修复标志

标记	说明	示例
`--enable-node-repair`	启用自动节点修复	`--enable-node-repair`
`--node-repair-max-unhealthy-percentage`	修复前不健康节点的最大百分比	`--node-repair-max-unhealthy-percentage=20`
`--node-repair-max-unhealthy-count`	修复前不健康节点的最大数量	`--node-repair-max-unhealthy-count=5`
`--node-repair-max-parallel-percentage`	要并行修复的节点的最大百分比	`--node-repair-max-parallel-percentage=15`
`--node-repair-max-parallel-count`	要并行修复的最大节点数	`--node-repair-max-parallel-count=2`

注意：由于其复杂性，仅通过 YAML 配置文件支持节点修复配置覆盖。

配置引用

nodeRepairConfig

字段	Type	说明	约束	示例
`enabled`	布尔值	启用/禁用节点修复	-	`true`
`maxUnhealthyNodeThresholdPercentage`	整数	不健康节点的百分比阈值，超过该阈值的节点 auto 修复操作将停止	不能与 `maxUnhealthyNodeThresholdCount`	`20`
`maxUnhealthyNodeThresholdCount`	整数	计算不健康节点的阈值，超过该阈值的节点自动修复操作将停止	不能与 `maxUnhealthyNodeThresholdPercentage`	`5`
`maxParallelNodesRepairedPercentage`	整数	可以同时或并行修复的不健康节点的最大百分比	不能与 `maxParallelNodesRepairedCount`	`15`
`maxParallelNodesRepairedCount`	整数	可以同时或并行修复的不健康节点的最大数量	不能与 `maxParallelNodesRepairedPercentage`	`2`
`nodeRepairConfigOverrides`	array	控制修复操作和延迟时间的特定修复操作的精细覆盖	必须为每次覆盖指定所有值	参见上面的例子

nodeRepairConfig覆盖

字段	Type	说明	有效值
`nodeMonitoringCondition`	字符串	节点监视代理报告了此覆盖所适用的不健康状况	`"AcceleratedInstanceNotReady"`, `"NetworkNotReady"`
`nodeUnhealthyReason`	字符串	节点监视代理报告此覆盖适用的原因	`"NvidiaXID13Error"`, `"InterfaceNotUp"`
`minRepairWaitTimeMins`	整数	尝试修复具有指定条件和原因的节点之前等待的最短时间（以分钟为单位）	任何正整数
`repairAction`	字符串	修复满足所有指定条件时要对节点采取的操作	`"Terminate"`, `"Restart"`, `"NoAction"`

进一步信息

EKS 托管节点组节点健康

Javascript 在您的浏览器中被禁用或不可用。

要使用 Amazon Web Services 文档，必须启用 Javascript。请参阅浏览器的帮助页面以了解相关说明。

文档惯例

EKS 混合节点

Networking