使用训练操作员来运行作业

要使用 kubectl 运行作业，必须创建一个 job.yaml 来指定作业规格并运行kubectl apply -f job.yaml以提交作业。在此 YAML 文件中，您可以在logMonitoringConfiguration参数中指定自定义配置来定义自动监控规则，这些规则用于分析分布式训练作业的日志输出以检测问题并进行恢复。


apiVersion: sagemaker.amazonaws.com/v1
kind: HyperPodPyTorchJob
metadata:
  labels:
    app.kubernetes.io/name: HyperPod
    app.kubernetes.io/managed-by: kustomize
  name: &jobname xxx
  annotations:
    XXX: XXX
    ......
spec:
  nprocPerNode: "X"
  replicaSpecs:
    - name: 'XXX'
      replicas: 16
      template:
        spec:
          nodeSelector:
            beta.kubernetes.io/instance-type: ml.p5.48xlarge
          containers:
            - name: XXX
              image: XXX
              imagePullPolicy: Always
              ports:
                - containerPort: 8080 # This is the port that HyperPodElasticAgent listens to
              resources:
                limits:
                  nvidia.com/gpu: 8
                  hugepages-2Mi: 5120Mi
                requests:
                  nvidia.com/gpu: 8
                  hugepages-2Mi: 5120Mi
                  memory: 32000Mi
          ......        
  runPolicy:
    jobMaxRetryCount: 50
    restartPolicy:
      numRestartBeforeFullJobRestart: 3 
      evalPeriodSeconds: 21600 
      maxFullJobRestarts: 1
    cleanPodPolicy: "All"
    logMonitoringConfiguration: 
      - name: "JobStart"
        logPattern: ".*Experiment configuration.*" # This is the start of the training script
        expectedStartCutOffInSeconds: 120 # Expected match in the first 2 minutes
      - name: "JobHangingDetection"
        logPattern: ".*\\[Epoch 0 Batch \\d+.*'training_loss_step': (\\d+(\\.\\d+)?).*"
        expectedRecurringFrequencyInSeconds: 300 # If next batch is not printed within 5 minute, consider it hangs. Or if loss is not decimal (e.g. nan) for 2 minutes, mark it hang as well.
        expectedStartCutOffInSeconds: 600 # Allow 10 minutes of job startup time
      - name: "NoS3CheckpointingDetection"
        logPattern: ".*The checkpoint is finalized. All shards is written.*"
        expectedRecurringFrequencyInSeconds: 600 # If next checkpoint s3 upload doesn't happen within 10 mins, mark it hang.
        expectedStartCutOffInSeconds: 1800 # Allow 30 minutes for first checkpoint upload
      - name: "LowThroughputDetection"
        logPattern: ".*\\[Epoch 0 Batch \\d+.*'samples\\/sec': (\\d+(\\.\\d+)?).*"
        metricThreshold: 80 # 80 samples/sec
        operator: "lteq"
        metricEvaluationDataPoints: 25 # if throughput lower than threshold for 25 datapoints, kill the job

如果要使用日志监控选项，请确保将训练日志发送到sys.stdout。 HyperPod 弹性代理在 sys.stdout 中监控训练日志，该日志保存在中。/tmp/hyperpod/您可以使用以下命令来发送训练日志。


logging.basicConfig(format="%(asctime)s [%(levelname)s] %(name)s: %(message)s", level=logging.INFO, stream=sys.stdout)

下表描述了所有可能的日志监控配置：

参数	使用量
jobMaxRetry计数	进程级别的最大重启次数。
重启政策： numRestartBeforeFullJobRestart	操作员在作业级别重新启动之前，进程级别的最大重启次数。
重启政策： evalPeriodSeconds	评估重启限制的时间段（以秒为单位）
重启策略：重启 maxFullJob	在作业失败之前重新启动完整作业的最大次数。
cleanPodPolicy	指定操作员应清理的 pod。可接受的值为`AllOnlyComplete`、和`None`。
logMonitoringConfiguration	慢速和挂起作业检测的日志监控规则
expectedRecurringFrequencyInSeconds	连续两次 LogPattern 匹配之间的时间间隔，在此时间间隔之后，规则的计算结果为 HANGING。如果未指定，则连续 LogPattern 匹配之间不存在时间限制。
expectedStartCutOffInSeconds	是时候进行首次 LogPattern 匹配了，之后规则的计算结果为 “悬挂”。如果未指定，则第一次 LogPattern 匹配不存在时间限制。
日志模式	用于标识规则处于活动状态时规则适用的日志行的正则表达式
metricEvaluationData积分	在将作业标记为 “慢” 之前，规则必须连续计算为 “慢” 的次数。如果未指定，默认值为 1。
指标阈值	通过捕获组提取 LogPattern 的值的阈值。如果未指定，则不执行指标评估。
operator	应用于监控配置的不等式。可接受的值为`gtgteq`、`lt`、`lteq`、和`eq`。
停止模式	用于标识要停用规则的日志行的常规表达式。如果未指定，则该规则将始终处于活动状态。

要获得更高的训练弹性，请指定备用节点配置的详细信息。如果您的任务失败，操作员会与 Kueue 合作，使用事先预留的节点来继续运行作业。备用节点配置需要 Kueue，因此，如果您尝试提交带有备用节点但未安装 Kueue 的作业，则该作业将失败。以下示例是一个包含备用节点配置的示例job.yaml文件。



apiVersion: sagemaker.amazonaws.com/v1
kind: HyperPodPyTorchJob
metadata:
  labels:
    kueue.x-k8s.io/queue-name: user-queue # Specify the queue to run the job.
  name: hyperpodpytorchjob-sample
spec:
  nprocPerNode: "1"
  runPolicy:
    cleanPodPolicy: "None"
  replicaSpecs: 
    - name: pods
      replicas: 1
      spares: 1 # Specify how many spare nodes to reserve.
      template:
        spec:
          containers:
            - name: XXX
              image: XXX
              
              imagePullPolicy: Always
              ports:
                - containerPort: 8080
              resources:
                requests:
                  nvidia.com/gpu: "0"
                limits:
                  nvidia.com/gpu: "0"

监控

亚马逊与亚马逊托管 Grafana 和适用 SageMaker HyperPod 于 Prometheus 的亚马逊托管服务集成了可观察性，因此您可以设置监控以收集指标并将其提供给这些可观察性工具。

或者，您可以在没有托管可观察性的情况下通过适用于 Prometheus 的亚马逊托管服务抓取指标。为此，请在运行作业时将要监控的指标包含在job.yaml文件中kubectl。


apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
  name: hyperpod-training-operator
  namespace: aws-hyperpod
spec:
  ......
  endpoints:
    - port: 8081
      path: /metrics
      interval: 15s

以下是培训运营商发布的事件，您可以将这些事件输入到 Prometheus 的亚马逊托管服务中，以监控您的训练作业。

事件	描述
hyperpod_training_operator_jobs_created_total	训练操作员已运行的作业总数
hyperpod_training_operator_jobs_restart_laten	当前任务重启延迟
hyperpod_training_operator_jobs_fault_detection_	故障检测延迟
hyperpod_training_operator_jobs_deleted_total	已删除的任务总数
hyperpod_training_operator_jobs_success_t	已完成的任务总数
hyperpod_training_operator_jobs_failed_total	失败的任务总数
hyperpod_training_operator_jobs_restarted_total	自动重启的任务总数

docker 配置示例

以下是您可以使用命令运行的 docker 文件示例。hyperpod run


export AGENT_CMD="--backend=nccl"
exec hyperpodrun --server-host=${AGENT_HOST} --server-port=${AGENT_PORT} \
    --tee=3 --log_dir=/tmp/hyperpod \
    --nnodes=${NNODES} --nproc-per-node=${NPROC_PER_NODE} \
    --pre-train-script=/workspace/echo.sh --pre-train-args='Pre-training script' \
    --post-train-script=/workspace/echo.sh --post-train-args='Post-training script' \
    /workspace/mnist.py --epochs=1000 ${AGENT_CMD}

日志监控配置示例

Job 挂起检测

要检测挂起作业，请使用以下配置。它使用以下参数：

expectedStartCutOffInSeconds — 显示器应等待多长时间才能收到第一批日志
expectedRecurringFrequencyInSeconds — 等待下一批日志的时间间隔

使用这些设置，日志监视器希望在训练作业开始后的 60 秒.*Train Epoch.*内看到与正则表达式模式匹配的日志行。首次出现后，监视器预计每 10 秒钟就会看到匹配的日志行。如果第一条日志未在 60 秒内出现，或者后续日志没有每 10 秒出现一次，则 HyperPod 弹性代理会将容器视为卡住，并与训练操作员协调以重新启动作业。


runPolicy:
    jobMaxRetryCount: 10
    cleanPodPolicy: "None"
    logMonitoringConfiguration:
      - name: "JobStartGracePeriod"
        # Sample log line: [default0]:2025-06-17 05:51:29,300 [INFO] __main__: Train Epoch: 5 [0/60000 (0%)]       loss=0.8470
        logPattern: ".*Train Epoch.*"  
        expectedStartCutOffInSeconds: 60 
      - name: "JobHangingDetection"
        logPattern: ".*Train Epoch.*"
        expectedRecurringFrequencyInSeconds: 10 # if the next batch is not printed within 10 seconds

训练损失激增

以下监控配置会发出带有该模式xxx training_loss_step xx的训练日志。它使用参数metricEvaluationDataPoints，允许您在操作员重启作业之前指定数据点的阈值。如果训练损失值大于 2.0，则操作员重新启动作业。


runPolicy:
  jobMaxRetryCount: 10
  cleanPodPolicy: "None"
  logMonitoringConfiguration:
    - name: "LossSpikeDetection"
      logPattern: ".*training_loss_step (\\d+(?:\\.\\d+)?).*"   # training_loss_step 5.0
      metricThreshold: 2.0
      operator: "gt"
      metricEvaluationDataPoints: 5 # if loss higher than threshold for 5 data points, restart the job

低 TFLOPs 检测

以下监控配置xx TFLOPs xx每五秒钟发出一次带有该模式的训练日志。如果 5 个数据点小 TFLOPs 于 100，则操作员重新启动训练作业。


runPolicy:
  jobMaxRetryCount: 10
  cleanPodPolicy: "None"
  logMonitoringConfiguration:
    - name: "TFLOPs"
      logPattern: ".* (.+)TFLOPs.*"    # Training model, speed: X TFLOPs...
      expectedRecurringFrequencyInSeconds: 5        
      metricThreshold: 100       # if Tflops is less than 100 for 5 data points, restart the job       
      operator: "lt"
      metricEvaluationDataPoints: 5

Javascript 在您的浏览器中被禁用或不可用。

要使用 Amazon Web Services 文档，必须启用 Javascript。请参阅浏览器的帮助页面以了解相关说明。

文档惯例

安装

故障排除