使用訓練運算子執行任務

若要使用 kubectl 執行任務，您必須建立 job.yaml 以指定任務規格，並執行 kubectl apply -f job.yaml以提交任務。在此 YAML 檔案中，您可以在 logMonitoringConfiguration 引數中指定自訂組態，以定義自動監控規則，分析來自分散式訓練任務的日誌輸出，以偵測問題並復原。


apiVersion: sagemaker.amazonaws.com/v1
kind: HyperPodPyTorchJob
metadata:
  labels:
    app.kubernetes.io/name: HyperPod
    app.kubernetes.io/managed-by: kustomize
  name: &jobname xxx
  annotations:
    XXX: XXX
    ......
spec:
  nprocPerNode: "X"
  replicaSpecs:
    - name: 'XXX'
      replicas: 16
      template:
        spec:
          nodeSelector:
            beta.kubernetes.io/instance-type: ml.p5.48xlarge
          containers:
            - name: XXX
              image: XXX
              imagePullPolicy: Always
              ports:
                - containerPort: 8080 # This is the port that HyperPodElasticAgent listens to
              resources:
                limits:
                  nvidia.com/gpu: 8
                  hugepages-2Mi: 5120Mi
                requests:
                  nvidia.com/gpu: 8
                  hugepages-2Mi: 5120Mi
                  memory: 32000Mi
          ......        
  runPolicy:
    jobMaxRetryCount: 50
    restartPolicy:
      numRestartBeforeFullJobRestart: 3 
      evalPeriodSeconds: 21600 
      maxFullJobRestarts: 1
    cleanPodPolicy: "All"
    logMonitoringConfiguration: 
      - name: "JobStart"
        logPattern: ".*Experiment configuration.*" # This is the start of the training script
        expectedStartCutOffInSeconds: 120 # Expected match in the first 2 minutes
      - name: "JobHangingDetection"
        logPattern: ".*\\[Epoch 0 Batch \\d+.*'training_loss_step': (\\d+(\\.\\d+)?).*"
        expectedRecurringFrequencyInSeconds: 300 # If next batch is not printed within 5 minute, consider it hangs. Or if loss is not decimal (e.g. nan) for 2 minutes, mark it hang as well.
        expectedStartCutOffInSeconds: 600 # Allow 10 minutes of job startup time
      - name: "NoS3CheckpointingDetection"
        logPattern: ".*The checkpoint is finalized. All shards is written.*"
        expectedRecurringFrequencyInSeconds: 600 # If next checkpoint s3 upload doesn't happen within 10 mins, mark it hang.
        expectedStartCutOffInSeconds: 1800 # Allow 30 minutes for first checkpoint upload
      - name: "LowThroughputDetection"
        logPattern: ".*\\[Epoch 0 Batch \\d+.*'samples\\/sec': (\\d+(\\.\\d+)?).*"
        metricThreshold: 80 # 80 samples/sec
        operator: "lteq"
        metricEvaluationDataPoints: 25 # if throughput lower than threshold for 25 datapoints, kill the job

如果您想要使用日誌監控選項，請確定您要將訓練日誌發出至 sys.stdout。HyperPod 彈性代理程式會監控 sys.stdout 中的訓練日誌，其儲存在。 /tmp/hyperpod/您可以使用下列命令來發出訓練日誌。


logging.basicConfig(format="%(asctime)s [%(levelname)s] %(name)s: %(message)s", level=logging.INFO, stream=sys.stdout)

下表說明所有可能的日誌監控組態：

參數	用量
jobMaxRetryCount	程序層級的重新啟動次數上限。
restartPolicy：numRestartBeforeFullJobRestart	在操作員在任務層級重新啟動之前，在程序層級重新啟動的最大次數。
restartPolicy：evalPeriodSeconds	評估重新啟動限制的期間，以秒為單位
restartPolicy：maxFullJobRestarts	完整任務在任務失敗之前重新啟動的最大數量。
cleanPodPolicy	指定運算子應清除的 Pod。接受的值為 `All`、 `OnlyComplete`和 `None`。
logMonitoringConfiguration	慢速和懸空任務偵測的日誌監控規則
expectedRecurringFrequencyInSeconds	兩個連續 LogPattern 比對之間的時間間隔，之後規則會評估為 HANGING。如果未指定，則在連續的 LogPattern 相符項目之間沒有時間限制。
expectedStartCutOffInSeconds	第一次 LogPattern 比對的時間，之後規則會評估為暫停。如果未指定，則第一個 LogPattern 比對沒有時間限制。
logPattern	規則表達式，用於識別規則在規則作用中時套用的日誌行
metricEvaluationDataPoints	在將任務標記為 SLOW 之前，規則必須評估為 SLOW 的連續次數。如果未指定，則預設值為 1。
metricThreshold	LogPattern 使用擷取群組擷取的值閾值。如果未指定，則不會執行指標評估。
operator	要套用至監控組態的不等式。接受的值為 `gt`、`gteq`、`lteq`、 `lt`和 `eq`。
stopPattern	定期展開以識別要在其中停用規則的日誌行。如果未指定，則規則將一律處於作用中狀態。

如需更多訓練彈性，請指定備用節點組態詳細資訊。如果您的任務失敗，運算子會與 Kueue 合作，使用預先預留的節點以繼續執行任務。備用節點組態需要 Kueue，因此如果您嘗試提交具有備用節點的任務，但未安裝 Kueue，則任務將會失敗。下列範例是包含備用節點組態的範例job.yaml檔案。



apiVersion: sagemaker.amazonaws.com/v1
kind: HyperPodPyTorchJob
metadata:
  labels:
    kueue.x-k8s.io/queue-name: user-queue # Specify the queue to run the job.
  name: hyperpodpytorchjob-sample
spec:
  nprocPerNode: "1"
  runPolicy:
    cleanPodPolicy: "None"
  replicaSpecs: 
    - name: pods
      replicas: 1
      spares: 1 # Specify how many spare nodes to reserve.
      template:
        spec:
          containers:
            - name: XXX
              image: XXX
              
              imagePullPolicy: Always
              ports:
                - containerPort: 8080
              resources:
                requests:
                  nvidia.com/gpu: "0"
                limits:
                  nvidia.com/gpu: "0"

監控

Amazon SageMaker HyperPod 與 Amazon Managed Grafana 和 Amazon Managed Service for Prometheus 整合可觀測性，因此您可以設定監控，以收集指標並將其饋送至這些可觀測性工具。

或者，您可以透過 Amazon Managed Service for Prometheus 來抓取指標，無需受管可觀測性。若要這樣做，請在使用執行任務時，將您要監控的指標納入 job.yaml 檔案kubectl。


apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
  name: hyperpod-training-operator
  namespace: aws-hyperpod
spec:
  ......
  endpoints:
    - port: 8081
      path: /metrics
      interval: 15s

以下是訓練運算子發出的事件，您可以饋送至 Amazon Managed Service for Prometheus 以監控訓練任務。

事件	描述
hyperpod_training_operator_jobs_created_total	訓練運算子已執行的任務總數
hyperpod_training_operator_jobs_restart_latency	目前的任務重新啟動延遲
hyperpod_training_operator_jobs_fault_detection_latency	故障偵測延遲
hyperpod_training_operator_jobs_deleted_total	已刪除任務的總數
hyperpod_training_operator_jobs_successful_total	已完成任務的總數
hyperpod_training_operator_jobs_failed_total	失敗的任務總數
hyperpod_training_operator_jobs_restarted_total	自動重新啟動的任務總數

範例 docker 組態

以下是您可以使用 hyperpod run命令執行的範例 docker 檔案。


export AGENT_CMD="--backend=nccl"
exec hyperpodrun --server-host=${AGENT_HOST} --server-port=${AGENT_PORT} \
    --tee=3 --log_dir=/tmp/hyperpod \
    --nnodes=${NNODES} --nproc-per-node=${NPROC_PER_NODE} \
    --pre-train-script=/workspace/echo.sh --pre-train-args='Pre-training script' \
    --post-train-script=/workspace/echo.sh --post-train-args='Post-training script' \
    /workspace/mnist.py --epochs=1000 ${AGENT_CMD}

日誌監控組態範例

任務懸置偵測

若要偵測懸置任務，請使用下列組態。它使用下列參數：

expectedStartCutOffInSeconds – 在預期第一個日誌之前，監視器應該等待多久
expectedRecurringFrequencyInSeconds – 等待下一批次日誌的時間間隔

使用這些設定，日誌監視器預期在訓練任務開始後 60 秒.*Train Epoch.*內看到符合 regex 模式的日誌行。第一次出現之後，監視器預期每 10 秒會看到相符的日誌行。如果第一個日誌未在 60 秒內顯示，或後續日誌未每 10 秒顯示一次，HyperPod 彈性代理程式會將容器視為卡住，並與訓練運算子協調以重新啟動任務。


runPolicy:
    jobMaxRetryCount: 10
    cleanPodPolicy: "None"
    logMonitoringConfiguration:
      - name: "JobStartGracePeriod"
        # Sample log line: [default0]:2025-06-17 05:51:29,300 [INFO] __main__: Train Epoch: 5 [0/60000 (0%)]       loss=0.8470
        logPattern: ".*Train Epoch.*"  
        expectedStartCutOffInSeconds: 60 
      - name: "JobHangingDetection"
        logPattern: ".*Train Epoch.*"
        expectedRecurringFrequencyInSeconds: 10 # if the next batch is not printed within 10 seconds

訓練損失峰值

下列監控組態會發出模式為的訓練日誌xxx training_loss_step xx。它使用參數 metricEvaluationDataPoints，可讓您在運算子重新啟動任務之前指定資料點的閾值。如果訓練損失值超過 2.0，運算子會重新啟動任務。


runPolicy:
  jobMaxRetryCount: 10
  cleanPodPolicy: "None"
  logMonitoringConfiguration:
    - name: "LossSpikeDetection"
      logPattern: ".*training_loss_step (\\d+(?:\\.\\d+)?).*"   # training_loss_step 5.0
      metricThreshold: 2.0
      operator: "gt"
      metricEvaluationDataPoints: 5 # if loss higher than threshold for 5 data points, restart the job

低 TFLOPs偵測

下列監控組態xx TFLOPs xx每五秒會發出模式為的訓練日誌。如果 5 個資料點的 TFLOPs小於 100，則運算子會重新啟動訓練任務。


runPolicy:
  jobMaxRetryCount: 10
  cleanPodPolicy: "None"
  logMonitoringConfiguration:
    - name: "TFLOPs"
      logPattern: ".* (.+)TFLOPs.*"    # Training model, speed: X TFLOPs...
      expectedRecurringFrequencyInSeconds: 5        
      metricThreshold: 100       # if Tflops is less than 100 for 5 data points, restart the job       
      operator: "lt"
      metricEvaluationDataPoints: 5

您的瀏覽器已停用或無法使用 Javascript。

您必須啟用 Javascript，才能使用 AWS 文件。請參閱您的瀏覽器說明頁以取得說明。

文件慣用形式

安裝

故障診斷