훈련 연산자를 사용하여 작업 실행

kubectl을 사용하여 작업을 실행하려면 job.yaml을 생성하여 작업 사양을 지정하고를 실행kubectl apply -f job.yaml하여 작업을 제출해야 합니다. 이 YAML 파일에서 logMonitoringConfiguration 인수에 사용자 지정 구성을 지정하여 분산 훈련 작업의 로그 출력을 분석하여 문제를 감지하고 복구하는 자동 모니터링 규칙을 정의할 수 있습니다.


apiVersion: sagemaker.amazonaws.com/v1
kind: HyperPodPyTorchJob
metadata:
  labels:
    app.kubernetes.io/name: HyperPod
    app.kubernetes.io/managed-by: kustomize
  name: &jobname xxx
  annotations:
    XXX: XXX
    ......
spec:
  nprocPerNode: "X"
  replicaSpecs:
    - name: 'XXX'
      replicas: 16
      template:
        spec:
          nodeSelector:
            beta.kubernetes.io/instance-type: ml.p5.48xlarge
          containers:
            - name: XXX
              image: XXX
              imagePullPolicy: Always
              ports:
                - containerPort: 8080 # This is the port that HyperPodElasticAgent listens to
              resources:
                limits:
                  nvidia.com/gpu: 8
                  hugepages-2Mi: 5120Mi
                requests:
                  nvidia.com/gpu: 8
                  hugepages-2Mi: 5120Mi
                  memory: 32000Mi
          ......        
  runPolicy:
    jobMaxRetryCount: 50
    restartPolicy:
      numRestartBeforeFullJobRestart: 3 
      evalPeriodSeconds: 21600 
      maxFullJobRestarts: 1
    cleanPodPolicy: "All"
    logMonitoringConfiguration: 
      - name: "JobStart"
        logPattern: ".*Experiment configuration.*" # This is the start of the training script
        expectedStartCutOffInSeconds: 120 # Expected match in the first 2 minutes
      - name: "JobHangingDetection"
        logPattern: ".*\\[Epoch 0 Batch \\d+.*'training_loss_step': (\\d+(\\.\\d+)?).*"
        expectedRecurringFrequencyInSeconds: 300 # If next batch is not printed within 5 minute, consider it hangs. Or if loss is not decimal (e.g. nan) for 2 minutes, mark it hang as well.
        expectedStartCutOffInSeconds: 600 # Allow 10 minutes of job startup time
      - name: "NoS3CheckpointingDetection"
        logPattern: ".*The checkpoint is finalized. All shards is written.*"
        expectedRecurringFrequencyInSeconds: 600 # If next checkpoint s3 upload doesn't happen within 10 mins, mark it hang.
        expectedStartCutOffInSeconds: 1800 # Allow 30 minutes for first checkpoint upload
      - name: "LowThroughputDetection"
        logPattern: ".*\\[Epoch 0 Batch \\d+.*'samples\\/sec': (\\d+(\\.\\d+)?).*"
        metricThreshold: 80 # 80 samples/sec
        operator: "lteq"
        metricEvaluationDataPoints: 25 # if throughput lower than threshold for 25 datapoints, kill the job

로그 모니터링 옵션을 사용하려면 훈련 로그를 로 내보내야 합니다sys.stdout. HyperPod 탄력적 에이전트는에 저장된 sys.stdout의 훈련 로그를 모니터링합니다/tmp/hyperpod/. 다음 명령을 사용하여 훈련 로그를 내보낼 수 있습니다.


logging.basicConfig(format="%(asctime)s [%(levelname)s] %(name)s: %(message)s", level=logging.INFO, stream=sys.stdout)

다음 표에서는 가능한 모든 로그 모니터링 구성을 설명합니다.

파라미터	사용법
jobMaxRetryCount	프로세스 수준에서 최대 재시작 횟수입니다.
restartPolicy: numRestartBeforeFullJobRestart	작업자가 작업 수준에서 재시작하기 전에 프로세스 수준에서 최대 재시작 횟수입니다.
restartPolicy: evalPeriodSeconds	초 단위로 재시작 제한을 평가하는 기간
restartPolicy: maxFullJobRestarts	작업이 실패하기 전에 전체 작업이 다시 시작되는 최대 횟수입니다.
cleanPodPolicy	연산자가 정리해야 하는 포드를 지정합니다. 허용되는 값은 `All`, `OnlyComplete`및 입니다`None`.
logMonitoringConfiguration	느리고 중단된 작업 감지를 위한 로그 모니터링 규칙
expectedRecurringFrequencyInSeconds	두 개의 연속 LogPattern 일치 사이의 시간 간격입니다. 이후 규칙이 HANGING으로 평가됩니다. 지정하지 않으면 연속 LogPattern 일치 사이에 시간 제약이 없습니다.
expectedStartCutOffInSeconds	규칙이 HANGING으로 평가되는 첫 번째 LogPattern 일치까지의 시간입니다. 지정하지 않으면 첫 번째 LogPattern 일치에 대한 시간 제약이 존재하지 않습니다.
logPattern	규칙이 활성화될 때 규칙이 적용되는 로그 줄을 식별하는 정규 표현식
metricEvaluationDataPoints	작업을 SLOW로 표시하기 전에 규칙이 SLOW로 평가되어야 하는 연속 횟수입니다. 지정하지 않은 경우 기본값은 1입니다.
metricThreshold	캡처 그룹을 사용하여 LogPattern에서 추출한 값에 대한 임계값입니다. 지정하지 않으면 지표 평가가 수행되지 않습니다.
연산자	모니터링 구성에 적용할 불평등입니다. 허용되는 값은 `gt`, `gteq`, `lt`, 및 `lteq`입니다`eq`.
stopPattern	규칙을 비활성화할 로그 줄을 식별하기 위한 정기적인 확장입니다. 지정하지 않으면 규칙이 항상 활성화됩니다.

훈련 복원력을 높이려면 예비 노드 구성 세부 정보를 지정합니다. 작업이 실패하면 운영자는 Kueue와 협력하여 미리 예약된 노드를 사용하여 작업을 계속 실행합니다. 예비 노드 구성에는 Kueue가 필요하므로 예비 노드가 있는 작업을 제출하려고 하지만 Kueue가 설치되지 않은 경우 작업이 실패합니다. 다음 예제는 예비 노드 구성을 포함하는 샘플 job.yaml 파일입니다.



apiVersion: sagemaker.amazonaws.com/v1
kind: HyperPodPyTorchJob
metadata:
  labels:
    kueue.x-k8s.io/queue-name: user-queue # Specify the queue to run the job.
  name: hyperpodpytorchjob-sample
spec:
  nprocPerNode: "1"
  runPolicy:
    cleanPodPolicy: "None"
  replicaSpecs: 
    - name: pods
      replicas: 1
      spares: 1 # Specify how many spare nodes to reserve.
      template:
        spec:
          containers:
            - name: XXX
              image: XXX
              
              imagePullPolicy: Always
              ports:
                - containerPort: 8080
              resources:
                requests:
                  nvidia.com/gpu: "0"
                limits:
                  nvidia.com/gpu: "0"

모니터링

Amazon SageMaker HyperPod는 Amazon Managed Grafana 및 Amazon Managed Service for Prometheus와 관찰성과 통합되어 있으므로 이러한 관찰성 도구에 지표를 수집하고 공급하도록 모니터링을 설정할 수 있습니다.

또는 관리형 관찰성 없이 Amazon Managed Service for Prometheus를 통해 지표를 스크레이프할 수 있습니다. 이렇게 하려면 로 작업을 실행할 때 모니터링할 지표를 job.yaml 파일에 포함시킵니다kubectl.


apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
  name: hyperpod-training-operator
  namespace: aws-hyperpod
spec:
  ......
  endpoints:
    - port: 8081
      path: /metrics
      interval: 15s

다음은 Amazon Managed Service for Prometheus에 피드하여 훈련 작업을 모니터링할 수 있도록 훈련 운영자가 내보내는 이벤트입니다.

이벤트	설명
hyperpod_training_operator_jobs_created_total	훈련 운영자가 실행한 총 작업 수
hyperpod_training_operator_jobs_restart_latency	현재 작업 재시작 지연 시간
hyperpod_training_operator_jobs_fault_detection_latency	장애 감지 지연 시간
hyperpod_training_operator_jobs_deleted_total	삭제된 총 작업 수
hyperpod_training_operator_jobs_successful_total	완료된 총 작업 수
hyperpod_training_operator_jobs_failed_total	실패한 총 작업 수
hyperpod_training_operator_jobs_restarted_total	자동 재시작된 작업의 총 수

샘플 Docker 구성

다음은 hyperpod run 명령으로 실행할 수 있는 샘플 Docker 파일입니다.


export AGENT_CMD="--backend=nccl"
exec hyperpodrun --server-host=${AGENT_HOST} --server-port=${AGENT_PORT} \
    --tee=3 --log_dir=/tmp/hyperpod \
    --nnodes=${NNODES} --nproc-per-node=${NPROC_PER_NODE} \
    --pre-train-script=/workspace/echo.sh --pre-train-args='Pre-training script' \
    --post-train-script=/workspace/echo.sh --post-train-args='Post-training script' \
    /workspace/mnist.py --epochs=1000 ${AGENT_CMD}

샘플 로그 모니터링 구성

작업 중단 감지

중단 작업을 감지하려면 다음 구성을 사용합니다. 다음 파라미터를 사용합니다.

expectedStartCutOffInSeconds - 모니터가 첫 번째 로그를 예상하기 전에 기다려야 하는 시간
expectedRecurringFrequencyInSeconds - 다음 로그 배치를 기다리는 시간 간격

이러한 설정을 사용하면 로그 모니터는 훈련 작업이 시작된 후 60초 .*Train Epoch.* 이내에 정규식 패턴과 일치하는 로그 줄을 볼 것으로 예상합니다. 첫 번째 표시 후 모니터는 10초마다 일치하는 로그 줄을 볼 것으로 예상합니다. 첫 번째 로그가 60초 이내에 나타나지 않거나 후속 로그가 10초마다 나타나지 않는 경우 HyperPod 탄력적 에이전트는 컨테이너를 멈춘 것으로 취급하고 훈련 연산자와 조정하여 작업을 다시 시작합니다.


runPolicy:
    jobMaxRetryCount: 10
    cleanPodPolicy: "None"
    logMonitoringConfiguration:
      - name: "JobStartGracePeriod"
        # Sample log line: [default0]:2025-06-17 05:51:29,300 [INFO] __main__: Train Epoch: 5 [0/60000 (0%)]       loss=0.8470
        logPattern: ".*Train Epoch.*"  
        expectedStartCutOffInSeconds: 60 
      - name: "JobHangingDetection"
        logPattern: ".*Train Epoch.*"
        expectedRecurringFrequencyInSeconds: 10 # if the next batch is not printed within 10 seconds

훈련 손실 급증

다음 모니터링 구성은 패턴이 인 훈련 로그를 내보냅니다xxx training_loss_step xx. 운영자가 작업을 다시 시작하기 전에 데이터 포인트의 임계값을 지정할 수 metricEvaluationDataPoints있는 파라미터를 사용합니다. 훈련 손실 값이 2.0보다 크면 연산자가 작업을 다시 시작합니다.


runPolicy:
  jobMaxRetryCount: 10
  cleanPodPolicy: "None"
  logMonitoringConfiguration:
    - name: "LossSpikeDetection"
      logPattern: ".*training_loss_step (\\d+(?:\\.\\d+)?).*"   # training_loss_step 5.0
      metricThreshold: 2.0
      operator: "gt"
      metricEvaluationDataPoints: 5 # if loss higher than threshold for 5 data points, restart the job

낮은 TFLOPs 감지

다음 모니터링 구성은 5초xx TFLOPs xx마다 패턴이 있는 훈련 로그를 내보냅니다. 5개의 데이터 포인트에 대해 TFLOPs가 100보다 작으면 연산자가 훈련 작업을 다시 시작합니다.


runPolicy:
  jobMaxRetryCount: 10
  cleanPodPolicy: "None"
  logMonitoringConfiguration:
    - name: "TFLOPs"
      logPattern: ".* (.+)TFLOPs.*"    # Training model, speed: X TFLOPs...
      expectedRecurringFrequencyInSeconds: 5        
      metricThreshold: 100       # if Tflops is less than 100 for 5 data points, restart the job       
      operator: "lt"
      metricEvaluationDataPoints: 5

javascript가 브라우저에서 비활성화되거나 사용이 불가합니다.

AWS 설명서를 사용하려면 Javascript가 활성화되어야 합니다. 지침을 보려면 브라우저의 도움말 페이지를 참조하십시오.

문서 규칙

설치

문제 해결