配置重试策略

虽然 SageMaker Pipelines 提供了一种强大的自动化方式来编排机器学习工作流程，但在运行它们时可能会遇到故障。为了从容应对此类情况并提高管道的可靠性，您可以配置重试策略，定义遇到异常后如何以及何时自动重试特定步骤。重试策略允许您指定重试的异常类型、重试尝试的最大次数、重试间隔以及增加重试间隔的回退率。以下部分提供了一些示例，说明如何使用 JSON 和使用 SageMaker Python SDK 为管道中的训练步骤配置重试策略。

以下是使用重试策略的训练步骤的示例。


{
    "Steps": [
        {
            "Name": "MyTrainingStep",
            "Type": "Training",
            "RetryPolicies": [
                {
                    "ExceptionType": [
                        "SageMaker.JOB_INTERNAL_ERROR",
                        "SageMaker.CAPACITY_ERROR"
                    ],
                    "IntervalSeconds": 1,
                    "BackoffRate": 2,
                    "MaxAttempts": 5
                }
            ]
        }
    ]
}

以下示例说明如何使用重试策略在 SDK for Python (Boto3) 中构建 TrainingStep。


from sagemaker.workflow.retry import (
    StepRetryPolicy, 
    StepExceptionTypeEnum,
    SageMakerJobExceptionTypeEnum,
    SageMakerJobStepRetryPolicy
)

step_train = TrainingStep(
    name="MyTrainingStep",
    xxx,
    retry_policies=[
        // override the default 
        StepRetryPolicy(
            exception_types=[
                StepExceptionTypeEnum.SERVICE_FAULT, 
                StepExceptionTypeEnum.THROTTLING
            ],
            expire_after_mins=5,
            interval_seconds=10,
            backoff_rate=2.0 
        ),
        // retry when resource limit quota gets exceeded
        SageMakerJobStepRetryPolicy(
            exception_types=[SageMakerJobExceptionTypeEnum.RESOURCE_LIMIT],
            expire_after_mins=120,
            interval_seconds=60,
            backoff_rate=2.0
        ),
        // retry when job failed due to transient error or EC2 ICE.
        SageMakerJobStepRetryPolicy(
            failure_reason_types=[
                SageMakerJobExceptionTypeEnum.INTERNAL_ERROR,
                SageMakerJobExceptionTypeEnum.CAPACITY_ERROR,
            ],
            max_attempts=10,
            interval_seconds=30,
            backoff_rate=2.0
        )
    ]
)

有关为某些步骤类型配置重试行为的更多信息，请参阅 Amazon Python SDK 文档中的 Amaz SageMaker on Pipelin SageMaker es-重试政策。

Javascript 在您的浏览器中被禁用或不可用。

要使用 Amazon Web Services 文档，必须启用 Javascript。请参阅浏览器的帮助页面以了解相关说明。

文档惯例

重试策略

选择性执行