第 2 步：使用 SageMaker Python SDK 启动和调试训练作业

要使用调试器配置 SageMaker 估算器，请使用 Amaz on Pyth SageMaker on 软件开发工具包并指定 SageMaker 调试器特定的参数。要充分利用调试功能，需要配置三个参数：debugger_hook_config、tensorboard_output_config 和 rules。

重要

在构造和运行估算器拟合方法以启动训练作业之前，请确保按照步骤 1：调整训练脚本以注册钩子中的说明调整训练脚本。

使用调试器特定的参数构造 SageMaker 估计器

本节中的代码示例展示了如何使用调试器特定的参数构造 SageMaker 估计器。

注意

以下代码示例是用于构造 SageMaker 框架估算器的模板，不能直接执行。您需要继续完成下一个部分中的内容，配置 Debugger 特定的参数。

PyTorch


# An example of constructing a SageMaker PyTorch estimator
import boto3
import sagemaker
from sagemaker.pytorch import PyTorch
from sagemaker.debugger import CollectionConfig, DebuggerHookConfig, Rule, rule_configs

session=boto3.session.Session()
region=session.region_name

debugger_hook_config=DebuggerHookConfig(...)
rules=[
    Rule.sagemaker(rule_configs.built_in_rule())
]

estimator=PyTorch(
    entry_point="directory/to/your_training_script.py",
    role=sagemaker.get_execution_role(),
    base_job_name="debugger-demo",
    instance_count=1,
    instance_type="ml.p3.2xlarge",
    framework_version="1.12.0",
    py_version="py37",
    
    # Debugger-specific parameters
    debugger_hook_config=debugger_hook_config,
    rules=rules
)

estimator.fit(wait=False)

TensorFlow


# An example of constructing a SageMaker TensorFlow estimator
import boto3
import sagemaker
from sagemaker.tensorflow import TensorFlow
from sagemaker.debugger import CollectionConfig, DebuggerHookConfig, Rule, rule_configs

session=boto3.session.Session()
region=session.region_name

debugger_hook_config=DebuggerHookConfig(...)
rules=[
    Rule.sagemaker(rule_configs.built_in_rule()),
    ProfilerRule.sagemaker(rule_configs.BuiltInRule())
]

estimator=TensorFlow(
    entry_point="directory/to/your_training_script.py",
    role=sagemaker.get_execution_role(),
    base_job_name="debugger-demo",
    instance_count=1,
    instance_type="ml.p3.2xlarge",
    framework_version="2.9.0",
    py_version="py39",
    
    # Debugger-specific parameters
    debugger_hook_config=debugger_hook_config,
    rules=rules
)

estimator.fit(wait=False)

MXNet


# An example of constructing a SageMaker MXNet estimator
import sagemaker
from sagemaker.mxnet import MXNet
from sagemaker.debugger import CollectionConfig, DebuggerHookConfig, Rule, rule_configs

debugger_hook_config=DebuggerHookConfig(...)
rules=[
    Rule.sagemaker(rule_configs.built_in_rule())
]

estimator=MXNet(
    entry_point="directory/to/your_training_script.py",
    role=sagemaker.get_execution_role(),
    base_job_name="debugger-demo",
    instance_count=1,
    instance_type="ml.p3.2xlarge",
    framework_version="1.7.0",
    py_version="py37",
    
    # Debugger-specific parameters
    debugger_hook_config=debugger_hook_config,
    rules=rules
)

estimator.fit(wait=False)

XGBoost


# An example of constructing a SageMaker XGBoost estimator
import sagemaker
from sagemaker.xgboost.estimator import XGBoost
from sagemaker.debugger import CollectionConfig, DebuggerHookConfig, Rule, rule_configs

debugger_hook_config=DebuggerHookConfig(...)
rules=[
    Rule.sagemaker(rule_configs.built_in_rule())
]

estimator=XGBoost(
    entry_point="directory/to/your_training_script.py",
    role=sagemaker.get_execution_role(),
    base_job_name="debugger-demo",
    instance_count=1,
    instance_type="ml.p3.2xlarge",
    framework_version="1.5-1",

    # Debugger-specific parameters
    debugger_hook_config=debugger_hook_config,
    rules=rules
)

estimator.fit(wait=False)

Generic estimator


# An example of constructing a SageMaker generic estimator using the XGBoost algorithm base image
import boto3
import sagemaker
from sagemaker.estimator import Estimator
from sagemaker import image_uris
from sagemaker.debugger import CollectionConfig, DebuggerHookConfig, Rule, rule_configs

debugger_hook_config=DebuggerHookConfig(...)
rules=[
    Rule.sagemaker(rule_configs.built_in_rule())
]

region=boto3.Session().region_name
xgboost_container=sagemaker.image_uris.retrieve("xgboost", region, "1.5-1")

estimator=Estimator(
    role=sagemaker.get_execution_role()
    image_uri=xgboost_container,
    base_job_name="debugger-demo",
    instance_count=1,
    instance_type="ml.m5.2xlarge",
    
    # Debugger-specific parameters
    debugger_hook_config=debugger_hook_config,
    rules=rules
)

estimator.fit(wait=False)

配置以下参数以激活 SageMaker 调试器：

debugger_hook_config（的对象 DebuggerHookConfig）— 需要在调整后的训练脚本中激活挂钩步骤 1：调整训练脚本以注册钩子，将 SageMaker 训练启动器（估算器）配置为从训练作业中收集输出张量，然后将张量保存到安全的 S3 存储桶或本地计算机中。要了解如何配置 debugger_hook_config 参数，请参阅配置 SageMaker 调试器以保存张量。
rules（Rule对象列表）— 配置此参数以激活要实时运行的 SageMaker Debugger 内置规则。内置规则是逻辑，用于自动调试模型的训练进度，并通过分析保存在安全 S3 存储桶中的输出张量来发现训练问题。要了解如何配置 rules 参数，请参阅配置 Debugger 内置规则。要查找用于调试输出张量的内置规则的完整列表，请参阅Debugger 规则。如果您想创建自己的逻辑来检测任意训练问题，请参阅创建 Debugger 自定义规则用于训练作业分析。

注意
内置规则只能通过 SageMaker 训练实例使用。您不能在本地模式下使用它们。
tensorboard_output_config（的对象 TensorBoardOutputConfig）— 将 SageMaker Debugger 配置为以 TensorBoard兼容格式收集输出张量并保存到对象中指定的 S3 输出路径。TensorBoardOutputConfig要了解更多信息，请参阅在中可视化 Amazon SageMaker 调试器输出张量 TensorBoard。

注意
tensorboard_output_config 必须使用 debugger_hook_config 参数进行配置，这还要求您添加 sagemaker-debugger 钩子以调整训练脚本。

注意

SageMaker 调试器将输出张量安全地保存在 S3 存储桶的子文件夹中。例如，账户中默认 S3 存储桶 URI 的格式为 s3://sagemaker-<region>-<12digit_account_id>/<base-job-name>/<debugger-subfolders>/。 SageMaker 调试器创建了两个子文件夹：debug-output、和。rule-output如果您添加 tensorboard_output_config 参数，则还会找到 tensorboard-output 文件夹。

请参阅以下主题，查找更多详细说明如何配置 Debugger 特定参数的示例。

主题

Javascript 在您的浏览器中被禁用或不可用。

要使用 Amazon Web Services 文档，必须启用 Javascript。请参阅浏览器的帮助页面以了解相关说明。

文档惯例

TensorFlow

配置 SageMaker 调试器以保存张量