By default, SageMaker Debugger basic profiling is on by default and monitors resource utilization
metrics, such as CPU utilization, GPU utilization, GPU memory utilization, Network, and I/O
wait time, of all SageMaker training jobs submitted using the Amazon SageMaker Python SDK
If you want to access the resource utilization metrics dashboard of your training job in SageMaker Studio, you can jump onto the Amazon SageMaker Debugger UI in Amazon SageMaker Studio Classic Experiments.
If you want to activate the rules that detect system resource utilization problems
automatically, you can add the rules
parameter in the estimator object for
activating the rules.
Important
To use the latest SageMaker Debugger features, you need to upgrade the SageMaker Python SDK and
the SMDebug
client library. In your iPython kernel, Jupyter Notebook, or
JupyterLab environment, run the following code to install the latest versions of the
libraries and restart the kernel.
import sys
import IPython
!{sys.executable} -m pip install -U sagemaker smdebug
IPython.Application.instance().kernel.do_shutdown(True)
Code template for
configuring a SageMaker AI estimator object with the SageMaker Debugger Python modules in the SageMaker AI
Python SDK
To adjust the basic profiling configuration (profiler_config
) or add the
profiler rules (rules
), choose one of the tabs to get the template for
setting up a SageMaker AI estimator. In the subsequent pages, you can find more information
about how to configure the two parameters.
Note
The following code examples are not directly executable. Proceed to the next sections to learn how to configure each parameter.
# An example of constructing a SageMaker AI PyTorch estimator
import boto3
import sagemaker
from sagemaker.pytorch import PyTorch
from sagemaker.debugger import ProfilerConfig, ProfilerRule, rule_configs
session=boto3.session.Session()
region=session.region_name
profiler_config
=ProfilerConfig(...)
rules
=[
ProfilerRule.sagemaker(rule_configs.BuiltInRule())
]
estimator=PyTorch(
entry_point="directory/to/your_training_script.py
",
role=sagemaker.get_execution_role(),
base_job_name="debugger-profiling-demo
",
instance_count=1
,
instance_type="ml.p3.2xlarge
",
framework_version="1.12.0
",
py_version="py37
",
# SageMaker Debugger parameters
profiler_config=profiler_config
,
rules=rules
)
estimator.fit(wait=False)
The following provides brief descriptions of the parameters.
-
profiler_config
– Configure Debugger to collect system metrics and framework metrics from your training job and save into your secured S3 bucket URI or local machine. You can set how frequently or loosely collect the system metrics. To learn how to configure theprofiler_config
parameter, see Configure settings for basic profiling of system resource utilization and Estimator configuration for framework profiling. -
rules
– Configure this parameter to activate SageMaker Debugger built-in rules that you want to run in parallel. Make sure that your training job has access to this S3 bucket. The rules runs on processing containers and automatically analyze your training job to find computational and operational performance issues. The ProfilerReport rule is the most integrated rule that runs all built-in profiling rules and saves the profiling results as a report into your secured S3 bucket. To learn how to configure therules
parameter, see Use built-in profiler rules managed by Amazon SageMaker Debugger.
Note
Debugger securely saves output data in subfolders of your default S3 bucket. For
example, the format of the default S3 bucket URI is
s3://sagemaker-<region>-<12digit_account_id>/<base-job-name>/<debugger-subfolders>/
.
There are three subfolders created by Debugger: debug-output
,
profiler-output
, and rule-output
. You can also
retrieve the default S3 bucket URIs using the SageMaker AI estimator
classmethods.
See the following topics to find out how to configure the Debugger-specific parameters in detail.