Amazon SageMaker Debugger
Debug, monitor, and profile training jobs in real time, detect non-converging conditions, optimize resource utilization by eliminating bottlenecks, improve training time, and reduce costs of your machine learning models using Amazon SageMaker Debugger.
Amazon SageMaker Debugger Features
A machine learning (ML) training job can have problems such as system bottlenecks, overfitting, saturated activation functions, and vanishing gradients, which can compromise model performance.
SageMaker Debugger profiles and debugs training jobs to help resolve such problems and improve your ML model's compute resource utilization and performance. Debugger offers tools to send alerts when training anomalies are found, take actions against the problems, and identify the root cause of them by visualizing collected metrics and tensors.
SageMaker Debugger supports Apache MXNet, TensorFlow, PyTorch, and XGBoost. For more information about available frameworks and versions, see Supported Frameworks and Algorithms.

The high-level Debugger workflow is as follows:
-
Configure a SageMaker training job with Debugger.
-
Configure using the SageMaker
Estimator
API (for Python SDK). -
Configure using the SageMaker
CreateTrainingJob
request (for Boto3 or CLI). -
Configure custom training containers with Debugger.
-
-
Start a training job and monitor training issues in real time.
-
Get alerts and take prompt actions against the training issues.
-
Receive texts and emails and stop training jobs when training issues are found using Debugger Built-in Actions for Rules.
-
Set up your own actions using Amazon CloudWatch Events and AWS Lambda.
-
-
Receive training reports, suggestions to fix the issues, and insights into your training jobs.
-
Explore deep analysis of the training issues and bottlenecks.
-
For profiling training jobs, see Analyze Data Using the SMDebug Client Library.
-
For debugging model parameters, see Visualize Debugger Output Tensors in TensorBoard.
-
-
Fix the issues, considering the suggestions provided by Debugger, and repeat steps 1–5 until you optimize your model and achieve target accuracy.
The SageMaker Debugger developer guide walks you through the following topics.
Topics
- Supported Frameworks and Algorithms
- Amazon SageMaker Debugger Architecture
- Get Started with Debugger Tutorials
- Configure Debugger Using Amazon SageMaker Python SDK
- Configure Debugger Using Amazon SageMaker API
- List of Debugger Built-in Rules
- Create Debugger Custom Rules for Training Job Analysis
- Use Debugger with Custom Training Containers
- Action on Amazon SageMaker Debugger Rules
- Amazon SageMaker Debugger in Amazon SageMaker Studio
- SageMaker Debugger Interactive Reports
- Analyze Data Using the SMDebug Client Library
- Visualize Amazon SageMaker Debugger Output Tensors in TensorBoard
- Best Practices for Amazon SageMaker Debugger
- Amazon SageMaker Debugger Advanced Topics and Reference Documentation