Amazon SageMaker Debugger - Amazon SageMaker

Amazon SageMaker Debugger

Debug, monitor, and profile training jobs in real time, detect non-converging conditions, optimize resource utilization by eliminating bottlenecks, improve training time and reduce costs of your machine learning models using Amazon SageMaker Debugger.

New Debugger Features

SageMaker Debugger profiles and debugs your training jobs to improve the performance of machine learning models on compute resource utilization and model predictions. You can now achieve target accuracy faster using the following new Debugger features:


                Overview of how Amazon SageMaker Debugger works.

Debugger Features

Debugger provides the following features:

To see a list of SageMaker machine learning frameworks and algorithms that Debugger supports, see Supported Frameworks and Algorithms.

To learn more about the architecture of Debugger and how it works, see Debugger Architecture and Best Practices.

For Debugger tutorials and notebook examples, see Get Started with Debugger Tutorials.

Supported Frameworks and Algorithms

The following table shows SageMaker machine learning frameworks and algorithms supported by Debugger.

SageMaker frameworks and algorithms Performance optimization Model optimization
Monitoring system bottlenecks Profiling framework operations Debugging model parameters

TensorFlow

All AWS Deep learning containers

AWS TensorFlow deep learning containers 2.3.1 or later

AWS TensorFlow deep learning containers 1.15.4 or later

PyTorch

AWS PyTorch deep learning containers 1.6.0 or later

AWS PyTorch deep learning containers 1.5.0 or later

MXNet

-

AWS MXNet deep learning containers 1.6.0 or later

XGBoost

1.0-1, 1.2-1

-

1.0-1, 1.2-1

SageMaker generic estimator

SageMaker Debugger built-in algorithms using image URIs

Custom training containers (with the AWS deep learning container images, public Docker images, or your own Docker images)

-

Custom training containers (TensorFlow, PyTorch, MXNet, and XGBoost training scripts with a manual hook registration)

If the framework or algorithm that you want to train and debug is not listed in the table, go to AWS Discussion Forum and leave feedback on SageMaker Debugger.

Use Debugger with Custom Training Containers

Bring your training containers to SageMaker and gain insights into your training jobs using Debugger. Maximize your work efficiency by optimizing your model on EC2 instances using the aforementioned SageMaker Debugger features.

For more information about how to push your training container to Amazon Elastic Container Registry (ECR) and debug, see Use Debugger with Custom Training Containers.

Debugger Open Source GitHub Repositories

Debugger APIs are provided through the SageMaker Python SDK and designed to construct Debugger hook and rule configurations for the SageMaker CreateTrainingJob and DescribeTrainingJob API operations. The SMDebug client library provides tools to register hooks and access the training data through its trial feature, all through its flexible and powerful API operations. It supports the machine learning frameworks TensorFlow, PyTorch, MXNet, and XGBoost on Python 3.6 and later.

If you want to find direct resources of the Debugger and SMDebug API operations, see the following links:

If you use the Java SDK to conduct SageMaker training jobs and want to configure Debugger APIs, see the following references:

This SageMaker Debugger developer guide walks you through the following topics.