Debug and Profile Training Jobs Using Amazon SageMaker Debugger - Amazon SageMaker

Debug and Profile Training Jobs Using Amazon SageMaker Debugger

Debug, profile, and monitor training jobs in real time to detect non-converging conditions, optimize resource utilization by eliminating bottlenecks, improve training time, and reduce costs of your machine learning models using Amazon SageMaker Debugger.

Amazon SageMaker Debugger Features

A machine learning (ML) training job can have problems such as system bottlenecks, overfitting, saturated activation functions, and vanishing gradients, which can compromise model performance.

SageMaker Debugger profiles and debugs training jobs to help resolve such problems and improve your ML model's compute resource utilization and performance. Debugger offers tools to send alerts when training anomalies are found, take actions against the problems, and identify the root cause of them by visualizing collected metrics and tensors.

SageMaker Debugger supports the Apache MXNet, PyTorch, TensorFlow, and XGBoost frameworks. For more information about available frameworks and versions supported by SageMaker Debugger, see Supported Frameworks and Algorithms.

                Overview of how Amazon SageMaker Debugger works.

The high-level Debugger workflow is as follows:

  1. Modify your training script with the sagemaker-debugger Python SDK if needed.

  2. Configure a SageMaker training job with SageMaker Debugger.

  3. Start a training job and monitor training issues in real time.

  4. Get alerts and take prompt actions against the training issues.

  5. Receive training reports, suggestions to fix the issues, and insights into your training jobs.

  6. Explore deep analysis of the training issues and bottlenecks.

  7. Fix the issues, considering the suggestions provided by Debugger, and repeat steps 1–5 until you optimize your model and achieve target accuracy.

The SageMaker Debugger developer guide walks you through the following topics.