SageMaker Debugger Profiling Report - Amazon SageMaker

SageMaker Debugger Profiling Report

For any SageMaker training jobs, the Debugger ProfilerReport rule invokes all of the monitoring and profiling rules and aggregates the rule analysis into a comprehensive report. Following this guide, download the report using the Amazon SageMaker Python SDK or the S3 console, and learn what you can interpret from the profiling results.

Download a Debugger Profiling Report

Download the Debugger profiling report while your training job is running or after the job has finished using the Amazon SageMaker Python SDK and AWS Command Line Interface (CLI).

Tip

You can also download the report with one click and no additional scripting through the SageMaker Studio Debugger insights dashboard. To find out how to download the report from Studio, see Open Amazon SageMaker Debugger Insights Dashboard.

Download using SageMaker Python SDK and AWS CLI
  1. Check the current job's default S3 output base URI.

    estimator.output_path
  2. Check the current job name.

    estimator.latest_training_job.job_name
  3. The Debugger profiling report is stored under <default-s3-output-base-uri>/<training-job-name>/rule-output. Configure the rule output path as follows:

    rule_output_path = estimator.output_path + estimator.latest_training_job.job_name + "/rule-output"
  4. To check if the report is generated, list directories and files recursively under the rule_output_path using aws s3 ls with the --recursive option.

    ! aws s3 ls {rule_output_path} --recursive

    This should return a complete list of files under an autogenerated folder that's named as ProfilerReport-1234567890. The folder name is a combination of strings: ProfilerReport and a unique 10-digit tag based on the Unix timestamp when the ProfilerReport rule is initiated.

    
                                        An example of rule output

    The profiler-report.html is an autogenerated profiling report by Debugger. The remaining files are the built-in rule analysis components stored in JSON and a Jupyter notebook that are used to aggregate them into the report.

  5. Download the files recursively using aws s3 cp. The following command saves all of the rule output files to the ProfilerReport-1234567890 folder under the current working directory.

    ! aws s3 cp {rule_output_path} ./ --recursive
    Tip

    If using a Jupyter notebook server, run !pwd to double check the current working directory.

  6. Under the /ProfilerReport-1234567890/profiler-output directory, open profiler-report.html. If using JupyterLab, choose Trust HTML to see the autogenerated Debugger profiling report.

    
                                        An example of rule output
  7. Open the profiler-report.ipynb file to explore how the report is generated. You can also customize and extend the profiling report using the Jupyter notebook file.

Download using Amazon S3 Console
  1. Sign in to the AWS Management Console and open the Amazon S3 console at https://console.aws.amazon.com/s3/.

  2. Search for the base S3 bucket. For example, if you haven't specified any base job name, the base S3 bucket name should be in the following format: sagemaker-<region>-111122223333. Look up the base S3 bucket through the Find bucket by name field.

    
                                        An example to the rule output S3 bucket URI
  3. In the base S3 bucket, look up the training job name by specifying your job name prefix into the Find objects by prefix input field. Choose the training job name.

    
                                        An example to the rule output S3 bucket URI
  4. In the training job's S3 bucket, there must be three subfolders for training data collected by Debugger: debug-output/, profiler-output/, and rule-output/. Choose rule-output/.

    
                                        An example to the rule output S3 bucket URI
  5. In the rule-output/ folder, choose ProfilerReport-1234567890, and choose profiler-output/ folder. The profiler-output/ folder contains profiler-report.html (the autogenerated profiling report in html), profiler-report.ipynb (a Jupyter notebook with scripts that are used for generating the report), and a profiler-report/ folder (contains rule analysis JSON files that are used as components of the report).

  6. Select the profiler-report.html file, choose Actions, and Download.

    
                                        An example to the rule output S3 bucket URI
  7. Open the downloaded profiler-report.html file in a web browser.

Note

If you started your training job without configuring the Debugger-specific parameters, Debugger generates the report based only on the system monitoring rules because the Debugger parameters are not configured to save framework metrics. To enable framework metrics profiling and receive an extended Debugger profiling report, configure the profiler_config parameter when constructing or updating SageMaker estimators.

To learn how to configure the profiler_config parameter before starting a training job, see Configure Debugger Framework Profiling.

To update the current training job and enable framework metrics profiling, see Update Debugger Framework Profiling Configuration.

Debugger Profiling Report Walkthrough

This section walks you through the Debugger profiling report section by section. The profiling report is generated based on the built-in rules for monitoring and profiling. The report shows result plots only for the rules that found issues.

Training Job Summary

At the beginning of the report, Debugger provides a summary of your training job. In this section, you can overview the time durations and timestamps at different training phases.


                        An example of Debugger profiling report

The summary table contains the following information:

  • start_time – The exact time when the training job started.

  • end_time – The exact time when the training job finished.

  • job_duration_in_seconds – The total training time from the start_time to the end_time.

  • training_loop_start – The exact time when the first step of the first epoch has started.

  • training_loop_end – The exact time when the last step of the last epoch has finished.

  • training_loop_duration_in_seconds – The total time between the training loop start time and the training loop end time.

  • initialization_in_seconds – Time spent on initializing the training job. The initialization phase covers the period from the start_time to the training_loop_start time. The initialization time is spent on compiling the training script, starting the training script, creating and initializing the model, initiating EC2 instances, and downloading training data.

  • finalization_in_seconds – Time spent on finalizing the training job, such as finishing the model training, updating the model artifacts, and closing the EC2 instances. The finalization phase covers the period from the training_loop_end time to the end_time.

  • initialization (%) – The percentage of time spent on initialization over the total job_duration_in_seconds.

  • training loop (%) – The percentage of time spent on training loop over the total job_duration_in_seconds.

  • finalization (%) – The percentage of time spent on finalization over the total job_duration_in_seconds.

System Usage Statistics

In this section, you can see an overview of system utilization statistics.


                        An example of Debugger profiling report

The Debugger profiling report includes the following information:

  • node – Lists the name of nodes. If using distributed training on multi nodes (multiple EC2 instances), the node names are in format of algo-n.

  • metric – The system metrics collected by Debugger: CPU, GPU, CPU memory, GPU memory, I/O, and Network metrics.

  • unit – The unit of the system metrics.

  • max – The maximum value of each system metric.

  • p99 – The 99th percentile of each system utilization.

  • p95 – The 95th percentile of each system utilization.

  • p50 – The 50th percentile (median) of each system utilization.

  • min – The minimum value of each system metric.

Framework metrics summary

In this section, the following pie charts show the breakdown of framework operations on CPUs and GPUs.


                        An example of Debugger profiling report

Each of the pie charts analyzes the collected framework metrics in various aspects as follows:

  • Ratio between TRAIN/EVAL phase and others – Shows the ratio between time durations spent on different training phases.

  • Ratio between forward and backward pass – Shows the ratio between time durations spent on forward and backward pass in the training loop.

  • Ratio between CPU/GPU operators – Shows the ratio between time spent on operators running on CPU or GPU, such as convolutional operators.

  • General metrics recorded in framework – Shows the ratio between time spent on major framework metrics, such as data loading, forward and backward pass.

Overview: CPU Operators

This section provides information of the CPU operators in detail. The table shows the percentage of the time and the absolute cumulative time spent on the most frequently called CPU operators.


                            An example of Debugger profiling report

Overview: GPU Operators

This section provides information of the GPU operators in detail. The table shows the percentage of the time and the absolute cumulative time spent on the most frequently called GPU operators.


                            An example of Debugger profiling report

Rules Summary

In this section, Debugger aggregates all of the rule evaluation results, analysis, rule descriptions, and suggestions.


                        An example of Debugger profiling report

Analyzing the Training Loop – Step Durations

In this section, you can find a detailed statistics of step durations on each GPU core of each node. Debugger evaluates mean, maximum, p99, p95, p50, and minimum values of step durations, and evaluate step outliers. The following histogram shows the step durations captured on different worker nodes and GPUs. You can enable or disable the histogram of each worker by choosing the legends on the right side. You can check if there is a particular GPU that's causing step duration outliers.


                        An example of Debugger profiling report

GPU Utilization Analysis

This section shows the detailed statistics about GPU core utilization based on LowGPUUtilization rule. It also summarizes the GPU utilization statistics, mean, p95, and p5 to determine if the training job is underutilizing GPUs.

Batch Size

This section shows the detailed statistics of total CPU utilization, individual GPU utilizations, and GPU memory footprints. The BatchSize rule determines if you need to change the batch size to better utilize the GPUs. You can check whether the batch size is too small resulting in underutilization or too large causing overutilization and out of memory issues. In the plot, the boxes show the p25 and p75 percentile ranges (filled with dark purple and bright yellow respectively) from the median (p50), and the error bars show the 5th percentile for the lower bound and 95th percentile for the upper bound.


                        An example of Debugger profiling report

CPU Bottlenecks

In this section, you can drill down into the CPU bottlenecks that the CPUBottleneck rule detected from your training job. The rule checks if the CPU utilization is above cpu_threshold (90% by default) and also if the GPU utilization is below gpu_threshold (10% by default).


                        An example of Debugger profiling report

The pie charts show the following information:

  • Low GPU usage caused by CPU bottlenecks – Shows the ratio of data points between the ones with GPU utilization above and below the threshold and the ones that matches the CPU bottleneck criteria.

  • Ratio between TRAIN/EVAL phase and others – Shows the ratio between time durations spent on different training phases.

  • Ratio between forward and backward pass – Shows the ratio between time durations spent on forward and backward pass in the training loop.

  • Ratio between CPU/GPU operators – Shows the ratio between time durations spent on GPUs and CPUs by Python operators, such as data loader processes and forward and backward pass operators.

  • General metrics recorded in framework – Shows major framework metrics and the ratio between time durations spent on the metrics.

I/O Bottlenecks

In this section, you can find a summary of IO bottlenecks.

LoadBalancing in Multi-GPU Training

In this section, you can identify workload balancing issue across GPUs.


                        An example of Debugger profiling report

GPU Memory Analysis

In this section, you can analyze the GPU memory utilization collected by the GPUMemoryIncrease rule. In the plot, the boxes show the p25 and p75 percentile ranges (filled with dark purple and bright yellow respectively) from the median (p50), and the error bars show the 5th percentile for the lower bound and 95th percentile for the upper bound.


                        An example of Debugger profiling report