Monitor AWS compute resource utilization in Amazon SageMaker Studio Classic - Amazon SageMaker

Monitor AWS compute resource utilization in Amazon SageMaker Studio Classic

To track compute resource utilization of your training job, use the monitoring tools offered by Amazon SageMaker Debugger.

For any training job you run in SageMaker using the SageMaker Python SDK, Debugger collects basic resource utilization metrics, such as CPU utilization, GPU utilization, GPU memory utilization, network, and I/O wait time every 500 milliseconds. To see the dashbard of the resource utilization metrics of your training job, simply use the SageMaker Debugger UI in SageMaker Studio Experiments.

Deep learning operations and steps might operate in intervals of milliseconds. Compared to Amazon CloudWatch metrics, which collect metrics at intervals of 1 second, Debugger provides finer granularity into the resource utilization metrics down to 100-millisecond (0.1 second) intervals so you can dive deep into the metrics at the level of an operation or a step.

If you want to change the metric collection time interval, you can add a paramter for profiling configuration to your training job launcher. For example, if you're using the SageMaker Python SDK, you need to pass the profiler_config parameter when you create an estimator object. To learn how to adjust the resource utilization metric collection interval, see Code template for configuring a SageMaker estimator object with the SageMaker Debugger Python modules in the SageMaker Python SDK and then Configure settings for basic profiling of system resource utilization.

Additionally, you can add issue detecting tools called built-in profiling rules provided by SageMaker Debugger. The built-in profiling rules run analysis against the resource utilization metrics and detect computational performance issues. For more information, see Configure built-in profiler rules managed by Amazon SageMaker Debugger. You can receive rule analysis results through the SageMaker Debugger UI in SageMaker Studio Experiments or the SageMaker Debugger Profiling Report. You can also create custom profiling rules using the SageMaker Python SDK.

To learn more about monitoring functionalities provided by SageMaker Debugger, see the following topics.