MLCOST-23: Enable debugging and logging - Machine Learning Lens

MLCOST-23: Enable debugging and logging

Ensure that there are sufficient logs and metrics recorded to capture the runtime and resource consumption. The collected logs and metrics can be analyzed to identify the areas for improvement. Monitor compute and data storage consumption. Instrument the machine learning code, and use debugging tools to capture metrics at runtime.

Implementation plan

  • Use Amazon SageMaker Debugger - Amazon SageMaker Debugger captures the state of a training job at periodic intervals. It provides visibility into the ML training process by monitoring, recording, and analyzing data with the ability to perform interactive exploration of data captured during training. The debugger has an alerting capability for errors detected during training. For example, it can automatically detect and alert you to commonly occurring errors, such as gradient values getting too large or too small.

  • Use Amazon CloudWatch -Logs generated during training by Amazon SageMaker are logged to Amazon CloudWatch Logs. Use an AWS KMS key to encrypt log data ingested by Amazon CloudWatch Logs.

Documents

Blogs

Examples

Video