MLCOST-14: Use managed training capabilities - Machine Learning Lens

MLCOST-14: Use managed training capabilities

Machine learning model training can be an iterative, compute intensive, and time-consuming process. Instead of using the notebook itself, which might be running on a small instance, consider offloading the training to a managed cluster of compute resources including both CPUs and GPUs to train the model. 

Implementation plan

  • Use Amazon SageMaker managed training capabilities - Amazon SageMaker reduces the time and cost to train and tune ML models without the need to manage infrastructure. With SageMaker, easily train and tune ML models using built-in tools to manage and track training experiments, automatically choose optimal hyperparameters, debug training jobs, and monitor the utilization of system resources such as GPUs, CPUs, and network bandwidth. SageMaker can automatically scale infrastructure up or down based on your training job requirements, from one GPU to thousands, or from terabytes to petabytes of storage.SageMaker also offers the highest-performing ML compute infrastructure currently available-including Amazon EC2 P4d instances, which can reduce ML training costs by up to 60% compared with previous generations. And, since you pay only for what you use, you can manage your training costs more effectively.

  • Use the Amazon SageMaker Training Compiler - To train deep learning (DL) models faster, you can use the Amazon SageMaker Training Compiler to accelerate the model training process by up to 50% through graph- and kernel-level optimizations that make more efficient use of GPUs. Moreover, you can add either data parallelism or model parallelism to your training script with a few lines of code, and the SageMaker distributed training libraries will automatically split models and training datasets across GPU instances to help you complete distributed training faster.

  • Use Amazon SageMaker managed Spot training - Amazon SageMaker makes it easy to train machine learning models using managed Amazon EC2 Spot Instances. Managed Spot training can optimize the cost of training models up to 90% over On-demand Instances. SageMaker manages the Spot interruptions on your behalf. You can specify which training jobs use Spot Instances and a stopping condition that specifies how long SageMaker waits for a job to run using Spot Instances. Metrics and logs generated during training runs are available in CloudWatch.

Documents

Blogs

Examples