MLCOST-15: Use distributed training - Machine Learning Lens

MLCOST-15: Use distributed training

Enable distributed training for a faster training time, when an algorithm allows it. Use multiple instances in a training cluster. Use managed services to help ensure all training instances are automatically shut down when training is completed.

Implementation plan

  • Use Amazon SageMaker Distributed training libraries - The distributed training libraries in Amazon SageMaker automatically split large deep learning models and training datasets across AWS GPU instances in a fraction of the time it takes to do manually. SageMaker achieves these efficiencies through two techniques: data parallelism and model parallelism. Model parallelism splits models too large to fit on a single GPU into smaller parts before distributing across multiple GPUs to train, and data parallelism splits large datasets to train concurrently to improve training speed. 

Documents

Blogs

Videos

Examples