MLCOST-15: Use distributed training
Enable distributed training for a faster training time, when an algorithm allows it. Use multiple instances in a training cluster. Use managed services to help ensure all training instances are automatically shut down when training is completed.
Implementation plan
-
Use Amazon SageMaker Distributed training libraries - The distributed training libraries
in Amazon SageMaker automatically split large deep learning models and training datasets across AWS GPU instances in a fraction of the time it takes to do manually. SageMaker achieves these efficiencies through two techniques: data parallelism and model parallelism. Model parallelism splits models too large to fit on a single GPU into smaller parts before distributing across multiple GPUs to train, and data parallelism splits large datasets to train concurrently to improve training speed.
Documents
Blogs
-
New – Data Parallelism Library in Amazon SageMaker Simplifies Training on Large Datasets
-
How Latent Space used the Amazon SageMaker model parallelism library to push the frontiers of
large-scale transformers -
The science behind Amazon SageMaker’s distributed-training engines
-
Amazon SageMaker XGBoost now offers fully distributed GPU training