MLCOST-09: Select optimal computing instance size - Machine Learning Lens

MLCOST-09: Select optimal computing instance size

Right size the training instances according to the ML algorithm used for maximum efficiency and cost reduction. Use debugging capabilities to understand the right resources to use during training. Simple models might not train faster on larger instances because they might not be able to benefit from additional compute resources. These models might even train slower due to the high GPU communication overhead. Start with smaller instances and scale as necessary.

Implementation plan

  • Use Amazon SageMaker Experiments - Amazon EC2 provides a wide selection of instance types optimized to fit different use cases. Machine learning workloads can use either a CPU or a GPU instance. Select an instance type from the available EC2 instance types depending on the needs of your ML algorithm. Experiment with both CPU and GPU instances to learn which one gives you the best cost configuration. Amazon SageMaker lets you use a single instance or a distributed cluster of GPU instances. Use Amazon SageMaker Experiments to evaluate alternative options, and identify the size resulting in optimal outcome. With the pricing broken down by time and resources, you can optimize the cost of Amazon SageMaker and only pay for what is needed.

  • Use Amazon SageMaker Debugger - Amazon SageMaker Debugger automatically monitors the utilization of system resources, such as GPUs, CPUs, network, and memory, and profiles your training jobs to collect detailed ML framework metrics. You can inspect all resource metrics visually through SageMaker Studio and take corrective actions if the resource is under-utilized to optimize cost. 

Documents

Blogs

Videos