Implementation plan Documents Blogs Metrics

MLSUS-10: Use efficient model tuning methods

Implement an efficient strategy to optimize hyperparameter values to minimize the resources required to complete model training. Avoid a brute force strategy wherever possible, as it tests hyperparameter values without concern for the number of resources used.

Implementation plan

Adopt sustainable tuning job strategy - Prefer Hyperband or Bayesian search over random search (and avoid grid search). Bayesian search makes intelligent guesses about the next set of parameters to pick based on the prior set of trials. It typically requires 10 times fewer jobs than random search, and thus 10 times less compute resources, to find the best hyperparameters. SageMaker Automatic Model Tuning now supports Hyperband, a new search strategy that can find the optimal set of hyperparameters up to three times faster than Bayesian search for large-scale models such as deep neural networks that address computer vision problems.
Limit the maximum number of concurrent training jobs - Running hyperparameter tuning jobs concurrently gets more work done quickly. However, with the Bayesian optimization strategy, a tuning job improves only through successive rounds of experiments. Typically, running one training job at a time achieves the best results with the least amount of compute resources.
Carefully choose the number of hyperparameters and their ranges - You get better results and use less compute resources by limiting your search to a few parameters and small ranges of values. If you know that a hyperparameter is log-scaled, convert it to further improve the optimization.

Documents

Blogs

Metrics

Track the metrics related to the resources provisioned for your hyperparameter tuning jobs (InstanceCount, InstanceType, and VolumeSizeInGB)
Measure the efficient use of these resources (CPUUtilization, GPUUtilization, GPUMemoryUtilization, MemoryUtilization, and DiskUtilization) in the SageMaker Console and the CloudWatch Console

Warning Javascript is disabled or is unavailable in your browser.

To use the Amazon Web Services Documentation, Javascript must be enabled. Please refer to your browser's Help pages for instructions.

Document Conventions

MLSUS-09: Archive or delete unnecessary training artifacts

Deployment lifecycle phase