Sustainability pillar – Best practices - Machine Learning Lens

Sustainability pillar – Best practices

The sustainability pillar focuses on environmental impacts, especially energy consumption and efficiency, since they are important levers for architects to inform direct action to reduce resource usage. This section includes best practices to consider while developing models that includes training, tuning, and model performance evaluation.

Related best practices

  • Tradeoff analysis on custom versus pre-trained models (MLCOST-04) - Consider whether the workload needs to be developed as a custom model. Many workloads can use the managed AWS AI services. Using these services means that you won’t need the associated resources to collect, store, and process data and to prepare, train, tune, and deploy an ML model. If adopting a fully managed AI service is not appropriate, evaluate if you can use pre-existing datasets, algorithms, or models. AWS Marketplace offers over 1,400 ML-related assets that customers can subscribe to. You can also fine-tune an existing model starting from a pre-trained model, like those available on Hugging Face or SageMaker JumpStart. Using pre-trained models from third parties can reduce the resources you need for data preparation and model training.

  • Enable debugging and logging (MLCOST-23) - A debugger like SageMaker Debugger can identify training problems like system bottlenecks, overfitting and saturated activation functions. It provides built-in rules like LowGPUUtilization or Overfit to monitor your workload and automatically stop a training job as soon as it detects an issue (such as bug, job failing to converge…). SageMaker Debugger also provides profiler capabilities to detect under-utilization of system resources and help right-size your environment. This helps avoid unnecessary carbon emissions.

  • Select optimal computing instance size (MLCOST-09) - Use SageMaker Studio to switch instance types on the fly based on your needs (for example, use a low power type for exploratory data analysis, and then switch to GPU only to prototype some neural network code). Right size your training jobs with Amazon CloudWatch metrics that monitor resources, such as CPU, GPU, memory, and disk utilization.

  • Select local training for small scale experiments (MLCOST-11) and Start training with small datasets (MLCOST-17) - Experiment with smaller datasets in your development notebook. This approach allows you to iterate quickly with limited carbon emission.

  • Stop resources when not in use (MLCOST-16) - When building your model, use lifecycle configuration scripts to automatically stop idle SageMaker Notebook instances. If you are using SageMaker Studio, install the auto-shutdown Jupyter extension to detect and stop idle resources. Use the fully managed training process provided by SageMaker to automatically launch training instances and shut them down as soon as the training job is complete. This minimizes idle compute resources and thus limits the environmental impact of your training job.