GENSUS01-BP01 Implement auto scaling and serverless architectures to optimize resource utilization

Adopt efficient and sustainable AI/ML practices to minimize resource usage, reduce costs, and lower environmental impact. Use serverless architectures, auto scaling, and specialized hardware to optimize resource utilization. This approach enhances performance efficiency, aligns with cost optimization, and supports sustainability goals. Implementing these practices enables responsible and economical deployment of generative AI workloads and promotes effective scaling without unnecessary resource waste.

Desired outcome: After implementing this best practice, customers can improve the elasticity of their generative AI workloads and benefit from the efficiencies of scale of the AWS Cloud.

Benefits of establishing this best practice: Optimize resource utilization - Minimize environmental impact by maximizing the efficiency of generative AI resources.

Level of risk exposed if this best practice is not established: Medium

Implementation guidance

Adopting serverless architectures and auto-scaling capabilities is essential for verifying that resources are provisioned and consumed only when needed. This approach minimizes idle consumption and reduces the associated environmental impact. While training jobs may run overnight, the notebook and ML development instances that are not in use can be shut down either through configuring an idle time-out or through scheduling. You can further enhance the efficiency of your workload's resource utilization by using AWS managed services and managed offerings.

Amazon Bedrock and Amazon Q are fully-managed services, which means that AWS handles the infrastructure management, scaling, and maintenance. As a result, users can focus on model development rather than infrastructure utilization. Similarly, Amazon SageMaker AI Inference Recommender helps optimize the deployment of machine learning models by automating load testing. It assists in selecting the best instance type by considering factors like instance count, container parameters, and model optimizations. This tool provides recommendations for both real-time and serverless inference endpoints, which helps you verify that models are deployed with the best performance at the lowest resource consumption.

For hosting and running generative AI models efficiently, consider using Amazon EC2 Inferentia instances. These instances deliver some of the highest compute power and accelerator memory in among EC2 instance families, which is crucial for handling large language models and other generative AI workloads. Inferentia instances support scale-out distributed inference to optimize compute consumption. The improved performance per watt translates to more efficient use of resources. By integrating these AWS services and features, organizations can achieve a more sustainable and cost-effective approach to generative AI workloads.

Implementation steps

Adopt serverless or fully-managed architectures.
- Use Amazon Bedrock for generative AI tasks to alleviate server management overhead
- Use Amazon Q Business-related AI applications to streamline operations
- Use Amazon SageMaker AI Serverless Inference for on-demand ML inference without managing servers
Configure auto scaling capabilities.
- Set up auto scaling for Amazon SageMaker AI Endpoints to handle varying loads efficiently
- Set up EC2 Auto Scaling for custom ML infrastructure to match resource allocation with demand
Optimize ML development environments.
- For SageMaker AI notebook instances, configure idle time-out to release resources when not in use
- For ML development instances, schedule automatic shutdown for unused instances to conserve resources
Use SageMaker AI Inference Recommender.
- Conduct automated load testing to assess model deployments under various loads
- Select optimal instance types based on recommendations for cost-effective and performance
- Consider both real-time and serverless inference
Implement efficient model hosting.
- For model deployments, consider EC2 Inferentia instances for enhanced performance and efficiency
- For large models, scale and distribute the load across multiple instances
Perform continuous monitoring and optimization.
- Use Amazon CloudWatch to track resource metrics and identify optimization opportunities
- Track token lengths of prompts and model responses to measure utilization
- Identify idle time periods to scale down or suspend the inference endpoints
- Set up SageMaker AI Model Monitor to continuously monitor model performance and data quality
Educate your team on sustainable AI practices.
- Provide training to foster a culture of sustainability
- Encourage the use of pre-trained models to reduce training time and resource consumption

Resources

Related practices:

Related guides, videos, and documentation:

Related examples:

Related tools:

Warning Javascript is disabled or is unavailable in your browser.

To use the Amazon Web Services Documentation, Javascript must be enabled. Please refer to your browser's Help pages for instructions.

Document Conventions

Energy-efficient infrastructure and services

GENSUS01-BP02 Use efficient model customization services