GENCOST02-BP02 Optimize resource consumption to minimize hosting costs

Hosting a foundation model for inference requires myriad choices, all of which affect cost. These cost dimensions can be optimized to reduce cost while meeting performance goals.

Desired outcome: When implemented, this best practice describes a relationship between cost and performance contextualized in self-hosted foundation model hosting.

Benefits of establishing this best practice:

Measure overall efficiency - It is helpful to understand inference and hosting costs associated with the performance requirements of foundation model.
Stop spending money on undifferentiated heavy lifting - More often than not, it is beneficial to opt for a managed or serverless hosting paradigm, due to the intractability of the total cost of ownership for foundation model hosting.

Level of risk exposed if this best practice is not established: Medium

Implementation guidance

Self-hosted model infrastructure should be optimized based on the model used and the workload's usage pattern. Customers self-hosting models should also consider optimizing the model's hosting infrastructure. Consider right-sizing the inference endpoint to the smallest instance available that allows you to meet performance goals. In some scenarios, it may be appropriate to shut down the hosting instance and restart it during relevant hours. This is particularly useful for workloads with predictable usage patterns. You may also consider purchasing Amazon EC2 Reserved Instances or Savings Plans to further reduce the cost of a hosted model endpoint. Before committing to compute reservation, consider Amazon SageMaker AI Inference Recommender to evaluate if you are using the ideal inference endpoint type, generation, and size.

Implementation steps

Identify the nature of the demand for this workload.
Deploy selected foundation model on acceptable infrastructure, even if it may be over-provisioned.
Establish an inference or demand profile for the hosted workload.
Optimize the hosting infrastructure in accordance with the workload's demands, and select the most cost optimized infrastructure that meets performance requirements.

Resources

Related practices:

Related guides, videos, and documentation:

Related examples:

Warning Javascript is disabled or is unavailable in your browser.

To use the Amazon Web Services Documentation, Javascript must be enabled. Please refer to your browser's Help pages for instructions.

Document Conventions

GENCOST02-BP01 Balance cost and performance when selecting inference paradigms

Cost-aware prompting