GENCOST02-BP02 Optimize resource consumption to minimize hosting costs - Generative AI Lens

GENCOST02-BP02 Optimize resource consumption to minimize hosting costs

Hosting a foundation model for inference requires myriad choices, all of which affect cost. These cost dimensions can be optimized to reduce cost while meeting performance goals.

Desired outcome: When implemented, this best practice describes a relationship between cost and performance contextualized in self-hosted foundation model hosting.

Benefits of establishing this best practice:

Level of risk exposed if this best practice is not established: Medium

Implementation guidance

Self-hosted model infrastructure should be optimized based on the model used and the workload's usage pattern. Customers self-hosting models should also consider optimizing the model's hosting infrastructure. Consider right-sizing the inference endpoint to the smallest instance available that allows you to meet performance goals. In some scenarios, it may be appropriate to shut down the hosting instance and restart it during relevant hours. This is particularly useful for workloads with predictable usage patterns. You may also consider purchasing Amazon EC2 Reserved Instances or Savings Plans to further reduce the cost of a hosted model endpoint. Before committing to compute reservation, consider Amazon SageMaker AI Inference Recommender to evaluate if you are using the ideal inference endpoint type, generation, and size.

Implementation steps

  1. Identify the nature of the demand for this workload.

  2. Deploy selected foundation model on acceptable infrastructure, even if it may be over-provisioned.

  3. Establish an inference or demand profile for the hosted workload.

  4. Optimize the hosting infrastructure in accordance with the workload's demands, and select the most cost optimized infrastructure that meets performance requirements.

Resources

Related practices:

Related guides, videos, and documentation:

Related examples: