GENCOST01-BP01 Right-size model selection to optimize inference costs - Generative AI Lens

GENCOST01-BP01 Right-size model selection to optimize inference costs

Foundation model costs vary greatly across the various foundation model providers, model families and sizes, and model hosting paradigms. It can be advantageous to use cost as a factor when selecting models. Understand the models available to you, as well as the requirements of your workload, to make an informed, cost-aware decision.

Desired outcome: When implemented, this best practice helps you manage spend on foundation model inference without guessing at the capacity requirements for a foundation model.

Benefits of establishing this best practice: Measure overall efficiency - It is helpful to understand inference and hosting costs associated with the performance requirements of foundation model.

Level of risk exposed if this best practice is not established: Medium

Implementation guidance

Foundation models have several cost-dimensions, some of which change depending on the hosting paradigm (managed or self-hosted). Traditionally, managed models charge for consumption measured in token input and token output. Self-hosted models charge using traditional infrastructure costs.

For managed models hosted on Amazon Bedrock, different models charge differently for the number of tokens input and output. Oftentimes, newer and larger models may have higher cost compared to older or smaller models. Self-hosted models on Amazon EC2 or Amazon SageMaker AI inference endpoints charge based on uptime, as well as additional costs storage and network costs.

When optimizing for cost, consider testing with a smaller model first, and gradually increase model size and capabilities until an acceptable model is selected. The criteria for an acceptable model will change based on the use case of the workload. By starting with the smallest model, you improve the chances of selecting a model with the most cost-effective token input and output cost. Alternatively, self-hosted model infrastructure should be optimized based on the model used and the workload's usage pattern.

Right-size as an ongoing activity. As newer models become available, the workload needs change, and as prompting and orchestration are refined, smaller, more cost-effective models should be evaluated against your workload's needs to continually optimize.

Additionally, consider decomposing your workload and routing to different sized models based on the specific needs of each inference request. Route less complicated inferences to smaller, more cost-effective models while assessing quality to maintain high quality across variably complicated inference requests. For managed models hosted on Amazon Bedrock, consider intelligent prompt routing for dynamic routing between models in the same model family. Alternatively, weight the benefits of developing a custom prompt routing layer.

Implementation steps

  1. Identify the minimum performance requirements for a foundation model.

  2. Determine the models available which meet that minimum performance bar.

  3. Select the most cost-efficient model based on the prioritized cost dimensions (like hosting paradigm or model size).

  4. Continuously evaluate model selection to validate the highest performance is being achieved at the lowest possible price-point.

Resources

Related practices:

Related guides, videos, and documentation:

Related examples: