GENPERF01-BP02 Collect performance metrics from generative AI workloads

Foundation model performance on specific tasks is measured in many different ways. It is important to measure and discern the performance of a model over time when selecting foundation models for generative AI workloads.

Desired outcome: When implemented, your organization improves its ability to evaluate model performance.

Benefits of establishing this best practice: Experiment more often - Testing model performance assists in the selection of foundation models for generative AI workloads.

Level of risk exposed if this best practice is not established: Medium

Implementation guidance

Consider introducing a centralized logging and monitoring solution for generative AI workloads. For example, Amazon CloudWatch integrates directly with other AWS services like Amazon Bedrock, the Amazon Q family of services, and Amazon SageMaker AI Inference Endpoints. By configuring Amazon CloudWatch or similar, customers collect performance metrics from model endpoints. These metrics can be used to develop and prioritize a list of roadmap improvements to generative AI solutions.

Performance metrics should also be collected by applications and services that interact with model endpoints and other generative AI services. Collect metrics and application traces pertaining to the flow of information, rather than just a specific piece of the workflow. Use Amazon CloudWatch or similar to determine how your entire application performs when interacting with generative AI solutions. This can help you triage performance concerns faster and improve resolution times.

Implementation steps

Identify and collect CloudWatch metrics.
- Implement a trace framework like OpenLLMetry to capture additional metrics.
Establish reasonable alarm thresholds, and set alerts to go off when those thresholds are breached.
Determine the remediation action for the alarm.
- Infrastructure alarms may require horizontal scaling to remediate any issues.
- Model alarms may inform a re-examination of the model selection process.
Automate resolution actions where possible.

Resources

Related practices:

Related guides, videos, and documentation:

Monitor the health and performance of Amazon Bedrock

Related examples:

Monitoring Generative AI application using Amazon Bedrock and Amazon CloudWatch integration

Related tools:

Traceloop OpenLLMetry

Warning Javascript is disabled or is unavailable in your browser.

To use the Amazon Web Services Documentation, Javascript must be enabled. Please refer to your browser's Help pages for instructions.

Document Conventions

GENPERF01-BP01 Define a ground truth data set of prompts and responses

Maintaining model performance