Select your cookie preferences

We use essential cookies and similar tools that are necessary to provide our site and services. We use performance cookies to collect anonymous statistics, so we can understand how customers use our site and make improvements. Essential cookies cannot be deactivated, but you can choose “Customize” or “Decline” to decline performance cookies.

If you agree, AWS and approved third parties will also use cookies to provide useful site features, remember your preferences, and display relevant content, including relevant advertising. To accept or decline all non-essential cookies, choose “Accept” or “Decline.” To make more detailed choices, choose “Customize.”

Evaluate the performance of Amazon Bedrock resources

Focus mode
Evaluate the performance of Amazon Bedrock resources - Amazon Bedrock

Use Amazon Bedrock evaluations to evaluate the performance and effectiveness of Amazon Bedrock models and knowledge bases, as well as models and Retrieval Augmented Generation (RAG) sources outside of Amazon Bedrock. Amazon Bedrock can compute performance metrics such as the semantic robustness of a model and the correctness of a knowledge base in retrieving information and generating responses. For model evaluations, you can also leverage a team of human workers to rate and provide their input for the evaluation.

Automatic evaluations, including evaluations that leverage Large Language Models (LLMs), produce computed scores and metrics that help you assess the effectiveness of a model and knowledge base. Human-based evaluations use a team of people who provide their ratings and preferences in relation to certain metrics.

Overview: Automatic model evaluation jobs

Automatic model evaluation jobs allow you to quickly evaluate a model's ability to perform a task. You can either provide your own custom prompt dataset that you've tailored to a specific use case, or you can use an available built-in dataset.

Overview: Model evaluation jobs that use human workers

Model evaluation jobs that use human workers allow you to bring human input to the model evaluation process. They can be employees of your company or a group of subject-matter experts from your industry.

Overview: Model evaluation jobs that use a judge model

Model evaluation jobs that use a judge model allow you to quickly evaluate a model's responses via using a second LLM. The second LLM scores the response and provides an explanation for each response.

Overview of RAG evaluations that use Large Language Models (LLMs)

LLM-based evaluations compute performance metrics for the knowledge base. The metrics reveal if a RAG source or Amazon Bedrock Knowledge Base is able to retrieve highly relevant information and generate useful, appropriate responses. You provide a dataset that contains the prompts or user queries for evaluating how a knowledge base retrieves information and generates responses for those given queries. The dataset must also include ‘ground truth’ or the expected retrieved texts and responses for the queries so that the evaluation can check if your knowledge base is aligned with what’s expected.

Use the following topic to learn more about creating your first model evaluation job.

Model evaluation jobs support using the following types of Amazon Bedrock models:

  • Foundation models

  • Amazon Bedrock Marketplace models

  • Customized foundation models

  • Imported foundation models

  • Prompt routers

  • Models that you have purchased Provisioned Throughput

PrivacySite termsCookie preferences
© 2025, Amazon Web Services, Inc. or its affiliates. All rights reserved.