Evaluate the performance of Amazon Bedrock resources
Note
Model evaluation jobs that use a judge model and Amazon Bedrock Knowledge Bases evaluation jobs are in preview.
Use Amazon Bedrock evaluations to evaluate the performance and effectiveness of Amazon Bedrock models and knowledge bases. Amazon Bedrock can compute performance metrics such as the semantic robustness of a model and the correctness of a knowledge base in retrieving information and generating responses. For model evaluations, you can also leverage a team of human workers to rate and provide their input for the evaluation.
Automatic evaluations, including evaluations that leverage Large Language Models (LLMs), produce computed scores and metrics that help you assess the effectiveness of a model and knowledge base. Human-based evaluations use a team of people who provide their ratings and preferences in relation to certain metrics.
Overview: Automatic model evaluation jobs
Automatic model evaluation jobs allow you to quickly evaluate a model's ability to perform a task. You can either provide your own custom prompt dataset that you've tailored to a specific use case, or you can use an available built-in dataset.
Overview: Model evaluation jobs that use human workers
Model evaluation jobs that use human workers allow you to bring human input to the model evaluation process. They can be employees of your company or a group of subject-matter experts from your industry.
Overview: Model evaluation jobs that use a judge model
Model evaluation jobs that use a judge model allow you to quickly evaluate a model's responses via using a second LLM. The second LLM scores the response and provides an explanation for each response.
Overview of knowledge base evaluations that use Large Language Models (LLMs)
LLM-based evaluations compute performance metrics for the knowledge base. The metrics reveal if a knowledge base is able to retrieve highly relevant information and generate useful, appropriate responses. You provide a dataset that contains the prompts or user queries for evaluating how a knowledge base retrieves information and generates responses for those given queries. The dataset must also include ‘ground truth’ or the expected retrieved texts and responses for the queries so that the evaluation can check if your knowledge base is aligned with what’s expected.
Use the following topic to learn more about creating your first model evaluation job.
To create a model evaluation job, you must have access to at least one Amazon Bedrock model. Model evaluation jobs support using the following types of models:
-
Foundation models
Amazon Bedrock Marketplace models
-
Customized foundation models
-
Imported foundation models
-
Prompt routers
-
Models that you have purchased Provisioned Throughput
Topics
- Supported Regions and models for model evaluation
- Creating an automatic model evaluation job in Amazon Bedrock
- Creating a model evaluation job that use human workers in Amazon Bedrock
- Creating a model evaluation job that uses a LLM as Judge
- Choose the best performing knowledge base using Amazon Bedrock evaluations
- Required Cross Origin Resource Sharing (CORS) permissions on S3 buckets
- Review model evaluation job reports and metrics in Amazon Bedrock
- Data management and encryption in Amazon Bedrock evaluation job
- CloudTrail management events in model evaluation jobs