Model evaluation - Amazon Bedrock

Model evaluation

Amazon Bedrock supports model evaluation jobs. The results of a model evaluation job allow you to compare model outputs, and then choose the model best suited for your downstream generative AI applications.

Model evaluation jobs support common use cases for large language models (LLMs) such as text generation, text classification, question answering, and text summarization.

To evaluate a model's performance for automatic model evaluation jobs, you can use either built-in prompt datasets or your own prompt datasets. For model evaluation jobs that use workers, you must use your own dataset.

You can choose to create either an automatic model evaluation job or a model evaluation job that uses a human workforce.

Overview: Automatic model evaluation jobs

Automatic model evaluation jobs allow you to quickly evaluate a model's ability to perform a task. You can either provide your own custom prompt dataset that you've tailored to a specific use case, or you can use an available built-in dataset.

Overview: Model evaluation jobs that use human workers

Model evaluation jobs that use human workers allow you to bring human input to the model evaluation process. They can be employees of your company or a group of subject-matter experts from your industry.

The following topics describe the available model evaluation tasks, and the kinds of metrics you can use. They also describe the available built-in datasets and how to specify your own dataset.