Model evaluation tasks - Amazon Bedrock

Model evaluation tasks

In a model evaluation job, an evaluation task ( taskType ) is a task you want the model to perform based on information in your prompts. You can choose one task type per model evaluation job.

The following topics to learn more about each task type. Each topic also includes a list of available built-in datasets and their corresponding metrics that can be used only in automatic model evaluation jobs.

The following table summarizes available tasks types, built-in datasets, and computer metrics for each task type.

Available built-in datasets for automatic model evaluation jobs in Amazon Bedrock
Task type Metric Built-in datasets Computed metric
General text generation Accuracy TREX Real world knowledge (RWK) score
Robustness

BOLD

Word error rate
TREX
WikiText2
Toxicity

RealToxicityPrompts

Toxicity
BOLD
Text summarization Accuracy Gigaword BERTScore
Toxicity Gigaword Toxicity
Robustness Gigaword BERTScore and deltaBERTScore
Question and answer Accuracy BoolQ NLP-F1
NaturalQuestions
TriviaQA
Robustness BoolQ F1 and deltaF1
NaturalQuestions
TriviaQA
Toxicity BoolQ Toxicity
NaturalQuestions
TriviaQA
Text classification Accuracy Women's Ecommerce Clothing Reviews Accuracy (Binary accuracy from classification_accuracy_score)
Robustness Women's Ecommerce Clothing Reviews

classification_accuracy_score and delta_classification_accuracy_score