Available benchmark tasks - Amazon SageMaker AI

Available benchmark tasks

A sample code package is available that demonstrates how to calculate benchmark metrics using the SageMaker AI model evaluation feature for Amazon Nova. To access the code packages, see sample-Nova-lighteval-custom-task.

Here is a list of the supported, available industry standard benchmarks. You can specify the following benchmarks in the eval_task parameter:

Benchmark

Modality

Description

Metrics

Strategy

Subtask Available

mmlu

Text

Multi-task Language Understanding – Tests knowledge across 57 subjects.

accuracy

zs_cot

Yes

mmlu_pro

Text

MMLU – Professional Subset – Focuses on professional domains such as law, medicine, accounting, and engineering.

accuracy

zs_cot

No

bbh

Text

Advanced Reasoning Tasks – A collection of challenging problems that test higher-level cognitive and problem-solving skills.

accuracy

zs_cot

Yes

gpqa

Text

General Physics Question Answering – Assesses comprehension of physics concepts and related problem-solving abilities.

accuracy

zs_cot

No

math

Text

Mathematical Problem Solving – Measures mathematical reasoning across topics including algebra, calculus, and word problems.

exact_match

zs_cot

Yes

strong_reject

Text

Quality-Control Task – Tests the model’s ability to detect and reject inappropriate, harmful, or incorrect content.

deflection

zs

Yes

IFEval

Text

Instruction-Following Evaluation – Gauges how accurately a model follows given instructions and completes tasks to specification.

accuracy

zs

No

gen_qa

Text

Custom Dataset Evaluation – Lets you bring your own dataset for benchmarking, comparing model outputs to reference answers with metrics such as ROUGE and BLEU.

all

gen_qa

No

llm_judge

Text

LLM-as-a-Judge Preference Comparison – Uses a Nova Judge model to determine preference between paired responses (B compared with A) for your prompts, calculating the probability of B being preferred over A.

all

judge

No

humaneval

Text

HumanEval - A benchmark dataset designed to evaluate the code generation capabilities of large language models

pass@1

zs

No

mm_llm_judge

Multi-modal (image)

This new benchmark behaves the same as the text-based llm_judge above. The only difference is that it supports image inference.

all

judge

No

rubric_llm_judge

Text

Rubric Judge is an enhanced LLM-as-a-judge evaluation model built on Nova 2.0 Lite. Unlike the original judge model that only provides preference verdicts, Rubric Judge dynamically generates custom evaluation criteria tailored to each prompt and assigns granular scores across multiple dimensions.

all

judge

No

aime_2024

Text

AIME 2024 - American Invitational Mathematics Examination problems testing advanced mathematical reasoning and problem-solving

exact_match

zs_cot

No

calendar_scheduling

Text

Natural Plan - Calendar Scheduling task testing planning abilities for scheduling meetings across multiple days and people

exact_match

fs

No

The following mmlu subtasks are available:

MMLU_SUBTASKS = [ "abstract_algebra", "anatomy", "astronomy", "business_ethics", "clinical_knowledge", "college_biology", "college_chemistry", "college_computer_science", "college_mathematics", "college_medicine", "college_physics", "computer_security", "conceptual_physics", "econometrics", "electrical_engineering", "elementary_mathematics", "formal_logic", "global_facts", "high_school_biology", "high_school_chemistry", "high_school_computer_science", "high_school_european_history", "high_school_geography", "high_school_government_and_politics", "high_school_macroeconomics", "high_school_mathematics", "high_school_microeconomics", "high_school_physics", "high_school_psychology", "high_school_statistics", "high_school_us_history", "high_school_world_history", "human_aging", "human_sexuality", "international_law", "jurisprudence", "logical_fallacies", "machine_learning", "management", "marketing", "medical_genetics", "miscellaneous", "moral_disputes", "moral_scenarios", "nutrition", "philosophy", "prehistory", "professional_accounting", "professional_law", "professional_medicine", "professional_psychology", "public_relations", "security_studies", "sociology", "us_foreign_policy", "virology", "world_religions" ]

The following bbh subtasks are available:

BBH_SUBTASKS = [ "boolean_expressions", "causal_judgement", "date_understanding", "disambiguation_qa", "dyck_languages", "formal_fallacies", "geometric_shapes", "hyperbaton", "logical_deduction_five_objects", "logical_deduction_seven_objects", "logical_deduction_three_objects", "movie_recommendation", "multistep_arithmetic_two", "navigate", "object_counting", "penguins_in_a_table", "reasoning_about_colored_objects", "ruin_names", "salient_translation_error_detection", "snarks", "sports_understanding", "temporal_sequences", "tracking_shuffled_objects_five_objects", "tracking_shuffled_objects_seven_objects", "tracking_shuffled_objects_three_objects", "web_of_lies", "word_sorting" ]

The following math subtasks are available:

MATH_SUBTASKS = [ "algebra", "counting_and_probability", "geometry", "intermediate_algebra", "number_theory", "prealgebra", "precalculus", ]