使用標準化資料集進行基準測試大型語言模型作為判斷 (LLMAJ) 評估自訂計分器

評估類型和任務提交

使用標準化資料集進行基準測試

使用基準評估類型跨標準化基準資料集評估模型的品質，包括熱門資料集，例如 MMLU 和 BBH。

Benchmark	支援的自訂資料集	模態	Description	指標	策略	子任務可用
mmlu	否	文字	多任務語言理解 – 測試 57 個主題的知識。	正確性	zs_cot	是
mmlu_pro	否	文字	MMLU – 專業子集 – 專注於專業領域，例如法律、醫學、會計和工程。	正確性	zs_cot	否
bbh	否	文字	進階推理任務 – 一系列挑戰性問題，可測試高階認知和問題解決技能。	正確性	fs_cot	是
gpqa	否	文字	一般物理問題回答 – 評估對物理概念的理解及解決相關問題的能力。	正確性	zs_cot	否
數學運算	否	文字	數學問題解決 – 測量代數、微積分和應用題等主題的數學推理能力。	exact_match	zs_cot	是
strong_reject	否	文字	品質控管任務 – 測試模型偵測和拒絕不適當、有害或不正確內容的能力。	偏轉	zs	是
ifeval	否	文字	指示追蹤評估 – 測量模型遵循指定指示的準確度，並根據規格完成任務。	正確性	zs	否

如需 BYOD 格式的詳細資訊，請參閱 Bring-Your-Own-Dataset (BYOD) 任務支援的資料集格式。

可用的子任務

下列列出跨多個網域進行模型評估的可用子任務，包括 MMLU （基本多任務語言理解）、BBH （大型 Bench Hard)、StrongReject 和 MATH。這些子任務可讓您評估模型在特定功能和知識領域的效能。

MMLU 子任務


MMLU_SUBTASKS = [
    "abstract_algebra",
    "anatomy",
    "astronomy",
    "business_ethics",
    "clinical_knowledge",
    "college_biology",
    "college_chemistry",
    "college_computer_science",
    "college_mathematics",
    "college_medicine",
    "college_physics",
    "computer_security",
    "conceptual_physics",
    "econometrics",
    "electrical_engineering",
    "elementary_mathematics",
    "formal_logic",
    "global_facts",
    "high_school_biology",
    "high_school_chemistry",
    "high_school_computer_science",
    "high_school_european_history",
    "high_school_geography",
    "high_school_government_and_politics",
    "high_school_macroeconomics",
    "high_school_mathematics",
    "high_school_microeconomics",
    "high_school_physics",
    "high_school_psychology",
    "high_school_statistics",
    "high_school_us_history",
    "high_school_world_history",
    "human_aging",
    "human_sexuality",
    "international_law",
    "jurisprudence",
    "logical_fallacies",
    "machine_learning",
    "management",
    "marketing",
    "medical_genetics",
    "miscellaneous",
    "moral_disputes",
    "moral_scenarios",
    "nutrition",
    "philosophy",
    "prehistory",
    "professional_accounting",
    "professional_law",
    "professional_medicine",
    "professional_psychology",
    "public_relations",
    "security_studies",
    "sociology",
    "us_foreign_policy",
    "virology",
    "world_religions"
]

BBH 子任務


BBH_SUBTASKS = [
    "boolean_expressions",
    "causal_judgement",
    "date_understanding",
    "disambiguation_qa",
    "dyck_languages",
    "formal_fallacies",
    "geometric_shapes",
    "hyperbaton",
    "logical_deduction_five_objects",
    "logical_deduction_seven_objects",
    "logical_deduction_three_objects",
    "movie_recommendation",
    "multistep_arithmetic_two",
    "navigate",
    "object_counting",
    "penguins_in_a_table",
    "reasoning_about_colored_objects",
    "ruin_names",
    "salient_translation_error_detection",
    "snarks",
    "sports_understanding",
    "temporal_sequences",
    "tracking_shuffled_objects_five_objects",
    "tracking_shuffled_objects_seven_objects",
    "tracking_shuffled_objects_three_objects",
    "web_of_lies",
    "word_sorting"
]

數學子任務


MATH_SUBTASKS = [
    "algebra", 
    "counting_and_probability", 
    "geometry",
    "intermediate_algebra", 
    "number_theory", 
    "prealgebra", 
    "precalculus"
]

StrongReject 子任務


STRONG_REJECT_SUBTASKS = [
    "gcg_transfer_harmbench", 
    "gcg_transfer_universal_attacks",
    "combination_3", 
    "combination_2", 
    "few_shot_json", 
    "dev_mode_v2",
    "dev_mode_with_rant",
    "wikipedia_with_title", 
    "distractors",
    "wikipedia",
     "style_injection_json", 
    "style_injection_short",
    "refusal_suppression", 
    "prefix_injection", 
    "distractors_negated",
    "poems", 
    "base64", 
    "base64_raw", "
    base64_input_only",
    "base64_output_only", 
    "evil_confidant", 
    "aim", 
    "rot_13",
    "disemvowel", 
    "auto_obfuscation", 
    "auto_payload_splitting", 
    "pair",
    "pap_authority_endorsement", 
    "pap_evidence_based_persuasion",
    "pap_expert_endorsement", 
    "pap_logical_appeal", 
    "pap_misrepresentation"
]

提交基準任務

大型語言模型作為判斷 (LLMAJ) 評估

使用 LLM-as-a-judge (LLMAJ) 評估，利用另一個前沿模型對目標模型回應進行分級。您可以透過呼叫 create_evaluation_job API 來啟動評估任務，以使用 AWS Bedrock 模型做為判斷。

如需支援的判斷模型的詳細資訊，請參閱：https：//https://docs.aws.amazon.com/bedrock/latest/userguide/models-supported.html

您可以使用 2 種不同的指標格式來定義評估：

內建指標：利用 AWS Bedrock 內建指標來分析模型推論回應的品質。如需詳細資訊，請參閱：https：//https://docs.aws.amazon.com/bedrock/latest/userguide/model-evaluation-type-judge-prompt.html
自訂指標：以 Bedrock 評估自訂指標格式定義您自己的自訂指標，以使用您自己的指令分析模型推論回應的品質。如需詳細資訊，請參閱：https：//https://docs.aws.amazon.com/bedrock/latest/userguide/model-evaluation-custom-metrics-prompt-formats.html

提交內建指標 LLMAJ 任務

提交自訂指標 LLMAJ 任務

定義您的自訂指標 (s)：


{
    "customMetricDefinition": {
        "name": "PositiveSentiment",
        "instructions": (
            "You are an expert evaluator. Your task is to assess if the sentiment of the response is positive. "
            "Rate the response based on whether it conveys positive sentiment, helpfulness, and constructive tone.\n\n"
            "Consider the following:\n"
            "- Does the response have a positive, encouraging tone?\n"
            "- Is the response helpful and constructive?\n"
            "- Does it avoid negative language or criticism?\n\n"
            "Rate on this scale:\n"
            "- Good: Response has positive sentiment\n"
            "- Poor: Response lacks positive sentiment\n\n"
            "Here is the actual task:\n"
            "Prompt: {{prompt}}\n"
            "Response: {{prediction}}"
        ),
        "ratingScale": [
            {"definition": "Good", "value": {"floatValue": 1}},
            {"definition": "Poor", "value": {"floatValue": 0}}
        ]
    }
}

如需詳細資訊，請參閱：https：//https://docs.aws.amazon.com/bedrock/latest/userguide/model-evaluation-custom-metrics-prompt-formats.html

自訂計分器

定義您自己的自訂計分器函數以啟動評估任務。系統提供兩個內建的計分器：Prime 數學和 Prime 程式碼。您也可以使用自己的計分器函數。您可以直接複製您的計分器函數程式碼，或使用相關聯的 ARN 帶入您自己的 Lambda 函數定義。根據預設，這兩種計分器類型都會產生評估結果，其中包含 F1 分數、ROUGE 和 BLEU 等標準指標。

如需內建和自訂評分者及其個別需求/合約的詳細資訊，請參閱使用預設和自訂計分器進行評估。