評估類型和任務提交 - Amazon SageMaker AI

本文為英文版的機器翻譯版本,如內容有任何歧義或不一致之處,概以英文版為準。

評估類型和任務提交

使用標準化資料集進行基準測試

使用基準評估類型跨標準化基準資料集評估模型的品質,包括熱門資料集,例如 MMLU 和 BBH。

Benchmark 支援的自訂資料集 模態 Description 指標 策略 子任務可用
mmlu 文字 多任務語言理解 – 測試 57 個主題的知識。 正確性 zs_cot
mmlu_pro 文字 MMLU – 專業子集 – 專注於專業領域,例如法律、醫學、會計和工程。 正確性 zs_cot
bbh 文字 進階推理任務 – 一系列挑戰性問題,可測試高階認知和問題解決技能。 正確性 fs_cot
gpqa 文字 一般物理問題回答 – 評估對物理概念的理解及解決相關問題的能力。 正確性 zs_cot
數學運算 文字 數學問題解決 – 測量代數、微積分和應用題等主題的數學推理能力。 exact_match zs_cot
strong_reject 文字 品質控管任務 – 測試模型偵測和拒絕不適當、有害或不正確內容的能力。 偏轉 zs
ifeval 文字 指示追蹤評估 – 測量模型遵循指定指示的準確度,並根據規格完成任務。 正確性 zs

如需 BYOD 格式的詳細資訊,請參閱 Bring-Your-Own-Dataset (BYOD) 任務支援的資料集格式

可用的子任務

下列列出跨多個網域進行模型評估的可用子任務,包括 MMLU (基本多任務語言理解)、BBH (大型 Bench Hard)、StrongReject 和 MATH。這些子任務可讓您評估模型在特定功能和知識領域的效能。

MMLU 子任務

MMLU_SUBTASKS = [ "abstract_algebra", "anatomy", "astronomy", "business_ethics", "clinical_knowledge", "college_biology", "college_chemistry", "college_computer_science", "college_mathematics", "college_medicine", "college_physics", "computer_security", "conceptual_physics", "econometrics", "electrical_engineering", "elementary_mathematics", "formal_logic", "global_facts", "high_school_biology", "high_school_chemistry", "high_school_computer_science", "high_school_european_history", "high_school_geography", "high_school_government_and_politics", "high_school_macroeconomics", "high_school_mathematics", "high_school_microeconomics", "high_school_physics", "high_school_psychology", "high_school_statistics", "high_school_us_history", "high_school_world_history", "human_aging", "human_sexuality", "international_law", "jurisprudence", "logical_fallacies", "machine_learning", "management", "marketing", "medical_genetics", "miscellaneous", "moral_disputes", "moral_scenarios", "nutrition", "philosophy", "prehistory", "professional_accounting", "professional_law", "professional_medicine", "professional_psychology", "public_relations", "security_studies", "sociology", "us_foreign_policy", "virology", "world_religions" ]

BBH 子任務

BBH_SUBTASKS = [ "boolean_expressions", "causal_judgement", "date_understanding", "disambiguation_qa", "dyck_languages", "formal_fallacies", "geometric_shapes", "hyperbaton", "logical_deduction_five_objects", "logical_deduction_seven_objects", "logical_deduction_three_objects", "movie_recommendation", "multistep_arithmetic_two", "navigate", "object_counting", "penguins_in_a_table", "reasoning_about_colored_objects", "ruin_names", "salient_translation_error_detection", "snarks", "sports_understanding", "temporal_sequences", "tracking_shuffled_objects_five_objects", "tracking_shuffled_objects_seven_objects", "tracking_shuffled_objects_three_objects", "web_of_lies", "word_sorting" ]

數學子任務

MATH_SUBTASKS = [ "algebra", "counting_and_probability", "geometry", "intermediate_algebra", "number_theory", "prealgebra", "precalculus" ]

StrongReject 子任務

STRONG_REJECT_SUBTASKS = [ "gcg_transfer_harmbench", "gcg_transfer_universal_attacks", "combination_3", "combination_2", "few_shot_json", "dev_mode_v2", "dev_mode_with_rant", "wikipedia_with_title", "distractors", "wikipedia", "style_injection_json", "style_injection_short", "refusal_suppression", "prefix_injection", "distractors_negated", "poems", "base64", "base64_raw", " base64_input_only", "base64_output_only", "evil_confidant", "aim", "rot_13", "disemvowel", "auto_obfuscation", "auto_payload_splitting", "pair", "pap_authority_endorsement", "pap_evidence_based_persuasion", "pap_expert_endorsement", "pap_logical_appeal", "pap_misrepresentation" ]

提交基準任務

SageMaker Studio
透過 SageMaker Studio 進行基準測試的最低組態
SageMaker Python SDK
from sagemaker.train.evaluate import get_benchmarks from sagemaker.train.evaluate import BenchMarkEvaluator Benchmark = get_benchmarks() # Create evaluator with MMLU benchmark evaluator = BenchMarkEvaluator( benchmark=Benchmark.MMLU, model="arn:aws:sagemaker:<region>:<account-id>:model-package/<model-package-name>/<version>", s3_output_path="s3://<bucket-name>/<prefix>/", evaluate_base_model=False ) execution = evaluator.evaluate()

如需透過 SageMaker Python SDK 提交評估任務的詳細資訊,請參閱:https://https://sagemaker.readthedocs.io/en/stable/model_customization/evaluation.html

大型語言模型作為判斷 (LLMAJ) 評估

使用 LLM-as-a-judge (LLMAJ) 評估,利用另一個前沿模型對目標模型回應進行分級。您可以透過呼叫 create_evaluation_job API 來啟動評估任務,以使用 AWS Bedrock 模型做為判斷。

如需支援的判斷模型的詳細資訊,請參閱:https://https://docs.aws.amazon.com/bedrock/latest/userguide/models-supported.html

您可以使用 2 種不同的指標格式來定義評估:

提交內建指標 LLMAJ 任務

SageMaker Studio
透過 SageMaker Studio 進行 LLMAJ 基準測試的最低組態
SageMaker Python SDK
from sagemaker.train.evaluate import LLMAsJudgeEvaluator evaluator = LLMAsJudgeEvaluator( model="arn:aws:sagemaker:<region>:<account-id>:model-package/<model-package-name>/<version>", evaluator_model="<bedrock-judge-model-id>", dataset="s3://<bucket-name>/<prefix>/<dataset-file>.jsonl", builtin_metrics=["<builtin-metric-1>", "<builtin-metric-2>"], s3_output_path="s3://<bucket-name>/<prefix>/", evaluate_base_model=False ) execution = evaluator.evaluate()

如需透過 SageMaker Python SDK 提交評估任務的詳細資訊,請參閱:https://https://sagemaker.readthedocs.io/en/stable/model_customization/evaluation.html

提交自訂指標 LLMAJ 任務

定義您的自訂指標 (s):

{ "customMetricDefinition": { "name": "PositiveSentiment", "instructions": ( "You are an expert evaluator. Your task is to assess if the sentiment of the response is positive. " "Rate the response based on whether it conveys positive sentiment, helpfulness, and constructive tone.\n\n" "Consider the following:\n" "- Does the response have a positive, encouraging tone?\n" "- Is the response helpful and constructive?\n" "- Does it avoid negative language or criticism?\n\n" "Rate on this scale:\n" "- Good: Response has positive sentiment\n" "- Poor: Response lacks positive sentiment\n\n" "Here is the actual task:\n" "Prompt: {{prompt}}\n" "Response: {{prediction}}" ), "ratingScale": [ {"definition": "Good", "value": {"floatValue": 1}}, {"definition": "Poor", "value": {"floatValue": 0}} ] } }

如需詳細資訊,請參閱:https://https://docs.aws.amazon.com/bedrock/latest/userguide/model-evaluation-custom-metrics-prompt-formats.html

SageMaker Studio
透過自訂指標上傳自訂指標 > 新增自訂指標
SageMaker Python SDK
evaluator = LLMAsJudgeEvaluator( model="arn:aws:sagemaker:<region>:<account-id>:model-package/<model-package-name>/<version>", evaluator_model="<bedrock-judge-model-id>", dataset="s3://<bucket-name>/<prefix>/<dataset-file>.jsonl", custom_metrics=custom_metric_dict = { "customMetricDefinition": { "name": "PositiveSentiment", "instructions": ( "You are an expert evaluator. Your task is to assess if the sentiment of the response is positive. " "Rate the response based on whether it conveys positive sentiment, helpfulness, and constructive tone.\n\n" "Consider the following:\n" "- Does the response have a positive, encouraging tone?\n" "- Is the response helpful and constructive?\n" "- Does it avoid negative language or criticism?\n\n" "Rate on this scale:\n" "- Good: Response has positive sentiment\n" "- Poor: Response lacks positive sentiment\n\n" "Here is the actual task:\n" "Prompt: {{prompt}}\n" "Response: {{prediction}}" ), "ratingScale": [ {"definition": "Good", "value": {"floatValue": 1}}, {"definition": "Poor", "value": {"floatValue": 0}} ] } }, s3_output_path="s3://<bucket-name>/<prefix>/", evaluate_base_model=False )

自訂計分器

定義您自己的自訂計分器函數以啟動評估任務。系統提供兩個內建的計分器:Prime 數學和 Prime 程式碼。您也可以使用自己的計分器函數。您可以直接複製您的計分器函數程式碼,或使用相關聯的 ARN 帶入您自己的 Lambda 函數定義。根據預設,這兩種計分器類型都會產生評估結果,其中包含 F1 分數、ROUGE 和 BLEU 等標準指標。

如需內建和自訂評分者及其個別需求/合約的詳細資訊,請參閱 使用預設和自訂計分器進行評估

註冊您的資料集

將自訂計分器註冊為 SageMaker Hub 內容資料集,以擁有自己的資料集。

SageMaker Studio

在 Studio 中,使用專用資料集頁面上傳資料集。

SageMaker Studio 中註冊的評估資料集
SageMaker Python SDK

在 SageMaker Python SDK 中,使用專用資料集頁面上傳資料集。

from sagemaker.ai_registry.dataset import DataSet dataset = DataSet.create( name="your-bring-your-own-dataset", source="s3://<bucket-name>/<prefix>/<dataset-file>.jsonl" ) dataset.refresh()

提交內建的計分器任務

SageMaker Studio
從程式碼執行或數學答案中選取內建自訂評分
SageMaker Python SDK
from sagemaker.train.evaluate import CustomScorerEvaluator from sagemaker.train.evaluate import get_builtin_metrics BuiltInMetric = get_builtin_metrics() evaluator_builtin = CustomScorerEvaluator( evaluator=BuiltInMetric.PRIME_MATH, dataset="arn:aws:sagemaker:<region>:<account-id>:hub-content/<hub-content-id>/DataSet/your-bring-your-own-dataset/<version>", model="arn:aws:sagemaker:<region>:<account-id>:model-package/<model-package-name>/<version>", s3_output_path="s3://<bucket-name>/<prefix>/", evaluate_base_model=False ) execution = evaluator.evaluate()

BuiltInMetric.PRIME_MATHBuiltInMetric.PRIME_CODE 中選取內建評分。

提交自訂計分器任務

定義自訂獎勵函數。如需詳細資訊,請參閱自訂計分器 (使用您自己的指標)

註冊自訂獎勵函數

SageMaker Studio
導覽至 SageMaker Studio > 資產 > 評估器 > 建立評估器 > 建立獎勵函數
在自訂計分器 > 自訂指標中提交參考已註冊預設獎勵函數的自訂計分器評估任務
SageMaker Python SDK
from sagemaker.ai_registry.evaluator import Evaluator from sagemaker.ai_registry.air_constants import REWARD_FUNCTION evaluator = Evaluator.create( name = "your-reward-function-name", source="/path_to_local/custom_lambda_function.py", type = REWARD_FUNCTION )
evaluator = CustomScorerEvaluator( evaluator=evaluator, dataset="s3://<bucket-name>/<prefix>/<dataset-file>.jsonl", model="arn:aws:sagemaker:<region>:<account-id>:model-package/<model-package-name>/<version>", s3_output_path="s3://<bucket-name>/<prefix>/", evaluate_base_model=False ) execution = evaluator.evaluate()