评估类型和 Job 提交 - 亚马逊 SageMaker AI

本文属于机器翻译版本。若本译文内容与英语原文存在差异,则一律以英文原文为准。

评估类型和 Job 提交

使用标准化数据集进行基准测试

使用基准评估类型在标准化基准数据集(包括 MMLU 和 BBH 等热门数据集)中评估模型的质量。

基准 支持自定义数据集 模式 说明 指标 Strategy 可用子任务
mmlu 文本 多任务语言理解:考核 57 个科目的知识。 准确性 zs_cot
mmlu_pro 文本 MMLU(专业子集),专注于法律、医学、会计和工程等专业领域。 准确性 zs_cot
bbh 文本 高级推理任务:一系列具有挑战性的问题,用于考核更高级别的认知和解决问题的能力。 准确性 fs_cot
gpqa 文本 一般物理问题解答:评测对物理概念和相关问题解决能力的理解情况。 准确性 zs_cot
math 文本 数学问题解决:衡量在代数、微积分及应用题等领域的数学推理能力。 exact_match zs_cot
strong_reject 文本 质量控制任务-测试模型检测和拒绝不当、有害或不正确内容的能力。 deflection zs
ifeval 文本 指令跟随评估:衡量模型遵循给定指令并按照规范完成任务的准确程度。 准确性 zs

有关 BYOD 格式的更多信息,请参阅Bring-Your-Own-Dataset(BYOD) 任务支持的数据集格式

可用子任务

以下列出了跨多个领域进行模型评估的可用子任务,包括 MMLU(大规模多任务语言理解)、BBH(Big Bench Hard)和 MATH。 StrongReject这些子任务让您能够评测模型在特定功能和知识领域的表现。

MMLU 子任务

MMLU_SUBTASKS = [ "abstract_algebra", "anatomy", "astronomy", "business_ethics", "clinical_knowledge", "college_biology", "college_chemistry", "college_computer_science", "college_mathematics", "college_medicine", "college_physics", "computer_security", "conceptual_physics", "econometrics", "electrical_engineering", "elementary_mathematics", "formal_logic", "global_facts", "high_school_biology", "high_school_chemistry", "high_school_computer_science", "high_school_european_history", "high_school_geography", "high_school_government_and_politics", "high_school_macroeconomics", "high_school_mathematics", "high_school_microeconomics", "high_school_physics", "high_school_psychology", "high_school_statistics", "high_school_us_history", "high_school_world_history", "human_aging", "human_sexuality", "international_law", "jurisprudence", "logical_fallacies", "machine_learning", "management", "marketing", "medical_genetics", "miscellaneous", "moral_disputes", "moral_scenarios", "nutrition", "philosophy", "prehistory", "professional_accounting", "professional_law", "professional_medicine", "professional_psychology", "public_relations", "security_studies", "sociology", "us_foreign_policy", "virology", "world_religions" ]

BBH 子任务

BBH_SUBTASKS = [ "boolean_expressions", "causal_judgement", "date_understanding", "disambiguation_qa", "dyck_languages", "formal_fallacies", "geometric_shapes", "hyperbaton", "logical_deduction_five_objects", "logical_deduction_seven_objects", "logical_deduction_three_objects", "movie_recommendation", "multistep_arithmetic_two", "navigate", "object_counting", "penguins_in_a_table", "reasoning_about_colored_objects", "ruin_names", "salient_translation_error_detection", "snarks", "sports_understanding", "temporal_sequences", "tracking_shuffled_objects_five_objects", "tracking_shuffled_objects_seven_objects", "tracking_shuffled_objects_three_objects", "web_of_lies", "word_sorting" ]

数学子任务

MATH_SUBTASKS = [ "algebra", "counting_and_probability", "geometry", "intermediate_algebra", "number_theory", "prealgebra", "precalculus" ]

StrongReject 子任务

STRONG_REJECT_SUBTASKS = [ "gcg_transfer_harmbench", "gcg_transfer_universal_attacks", "combination_3", "combination_2", "few_shot_json", "dev_mode_v2", "dev_mode_with_rant", "wikipedia_with_title", "distractors", "wikipedia", "style_injection_json", "style_injection_short", "refusal_suppression", "prefix_injection", "distractors_negated", "poems", "base64", "base64_raw", " base64_input_only", "base64_output_only", "evil_confidant", "aim", "rot_13", "disemvowel", "auto_obfuscation", "auto_payload_splitting", "pair", "pap_authority_endorsement", "pap_evidence_based_persuasion", "pap_expert_endorsement", "pap_logical_appeal", "pap_misrepresentation" ]

提交您的基准测试作业

SageMaker Studio
通过 SageMaker Studio 进行基准测试的最低配置
SageMaker Python SDK
from sagemaker.train.evaluate import get_benchmarks from sagemaker.train.evaluate import BenchMarkEvaluator Benchmark = get_benchmarks() # Create evaluator with MMLU benchmark evaluator = BenchMarkEvaluator( benchmark=Benchmark.MMLU, model="arn:aws:sagemaker:<region>:<account-id>:model-package/<model-package-name>/<version>", s3_output_path="s3://<bucket-name>/<prefix>/", evaluate_base_model=False ) execution = evaluator.evaluate()

有关通过 SageMaker Python SDK 提交评估作业的更多信息,请参阅:https://sagemaker.readthedocs.io/en/stable/model_customization/evaluation.html

大型语言模型作为评判 (LLMAJ) 评估

使用 LLM-as-a-judge (LLMAJ) 评估来利用另一个前沿模型对目标模型的响应进行评分。你可以通过调用 create_evaluation_job API 启动评估作业,使用 B AWS edrock 模型作为评委。

有关支持的评判模型的更多信息,请参阅:https://docs.aws.amazon.com/bedrock/latest/userguide/models-supported.html

您可以使用两种不同的指标格式来定义评估:

提交内置指标 LLMAJ 作业

SageMaker Studio
通过 Studio 进行 LLMAJ 基准测试的最低配置 SageMaker
SageMaker Python SDK
from sagemaker.train.evaluate import LLMAsJudgeEvaluator evaluator = LLMAsJudgeEvaluator( model="arn:aws:sagemaker:<region>:<account-id>:model-package/<model-package-name>/<version>", evaluator_model="<bedrock-judge-model-id>", dataset="s3://<bucket-name>/<prefix>/<dataset-file>.jsonl", builtin_metrics=["<builtin-metric-1>", "<builtin-metric-2>"], s3_output_path="s3://<bucket-name>/<prefix>/", evaluate_base_model=False ) execution = evaluator.evaluate()

有关通过 SageMaker Python SDK 提交评估作业的更多信息,请参阅:https://sagemaker.readthedocs.io/en/stable/model_customization/evaluation.html

提交自定义指标 LLMAJ 作业

定义您的自定义指标:

{ "customMetricDefinition": { "name": "PositiveSentiment", "instructions": ( "You are an expert evaluator. Your task is to assess if the sentiment of the response is positive. " "Rate the response based on whether it conveys positive sentiment, helpfulness, and constructive tone.\n\n" "Consider the following:\n" "- Does the response have a positive, encouraging tone?\n" "- Is the response helpful and constructive?\n" "- Does it avoid negative language or criticism?\n\n" "Rate on this scale:\n" "- Good: Response has positive sentiment\n" "- Poor: Response lacks positive sentiment\n\n" "Here is the actual task:\n" "Prompt: {{prompt}}\n" "Response: {{prediction}}" ), "ratingScale": [ {"definition": "Good", "value": {"floatValue": 1}}, {"definition": "Poor", "value": {"floatValue": 0}} ] } }

欲了解更多信息,请参阅:https://docs.aws.amazon.com/bedrock/latest/userguide/model-evaluation-custom-metrics-prompt-formats.html

SageMaker Studio
通过 “自定义指标” > “添加自定义指标” 上传自定义指标
SageMaker Python SDK
evaluator = LLMAsJudgeEvaluator( model="arn:aws:sagemaker:<region>:<account-id>:model-package/<model-package-name>/<version>", evaluator_model="<bedrock-judge-model-id>", dataset="s3://<bucket-name>/<prefix>/<dataset-file>.jsonl", custom_metrics=custom_metric_dict = { "customMetricDefinition": { "name": "PositiveSentiment", "instructions": ( "You are an expert evaluator. Your task is to assess if the sentiment of the response is positive. " "Rate the response based on whether it conveys positive sentiment, helpfulness, and constructive tone.\n\n" "Consider the following:\n" "- Does the response have a positive, encouraging tone?\n" "- Is the response helpful and constructive?\n" "- Does it avoid negative language or criticism?\n\n" "Rate on this scale:\n" "- Good: Response has positive sentiment\n" "- Poor: Response lacks positive sentiment\n\n" "Here is the actual task:\n" "Prompt: {{prompt}}\n" "Response: {{prediction}}" ), "ratingScale": [ {"definition": "Good", "value": {"floatValue": 1}}, {"definition": "Poor", "value": {"floatValue": 0}} ] } }, s3_output_path="s3://<bucket-name>/<prefix>/", evaluate_base_model=False )

自定义记分器

定义您自己的自定义记分器函数以启动评估作业。该系统提供了两个内置得分器:素数和素码。你也可以自带记分器功能。您可以直接复制您的记分器函数代码,也可以使用关联的 ARN 自带自己的 Lambda 函数定义。默认情况下,两种得分手类型都会生成评估结果,其中包括标准指标,例如 F1 分数、ROUGE 和 BLEU。

有关内置和自定义评分器及其各自要求/合同的更多信息,请参阅。使用预设和自定义评分器进行评估

注册您的数据集

通过将自己的数据集注册为 SageMaker Hub 内容数据集,为自定义评分者带来自己的数据集。

SageMaker Studio

在 Studio 中,使用专用的数据集页面上传您的数据集。

SageMaker Studio 中注册的评估数据集
SageMaker Python SDK

在 SageMaker Python SDK 中,使用专用的数据集页面上传您的数据集。

from sagemaker.ai_registry.dataset import DataSet dataset = DataSet.create( name="your-bring-your-own-dataset", source="s3://<bucket-name>/<prefix>/<dataset-file>.jsonl" ) dataset.refresh()

提交内置记分员作业

SageMaker Studio
从代码执行或数学答案中选择 “内置自定义评分”
SageMaker Python SDK
from sagemaker.train.evaluate import CustomScorerEvaluator from sagemaker.train.evaluate import get_builtin_metrics BuiltInMetric = get_builtin_metrics() evaluator_builtin = CustomScorerEvaluator( evaluator=BuiltInMetric.PRIME_MATH, dataset="arn:aws:sagemaker:<region>:<account-id>:hub-content/<hub-content-id>/DataSet/your-bring-your-own-dataset/<version>", model="arn:aws:sagemaker:<region>:<account-id>:model-package/<model-package-name>/<version>", s3_output_path="s3://<bucket-name>/<prefix>/", evaluate_base_model=False ) execution = evaluator.evaluate()

BuiltInMetric.PRIME_MATHBuiltInMetric.PRIME_CODE中选择 “内置评分”。

提交自定义记分员作业

定义自定义奖励函数。有关更多信息,请参阅 自定义评分器(自带指标)

注册自定义奖励功能

SageMaker Studio
导航到 SageMaker Studio > 资产 > 评估器 > 创建评估器 > 创建奖励函数
在 Custom Scorer > Custom Metrics 中提交 Custom Scorer 评估作业,引用注册的预设
SageMaker Python SDK
from sagemaker.ai_registry.evaluator import Evaluator from sagemaker.ai_registry.air_constants import REWARD_FUNCTION evaluator = Evaluator.create( name = "your-reward-function-name", source="/path_to_local/custom_lambda_function.py", type = REWARD_FUNCTION )
evaluator = CustomScorerEvaluator( evaluator=evaluator, dataset="s3://<bucket-name>/<prefix>/<dataset-file>.jsonl", model="arn:aws:sagemaker:<region>:<account-id>:model-package/<model-package-name>/<version>", s3_output_path="s3://<bucket-name>/<prefix>/", evaluate_base_model=False ) execution = evaluator.evaluate()