评估经过训练的模型

评估配方是一个定义如何执行 Amazon Nova 模型评测任务的 YAML 配置文件。使用此配方，您可以根据常用基准或自有的自定义数据集评估基础模型或训练模型的性能。指标可以存储在 Amazon S3 中，或者 TensorBoard。评估提供定量指标来帮助您评估模型在各种任务中的性能，从而确定是否需要进一步自定义。

模型评测是一个离线过程，模型使用预定义的答案，按照固定的基准进行测试。模型不是实时评估的，也不是按照用户实时互动进行评估。要进行实时评估，您可以在模型部署到 Amazon Bedrock 之后通过调用 Amazon Bedrock 运行时对其进行评估。 APIs

可用的基准测试任务

提供了一个示例代码包，用于演示如何使用 Amazon Nova 的 SageMaker AI 模型评估功能计算基准指标。要访问代码包，请参阅 Sample-n ova-lighteval-custom-task。

以下是支持的可用行业标准基准测试列表。您可以在eval_task参数中指定以下基准：

基准	模式	描述	Metrics	策略	子任务可用
mmlu	文本	多任务语言理解 — 测试 57 个科目的知识。	准确性	zs_cot	是
mmlu_pro	文本	MMLU — 专业子集 — 专注于法律、医学、会计和工程等专业领域。	准确性	zs_cot	否
bbh	文本	高级推理任务 — 一系列具有挑战性的问题，用于测试更高级别的认知和解决问题的能力。	准确性	zs_cot	是
gpa	文本	一般物理问题解答 — 评估对物理概念和相关问题解决能力的理解。	准确性	zs_cot	否
math	文本	数学问题解决 — 测量包括代数、微积分和单词问题在内的各种主题的数学推理。	exact_match	zs_cot	是
强烈拒绝	文本	质量控制任务-测试模型检测和拒绝不当、有害或不正确内容的能力。	deflection	zs	是
IFEval	文本	指令跟随评估 — 衡量模型遵循给定指令并按照规范完成任务的准确程度。	准确性	zs	否
gen_qa	文本	自定义数据集评估 — 允许您使用自己的数据集进行基准测试，将模型输出与参考答案与 ROUGE 和 BLEU 等指标进行比较。	全部	gen_qa	否
嗯mmu	多式联运	大规模多学科多模态理解（MMMU）——大学水平的基准，包括来自30个学科的多项选择题和开放式问题。	准确性	zs_cot	是
lm_judge	文本	LLM-as-a-Judge 偏好比较 — 使用 Nova Judge 模型来确定提示的配对响应（B 和 A）之间的偏好，计算 B 比 A 更受青睐的概率。	全部	判断	否
人类	文本	HumanEval -旨在评估大型语言模型代码生成能力的基准数据集	pass @1	zs	否

以下mmlu子任务可用：


MMLU_SUBTASKS = [
    "abstract_algebra",
    "anatomy",
    "astronomy",
    "business_ethics",
    "clinical_knowledge",
    "college_biology",
    "college_chemistry",
    "college_computer_science",
    "college_mathematics",
    "college_medicine",
    "college_physics",
    "computer_security",
    "conceptual_physics",
    "econometrics",
    "electrical_engineering",
    "elementary_mathematics",
    "formal_logic",
    "global_facts",
    "high_school_biology",
    "high_school_chemistry",
    "high_school_computer_science",
    "high_school_european_history",
    "high_school_geography",
    "high_school_government_and_politics",
    "high_school_macroeconomics",
    "high_school_mathematics",
    "high_school_microeconomics",
    "high_school_physics",
    "high_school_psychology",
    "high_school_statistics",
    "high_school_us_history",
    "high_school_world_history",
    "human_aging",
    "human_sexuality",
    "international_law",
    "jurisprudence",
    "logical_fallacies",
    "machine_learning",
    "management",
    "marketing",
    "medical_genetics",
    "miscellaneous",
    "moral_disputes",
    "moral_scenarios",
    "nutrition",
    "philosophy",
    "prehistory",
    "professional_accounting",
    "professional_law",
    "professional_medicine",
    "professional_psychology",
    "public_relations",
    "security_studies",
    "sociology",
    "us_foreign_policy",
    "virology",
    "world_religions"
]

以下bbh子任务可用：


BBH_SUBTASKS = [
    "boolean_expressions",
    "causal_judgement",
    "date_understanding",
    "disambiguation_qa",
    "dyck_languages",
    "formal_fallacies",
    "geometric_shapes",
    "hyperbaton",
    "logical_deduction_five_objects",
    "logical_deduction_seven_objects",
    "logical_deduction_three_objects",
    "movie_recommendation",
    "multistep_arithmetic_two",
    "navigate",
    "object_counting",
    "penguins_in_a_table",
    "reasoning_about_colored_objects",
    "ruin_names",
    "salient_translation_error_detection",
    "snarks",
    "sports_understanding",
    "temporal_sequences",
    "tracking_shuffled_objects_five_objects",
    "tracking_shuffled_objects_seven_objects",
    "tracking_shuffled_objects_three_objects",
    "web_of_lies",
    "word_sorting"
]

以下math子任务可用：


MATH_SUBTASKS = [
    "algebra",
    "counting_and_probability",
    "geometry",
    "intermediate_algebra",
    "number_theory",
    "prealgebra",
    "precalculus",
]

了解配方参数

运行配置

以下是常规运行配置和所涉及参数的说明。


run:
  name: eval_job_name 
  model_type: amazon.nova-micro-v1:0:128k 
  model_name_or_path: nova-micro/prod 
  replicas: 1 
  data_s3_path: ""
  output_s3_path: s3://output_path

name:（必填）评估任务的描述性名称。这有助于在 AWS 控制台中识别您的作业。
model_type:（必填）指定要使用的 Amazon Nova 型号变体。请勿手动修改此字段。选项包括：
- amazon.nova-micro-v1:0:128k
- amazon.nova-lite-v1:0:300k
- amazon.nova-pro-v1:0:300k
model_name_or_path:（必需）基础模型的路径或训练后检查点的 S3 路径。选项包括：
- nova-micro/prod
- nova-lite/prod
- nova-pro/prod
- （训练后检查点的 S3 路径）s3://<escrow bucket>/<job id>/outputs/checkpoints
replicas:（必填）用于分布式训练的计算实例数量。您必须将此值设置为 1，因为不支持多节点。
data_s3_path:（必需）输入数据集的 S3 路径。除非您使用自带数据集或 LLM 作为评判方法，否则请将此参数留空。
output_s3_path:（必需）存储输出评估项目的 S3 路径。请注意，输出 S3 存储桶必须由创建任务的同一账户创建。

评估配置

以下是模型评估配置和所涉及参数的说明。


evaluation:
  task: mmlu 
  strategy: zs_cot 
  subtask: mathematics
  metric: accuracy

task:（必需）指定要使用的评估基准或任务。

支持的任务列表：
- mmlu
- mmlu_pro
- bbh
- gpa
- math
- 强烈拒绝
- gen_qa
- ifeval
- 嗯mmu
- lm_judge
- 人类
strategy:（必填）定义评估方法：
- zs_cot：Zero Chain-of-Thought-shot-一种无需明确示例即可鼓励 step-by-step推理的大型语言模型的方法。
- zs：Zero-shot-一种无需任何事先训练示例即可解决问题的方法。
- gen_qa：一种专门用于自带数据集配方的策略。
- judge：Amazon Nova LLM 作为评委食谱的特定策略。
subtask:（可选且可移动）为某些评估任务指定特定的子任务。如果您的任务没有任何子任务，请将其从食谱中删除。
metric:（必填）要使用的评估指标。
- 准确性：正确答案的百分比
- exact_match:（对于math基准测试），返回输入预测的字符串与其引用完全匹配的速率。
- deflection:（对于strong reject基准测试），返回基础模型的相对偏转和显著性指标的差异。
- pass @1：（用于humaneval基准测试）是一个指标，用于衡量模型的最高置信度预测与正确答案相匹配的情况所占的百分比。
- all: 返回以下指标：
  - 要获取gen_qa并自带数据集基准测试，请返回以下指标：
    
    rouge1：测量生成的文本和参考文本之间的单字符（单个单词）的重叠度。
    
    rouge2：测量生成的文本和参考文本之间的双重组合（两个连续的单词）。
    
    rougeL：测量文本之间最长的公共子序列，允许匹配中存在间隙。
    
    exact_match: 二进制分数（0 或 1）表示生成的文本是否与参考文本完全匹配，逐个字符。
    
    quasi_exact_match：与精确匹配类似，但比较宽松，通常忽略大小写、标点符号和空格差异。
    
    f1_score：精度和召回率的谐波平均值，测量预测答案和参考答案之间的单词重叠度。
    
    f1_score_quasi: 与 f1_score 类似，但匹配更为宽松，使用忽略细微差异的标准化文本比较。
    
    bleu：测量生成的文本和参考文本之间 n-gram 匹配的精度，通常用于翻译评估。
  - 要获取llm_judge并自带数据集基准测试，请返回以下指标：
    
    a_scores: 向前和向后评估传球的胜利次数。response_A
    
    a_scores_stderr: 成对判断response_A scores的标准误差。
    
    b_scores: 向前和向后评估传球的胜利次数。response_B
    
    b_scores_stderr: 成对判断response_B scores的标准误差。
    
    ties: 和被评为相response_A等response_B的判决数量。
    
    ties_stderr: 成对判断中平局的标准误差。
    
    inference_error: 无法适当评估的判决数量。
    
    inference_error_stderr: 判断间推理错误的标准误差。
    
    score: 基于向前和向后传球获胜的总分数response_B。
    
    score_stderr：成对判断中总分的标准误差。
    
    winrate：使用 Bradley-Terry 概率计算的 response_B 优先于 response_A 的概率。
    
    lower_rate：bootstrap抽样中估计的胜率的下限（2.5个百分位数）。

推理配置

以下是推理配置和所涉及参数的说明。所有参数都是可选的。


inference:
  max_new_tokens: 200 
  top_k: -1 
  top_p: 1.0 
  temperature: 0

max_new_tokens：要生成的最大代币数量。这必须是一个整数。
top_k: 要考虑的最高概率代币的数量。这必须是一个整数。
top_p：代币抽样的累积概率阈值。这必须是介于 0.0 和 1.0 之间的浮点数（含）。
temperature：代币选择中的随机性。值越大，随机性越大。使用 0 使结果具有确定性。此值必须是最小值为 0 的浮点数。

请注意，对于humaneval，我们建议使用以下推理配置：


inference:
  top_k: 1
  max_new_tokens: 1600
  temperature: 0.0

评估配方示例

Amazon Nova 提供四种类型的评估配方，可在 HyperPod 配方 GitHub 存储库中找到。

通用文本基准测试食谱

这些方法使您能够通过一套全面的纯文本基准测试来评估 Amazon Nova 机型的基本功能。它们以格式提供xxx_general_text_benchmark_eval.yaml。

通用多模态基准测试配方

这些配方使您能够通过一套全面的多模态基准测试来评估 Amazon Nova 模型的基本功能。它们以格式提供xxx_general_multi_modal_benchmark_eval.yaml。

自带数据集基准测试配方

这些配方使您能够使用自己的数据集进行基准测试，并使用不同类型的指标比较模型输出以参考答案。它们以格式提供xxx_bring_your_own_dataset_eval.yaml。

以下是自带数据集的要求：

文件格式要求
- 必须包含一个包含评估示例的gen_qa.jsonl文件。
- 您的数据集必须上传到 A SageMaker I 训练作业可以访问的 S3 位置。
- 该文件必须遵循常规问答数据集所需的架构格式。
架构格式要求-JSONL 文件中的每一行都必须是包含以下字段的 JSON 对象：
- query:（必填）包含需要答案的问题或说明的字符串
- response:（必填）包含预期模型输出的字符串
- system:（可选）包含系统提示的字符串，用于在处理查询之前设置 AI 模型的行为、角色或个性

以下是自带数据集示例条目


{
   "system":"You are a english major with top marks in class who likes to give minimal word responses: ",
   "query":"What is the symbol that ends the sentence as a question",
   "response":"?"
}
{
   "system":"You are a pattern analysis specialist that provides succinct answers: ",
   "query":"What is the next number in this series? 1, 2, 4, 8, 16, ?",
   "response":"32"
}
{
   "system":"You have great attention to detail that follows instructions accurately: ",
   "query":"Repeat only the last two words of the following: I ate a hamburger today and it was kind of dry",
   "response":"of dry"
}

要使用您的自定义数据集，请使用以下必填字段修改您的评估方法，不要更改任何内容：


evaluation:
  task: gen_qa 
  strategy: gen_qa 
  metric: all

适用以下限制：

每次评估只允许有一个 JSONL 文件。
该文件必须严格遵循定义的架构。
上下文长度限制：对于数据集中的每个样本，上下文长度（包括系统 + 查询提示）应小于 3.5k

Amazon Nova LLM 作为评委基准测试食谱

Amazon Nova LLM as a Judge 是一项模型评估功能，它使客户能够将一个模型的响应质量与自定义数据集上的基准模型响应进行比较。它采用包含提示、基线响应和挑战者响应的数据集，并使用Nova Judge模型提供基于 Bradley-Terry概率的胜率指标，并进行配对比较。

食谱以以下格式提供xxx_llm_judge_eval.yaml。

以下是法学硕士作为评委的要求：

文件格式要求
- 包括一个包含评估示例的llm_judge.jsonl文件。文件名必须是llm_judge.jsonl。
- 您的数据集必须上传到 SageMaker AI HyperPod RIG 可以访问的 S3 位置。
- 该文件必须遵循llm_judge.jsonl数据集所需的架构格式。
- 输入数据集应确保所有记录的上下文长度小于 12k。
架构格式要求-JSONL 文件中的每一行都必须是包含以下字段的 JSON 对象：
- prompt:（必需）包含生成响应提示的字符串。
- response_A：包含基线响应的字符串。
- response_B：将包含备选响应的字符串与基线响应进行比较。

这是法学硕士作为评委的示例条目


{
"prompt": "What is the most effective way to combat climate change?",
"response_A": "The most effective way to combat climate change is through a combination of transitioning to renewable energy sources and implementing strict carbon pricing policies. This creates economic incentives for businesses to reduce emissions while promoting clean energy adoption.",
"response_B": "We should focus on renewable energy. Solar and wind power are good. People should drive electric cars. Companies need to pollute less."
}
{
"prompt": "Explain how a computer's CPU works",
"response_A": "CPU is like brain of computer. It does math and makes computer work fast. Has lots of tiny parts inside.",
"response_B": "A CPU (Central Processing Unit) functions through a fetch-execute cycle, where instructions are retrieved from memory, decoded, and executed through its arithmetic logic unit (ALU). It coordinates with cache memory and registers to process data efficiently using binary operations."
}
{
"prompt": "How does photosynthesis work?",
"response_A": "Plants do photosynthesis to make food. They use sunlight and water. It happens in leaves.",
"response_B": "Photosynthesis is a complex biochemical process where plants convert light energy into chemical energy. They utilize chlorophyll to absorb sunlight, combining CO2 and water to produce glucose and oxygen through a series of chemical reactions in chloroplasts."
}

要使用您的自定义数据集，请使用以下必填字段修改您的评估方法，不要更改任何内容：


evaluation:
  task: llm_judge
  strategy: judge
  metric: all

适用以下限制：

每次评估只允许有一个 JSONL 文件。
该文件必须严格遵循定义的架构。
Amazon Nova Judge 机型在所有型号系列规格（即 Lite、Micro 和 Pro）中都是一样的。
目前不支持自定义评判模型。
上下文长度限制：对于数据集中的每个样本，上下文长度（包括系统 + 查询提示）应小于 7k

开始评估工作

以下提供了建议的评估实例类型和模型类型配置：


# Install Dependencies (Helm - https://helm.sh/docs/intro/install/)
curl -fsSL -o get_helm.sh https://raw.githubusercontent.com/helm/helm/main/scripts/get-helm-3
chmod 700 get_helm.sh
./get_helm.sh
rm -f ./get_helm.sh  

# Install the HyperPod CLI
git clone --recurse-submodules https://github.com/aws/sagemaker-hyperpod-cli.git
git checkout -b release_v2
cd sagemaker-hyperpod-cli
pip install .

# Verify the installation
hyperpod --help

# Connect to a HyperPod Cluster
hyperpod connect-cluster --cluster-name cluster-name


# Submit the Job using the recipe for eval
# Namespace by default should be kubeflow
hyperpod start-job [--namespace namespace] --recipe evaluation/nova/nova_micro_p5_48xl_general_text_benchmark_eval --override-parameters \
'{
    "instance_type":"p5d.48xlarge",
    "container": "708977205387.dkr.ecr.us-east-1.amazonaws.com/nova-evaluation-repo:SM-HP-Eval-latest",
    "recipes.run.name": custom-run-name,
    "recipes.run.model_type": model_type,
    "recipes.run.model_name_or_path" " model name or finetune checkpoint s3uri,
    "recipes.run.data_s3_path": s3 for input data only for genqa and llm_judge, must be full S3 path that include filename,
}'

# List jobs
hyperpod list-jobs [--namespace namespace] [--all-namespaces]

# Getting Job details
hyperpod get-job --job-name job-name [--namespace namespace] [--verbose]

# Listing Pods
hyperpod list-pods --job-name job-name --namespace namespace

# Cancel Job
hyperpod cancel-job --job-name job-name [--namespace namespace]

您还应该能够通过 Amazon EKS 集群控制台查看任务状态。

访问并可视化您的结果

成功完成评估任务后，您可以使用本节中的信息访问和分析结果。根据配方中定义的output_s3_path（例如s3://output_path/），输出结构如下：


job_name/
├── eval-result/
│    └── job-name/
│         └── results_[timestamp].json
│         └── details/
│              └── model/
│                  └── execution-date-time/
│                             └──details_task_name_#_datetime.parquet
└── eval-tensorboard-result/
    └── job-name/
        └── eval/
            └── events.out.tfevents.[timestamp]

指标结果存储在指定的 S3 输出位置 s3://output_path/job_name/eval-result/result-timestamp.json

张量板结果存储在 S3 路径中。s3://output_path/job_name/eval-tensorboard-result/eval/event.out.tfevents.epoch+ip

strong_reject和gen_qa的推理输出存储在 S3 路径s3://output_path/job_name/eval-result/details/model/taskname.parquet中。

要在 Tensorboard 中可视化您的评估指标，请完成以下步骤：

导航到 SageMaker AI 张量板。
选择 S3 文件夹。
例如，添加您的 S3 文件夹路径s3://output_path/job-name/eval-tensorboard-result/eval。
等待同步完成。

时间序列、标量和文本可视化均可用。

我们建议您遵循以下最佳实操：

按模型和基准测试类型整理输出路径。
保持一致的命名惯例以便于跟踪。
将提取的结果保存在安全的位置。
监控 TensorBoard 同步状态以成功加载数据。

您可以在日志组中找到 HyperPod 任务错误日志/aws/sagemaker/Clusters/cluster-id。 CloudWatch

Javascript 在您的浏览器中被禁用或不可用。

要使用 Amazon Web Services 文档，必须启用 Javascript。请参阅浏览器的帮助页面以了解相关说明。

文档惯例

使用 HyperPod CLI 开始训练作业

迭代训练