Jenis evaluasi dan Job Submission - Amazon SageMaker AI

Terjemahan disediakan oleh mesin penerjemah. Jika konten terjemahan yang diberikan bertentangan dengan versi bahasa Inggris aslinya, utamakan versi bahasa Inggris.

Jenis evaluasi dan Job Submission

Benchmarking dengan dataset standar

Gunakan tipe Evaluasi Benchmark untuk mengevaluasi kualitas model Anda di seluruh kumpulan data benchmark standar termasuk kumpulan data populer seperti MMLU dan BBH.

Tolok Ukur Dataset Kustom Didukung Modalitas Deskripsi Metrik-metrik Strategi Subtugas tersedia
mmlu Tidak Teks Pemahaman Bahasa Multi-tugas - Menguji pengetahuan di 57 mata pelajaran. ketepatan zs_cot Ya
mmlu_pro Tidak Teks MMLU - Subset Profesional - Berfokus pada domain profesional seperti hukum, kedokteran, akuntansi, dan teknik. ketepatan zs_cot Tidak
bbh Tidak Teks Tugas Penalaran Lanjutan - Kumpulan masalah menantang yang menguji keterampilan kognitif dan pemecahan masalah tingkat tinggi. ketepatan fs_cot Ya
gpqa Tidak Teks Penjawab Pertanyaan Fisika Umum — Menilai pemahaman konsep fisika dan kemampuan pemecahan masalah terkait. ketepatan zs_cot Tidak
matematika Tidak Teks Pemecahan Masalah Matematika — Mengukur penalaran matematis di seluruh topik termasuk aljabar, kalkulus, dan masalah kata. exact_match zs_cot Ya
strong_tolak Tidak Teks Quality-Control Task — Menguji kemampuan model untuk mendeteksi dan menolak konten yang tidak pantas, berbahaya, atau salah. defleksi zs Ya
ifeval Tidak Teks Instruksi-Mengikuti Evaluasi - Mengukur seberapa akurat model mengikuti instruksi yang diberikan dan menyelesaikan tugas untuk spesifikasi. ketepatan zs Tidak

Untuk informasi selengkapnya tentang format BYOD, lihatFormat Set Data yang Didukung untuk Tugas Bring-Your-Own-Dataset (BYOD).

Subtugas yang Tersedia

Berikut daftar subtugas yang tersedia untuk evaluasi model di beberapa domain termasuk MMLU (Massive Multitask Language Understanding), BBH (Big Bench Hard), dan MATH. StrongReject Subtugas ini memungkinkan Anda menilai kinerja model Anda pada kemampuan dan bidang pengetahuan tertentu.

Subtugas MMLU

MMLU_SUBTASKS = [ "abstract_algebra", "anatomy", "astronomy", "business_ethics", "clinical_knowledge", "college_biology", "college_chemistry", "college_computer_science", "college_mathematics", "college_medicine", "college_physics", "computer_security", "conceptual_physics", "econometrics", "electrical_engineering", "elementary_mathematics", "formal_logic", "global_facts", "high_school_biology", "high_school_chemistry", "high_school_computer_science", "high_school_european_history", "high_school_geography", "high_school_government_and_politics", "high_school_macroeconomics", "high_school_mathematics", "high_school_microeconomics", "high_school_physics", "high_school_psychology", "high_school_statistics", "high_school_us_history", "high_school_world_history", "human_aging", "human_sexuality", "international_law", "jurisprudence", "logical_fallacies", "machine_learning", "management", "marketing", "medical_genetics", "miscellaneous", "moral_disputes", "moral_scenarios", "nutrition", "philosophy", "prehistory", "professional_accounting", "professional_law", "professional_medicine", "professional_psychology", "public_relations", "security_studies", "sociology", "us_foreign_policy", "virology", "world_religions" ]

Subtugas BBH

BBH_SUBTASKS = [ "boolean_expressions", "causal_judgement", "date_understanding", "disambiguation_qa", "dyck_languages", "formal_fallacies", "geometric_shapes", "hyperbaton", "logical_deduction_five_objects", "logical_deduction_seven_objects", "logical_deduction_three_objects", "movie_recommendation", "multistep_arithmetic_two", "navigate", "object_counting", "penguins_in_a_table", "reasoning_about_colored_objects", "ruin_names", "salient_translation_error_detection", "snarks", "sports_understanding", "temporal_sequences", "tracking_shuffled_objects_five_objects", "tracking_shuffled_objects_seven_objects", "tracking_shuffled_objects_three_objects", "web_of_lies", "word_sorting" ]

Subtugas Matematika

MATH_SUBTASKS = [ "algebra", "counting_and_probability", "geometry", "intermediate_algebra", "number_theory", "prealgebra", "precalculus" ]

StrongReject Subtugas

STRONG_REJECT_SUBTASKS = [ "gcg_transfer_harmbench", "gcg_transfer_universal_attacks", "combination_3", "combination_2", "few_shot_json", "dev_mode_v2", "dev_mode_with_rant", "wikipedia_with_title", "distractors", "wikipedia", "style_injection_json", "style_injection_short", "refusal_suppression", "prefix_injection", "distractors_negated", "poems", "base64", "base64_raw", " base64_input_only", "base64_output_only", "evil_confidant", "aim", "rot_13", "disemvowel", "auto_obfuscation", "auto_payload_splitting", "pair", "pap_authority_endorsement", "pap_evidence_based_persuasion", "pap_expert_endorsement", "pap_logical_appeal", "pap_misrepresentation" ]

Kirimkan pekerjaan benchmark Anda

SageMaker Studio
Konfigurasi minimal untuk benchmarking melalui Studio SageMaker
SageMaker Python SDK
from sagemaker.train.evaluate import get_benchmarks from sagemaker.train.evaluate import BenchMarkEvaluator Benchmark = get_benchmarks() # Create evaluator with MMLU benchmark evaluator = BenchMarkEvaluator( benchmark=Benchmark.MMLU, model="arn:aws:sagemaker:<region>:<account-id>:model-package/<model-package-name>/<version>", s3_output_path="s3://<bucket-name>/<prefix>/", evaluate_base_model=False ) execution = evaluator.evaluate()

Untuk informasi lebih lanjut tentang pengajuan pekerjaan evaluasi melalui SageMaker Python SDK, lihat: https://sagemaker.readthedocs.io/en/stable/model_customization/evaluation.html

Evaluasi Model Bahasa Besar sebagai Hakim (LLMAJ)

Gunakan evaluasi LLM-as-a-judge (LLMAJ) untuk memanfaatkan model perbatasan lain untuk menilai respons model target Anda. Anda dapat menggunakan model AWS Bedrock sebagai juri dengan memanggil create_evaluation_job API untuk meluncurkan pekerjaan evaluasi.

Untuk informasi selengkapnya tentang model juri yang didukung, lihat: https://docs.aws.amazon.com/bedrock/latest/userguide/models-supported.html

Anda dapat menggunakan 2 format metrik yang berbeda untuk menentukan evaluasi:

Kirim pekerjaan LLMAJ metrik bawaan

SageMaker Studio
Konfigurasi minimal untuk benchmarking LLMAJ melalui Studio SageMaker
SageMaker Python SDK
from sagemaker.train.evaluate import LLMAsJudgeEvaluator evaluator = LLMAsJudgeEvaluator( model="arn:aws:sagemaker:<region>:<account-id>:model-package/<model-package-name>/<version>", evaluator_model="<bedrock-judge-model-id>", dataset="s3://<bucket-name>/<prefix>/<dataset-file>.jsonl", builtin_metrics=["<builtin-metric-1>", "<builtin-metric-2>"], s3_output_path="s3://<bucket-name>/<prefix>/", evaluate_base_model=False ) execution = evaluator.evaluate()

Untuk informasi lebih lanjut tentang pengajuan pekerjaan evaluasi melalui SageMaker Python SDK, lihat: https://sagemaker.readthedocs.io/en/stable/model_customization/evaluation.html

Kirim pekerjaan LLMAJ metrik khusus

Tentukan metrik kustom Anda:

{ "customMetricDefinition": { "name": "PositiveSentiment", "instructions": ( "You are an expert evaluator. Your task is to assess if the sentiment of the response is positive. " "Rate the response based on whether it conveys positive sentiment, helpfulness, and constructive tone.\n\n" "Consider the following:\n" "- Does the response have a positive, encouraging tone?\n" "- Is the response helpful and constructive?\n" "- Does it avoid negative language or criticism?\n\n" "Rate on this scale:\n" "- Good: Response has positive sentiment\n" "- Poor: Response lacks positive sentiment\n\n" "Here is the actual task:\n" "Prompt: {{prompt}}\n" "Response: {{prediction}}" ), "ratingScale": [ {"definition": "Good", "value": {"floatValue": 1}}, {"definition": "Poor", "value": {"floatValue": 0}} ] } }

Untuk informasi lebih lanjut, lihat: https://docs.aws.amazon.com/bedrock/latest/userguide/model- evaluation-custom-metrics-prompt -formats.html

SageMaker Studio
Unggah metrik kustom melalui Metrik kustom > Tambahkan metrik khusus
SageMaker Python SDK
evaluator = LLMAsJudgeEvaluator( model="arn:aws:sagemaker:<region>:<account-id>:model-package/<model-package-name>/<version>", evaluator_model="<bedrock-judge-model-id>", dataset="s3://<bucket-name>/<prefix>/<dataset-file>.jsonl", custom_metrics=custom_metric_dict = { "customMetricDefinition": { "name": "PositiveSentiment", "instructions": ( "You are an expert evaluator. Your task is to assess if the sentiment of the response is positive. " "Rate the response based on whether it conveys positive sentiment, helpfulness, and constructive tone.\n\n" "Consider the following:\n" "- Does the response have a positive, encouraging tone?\n" "- Is the response helpful and constructive?\n" "- Does it avoid negative language or criticism?\n\n" "Rate on this scale:\n" "- Good: Response has positive sentiment\n" "- Poor: Response lacks positive sentiment\n\n" "Here is the actual task:\n" "Prompt: {{prompt}}\n" "Response: {{prediction}}" ), "ratingScale": [ {"definition": "Good", "value": {"floatValue": 1}}, {"definition": "Poor", "value": {"floatValue": 0}} ] } }, s3_output_path="s3://<bucket-name>/<prefix>/", evaluate_base_model=False )

Pencetak Skor Kustom

Tentukan fungsi pencetak gol kustom Anda sendiri untuk meluncurkan pekerjaan evaluasi. Sistem ini menyediakan dua pencetak gol bawaan: Prime math dan Prime code. Anda juga dapat membawa fungsi pencetak gol Anda sendiri. Anda dapat menyalin kode fungsi pencetak gol Anda secara langsung atau membawa definisi fungsi Lambda Anda sendiri menggunakan ARN terkait. Secara default, kedua jenis pencetak gol menghasilkan hasil evaluasi yang mencakup metrik standar seperti skor F1, ROUGE, dan BLEU.

Untuk informasi lebih lanjut tentang pencetak gol bawaan dan kustom serta persyaratan/kontrak masing-masing, lihat. Evaluasi dengan Preset dan Custom Scorers

Daftarkan dataset Anda

Bawa kumpulan data Anda sendiri untuk pencetak gol khusus dengan mendaftarkannya sebagai Kumpulan Data Konten SageMaker Hub.

SageMaker Studio

Di Studio, unggah kumpulan data Anda menggunakan halaman Datasets khusus..

Dataset evaluasi terdaftar di Studio SageMaker
SageMaker Python SDK

Di SageMaker Python SDK, unggah dataset Anda menggunakan halaman Datasets khusus..

from sagemaker.ai_registry.dataset import DataSet dataset = DataSet.create( name="your-bring-your-own-dataset", source="s3://<bucket-name>/<prefix>/<dataset-file>.jsonl" ) dataset.refresh()

Kirim pekerjaan pencetak gol bawaan

SageMaker Studio
Pilih dari eksekusi Kode atau jawaban Matematika untuk penilaian kustom Built-in
SageMaker Python SDK
from sagemaker.train.evaluate import CustomScorerEvaluator from sagemaker.train.evaluate import get_builtin_metrics BuiltInMetric = get_builtin_metrics() evaluator_builtin = CustomScorerEvaluator( evaluator=BuiltInMetric.PRIME_MATH, dataset="arn:aws:sagemaker:<region>:<account-id>:hub-content/<hub-content-id>/DataSet/your-bring-your-own-dataset/<version>", model="arn:aws:sagemaker:<region>:<account-id>:model-package/<model-package-name>/<version>", s3_output_path="s3://<bucket-name>/<prefix>/", evaluate_base_model=False ) execution = evaluator.evaluate()

Pilih dari BuiltInMetric.PRIME_MATH atau BuiltInMetric.PRIME_CODE untuk Skor Bawaan.

Kirim pekerjaan pencetak gol khusus

Tentukan fungsi hadiah khusus. Untuk informasi selengkapnya, lihat Pencetak Skor Kustom (Bawa Metrik Anda Sendiri).

Daftarkan fungsi hadiah khusus

SageMaker Studio
Menavigasi ke SageMaker Studio > Aset > Evaluator > Buat evaluator > Buat fungsi hadiah
Kirimkan pekerjaan evaluasi Pencetak Skor Khusus yang mereferensikan fungsi hadiah preset terdaftar di Custom Scorer > Metrik khusus
SageMaker Python SDK
from sagemaker.ai_registry.evaluator import Evaluator from sagemaker.ai_registry.air_constants import REWARD_FUNCTION evaluator = Evaluator.create( name = "your-reward-function-name", source="/path_to_local/custom_lambda_function.py", type = REWARD_FUNCTION )
evaluator = CustomScorerEvaluator( evaluator=evaluator, dataset="s3://<bucket-name>/<prefix>/<dataset-file>.jsonl", model="arn:aws:sagemaker:<region>:<account-id>:model-package/<model-package-name>/<version>", s3_output_path="s3://<bucket-name>/<prefix>/", evaluate_base_model=False ) execution = evaluator.evaluate()