Use custom prompt dataset for model evaluation in Amazon Bedrock
You can use a custom prompt dataset in model evaluation jobs.
Custom prompt datasets must be stored in Amazon S3, and use the JSON line format and
use the .jsonl
file extension. When you upload the dataset to Amazon S3 make
sure that you update the Cross Origin Resource Sharing (CORS) configuration on the
S3 bucket. To learn more about the required CORS permissions, see Required Cross Origin Resource Sharing (CORS) permission on S3 buckets.
Topics
Requirements for custom prompt datasets used in automatic model evaluation jobs
In automatic model evaluation jobs you can use a custom prompt dataset for
each metric you select in the model evaluation job. Custom datasets use the JSON
line format (.jsonl
), and each line must be a valid JSON object.
There can be up to 1000 prompts in your dataset per automatic evaluation
job.
You must use the following keys in a custom dataset.
-
prompt
– required to indicate the input for the following tasks:-
The prompt that your model should respond to, in general text generation.
-
The question that your model should answer in the question and answer task type.
-
The text that your model should summarize in text summarization task.
-
The text that your model should classify in classification tasks.
-
-
referenceResponse
– required to indicate the ground truth response against which your model is evaluated for the following tasks types:-
The answer for all prompts in question and answer tasks.
-
The answer for all accuracy, and robustness evaluations.
-
-
category
– (optional) generates evaluation scores reported for each category.
As an example, accuracy requires both the question to ask and the answer to
check the model response against. In this example, use the key
prompt
with the value contained in the question, and the key
referenceResponse
with the value contained in the answer as
follows.
{ "prompt": "Bobigny is the capital of", "referenceResponse": "Seine-Saint-Denis", "category": "Capitals" }
The previous example is a single line of a JSON line input file that will be
sent to your model as an inference request. Model will be invoked for every such
record in your JSON line dataset. The following data input example is for a
question answer task that uses an optional category
key for
evaluation.
{"prompt":"Aurillac is the capital of", "category":"Capitals", "referenceResponse":"Cantal"} {"prompt":"Bamiyan city is the capital of", "category":"Capitals", "referenceResponse":"Bamiyan Province"} {"prompt":"Sokhumi is the capital of", "category":"Capitals", "referenceResponse":"Abkhazia"}
To learn more about the format requirements for model evaluation jobs that use human workers, see Requirements for custom prompt datasets in model evaluation job that use human workers.
Requirements for custom prompt datasets in model evaluation job that use human workers
In the JSON line format, each line is a valid JSON object. A prompt dataset can have a maximum of 1000 prompts per model evaluation job.
A valid prompt entry must contain the prompt
key. Both
category
and referenceResponse
are optional. Use
the category
key to label your prompt with a specific category that
you can use to filter the results when reviewing them in the model evaluation
report card. Use the referenceResponse
key to specify the ground
truth response that your workers can reference during the evaluation.
In the worker UI, what you specify for prompt
and
referenceResponse
are visible to your human workers.
The following is an example custom dataset that contains 6 inputs and uses the JSON line format.
{"prompt":"Provide the prompt you want the model to use during inference
","category":"(Optional) Specify an optional category
","referenceResponse":"(Optional) Specify a ground truth response
."}
{"prompt":"Provide the prompt you want the model to use during inference
","category":"(Optional) Specify an optional category
","referenceResponse":"(Optional) Specify a ground truth response
."}
{"prompt":"Provide the prompt you want the model to use during inference
","category":"(Optional) Specify an optional category
","referenceResponse":"(Optional) Specify a ground truth response
."}
{"prompt":"Provide the prompt you want the model to use during inference
","category":"(Optional) Specify an optional category
","referenceResponse":"(Optional) Specify a ground truth response
."}
{"prompt":"Provide the prompt you want the model to use during inference
","category":"(Optional) Specify an optional category
","referenceResponse":"(Optional) Specify a ground truth response
."}
{"prompt":"Provide the prompt you want the model to use during inference
","category":"(Optional) Specify an optional category
","referenceResponse":"(Optional) Specify a ground truth response
."}
The following example is a single entry expanded for clarity
{ "prompt": "What is high intensity interval training?", "category": "Fitness", "referenceResponse": "High-Intensity Interval Training (HIIT) is a cardiovascular exercise approach that involves short, intense bursts of exercise followed by brief recovery or rest periods." }