Use custom prompt dataset for model evaluation in Amazon Bedrock - Amazon Bedrock

Use custom prompt dataset for model evaluation in Amazon Bedrock

You can use a custom prompt dataset in model evaluation jobs.

Custom prompt datasets must be stored in Amazon S3, and use the JSON line format and use the .jsonl file extension. When you upload the dataset to Amazon S3 make sure that you update the Cross Origin Resource Sharing (CORS) configuration on the S3 bucket. To learn more about the required CORS permissions, see Required Cross Origin Resource Sharing (CORS) permission on S3 buckets.

Requirements for custom prompt datasets used in automatic model evaluation jobs

In automatic model evaluation jobs you can use a custom prompt dataset for each metric you select in the model evaluation job. Custom datasets use the JSON line format (.jsonl), and each line must be a valid JSON object. There can be up to 1000 prompts in your dataset per automatic evaluation job.

You must use the following keys in a custom dataset.

  • prompt – required to indicate the input for the following tasks:

    • The prompt that your model should respond to, in general text generation.

    • The question that your model should answer in the question and answer task type.

    • The text that your model should summarize in text summarization task.

    • The text that your model should classify in classification tasks.

  • referenceResponse – required to indicate the ground truth response against which your model is evaluated for the following tasks types:

    • The answer for all prompts in question and answer tasks.

    • The answer for all accuracy, and robustness evaluations.

  • category– (optional) generates evaluation scores reported for each category.

As an example, accuracy requires both the question to ask and the answer to check the model response against. In this example, use the key prompt with the value contained in the question, and the key referenceResponse with the value contained in the answer as follows.

{ "prompt": "Bobigny is the capital of", "referenceResponse": "Seine-Saint-Denis", "category": "Capitals" }

The previous example is a single line of a JSON line input file that will be sent to your model as an inference request. Model will be invoked for every such record in your JSON line dataset. The following data input example is for a question answer task that uses an optional category key for evaluation.

{"prompt":"Aurillac is the capital of", "category":"Capitals", "referenceResponse":"Cantal"} {"prompt":"Bamiyan city is the capital of", "category":"Capitals", "referenceResponse":"Bamiyan Province"} {"prompt":"Sokhumi is the capital of", "category":"Capitals", "referenceResponse":"Abkhazia"}

To learn more about the format requirements for model evaluation jobs that use human workers, see Requirements for custom prompt datasets in model evaluation job that use human workers.

Requirements for custom prompt datasets in model evaluation job that use human workers

In the JSON line format, each line is a valid JSON object. A prompt dataset can have a maximum of 1000 prompts per model evaluation job.

A valid prompt entry must contain the prompt key. Both category and referenceResponse are optional. Use the category key to label your prompt with a specific category that you can use to filter the results when reviewing them in the model evaluation report card. Use the referenceResponse key to specify the ground truth response that your workers can reference during the evaluation.

In the worker UI, what you specify for prompt and referenceResponse are visible to your human workers.

The following is an example custom dataset that contains 6 inputs and uses the JSON line format.

{"prompt":"Provide the prompt you want the model to use during inference","category":"(Optional) Specify an optional category","referenceResponse":"(Optional) Specify a ground truth response."} {"prompt":"Provide the prompt you want the model to use during inference","category":"(Optional) Specify an optional category","referenceResponse":"(Optional) Specify a ground truth response."} {"prompt":"Provide the prompt you want the model to use during inference","category":"(Optional) Specify an optional category","referenceResponse":"(Optional) Specify a ground truth response."} {"prompt":"Provide the prompt you want the model to use during inference","category":"(Optional) Specify an optional category","referenceResponse":"(Optional) Specify a ground truth response."} {"prompt":"Provide the prompt you want the model to use during inference","category":"(Optional) Specify an optional category","referenceResponse":"(Optional) Specify a ground truth response."} {"prompt":"Provide the prompt you want the model to use during inference","category":"(Optional) Specify an optional category","referenceResponse":"(Optional) Specify a ground truth response."}

The following example is a single entry expanded for clarity

{ "prompt": "What is high intensity interval training?", "category": "Fitness", "referenceResponse": "High-Intensity Interval Training (HIIT) is a cardiovascular exercise approach that involves short, intense bursts of exercise followed by brief recovery or rest periods." }