Creating model evaluation jobs - Amazon Bedrock

Creating model evaluation jobs

The follow in examples show you how to create a model evaluation job using the Amazon Bedrock console, AWS CLI, SDK for Python

Automatic model evaluation jobs

The follow examples demonstrate how to create an automatic model evaluation job. All automatic model evaluation jobs require that you create IAM service role. To learn more about the IAM requirements for setting up a model evaluation job, see Service role requirements for model evaluation jobs.

Amazon Bedrock console

Use the following procedure to create a model evaluation job using the Amazon Bedrock console. To successfully complete this procedure make sure that your IAM user, group, or role has the sufficient permissions to access the console. To learn more, see Required permissions to create a model evaluation job using the Amazon Bedrock console.

Also, any custom prompt datasets that you want to specify in the model evaluation job must have the required CORS permissions added to the Amazon S3 bucket. To learn more about adding the required CORS permissions see, Required Cross Origin Resource Sharing (CORS) permission on S3 buckets.

To create a automatic model evaluation job
  1. Open the Amazon Bedrock console: https://console.aws.amazon.com/bedrock/

  2. In the navigation pane, choose Model evaluation.

  3. In the Build an evaluation card, under Automatic choose Create automatic evaluation.

  4. On the Create automatic evaluation page, provide the following information

    1. Evaluation name — Give the model evaluation job a name that describes the job. This name is shown in your model evaluation job list. The name must be unique in your AWS account in an AWS Region.

    2. Description (Optional) — Provide an optional description.

    3. Models — Choose the model you want to use in the model evaluation job.

      To learn more about available models and accessing them in Amazon Bedrock, see Manage access to Amazon Bedrock foundation models.

    4. (Optional) To change the inference configuration choose update.

      Changing the inference configuration changes the responses generated by the selected model. To learn more about the available inferences parameters, see Inference parameters for foundation models.

    5. Task type — Choose the type of task you want the model to attempt to perform during the model evaluation job.

    6. Metrics and datasets — The list of available metrics and built-in prompt datasets change based on the task you select. You can choose from the list of Available built-in datasets or you can choose Use your own prompt dataset. If you choose to use your own prompt dataset, enter the exact S3 URI of your prompt dataset file or choose Browse S3 to search for your prompt data set.

    7. >Evaluation results —Specify the S3 URI of the directory where you want the results saved. Choose Browse S3 to search for a location in Amazon S3.

    8. (Optional) To enable the use of a customer managed key Choose Customize encryption settings (advanced). Then, provide the ARN of the AWS KMS key you want to use.

    9. Amazon Bedrock IAM role — Choose Use an existing role to use IAM service role that already has the required permissions, or choose Create a new role to create a new IAM service role,

  5. Then, choose Create.

Once your job has start the status changes . Once the status changes Completed, then you can view the job's report card.

SDK for Python

Procedure

import boto3 client = boto3.client('bedrock') job_request = client.create_evaluation_job( jobName="api-auto-job-titan", jobDescription="two different task types", roleArn="arn:aws:iam::111122223333:role/role-name", inferenceConfig={ "models": [ { "bedrockModel": { "modelIdentifier":"arn:aws:bedrock:us-west-2::foundation-model/amazon.titan-text-lite-v1", "inferenceParams":"{\"temperature\":\"0.0\", \"topP\":\"1\", \"maxTokenCount\":\"512\"}" } } ] }, outputDataConfig={ "s3Uri":"s3://model-evaluations/outputs/" }, evaluationConfig={ "automated": { "datasetMetricConfigs": [ { "taskType": "QuestionAndAnswer", "dataset": { "name": "Builtin.BoolQ" }, "metricNames": [ "Builtin.Accuracy", "Builtin.Robustness" ] } ] } } ) print(job_request)
AWS CLI

In the AWS CLI, you can use the help command to see which parameters are required, and which parameters are optional when specifying create-evaluation-job in the AWS CLI.

aws bedrock create-evaluation-job help
aws bedrock create-evaluation-job \ --job-name 'automatic-eval-job-cli-001 \ --role-arn 'arn:aws:iam::111122223333:role/role-name' \ --evaluation-config '{"automated": {"datasetMetricConfigs": [{"taskType": "QuestionAndAnswer","dataset": {"name": "Builtin.BoolQ"},"metricNames": ["Builtin.Accuracy","Builtin.Robustness"]}]}}' \ --inference-config '{"models": [{"bedrockModel": {"modelIdentifier":"arn:aws:bedrock:us-west-2::foundation-model/amazon.titan-text-lite-v1","inferenceParams":"{\"temperature\":\"0.0\", \"topP\":\"1\", \"maxTokenCount\":\"512\"}"}}]}' \ --output-data-config '{"s3Uri":"s3://automatic-eval-jobs/outputs"}'

Human-based model evaluation jobs

When you create a human based model evaluation job outside of the Amazon Bedrock console, you need to create an Amazon SageMaker flow definition ARN.

The flow definition ARN is where a model evaluation job's workflow is defined. The flow definition is used to define the worker interface and the work team you want assigned to the task, and connecting to Amazon Bedrock.

For model evaluation jobs started using Amazon Bedrock API operations you must create a flow definition ARN using the AWS CLI or a supported AWS SDK. To learn more about how flow definitions work, and creating them programmatically, see Create a Human Review Workflow (API) in the SageMaker Developer Guide.

In the CreateFlowDefinition you must specify AWS/Bedrock/Evaluation as input to the AwsManagedHumanLoopRequestSource. The Amazon Bedrock service role must also have permissions to access the output bucket of the flow definition.

The following is an example request using the AWS CLI. In the request, the HumanTaskUiArn is a SageMaker owned ARN. In the ARN, you can only modify the AWS Region.

aws sagemaker create-flow-definition --cli-input-json ' { "FlowDefinitionName": "human-evaluation-task01", "HumanLoopRequestSource": { "AwsManagedHumanLoopRequestSource": "AWS/Bedrock/Evaluation" }, "HumanLoopConfig": { "WorkteamArn": "arn:aws:sagemaker:AWS Region:111122223333:workteam/private-crowd/my-workteam", ## The Task UI ARN is provided by the service team, you can only modify the AWS Region. "HumanTaskUiArn":"arn:aws:sagemaker:AWS Region:394669845002:human-task-ui/Evaluation" "TaskTitle": "Human review tasks", "TaskDescription": "Provide a real good answer", "TaskCount": 1, "TaskAvailabilityLifetimeInSeconds": 864000, "TaskTimeLimitInSeconds": 3600, "TaskKeywords": [ "foo" ] }, "OutputConfig": { "S3OutputPath": "s3://your-output-bucket" }, "RoleArn": "arn:aws:iam::111122223333:role/SageMakerCustomerRoleArn" }'

After creating your flow definition ARN, use the following examples to create human-based model evaluation job using the AWS CLI or a supported AWS SDK.

Amazon Bedrock console

Use the following procedure to create a model evaluation job using the Amazon Bedrock console. To successfully complete this procedure make sure that your IAM user, group, or role has the sufficient permissions to access the console. To learn more, see Required permissions to create a model evaluation job using the Amazon Bedrock console.

To create a model evaluation job that uses human workers
  1. Open the Amazon Bedrock console: https://console.aws.amazon.com/bedrock/

  2. In the navigation pane, choose Model evaluation.

  3. In the Build an evaluation card, under Automatic choose Create automatic evaluation.

  4. On the Create automatic evaluation page, provide the following information

    1. Evaluation name — Give the model evaluation job a name that describes the job. This name is shown in your model evaluation job list. The name must be unique in your AWS account in an AWS Region.

    2. Description (Optional) — Provide an optional description.

    3. Models — Choose the model you want to use in the model evaluation job.

      To learn more about available models and accessing them in Amazon Bedrock, see Manage access to Amazon Bedrock foundation models.

    4. (Optional) To change the inference configuration choose update.

      Changing the inference configuration changes the responses generated by the selected model. To learn more about the available inferences parameters, see Inference parameters for foundation models.

    5. Task type — Choose the type of task you want the model to attempt to perform during the model evaluation job.

    6. Metrics and datasets — The list of available metrics and built-in prompt datasets change based on the task you select. You can choose from the list of Available built-in datasets or you can choose Use your own prompt dataset. If you choose to use your own prompt dataset, enter the exact S3 URI of your prompt dataset file or choose Browse S3 to search for your prompt data set.

    7. Evaluation results — Specify the S3 URI of the directory where you want the results of your model evaluation job saved. Choose Browse S3 to search for a location in Amazon S3.

    8. (Optional) To enable the use of a customer managed key Choose Customize encryption settings (advanced). Then, provide the ARN of the AWS KMS key you want to use.

    9. Amazon BedrockIAM role — Choose Use an existing role to use a IAMservice role that already has the required permissions, or choose Create a new role to create a new IAM service role,

  5. Then, choose Create.

Once your job has start the status changes In progress. Once the status changes Completed, then you can view the job's report card.

SDK for Python

The following code example demonstrates how to create a model evaluation job that uses human workers via the SDK for SDK for Python.

import boto3 client = boto3.client('bedrock') job_request = client.create_evaluation_job( jobName="111122223333-job-01", jobDescription="two different task types", roleArn="arn:aws:iam::111122223333:role/example-human-eval-api-role", inferenceConfig={ ## You must specify and array of models "models": [ { "bedrockModel": { "modelIdentifier":"arn:aws:bedrock:us-west-2::foundation-model/amazon.titan-text-lite-v1", "inferenceParams":"{\"temperature\":\"0.0\", \"topP\":\"1\", \"maxTokenCount\":\"512\"}" } }, { "bedrockModel": { "modelIdentifier": "anthropic.claude-v2", "inferenceParams": "{\"temperature\":\"0.25\",\"top_p\":\"0.25\",\"max_tokens_to_sample\":\"256\",\"top_k\":\"1\"}" } } ] }, outputDataConfig={ "s3Uri":"s3://job-bucket/outputs/" }, evaluationConfig={ "human": { "humanWorkflowConfig": { "flowDefinitionArn": "arn:aws:sagemaker:us-west-2:111122223333:flow-definition/example-workflow-arn", "instructions": "some human eval instruction" }, "customMetrics": [ { "name": "IndividualLikertScale", "description": "testing", "ratingMethod": "IndividualLikertScale" } ], "datasetMetricConfigs": [ { "taskType": "Summarization", "dataset": { "name": "Custom_Dataset1", "datasetLocation": { "s3Uri": "s3://job-bucket/custom-datasets/custom-trex.jsonl" } }, "metricNames": [ "IndividualLikertScale" ] } ] } } ) print(job_request)