Creating model evaluation
jobs
The follow in examples show you how to create a model evaluation job using the Amazon Bedrock console, AWS CLI, SDK for Python
Automatic model
evaluation jobs
The follow examples demonstrate how to create an automatic model evaluation
job. All automatic model evaluation jobs require that you create IAM service
role. To learn more about the IAM requirements for setting up a model
evaluation job, see Service role requirements for model evaluation jobs.
- Amazon Bedrock console
-
Use the following procedure to create a model evaluation job using
the Amazon Bedrock console. To successfully complete this procedure make sure
that your IAM user, group, or role has the sufficient permissions
to access the console. To learn more, see Required permissions to create a model evaluation job using the Amazon Bedrock console.
Also, any custom prompt datasets that you want to specify in the
model evaluation job must have the required CORS permissions added
to the Amazon S3 bucket. To learn more about adding the required CORS
permissions see, Required Cross Origin Resource Sharing (CORS) permission on S3 buckets.
To create a automatic model evaluation job
-
Open the Amazon Bedrock console: https://console.aws.amazon.com/bedrock/
-
In the navigation pane, choose Model
evaluation.
-
In the Build an evaluation card,
under Automatic choose Create
automatic evaluation.
-
On the Create automatic evaluation
page, provide the following information
-
Evaluation name — Give
the model evaluation job a name that describes the
job. This name is shown in your model evaluation job
list. The name must be unique in
your AWS account
in
an AWS Region.
-
Description (Optional)
— Provide an optional description.
-
Models — Choose the
model you want to use in the model evaluation
job.
To learn more about available
models
and accessing them in Amazon Bedrock, see
Manage access to Amazon Bedrock foundation models.
-
(Optional) To change the inference configuration
choose update.
Changing the inference configuration changes the
responses generated by the selected
model.
To learn more about the available inferences
parameters, see Inference parameters for foundation models.
-
Task type — Choose
the type of task you want the model to attempt to
perform during the model evaluation job.
-
Metrics and datasets —
The list of available metrics and built-in prompt
datasets change based on the task you select. You
can choose from the list of Available
built-in datasets or you can choose
Use your own prompt dataset.
If you choose to use your own prompt dataset, enter
the exact S3 URI of your prompt dataset file or
choose Browse S3 to
search for your prompt data set.
-
>Evaluation results
—Specify the S3 URI of the directory where
you want the results saved. Choose Browse S3 to search for a
location in Amazon S3.
-
(Optional)
To enable the use of a customer managed key
Choose Customize encryption settings
(advanced).
Then,
provide
the ARN of the AWS KMS key
you
want to
use.
-
Amazon Bedrock IAM role —
Choose Use an existing
role to use IAM service role that
already has the required permissions, or choose
Create a new role
to create a new IAM service
role,
-
Then, choose Create.
Once your job has start the status changes . Once the
status changes Completed, then you can view the
job's report card.
- SDK for Python
-
Procedure
import boto3
client = boto3.client('bedrock')
job_request = client.create_evaluation_job(
jobName="api-auto-job-titan
",
jobDescription="two different task types",
roleArn="arn:aws:iam::111122223333
:role/role-name
",
inferenceConfig={
"models": [
{
"bedrockModel": {
"modelIdentifier":"arn:aws:bedrock:us-west-2::foundation-model/amazon.titan-text-lite-v1",
"inferenceParams":"{\"temperature\":\"0.0\", \"topP\":\"1\", \"maxTokenCount\":\"512\"}"
}
}
]
},
outputDataConfig={
"s3Uri":"s3://model-evaluations/outputs/
"
},
evaluationConfig={
"automated": {
"datasetMetricConfigs": [
{
"taskType": "QuestionAndAnswer",
"dataset": {
"name": "Builtin.BoolQ"
},
"metricNames": [
"Builtin.Accuracy",
"Builtin.Robustness"
]
}
]
}
}
)
print(job_request)
- AWS CLI
-
In the AWS CLI, you can use the help
command to see
which parameters are required, and which parameters are optional
when specifying create-evaluation-job
in the
AWS CLI.
aws bedrock create-evaluation-job help
aws bedrock create-evaluation-job \
--job-name 'automatic-eval-job-cli-001
\
--role-arn 'arn:aws:iam::111122223333
:role/role-name
' \
--evaluation-config '{"automated": {"datasetMetricConfigs": [{"taskType": "QuestionAndAnswer","dataset": {"name": "Builtin.BoolQ"},"metricNames": ["Builtin.Accuracy","Builtin.Robustness"]}]}}' \
--inference-config '{"models": [{"bedrockModel": {"modelIdentifier":"arn:aws:bedrock:us-west-2::foundation-model/amazon.titan-text-lite-v1","inferenceParams":"{\"temperature\":\"0.0\", \"topP\":\"1\", \"maxTokenCount\":\"512\"}"}}]}' \
--output-data-config '{"s3Uri":"s3://automatic-eval-jobs/outputs
"}'
Human-based model evaluation jobs
When you create a human based model evaluation job outside of the Amazon Bedrock
console, you need to create an Amazon SageMaker flow definition ARN.
The flow definition ARN is where a model evaluation job's workflow is defined.
The flow definition is used to define the worker interface and the work team you
want assigned to the task, and connecting to Amazon Bedrock.
For model evaluation jobs started using Amazon Bedrock API operations you must
create a flow definition ARN using the AWS CLI or a supported AWS SDK. To
learn more about how flow definitions work, and creating them programmatically,
see Create a Human Review Workflow (API) in the SageMaker Developer Guide.
In the CreateFlowDefinition
you must specify AWS/Bedrock/Evaluation
as input to the AwsManagedHumanLoopRequestSource
. The Amazon Bedrock service role must also have permissions to access the output bucket of the flow definition.
The following is an example request using the AWS CLI. In the request, the HumanTaskUiArn
is a SageMaker owned ARN. In the ARN, you can only modify the AWS Region.
aws sagemaker create-flow-definition --cli-input-json '
{
"FlowDefinitionName": "human-evaluation-task01
",
"HumanLoopRequestSource": {
"AwsManagedHumanLoopRequestSource": "AWS/Bedrock/Evaluation
"
},
"HumanLoopConfig": {
"WorkteamArn": "arn:aws:sagemaker:AWS Region
:111122223333:workteam/private-crowd/my-workteam
",
## The Task UI ARN is provided by the service team, you can only modify the AWS Region.
"HumanTaskUiArn":"arn:aws:sagemaker:AWS Region:394669845002:human-task-ui/Evaluation"
"TaskTitle": "Human review tasks",
"TaskDescription": "Provide a real good answer",
"TaskCount": 1,
"TaskAvailabilityLifetimeInSeconds": 864000,
"TaskTimeLimitInSeconds": 3600,
"TaskKeywords": [
"foo"
]
},
"OutputConfig": {
"S3OutputPath": "s3://your-output-bucket
"
},
"RoleArn": "arn:aws:iam::111122223333
:role/SageMakerCustomerRoleArn"
}'
After creating your flow definition ARN, use the following examples to create human-based model evaluation job using the AWS CLI or a supported AWS SDK.
- Amazon Bedrock console
-
Use the following procedure to create a model evaluation job using
the Amazon Bedrock console. To successfully complete this procedure make sure
that your IAM user, group, or role has the sufficient permissions
to access the console. To learn more, see Required permissions to create a model evaluation job using the Amazon Bedrock console.
To create a model evaluation job that uses human workers
-
Open the Amazon Bedrock console: https://console.aws.amazon.com/bedrock/
-
In the navigation pane, choose Model
evaluation.
-
In the Build an evaluation card,
under Automatic choose Create
automatic evaluation.
-
On the Create automatic evaluation
page, provide the following information
-
Evaluation name — Give
the model evaluation job a name that describes the
job. This name is shown in your model evaluation job
list. The name must be unique in
your AWS account
in
an AWS Region.
-
Description (Optional)
— Provide an optional description.
-
Models — Choose the
model you want to use in the model evaluation
job.
To learn more about available
models
and accessing them in Amazon Bedrock, see
Manage access to Amazon Bedrock foundation models.
-
(Optional) To change the inference configuration
choose update.
Changing the inference configuration changes the
responses generated by the selected
model.
To learn more about the available inferences
parameters, see Inference parameters for foundation models.
-
Task type — Choose
the type of task you want the model to attempt to
perform during the model evaluation job.
-
Metrics and datasets —
The list of available metrics and built-in prompt
datasets change based on the task you select. You
can choose from the list of Available
built-in datasets or you can choose
Use your own prompt dataset.
If you choose to use your own prompt dataset, enter
the exact S3 URI
of
your prompt dataset
file
or choose Browse S3
to search for your prompt data
set.
-
Evaluation
results —
Specify
the S3 URI of the directory where you want the
results of your model evaluation job
saved.
Choose Browse S3 to
search for a location in Amazon S3.
-
(Optional)
To enable the use of a customer managed key
Choose Customize encryption settings
(advanced).
Then,
provide
the ARN of the AWS KMS key
you
want to
use.
-
Amazon BedrockIAM
role — Choose
Use an existing
role to use a
IAMservice role that
already has the required
permissions,
or choose Create a new
role to create a new IAM service
role,
-
Then, choose Create.
Once your job has start the status changes In
progress. Once the status changes
Completed, then you can view the job's
report card.
- SDK for Python
-
The following code example demonstrates how to create a model evaluation job that uses human workers via the SDK for SDK for Python.
import boto3
client = boto3.client('bedrock')
job_request = client.create_evaluation_job(
jobName="111122223333-job-01
",
jobDescription="two different task types",
roleArn="arn:aws:iam::111122223333
:role/example-human-eval-api-role",
inferenceConfig={
## You must specify and array of models
"models": [
{
"bedrockModel": {
"modelIdentifier":"arn:aws:bedrock:us-west-2::foundation-model/amazon.titan-text-lite-v1",
"inferenceParams":"{\"temperature\":\"0.0\", \"topP\":\"1\", \"maxTokenCount\":\"512\"}"
}
},
{
"bedrockModel": {
"modelIdentifier": "anthropic.claude-v2",
"inferenceParams": "{\"temperature\":\"0.25\",\"top_p\":\"0.25\",\"max_tokens_to_sample\":\"256\",\"top_k\":\"1\"}"
}
}
]
},
outputDataConfig={
"s3Uri":"s3://job-bucket
/outputs/"
},
evaluationConfig={
"human": {
"humanWorkflowConfig": {
"flowDefinitionArn": "arn:aws:sagemaker:us-west-2:111122223333
:flow-definition/example-workflow-arn",
"instructions": "some human eval instruction"
},
"customMetrics": [
{
"name": "IndividualLikertScale",
"description": "testing",
"ratingMethod": "IndividualLikertScale"
}
],
"datasetMetricConfigs": [
{
"taskType": "Summarization",
"dataset": {
"name": "Custom_Dataset1",
"datasetLocation": {
"s3Uri": "s3://job-bucket
/custom-datasets/custom-trex.jsonl"
}
},
"metricNames": [
"IndividualLikertScale"
]
}
]
}
}
)
print(job_request)