Custom code-based evaluator
Custom code-based evaluators let you use your own AWS Lambda function to programmatically evaluate agent performance, instead of using an LLM as a judge. This gives you full control over the evaluation logic — you can implement deterministic checks, call external APIs, run regex matching, compute custom metrics, or apply any business-specific rules.
Prerequisites
To use custom code-based evaluators, you need:
-
An AWS Lambda function deployed in the same Region as your AgentCore Evaluations resources.
-
An IAM execution role that grants the AgentCore Evaluations service permission to invoke your Lambda function.
-
The Lambda function must return a JSON response conforming to the response schema described in Response schema.
IAM permissions
Your service execution role needs the following additional permission to invoke Lambda functions for code-based evaluation:
{ "Sid": "LambdaInvokeStatement", "Effect": "Allow", "Action": [ "lambda:InvokeFunction", "lambda:GetFunction" ], "Resource": "arn:aws:lambda:region:account-id:function:function-name" }
Lambda function contract
Note
The maximum runtime timeout for the Lambda function is 5 minutes (300 seconds). The maximum input payload size sent to the Lambda function is 6 MB.
Input schema
Your Lambda function receives a JSON payload with the following structure:
{ "schemaVersion": "1.0", "evaluatorId": "my-evaluator-abc1234567", "evaluatorName": "MyCodeEvaluator", "evaluationLevel": "TRACE", "evaluationInput": { "sessionSpans": [...] }, "evaluationTarget": { "traceIds": ["trace123"], "spanIds": ["span123"] } }
| Field | Type | Description |
|---|---|---|
|
|
String |
Schema version of the payload. Currently |
|
|
String |
The ID of the code-based evaluator. |
|
|
String |
The name of the code-based evaluator. |
|
|
String |
The evaluation level: |
|
|
Object |
Contains the session spans for evaluation. |
|
|
List |
The session spans to evaluate. May be truncated if the original payload exceeds 6 MB. |
|
|
Object |
Identifies the specific traces or spans to evaluate. For session-level evaluators, this value is |
|
|
List |
The trace IDs of the evaluation target. Present for trace-level and tool-level evaluations. |
|
|
List |
The span IDs of the evaluation target. Present for tool-level evaluations. |
Response schema
Your Lambda function must return a JSON object matching one of two formats:
Success response
{ "label": "PASS", "value": 1.0, "explanation": "All validation checks passed." }
| Field | Required | Type | Description |
|---|---|---|---|
|
|
Yes |
String |
A categorical label for the evaluation result (for example, "PASS", "FAIL", "Good", "Poor"). |
|
|
No |
Number |
A numeric score (for example, 0.0 to 1.0). |
|
|
No |
String |
A human-readable explanation of the evaluation result. |
Error response
{ "errorCode": "VALIDATION_FAILED", "errorMessage": "Input spans missing required tool call attributes." }
| Field | Required | Type | Description |
|---|---|---|---|
|
|
Yes |
String |
A code identifying the error. |
|
|
Yes |
String |
A human-readable description of the error. |
Create a code-based evaluator
The CreateEvaluator API creates a code-based evaluator by specifying a Lambda function ARN and optional timeout.
Required parameters: A unique evaluator name, evaluation level ( TRACE , TOOL_CALL , or SESSION ), and a code-based evaluator configuration containing the Lambda ARN.
Code-based evaluator configuration:
{ "codeBased": { "lambdaConfig": { "lambdaArn": "arn:aws:lambda:region:account-id:function:function-name", "lambdaTimeoutInSeconds": 60 } } }
| Field | Required | Default | Description |
|---|---|---|---|
|
|
Yes |
— |
The ARN of the Lambda function to invoke. |
|
|
No |
60 |
Timeout in seconds for the Lambda invocation (1–300). |
The following code samples demonstrate how to create code-based evaluators using different development approaches.
Example
Run on-demand evaluation with a code-based evaluator
Once created, use the custom code-based evaluator with the Evaluate API the same way you would use any other evaluator. The service handles Lambda invocation, parallel fan-out, and result mapping automatically.
Example
Using evaluation targets
You can target specific traces or spans, just like with LLM-based evaluators:
# Trace-level evaluation response = client.evaluate( evaluatorId="code-based-evaluator-id", evaluationInput={"sessionSpans": session_span_logs}, evaluationTarget={"traceIds": ["trace-id-1", "trace-id-2"]} ) # Tool-level evaluation response = client.evaluate( evaluatorId="code-based-evaluator-id", evaluationInput={"sessionSpans": session_span_logs}, evaluationTarget={"spanIds": ["span-id-1", "span-id-2"]} )