Custom Code Evaluator

Custom code evaluators let you write your own evaluation logic in Python, JavaScript, or TypeScript. Your code has access to the application inputs, outputs, and the full execution trace (spans, latency, token usage, costs).

Self-hosted deployments only

On self-hosted Agenta, custom evaluator code runs server-side. By default it runs in a restricted Python sandbox (no filesystem, network, or host access). Operators can change the runner with the AGENTA_SERVICES_CODE_SANDBOX_RUNNER environment variable: local runs code with no sandbox (trusted authors only), daytona runs it in an isolated remote sandbox. See environment configuration. Agenta Cloud is unaffected — it isolates evaluator execution.

Function signature

Your code must define an evaluate function with the following signature:

from typing import Dict, Any

def evaluate(
    inputs: Dict[str, Any],
    outputs: Any,
    trace: Dict[str, Any],
) -> float:

Parameters

Parameter	Type	Description
`inputs`	`Dict[str, Any]`	In batch evaluation: the testcase data (all columns). In online evaluation: the application's input from the trace.
`outputs`	`Any`	The application's output (string or dict).
`trace`	`Dict[str, Any]`	The full execution trace with spans, metrics (latency, token counts, costs), and child spans.

Return value

The function should return one of:

float (0.0 to 1.0) — a score where 0.0 is worst and 1.0 is best
bool — True maps to 1.0, False to 0.0

Examples

Exact match

from typing import Dict, Any

def evaluate(
    inputs: Dict[str, Any],
    outputs: Any,
    trace: Dict[str, Any],
) -> float:
    if outputs == inputs.get("correct_answer"):
        return 1.0
    return 0.0

Latency check

Use trace data to check whether the response was generated within a time budget:

from typing import Dict, Any

def evaluate(
    inputs: Dict[str, Any],
    outputs: Any,
    trace: Dict[str, Any],
) -> float:
    if not trace or not trace.get("spans"):
        return 0.0

    root = list(trace["spans"].values())[0]
    ag = root.get("attributes", {}).get("ag", {})
    duration = ag.get("metrics", {}).get("unit", {}).get("duration", {}).get("total", 0)

    # Fail if response took more than 5 seconds
    if duration > 5.0:
        return 0.0
    return 1.0

Token budget check

Verify the application stayed within a token budget:

from typing import Dict, Any

def evaluate(
    inputs: Dict[str, Any],
    outputs: Any,
    trace: Dict[str, Any],
) -> float:
    if not trace or not trace.get("spans"):
        return 0.5

    root = list(trace["spans"].values())[0]
    ag = root.get("attributes", {}).get("ag", {})
    tokens = ag.get("metrics", {}).get("unit", {}).get("tokens", {})
    total_tokens = tokens.get("total", 0)

    max_tokens = 500
    if total_tokens <= max_tokens:
        return 1.0
    elif total_tokens <= max_tokens * 1.5:
        return 0.5
    return 0.0

Accessing ground truth

In batch evaluation, testcase columns are available directly in inputs. If your testset has a correct_answer column, access it as inputs["correct_answer"] or inputs.get("correct_answer").

You do not need to configure a separate "correct answer key" — just read the column name directly from inputs.

Accessing trace data

The trace parameter contains the full OpenTelemetry trace serialized as a dict. The structure looks like:

{
    "spans": {
        "<span_id>": {
            "name": "my_app",
            "start_time": "2025-01-15T10:30:00Z",
            "end_time": "2025-01-15T10:30:02.5Z",
            "status_code": "OK",
            "attributes": {
                "ag": {
                    "data": {
                        "inputs": {"country": "France"},
                        "outputs": "The capital is Paris"
                    },
                    "metrics": {
                        "unit": {
                            "costs": {"total": 0.001},
                            "tokens": {"prompt": 50, "completion": 20, "total": 70},
                            "duration": {"total": 2.5}
                        }
                    }
                }
            },
            "children": [...]
        }
    }
}

Useful paths:

Data	Path
Root span	`list(trace["spans"].values())[0]`
App inputs	`root["attributes"]["ag"]["data"]["inputs"]`
App outputs	`root["attributes"]["ag"]["data"]["outputs"]`
Latency (seconds)	`root["attributes"]["ag"]["metrics"]["unit"]["duration"]["total"]`
Token counts	`root["attributes"]["ag"]["metrics"]["unit"]["tokens"]`
Costs	`root["attributes"]["ag"]["metrics"]["unit"]["costs"]["total"]`
Child spans	`root["children"]`

JavaScript and TypeScript

The same interface is available in JavaScript and TypeScript:

JavaScript:

function evaluate(inputs, outputs, trace) {
  return outputs === inputs.correct_answer ? 1.0 : 0.0
}

TypeScript:

function evaluate(
  inputs: Record<string, any>,
  outputs: any,
  trace: Record<string, any>
): number {
  return outputs === inputs.correct_answer ? 1.0 : 0.0
}

Legacy interface (v1)

info

Existing evaluators created before this update use the legacy 4-parameter interface:

def evaluate(app_params, inputs, output, correct_answer) -> float:

These continue to work unchanged.

If you want to migrate an existing evaluator to the new interface, update the function signature to (inputs, outputs, trace) and access ground truth directly from inputs instead of the correct_answer parameter.

Function signature​

Parameters​

Return value​

Examples​

Exact match​

Latency check​

Token budget check​

Accessing ground truth​

Accessing trace data​

JavaScript and TypeScript​

Legacy interface (v1)​