Evaluation, safety, and governanceW1017 min read

Agent Evaluation and Benchmarking

Why agent evaluation is harder than NLP eval. Trajectory evaluation, process vs outcome metrics. Major benchmarks: GAIA, AgentBench, WebArena, SWE-bench, ToolBench. LLM-as-judge with reliability checks. Cost and latency analysis.

Core conceptsTrajectory evaluationLLM-as-judgeCost metrics

01Learning Objectives

By the end of this lecture, students will be able to:

Explain why evaluating AI agents is fundamentally more challenging than evaluating standard NLP models.
Design task-based evaluation protocols that measure both success rate and completion quality.
Implement process-based evaluation that assesses reasoning quality, tool use efficiency, and intermediate steps.
Describe and compare major agent benchmarks: SWE-bench, GAIA, AgentBench, WebArena, and HumanEval.
Apply LLM-as-judge techniques for automated evaluation of agent outputs.
Design human evaluation protocols with appropriate annotation guidelines and inter-rater reliability measures.
Incorporate cost and latency metrics into agent evaluation for practical deployment readiness.
Build a simple automated evaluation harness in Python.

021. Why Evaluating Agents Is Hard

The Evaluation Gap

Evaluating traditional NLP models is relatively straightforward: given an input, compare the model's output to a gold-standard reference. Accuracy, F1 score, BLEU, and ROUGE provide clear, quantitative metrics. Agent evaluation is fundamentally different, and this difference is one of the most underappreciated challenges in the field.

Consider the analogy of evaluating a student. Evaluating a multiple-choice exam is easy: compare answers to the key. Evaluating an essay is harder: you need rubrics, judgment, and possibly multiple graders. Evaluating a student's performance in a semester-long research project is hardest of all: you must assess not just the final report, but the research process, the choices made along the way, the ability to handle setbacks, and the quality of reasoning. Agent evaluation is like evaluating the research project -- and we are often trying to do it with the tools designed for multiple-choice exams.

Key Insight: The fundamental challenge of agent evaluation is that agents are not functions that map inputs to outputs. They are processes that interact with environments over time, making sequences of decisions with uncertain outcomes. Evaluating a process is fundamentally harder than evaluating a function.

Sources of Difficulty

Non-Determinism

LLM-based agents are stochastic. The same agent with the same input may produce different outputs across runs due to:

Sampling temperature: Even small temperatures introduce variability.
Tool return variability: Web searches, API calls, and database queries may return different results at different times.
Path dependence: Early random choices cascade through subsequent reasoning steps.

This means single-run evaluation is unreliable. Statistical evaluation across multiple runs is necessary, increasing cost.

python

def evaluate_with_variance(
    agent_fn, test_case: dict, num_runs: int = 5
) -> dict:
    """Run an agent multiple times and report variance.

    Args:
        agent_fn: The agent function to evaluate.
        test_case: Dict with 'input' and 'expected_output' keys.
        num_runs: Number of independent runs.

    Returns:
        Dict with mean score, standard deviation, and individual results.
    """
    scores = []
    outputs = []

    for run in range(num_runs):
        output = agent_fn(test_case["input"])
        score = compute_score(output, test_case["expected_output"])
        scores.append(score)
        outputs.append(output)

    return {
        "mean_score": sum(scores) / len(scores),
        "std_dev": (sum((s - sum(scores)/len(scores))**2 for s in scores) / len(scores)) ** 0.5,
        "min_score": min(scores),
        "max_score": max(scores),
        "num_runs": num_runs,
        "all_scores": scores,
    }

Complex, Multi-Step Behaviors

Agents do not just produce text; they take actions across multiple steps. A coding agent might:

Read the problem description.
Search for relevant documentation.
Write initial code.
Run the code and observe an error.
Debug and fix the code.
Run tests to verify.

Evaluating only the final output misses important information about the quality of the process. Two agents might produce the same final code, but one might have done it in 3 steps while the other needed 15, including several failed attempts.

Open-Ended Outputs

Many agent tasks have multiple valid solutions. "Write a Python function to sort a list" has dozens of correct implementations. "Plan a week-long trip to Japan" has millions of reasonable answers. There is no single gold-standard output to compare against.

Environment Interaction

Agents interact with external environments (web browsers, code interpreters, databases, APIs). These interactions are:

Difficult to reproduce: Web content changes, APIs evolve, databases are updated.
Expensive to run: Real API calls cost money and take time.
Hard to sandbox: Agents with real-world access can cause real-world effects.

Emergent and Unexpected Behaviors

Agents may exhibit behaviors not anticipated by the evaluation protocol:

Finding creative solutions the evaluator did not consider.
Exploiting loopholes in the evaluation setup.
Producing correct results through incorrect reasoning.
Appearing to succeed while actually taking harmful shortcuts.

032. Task-Based Evaluation

Task-based evaluation answers the most practical question: "Does the agent actually get things done?" It is the bottom line of agent evaluation. An agent with beautiful reasoning traces and elegant tool use that fails to produce the correct answer is still a failing agent.

Success Rate

The most fundamental metric: did the agent complete the task? This is the "pass/fail" grade of agent evaluation.

python

def task_success_rate(
    results: list[dict],
) -> dict:
    """Compute task-level success metrics.

    Args:
        results: List of dicts with 'task_id', 'completed', and 'correct' fields.

    Returns:
        Dict with success metrics.
    """
    n = len(results)
    completed = sum(1 for r in results if r["completed"])
    correct = sum(1 for r in results if r["correct"])

    return {
        "total_tasks": n,
        "completion_rate": completed / n if n > 0 else 0.0,
        "success_rate": correct / n if n > 0 else 0.0,
        "completed_but_incorrect": completed - correct,
    }

However, binary success/failure loses important nuance. An agent that completes 80% of a task correctly is different from one that fails entirely.

Completion Quality

Graduated scoring captures partial success:

python

from dataclasses import dataclass


@dataclass
class TaskResult:
    """Result of a single task evaluation."""

    task_id: str
    score: float          # 0.0 to 1.0 (partial credit)
    completed: bool       # Did the agent declare completion?
    steps_taken: int      # Number of steps to reach the result
    errors_encountered: int
    tools_used: list[str]
    wall_time_seconds: float
    tokens_consumed: int
    notes: str = ""


def evaluate_task_quality(
    result: TaskResult,
    max_steps: int = 20,
    max_time: float = 300.0,
) -> dict:
    """Evaluate task completion quality with multiple dimensions.

    Combines correctness with efficiency metrics.
    """
    # Correctness (primary metric)
    correctness = result.score

    # Efficiency: penalize excessive steps
    step_efficiency = max(0.0, 1.0 - (result.steps_taken / max_steps))

    # Time efficiency
    time_efficiency = max(0.0, 1.0 - (result.wall_time_seconds / max_time))

    # Error rate (lower is better)
    error_rate = result.errors_encountered / max(result.steps_taken, 1)

    # Combined quality score (weighted)
    quality_score = (
        0.6 * correctness +
        0.2 * step_efficiency +
        0.1 * time_efficiency +
        0.1 * (1 - error_rate)
    )

    return {
        "task_id": result.task_id,
        "correctness": correctness,
        "step_efficiency": step_efficiency,
        "time_efficiency": time_efficiency,
        "error_rate": error_rate,
        "quality_score": quality_score,
    }

Rubric-Based Evaluation

For open-ended tasks, define a rubric with specific criteria:

python

CODING_TASK_RUBRIC = {
    "correctness": {
        "description": "Does the code produce the correct output?",
        "weight": 0.40,
        "levels": {
            1.0: "All test cases pass",
            0.75: "Most test cases pass (>80%)",
            0.5: "Some test cases pass (40-80%)",
            0.25: "Few test cases pass (<40%)",
            0.0: "No test cases pass or code does not run",
        },
    },
    "code_quality": {
        "description": "Is the code clean, readable, and well-structured?",
        "weight": 0.20,
        "levels": {
            1.0: "Excellent: clean, well-documented, good variable names",
            0.75: "Good: mostly clean with minor issues",
            0.5: "Acceptable: works but has style issues",
            0.25: "Poor: hard to read, bad structure",
            0.0: "Very poor: incomprehensible",
        },
    },
    "efficiency": {
        "description": "Is the solution reasonably efficient?",
        "weight": 0.15,
        "levels": {
            1.0: "Optimal time and space complexity",
            0.75: "Near-optimal",
            0.5: "Acceptable for the problem size",
            0.25: "Unnecessarily slow or memory-intensive",
            0.0: "Unacceptably inefficient",
        },
    },
    "error_handling": {
        "description": "Does the code handle edge cases and errors?",
        "weight": 0.15,
        "levels": {
            1.0: "Comprehensive error handling",
            0.5: "Basic error handling",
            0.0: "No error handling",
        },
    },
    "process_quality": {
        "description": "Was the agent's problem-solving process reasonable?",
        "weight": 0.10,
        "levels": {
            1.0: "Systematic: understood, planned, implemented, tested",
            0.5: "Somewhat systematic but with unnecessary steps",
            0.0: "Chaotic: random trial and error",
        },
    },
}


def rubric_evaluate(scores: dict[str, float], rubric: dict) -> float:
    """Compute a weighted score from rubric evaluations.

    Args:
        scores: Dict mapping rubric criterion names to scores (0.0-1.0).
        rubric: The rubric definition with criteria and weights.

    Returns:
        Weighted total score.
    """
    total = 0.0
    total_weight = 0.0
    for criterion, config in rubric.items():
        if criterion in scores:
            total += scores[criterion] * config["weight"]
            total_weight += config["weight"]
    return total / total_weight if total_weight > 0 else 0.0

043. Process-Based Evaluation

Beyond Final Outputs

Task-based evaluation tells you whether the agent succeeded. Process-based evaluation tells you why it succeeded or failed. This distinction matters enormously for improving agents.

The analogy is grading a math exam. If a student gets the right answer through wrong reasoning (e.g., two errors that cancel out), they will fail on the next problem where the errors do not cancel. Similarly, an agent that gets the right answer through a flawed process (lucky search results, coincidental tool outputs) will fail on slightly different tasks.

Process-based evaluation examines how the agent arrived at its answer, not just the answer itself. This is crucial because:

A correct final answer achieved through flawed reasoning may not generalize.
An incorrect final answer with good reasoning may need only minor adjustments.
Understanding the process enables targeted improvements.

Reasoning Quality

Evaluate the quality of the agent's chain of thought:

python

REASONING_EVALUATION_PROMPT = """Evaluate the quality of the following agent's
reasoning process. Score each dimension from 1-5.

Agent's reasoning trace:
{trace}

Correct answer: {correct_answer}

Dimensions to evaluate:
1. **Logical Coherence** (1-5): Does each step follow logically from the previous?
2. **Relevance** (1-5): Are the reasoning steps relevant to the problem?
3. **Completeness** (1-5): Does the reasoning address all aspects of the problem?
4. **Efficiency** (1-5): Is the reasoning concise without unnecessary tangents?
5. **Self-Correction** (1-5): When errors occur, does the agent recognize and fix them?

For each dimension, provide a score and brief justification.
"""


def evaluate_reasoning(
    trace: str, correct_answer: str, llm_call
) -> dict:
    """Evaluate reasoning quality using an LLM judge.

    Args:
        trace: The agent's full reasoning trace.
        correct_answer: The known correct answer.
        llm_call: Function to call the judge LLM.

    Returns:
        Dict with scores for each reasoning dimension.
    """
    response = llm_call(
        prompt=REASONING_EVALUATION_PROMPT.format(
            trace=trace, correct_answer=correct_answer
        )
    )
    return parse_dimension_scores(response)


def parse_dimension_scores(response: str) -> dict:
    """Parse dimension scores from judge response."""
    dimensions = [
        "logical_coherence",
        "relevance",
        "completeness",
        "efficiency",
        "self_correction",
    ]
    scores = {}
    for dim in dimensions:
        # Look for patterns like "Logical Coherence: 4/5" or "Score: 4"
        for line in response.split("\n"):
            dim_readable = dim.replace("_", " ")
            if dim_readable.lower() in line.lower() and any(c.isdigit() for c in line):
                digits = [int(c) for c in line if c.isdigit()]
                if digits:
                    scores[dim] = digits[0] / 5.0  # Normalize to 0-1
                    break
        if dim not in scores:
            scores[dim] = 0.5  # Default if parsing fails
    return scores

Tool Use Efficiency

How effectively does the agent use its available tools?

python

def evaluate_tool_use(
    tool_calls: list[dict], optimal_calls: list[dict] | None = None
) -> dict:
    """Evaluate the efficiency and appropriateness of tool usage.

    Args:
        tool_calls: List of actual tool calls the agent made.
        optimal_calls: List of ideal tool calls (if known).

    Returns:
        Dict with tool use metrics.
    """
    total_calls = len(tool_calls)

    # Count redundant calls (same tool with same/similar arguments)
    seen = set()
    redundant = 0
    for call in tool_calls:
        key = f"{call['tool']}:{call.get('args', '')}"
        if key in seen:
            redundant += 1
        seen.add(key)

    # Count failed calls
    failed = sum(1 for call in tool_calls if call.get("status") == "error")

    # Tool diversity (number of unique tools used)
    unique_tools = len({call["tool"] for call in tool_calls})

    metrics = {
        "total_calls": total_calls,
        "unique_tools_used": unique_tools,
        "redundant_calls": redundant,
        "failed_calls": failed,
        "redundancy_rate": redundant / total_calls if total_calls > 0 else 0.0,
        "failure_rate": failed / total_calls if total_calls > 0 else 0.0,
    }

    # Compare to optimal if available
    if optimal_calls is not None:
        optimal_count = len(optimal_calls)
        metrics["optimal_calls"] = optimal_count
        metrics["call_overhead"] = (total_calls - optimal_count) / optimal_count if optimal_count > 0 else 0.0

    return metrics

Trajectory Analysis

Examine the complete trajectory of agent actions:

python

def analyze_trajectory(
    trajectory: list[dict],
) -> dict:
    """Analyze an agent's complete action trajectory.

    Args:
        trajectory: List of actions, each with 'type', 'content', and 'timestamp'.

    Returns:
        Dict with trajectory analysis metrics.
    """
    total_steps = len(trajectory)

    # Categorize steps
    reasoning_steps = [s for s in trajectory if s["type"] == "thought"]
    action_steps = [s for s in trajectory if s["type"] == "action"]
    observation_steps = [s for s in trajectory if s["type"] == "observation"]

    # Detect backtracking (similar actions repeated)
    backtrack_count = 0
    for i in range(1, len(action_steps)):
        if action_steps[i]["content"] == action_steps[i-1]["content"]:
            backtrack_count += 1

    # Compute timing if timestamps available
    if trajectory and "timestamp" in trajectory[0]:
        from datetime import datetime
        start = datetime.fromisoformat(trajectory[0]["timestamp"])
        end = datetime.fromisoformat(trajectory[-1]["timestamp"])
        duration = (end - start).total_seconds()
    else:
        duration = None

    return {
        "total_steps": total_steps,
        "reasoning_steps": len(reasoning_steps),
        "action_steps": len(action_steps),
        "observation_steps": len(observation_steps),
        "reasoning_to_action_ratio": (
            len(reasoning_steps) / len(action_steps) if action_steps else 0
        ),
        "backtrack_count": backtrack_count,
        "duration_seconds": duration,
    }

054. Major Agent Benchmarks

Benchmarks serve a critical role in the AI ecosystem: they provide standardized, reproducible evaluations that allow comparison across different agents, models, and approaches. Without benchmarks, every team evaluates on their own test cases, making it impossible to compare results across papers.

However, benchmarks also have a dangerous side. When a benchmark becomes the primary target, teams optimize for that specific benchmark rather than for general capability -- this is Goodhart's Law in action. Keep this tension in mind as we survey the major benchmarks.

Key Insight: Benchmarks measure specific capabilities, not general intelligence. An agent that achieves 100% on SWE-bench might fail spectacularly at web browsing tasks. Always evaluate on benchmarks that match your deployment scenario, and treat benchmark scores as one signal among many, not the definitive measure of agent quality.

SWE-bench

Paper: Jimenez et al. (2024), "SWE-bench: Can Language Models Resolve Real-World GitHub Issues?"

What it evaluates: The ability of agents to resolve real GitHub issues in popular Python repositories (Django, Flask, scikit-learn, sympy, etc.).

How it works:

Each task is a real GitHub issue with a corresponding pull request that fixes it.
The agent is given the issue description and the repository at the commit before the fix.
The agent must modify the code to resolve the issue.
Success is measured by whether the repository's test suite passes after the agent's modifications.

Key characteristics:

2,294 task instances from 12 Python repositories.
Requires understanding codebases of 10,000+ lines.
Tests are objective: either they pass or they do not.
SWE-bench Lite is a curated subset of 300 easier instances.

State of the art (as of early 2025): Top agents resolve approximately 40-50% of SWE-bench Lite tasks.

GAIA

Paper: Mialon et al. (2023), "GAIA: A Benchmark for General AI Assistants."

What it evaluates: General-purpose assistant capabilities requiring multi-step reasoning, tool use, and web browsing.

How it works:

466 questions of increasing difficulty (Levels 1, 2, 3).
Questions are designed so that humans can answer them easily but AI systems need tools (calculator, web search, file reading).
Each question has a single, unambiguous correct answer.
Level 1: Requires 1-2 steps. Level 3: Requires 10+ steps with complex reasoning.

Example (Level 2): "What was the closing stock price of Apple on the day the first iPhone was announced?" (Requires knowing the announcement date and looking up historical stock data.)

Key insight: Even with tool access, achieving human-level performance on GAIA remains an open challenge.

AgentBench

Paper: Liu et al. (2024), "AgentBench: Evaluating LLMs as Agents."

What it evaluates: Agent capabilities across 8 different environments:

Operating System (bash commands)
Database (SQL queries)
Knowledge Graph (SPARQL queries)
Digital Card Game
Lateral Thinking Puzzles
House-Holding (ALFWorld)
Web Shopping (WebShop)
Web Browsing

Key insight: Performance varies dramatically across environments. An agent that excels at coding may struggle with web navigation.

WebArena

Paper: Zhou et al. (2024), "WebArena: A Realistic Web Environment for Building Autonomous Agents."

What it evaluates: The ability of agents to complete realistic tasks on real websites.

How it works:

Self-hosted replicas of real websites (Reddit, GitLab, shopping sites, CMS platforms, maps).
812 tasks across the hosted websites.
Tasks require navigation, form filling, information retrieval, and multi-step interactions.
Evaluation checks the final state of the environment (e.g., was the post actually created? was the item added to the cart?).

Example task: "Post a comment on the latest discussion thread in the Machine Learning subreddit saying 'Great insight, thanks for sharing!'"

HumanEval

Paper: Chen et al. (2021), "Evaluating Large Language Models Trained on Code."

What it evaluates: Code generation from docstrings.

How it works:

164 Python programming problems with function signatures and docstrings.
The model generates the function body.
Evaluation is pass@k: the probability that at least one of k generated solutions passes all test cases.

Why it matters for agents: HumanEval tests the code generation capability that underpins coding agents. While it evaluates generation rather than agency, it is a foundational benchmark for code-capable agents.

Comparison of Benchmarks

Benchmark	Domain	Tasks	Evaluation	Requires Tools
SWE-bench	Software engineering	2,294	Test suite pass/fail	Code editor, terminal
GAIA	General assistant	466	Exact answer match	Web, calculator, files
AgentBench	8 environments	1,000+	Environment-specific	Various
WebArena	Web browsing	812	Environment state	Web browser
HumanEval	Code generation	164	Test case pass@k	None (pure generation)

065. LLM-as-Judge

The Concept

When gold-standard references are unavailable or when evaluating open-ended outputs, another LLM can serve as the evaluator (judge). This approach was formalized by Zheng et al. (2023) in their work on MT-Bench and Chatbot Arena.

The idea is both appealing and unsettling: using an AI to evaluate another AI. The appeal is scalability -- you can evaluate thousands of outputs automatically. The concern is circularity -- if the judge has the same biases as the agent being evaluated, those biases will be invisible to the evaluation. It is like asking a student to grade their own exam.

Despite these concerns, LLM-as-judge has become the dominant approach for evaluating open-ended agent outputs because the alternative (human evaluation) is too slow and expensive for most development workflows. The key is to understand and mitigate the known biases.

Common Misconception: "LLM-as-judge is unreliable." Research shows that strong LLM judges (like GPT-4) correlate well with human judgment on most evaluation dimensions. The correlation is not perfect, but it is good enough to be useful for development and iteration. The best practice is to use LLM-as-judge for rapid iteration and human evaluation for final validation.

Basic LLM-as-Judge

python

JUDGE_PROMPT = """You are an expert judge evaluating the quality of an AI agent's
response to a task.

Task: {task}

Agent's response:
{response}

Evaluate on a scale of 1-10 across these dimensions:
1. **Correctness**: Is the response factually accurate and logically sound?
2. **Completeness**: Does the response fully address all aspects of the task?
3. **Clarity**: Is the response well-organized and easy to understand?
4. **Usefulness**: Would this response be practically helpful to the user?

For each dimension, provide:
- Score (1-10)
- Brief justification

Finally, provide an overall score (1-10).
"""


def llm_as_judge(
    task: str,
    response: str,
    llm_call,
    judge_model: str = "gpt-4",
) -> dict:
    """Use an LLM to evaluate an agent's response.

    Args:
        task: The original task description.
        response: The agent's response to evaluate.
        llm_call: Function to call the judge LLM.
        judge_model: Model to use as judge.

    Returns:
        Dict with dimension scores and overall score.
    """
    prompt = JUDGE_PROMPT.format(task=task, response=response)
    judgment = llm_call(model=judge_model, prompt=prompt)
    return parse_judgment(judgment)


def parse_judgment(judgment: str) -> dict:
    """Parse a judge's response into structured scores."""
    dimensions = ["correctness", "completeness", "clarity", "usefulness"]
    scores = {}

    for dim in dimensions:
        for line in judgment.split("\n"):
            if dim.lower() in line.lower() and any(c.isdigit() for c in line):
                digits = [int(c) for c in line if c.isdigit()]
                if digits:
                    scores[dim] = min(digits[0], 10) / 10.0  # Normalize to 0-1
                    break

    # Extract overall score
    for line in judgment.split("\n"):
        if "overall" in line.lower() and any(c.isdigit() for c in line):
            digits = [int(c) for c in line if c.isdigit()]
            if digits:
                scores["overall"] = min(digits[0], 10) / 10.0
                break

    if "overall" not in scores:
        scores["overall"] = sum(scores.values()) / len(scores) if scores else 0.5

    scores["raw_judgment"] = judgment
    return scores

Pairwise Comparison

Instead of absolute scoring (which is subjective), compare two responses head-to-head:

python

PAIRWISE_PROMPT = """Compare the following two responses to the same task.

Task: {task}

Response A:
{response_a}

Response B:
{response_b}

Which response is better? Consider correctness, completeness, clarity, and
usefulness.

Respond with:
- "A" if Response A is clearly better
- "B" if Response B is clearly better
- "TIE" if they are roughly equivalent

Then explain your reasoning in 2-3 sentences."""


def pairwise_compare(
    task: str,
    response_a: str,
    response_b: str,
    llm_call,
    num_comparisons: int = 3,
) -> dict:
    """Compare two responses with position debiasing.

    Runs the comparison multiple times, swapping positions to reduce
    position bias (LLMs tend to prefer the first response).

    Args:
        task: The original task.
        response_a: First response.
        response_b: Second response.
        llm_call: Function to call the judge LLM.
        num_comparisons: Number of comparisons (should be odd).

    Returns:
        Dict with the winner and comparison details.
    """
    a_wins = 0
    b_wins = 0
    ties = 0

    for i in range(num_comparisons):
        # Alternate positions to reduce position bias
        if i % 2 == 0:
            first, second = response_a, response_b
            label_first, label_second = "A", "B"
        else:
            first, second = response_b, response_a
            label_first, label_second = "B", "A"

        prompt = PAIRWISE_PROMPT.format(
            task=task, response_a=first, response_b=second
        )
        result = llm_call(prompt=prompt).strip()

        if result.startswith("A"):
            if label_first == "A":
                a_wins += 1
            else:
                b_wins += 1
        elif result.startswith("B"):
            if label_second == "A":
                a_wins += 1
            else:
                b_wins += 1
        else:
            ties += 1

    if a_wins > b_wins:
        winner = "A"
    elif b_wins > a_wins:
        winner = "B"
    else:
        winner = "TIE"

    return {
        "winner": winner,
        "a_wins": a_wins,
        "b_wins": b_wins,
        "ties": ties,
        "comparisons": num_comparisons,
    }

Known Biases of LLM Judges

LLM-as-judge has several documented biases (Zheng et al., 2023):

Position bias: Tendency to prefer the response presented first.
Verbosity bias: Tendency to prefer longer, more detailed responses.
Self-enhancement bias: Models tend to rate their own outputs higher.
Style bias: Preference for certain writing styles (e.g., bullet points, structured formatting).

Mitigations:

Swap positions and average results (as shown above).
Use a different model as judge than the model being evaluated.
Combine LLM judgment with objective metrics.
Report confidence intervals, not just point estimates.

076. Automated Evaluation Pipelines

End-to-End Evaluation Pipeline

Interactive · Agent Evaluation Pipeline

Evaluation funnel

From breadth to depth

Agent evaluation starts wide and cheap, ends narrow and costly. Click each tier to see what passes through.

Broad coverage

10,000 cases

Cheap automated metrics across the whole set. Catches regressions early.

← BreadthDepth →

A production evaluation pipeline should be automated, reproducible, and comprehensive:

python

"""
An automated evaluation pipeline for AI agents.

This pipeline:
1. Loads test cases from a dataset
2. Runs the agent on each test case (with retry and timeout)
3. Evaluates results using multiple metrics
4. Generates a comprehensive report
"""

import json
import time
from dataclasses import dataclass, field
from datetime import datetime
from pathlib import Path


@dataclass
class TestCase:
    """A single test case for agent evaluation."""

    id: str
    input: str
    expected_output: str | None = None
    metadata: dict = field(default_factory=dict)
    difficulty: str = "medium"
    category: str = "general"


@dataclass
class EvaluationResult:
    """Result of evaluating an agent on a test case."""

    test_case_id: str
    agent_output: str
    scores: dict[str, float]
    trajectory: list[dict]
    wall_time_seconds: float
    tokens_used: int
    error: str | None = None


class EvaluationPipeline:
    """Automated evaluation pipeline for AI agents.

    Supports:
    - Multiple evaluation metrics (exact match, LLM judge, custom)
    - Retry logic with timeout
    - Parallel evaluation (optional)
    - Comprehensive reporting
    """

    def __init__(
        self,
        agent_fn,
        evaluators: dict[str, callable],
        timeout_seconds: float = 300.0,
        max_retries: int = 2,
    ):
        self.agent_fn = agent_fn
        self.evaluators = evaluators
        self.timeout_seconds = timeout_seconds
        self.max_retries = max_retries
        self.results: list[EvaluationResult] = []

    def load_test_cases(self, path: str) -> list[TestCase]:
        """Load test cases from a JSON file.

        Expected format:
        [
            {
                "id": "test_001",
                "input": "...",
                "expected_output": "...",
                "difficulty": "easy",
                "category": "coding"
            },
            ...
        ]
        """
        with open(path) as f:
            data = json.load(f)
        return [TestCase(**case) for case in data]

    def run_agent(self, test_case: TestCase) -> dict:
        """Run the agent on a single test case with timing and error handling."""
        start_time = time.time()
        trajectory = []
        error = None
        output = ""

        for attempt in range(self.max_retries + 1):
            try:
                result = self.agent_fn(
                    test_case.input,
                    timeout=self.timeout_seconds,
                )
                output = result.get("output", "")
                trajectory = result.get("trajectory", [])
                break
            except TimeoutError:
                error = f"Timeout after {self.timeout_seconds}s (attempt {attempt + 1})"
            except Exception as e:
                error = f"Error: {str(e)} (attempt {attempt + 1})"

        wall_time = time.time() - start_time

        return {
            "output": output,
            "trajectory": trajectory,
            "wall_time": wall_time,
            "tokens_used": result.get("tokens_used", 0) if not error else 0,
            "error": error,
        }

    def evaluate_single(self, test_case: TestCase) -> EvaluationResult:
        """Run and evaluate a single test case."""
        # Run the agent
        run_result = self.run_agent(test_case)

        # Apply all evaluators
        scores = {}
        for eval_name, eval_fn in self.evaluators.items():
            try:
                score = eval_fn(
                    input=test_case.input,
                    output=run_result["output"],
                    expected=test_case.expected_output,
                    trajectory=run_result["trajectory"],
                )
                scores[eval_name] = score
            except Exception as e:
                scores[eval_name] = 0.0
                print(f"  Evaluator '{eval_name}' failed: {e}")

        return EvaluationResult(
            test_case_id=test_case.id,
            agent_output=run_result["output"],
            scores=scores,
            trajectory=run_result["trajectory"],
            wall_time_seconds=run_result["wall_time"],
            tokens_used=run_result["tokens_used"],
            error=run_result["error"],
        )

    def run_evaluation(self, test_cases: list[TestCase]) -> dict:
        """Run the full evaluation pipeline.

        Args:
            test_cases: List of test cases to evaluate.

        Returns:
            Comprehensive evaluation report.
        """
        print(f"Starting evaluation: {len(test_cases)} test cases")
        print(f"Evaluators: {list(self.evaluators.keys())}")
        print("=" * 60)

        self.results = []
        for i, test_case in enumerate(test_cases):
            print(f"[{i+1}/{len(test_cases)}] Evaluating {test_case.id}...", end=" ")
            result = self.evaluate_single(test_case)
            self.results.append(result)

            if result.error:
                print(f"ERROR: {result.error}")
            else:
                score_str = ", ".join(
                    f"{k}: {v:.2f}" for k, v in result.scores.items()
                )
                print(f"Scores: {score_str} ({result.wall_time_seconds:.1f}s)")

        report = self.generate_report(test_cases)
        return report

    def generate_report(self, test_cases: list[TestCase]) -> dict:
        """Generate a comprehensive evaluation report."""
        if not self.results:
            return {"error": "No results to report."}

        # Overall metrics
        all_scores = {}
        for eval_name in self.evaluators:
            scores = [r.scores.get(eval_name, 0.0) for r in self.results]
            all_scores[eval_name] = {
                "mean": sum(scores) / len(scores),
                "min": min(scores),
                "max": max(scores),
                "std": (sum((s - sum(scores)/len(scores))**2 for s in scores) / len(scores)) ** 0.5,
            }

        # Error rate
        errors = [r for r in self.results if r.error]

        # Timing
        times = [r.wall_time_seconds for r in self.results]
        tokens = [r.tokens_used for r in self.results]

        # Per-category breakdown
        categories = {}
        for tc, result in zip(test_cases, self.results):
            cat = tc.category
            if cat not in categories:
                categories[cat] = []
            categories[cat].append(result)

        category_scores = {}
        for cat, cat_results in categories.items():
            for eval_name in self.evaluators:
                scores = [r.scores.get(eval_name, 0.0) for r in cat_results]
                key = f"{cat}/{eval_name}"
                category_scores[key] = sum(scores) / len(scores) if scores else 0.0

        # Per-difficulty breakdown
        difficulties = {}
        for tc, result in zip(test_cases, self.results):
            diff = tc.difficulty
            if diff not in difficulties:
                difficulties[diff] = []
            difficulties[diff].append(result)

        difficulty_scores = {}
        for diff, diff_results in difficulties.items():
            for eval_name in self.evaluators:
                scores = [r.scores.get(eval_name, 0.0) for r in diff_results]
                key = f"{diff}/{eval_name}"
                difficulty_scores[key] = sum(scores) / len(scores) if scores else 0.0

        report = {
            "timestamp": datetime.now().isoformat(),
            "total_test_cases": len(self.results),
            "overall_scores": all_scores,
            "error_count": len(errors),
            "error_rate": len(errors) / len(self.results),
            "timing": {
                "mean_seconds": sum(times) / len(times),
                "total_seconds": sum(times),
                "max_seconds": max(times),
            },
            "tokens": {
                "mean": sum(tokens) / len(tokens) if tokens else 0,
                "total": sum(tokens),
            },
            "by_category": category_scores,
            "by_difficulty": difficulty_scores,
        }

        return report

    def save_report(self, report: dict, path: str) -> None:
        """Save the evaluation report to a JSON file."""
        with open(path, "w") as f:
            json.dump(report, f, indent=2)
        print(f"Report saved to {path}")

087. Human Evaluation

When Human Evaluation Is Necessary

LLM-as-judge and automated metrics have limitations. Human evaluation remains necessary for:

Novel domains where no benchmark exists.
Subjective quality assessment (creativity, tone, persuasiveness).
Safety evaluation (detecting harmful outputs that automated checks miss).
Calibrating automated metrics (ensuring they correlate with human judgment).

Annotation Protocol Design

A well-designed annotation protocol includes:

python

ANNOTATION_GUIDELINES = """
## Agent Response Evaluation Guidelines

### Task
You will evaluate AI agent responses to user tasks. For each response,
rate the following dimensions on a 1-5 scale.

### Dimensions

**1. Task Completion (1-5)**
- 5: Fully completes the task with no omissions
- 4: Mostly complete with minor omissions
- 3: Partially complete; addresses the main point but misses aspects
- 2: Minimally addresses the task
- 1: Does not address the task at all

**2. Accuracy (1-5)**
- 5: All information is accurate
- 4: Nearly all accurate, one minor error
- 3: Mostly accurate, some errors
- 2: Multiple significant errors
- 1: Predominantly inaccurate

**3. Helpfulness (1-5)**
- 5: Exceptionally helpful; would fully satisfy the user
- 4: Helpful; addresses the user's needs well
- 3: Somewhat helpful; provides partial value
- 2: Minimally helpful
- 1: Not helpful at all

### Instructions
- Read the task description carefully before evaluating.
- Evaluate the response independently (do not compare to other responses).
- Provide a brief justification (1-2 sentences) for each score.
- If you are unsure, err on the side of a lower score.
- Flag any responses that contain harmful, biased, or inappropriate content.
"""

Inter-Rater Reliability

When multiple annotators evaluate the same outputs, their agreement must be measured:

python

def cohens_kappa(rater1: list[int], rater2: list[int]) -> float:
    """Compute Cohen's Kappa for inter-rater reliability.

    Cohen's Kappa measures agreement between two raters beyond
    what would be expected by chance.

    Interpretation:
    - < 0.00: Poor agreement
    - 0.00-0.20: Slight agreement
    - 0.21-0.40: Fair agreement
    - 0.41-0.60: Moderate agreement
    - 0.61-0.80: Substantial agreement
    - 0.81-1.00: Almost perfect agreement

    Args:
        rater1: List of ratings from rater 1.
        rater2: List of ratings from rater 2.

    Returns:
        Cohen's Kappa coefficient.
    """
    assert len(rater1) == len(rater2), "Raters must evaluate the same items"

    n = len(rater1)
    categories = sorted(set(rater1) | set(rater2))

    # Observed agreement
    agreement = sum(1 for a, b in zip(rater1, rater2) if a == b)
    p_observed = agreement / n

    # Expected agreement by chance
    p_expected = 0.0
    for cat in categories:
        p1 = sum(1 for r in rater1 if r == cat) / n
        p2 = sum(1 for r in rater2 if r == cat) / n
        p_expected += p1 * p2

    # Kappa
    if p_expected == 1.0:
        return 1.0  # Perfect agreement
    return (p_observed - p_expected) / (1 - p_expected)


def krippendorffs_alpha(ratings: list[list[int | None]], level: str = "ordinal") -> float:
    """Compute Krippendorff's Alpha for multiple raters.

    Krippendorff's Alpha supports:
    - Any number of raters
    - Missing data (None values)
    - Different measurement levels (nominal, ordinal, interval, ratio)

    This is a simplified implementation for nominal data.

    Args:
        ratings: Matrix where ratings[i][j] is rater i's rating for item j.
                 None indicates a missing rating.
        level: Measurement level ("nominal" supported in this implementation).

    Returns:
        Krippendorff's Alpha coefficient.
    """
    n_raters = len(ratings)
    n_items = len(ratings[0])

    # Collect observed disagreement
    observed_disagreement = 0.0
    n_pairs = 0

    for j in range(n_items):
        item_ratings = [ratings[i][j] for i in range(n_raters) if ratings[i][j] is not None]
        m = len(item_ratings)
        if m < 2:
            continue
        for a_idx in range(m):
            for b_idx in range(a_idx + 1, m):
                if item_ratings[a_idx] != item_ratings[b_idx]:
                    observed_disagreement += 1
                n_pairs += 1

    if n_pairs == 0:
        return 1.0

    d_observed = observed_disagreement / n_pairs

    # Expected disagreement
    all_ratings = [r for rater in ratings for item_ratings in zip(*ratings) for r in [rater[ratings[0].index(item_ratings[0])]] if r is not None]
    all_valid = [r for rater in ratings for r in rater if r is not None]
    n_total = len(all_valid)
    value_counts = {}
    for v in all_valid:
        value_counts[v] = value_counts.get(v, 0) + 1

    d_expected = 0.0
    values = list(value_counts.keys())
    for i, v1 in enumerate(values):
        for j, v2 in enumerate(values):
            if v1 != v2:
                d_expected += value_counts[v1] * value_counts[v2]
    d_expected /= (n_total * (n_total - 1))

    if d_expected == 0:
        return 1.0

    return 1.0 - (d_observed / d_expected)

098. Cost and Latency Metrics

Why Cost and Latency Matter

There is a saying in engineering: "Fast, cheap, good -- pick two." Agent evaluation traditionally focuses on "good" (accuracy, quality) while ignoring "fast" (latency) and "cheap" (cost). In production, all three matter.

Key Insight: Cost and latency are not secondary metrics -- they are first-class evaluation criteria. An agent that achieves 95% accuracy but takes 5 minutes and costs $2 per interaction may be less practical than one that achieves 85% accuracy in 10 seconds for$ 0.05. The right trade-off depends on the use case, but you cannot make that trade-off if you do not measure all three dimensions.

Academic benchmarks focus on task success, but production deployments must also consider:

Cost: How much does each agent interaction cost in terms of API calls, tokens, and compute?
Latency: How long does the user wait for a response?
Throughput: How many concurrent requests can the system handle?

An agent that achieves 95% accuracy but takes 5 minutes and costs $2 per interaction may be less practical than one that achieves 85% accuracy in 10 seconds for$ 0.05.

Cost Tracking

python

# Approximate pricing (as of early 2025)
MODEL_PRICING = {
    "gpt-4-turbo": {"input": 10.00, "output": 30.00},     # per million tokens
    "gpt-4o": {"input": 2.50, "output": 10.00},
    "gpt-4o-mini": {"input": 0.15, "output": 0.60},
    "claude-3-5-sonnet": {"input": 3.00, "output": 15.00},
    "claude-3-5-haiku": {"input": 0.80, "output": 4.00},
}


@dataclass
class CostMetrics:
    """Track costs for an agent interaction."""

    model: str
    input_tokens: int = 0
    output_tokens: int = 0
    tool_calls: int = 0
    tool_cost: float = 0.0  # External API costs

    @property
    def llm_cost(self) -> float:
        """Compute LLM API cost in USD."""
        pricing = MODEL_PRICING.get(self.model, {"input": 0, "output": 0})
        input_cost = (self.input_tokens / 1_000_000) * pricing["input"]
        output_cost = (self.output_tokens / 1_000_000) * pricing["output"]
        return input_cost + output_cost

    @property
    def total_cost(self) -> float:
        return self.llm_cost + self.tool_cost

    def __repr__(self) -> str:
        return (
            f"CostMetrics(model={self.model}, "
            f"tokens={self.input_tokens}+{self.output_tokens}, "
            f"llm_cost=${self.llm_cost:.4f}, "
            f"total=${self.total_cost:.4f})"
        )


def cost_adjusted_score(
    task_score: float,
    cost: float,
    max_acceptable_cost: float = 1.0,
    cost_penalty_weight: float = 0.2,
) -> float:
    """Compute a cost-adjusted performance score.

    Penalizes agents that are expensive relative to their accuracy.

    Args:
        task_score: Raw task performance score (0-1).
        cost: Cost of the interaction in USD.
        max_acceptable_cost: Cost threshold above which penalty increases.
        cost_penalty_weight: How much to weight cost penalty (0-1).

    Returns:
        Adjusted score.
    """
    cost_ratio = min(cost / max_acceptable_cost, 2.0)  # Cap at 2x
    cost_penalty = cost_ratio * cost_penalty_weight
    return max(0.0, task_score - cost_penalty)

Latency Analysis

python

@dataclass
class LatencyMetrics:
    """Track latency for an agent interaction."""

    time_to_first_token: float = 0.0     # Seconds until first output
    time_to_completion: float = 0.0       # Total interaction time
    llm_call_times: list[float] = field(default_factory=list)
    tool_call_times: list[float] = field(default_factory=list)

    @property
    def total_llm_time(self) -> float:
        return sum(self.llm_call_times)

    @property
    def total_tool_time(self) -> float:
        return sum(self.tool_call_times)

    @property
    def overhead_time(self) -> float:
        """Time spent on orchestration (not LLM or tools)."""
        return self.time_to_completion - self.total_llm_time - self.total_tool_time

    def summary(self) -> dict:
        return {
            "total_seconds": self.time_to_completion,
            "time_to_first_token": self.time_to_first_token,
            "llm_time": self.total_llm_time,
            "tool_time": self.total_tool_time,
            "overhead_time": self.overhead_time,
            "num_llm_calls": len(self.llm_call_times),
            "num_tool_calls": len(self.tool_call_times),
            "avg_llm_call_time": (
                self.total_llm_time / len(self.llm_call_times)
                if self.llm_call_times else 0
            ),
        }

The Cost-Quality Pareto Frontier

When comparing agents, plot cost vs. quality to find the Pareto frontier: the set of agents where no other agent is both cheaper and better. This is one of the most powerful tools for making deployment decisions.

The concept is borrowed from economics. Imagine plotting every available agent on a graph with cost on the x-axis and quality on the y-axis. The Pareto frontier is the "outer boundary" of points where you cannot improve quality without increasing cost, or reduce cost without sacrificing quality. Agents on the frontier are the rational choices; agents below the frontier are dominated (there exists another agent that is both better and cheaper).

Try It Yourself: Imagine you have five agents: A (quality=0.95, cost= $1.50), B (quality=0.90, cost=$ 0.80), C (quality=0.85, cost= $0.40), D (quality=0.80, cost=$ 0.30), E (quality=0.70, cost=$0.90). Which agents are on the Pareto frontier? Which are dominated? (Hint: E is dominated by both C and D.)

python

def find_pareto_frontier(
    agents: list[dict],
) -> list[dict]:
    """Find the Pareto-optimal agents (cost vs. quality).

    An agent is Pareto-optimal if no other agent is both cheaper
    and higher quality.

    Args:
        agents: List of dicts with 'name', 'quality', and 'cost' keys.

    Returns:
        List of Pareto-optimal agents.
    """
    # Sort by cost (ascending)
    sorted_agents = sorted(agents, key=lambda a: a["cost"])

    pareto = []
    best_quality = -1

    for agent in sorted_agents:
        if agent["quality"] > best_quality:
            pareto.append(agent)
            best_quality = agent["quality"]

    return pareto

109. Evaluation Frameworks and Best Practices

Best Practices for Agent Evaluation

Define clear success criteria before building the agent. Vague goals lead to vague evaluation.
Use multiple evaluation methods. No single metric captures all aspects of agent quality. Combine:
- Automated metrics (exact match, pass@k, BLEU) for efficiency.
- LLM-as-judge for open-ended assessment.
- Human evaluation for final validation.
Report variance, not just means. Run evaluations multiple times and report standard deviations and confidence intervals.
Evaluate on held-out data. Ensure the agent has not seen the test cases during development.
Test edge cases and adversarial inputs. Agents often fail on unusual inputs, contradictory instructions, or attempts at manipulation.
Track cost and latency alongside accuracy. A 2% accuracy improvement that triples cost may not be worthwhile.
Version your evaluations. Track which version of the agent was evaluated with which version of the test suite.
Include failure analysis. Understanding why an agent fails is more valuable than knowing how often it fails.

Evaluation Anti-Patterns

Evaluating on the training distribution only: Test on diverse, out-of-distribution inputs.
Cherry-picking examples: Always report aggregate metrics, not selected successes.
Ignoring the cost axis: An infinitely expensive agent is not useful.
Evaluating once: Single-run results are unreliable for stochastic agents.
Conflating capability with reliability: An agent that can solve a task 30% of the time is different from one that solves it 95% of the time.

1110. Practical Example: Building an Automated Evaluation Harness

Let us build a complete evaluation harness that can assess an agent across multiple dimensions.

python

"""
A complete automated evaluation harness for AI agents.

This harness:
1. Loads test cases from a JSON file
2. Runs the agent on each test case
3. Applies multiple evaluation metrics
4. Generates a detailed report with breakdowns

Requirements:
    pip install sentence-transformers
"""

import json
import time
from dataclasses import dataclass, field
from datetime import datetime


# ── Evaluator Functions ────────────────────────────────────────────

def exact_match_evaluator(
    input: str, output: str, expected: str | None, **kwargs
) -> float:
    """Check if output exactly matches expected output.

    Case-insensitive, whitespace-normalized comparison.
    """
    if expected is None:
        return 0.0
    return 1.0 if output.strip().lower() == expected.strip().lower() else 0.0


def contains_evaluator(
    input: str, output: str, expected: str | None, **kwargs
) -> float:
    """Check if the expected answer is contained in the output."""
    if expected is None:
        return 0.0
    return 1.0 if expected.strip().lower() in output.strip().lower() else 0.0


def semantic_similarity_evaluator(
    input: str, output: str, expected: str | None, **kwargs
) -> float:
    """Compute semantic similarity between output and expected.

    Uses sentence-transformers for embedding-based comparison.
    """
    if expected is None:
        return 0.0

    from sentence_transformers import SentenceTransformer
    import numpy as np

    # Use a cached model in production
    model = SentenceTransformer("all-MiniLM-L6-v2")
    embeddings = model.encode([output, expected])
    similarity = float(np.dot(embeddings[0], embeddings[1]) / (
        np.linalg.norm(embeddings[0]) * np.linalg.norm(embeddings[1]) + 1e-8
    ))
    return max(0.0, similarity)  # Clamp negative similarities


def step_efficiency_evaluator(
    input: str, output: str, expected: str | None,
    trajectory: list[dict] | None = None, **kwargs
) -> float:
    """Evaluate how efficiently the agent reached its answer.

    Penalizes excessive steps, errors, and backtracking.
    """
    if not trajectory:
        return 0.5  # Unknown efficiency

    n_steps = len(trajectory)
    n_errors = sum(1 for step in trajectory if step.get("status") == "error")
    n_retries = sum(
        1 for i in range(1, len(trajectory))
        if trajectory[i].get("action") == trajectory[i-1].get("action")
    )

    # Score: penalize for excessive steps, errors, and retries
    step_penalty = max(0, n_steps - 5) * 0.05  # Penalty after 5 steps
    error_penalty = n_errors * 0.15
    retry_penalty = n_retries * 0.10

    score = max(0.0, 1.0 - step_penalty - error_penalty - retry_penalty)
    return score


# ── Evaluation Harness ─────────────────────────────────────────────

class AgentEvaluationHarness:
    """A comprehensive evaluation harness for AI agents.

    Features:
    - Multiple evaluator support
    - Detailed per-test-case results
    - Aggregate statistics with breakdowns by category and difficulty
    - Reproducible evaluation with seeded runs
    """

    def __init__(self, agent_fn):
        self.agent_fn = agent_fn
        self.evaluators = {
            "exact_match": exact_match_evaluator,
            "contains": contains_evaluator,
            "efficiency": step_efficiency_evaluator,
        }
        self.results = []

    def add_evaluator(self, name: str, evaluator_fn) -> None:
        """Register a custom evaluator."""
        self.evaluators[name] = evaluator_fn

    def run(
        self,
        test_cases: list[dict],
        num_runs: int = 1,
        verbose: bool = True,
    ) -> dict:
        """Run the evaluation harness.

        Args:
            test_cases: List of test case dicts.
            num_runs: Number of times to run each test case (for variance).
            verbose: Whether to print progress.

        Returns:
            Comprehensive evaluation report.
        """
        if verbose:
            print(f"Evaluation started: {len(test_cases)} cases, {num_runs} run(s) each")
            print(f"Evaluators: {list(self.evaluators.keys())}")
            print("=" * 70)

        self.results = []

        for i, test_case in enumerate(test_cases):
            case_id = test_case.get("id", f"case_{i}")
            if verbose:
                print(f"\n[{i+1}/{len(test_cases)}] {case_id}: {test_case['input'][:60]}...")

            run_results = []
            for run in range(num_runs):
                # Run the agent
                start = time.time()
                try:
                    agent_result = self.agent_fn(test_case["input"])
                    output = agent_result if isinstance(agent_result, str) else agent_result.get("output", "")
                    trajectory = agent_result.get("trajectory", []) if isinstance(agent_result, dict) else []
                    error = None
                except Exception as e:
                    output = ""
                    trajectory = []
                    error = str(e)
                elapsed = time.time() - start

                # Apply evaluators
                scores = {}
                for eval_name, eval_fn in self.evaluators.items():
                    try:
                        score = eval_fn(
                            input=test_case["input"],
                            output=output,
                            expected=test_case.get("expected_output"),
                            trajectory=trajectory,
                        )
                        scores[eval_name] = score
                    except Exception as e:
                        scores[eval_name] = 0.0

                run_results.append({
                    "output": output,
                    "scores": scores,
                    "wall_time": elapsed,
                    "error": error,
                })

            # Aggregate across runs
            avg_scores = {}
            for eval_name in self.evaluators:
                run_scores = [r["scores"].get(eval_name, 0.0) for r in run_results]
                avg_scores[eval_name] = {
                    "mean": sum(run_scores) / len(run_scores),
                    "std": (sum((s - sum(run_scores)/len(run_scores))**2 for s in run_scores) / len(run_scores)) ** 0.5 if num_runs > 1 else 0.0,
                }

            case_result = {
                "test_case_id": case_id,
                "category": test_case.get("category", "general"),
                "difficulty": test_case.get("difficulty", "medium"),
                "scores": avg_scores,
                "num_runs": num_runs,
                "avg_wall_time": sum(r["wall_time"] for r in run_results) / num_runs,
                "error_rate": sum(1 for r in run_results if r["error"]) / num_runs,
                "sample_output": run_results[0]["output"][:300],
            }
            self.results.append(case_result)

            if verbose:
                score_str = " | ".join(
                    f"{k}: {v['mean']:.2f}" for k, v in avg_scores.items()
                )
                print(f"  Scores: {score_str} ({case_result['avg_wall_time']:.1f}s)")

        # Generate report
        report = self._generate_report()

        if verbose:
            print("\n" + "=" * 70)
            print("EVALUATION SUMMARY")
            print("=" * 70)
            self._print_report(report)

        return report

    def _generate_report(self) -> dict:
        """Generate the evaluation report."""
        # Overall scores
        overall = {}
        for eval_name in self.evaluators:
            means = [r["scores"][eval_name]["mean"] for r in self.results]
            overall[eval_name] = {
                "mean": sum(means) / len(means),
                "std": (sum((m - sum(means)/len(means))**2 for m in means) / len(means)) ** 0.5,
                "min": min(means),
                "max": max(means),
            }

        # By category
        by_category = {}
        for result in self.results:
            cat = result["category"]
            if cat not in by_category:
                by_category[cat] = []
            by_category[cat].append(result)

        category_scores = {}
        for cat, cat_results in by_category.items():
            category_scores[cat] = {}
            for eval_name in self.evaluators:
                means = [r["scores"][eval_name]["mean"] for r in cat_results]
                category_scores[cat][eval_name] = sum(means) / len(means)
            category_scores[cat]["count"] = len(cat_results)

        # By difficulty
        by_difficulty = {}
        for result in self.results:
            diff = result["difficulty"]
            if diff not in by_difficulty:
                by_difficulty[diff] = []
            by_difficulty[diff].append(result)

        difficulty_scores = {}
        for diff, diff_results in by_difficulty.items():
            difficulty_scores[diff] = {}
            for eval_name in self.evaluators:
                means = [r["scores"][eval_name]["mean"] for r in diff_results]
                difficulty_scores[diff][eval_name] = sum(means) / len(means)
            difficulty_scores[diff]["count"] = len(diff_results)

        # Timing
        times = [r["avg_wall_time"] for r in self.results]

        return {
            "timestamp": datetime.now().isoformat(),
            "total_cases": len(self.results),
            "overall_scores": overall,
            "by_category": category_scores,
            "by_difficulty": difficulty_scores,
            "timing": {
                "mean_seconds": sum(times) / len(times),
                "total_seconds": sum(times),
                "max_seconds": max(times),
            },
            "error_rate": sum(r["error_rate"] for r in self.results) / len(self.results),
        }

    def _print_report(self, report: dict) -> None:
        """Pretty-print the evaluation report."""
        print(f"\nTotal test cases: {report['total_cases']}")
        print(f"Error rate: {report['error_rate']:.1%}")
        print(f"Mean time per case: {report['timing']['mean_seconds']:.1f}s")

        print("\nOverall Scores:")
        for metric, stats in report["overall_scores"].items():
            print(f"  {metric}: {stats['mean']:.3f} (std: {stats['std']:.3f}, range: {stats['min']:.2f}-{stats['max']:.2f})")

        if report["by_category"]:
            print("\nBy Category:")
            for cat, scores in report["by_category"].items():
                count = scores.pop("count", "?")
                score_str = ", ".join(f"{k}: {v:.2f}" for k, v in scores.items())
                print(f"  {cat} (n={count}): {score_str}")

        if report["by_difficulty"]:
            print("\nBy Difficulty:")
            for diff, scores in report["by_difficulty"].items():
                count = scores.pop("count", "?")
                score_str = ", ".join(f"{k}: {v:.2f}" for k, v in scores.items())
                print(f"  {diff} (n={count}): {score_str}")


# ── Usage Example ─────────────────────────────────────────────────

def main():
    """Demonstrate the evaluation harness."""

    # A simple mock agent for demonstration
    def mock_agent(input_text: str) -> dict:
        """A mock agent that returns predefined answers."""
        answers = {
            "capital of france": "The capital of France is Paris.",
            "2 + 2": "4",
            "sort": "def sort_list(lst): return sorted(lst)",
        }

        output = "I don't know."
        for key, answer in answers.items():
            if key in input_text.lower():
                output = answer
                break

        return {
            "output": output,
            "trajectory": [
                {"action": "think", "content": "Processing query..."},
                {"action": "respond", "content": output},
            ],
        }

    # Define test cases
    test_cases = [
        {
            "id": "geo_001",
            "input": "What is the capital of France?",
            "expected_output": "Paris",
            "category": "geography",
            "difficulty": "easy",
        },
        {
            "id": "math_001",
            "input": "What is 2 + 2?",
            "expected_output": "4",
            "category": "math",
            "difficulty": "easy",
        },
        {
            "id": "code_001",
            "input": "Write a Python function to sort a list.",
            "expected_output": "def sort_list(lst): return sorted(lst)",
            "category": "coding",
            "difficulty": "medium",
        },
        {
            "id": "geo_002",
            "input": "What is the capital of Bhutan?",
            "expected_output": "Thimphu",
            "category": "geography",
            "difficulty": "hard",
        },
        {
            "id": "reason_001",
            "input": "If all roses are flowers and some flowers fade quickly, can we conclude that some roses fade quickly?",
            "expected_output": "No",
            "category": "reasoning",
            "difficulty": "hard",
        },
    ]

    # Create and run harness
    harness = AgentEvaluationHarness(mock_agent)
    report = harness.run(test_cases, num_runs=1, verbose=True)

    # Save report
    print(f"\nFull report:")
    print(json.dumps(report, indent=2, default=str))


if __name__ == "__main__":
    main()

Expected Output

text

Evaluation started: 5 cases, 1 run(s) each
Evaluators: ['exact_match', 'contains', 'efficiency']
======================================================================

[1/5] geo_001: What is the capital of France?...
  Scores: exact_match: 0.00 | contains: 1.00 | efficiency: 1.00 (0.0s)

[2/5] math_001: What is 2 + 2?...
  Scores: exact_match: 1.00 | contains: 1.00 | efficiency: 1.00 (0.0s)

[3/5] code_001: Write a Python function to sort a list....
  Scores: exact_match: 1.00 | contains: 1.00 | efficiency: 1.00 (0.0s)

[4/5] geo_002: What is the capital of Bhutan?...
  Scores: exact_match: 0.00 | contains: 0.00 | efficiency: 1.00 (0.0s)

[5/5] reason_001: If all roses are flowers and some flowers fade quickly, ...
  Scores: exact_match: 0.00 | contains: 0.00 | efficiency: 1.00 (0.0s)

======================================================================
EVALUATION SUMMARY
======================================================================

Total test cases: 5
Error rate: 0.0%
Mean time per case: 0.0s

Overall Scores:
  exact_match: 0.400 (std: 0.490, range: 0.00-1.00)
  contains: 0.600 (std: 0.490, range: 0.00-1.00)
  efficiency: 1.000 (std: 0.000, range: 1.00-1.00)

By Category:
  geography (n=2): exact_match: 0.00, contains: 0.50
  math (n=1): exact_match: 1.00, contains: 1.00
  coding (n=1): exact_match: 1.00, contains: 1.00
  reasoning (n=1): exact_match: 0.00, contains: 0.00

By Difficulty:
  easy (n=2): exact_match: 0.50, contains: 1.00
  medium (n=1): exact_match: 1.00, contains: 1.00
  hard (n=2): exact_match: 0.00, contains: 0.00

12Discussion Questions

Goodhart's Law in agent evaluation: "When a measure becomes a target, it ceases to be a good measure." How does this apply to agent benchmarks? Can agents learn to game specific benchmarks? Hint: consider how SWE-bench agents might learn to focus on test-passing rather than actually fixing the underlying issue. What would a "truly solved" GitHub issue look like versus one that merely passes the test suite?
The evaluation gap: Current benchmarks test specific capabilities, but real-world agent deployment involves unpredictable interactions. How can we evaluate agents for the "long tail" of real-world scenarios? Hint: consider stress testing with adversarial inputs, evaluating on out-of-distribution tasks, and monitoring production performance over time rather than relying solely on pre-deployment benchmarks.
Evaluating safety: How should we evaluate whether an agent is safe? Is the absence of harmful behavior in testing sufficient, or do we need stronger guarantees? Hint: consider the difference between "we tested 1000 cases and none were harmful" and "we can prove the agent will never take harmful actions." What level of assurance is appropriate for different risk levels?
Human vs. LLM judges: Under what circumstances should we prefer human evaluation over LLM-as-judge? When is LLM-as-judge sufficient? Hint: consider the cost-accuracy trade-off, the domain expertise required, and whether the evaluation requires understanding nuances that LLMs might miss (cultural context, emotional tone, safety concerns).
Cost-quality trade-offs: Is it ethical to deploy a cheaper, less accurate agent when a more expensive, more accurate one exists? How should we think about cost-quality trade-offs in high-stakes domains? Hint: consider the difference between a customer service chatbot (where a 5% error rate is annoying but not harmful) and a medical diagnosis assistant (where a 5% error rate could be dangerous).
Benchmark saturation: What should the community do when agents achieve near-perfect scores on existing benchmarks? How do we design benchmarks that remain challenging as capabilities improve? Hint: look at the history of chess AI and Go AI -- once benchmarks were "solved," the focus shifted to more complex domains. What is the equivalent trajectory for agent benchmarks?

13Summary and Key Takeaways

Agent evaluation is fundamentally harder than model evaluation due to non-determinism, multi-step behaviors, open-ended outputs, and environment interaction. Single metrics are insufficient.
Task-based evaluation measures what matters most (did the agent succeed?) but should be complemented with quality metrics, rubrics, and process evaluation for deeper insight.
Process-based evaluation reveals how the agent works, not just whether it works. Reasoning quality, tool use efficiency, and trajectory analysis provide actionable insights for improvement.
Benchmarks provide standardized comparison but each has limitations. SWE-bench tests coding, GAIA tests general assistance, WebArena tests web interaction. Use benchmarks that match your deployment scenario.
LLM-as-judge enables scalable evaluation of open-ended outputs but has known biases. Position swapping, different judge models, and calibration against human evaluation mitigate these biases.
Human evaluation remains the gold standard for subjective quality and safety. Well-designed annotation protocols with inter-rater reliability measures ensure consistency.
Cost and latency are first-class evaluation metrics for production agents. The Pareto frontier of cost vs. quality guides practical deployment decisions.
Evaluation should be automated, reproducible, and comprehensive. An evaluation harness that combines multiple metrics, tracks variance, and generates detailed reports is essential infrastructure for agent development.

14References

Jimenez, C. E., Yang, J., Wettig, A., Yao, S., Pei, K., Press, O., & Narasimhan, K. (2024). SWE-bench: Can Language Models Resolve Real-World GitHub Issues? International Conference on Learning Representations (ICLR).
Mialon, G., Dessi, R., Lomeli, M., Nalmpantis, C., Pasunuru, R., Raber, R., Roziere, B., Schick, T., Dwivedi-Yu, J., Celikyilmaz, A., Grave, E., LeCun, Y., & Scialom, T. (2023). GAIA: A Benchmark for General AI Assistants. arXiv preprint arXiv:2311.12983.
Liu, X., Yu, H., Zhang, H., Xu, Y., Lei, X., Lai, H., Gu, Y., Ding, H., Men, K., Yang, K., Zhang, S., Deng, X., Zeng, A., Du, Z., Zhang, C., Shen, S., Zhang, T., Su, Y., Sun, H., Huang, M., Dong, Y., & Tang, J. (2024). AgentBench: Evaluating LLMs as Agents. International Conference on Learning Representations (ICLR).
Zhou, S., Xu, F. F., Zhu, H., Zhou, X., Lo, R., Sridhar, A., Cheng, X., Ou, T., Bisk, Y., Fried, D., Alon, U., & Neubig, G. (2024). WebArena: A Realistic Web Environment for Building Autonomous Agents. International Conference on Learning Representations (ICLR).
Chen, M., Tworek, J., Jun, H., Yuan, Q., Pinto, H. P. de O., Kaplan, J., Edwards, H., Burda, Y., Joseph, N., Brockman, G., Ray, A., Puri, R., Krueger, G., Petrov, M., Khlaaf, H., Sastry, G., Mishkin, P., Chan, B., Gray, S., Ryder, N., Pavlov, M., Power, A., Kaiser, L., Bavarian, M., Winter, C., Tillet, P., Such, F. P., Cummings, D., Plappert, M., Chanez, F., Barnes, E., Herbert-Voss, A., Guss, W. H., Nichol, A., Paino, A., Tezak, N., Tang, J., Babuschkin, I., Balaji, S., Jain, S., Saunders, W., Hesse, C., Carr, A. N., Leike, J., Achiam, J., Misra, V., Morikawa, E., Radford, A., Knight, M., Brundage, M., Murati, M., Mayer, K., Welinder, P., McGrew, B., Amodei, D., McCandlish, S., Sutskever, I., & Zaremba, W. (2021). Evaluating Large Language Models Trained on Code. arXiv preprint arXiv:2107.03374.
Zheng, L., Chiang, W.-L., Sheng, Y., Zhuang, S., Wu, Z., Zhuang, Y., Lin, Z., Li, Z., Li, D., Xing, E. P., Zhang, H., Gonzalez, J. E., & Stoica, I. (2023). Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena. Advances in Neural Information Processing Systems (NeurIPS), 36.
Es, S., James, J., Espinosa-Anke, L., & Schockaert, S. (2024). RAGAS: Automated Evaluation of Retrieval Augmented Generation. Proceedings of the 18th Conference of the European Chapter of the Association for Computational Linguistics (EACL): System Demonstrations.
Shinn, N., Cassano, F., Gopinath, A., Narasimhan, K., & Yao, S. (2023). Reflexion: Language Agents with Verbal Reinforcement Learning. Advances in Neural Information Processing Systems (NeurIPS), 36.

Part of "Agentic AI: Foundations, Architectures, and Applications" (CC BY-SA 4.0).