Evaluation, safety, and governanceW1251 min read

Human-Agent Interaction and Oversight

Three paradigms: human-in-the-loop, human-on-the-loop, fully autonomous. Matching autonomy to risk and reversibility. Approval workflow design balancing safety and efficiency. Trust calibration, transparency, feedback channels.

Core conceptsAutonomy spectrumTrust calibrationApproval workflows

Duration: 2 hours lecture + 1 hour lab Prerequisites: Weeks 1-11 (foundations through safety and alignment)

01Learning Objectives

By the end of this lecture, students will be able to:

Distinguish between human-in-the-loop, human-on-the-loop, and fully autonomous paradigms
Map agent autonomy to a levels framework and justify the appropriate level for a given use case
Design approval workflows that balance safety with efficiency
Apply transparency and explainability principles to agent systems
Reason about trust calibration between humans and agents
Design user experiences for agentic interfaces
Implement feedback mechanisms that improve agent performance over time

021. Human-in-the-Loop vs. Human-on-the-Loop vs. Fully Autonomous

1.1 Why Human Oversight Matters: The Central Tension

Last week, we discussed safety, alignment, and guardrails -- the technical mechanisms that constrain agent behavior. This week, we turn to a complementary and equally important topic: the role of the human in agentic systems.

Here is the central tension: we build agents to automate work, but the most powerful safety mechanism is human judgment. Every step toward greater autonomy is a step away from human oversight. How do we get the benefits of automation without losing the benefits of human judgment?

This is not a new question. Aviation faced it decades ago when autopilot systems became capable enough to fly planes without pilot intervention. The answer aviation arrived at was not "remove the pilot" or "never use autopilot," but rather a carefully designed system of shared responsibility where automation handles routine tasks and humans handle exceptions, emergencies, and judgment calls. The aviation industry's decades of experience with human-automation interaction provides invaluable lessons for agentic AI.

To understand the design space, we need to start with three fundamental paradigms for human-agent interaction.

1.2 The Three Paradigms

Human-in-the-loop (HITL). The human is actively involved in every decision cycle. The agent proposes actions, and the human approves or rejects each one before execution. The agent cannot act without human confirmation.

text

User request --> Agent plans --> Agent proposes action --> Human approves --> Action executes
                                                       --> Human rejects  --> Agent replans

Think of this like a new employee on their first week. They come to you with every decision: "I think we should respond to this customer by offering a 10% discount. Should I send this email?" You review each action, provide feedback, and gradually build trust.

HITL provides maximum safety because every action is reviewed before execution. But it also provides minimum efficiency because the agent is constantly waiting for human approval. The human becomes the bottleneck.

Human-on-the-loop (HOTL). The agent acts autonomously, but a human monitors its behavior and can intervene when needed. The human sets policies and boundaries, and the agent operates within them. The human reviews summaries and can override at any time.

text

User request --> Agent plans --> Agent acts autonomously --> Human monitors
                                                         --> Human intervenes if needed

This is like a trusted employee who handles their own workflow but provides regular updates to their manager. The manager can step in at any time but usually does not need to. The employee makes routine decisions independently and escalates unusual or high-stakes decisions.

HOTL balances safety and efficiency. The agent can act quickly on routine tasks while the human focuses on exceptions and high-level oversight. The challenge is designing the monitoring system so the human actually catches problems (more on this later).

Fully autonomous. The agent operates without human oversight during execution. Humans define the initial goal and constraints, then the agent runs to completion. Human involvement is limited to post-hoc review of results.

This is like a freelancer you hire for a project. You define the requirements, they deliver the result, and you review it. You have no visibility into the process while it is running.

Full autonomy provides maximum efficiency but minimum safety during execution. It is appropriate only when the risks of failure are low, the agent is well-tested for the specific task, and adequate post-hoc review is in place.

1.3 Tradeoffs

Paradigm	Safety	Efficiency	User Burden	Scalability
Human-in-the-loop	Highest	Lowest	High	Low
Human-on-the-loop	Medium	Medium	Medium	Medium
Fully autonomous	Lowest	Highest	Low	High

These tradeoffs are fundamental and cannot be entirely eliminated. You can optimize within each paradigm (designing better approval workflows, smarter monitoring dashboards), but you cannot have maximum safety and maximum efficiency simultaneously. This is an engineering tradeoff that must be made deliberately for each use case.

1.4 When to Use Each

HITL is appropriate when:

Actions are irreversible (sending emails, making purchases, deploying code to production)
Errors have high consequences (medical, legal, financial decisions)
Trust in the agent has not been established (new deployment, new task type)
Regulatory requirements mandate human oversight (healthcare, criminal justice)
The agent is operating in a domain where it has shown poor reliability

HOTL is appropriate when:

Actions are mostly reversible or low-stakes
The agent has been validated on similar tasks with a good track record
High throughput is needed but safety cannot be ignored entirely
The human can effectively monitor aggregate behavior (dashboards, summaries)
The cost of occasional errors is manageable

Fully autonomous is appropriate when:

Tasks are well-defined, routine, and low-risk
The agent has extensive track record on this exact type of task
Speed is critical and human oversight would be a bottleneck
Adequate post-hoc auditing is in place to catch errors after the fact
The consequences of failure are bounded and recoverable

1.5 Hybrid Approaches: The Real-World Pattern

In practice, most production systems use hybrid approaches where the level of oversight varies based on the specific action being taken:

This per-action granularity provides the best balance of safety and efficiency. Reading data is safe (no side effects), so it can be autonomous. Writing data has consequences but is often reversible (version control, backups), so monitoring is sufficient. Deleting data and sending external communications are potentially irreversible, so they require explicit approval.

This is exactly the pattern we see in production tools like Claude Code: reading files and searching are autonomous, but writing files and running commands require user approval (which can be configured based on the user's trust level and the specific operation).

Key Insight: The right paradigm is not a global choice. It is a per-action choice that depends on the reversibility, consequence, and risk of each specific action. Design your system so that the oversight level can be configured at the action level, not just at the system level.

1.6 Real-World Case Study: How Claude Code Implements Per-Action Oversight

Claude Code provides one of the best real-world examples of per-action oversight granularity. When you use Claude Code, the following rules apply by default:

Autonomous (no approval): Reading files, searching codebases, listing directories. These are read-only operations with no side effects.
Human-in-the-loop (approval required): Writing files, running terminal commands, creating commits. These modify the environment and are potentially irreversible.
Configurable escalation: Users can grant blanket permission for specific operations (e.g., "allow all write operations to .test files") based on their trust level.

This demonstrates a practical principle: the default should be safe, and the user should opt into risk. A new user who has never worked with the agent gets maximum safety. An experienced user who trusts the agent on certain operations can explicitly reduce oversight for those operations.

The lesson for your own agent designs is clear: start with restrictive defaults and allow users to relax them, never the reverse.

Try It Yourself: Classify Oversight Levels

For each of the following agent actions, decide whether it should be autonomous, human-on-the-loop, or human-in-the-loop. Justify your choice:

An agent reading a customer's account information to answer their question
An agent updating a customer's email address in the database
An agent sending a password reset email to a customer
An agent refunding a payment of $500
An agent deleting a customer's account
An agent drafting an internal summary of the day's support tickets
An agent posting a response on social media on behalf of the company

032. Levels of Agent Autonomy

2.1 Drawing from Self-Driving Car Levels

The SAE International standard defines six levels of driving automation (SAE J3016). This framework has proven remarkably useful as an analogy for agentic AI because it captures the same spectrum from "human does everything" to "machine does everything" with clearly defined intermediate stages.

The driving analogy is instructive because the automotive industry has spent decades thinking about how to safely increase automation. They have learned painful lessons about what happens when automation is poorly matched to human capabilities (most notably, the "Level 3 problem" where the human is supposed to be ready to take over but is not actually paying attention). These lessons apply directly to agentic AI.

Let us map the six levels to agent systems:

Interactive · Levels of Agent Autonomy

Autonomy spectrum

Who decides at each step?

The right level depends on task risk and how reversible the action is. There's no universal pick.

L3 · HOTL

Live oversight

The agent acts autonomously and the human watches the live trace, ready to interrupt.

Example: Coding assistant with reviewer in the room.

Maximum safetyMaximum autonomy

Safety ←Balance→ Efficiency

Level 0 -- No Automation (Traditional Software). The human performs all tasks. Software provides information but makes no decisions. Example: a search engine that returns results for the human to act on. A database query tool that returns data for the human to interpret. A spell checker that underlines errors but does not fix them.

The human is fully responsible for all decisions and actions. The software is purely a tool, with no autonomy whatsoever.

Level 1 -- Assistance (Autocomplete/Suggestions). The agent handles a single subtask under human control. The human drives all decision-making. Example: code autocomplete, email auto-reply suggestions, spell-check with auto-correct. The human accepts or rejects each suggestion.

The key characteristic of Level 1 is that the agent's contribution is atomic: a single suggestion, completion, or correction. The human evaluates each one independently.

Level 2 -- Partial Automation (Copilot). The agent can handle multiple subtasks simultaneously, but the human must remain engaged and monitor continuously. The human can take over at any time. Example: GitHub Copilot generating multi-line code blocks, a writing assistant that drafts paragraphs, a research agent that pulls together information from multiple sources but presents it for human review.

The key characteristic of Level 2 is that the agent produces multi-step output, but the human is still actively supervising every step. If the human looks away, the agent should not proceed.

Level 3 -- Conditional Automation (Supervised Agent). The agent handles complete tasks autonomously but within a defined domain. The human must be available to take over when the agent encounters situations outside its domain. Example: a customer service agent that handles routine queries but escalates complex ones; a coding agent that implements straightforward features but asks for help with architectural decisions.

Level 3 is where things get interesting and dangerous. The agent is autonomous enough that the human might stop paying attention (because the agent is handling things well), but not autonomous enough to handle every situation. This creates the Level 3 problem: the human is nominally "in the loop" but is actually disengaged and unprepared to take over when the agent encounters its limits.

The automotive industry has grappled with this exact problem. Tesla's "Full Self-Driving" operates at roughly Level 2-3, and the challenge of keeping the human driver engaged when the car is driving itself is one of the central safety concerns. The same challenge applies to agentic AI.

Level 4 -- High Automation (Monitored Agent). The agent can handle complete tasks, including most edge cases, without human involvement. It operates within its domain without needing human fallback, but it does not operate outside that domain. Example: a coding agent that can implement features, write tests, debug issues, and handle common errors, requesting human help only for ambiguous requirements or novel architectural questions.

The key difference between Level 3 and Level 4 is that a Level 4 agent can handle its own failures. When something unexpected happens, it can recover or fail gracefully rather than requiring the human to take over. The human reviews outcomes rather than supervising the process.

Level 5 -- Full Automation (Autonomous Agent). The agent can handle any task within its broad domain without human involvement, including novel situations. No human oversight required during operation. This level remains largely aspirational for general-purpose agents as of 2026, though narrow-domain agents (like automated trading systems with well-defined rules) may approach it.

2.2 Applying the Framework

When deciding the autonomy level for a specific agent deployment, consider five key factors:

Task criticality: How severe are the consequences of errors? A bug in a hobby project is Level 4-appropriate. A bug in medical software is Level 2 at most.
Task predictability: How well-defined and routine is the task? Formatting code (highly predictable) can be more autonomous than designing system architecture (highly unpredictable).
Agent maturity: How extensively has the agent been tested on this type of task? A new agent deployment should start at lower autonomy levels regardless of theoretical capability.
Recovery cost: How expensive is it to recover from an error? If every action is backed by version control, recovery is cheap and higher autonomy is acceptable. If actions are irreversible, lower autonomy is needed.
Regulatory requirements: Are there legal mandates for human oversight? Some domains (healthcare, financial services, criminal justice) have explicit requirements.

2.3 The Level 3 Problem in Agentic AI

The "Level 3 problem" deserves special attention because it is where most production agentic systems operate today. At Level 3, the agent is autonomous enough to handle routine cases but expects the human to take over for difficult ones. The problem is that the human, having watched the agent succeed repeatedly, is not mentally prepared to take over.

In agentic AI, the Level 3 problem manifests as follows: a customer service agent handles 95% of inquiries correctly. The human supervisor stops reading the agent's responses because they are almost always fine. The 5% of cases that need human intervention are precisely the unusual, complex cases that require the most attention, but the supervisor is the least prepared for them because they have been disengaged.

The implication is clear: if your system operates at Level 3, you need explicit mechanisms to keep the human engaged. Periodic manual handling of cases (even easy ones), random spot-checks, and mandatory review of a sample of "successful" cases all help maintain human readiness.

2.4 Progressive Autonomy

A practical pattern is progressive autonomy: start with high oversight and gradually reduce it as trust is established through demonstrated performance.

This mirrors how organizations onboard human employees: new hires have more oversight, and autonomy is earned through demonstrated competence. Nobody gives a new employee full administrative access on day one.

Progressive autonomy also provides a natural data collection mechanism. During the high-oversight phases, you accumulate labeled data about what the agent does right and wrong, which informs the decision about when to increase autonomy.

Key Insight: The right autonomy level is not a permanent choice. It should evolve over time based on demonstrated performance. Design your system to support gradual increases (and decreases) in autonomy as the agent's track record changes.

043. Designing Effective Approval Workflows

3.1 The Approval Bottleneck Problem

Naive HITL systems create a painful bottleneck:

Agents wait idle for human approval, wasting compute and time
Humans face "approval fatigue" from constant interruptions, leading to rubber-stamping
The latency defeats the purpose of automation (if every action takes 5 minutes of human review, why use an agent at all?)

The approval bottleneck is one of the main reasons organizations move too quickly to full autonomy: the HITL experience is so painful that people skip past HOTL and go straight to "let the agent do whatever it wants." This is dangerous. The solution is not to remove oversight but to design better oversight.

3.2 Design Patterns for Approval Workflows

3.2.1 Batched Approvals

Instead of approving each action individually, batch related actions together:

text

Agent: "I plan to take the following 5 actions to resolve this customer issue:
  1. Look up customer account #12345
  2. Check order history for the last 30 days
  3. Identify the disputed charge
  4. Calculate the refund amount
  5. Issue the refund

Approve all / Approve with modifications / Reject"

This gives the human a complete picture and reduces the number of approval decisions from 5 to 1. The human can evaluate the entire workflow at once, which is both faster and more effective than evaluating each step in isolation (because the human can see whether the steps make sense together).

3.2.2 Plan-Level Approval

The agent presents its entire plan before executing any actions. The human approves the plan, and then the agent executes without further interruption (unless something unexpected happens).

This pattern is used by GitHub Copilot Workspace and similar tools. The human approves the intent (what the agent plans to do), and the agent handles the execution (how to do it). This is efficient because the human focuses on the high-level decisions where their judgment adds the most value, while the agent handles the mechanical execution where its speed adds the most value.

3.2.3 Exception-Based Approval

Define what "normal" looks like and only request approval for exceptions:

Normal actions: Execute automatically (logged for audit)
Unusual actions: Flag for human review before execution
Prohibited actions: Block immediately and notify human

The challenge is defining "normal," "unusual," and "prohibited." This can be done through:

Explicit rules (an allowlist of normal actions)
Statistical baselines (any action that deviates significantly from historical patterns)
Risk scoring (any action above a certain risk score requires approval)

This pattern works well for high-volume systems where human review of every action is impractical. The human's attention is focused on the exceptions where it is most needed.

3.2.4 Tiered Approval

Different approvers for different risk levels:

This mirrors how organizations handle financial approvals: a team lead can approve expenses up to $1,000, a manager up to$ 10,000, and the CFO above that. The principle is the same: escalate oversight as risk increases.

3.3 Approval UX Design Principles

The user experience of the approval workflow determines whether humans will engage thoughtfully or rubber-stamp everything. These principles are essential:

Provide context: Show the agent's reasoning, not just its proposed action. "I want to issue a $50 refund" is less helpful than "The customer was charged twice for order #1234. The duplicate charge was$ 50. I recommend issuing a refund for the duplicate charge."
Make the default safe: If the human ignores the approval request (walks away from their desk, gets distracted), the safe option should be the default. Typically this means "do not proceed." An approval that times out should not auto-approve.
Minimize cognitive load: Present information hierarchically: summary first, details available on demand. The human should be able to make a quick decision for routine cases and drill into details for unusual ones.
Enable quick decisions: For routine approvals that match established patterns, provide one-click approval buttons. But balance this with anti-rubber-stamping measures (see below).
Support undo: Even after approval, provide a window for the human to reverse the decision. "Undo this action within 30 seconds" gives the human a safety net for hasty approvals.
Track approval patterns: Identify actions that are always approved (candidates for automation, reducing the human's workload) and always rejected (candidates for policy updates, preventing the agent from proposing them). This data drives continuous improvement of the approval workflow.

3.4 Anti-Rubber-Stamping Measures

Approval fatigue is a serious problem. When humans review hundreds of agent proposals per day, they start approving everything without reading. Several techniques combat this:

Randomized challenges: Occasionally insert obviously wrong proposals to test whether the human is paying attention. If they approve a clearly incorrect action, alert them and require a slower review mode.

Variable detail requirements: Randomly require the human to summarize why they approved a specific action. This forces engagement beyond clicking a button.

Approval rate monitoring: Track each human reviewer's approval rate. If it approaches 100%, they may be rubber-stamping. Flag this for their manager.

Cooling periods: After approving many actions in quick succession, insert a mandatory pause: "You have approved 20 actions in the last 5 minutes. Take a moment to review the next one carefully."

Key Insight: The goal of an approval workflow is not just to get a human signature on each action. It is to ensure that a human actually evaluates each action. If your approval workflow results in rubber-stamping, it provides a false sense of safety while adding delay. Design for genuine engagement, not just process compliance.

3.5 Complete Example: An Approval Workflow System in Python

The following code implements a complete approval workflow system that supports all four patterns discussed above. Study it carefully; each component maps to a design pattern from this section.

python

"""
Complete approval workflow system for agentic AI.

This module implements tiered, exception-based, and batched approval
workflows with anti-rubber-stamping measures. It is designed to be
integrated into any agent system where human oversight is required.
"""

import time
import uuid
from dataclasses import dataclass, field
from enum import Enum
from collections import deque


class RiskLevel(Enum):
    """Risk levels determine which approval path is taken."""
    LOW = "low"           # Automated approval (logged)
    MEDIUM = "medium"     # Single approver required
    HIGH = "high"         # Senior approver required
    CRITICAL = "critical" # Two-person approval required


class ApprovalStatus(Enum):
    PENDING = "pending"
    APPROVED = "approved"
    REJECTED = "rejected"
    EXPIRED = "expired"
    NEEDS_ESCALATION = "needs_escalation"


@dataclass
class ProposedAction:
    """An action proposed by the agent that may require approval."""
    action_id: str = field(default_factory=lambda: str(uuid.uuid4()))
    tool_name: str = ""
    parameters: dict = field(default_factory=dict)
    reasoning: str = ""
    risk_level: RiskLevel = RiskLevel.LOW
    estimated_impact: str = ""
    rollback_possible: bool = True
    created_at: float = field(default_factory=time.time)


@dataclass
class ApprovalDecision:
    """A human's decision on a proposed action."""
    action_id: str = ""
    approver_id: str = ""
    status: ApprovalStatus = ApprovalStatus.PENDING
    reason: str = ""
    decided_at: float = field(default_factory=time.time)
    time_spent_seconds: float = 0.0  # How long the approver took


class ApprovalWorkflow:
    """
    Manages the approval process for agent actions.

    This class implements tiered approval, expiration, batching,
    and anti-rubber-stamping measures.

    Usage:
        workflow = ApprovalWorkflow()

        # Agent proposes an action
        action = ProposedAction(
            tool_name="send_email",
            parameters={"to": "customer@example.com", "body": "..."},
            reasoning="Customer requested a refund confirmation",
            risk_level=RiskLevel.MEDIUM,
        )
        workflow.submit(action)

        # Human reviews and decides
        decision = workflow.decide(
            action_id=action.action_id,
            approver_id="reviewer_jane",
            status=ApprovalStatus.APPROVED,
            reason="Looks correct, standard refund confirmation",
        )

        # Check if the action can proceed
        if workflow.can_execute(action.action_id):
            execute_action(action)
    """

    def __init__(self, approval_timeout_seconds: float = 300.0):
        self.pending: dict[str, ProposedAction] = {}
        self.decisions: dict[str, list[ApprovalDecision]] = {}
        self.approval_timeout = approval_timeout_seconds
        self.approver_stats: dict[str, dict] = {}  # Track per-approver stats

    def submit(self, action: ProposedAction) -> ApprovalStatus:
        """
        Submit an action for approval.

        Low-risk actions are auto-approved (but logged).
        All other actions enter the pending queue.
        """
        if action.risk_level == RiskLevel.LOW:
            # Auto-approve low-risk actions, but log them
            auto_decision = ApprovalDecision(
                action_id=action.action_id,
                approver_id="system_auto_approve",
                status=ApprovalStatus.APPROVED,
                reason="Auto-approved: low risk",
            )
            self.decisions[action.action_id] = [auto_decision]
            return ApprovalStatus.APPROVED

        self.pending[action.action_id] = action
        self.decisions[action.action_id] = []
        return ApprovalStatus.PENDING

    def decide(self, action_id: str, approver_id: str,
               status: ApprovalStatus, reason: str) -> ApprovalDecision:
        """
        Record a human's approval decision.

        Tracks timing for anti-rubber-stamping analysis.
        """
        action = self.pending.get(action_id)
        if action is None:
            raise ValueError(f"No pending action with id {action_id}")

        # Calculate how long the approver took
        time_spent = time.time() - action.created_at

        decision = ApprovalDecision(
            action_id=action_id,
            approver_id=approver_id,
            status=status,
            reason=reason,
            time_spent_seconds=time_spent,
        )
        self.decisions[action_id].append(decision)

        # Update approver statistics for rubber-stamping detection
        self._update_approver_stats(approver_id, decision)

        return decision

    def can_execute(self, action_id: str) -> bool:
        """
        Check whether an action has sufficient approvals to execute.

        Critical actions require two approvals.
        Expired actions cannot be executed.
        """
        action = self.pending.get(action_id)
        if action is None:
            # Check if it was auto-approved
            decisions = self.decisions.get(action_id, [])
            return any(d.status == ApprovalStatus.APPROVED for d in decisions)

        # Check for expiration
        if time.time() - action.created_at > self.approval_timeout:
            return False  # Safe default: expired actions do NOT execute

        decisions = self.decisions.get(action_id, [])
        approvals = [d for d in decisions if d.status == ApprovalStatus.APPROVED]

        if action.risk_level == RiskLevel.CRITICAL:
            # Two-person rule: need two distinct approvers
            unique_approvers = {d.approver_id for d in approvals}
            return len(unique_approvers) >= 2

        return len(approvals) >= 1

    def _update_approver_stats(self, approver_id: str,
                                decision: ApprovalDecision):
        """Track approval patterns for rubber-stamping detection."""
        if approver_id not in self.approver_stats:
            self.approver_stats[approver_id] = {
                "total": 0,
                "approved": 0,
                "avg_time_seconds": 0.0,
                "recent_times": deque(maxlen=20),
            }

        stats = self.approver_stats[approver_id]
        stats["total"] += 1
        if decision.status == ApprovalStatus.APPROVED:
            stats["approved"] += 1
        stats["recent_times"].append(decision.time_spent_seconds)
        stats["avg_time_seconds"] = (
            sum(stats["recent_times"]) / len(stats["recent_times"])
        )

    def check_rubber_stamping(self, approver_id: str) -> dict:
        """
        Analyze whether an approver may be rubber-stamping.

        Returns a report with warning indicators.
        A high approval rate combined with very short review times
        suggests the approver is not actually reading the proposals.
        """
        stats = self.approver_stats.get(approver_id)
        if stats is None or stats["total"] < 10:
            return {"warning": False, "reason": "Insufficient data"}

        approval_rate = stats["approved"] / stats["total"]
        avg_time = stats["avg_time_seconds"]

        warnings = []
        if approval_rate > 0.95:
            warnings.append(
                f"Approval rate is {approval_rate:.0%} "
                f"(approved {stats['approved']} of {stats['total']})"
            )
        if avg_time < 3.0:
            warnings.append(
                f"Average review time is {avg_time:.1f}s "
                f"(may indicate insufficient review)"
            )

        return {
            "warning": len(warnings) > 0,
            "approval_rate": approval_rate,
            "avg_review_time_seconds": avg_time,
            "warnings": warnings,
        }

Let us walk through the key design decisions in this code:

Risk-based routing (submit method): Low-risk actions are auto-approved, reducing the human's workload. All other actions enter the pending queue. This implements the tiered approval pattern.
Safe defaults (can_execute method): Expired actions cannot be executed. If the human walks away from their desk, the agent stops. This implements the "make the default safe" principle.
Two-person rule (can_execute for CRITICAL): Critical actions require two distinct approvers, preventing a single point of failure in the oversight chain.
Rubber-stamping detection (check_rubber_stamping method): The system tracks each approver's approval rate and average review time. An approver who approves 98% of actions in under 3 seconds per action is almost certainly not reading the proposals.
Timing data (time_spent_seconds): Recording how long each review takes provides data for both rubber-stamping detection and workflow optimization.

3.6 UX Mockup: The Approval Interface

What does a well-designed approval interface look like? Here is a text-based mockup of an approval screen that implements the design principles from Section 3.3:

Notice the design principles in action:

Context first: The "Why" section explains the agent's reasoning before showing the action.
Progressive disclosure: The full email is collapsed by default, reducing cognitive load for routine cases.
Track record: "94% approved by human reviewers" helps calibrate expectations.
Multiple options: The reviewer can approve, reject, edit, or escalate.
Self-awareness data: The footer shows the reviewer's own statistics, encouraging mindful engagement.
Expiration timer: Creates appropriate urgency without auto-approving.

Try It Yourself: Design an Approval Interface

Design an approval interface for one of the following scenarios. Sketch it out (text mockup is fine) and identify which design principles you applied:

A medical triage agent that recommends patient priority levels in an emergency room
A content moderation agent that proposes removing user posts on a social media platform
A financial trading agent that proposes executing stock trades above $10,000

Consider: What information does the human need to make a good decision? How do you prevent rubber-stamping? What is the safe default if the human does not respond?

054. Transparency and Explainability in Agent Actions

4.1 Why Transparency Matters for Agents

Transparency in agentic AI goes far beyond traditional model explainability (why did the classifier predict "cat" instead of "dog"?). For an agent, we need to explain a much richer set of decisions:

Why this plan? Why did the agent choose this sequence of actions over alternatives?
Why this tool? Why did the agent select this particular tool for this step?
Why these parameters? Why did the agent use these specific arguments?
What was the reasoning? What information did the agent consider in its decision?
What was rejected? What alternatives did the agent consider and discard?
How confident is the agent? Does the agent think this is the right approach, or is it uncertain?

Without answers to these questions, the human cannot make informed oversight decisions. Approving an agent action you do not understand is not meaningful oversight; it is rubber-stamping with extra steps.

4.2 Levels of Transparency

We can think of transparency as operating at four progressively deeper levels:

Level 1 -- Action transparency: The user can see what the agent did.

text

Agent executed: search("python error handling best practices")
Agent executed: read_file("/src/main.py")
Agent executed: write_file("/src/main.py", <modified content>)

This is the minimum viable transparency. The user knows which tools were called and with what parameters. It is like a transaction log: useful for audit but does not explain why.

Level 2 -- Reasoning transparency: The user can see why the agent did it.

text

"I searched for Python error handling best practices because the current
code uses bare except clauses, which is an anti-pattern. I then modified
main.py to use specific exception types."

Reasoning transparency answers the "why" question. It connects actions to rationale, allowing the user to evaluate whether the reasoning is sound.

Level 3 -- Alternative transparency: The user can see what else the agent considered.

text

"I considered three approaches:
  A) Add specific exception types (chosen -- most Pythonic)
  B) Add a generic logging wrapper (rejected -- masks error types)
  C) Refactor to avoid try/except where possible (rejected -- too invasive)"

Alternative transparency reveals the decision space. By showing what was rejected and why, it gives the user confidence that the agent considered multiple options and chose wisely. It also makes it easy for the user to say "actually, I prefer option C."

Level 4 -- Uncertainty transparency: The user can see how confident the agent is.

text

"I am 90% confident that ValueError is the right exception type here,
but there's a possibility that TypeError would be more appropriate.
I'd recommend reviewing line 42 carefully."

Uncertainty transparency is the deepest level. It tells the user where to focus their review effort: areas where the agent is uncertain deserve more scrutiny than areas where it is confident.

4.3 Chain-of-Thought as Transparency

One practical advantage of LLM-based agents is that chain-of-thought reasoning provides a natural form of transparency. The agent's reasoning is expressed in natural language that humans can read and evaluate. This is a significant advantage over traditional ML models, where explaining a decision requires specialized interpretation techniques.

However, there are important caveats, as highlighted by Turpin et al. (2023):

Chain-of-thought may not reflect actual reasoning. The model might generate plausible-sounding reasoning that does not correspond to the actual computation that produced the answer. This is known as "unfaithful reasoning." The model's output says "I chose A because of X, Y, and Z," but the actual neural computation that selected A did not involve X, Y, or Z.

Post-hoc rationalization. The model might decide on an action first and then generate reasoning to justify it, rather than reasoning first and then deciding. This is the computational equivalent of a human who makes a gut decision and then constructs a rational argument to support it.

Selective reporting. The model might omit relevant considerations from its stated reasoning. It might mention the factors that support its decision while ignoring factors that argue against it.

Despite these limitations, chain-of-thought remains the most practical transparency mechanism for current agents. The key is to treat it as one signal among many, not as ground truth about the agent's decision process. Combine it with action logging, output validation, and human review for a more complete picture.

4.4 Design Pattern: The Transparent Agent

python

"""
Design pattern for a transparent agent that provides
multi-level explanations of its actions.
"""

from dataclasses import dataclass


@dataclass
class AgentAction:
    """A single action with full transparency metadata."""
    tool_name: str
    parameters: dict
    reasoning: str                    # Why this action?
    alternatives_considered: list[str] # What else was considered?
    confidence: float                  # 0.0 to 1.0
    risks: list[str]                   # What could go wrong?


@dataclass
class AgentPlan:
    """A complete plan with full transparency metadata."""
    goal: str
    steps: list[AgentAction]
    overall_reasoning: str
    estimated_duration: str
    rollback_plan: str                 # How to undo if things go wrong


def present_plan_to_user(plan: AgentPlan) -> str:
    """
    Format an agent plan for human review.

    Uses progressive disclosure: summary first, details on demand.
    This function generates the human-facing presentation of the
    plan, suitable for an approval workflow.
    """
    output = []
    output.append(f"## Goal: {plan.goal}")
    output.append(f"**Approach**: {plan.overall_reasoning}")
    output.append(f"**Estimated time**: {plan.estimated_duration}")
    output.append(f"**Rollback plan**: {plan.rollback_plan}")
    output.append("")
    output.append("### Steps:")

    for i, step in enumerate(plan.steps, 1):
        # Map numeric confidence to human-readable label
        confidence_indicator = (
            "high" if step.confidence > 0.8
            else "medium" if step.confidence > 0.5
            else "low"
        )
        output.append(
            f"\n**Step {i}**: `{step.tool_name}` "
            f"(confidence: {confidence_indicator})"
        )
        output.append(f"  - **Why**: {step.reasoning}")

        if step.alternatives_considered:
            output.append(
                f"  - **Alternatives considered**: "
                f"{', '.join(step.alternatives_considered)}"
            )

        if step.risks:
            output.append(f"  - **Risks**: {', '.join(step.risks)}")

    return "\n".join(output)

The key design decisions in this code:

Every action carries its own reasoning, alternatives, confidence, and risks. This means the transparency metadata travels with the action, not in a separate log.
The present_plan_to_user function implements progressive disclosure: it shows the high-level summary (goal, approach, rollback plan) before diving into step-by-step details.
Confidence is translated from a number to a human-readable label because users respond better to "medium confidence" than to "0.62."

065. Trust Calibration: When to Trust and When to Verify

5.1 The Trust Problem

Trust calibration is one of the most important and least understood aspects of human-agent interaction. It refers to the alignment between a user's confidence in an agent and the agent's actual reliability. Two failure modes exist, and both are costly:

Over-trust (automation complacency). The user trusts the agent more than its capabilities warrant. This leads to uncritical acceptance of agent outputs and failure to catch errors. Over-trust is especially dangerous because the errors that slip through are precisely the ones that required human judgment: the agent got the easy cases right (building the user's confidence) and failed on the hard cases (which the overconfident user did not review carefully).

Over-trust develops naturally when an agent performs well most of the time. After seeing the agent succeed 50 times in a row, the human stops checking the 51st output. But the 51st output might be the one that is wrong.

Under-trust (automation aversion). The user trusts the agent less than its capabilities warrant. This leads to unnecessary verification of correct agent outputs, negating the efficiency benefits of automation. Under-trust often develops after a dramatic failure: the agent makes one bad mistake, and the user never trusts it again, even though it is reliable 99% of the time.

Under-trust is costly because it transforms the agent from a productivity tool into a source of busy work. If the human re-does every task the agent completes, they are doing the work twice.

The goal is calibrated trust: the user's confidence tracks the agent's actual reliability. The user trusts the agent on tasks where it is reliable and verifies on tasks where it is less reliable.

5.2 Factors Affecting Trust

Research on human-automation trust, summarized in the influential review by Lee and See (2004), identifies three dimensions:

Performance-based trust. Based on the agent's track record. Does it usually produce correct results? Performance-based trust is the most rational form: it is based on evidence. But it is also slow to build and fast to destroy (one dramatic failure can override many successes).

Process-based trust. Based on understanding how the agent works. Can the user follow the agent's reasoning? Process-based trust is why transparency matters: a user who can see and evaluate the agent's reasoning has a basis for trusting it beyond just track record. It is also why chain-of-thought reasoning is so valuable for trust building.

Purpose-based trust. Based on the agent's perceived intent. Does the user believe the agent is trying to help them? Purpose-based trust is why framing matters: an agent that says "I am trying to help you with X" establishes its intent clearly.

For LLM-based agents, purpose-based trust is generally high (users believe the agent is trying to help), process-based trust is often low (users do not understand how LLMs work and cannot reliably evaluate chain-of-thought reasoning), and performance-based trust varies based on experience.

5.3 Strategies for Appropriate Trust Calibration

5.3.1 Communicate Uncertainty

Agents should explicitly state when they are uncertain. This is perhaps the single most important design pattern for trust calibration:

text

High confidence:  "I'll update the configuration file."
Medium confidence: "I believe the issue is in the database connection, but I'm not certain."
Low confidence:   "I'm not sure how to approach this. Here are two possible strategies..."

When an agent communicates uncertainty, the user knows where to focus their review effort. High-confidence actions can be reviewed quickly; low-confidence actions deserve careful scrutiny. Without uncertainty communication, the user must treat every action with the same level of scrutiny, which is both exhausting and inefficient.

5.3.2 Show Track Record

Display the agent's historical accuracy for similar tasks:

text

"This type of task (code refactoring) has a 94% success rate
based on the last 200 executions in your organization."

Track record information helps users calibrate their expectations. If the agent has a 94% success rate on this type of task, the user knows to expect roughly 1 failure in 17 attempts and can plan their review strategy accordingly.

5.3.3 Progressive Disclosure of Competence

Let users discover the agent's capabilities gradually rather than presenting a long list of features. Start with simple tasks and expand scope as trust builds. This mirrors how trust develops in human relationships: you do not share your deepest secrets with someone you just met.

A practical implementation of progressive disclosure looks like this:

text

Week 1: "I can help you search your codebase and answer questions about it."
Week 2: "I can also suggest code changes. Want me to try?"
Week 3: "Based on our work together, I can now handle straightforward
         bug fixes from start to finish. I'll always show you the
         changes before applying them."
Week 4: "I've successfully completed 15 bug fixes with zero rejections.
         Would you like to enable auto-apply for low-risk changes?"

Each step increases capability and autonomy, but only after the previous step has been validated by the user's experience. The agent earns trust through demonstrated competence, not through self-assertion.

Key Insight: Progressive disclosure of competence is the trust-building equivalent of progressive autonomy from Section 2.3. Both start conservative and increase gradually based on demonstrated success. The difference is that progressive autonomy is about what the agent does, while progressive disclosure is about what the user knows the agent can do.

5.3.4 Honest Failure Reporting

When an agent fails, it should clearly report:

What it was trying to do
What went wrong
What it has already done (for rollback purposes)
What it recommends as a next step

Never hide failures. An agent that silently fails or pretends to succeed destroys trust far more than one that fails transparently. If the user discovers later that the agent failed silently, they will never trust it again (and rightly so). Honest failure reporting, while painful in the moment, builds long-term trust by demonstrating that the agent is a reliable reporter of its own state.

5.4 The Verification Cost Tradeoff

Every time a human verifies an agent's work, there is a cost (the human's time and attention). Trust calibration helps optimize this:

text

Expected cost = P(error) * Cost(undetected_error) + P(verify) * Cost(verification)

Optimal verification rate minimizes expected cost.

For high-stakes, low-frequency decisions: verify every time (the cost of an undetected error far exceeds the cost of verification). For low-stakes, high-frequency decisions: sample and verify periodically (spot-checking). For medium-stakes: verify based on the agent's expressed confidence.

This framework makes trust calibration concrete and actionable. It is not about "trusting" or "not trusting" the agent in an absolute sense; it is about allocating verification effort where it has the highest expected value.

For example, consider a coding agent with the following error rates:

Code formatting changes: 0.1% error rate, cost of error: low (easily fixed)
Bug fixes: 5% error rate, cost of error: medium (may introduce new bugs)
Security-sensitive changes: 15% error rate, cost of error: high (vulnerabilities)

The optimal strategy is: do not verify formatting changes (the expected cost of missing an error is tiny), spot-check bug fixes (verify every 5th or 10th one, plus any that seem unusual), and verify every security-sensitive change (the cost of missing an error is too high to skip). This is much more efficient than verifying everything or verifying nothing, and it allocates human attention to where it has the greatest impact.

Key Insight: Trust is not a binary property ("I trust this agent" or "I do not trust this agent"). It is a calibration problem: matching your verification effort to the agent's actual reliability on each specific type of task. Over-trust wastes safety; under-trust wastes time. Calibrated trust optimizes both.

076. Collaborative Workflows: Agent as Assistant vs. Agent as Colleague

6.1 Two Collaboration Models

As agents become more capable, the nature of human-agent collaboration is evolving from a master-servant relationship to something more peer-like. Understanding these two models helps you design the right interaction patterns.

Agent as assistant. The human drives. The agent helps with specific subtasks when asked. The human maintains the overall plan and context. This is the model used by most current AI tools (ChatGPT, Copilot, Claude in standard chat mode).

The assistant model is comfortable and familiar because it mirrors existing tool-use patterns. The human is always in control, and the agent is passive until asked. But it also means the human must manage context, remember to ask for help, and coordinate all the pieces.

Agent as colleague. The agent and human work as peers on a shared task. The agent may independently pursue subtasks, contribute ideas, and push back on the human's approach. The human and agent negotiate strategy.

The colleague model is emerging in practice, particularly in software engineering (Claude Code operating as a pair programming partner that proactively suggests improvements) and research (agents that independently explore literature and surface relevant findings).

6.2 Characteristics of Each Model

Dimension	Assistant Model	Colleague Model
Initiative	Human drives, agent responds	Both can initiate
Planning	Human plans, agent executes	Shared planning
Context	Human maintains context	Shared context
Disagreement	Agent defers to human	Agent can push back
Accountability	Human accountable	Shared accountability (harder)
Cognitive load	Lower for agent, higher for human	More balanced

6.3 Practical Colleague-Model Patterns

The colleague model is emerging in practice, particularly in software engineering:

Pair programming with AI. The human and agent take turns writing code. The agent might implement a function, the human reviews and refines, the agent writes tests, the human modifies the tests, and so on. The key difference from the assistant model is that the agent takes initiative: it might notice a potential bug the human did not ask about and flag it proactively.

Review and critique. The agent reviews the human's work and provides constructive feedback: not just corrections but also suggestions for alternative approaches, identification of missing edge cases, and architectural concerns. This requires the agent to have "opinions" (or at least, to surface considerations the human might have missed).

Proactive contribution. The agent notices issues or opportunities the human has not asked about: "I noticed the function you just wrote has a potential race condition. Would you like me to address it?" This is the hallmark of the colleague model: the agent acts on its own judgment, not just on explicit instructions.

6.4 Communication Protocols

For effective human-agent collaboration, explicit communication protocols help:

Status updates. Agent proactively reports what it is doing and what it plans to do next. Without status updates, the human cannot maintain a mental model of the agent's progress.

Check-ins. At natural breakpoints, the agent summarizes progress and asks whether to continue. This gives the human regular opportunities to redirect or correct without having to interrupt.

Handoffs. When the agent encounters something outside its capability, it clearly hands off to the human with full context: what it was trying to do, where it got stuck, and what information the human needs to continue.

Clarification requests. Instead of guessing, the agent asks targeted questions when requirements are ambiguous. Good clarification requests are specific ("Should the refund be applied to the original payment method or as store credit?"), not vague ("What should I do?").

Try It Yourself: Design a Communication Protocol

You are building a coding agent that operates in the "colleague model." Design a communication protocol for the following scenario:

The agent is working on a feature implementation. Midway through, it discovers that the database schema needs to change, which is outside the scope of its original task.

Write out the exact messages the agent should send to the human developer at each stage:

The initial status update when it starts work
The notification when it discovers the schema issue
The handoff message with full context
The follow-up message after the human resolves the schema issue

For each message, explain why you included the information you did and what you left out.

087. UX Design for Agentic Interfaces

7.1 The UX Challenge

Agentic interfaces pose unique UX challenges that go beyond traditional software design:

Non-deterministic behavior: The agent might do something different each time, making the interface less predictable than traditional software
Long-running tasks: Tasks may take minutes or hours, with uncertain completion times
Invisible work: Much of the agent's work (reasoning, searching, planning) is invisible to the user
Need for intervention: The user needs to understand the agent's state well enough to intervene at appropriate points

Traditional software UX assumes that the user is in control and the software is deterministic. Agentic UX must handle a fundamentally different situation where the agent has its own "agency" and the user must oversee rather than direct.

7.2 Core UX Patterns

7.2.1 The Activity Feed

Show a chronological feed of agent actions, similar to a developer terminal or activity log:

This pattern provides: real-time visibility into agent activity, natural breakpoints for intervention, and a scrollable history for review. It is familiar from terminal and CI/CD interfaces, reducing the learning curve.

The activity feed is the most important UX pattern for agentic interfaces because it addresses the "invisible work" problem: the user can see what the agent is doing at every moment.

7.2.2 The Plan View

Show the agent's overall plan with progress tracking:

The plan view provides high-level orientation: where is the agent in its overall task? How much is left? The user can modify the plan (add steps, remove steps, reorder) or cancel entirely. This gives the user control over the strategy while the agent handles the tactics.

7.2.3 The Diff View

For agents that modify files or data, show changes in a familiar diff format:

The diff view is essential for code-modifying agents. It leverages a familiar interface (code diffs are universal in software development) and allows precise review of what the agent proposes to change. The "Explain Why" button connects the diff to the agent's reasoning, providing Level 2 transparency.

7.2.4 The Confidence Indicator

Visually communicate the agent's confidence in its actions:

Confidence indicators help users calibrate their trust on a per-action basis. A high-confidence recommendation can be accepted quickly; a low-confidence one deserves careful review. Critically, the indicator shows where the agent is less confident, focusing the user's attention on the areas most likely to need human judgment.

7.3 Design Principles for Agentic UX

Drawing from Nielsen's usability heuristics (1994) and Amershi et al.'s guidelines for human-AI interaction (2019), adapted for agentic systems:

Visibility of agent state: Always show what the agent is currently doing, what it plans to do, and what it has done. Never leave the user wondering "what is the agent doing right now?" Uncertainty about agent state is the fastest way to erode trust.
User control and freedom: Provide clear mechanisms to pause, resume, cancel, undo, and modify agent behavior at any point. The user should always feel in control, even when the agent is acting autonomously.
Consistency and standards: Use familiar UI patterns (diffs, activity feeds, progress bars) rather than inventing novel interfaces. Agentic UX is challenging enough without adding interface learning curves.
Error prevention and recovery: Design the interface to make it easy to catch agent errors before they take effect. Show proposed changes before applying them. Provide undo for applied changes.
Recognition over recall: Show the agent's available capabilities rather than requiring the user to remember commands or prompts. A discoverable interface reduces the barrier to effective agent use.
Flexibility and efficiency: Support both novice users (guided, high-oversight mode) and expert users (streamlined, lower-oversight mode). The same agent might need different UX for different user skill levels.

7.4 Accessibility in Agentic Interfaces

Accessibility is not an afterthought; it is a fundamental design requirement. Agentic interfaces must be usable by people with diverse abilities, including visual, auditory, motor, and cognitive differences. The interactive and dynamic nature of agentic UIs creates specific accessibility challenges that go beyond standard web accessibility.

Screen reader compatibility. Activity feeds and real-time status updates must be announced to screen readers using ARIA live regions. When the agent takes an action, the screen reader should announce "Agent is now searching for authentication files" rather than silently updating the visual display. Use aria-live="polite" for routine updates and aria-live="assertive" for critical events that require immediate attention (like an approval request or error).

html

<!-- Example: Accessible agent activity feed -->
<div role="log" aria-live="polite" aria-label="Agent activity feed">
  <div role="status">Agent is reading src/middleware/auth.ts</div>
  <div role="status">Agent found potential issue in token validation</div>
</div>

<!-- Approval request: use assertive because it requires action -->
<div role="alert" aria-live="assertive">
  Agent is requesting approval to modify 3 files.
  <button aria-describedby="approval-context">Approve</button>
  <button>Reject</button>
</div>

Keyboard navigation. Every action in the agentic interface (approve, reject, pause, resume, cancel, drill into details) must be accessible via keyboard. The approval buttons should be focusable with Tab and activatable with Enter or Space. The activity feed should be navigable with arrow keys.

Cognitive accessibility. Agentic interfaces can be overwhelming for users with cognitive differences. Key strategies include:

Clear, consistent layout that does not change unexpectedly
Simple language in agent status messages (avoid jargon)
Adjustable information density (let users choose between "simple" and "detailed" views)
Pause functionality that stops the agent's activity, giving the user time to process

Motor accessibility. Users with limited motor control may need more time to interact with approval interfaces. Ensure that approval timeouts are generous and configurable, and that "approve" buttons are large enough click targets (at least 44x44 pixels per WCAG guidelines).

Alternative output formats. Agent reasoning that is presented visually (confidence bars, plan diagrams) should have text alternatives. A confidence bar showing 85% should also say "85% confident" in text accessible to screen readers.

Key Insight: Agentic interfaces amplify accessibility challenges because they are dynamic, non-deterministic, and time-sensitive. A user who struggles with a rapidly updating activity feed will have a worse experience with an agentic tool than with traditional software. Building accessible agentic interfaces requires intentional design from the start, not a retroactive fix.

7.5 Common Misconceptions About Agentic UX

Misconception: "Chat is the best interface for agents." Chat is the simplest interface, not the best. For many tasks, a structured interface (plan view, diff view, dashboard) provides better oversight than a scrolling conversation. Chat forces information into a linear stream, but agent behavior is often hierarchical (plans within plans) and parallel (multiple actions at once). Use chat for communication and structured views for oversight.

Misconception: "More information means better oversight." Showing every detail of every agent action actually reduces oversight quality by overwhelming the human. The activity feed from Section 7.2.1 works because it uses progressive disclosure: summaries by default, details on demand. A wall of text is not transparency; it is noise.

Misconception: "Users will read the agent's reasoning." Most users skip the reasoning most of the time. Design your interface assuming that users will read only the summary and look at the action. The reasoning should be available but not required for routine approvals. Save detailed reasoning for high-risk or unusual actions where you need to force engagement.

098. Feedback Mechanisms: Corrections, Preferences, Demonstrations

8.1 Why Feedback Matters

An agent that never improves from experience is a static tool. Feedback mechanisms transform agents from static tools into learning systems that get better over time. Effective feedback loops allow:

Correction of specific errors (the agent learns not to repeat mistakes)
Communication of preferences and style (the agent learns the user's way of working)
Teaching of domain-specific knowledge (the agent learns the user's domain)
Calibration of behavior over time (the agent learns what level of detail, formality, and caution the user prefers)

8.2 Types of Feedback

8.2.1 Explicit Corrections

The user directly corrects the agent's output:

text

Agent: "I set the timeout to 30 seconds."
User:  "Change that to 60 seconds. Our API is slow."
Agent: "Updated to 60 seconds. I'll use 60s as the default for this API going forward."

The key challenge is generalization: the agent should learn from the correction, not just apply it to this one instance. Should it use 60 seconds for all APIs? Just this one API? All slow APIs? The generalization scope is often ambiguous, and getting it wrong can cause new problems.

A good practice is for the agent to explicitly state what it has generalized: "I'll use 60s as the default for the payments API going forward. Should I apply this to other APIs as well?" This makes the generalization visible and correctable.

8.2.2 Preference Signals

Thumbs up/down, ratings, or choices between alternatives:

text

Agent presents two code implementations:
  Option A: Using recursion
  Option B: Using iteration

User selects Option B.

Agent learns: "This user prefers iterative solutions over recursive ones."

Preference signals are low-effort for the user but provide less information than explicit corrections. They tell you what the user prefers but not why. The agent might generalize incorrectly: perhaps the user chose Option B not because they prefer iteration in general, but because this specific problem is better suited to iteration.

8.2.3 Demonstrations

The user shows the agent how to perform a task:

text

User: "Watch how I handle this type of customer complaint..."
[User demonstrates the workflow step by step]
Agent: "I understand. For complaints about delivery delays:
  1. Acknowledge the frustration
  2. Look up the order status
  3. Offer a specific resolution (refund or expedited reshipping)
  4. Follow up within 24 hours"

Demonstrations are high-effort but provide rich information about desired behavior. They are most effective for teaching complex workflows where the sequence of steps and the decision criteria are hard to describe in words.

8.2.4 Implicit Feedback

Behavioral signals that indicate satisfaction or dissatisfaction without the user explicitly providing feedback:

User accepts the agent's output without modification (positive)
User modifies the output significantly (negative)
User abandons the task (strongly negative)
User asks the agent to redo the task (negative)
User uses the output immediately (positive)

Implicit feedback is abundant (every interaction generates it) but noisy (the user might modify the output for reasons unrelated to quality, such as adding context the agent did not have).

8.3 Implementing Feedback Loops

python

"""
Feedback collection and application system for agents.

This module shows how to collect user feedback, generalize it into
reusable rules, and inject those rules into the agent's context
to personalize its behavior over time.
"""

from dataclasses import dataclass
from enum import Enum
from datetime import datetime


class FeedbackType(Enum):
    CORRECTION = "correction"
    PREFERENCE = "preference"
    DEMONSTRATION = "demonstration"
    RATING = "rating"
    IMPLICIT = "implicit"


@dataclass
class FeedbackRecord:
    feedback_type: FeedbackType
    timestamp: datetime
    task_context: str
    agent_output: str
    user_feedback: str
    generalized_rule: str | None = None


class FeedbackStore:
    """
    Stores and retrieves feedback for agent personalization.

    Feedback is stored as natural language rules that can be
    injected into the agent's context at inference time. This
    is a simple but effective approach: the agent's system prompt
    includes a section like "User Preferences" that contains
    learned rules from past feedback.
    """

    def __init__(self):
        self.records: list[FeedbackRecord] = []
        self.rules: list[str] = []

    def add_correction(self, context: str, agent_output: str, correction: str):
        """
        Record a user correction and attempt to generalize.

        In a production system, you would use an LLM to generalize
        the correction into a reusable rule. For example:
          Context: "Setting timeout for payments API"
          Agent output: "30 seconds"
          Correction: "60 seconds"
          Generalized rule: "For the payments API, use 60s timeout"
        """
        record = FeedbackRecord(
            feedback_type=FeedbackType.CORRECTION,
            timestamp=datetime.now(),
            task_context=context,
            agent_output=agent_output,
            user_feedback=correction,
        )
        self.records.append(record)

        # In practice, you might use an LLM to generalize the correction
        # into a reusable rule. Here we store it directly.
        rule = (
            f"When handling '{context}': "
            f"user prefers '{correction}' over '{agent_output}'"
        )
        record.generalized_rule = rule
        self.rules.append(rule)

    def add_preference(self, context: str, chosen: str, rejected: str):
        """Record a preference between two options."""
        record = FeedbackRecord(
            feedback_type=FeedbackType.PREFERENCE,
            timestamp=datetime.now(),
            task_context=context,
            agent_output=f"Chosen: {chosen}, Rejected: {rejected}",
            user_feedback=f"Preferred: {chosen}",
        )
        self.records.append(record)

    def get_relevant_rules(
        self, current_context: str, max_rules: int = 5
    ) -> list[str]:
        """
        Retrieve rules relevant to the current context.

        In a production system, this would use semantic similarity
        (e.g., embedding-based retrieval) to find the most relevant
        rules. Here we use a simplified recency-based approach.
        """
        return self.rules[-max_rules:]

    def build_personalization_prompt(self, current_context: str) -> str:
        """
        Build a prompt section containing relevant user preferences
        to inject into the agent's context.

        This is the bridge between feedback collection and behavior
        change: learned rules become part of the agent's instructions.
        """
        rules = self.get_relevant_rules(current_context)
        if not rules:
            return ""

        lines = ["## User Preferences (learned from feedback)"]
        for rule in rules:
            lines.append(f"- {rule}")

        return "\n".join(lines)

The architecture is simple but powerful: feedback becomes rules, rules become part of the prompt, and the prompt shapes behavior. This in-context learning approach does not require retraining the model; it works by augmenting the system prompt with learned preferences.

8.4 RLHF and Preference Learning for Agents

The feedback mechanisms described above operate at the application level: they change the agent's behavior by modifying its context (system prompt), not its underlying model. But there is a deeper level of feedback that operates on the model itself: Reinforcement Learning from Human Feedback (RLHF).

RLHF, as introduced by Ouyang et al. (2022) for InstructGPT, works in three stages:

Supervised fine-tuning (SFT): Train the model on high-quality demonstrations of desired behavior.
Reward model training: Collect human comparisons ("output A is better than output B") and train a reward model to predict human preferences.
Reinforcement learning: Use the reward model to fine-tune the base model with PPO (Proximal Policy Optimization) or similar algorithms.

For agentic systems, RLHF is more complex than for simple chatbots because the "output" is not just text but a sequence of actions:

Action-level RLHF. Instead of comparing two text outputs, compare two action trajectories: "The agent took path A (search, read file, edit) vs. path B (search, search again, edit). Which trajectory was better?" This requires humans to evaluate multi-step plans, which is harder and more time-consuming than evaluating single outputs.

Outcome-based feedback. Instead of evaluating each action, evaluate the final outcome. Did the task get completed? Was the user satisfied? This is easier for humans (judging an outcome is simpler than judging a process) but provides less granular signal for learning.

Constitutional AI for agents. Anthropic's Constitutional AI (Bai et al., 2022) approach, where the model critiques and revises its own outputs based on a set of principles, can be extended to agent actions. The agent proposes a plan, a separate evaluation pass checks the plan against principles ("Does this plan respect user privacy? Does it avoid unnecessary actions? Is it the simplest path to the goal?"), and the plan is revised before execution.

Direct Preference Optimization (DPO). Rafailov et al. (2023) introduced DPO as a simpler alternative to RLHF that avoids the need for a separate reward model and RL training. For agents, DPO could be applied to pairs of action trajectories, directly optimizing the model to prefer better trajectories without the complexity of PPO.

Key Insight: RLHF and preference learning are how foundation models are trained to be helpful and harmless. But for agent-specific behavior (tool selection, plan quality, escalation decisions), application-level feedback (stored as rules in the prompt) is often more practical than model-level training. Most organizations cannot retrain the foundation model but can customize the system prompt. The feedback store pattern from Section 8.3 is the practical approach for most teams.

8.5 Challenges in Feedback-Based Learning

Feedback sparsity: Users provide feedback rarely. Most agent outputs receive no explicit feedback. This means each piece of feedback must be generalized carefully because you cannot afford to waste it.
Conflicting feedback: Different users (or the same user at different times) may provide contradictory feedback. The agent might learn "use formal language" from one user and "be casual" from another. Multi-user systems need per-user preference profiles.
Generalization errors: A correction for a specific case may be incorrectly generalized. The user corrects a timeout value for one API, and the agent generalizes it to all APIs. Over-generalization is as harmful as under-generalization.
Feedback latency: The user may not realize the output was wrong until much later. By then, the context is lost, making it harder to associate feedback with the specific action that caused the problem.
Reward hacking: The agent may learn to optimize for positive feedback rather than actual quality. For example, it might produce outputs that look good (well-formatted, confident-sounding) but are subtly incorrect, because such outputs receive positive implicit feedback (the user does not notice the error and accepts the output).

109. The Role of Human Oversight in High-Stakes Domains

9.1 Domain-Specific Requirements

Different domains have different oversight requirements, shaped by regulation, tradition, and the consequences of error:

Healthcare. Medical diagnosis and treatment recommendations require physician oversight. The EU AI Act classifies AI in healthcare as high-risk, mandating human oversight. Agents in healthcare should present findings and recommendations, never make final decisions. The physician retains full decision authority.

Legal. Legal research agents can find relevant cases and statutes, but legal conclusions must come from licensed attorneys. Attorney-client privilege and duty-of-care obligations require human involvement. An agent that provides legal advice without attorney review could create malpractice liability.

Finance. Algorithmic trading systems have long operated with varying degrees of autonomy, but regulatory frameworks (MiFID II in the EU, SEC regulations in the US) require human oversight mechanisms, kill switches, and audit trails. Post-2008 "flash crash" regulations specifically address the risks of autonomous trading systems.

Education. AI tutoring systems can adapt to student needs, but educational decisions (grading, advancement, intervention) should involve human educators who understand the student's full context. A tutoring agent that accelerates a student through material they have not truly mastered can be counterproductive.

Criminal justice. Risk assessment tools used in bail, sentencing, and parole decisions have been extensively criticized for bias and lack of transparency. Dressel and Farid (2018) showed that simple models matched the accuracy of commercial tools like COMPAS, questioning the need for complex, opaque systems. Human judges must retain full decision authority, with AI serving only as one input among many.

9.2 The Automation Paradox

Ironically, the more reliable an automated system becomes, the harder it is for humans to maintain effective oversight. This is called the "automation paradox" or "irony of automation" (Bainbridge, 1983):

When the system works well 99% of the time, the human's attention lapses
The rare failures are precisely the cases where human intervention is most needed
But the human is least prepared to intervene because they have not been actively engaged
Skill degradation occurs: the human loses the ability to perform the task manually

This paradox is well-documented across multiple industries, with sobering real-world examples:

Aviation: Air France Flight 447 (2009). When the Airbus A330's pitot tubes iced over during a flight from Rio de Janeiro to Paris, the autopilot disengaged. The pilots, who had been monitoring the autopilot for hours, were suddenly required to hand-fly the aircraft in a complex situation. Disoriented and undertrained for manual high-altitude flying, they made a series of errors that led to a fatal stall. The aircraft had functioned perfectly for thousands of flights; the pilots' manual flying skills had atrophied because they so rarely needed them.

Automotive: Tesla Autopilot incidents (2016-present). Multiple fatal crashes have occurred when Tesla's Autopilot system encountered situations it could not handle (a crossing truck, a highway barrier, a stopped emergency vehicle), and the driver, who had been passively monitoring, failed to take over in time. The pattern is consistent: the driver stops paying attention because the system works well most of the time, and then the rare failure catches them unprepared. NHTSA investigations have documented cases where drivers were watching movies, sleeping, or sitting in the back seat while the car drove itself.

Aviation/Software: Boeing 737 MAX MCAS (2018-2019). The MCAS (Maneuvering Characteristics Augmentation System) was designed to automatically push the nose down in certain conditions. When a faulty angle-of-attack sensor triggered MCAS erroneously, pilots fought the system but were unable to override it effectively. Two crashes killed 346 people. A critical design failure was that the automation was designed to override the pilot's inputs, and the pilots were not adequately trained on how MCAS worked or how to disable it. The system's designers assumed the pilots would understand what was happening and take corrective action, but in practice, the pilots were confused by a system they barely knew existed.

Healthcare: Alert fatigue in clinical decision support. Hospitals that implemented automated drug interaction alerts found that physicians received so many alerts (most of which were clinically insignificant) that they began dismissing them all. Studies found override rates of 90-96%. When a genuinely dangerous interaction was flagged, it was dismissed along with the noise. This is the approval fatigue problem from Section 3.4 playing out in a life-or-death domain.

These examples share a common structure: a reliable automated system leads to human disengagement, which leads to catastrophic failure when the automation encounters its limits. This is not a technology problem; it is a human factors problem. The technology works as designed; the human-technology interaction fails.

For agentic AI, this means:

High-autonomy agents require deliberate strategies to keep humans engaged
Periodic manual task execution maintains human competence (practice)
Surprise testing (injecting known errors for the human to catch) maintains vigilance
Clear escalation triggers keep the human in the loop at critical moments
The interface should prevent the human from completely disengaging (not just allowing monitoring but requiring periodic engagement)

9.3 Designing for Effective Oversight

Effective oversight requires more than just putting a human in the loop. The human must be equipped to make good decisions:

Appropriate information density: Too much information overwhelms; too little is insufficient. Show summaries with drill-down capability. The right amount depends on the task and the human's expertise.
Meaningful intervention points: Do not just alert the human; give them enough context and options to make a good decision quickly. "The agent wants to delete 5,000 records. Approve/Reject" is worse than "The agent wants to delete 5,000 customer records created before 2020 as part of the data retention cleanup. This matches the approved cleanup policy for inactive accounts. Here are 10 sample records for review. Approve/Reject/Review More."
Feedback on oversight quality: Track how often human overrides improve outcomes. If overrides consistently make things worse, the human may need training. If overrides consistently make things better, the agent's behavior may need improvement.
Shared mental model: The human needs to understand the agent's capabilities, limitations, and current state well enough to make informed oversight decisions. Building this shared mental model requires ongoing communication, not just a one-time training session.

Key Insight: The automation paradox means that the better your agent works, the worse your human oversight becomes, unless you deliberately design against this tendency. Effective oversight is not a natural byproduct of putting a human in the loop; it is an engineering challenge that requires specific design attention.

1110. Common Misconceptions

Misconception: "Human-in-the-loop means the system is safe." Putting a human in the loop only makes the system safe if the human can and does effectively evaluate each decision. A human who rubber-stamps every approval, or who lacks the expertise to evaluate the agent's proposals, provides no safety benefit. The system has the cost of human oversight (delay, labor) without the benefit (error catching). Safety comes from effective oversight, not merely from the presence of a human.

Misconception: "More automation is always better." The optimal level of automation depends on the specific context. For some tasks, full automation is optimal. For others, human judgment adds irreplaceable value. The goal is not to automate everything but to automate the right things at the right level. A customer service agent that handles routine queries autonomously and escalates complex ones achieves better outcomes than either full automation or full human handling.

Misconception: "Trust should be binary -- either trust the agent or do not." As Section 5 discussed, trust should be calibrated per task, per context, and per agent capability. An agent can be highly trusted for code formatting and not trusted at all for security-sensitive decisions. Treating trust as binary leads to either over-trust (trusting the agent on everything because it is good at some things) or under-trust (not trusting it at all because it once failed).

Misconception: "Users prefer full control." Research consistently shows that users prefer the right amount of control for the task. For low-stakes, repetitive tasks, users actively prefer automation (they do not want to click "approve" 500 times). For high-stakes, novel tasks, they want control. The key is matching control to stakes, not maximizing control.

Misconception: "The automation paradox can be solved with training." Training helps, but it is not sufficient. The automation paradox is driven by fundamental human cognitive tendencies (attention lapses during prolonged monitoring), not by lack of knowledge. You can train pilots extensively, but if they have not hand-flown in months, their skills will still have degraded. The solution requires system design (periodic engagement requirements, varied tasks, surprise testing), not just training.

Misconception: "Feedback from users will naturally improve the agent over time." Feedback only improves the agent if it is collected systematically, generalized correctly, and applied appropriately. Unstructured feedback ("this is wrong") without context is hard to act on. Feedback that is collected but never analyzed provides no benefit. And feedback that is over-generalized can make the agent worse. Feedback systems require deliberate design, not just a thumbs-up/thumbs-down button.

1211. Discussion Questions

The autonomy dilemma: Organizations want agents to be autonomous (for efficiency) but also safe (requiring oversight). How do you design a system where increasing autonomy and increasing safety are not in tension? Is this possible, or is there a fundamental tradeoff?

Hint: Consider how progressive autonomy and exception-based approval can increase both autonomy (for routine cases) and safety (by focusing human attention on exceptions). Also consider how better monitoring might enable higher autonomy with maintained safety.
Approval fatigue: In a HITL system processing thousands of agent actions per day, humans may start rubber-stamping approvals. How can you design the approval workflow to keep humans engaged and thoughtful? Consider insights from psychology and behavioral economics.

Hint: Think about variable-ratio reinforcement (unpredictable challenges maintain engagement), progressive disclosure (showing just enough information), and social accountability (knowing that your approval rate is tracked).
Trust transfer: If you trust an agent on one type of task (e.g., code formatting), should that trust transfer to a different type of task (e.g., security review)? How should the interface communicate task-specific trust levels?

Hint: Trust should generally NOT transfer across task types, because the agent's reliability varies by task. But users naturally generalize trust. How do you design an interface that communicates "I am reliable at formatting but less reliable at security review"?
The colleague model in practice: Consider a software development team where one "member" is an AI agent. How does this change team dynamics? What protocols would you establish for agent-human collaboration? How do you handle disagreements between the agent and a human team member?

Hint: Think about how teams currently handle disagreements between human members. How would code review work if one reviewer is an agent? What happens when the agent and a human disagree about an architectural decision?
Oversight in adversarial settings: How do you design human oversight for agents operating in adversarial environments (e.g., cybersecurity)? The attacker may try to manipulate the agent to generate actions that look reasonable to the human overseer but are actually harmful.

Hint: This is a variant of the indirect prompt injection problem from last week, but applied to the oversight interface. What if the attacker crafts inputs designed not just to fool the agent, but to fool the human reviewing the agent's actions?

1312. Summary and Key Takeaways

Three paradigms govern human-agent interaction: human-in-the-loop (every action approved), human-on-the-loop (monitoring with intervention), and fully autonomous. Most production systems use hybrid approaches with per-action granularity, adjusting oversight based on the reversibility and risk of each specific action.
Autonomy levels (analogous to self-driving levels 0-5) provide a framework for deciding how much independence to give an agent. The choice depends on task criticality, predictability, agent maturity, recovery cost, and regulatory requirements. Progressive autonomy (starting high and reducing oversight over time) is a practical pattern.
Approval workflows must balance safety with efficiency. Key patterns include batched approvals, plan-level approval, exception-based approval, and tiered approval. Good UX design and anti-rubber-stamping measures prevent approval fatigue while maintaining genuine oversight.
Transparency operates at four levels: action transparency (what), reasoning transparency (why), alternative transparency (what else was considered), and uncertainty transparency (how confident). Chain-of-thought reasoning provides transparency but may not be fully faithful to the model's actual decision process.
Trust calibration aims to align user confidence with agent reliability. Both over-trust (complacency) and under-trust (aversion) are costly. Communicating uncertainty, showing track records, progressive disclosure of competence, and honest failure reporting support calibrated trust.
Two collaboration models -- assistant and colleague -- offer different tradeoffs. The colleague model (emerging in software engineering) enables richer collaboration but raises new challenges around initiative, disagreement, and shared accountability.
UX design for agents must address non-deterministic behavior, long-running tasks, invisible work, and the need for intervention. Core patterns include activity feeds (what is happening now), plan views (what is the overall strategy), diff views (what specific changes are proposed), and confidence indicators (how certain is the agent).
Feedback mechanisms -- corrections, preferences, demonstrations, and implicit signals -- enable agents to improve over time. The main challenges are feedback sparsity, conflicting feedback, generalization errors, and the risk of reward hacking. A practical approach stores feedback as natural language rules injected into the agent's context.
High-stakes domains require careful oversight design that accounts for the automation paradox: as systems become more reliable, human oversight becomes harder to maintain, precisely when the rare failures most need human intervention. Real-world incidents (Air France 447, Tesla Autopilot crashes, Boeing 737 MAX) demonstrate the catastrophic consequences of this paradox. Deliberate design is required to keep humans engaged.
Accessibility in agentic interfaces is a fundamental requirement, not an afterthought. Dynamic, real-time, and time-sensitive interfaces create specific challenges for users with visual, motor, cognitive, or auditory differences. Screen reader compatibility, keyboard navigation, adjustable information density, and generous timeouts are essential design elements.
RLHF and preference learning provide mechanisms for agents to improve from human feedback at both the model level (training the underlying LLM) and the application level (storing learned rules in the system prompt). For most teams, application-level feedback (stored as natural language rules injected into the prompt) is more practical than retraining the foundation model.
Common misconceptions about human oversight can lead to poorly designed systems. Human-in-the-loop is not automatically safe; more automation is not always better; trust should not be binary; feedback does not automatically improve agents; and the automation paradox cannot be solved by training alone. Recognizing these misconceptions is the first step toward avoiding them.

1413. References

Amershi, S., Weld, D., Vorvoreanu, M., Fourney, A., Nushi, B., Collisson, P., ... & Horvitz, E. (2019). Guidelines for human-AI interaction. Proceedings of the 2019 CHI Conference on Human Factors in Computing Systems, 1-13.
Bainbridge, L. (1983). Ironies of automation. Automatica, 19(6), 775-779.
Dressel, J., & Farid, H. (2018). The accuracy, fairness, and limits of predicting recidivism. Science Advances, 4(1), eaao5580.
Lee, J. D., & See, K. A. (2004). Trust in automation: Designing for appropriate reliance. Human Factors, 46(1), 50-80.
Nielsen, J. (1994). 10 usability heuristics for user interface design. Nielsen Norman Group.
Parasuraman, R., Sheridan, T. B., & Wickens, C. D. (2000). A model for types and levels of human interaction with automation. IEEE Transactions on Systems, Man, and Cybernetics -- Part A, 30(3), 286-297.
SAE International. (2021). SAE J3016: Taxonomy and definitions for terms related to driving automation systems for on-road motor vehicles.
Shneiderman, B. (2022). Human-Centered AI. Oxford University Press.
Turpin, M., Michael, J., Perez, E., & Bowman, S. R. (2023). Language models don't always say what they think: Unfaithful explanations in chain-of-thought prompting. Advances in Neural Information Processing Systems, 36.
Ouyang, L., Wu, J., Jiang, X., Almeida, D., Wainwright, C., Mishkin, P., ... & Lowe, R. (2022). Training language models to follow instructions with human feedback. Advances in Neural Information Processing Systems, 35, 27730-27744.
Bai, Y., Kadavath, S., Kundu, S., Askell, A., Kernion, J., Jones, A., ... & Kaplan, J. (2022). Constitutional AI: Harmlessness from AI feedback. arXiv preprint arXiv:2212.08073.
Rafailov, R., Sharma, A., Mitchell, E., Ermon, S., Manning, C. D., & Finn, C. (2023). Direct preference optimization: Your language model is secretly a reward model. Advances in Neural Information Processing Systems, 36.
Endsley, M. R. (2017). From here to autonomy: Lessons learned from human-automation research. Human Factors, 59(1), 5-27.
National Transportation Safety Board (2019). Assumptions used in the safety assessment process and the effects of multiple alerts and indications on pilot performance. Safety Report NTSB/SA-19/01.

These lecture notes are part of the Agentic AI course. Licensed under CC BY 4.0.