SystemsW1347 min read

Agentic Workflows in Software Engineering

From autocomplete to closed-loop coding agents. Tools landscape (Claude Code, Cursor, Copilot, Devin, SWE-agent). Integration in CI/CD, SWE-bench results, where agents shine and where they break. Code review and security review of AI-generated code.

Core conceptsClosed-loop iterationSWE-benchCode review

Duration: 2 hours lecture + 1 hour lab Prerequisites: Weeks 1-12 (foundations through human-agent interaction)

01Learning Objectives

By the end of this lecture, students will be able to:

Describe the landscape of AI coding agents and their capabilities
Explain how agentic workflows differ from simple code completion
Design effective prompts and workflows for coding agents on non-trivial tasks
Identify where agents fit into CI/CD pipelines and the software development lifecycle
Critically evaluate the limitations and risks of agent-assisted software engineering
Apply best practices for code review, testing, and security when working with AI agents
Analyze case studies of agentic software engineering in real organizations
Compare agent development frameworks and select the appropriate one for a given use case
Explain why agent observability and evaluation are critical for production deployments
Describe how computer use and browser agents extend automation beyond APIs

021. AI Agents for Code Generation and Review

1.1 Why This Matters: Software Engineering is Being Transformed

We have spent the past twelve weeks building a deep understanding of agentic AI: the foundations, the tools, the memory systems, the planning mechanisms, the safety measures, and the human interaction patterns. This week, we turn to one of the most immediately impactful applications of all that theory: agentic software engineering.

Software engineering is arguably the domain where agentic AI has made the most progress and had the most impact. There is a good reason for this: writing code is a task that is amenable to automation in ways that many other knowledge work tasks are not. Code has clear correctness criteria (it compiles, tests pass, it produces the right output), it is highly structured (following syntax rules and established patterns), and there is an enormous corpus of training data (open-source code on GitHub). Moreover, the feedback loop is fast: you can run the code and immediately see whether it works.

But the transition from "AI that suggests code completions" to "AI that autonomously solves software engineering tasks" is not just a quantitative improvement. It is a qualitative shift that changes the nature of the developer's job. Understanding this shift, its possibilities and its pitfalls, is essential for anyone who will work in software engineering in the coming years.

1.2 From Autocomplete to Autonomous Coding: A Brief History

The evolution of AI-assisted programming has moved through four distinct stages, each representing a fundamental increase in capability:

Stage 1 -- Statistical autocomplete (pre-2021). Tools like IntelliSense and early TabNine used statistical models (n-gram models, simple neural networks) to predict the next few tokens. The suggestions were limited to completing the current line or small snippets. The developer was firmly in control, and the tool's contribution was limited to saving keystrokes.

Think of Stage 1 like predictive text on your phone: it guesses the next word based on what you have typed so far. Useful, but not creative. It could not write a sentence you were not already writing.

Stage 2 -- Neural code completion (2021-2023). GitHub Copilot (powered by OpenAI Codex, itself based on GPT-3) and Amazon CodeWhisperer introduced transformer-based code generation. These tools could generate multi-line code blocks, entire functions, and boilerplate. But they were fundamentally reactive: they completed what the developer was already writing. The developer wrote a function signature and docstring; the model generated the body.

Stage 2 was like having a very fast typist sitting next to you who could turn your outlines into prose. You still had to know what you wanted to write, but the mechanical effort of writing it was reduced.

Stage 3 -- Conversational code generation (2023-2024). ChatGPT, Claude, and similar systems enabled developers to describe what they wanted in natural language and receive code in response. This was a shift from completion to creation. A developer could say "write a REST API for user management with authentication" and receive a working implementation. But the interaction was still single-turn or short multi-turn: the developer asked, the model responded, and there was limited feedback or iteration.

Stage 3 was like hiring a freelance developer: you described what you wanted, they delivered code, and you reviewed it. But they did not have access to your codebase, your tests, or your deployment pipeline.

Stage 4 -- Agentic coding (2024-present). Coding agents like Claude Code, Devin, Cursor Agent Mode, and GitHub Copilot Workspace can autonomously execute multi-step software engineering tasks. They read codebases, understand project structure, write code, run tests, debug failures, and iterate. The developer's role shifts from "writing code" to "specifying intent and reviewing results."

Stage 4 is like having a junior developer on your team: you give them a task, they figure out the approach, write the code, run the tests, fix the failures, and come back to you with a completed implementation for review. They can ask you questions when they are stuck, and they learn from your feedback.

Key Insight: The key difference between Stage 3 and Stage 4 is the closed loop. A Stage 3 system generates code and gives it to you. A Stage 4 system generates code, runs it, observes the errors, fixes them, runs it again, and iterates until the tests pass. This closed loop between action and observation is exactly what makes an "agent" as we defined it in Week 1.

1.3 What Makes a Coding Agent "Agentic"?

A coding agent is not just a code generator. It has five distinguishing characteristics that correspond to the agent components we studied throughout this course:

Tool access (Week 5-6): Can read files, write files, run commands, execute tests, browse documentation, search codebases. These are the agent's "hands."
Environment awareness (Week 7): Understands the project structure, dependencies, frameworks, coding conventions, and existing patterns. This is the agent's "memory" of the project context.
Planning (Week 8-9): Can break a task into steps and execute them sequentially. A task like "fix this bug" requires: understanding the bug, finding the relevant code, diagnosing the cause, implementing the fix, writing a test, and verifying the fix.
Feedback loops (Week 1): Can run code, observe errors, and iterate on fixes. This observe-think-act cycle is the core of agency.
Memory (Week 7): Maintains context across a multi-step task (and sometimes across sessions). The agent remembers what files it has read, what approaches it has tried, and what feedback the user has given.

1.4 The Landscape of Coding Agents (as of 2026)

The coding agent landscape has diversified rapidly. Here is an overview of the major tools:

Agent	Type	Key Features
Claude Code (Anthropic)	CLI-based agentic coding	Full codebase understanding, terminal access, agentic loop
GitHub Copilot (GitHub/OpenAI)	IDE-integrated + Workspace	Code completion, chat, workspace for multi-file changes
Cursor (Cursor Inc.)	IDE with agent mode	AI-first IDE, agent mode for autonomous multi-step tasks
Devin (Cognition Labs)	Fully autonomous agent	Browser-based, sets up environments, writes and deploys code
Windsurf (Codeium)	IDE with agentic features	Cascade mode for multi-file edits
Aider (Open source)	CLI pair programming	Git-aware, multi-file editing, multiple LLM providers
OpenHands (Open source)	Autonomous coding agent	Web browsing, code, commands in sandboxed environment
SWE-agent (Princeton NLP)	Research agent	Designed for SWE-bench, resolves GitHub issues

1.5 Benchmarking Coding Agents

SWE-bench (Jimenez et al., 2024) is the standard benchmark for evaluating coding agents. It consists of real GitHub issues from popular Python repositories (Django, Flask, scikit-learn, and others), paired with test cases that validate the fix. The benchmark is significant because it uses real bugs from real projects, not synthetic puzzles.

SWE-bench Verified is a human-validated subset of 500 problems that eliminates ambiguous or poorly specified tasks. As of early 2026, the top-performing agents solve approximately 50-60% of these problems autonomously.

To put this in perspective: these are real bugs that human developers encountered, reported, and eventually fixed. Solving 50-60% of them autonomously is remarkable. But it also means that 40-50% of real-world bugs remain beyond the capabilities of current agents, and the unsolved problems tend to be the harder ones requiring deeper understanding.

Key Insight: Coding agents are powerful assistants but not autonomous replacements for developers. They excel at well-defined, medium-complexity tasks (fixing a specific bug, adding a feature with clear requirements, writing tests for existing code). They struggle with deep architectural understanding, novel algorithms, ambiguous requirements, and cross-cutting concerns that span the entire system.

1.6 SWE-bench Scores: A Quantitative Comparison

The following table provides quantitative benchmarks for major coding agents as of early 2026. These numbers change rapidly as models and agent architectures improve, but they provide a useful snapshot of relative capabilities.

Agent / System	SWE-bench Verified (%)	SWE-bench Full (%)	Notes
Claude Code (Claude 3.5 Sonnet)	~49%	~33%	CLI-based, full terminal access
Claude Code (Claude 3.5 Opus)	~55%	~38%	Improved reasoning model
OpenHands + Claude 3.5	~53%	~36%	Open-source, sandboxed environment
Devin	~48%	~31%	First autonomous agent to gain wide attention
SWE-agent + GPT-4	~23%	~12%	Original research baseline
SWE-agent + Claude 3.5	~40%	~27%	Same architecture, better model
Aider + Claude 3.5	~45%	~26%	Lightweight CLI tool
Cursor Agent Mode	~42% (estimated)	N/A	IDE-integrated, limited public benchmarks
Amazon Q Developer	~35% (estimated)	N/A	AWS-integrated

Several patterns emerge from these numbers:

Model quality matters enormously. The same agent architecture (SWE-agent) jumps from 12% to 27% on SWE-bench Full just by switching from GPT-4 to Claude 3.5. The underlying model is the single largest factor in agent performance.
Agent architecture matters too, but less. Different agent architectures with the same model show 5-15% differences. Tool design, planning strategy, and error recovery all contribute, but the model provides the floor.
Verified vs. Full. SWE-bench Verified scores are consistently higher than Full scores because the Verified subset removes ambiguous or poorly specified problems. The gap tells you how much real-world noise affects performance.
The frontier is ~55%. As of early 2026, the best agents solve roughly half of well-specified real-world bugs. This means that for every bug the agent fixes, there is roughly one it cannot fix. Human developers remain essential.

Key Insight: When evaluating coding agents, do not rely on marketing materials. Look at SWE-bench scores (especially Verified), but also test the agent on your own codebase. Performance on Python-heavy open-source repositories does not necessarily predict performance on your TypeScript/React/PostgreSQL stack.

Try It Yourself: Evaluate an Agent

If you have access to any coding agent (Claude Code, Cursor, Copilot, or even a free-tier chatbot), try this experiment:

Pick a small open-source project with good test coverage
Find a closed issue with a clear bug report
Ask the agent to fix the bug, giving it only the bug report (not the fix)
Compare the agent's fix to the actual human fix

Note: Where did the agent succeed? Where did it struggle? What information did it need that it did not have?

032. The Agent-Developer Workflow: Pair Programming with AI

2.1 Workflow Models

How developers work with coding agents varies significantly based on the task, the agent's capability, and the developer's preference. Four main models have emerged:

Model 1: Agent as Junior Developer

The human architect designs the system, then delegates implementation of specific components to the agent. The human reviews the output, provides feedback, and the agent iterates.

This model works well when:

The task is well-specified (clear input, clear output, clear constraints)
There are existing patterns to follow (the agent can imitate what is already in the codebase)
The human can effectively review the output (the task is within the human's domain expertise)

Think of it like delegating to an intern: you give them a clear brief, they do the work, you review it, and you provide corrections. The human's role is specification and quality assurance.

Model 2: Ping-Pong Pair Programming

The human and agent alternate contributions. The human writes a function signature, the agent implements the body. The agent writes a test, the human refines it. This tight collaboration loop keeps both parties engaged.

Example session:

text

Human: [Writes function signature and docstring]
Agent: [Implements the function body]
Human: [Reviews, suggests an edge case the agent missed]
Agent: [Adds edge case handling and writes tests]
Human: [Runs tests, spots a performance issue]
Agent: [Optimizes the implementation]
Human: [Approves the final version]

This model works well for exploratory development where the requirements emerge through the process of writing code. It keeps the human engaged (reducing the automation paradox from Week 12) while leveraging the agent's speed.

Model 3: Agent-First, Human-Review

The agent takes a complete task from start to finish. The human reviews the final result (or intermediate checkpoints). This is the most efficient model for well-defined tasks but requires the human to be a careful and thorough reviewer.

The risk of this model is that the human becomes a rubber-stamp (the automation paradox again). If the agent usually produces good code, the human may stop reviewing carefully, and the one time the agent produces bad code is the one time the human does not catch it.

Model 4: Multi-Agent with Human Orchestration

Multiple specialized agents handle different aspects of a task: one for implementation, one for testing, one for documentation, one for security review. The human orchestrates the overall workflow.

This model is emerging but not yet mainstream. It leverages the multi-agent patterns we studied in Week 10 and is most useful for large, complex tasks where specialization adds value.

2.2 Effective Communication with Coding Agents

The quality of agent output depends heavily on the quality of the prompt. This is where the prompting techniques from earlier weeks pay off directly. Key principles:

Be specific about requirements. Instead of "add error handling," say "add try-except blocks for database connection errors, with retry logic (3 attempts, exponential backoff) and structured logging using the existing logger."

The difference is dramatic. The vague prompt leaves the agent to guess what kind of error handling you want, which logging framework to use, and whether to retry. The specific prompt eliminates guesswork and produces output that matches your needs.

Provide context. Point the agent to relevant files, patterns, and documentation. "Follow the pattern in src/services/user_service.py for database access." This is especially important because agents (unlike human developers) do not have years of accumulated context about your codebase. Every piece of context you provide reduces the chance of the agent making a wrong assumption.

State constraints explicitly. "Do not modify the public API. Keep backward compatibility. Use only standard library dependencies." Constraints are things the agent should NOT do, and they are just as important as what it should do. Without explicit constraints, the agent might choose the optimal solution to the problem, which happens to break backward compatibility in a way you cannot accept.

Describe the 'why'. "We need this because the current implementation fails silently when the database is unreachable, causing data loss." The "why" helps the agent make judgment calls. If it understands that the goal is to prevent data loss, it will make different design choices than if it thinks the goal is to add logging.

Define done. "The task is complete when all existing tests pass, new tests cover the error scenarios, and the code follows our lint rules." Without a clear definition of done, the agent does not know when to stop iterating.

2.3 Walkthrough: Using Claude Code to Solve a Real Task

Let us walk through a complete, realistic example of using Claude Code to add a feature to an existing project. This walkthrough illustrates the agent-developer workflow in practice, including the back-and-forth that characterizes effective agent-assisted development.

The task: Add rate limiting to an existing Express.js API.

Step 1: Set the context. The developer opens Claude Code in the project directory and provides a clear, specific prompt:

text

Developer: "Add rate limiting to our API. Requirements:
- 100 requests per minute per IP address
- Return 429 Too Many Requests when the limit is exceeded
- Use a sliding window algorithm
- Store rate limit data in Redis (we already have a Redis connection
  in src/lib/redis.ts)
- Apply to all routes under /api/v2/
- Do not modify existing middleware; add new middleware
- Follow the patterns in src/middleware/ for code style"

Notice how the prompt follows the principles from Section 2.2: specific requirements, explicit constraints ("do not modify existing middleware"), context pointers ("patterns in src/middleware/"), and the "why" is implicit (protecting the API from abuse).

Step 2: The agent explores the codebase. Claude Code begins by reading relevant files to understand the project:

text

Agent: [reads src/middleware/auth.ts]        -- Understands middleware patterns
Agent: [reads src/lib/redis.ts]              -- Understands Redis connection
Agent: [reads src/app.ts]                    -- Understands middleware registration
Agent: [reads package.json]                  -- Checks existing dependencies
Agent: [searches for "rate" across codebase] -- Checks for existing rate limiting

This exploration phase takes seconds and gives the agent the context it needs. A human developer would do the same thing (look at existing code to understand patterns) but would take longer.

Step 3: The agent proposes and implements. Based on its exploration, the agent creates the implementation:

text

Agent: "I'll create a new rate limiting middleware following the pattern
in src/middleware/auth.ts. I'll use a Redis-backed sliding window
counter. Here's my plan:

1. Create src/middleware/rate-limit.ts (the middleware)
2. Add rate limit configuration to src/config.ts
3. Register the middleware in src/app.ts for /api/v2/ routes
4. Add tests in src/middleware/__tests__/rate-limit.test.ts

Let me implement this."

The agent writes the code, creates the files, and shows the developer the changes.

Step 4: The developer reviews and iterates. The developer notices an issue:

text

Developer: "This looks good, but you're using a fixed window, not a
sliding window. A fixed window resets at the start of each minute,
which allows burst traffic at the boundary. Use a sliding window
with Redis sorted sets instead."

This is the kind of correction that distinguishes the agent-developer workflow from autonomous coding. The developer's domain knowledge (understanding the difference between fixed and sliding window rate limiting) catches a subtle error that the agent's tests would not have caught (both pass the basic tests).

Step 5: The agent iterates. Claude Code revises the implementation based on the feedback, runs the tests, and presents the updated version.

Step 6: Final verification. The developer reviews the final implementation, runs the full test suite, and approves the changes.

text

Developer: "Looks good. Run the tests and commit."
Agent: [runs npm test -- all pass]
Agent: [creates commit: "feat(api): add sliding window rate limiting
        for /api/v2/ routes"]

What made this effective:

The developer provided specific, well-structured requirements
The agent explored the codebase before writing code
The developer caught a subtle algorithmic issue the agent missed
The iteration was quick (minutes, not hours)
The agent handled the mechanical work (writing boilerplate, setting up tests, creating the commit) while the developer focused on correctness

What could have gone wrong:

If the developer had not specified "sliding window," they would have gotten a fixed window implementation that technically worked but was suboptimal
If the developer had not reviewed the algorithm, the subtle difference would have reached production
If the developer had said "add rate limiting" without context, the agent might have chosen a completely different approach (e.g., an in-memory rate limiter that does not work across server instances)

2.4 Prompt Engineering for Coding Agents: Before and After

The quality of the developer's prompt dramatically affects the quality of the agent's output. Here are three before/after examples that illustrate effective prompt engineering for coding agents:

Example 1: Adding error handling

Bad prompt:

text

"Add error handling to the database module."

Good prompt:

text

"Add error handling to src/db/queries.ts. Specifically:
- Wrap all database calls in try-catch blocks
- On connection errors (ECONNREFUSED, ETIMEDOUT), retry up to 3 times
  with exponential backoff (1s, 2s, 4s)
- On query errors, log the error with our existing logger
  (import from src/lib/logger) and throw a custom DatabaseError
  (create it in src/errors.ts following the pattern of ApiError)
- Do not catch and swallow errors silently
- Add unit tests for the retry logic"

The bad prompt leaves everything to the agent's judgment. The good prompt specifies the error types, retry strategy, logging mechanism, error class pattern, and testing expectations.

Example 2: Fixing a bug

Bad prompt:

text

"The search is broken. Fix it."

Good prompt:

text

"Users report that searching for 'café' returns no results, but
searching for 'cafe' works. The search endpoint is in
src/api/search.ts, and it queries Elasticsearch. I suspect the
issue is with Unicode normalization or accent handling. Check the
Elasticsearch mapping in src/config/elasticsearch.ts and the
query builder in src/lib/search-query.ts. The fix should handle
all common accented characters, not just this one case."

The bad prompt gives no information about the symptom, location, or expected behavior. The good prompt provides a specific reproduction case, likely cause, relevant files, and guidance on the scope of the fix.

Example 3: Implementing a feature

Bad prompt:

text

"Add authentication."

Good prompt:

text

"Add JWT authentication to the Express API. Use the existing User
model in src/models/user.ts. Requirements:
- POST /auth/login accepts email + password, returns JWT
- JWT expires in 1 hour, refresh token expires in 7 days
- Store refresh tokens in Redis (connection in src/lib/redis.ts)
- Protect all /api/v2/ routes with auth middleware
- Use bcrypt for password hashing (already in package.json)
- Follow the middleware pattern in src/middleware/auth-example.ts
- Do NOT use passport.js (we want minimal dependencies)"

The pattern is consistent: specific requirements, explicit constraints, context pointers, and a clear definition of done produce dramatically better agent output.

043. Automated Testing with Agents

3.1 Agents as Test Writers

One of the most immediately valuable applications of coding agents is automated test generation. Writing tests is often perceived as tedious by developers, which means test coverage is frequently inadequate. Agents, which do not experience tedium, can dramatically increase test coverage. Agents can:

Generate unit tests from function signatures and docstrings
Create edge case tests by analyzing code paths and boundary conditions
Write integration tests that exercise API endpoints or workflows
Generate property-based tests using frameworks like Hypothesis
Create regression tests from bug reports (ensuring the bug does not recur)

3.2 Test-Driven Development with Agents

An interesting workflow inverts the traditional TDD cycle:

text

1. Human writes requirements in natural language
2. Agent generates tests from requirements (the "red" phase)
3. Agent implements code to pass the tests (the "green" phase)
4. Human reviews both tests and code
5. Agent refactors while keeping tests green (the "refactor" phase)

This is sometimes called "Spec-Driven Development" when working with agents: the human writes the specification, the agent writes the tests and implementation. The human's role becomes specification and review, which are arguably the highest-value parts of software development.

The key advantage of this workflow is that the tests serve as a machine-verifiable specification. Instead of arguing about whether the code "does the right thing," you can point to specific test cases and say "does it pass these tests?"

3.3 Example: Agent-Generated Test Suite

Here is an example of prompting a coding agent to generate a comprehensive test suite. Pay attention to how the agent covers not just the happy path but also edge cases, error conditions, and concurrency.

python

"""
Prompt to the agent:
"Write a comprehensive test suite for the following TaskQueue class.
Cover normal operation, edge cases, error handling, and concurrency.
Use pytest. Include both unit tests and integration-style tests."
"""

# The class to test (provided to the agent as context):
import threading
import time
from collections import deque
from dataclasses import dataclass, field


@dataclass
class Task:
    task_id: str
    payload: dict
    priority: int = 0
    created_at: float = field(default_factory=time.time)


class TaskQueue:
    """Thread-safe priority task queue with size limits."""

    def __init__(self, max_size: int = 100):
        self.max_size = max_size
        self._queue: list[Task] = []
        self._lock = threading.Lock()
        self._not_empty = threading.Condition(self._lock)
        self._not_full = threading.Condition(self._lock)

    def put(self, task: Task, timeout: float | None = None) -> bool:
        with self._not_full:
            if len(self._queue) >= self.max_size:
                if not self._not_full.wait(timeout=timeout):
                    return False
            self._queue.append(task)
            self._queue.sort(key=lambda t: -t.priority)
            self._not_empty.notify()
            return True

    def get(self, timeout: float | None = None) -> Task | None:
        with self._not_empty:
            if not self._queue:
                if not self._not_empty.wait(timeout=timeout):
                    return None
            task = self._queue.pop(0)
            self._not_full.notify()
            return task

    def size(self) -> int:
        with self._lock:
            return len(self._queue)

    def is_empty(self) -> bool:
        return self.size() == 0


# --- Agent-generated test suite below ---

import pytest


class TestTaskQueueBasic:
    """Basic functionality tests."""

    def test_put_and_get_single_task(self):
        """Verify that a task can be added and retrieved."""
        queue = TaskQueue()
        task = Task(task_id="1", payload={"action": "test"})
        assert queue.put(task) is True
        result = queue.get()
        assert result is not None
        assert result.task_id == "1"

    def test_fifo_order_same_priority(self):
        """Tasks with the same priority should come out in FIFO order."""
        queue = TaskQueue()
        for i in range(5):
            queue.put(Task(task_id=str(i), payload={}, priority=0))
        ids = [queue.get().task_id for _ in range(5)]
        assert ids == ["0", "1", "2", "3", "4"]

    def test_priority_ordering(self):
        """Higher priority tasks should come out first."""
        queue = TaskQueue()
        queue.put(Task(task_id="low", payload={}, priority=1))
        queue.put(Task(task_id="high", payload={}, priority=10))
        queue.put(Task(task_id="mid", payload={}, priority=5))

        assert queue.get().task_id == "high"
        assert queue.get().task_id == "mid"
        assert queue.get().task_id == "low"

    def test_size_tracking(self):
        """Size should accurately reflect queue contents."""
        queue = TaskQueue()
        assert queue.size() == 0
        assert queue.is_empty() is True

        queue.put(Task(task_id="1", payload={}))
        assert queue.size() == 1
        assert queue.is_empty() is False

        queue.get()
        assert queue.size() == 0

    def test_max_size_enforcement(self):
        """Queue should reject tasks when full (with timeout)."""
        queue = TaskQueue(max_size=2)
        assert queue.put(Task(task_id="1", payload={})) is True
        assert queue.put(Task(task_id="2", payload={})) is True
        # Third put should timeout since queue is full
        assert queue.put(Task(task_id="3", payload={}), timeout=0.1) is False


class TestTaskQueueConcurrency:
    """Thread safety tests."""

    def test_concurrent_put_and_get(self):
        """Multiple producers and consumers should not lose tasks."""
        queue = TaskQueue(max_size=1000)
        results = []
        errors = []

        def producer(start_id, count):
            try:
                for i in range(count):
                    queue.put(Task(task_id=f"p{start_id}-{i}", payload={}))
            except Exception as e:
                errors.append(e)

        def consumer(count):
            try:
                for _ in range(count):
                    task = queue.get(timeout=5.0)
                    if task:
                        results.append(task.task_id)
            except Exception as e:
                errors.append(e)

        n_items = 100
        producers = [
            threading.Thread(target=producer, args=(i, n_items))
            for i in range(4)
        ]
        consumers = [
            threading.Thread(target=consumer, args=(n_items,))
            for _ in range(4)
        ]

        for t in producers + consumers:
            t.start()
        for t in producers + consumers:
            t.join(timeout=10)

        assert len(errors) == 0
        assert len(results) == 4 * n_items

    def test_get_blocks_until_item_available(self):
        """get() should block until a task is put."""
        queue = TaskQueue()

        def delayed_put():
            time.sleep(0.2)
            queue.put(Task(task_id="delayed", payload={}))

        threading.Thread(target=delayed_put).start()
        task = queue.get(timeout=2.0)
        assert task is not None
        assert task.task_id == "delayed"


class TestTaskQueueEdgeCases:
    """Edge case tests."""

    def test_get_from_empty_queue_with_timeout(self):
        """get() on an empty queue with timeout should return None."""
        queue = TaskQueue()
        result = queue.get(timeout=0.1)
        assert result is None

    def test_large_payload(self):
        """Queue should handle tasks with large payloads."""
        queue = TaskQueue()
        big_payload = {"data": "x" * 1_000_000}
        queue.put(Task(task_id="big", payload=big_payload))
        task = queue.get()
        assert len(task.payload["data"]) == 1_000_000

    def test_negative_priority(self):
        """Negative priorities should be handled correctly."""
        queue = TaskQueue()
        queue.put(Task(task_id="neg", payload={}, priority=-5))
        queue.put(Task(task_id="pos", payload={}, priority=5))
        assert queue.get().task_id == "pos"
        assert queue.get().task_id == "neg"

Notice how the agent-generated tests cover multiple dimensions: basic functionality, ordering guarantees, size constraints, concurrency safety, blocking behavior, and edge cases. A human reviewer should verify completeness and add domain-specific scenarios that the agent might have missed.

054. CI/CD Integration: Agents in the Development Pipeline

4.1 Where Agents Fit in CI/CD

Agents can participate at multiple stages of the development pipeline, not just in code writing:

Pre-commit. Agents review code before it enters the repository, catching issues early. This can include style checks, security scanning, and logical review that goes beyond what static analysis tools can do.

Pull request review. Agents automatically review pull requests, providing comments on code quality, potential bugs, missing tests, and documentation gaps. Tools like CodeRabbit and GitHub Copilot PR review automate this. The agent reads the diff, understands the context of the changes, and provides feedback that is specific to the codebase.

Automated fixes. When CI detects a failing test or linting error, an agent can automatically create a fix PR. This closes the loop between detection and resolution, reducing the time a build stays broken.

Release management. Agents can draft release notes, update changelogs, and verify that all release criteria are met. These are mechanical tasks that are well-suited to automation.

4.2 Example: Agent in a GitHub Actions Workflow

yaml

# .github/workflows/agent-review.yml
name: Agent Code Review

on:
  pull_request:
    types: [opened, synchronize]

jobs:
  agent-review:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
        with:
          fetch-depth: 0

      - name: Get changed files
        id: changes
        run: |
          echo "files=$(git diff --name-only origin/main...HEAD | tr '\n' ' ')" >> $GITHUB_OUTPUT

      - name: Run agent review
        env:
          ANTHROPIC_API_KEY: ${{ secrets.ANTHROPIC_API_KEY }}
          GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}
        run: |
          # Agent reviews each changed file and posts comments
          python scripts/agent_review.py \
            --files "${{ steps.changes.outputs.files }}" \
            --pr-number "${{ github.event.pull_request.number }}"

This workflow triggers on every pull request, identifies changed files, and runs an agent-powered review that posts comments directly on the PR. The agent can provide feedback that is more contextual and nuanced than traditional linting tools because it understands the code semantically.

4.3 Continuous Integration with Agent Feedback Loops

A powerful pattern is the agent-in-the-loop CI where the agent actively participates in the CI process:

Developer pushes code
CI runs tests; some fail
Agent analyzes test failures (reads error messages, traces through code)
Agent proposes fixes
If fixes are straightforward, agent creates a commit
If fixes are complex, agent creates a detailed issue or comment explaining the problem
Developer reviews and approves

This reduces the turnaround time for fixing CI failures from hours (waiting for a developer to investigate) to minutes.

4.4 Example: Agent-Powered Auto-Fix Pipeline

The following Python script demonstrates how an agent can be integrated into a CI pipeline to automatically analyze and fix test failures:

python

"""
CI integration script that uses a coding agent to analyze
and propose fixes for test failures.

This script is designed to run as a CI step after test execution.
It reads the test output, asks an agent to diagnose the failure,
and creates a fix PR if the agent is confident in its solution.
"""

import subprocess
import json
import sys
from pathlib import Path


def get_test_failures(test_output: str) -> list[dict]:
    """Parse test output to extract failure information."""
    failures = []
    # Simplified parser -- real implementation would handle
    # pytest, jest, etc. output formats
    for line in test_output.split("\n"):
        if "FAILED" in line or "FAIL" in line:
            failures.append({
                "test": line.strip(),
                "output": test_output,
            })
    return failures


def build_agent_prompt(failures: list[dict], changed_files: list[str]) -> str:
    """
    Build a prompt for the coding agent that includes:
    - The test failures and their output
    - The list of files changed in this PR
    - Instructions for diagnosing and fixing
    """
    prompt = "The following tests are failing after recent changes:\n\n"

    for f in failures:
        prompt += f"**Failed test**: {f['test']}\n"

    prompt += f"\n**Changed files**: {', '.join(changed_files)}\n\n"
    prompt += (
        "Please:\n"
        "1. Read the failing test files to understand what they expect\n"
        "2. Read the changed files to understand what was modified\n"
        "3. Diagnose why the tests are failing\n"
        "4. If the fix is straightforward (< 10 lines changed), "
        "implement it\n"
        "5. If the fix is complex, explain the issue and suggest "
        "an approach\n"
        "6. Run the tests to verify your fix\n"
    )
    return prompt


def create_fix_pr(branch_name: str, commit_message: str):
    """Create a PR with the agent's fix."""
    subprocess.run(["git", "checkout", "-b", branch_name], check=True)
    subprocess.run(["git", "add", "-A"], check=True)
    subprocess.run(["git", "commit", "-m", commit_message], check=True)
    subprocess.run(["git", "push", "-u", "origin", branch_name], check=True)
    subprocess.run([
        "gh", "pr", "create",
        "--title", f"fix: {commit_message}",
        "--body", "Automated fix generated by CI agent. "
                  "Please review carefully before merging.",
        "--label", "agent-generated",
    ], check=True)

The key design decisions:

Threshold for auto-fix: Only straightforward fixes (< 10 lines) are implemented automatically. Complex fixes get a diagnosis but not a PR. This prevents the agent from making large, hard-to-review changes.
Labeling: The PR is labeled "agent-generated" so reviewers know to scrutinize it carefully.
Explicit instructions: The prompt tells the agent exactly what steps to follow, reducing the chance of an unhelpful response.

065. Code Refactoring and Migration Agents

5.1 Why Agents Excel at Refactoring

Refactoring is a particularly good fit for coding agents because of several properties:

Mechanical: Many refactoring operations follow well-defined patterns (rename, extract method, move class)
Testable: Refactoring should not change behavior, so existing tests validate the result
Tedious: Humans find large-scale refactoring boring and error-prone (this is where agents shine: they do not get bored or careless)
Context-heavy: Agents can track all references to a symbol across a large codebase, something humans struggle to do manually

5.2 Common Agent-Assisted Refactoring Tasks

Rename and restructure. Rename variables, functions, classes, and modules consistently across the entire codebase. Move code between files and update all imports. This sounds trivial, but in a large codebase with hundreds of files, a rename can be surprisingly error-prone.

API migration. When a library releases a new API version, agents can update all call sites. For example, migrating from an older HTTP client library to a newer one, updating function signatures, changing parameter names, and adapting to new return types.

Pattern application. Apply a consistent pattern across the codebase: add error handling to all database calls, convert callbacks to async/await, add logging to all API endpoints, or add type annotations to an untyped codebase.

Dependency updates. Update dependencies and fix breaking changes. The agent reads the changelog, identifies affected code, and applies necessary modifications.

5.3 Case Study: Large-Scale TypeScript Migration

A common real-world task is migrating a JavaScript codebase to TypeScript:

text

Task: Convert src/utils/*.js to TypeScript

Agent approach:
1. Read each .js file and analyze the code
2. Infer types from usage patterns, JSDoc comments, and runtime checks
3. Rename .js to .ts
4. Add type annotations
5. Fix type errors reported by the compiler
6. Run existing tests to verify behavior is unchanged
7. Iterate on type errors until the build succeeds

An agent can handle straightforward files (pure utility functions with clear types) almost perfectly. Complex files (those using dynamic patterns, runtime type construction, or heavy metaprogramming) require human assistance. The practical workflow is often: let the agent handle the 80% of files that are straightforward, then have a human handle the remaining 20%.

5.4 When Refactoring Goes Wrong

Agents can make refactoring mistakes that are subtle and hard to catch:

Behavioral changes. The agent changes the code's behavior while claiming to "only refactor." For example, extracting a method that changes the order of operations, or renaming a variable that was used in a string template (the template breaks but the code compiles). Always run the full test suite after any refactoring.

Incomplete renaming. The agent renames a function in 95% of call sites but misses some in less-traveled code paths (error handlers, configuration files, migration scripts). This compiles fine but crashes at runtime.

Style inconsistency. The agent applies modern idioms to the refactored code but leaves untouched code in the old style, creating a jarring inconsistency. This is not a correctness issue but it makes the codebase harder to read.

Over-abstraction. The agent introduces abstractions (interfaces, factory patterns, dependency injection) that add complexity without proportional benefit. The refactored code is "cleaner" by some abstract measure but harder to understand and debug.

Key Insight: Refactoring is one of the best use cases for coding agents because it is mechanical, testable, and tedious for humans. But it is also one of the riskiest because refactoring bugs are subtle (the code compiles and most tests pass) and can lurk undetected for weeks. Comprehensive test coverage before refactoring is essential; if you do not have it, write the tests first.

076. Documentation Generation

6.1 Types of Documentation Agents Can Generate

API documentation. From function signatures, docstrings, and usage examples, agents can generate comprehensive API docs. This works well because the source of truth (the code) is available to the agent.

Architecture documentation. By analyzing imports, dependencies, and file structure, agents can generate architecture diagrams and explanations. The agent can identify the major components, their dependencies, and their communication patterns.

README generation. Agents can create project READMEs from the codebase, including setup instructions, usage examples, and contribution guidelines.

Inline comments. Agents can add explanatory comments to complex code sections, making the codebase more accessible to new contributors.

Changelog generation. From git history and PR descriptions, agents can compile structured changelogs that summarize what changed between releases.

6.2 Quality Considerations

Agent-generated documentation has known pitfalls that you should watch for:

Stating the obvious. Agents often generate comments that restate the code rather than explaining intent: # Increment counter by 1 above counter += 1. Good documentation explains why, not what.

Hallucinated details. The agent might describe behavior that the code does not actually implement. This is especially dangerous in documentation because readers trust documentation as authoritative.

Staleness. Documentation generated from code becomes stale when the code changes. If you generate docs once and never update them, they will eventually be misleading.

Missing context. Agents describe what the code does but may miss why it was written that way. Business context, historical decisions, and trade-off rationale are often not in the code.

Best practice. Use agents to draft documentation, then have a human review for accuracy, relevance, and completeness. Integrate documentation generation into CI so it stays current with the code.

6.3 Example: Documentation Generation Prompt

Here is an effective prompt for generating API documentation with a coding agent:

text

"Generate API documentation for all endpoints in src/api/v2/.
For each endpoint, include:

1. HTTP method and path
2. Description of what it does (one sentence)
3. Request parameters (path params, query params, request body)
4. Response format (with example JSON)
5. Error responses (status codes and when they occur)
6. Authentication requirements
7. Rate limiting (if applicable)
8. A curl example

Use the existing endpoint implementations as the source of truth.
Do NOT invent parameters or responses that are not in the code.
Format as Markdown. Follow the style in docs/api/v1-reference.md."

The prompt is specific about the format, explicitly warns against hallucination ("Do NOT invent parameters"), and points to an existing style reference. This produces documentation that is useful and accurate.

6.4 When NOT to Use Agents for Documentation

Some documentation types are poorly suited to agent generation:

Architecture Decision Records (ADRs): These capture why decisions were made, which requires human context the agent does not have.
Onboarding guides: These need to reflect the actual experience of learning the codebase, not just its structure.
Runbooks for incidents: These need to capture the human judgment and institutional knowledge that comes from handling real incidents.
Strategic or vision documents: These require understanding the business context and future direction.

For these types, the agent can help with formatting and editing, but the content must come from humans.

087. Debugging and Issue Resolution Agents

7.1 The Debugging Workflow

Debugging is where agentic capabilities truly shine because it requires exactly the kind of multi-step, multi-file reasoning that agents are good at. A systematic debugging workflow for an agent looks like:

Reproduce: Read the bug report, understand the expected vs. actual behavior
Locate: Search the codebase for relevant code, analyze stack traces, identify the likely location of the bug
Diagnose: Understand why the code behaves incorrectly, tracing the logic through multiple functions if needed
Fix: Implement the correction
Verify: Write a test that captures the bug and confirm the fix resolves it
Prevent: Suggest improvements to prevent similar bugs in the future

7.2 When Agents Struggle with Debugging

Agents are less effective when:

The bug involves complex state interactions across many components
The issue is environmental (works on one machine but not another)
The bug requires understanding external systems (third-party APIs, hardware)
The root cause is in a dependency, not in the project's own code
Reproducing the bug requires specific timing or concurrency conditions

7.3 Example: Prompting an Agent for a Non-Trivial Bug

text

Prompt: "Users report that our API occasionally returns stale data.
The endpoint is GET /api/v2/products/:id. Our stack is:
- Next.js API routes
- Redis cache (TTL: 5 minutes)
- PostgreSQL database
- The issue happens about 1 in 20 requests
- It only happens after a product update via PUT /api/v2/products/:id

Investigate this bug. Check the cache invalidation logic,
the database query, and the API route handler.
Look for race conditions or cache coherence issues."

A good coding agent will:

Read the GET and PUT route handlers
Trace the cache invalidation logic in the PUT handler
Identify if there is a window between the database update and cache invalidation
Check for race conditions when concurrent read and write requests arrive
Propose a fix (likely: invalidate cache before returning from the PUT handler, or use a cache-aside pattern with version tracking)

This kind of multi-file, multi-system debugging is where agentic capabilities truly shine. The agent can read many files, form a mental model of the data flow, and trace logic flows that would take a human significantly longer.

7.4 Debugging Anti-Patterns with Agents

While agents are powerful debuggers, certain patterns lead to poor results:

Anti-pattern: "Just fix it." Giving the agent a vague bug description and expecting it to find and fix the issue. Without specific symptoms, reproduction steps, and context, the agent will either guess (often incorrectly) or spend many tool calls exploring irrelevant code.

Anti-pattern: "Fix all the bugs." Asking the agent to fix multiple unrelated bugs at once. Each bug requires focused investigation, and mixing them leads to confused reasoning and incomplete fixes. Fix bugs one at a time.

Anti-pattern: "Make the error go away." The agent might suppress the error (catching and ignoring the exception) rather than fixing the root cause. Always specify that you want the root cause fixed, not the symptom hidden. Review agent fixes to ensure they address the underlying issue.

Anti-pattern: "The agent said it fixed it, so it must be fixed." Always verify the fix by running the relevant tests and manually testing the reproduction case. Agents can produce fixes that look correct, pass the existing tests, and still do not actually fix the bug (because the test coverage was insufficient to begin with).

Key Insight: The quality of debugging with a coding agent is directly proportional to the quality of the bug report you provide. A well-written bug report (specific symptoms, reproduction steps, expected vs. actual behavior, relevant files, environment details) is worth more than a sophisticated agent. The best debugging workflow is: human writes a thorough bug report, agent investigates and proposes a fix, human verifies the fix. This combines human domain knowledge with agent speed and thoroughness.

Try It Yourself: Debugging Challenge

Here is a bug report for a hypothetical application. Practice writing a prompt that you would give to a coding agent:

Bug: "Users report that pagination on the search results page shows the wrong total count. The first page says 'Showing 1-10 of 47 results' but clicking 'Next' eventually leads to page 3 which only has 2 results (total 22 results, not 47). The count seems to match the number of results before our last deployment that added search filters."

Write a prompt for the agent that includes:

The specific symptom and reproduction steps
Where to look first (based on your analysis of the bug report)
What to check (cache invalidation? filter application in count query?)
What the fix should accomplish (not just "fix it" but what correct behavior looks like)

098. Limitations and Risks

8.1 Code Quality Concerns

Subtle bugs. Agents can produce code that looks correct but contains subtle logical errors, especially in edge cases. The code compiles, tests pass (if tests are insufficient), but the behavior is wrong in specific scenarios. This is the most dangerous failure mode because it is invisible.

Non-idiomatic code. Agents trained on diverse codebases may produce code that works but does not follow the project's conventions or the language community's best practices. The code is correct but does not "feel right" to experienced developers on the team.

Over-engineering. Agents may add unnecessary abstractions, patterns, or dependencies when a simpler solution would suffice. Asked to add a configuration file, the agent might create an entire configuration management system with environment variable support, file watching, and schema validation when a simple dictionary would do.

Outdated patterns. Agents' training data has a cutoff date. They may use deprecated APIs or outdated practices. This is especially relevant in fast-moving ecosystems like JavaScript/TypeScript, where best practices change rapidly.

8.2 Security Vulnerabilities

Pearce et al. (2022) studied the security of code generated by GitHub Copilot and found that approximately 40% of generated code contained security vulnerabilities in scenarios specifically designed to elicit security-sensitive code. Common issues include:

SQL injection through string concatenation instead of parameterized queries
Missing input validation (trusting user input)
Insecure use of cryptographic functions (using MD5 for password hashing)
Path traversal vulnerabilities (not sanitizing file paths)
Hardcoded credentials or secrets in code

Mitigation. Always run security scanning tools (SAST/DAST) on agent-generated code. Treat agent output with the same scrutiny as code from an untrusted contributor. Never assume that because an AI wrote it, it is secure.

8.3 Over-Reliance and Skill Atrophy

A significant concern is that developers who rely heavily on coding agents may experience skill atrophy:

Reduced ability to write code from scratch (always starting with agent output)
Diminished understanding of underlying algorithms and data structures
Weakened debugging skills (relying on the agent to debug)
Decreased code comprehension (accepting agent output without fully understanding it)

This mirrors the automation paradox discussed in Week 12: as the tool becomes more capable, the human's ability to verify and correct its output may decline. A developer who cannot write code without an agent is in a precarious position: they cannot evaluate whether the agent's code is good because they lack the skill to write an alternative.

Mitigation. Deliberately practice writing code without agent assistance. Maintain fundamental skills through kata, code challenges, or periodic "no-AI" development days. Some teams implement "no-AI Fridays" where all coding is done manually. Others require that developers can explain every line of agent-generated code in their PRs, which forces understanding.

8.4 The "Works on My Machine" Problem

Coding agents can produce code that works in the agent's context but fails in different environments. This happens because:

The agent may assume specific versions of dependencies that are not pinned
The agent may write code that relies on the specific operating system or file system layout
The agent may hard-code paths or configurations that are environment-specific
The agent may generate code that works with the test data it has seen but fails on production data

Mitigation. Run agent-generated code in CI (which uses a clean environment) before merging. Use Docker or similar containerization to ensure consistent environments. Require that agent-generated code passes linting, type checking, and tests in CI, not just locally.

8.5 Common Misconceptions About Coding Agents

Misconception: "The agent understands my codebase." The agent reads files and forms a temporary working model of the code. It does not "understand" the codebase in the way a human developer who has worked on it for months does. It misses unwritten conventions, tribal knowledge, and the historical reasons behind design decisions. This is why CLAUDE.md and .cursorrules files are so important: they capture context that the agent would otherwise miss.

Misconception: "If the tests pass, the code is correct." Tests verify that specific scenarios work, not that the code is correct in general. An agent that writes both code and tests may have consistent blind spots: it misses the same edge cases in both. This is why independent test review (Section 15) and mutation testing are important.

Misconception: "Agents will replace junior developers." More accurately, agents shift the junior developer's role from writing boilerplate to reviewing agent output, writing specifications, and learning system design. A junior developer who can effectively direct and review an agent's work is more productive than one who writes all code manually. The skill set changes; the role does not disappear.

Misconception: "More context always helps." There is a sweet spot for context. Too little context leads to wrong assumptions. Too much context (dumping the entire codebase into the prompt) can actually degrade performance because the model struggles to identify the relevant information among the noise. Provide targeted context: the specific files that are relevant, the patterns to follow, and the constraints to respect.

Misconception: "The best agent is the most autonomous one." For production work, the best agent is the one that produces the best outcomes, which often means one that asks clarifying questions, escalates when uncertain, and lets the human make the hard decisions. A fully autonomous agent that produces wrong code 20% of the time is less useful than a semi-autonomous one that asks for guidance 20% of the time and produces correct code 98% of the time.

8.6 Intellectual Property and Licensing

Code generated by AI agents raises IP questions that remain largely unresolved:

Is agent-generated code copyrightable? (Varies by jurisdiction; the US Copyright Office says pure AI output is not)
If the agent was trained on open-source code, does the generated code inherit any licenses? (Unclear; several lawsuits are pending)
Can the developer claim sole authorship of agent-assisted code? (Ethically questionable if the agent did most of the work)

The safe approach is to treat agent-generated code as if it were written by a colleague: review it, understand it, take responsibility for it, and be prepared to explain and defend every line.

109. Case Studies of Agentic Software Engineering in Practice

9.1 Case Study: SWE-agent (Princeton NLP Group)

Yang et al. (2024) introduced SWE-agent, a system designed to resolve real GitHub issues autonomously. The key insight was the concept of the Agent-Computer Interface (ACI): custom commands designed for how LLM agents interact with code, rather than raw shell access.

Instead of giving the agent bash commands like grep, cat, and sed, SWE-agent provides commands like open_file, search_dir, edit_file, and run_tests. These are semantically clearer and less error-prone for the model. The agent thinks about what to do, takes an action, observes the result, and repeats.

SWE-agent achieved 12.47% on the original SWE-bench dataset, establishing a baseline that subsequent systems improved upon significantly.

Lesson: The interface between the agent and its tools matters enormously. Well-designed tools that match the agent's "cognitive style" outperform raw, low-level interfaces. This echoes what we learned in Week 5-6 about tool design.

9.2 Case Study: Copilot Workspace (GitHub)

GitHub Copilot Workspace represents a "plan and execute" approach:

The developer describes a task (or selects a GitHub issue)
The system analyzes the codebase and generates a plan (which files to change and why)
The developer reviews and edits the plan
The system implements the plan across all affected files
The developer reviews the changes and can iterate

This is a Level 3 autonomy approach (conditional automation from Week 12): the agent does the work, but the human approves the plan and reviews the result.

Lesson: Showing the plan before execution builds trust and catches misunderstandings early. Users overwhelmingly prefer the ability to modify the plan before the agent starts working. This is the plan-level approval pattern from Week 12 in action.

9.3 Case Study: Claude Code (Anthropic)

Claude Code operates as a CLI tool that combines an agentic loop with full terminal access. Key aspects:

Direct file system and terminal access: The agent can read, write, search, and execute commands directly
Project-level context: Uses CLAUDE.md files for project-specific instructions and conventions
Permission system: Asks for approval before executing potentially dangerous operations
Extended thinking: Uses chain-of-thought reasoning to plan complex operations

In practice, teams report that Claude Code is most effective for multi-file refactoring tasks, understanding and navigating unfamiliar codebases, implementing well-specified features in established projects, writing comprehensive test suites, and debugging complex issues with clear reproduction steps.

Lesson: Combining code generation with terminal access and file system awareness makes the agent significantly more capable than chat-based code generation alone. The agent does not just generate code; it interacts with the full development environment.

9.4 Case Study: Real-World Team Adoption Patterns

Organizations adopting coding agents typically go through three phases:

Phase 1: Individual exploration (weeks 1-4). Individual developers experiment with coding agents for personal tasks: writing tests, fixing small bugs, generating boilerplate. Usage is ad hoc and uncoordinated. Some developers become enthusiastic advocates; others remain skeptical.

Phase 2: Team standardization (months 1-3). The team establishes shared practices: which agent to use, project configuration files (CLAUDE.md, .cursorrules), review standards for agent-generated code, and policies on which tasks are appropriate for agent assistance. During this phase, teams often discover that the biggest productivity gains come not from the fastest coders but from developers who are best at specifying requirements and reviewing output.

Phase 3: Pipeline integration (months 3-6). The agent is integrated into CI/CD: automated PR reviews, test generation on new code, fix suggestions for failing builds, and documentation generation. At this phase, the agent becomes part of the team's infrastructure, not just an individual tool.

Teams that skip Phase 2 (jumping from individual experimentation to pipeline integration without establishing shared practices) often encounter problems: inconsistent code quality, security vulnerabilities from unreviewed agent code, and resistance from team members who feel the agent is being imposed on them.

Key Insight: Adopting coding agents is as much an organizational challenge as a technical one. The technology is the easy part; changing team workflows, establishing review standards, and building shared confidence in the tool is the hard part.

Try It Yourself: Adoption Plan

Draft a 3-month adoption plan for introducing a coding agent (your choice) to a team of 5 developers working on a web application. Your plan should include:

Which tasks you would start with (low-risk, high-visibility wins)
What configuration files you would create (CLAUDE.md or equivalent)
What review standards you would establish for agent-generated code
How you would measure success (metrics)
What risks you would watch for and how you would mitigate them

1110. Modern Coding Agents: A Closer Look (2025-2026)

10.1 Claude Code (Anthropic)

Claude Code is a terminal-based agentic coding assistant. Unlike IDE plugins, it operates as a standalone CLI process.

Architecture: Claude Code follows an observe-think-act loop. It reads the current state of the codebase (observe), reasons about what to do next using extended thinking (think), and then takes an action such as editing a file or running a command (act). This loop continues until the task is complete.

Key features:

CLAUDE.md configuration: Projects include a CLAUDE.md file with project-specific instructions that shape the agent's behavior across sessions.
Permission system: Asks for explicit user approval before destructive operations.
Hooks: Pre- and post-action hooks enforce custom policies.
MCP (Model Context Protocol) servers: Connect to external tools and data sources through a standardized interface.
Sub-agents: Spawn sub-agents for parallelizable tasks.
Git-native: Deep integration with git for staging, committing, and diffing.

10.2 Cursor

Cursor is an AI-first IDE (a fork of VS Code) with AI at every level:

Tab completion: Context-aware autocomplete that predicts multi-line edits.
Cmd+K inline editing: Select code and describe a change in natural language.
Agentic chat mode (Composer Agent): Full agentic mode with multi-file edits and terminal commands.
.cursorrules: Project configuration for agent conventions.
Codebase indexing: Semantic search across the entire project.

10.3 Windsurf (Codeium)

Windsurf is another AI-first IDE:

Cascade: An agentic flow engine for multi-step, multi-file tasks.
Supercomplete: Predicts the developer's next several actions.
Real-time awareness: Tracks IDE activity as implicit context.

10.4 GitHub Copilot

Copilot has evolved significantly:

Agent mode (2025): Makes multi-file edits and runs terminal commands in VS Code.
Copilot Workspace: Plan-and-execute workflows from GitHub issues.
Copilot for PRs: Automated review comments and suggested fixes.
Copilot Extensions: Third-party integrations.

10.5 Devin (Cognition Labs)

Devin operates in its own sandboxed cloud environment:

Full environment: Browser, shell, and code editor in a cloud VM.
Long-running tasks: Designed for tasks taking hours.
Slack/issue integration: Assigned tasks via Slack or GitHub issues.

10.6 OpenHands (formerly OpenDevin)

OpenHands is open-source:

Sandboxed Docker environment: The agent operates inside a container.
Multiple agent implementations: Supports different architectures.
Web browsing: Can browse documentation during work.
SWE-bench competitive: Consistently ranks among top performers.

10.7 SWE-agent (Princeton NLP)

SWE-agent's primary contribution is the Agent-Computer Interface (ACI): custom commands optimized for LLM agent interaction with code. The insight that tool design matters as much as model capability has influenced all subsequent coding agents.

1211. Agent Frameworks for Developers

11.1 OpenAI Agents SDK

A lightweight framework for building agentic applications:

Agent: A configured LLM with instructions, tools, and guardrails.
Handoffs: Transfer control between specialized agents.
Guardrails: Input/output validators running in parallel.
Tracing: Built-in execution logging.

python

from agents import Agent, Runner, handoff

billing_agent = Agent(
    name="Billing Agent",
    instructions="Handle billing inquiries. Be concise.",
    tools=[lookup_invoice, process_refund],
)

triage_agent = Agent(
    name="Triage Agent",
    instructions="Route the user to the right department.",
    handoffs=[handoff(billing_agent)],
)

result = Runner.run_sync(triage_agent, "I need a refund for order #1234")

11.2 Claude Agent SDK (Anthropic)

Supports multi-turn conversations, tool use, and structured output with extended thinking and MCP integration.

11.3 LangGraph

Graph-based agent workflows with explicit transitions, persistence, and human-in-the-loop support:

python

from langgraph.graph import StateGraph, END

graph = StateGraph(AgentState)
graph.add_node("plan", plan_step)
graph.add_node("execute", execute_step)
graph.add_node("review", human_review_step)
graph.add_edge("plan", "execute")
graph.add_edge("execute", "review")
graph.add_conditional_edges("review", should_continue, {
    "iterate": "plan",
    "approve": END,
})

11.4 CrewAI

Role-playing approach with specialized agents:

python

from crewai import Agent, Task, Crew

researcher = Agent(
    role="Technical Researcher",
    goal="Find the best architecture for the given requirements",
    backstory="You are a senior architect with 15 years of experience.",
    tools=[web_search, read_docs],
)

developer = Agent(
    role="Senior Developer",
    goal="Implement the architecture recommended by the researcher",
    backstory="You are a detail-oriented developer who writes clean code.",
    tools=[file_write, run_tests],
)

crew = Crew(agents=[researcher, developer], tasks=[research_task, impl_task])
result = crew.kickoff()

11.5 Semantic Kernel (Microsoft)

Enterprise-focused with Azure integration, plugin architecture, and multi-language support (Python, C#, Java).

11.6 Framework Comparison

Framework	Language	Multi-Agent	Persistence	Human-in-Loop	Best For
OpenAI Agents SDK	Python	Yes (handoffs)	No (external)	Limited	Conversational agents
Claude Agent SDK	Python, TS	Via orchestration	No (external)	Via tools	Tool-heavy agents
LangGraph	Python, JS	Yes (graph nodes)	Yes (built-in)	Yes (built-in)	Complex workflows
CrewAI	Python	Yes (crews)	Limited	Limited	Role-based prototyping
Semantic Kernel	Python, C#, Java	Yes (plugins)	Yes	Yes	Enterprise/Azure

1312. Agent Observability and Evaluation

12.1 Why Observability Matters

Agent workflows are inherently complex. Without observability, you are flying blind:

Multi-step execution: An agent might make 20-50 tool calls per task. Any step can go wrong.
Non-determinism: The same prompt can lead to different execution paths.
Cost tracking: Without observability, costs can spiral.
Quality assurance: Teams need to verify agent output quality over time.
Debugging loops: Agents can get stuck in loops. Observability detects and breaks these.

12.2 Key Tools

LangSmith (LangChain): Tracing, evaluation, and monitoring with hierarchical trace views.

Langfuse (open-source): Self-hostable with traces, scores, prompt management, and cost tracking.

Braintrust: Evaluation framework with CI integration.

12.3 SWE-bench as an Evaluation Standard

SWE-bench provides 2,294 real-world tasks from 12 Python repositories. SWE-bench Verified (500 problems) is the gold standard. As of early 2026, top agents solve 50-60% of Verified tasks.

Other benchmarks: HumanEval/MBPP (function-level), WebArena (web-based tasks), GAIA (general assistant), TAU-bench (customer service).

1413. Computer Use and Browser Agents

13.1 Beyond APIs: Visual Interaction

A frontier development is agents that interact with GUIs the same way a human would: by looking at the screen and using mouse and keyboard actions.

Claude Computer Use (Anthropic) takes screenshots, analyzes visual content, and issues mouse/keyboard actions. This enables automation of tasks for which no API exists.

Use cases in software engineering:

GUI testing (verifying visual rendering)
Cross-browser testing
Legacy system interaction (systems with only a GUI)
End-to-end workflow testing

Limitations: Slower than API calls. Less reliable than structured tool use. Minor UI changes can break workflows.

13.2 Browser Agents

Browser Use: Open-source library for browser navigation.
Playwright MCP: MCP server exposing Playwright capabilities.
Stagehand (Browserbase): AI-native browser automation combining visual understanding with DOM access.

13.3 WebArena Benchmark

WebArena (Zhou et al., 2024) evaluates web-browsing agents on realistic tasks in self-hosted web applications. Best agents solve roughly 30-40% of tasks, indicating significant room for improvement.

The gap between API-based tool use (~50-60% on SWE-bench) and visual/browser-based interaction (~30-40% on WebArena) is significant. This tells us that structured, API-based interaction is much more reliable than visual interaction for current agents. The practical implication: prefer API-based integrations over visual/browser-based ones whenever possible. Use browser agents for tasks where no API exists, not as a general-purpose tool.

Use cases where browser agents shine despite their limitations:

Testing that a deployed web application renders correctly
Interacting with legacy systems that have no API
Automated form filling for systems you do not control
Verifying that a UI change looks correct in multiple browsers
Scraping data from websites that block automated access
End-to-end smoke tests that simulate real user journeys

1514. MCP Servers for Development Workflows

14.1 What MCP Brings to Coding Agents

The Model Context Protocol (MCP), which we studied in Week 4, has particular significance for coding agents. MCP servers provide a standardized way to give agents access to development tools, databases, and services without building custom integrations for each one.

The key insight is that MCP servers turn development infrastructure into agent-accessible tools. Instead of the agent needing to know how to execute raw SQL, it can use a database MCP server that provides structured tools like query, list_tables, and describe_schema. Instead of parsing git output, it can use a git MCP server with tools like get_diff, create_branch, and list_commits.

14.2 Common MCP Servers for Development

Filesystem MCP server. Provides structured file operations (read, write, list, search) with built-in safety features like path restrictions (the agent can only access files within the project directory) and permission controls (read-only mode for sensitive directories).

Git MCP server. Exposes git operations as tools: git_status, git_diff, git_commit, git_log, git_branch. This is safer than giving the agent raw shell access to git because the MCP server can enforce policies (e.g., no force pushes, no commits to main).

Database MCP server. Provides read-only or controlled access to databases. The agent can query schemas, run SELECT queries, and inspect data without the risk of accidentally running DELETE or DROP statements. This is critical for debugging data-related issues.

Playwright/Browser MCP server. Enables the agent to interact with web pages programmatically: navigate, click, fill forms, take screenshots. This is useful for testing web applications, verifying deployments, and debugging UI issues.

Documentation MCP server. Provides access to project documentation, API references, and internal knowledge bases. The agent can search documentation to find relevant information without having to browse the web.

14.3 Example: Configuring MCP for a Coding Workflow

json

{
  "mcpServers": {
    "filesystem": {
      "command": "npx",
      "args": ["-y", "@modelcontextprotocol/server-filesystem",
               "/path/to/project"],
      "env": {}
    },
    "postgres": {
      "command": "npx",
      "args": ["-y", "@modelcontextprotocol/server-postgres"],
      "env": {
        "POSTGRES_URL": "postgresql://dev:password@localhost:5432/myapp"
      }
    },
    "github": {
      "command": "npx",
      "args": ["-y", "@modelcontextprotocol/server-github"],
      "env": {
        "GITHUB_TOKEN": "${GITHUB_TOKEN}"
      }
    }
  }
}

With this configuration, a coding agent has structured access to the filesystem, the project's database, and GitHub. It can read code, query data to understand the system's state, and interact with pull requests and issues, all through standardized, controlled interfaces.

Key Insight: MCP servers are to coding agents what IDEs are to developers: they provide structured access to the development environment. Raw shell access is powerful but dangerous and error-prone. MCP servers provide the same capabilities with better safety guarantees, clearer semantics, and easier auditability.

1615. Testing Strategies When Working with Coding Agents

15.1 The Testing Challenge

When an agent writes code, how do you verify it is correct? The challenge is more subtle than it appears. If you ask the agent to write both the code and the tests, the tests may pass but miss important scenarios (the agent has the same blind spots in both). If you write the tests yourself, you lose the productivity benefit. A balanced approach is needed.

15.2 Strategies

Strategy 1: Human-written tests, agent-written implementation (TDD with agents). The human writes test cases that capture the requirements. The agent writes code to pass those tests. This is the most reliable approach because the tests represent the human's understanding of correct behavior, and the agent's job is purely mechanical: make the tests pass.

This is the ideal workflow for well-specified features. The human's time is spent on the highest-value activity (defining correct behavior) and the agent handles the lower-value activity (writing the implementation).

Strategy 2: Agent-written tests, human review. The agent generates tests from the code or requirements. The human reviews the tests for completeness and correctness. This is faster than Strategy 1 but relies on the human catching gaps in test coverage.

When reviewing agent-generated tests, look for:

Missing edge cases (empty inputs, null values, boundary conditions)
Missing error cases (what happens when the database is down?)
Tests that only test the happy path
Tests that test implementation details rather than behavior (brittle tests)
Tests that always pass regardless of the code (tautological tests)

Strategy 3: Mutation testing. After the agent generates code and tests, run a mutation testing tool (like mutmut for Python or Stryker for JavaScript) that introduces small changes ("mutations") to the code and checks whether the tests catch them. If a mutation survives (the tests still pass despite the code being changed), the test suite has a gap.

This is the most rigorous approach and is particularly valuable for agent-generated code because it objectively measures test quality without relying on human judgment.

Strategy 4: Dual-agent testing. Use one agent to write the implementation and a different agent (or the same agent in a separate session) to write the tests. Because the second agent does not share context with the first, it may catch different issues. This is the software engineering equivalent of "independent verification" in safety-critical systems.

15.3 A Practical Testing Workflow

text

1. Human writes requirements (natural language specification)
2. Agent A generates implementation
3. Agent B generates tests from the requirements (NOT from the code)
4. Run tests against implementation
5. Human reviews any failures:
   - Is the test wrong? (Adjust the test)
   - Is the implementation wrong? (Ask Agent A to fix)
6. Run mutation testing to verify test quality
7. Human reviews final code and tests

This workflow combines the speed of agent-generated code with the reliability of independent testing and human oversight.

1716. Best Practices for Working with Coding Agents

14.1 Review Practices

Review agent code as critically as human code. Do not assume it is correct because an AI wrote it.
Check edge cases. Agents often handle the happy path well but miss edge cases.
Verify security. Run security scanning tools on all agent-generated code.
Understand before accepting. If you do not understand what the code does, do not accept it.
Run the tests. Never merge agent-generated code without running the full test suite.

14.2 Prompting Practices

Provide full context. Include relevant files, constraints, and conventions.
Be specific about requirements. Ambiguity leads to incorrect assumptions.
Define acceptance criteria. What does "done" look like?
Iterate. Treat agent output as a first draft, not a final product.
Use project files. Configure CLAUDE.md, .cursorrules, or equivalent.

14.3 Organizational Practices

Establish policies. Define which tasks agents can handle autonomously.
Track agent contributions. Know which code was agent-generated for auditing.
Maintain skills. Ensure developers regularly write code without agent assistance.
Share patterns. Document effective prompts and workflows for your team.
Measure impact. Track velocity, defect rate, and code review time.

1817. Discussion Questions

The deskilling debate: Will widespread use of coding agents lead to a generation of developers who cannot program without AI assistance? Is this a problem, or is it similar to how calculators changed mathematics education? Consider that mathematicians today are not expected to do long division by hand.

Hint: Think about which skills become more important (specification, review, architecture, testing strategy) and which become less important (typing speed, syntax memorization, boilerplate generation).
Code ownership: If an agent writes 80% of a codebase, who is the author? Who is responsible for bugs? How does this affect code review culture?
The 10x developer myth: Some claim that coding agents make every developer a "10x developer." Critique this claim. What skills become more valuable when agents handle implementation?

Hint: If the bottleneck shifts from implementation to specification and review, then the developers who are best at understanding requirements and evaluating code quality become the most valuable, not those who type the fastest.
Open source implications: How do coding agents affect the open-source ecosystem? If agents can generate code that closely resembles existing open-source code, what are the licensing implications?
Competitive dynamics: As coding agents become more capable, what happens to the demand for junior developers? How should computer science education adapt?

1918. Summary and Key Takeaways

Coding agents represent Stage 4 of AI-assisted programming, with a closed loop of action and observation that distinguishes them from simpler code generation tools.
The landscape is diverse: From IDE-integrated tools (Copilot, Cursor) to CLI agents (Claude Code, Aider) to fully autonomous systems (Devin, OpenHands). Each offers different tradeoffs between autonomy and control.
Effective agent-developer workflows range from "agent as junior developer" to "ping-pong pair programming" to "agent-first, human-review." The right model depends on the task, the agent's capability, and the developer's experience.
Agents excel at specific tasks: test generation, refactoring, documentation, debugging well-defined issues, and implementing features with clear specifications. They struggle with novel algorithms, deep architectural decisions, and cross-cutting concerns.
CI/CD integration enables agents to participate throughout the development lifecycle: reviewing PRs, fixing CI failures, and assisting with deployments.
Security is a real concern: Studies show that AI-generated code frequently contains vulnerabilities. Security scanning, careful review, and "trust but verify" are essential.
Over-reliance risks include skill atrophy, uncritical acceptance of agent output, and reduced code comprehension. Teams should deliberately maintain human coding skills.
Agent frameworks (OpenAI Agents SDK, Claude Agent SDK, LangGraph, CrewAI, Semantic Kernel) enable developers to build custom agents, each with different tradeoffs.
Observability is non-negotiable: Tools like LangSmith, Langfuse, and Braintrust provide the tracing, evaluation, and monitoring needed for production reliability.
Computer use and browser agents represent a new frontier where agents interact with GUIs visually, enabling automation of tasks for which no API exists.
Best practices center on treating agent output as a first draft: review critically, verify security, run tests, and ensure understanding before accepting.
Prompt engineering for coding agents dramatically affects output quality. Specific requirements, explicit constraints, context pointers, and clear definitions of done produce much better results than vague instructions.
MCP servers provide standardized, safe access to development infrastructure (filesystems, databases, git, browsers) that is superior to raw shell access for most agent operations.
Testing strategies for agent-generated code require independent verification. Human-written tests with agent implementations, mutation testing, and dual-agent testing all help ensure correctness beyond what the agent's own tests provide.
Team adoption follows a predictable pattern (individual exploration, team standardization, pipeline integration) and succeeds when the organizational challenges receive as much attention as the technical ones.

2019. References

Chen, M., Tworek, J., Jun, H., Yuan, Q., Pinto, H. P. D. O., Kaplan, J., ... & Zaremba, W. (2021). Evaluating large language models trained on code. arXiv preprint arXiv:2107.03374.
Jimenez, C. E., Yang, J., Wettig, A., Yao, S., Pei, K., Press, O., & Narasimhan, K. (2024). SWE-bench: Can language models resolve real-world GitHub issues? Proceedings of the 12th International Conference on Learning Representations (ICLR 2024).
Pearce, H., Ahmad, B., Tan, B., Dolan-Gavitt, B., & Karri, R. (2022). Asleep at the keyboard? Assessing the security of GitHub Copilot's code contributions. 2022 IEEE Symposium on Security and Privacy (SP), 754-768.
Yang, J., Jimenez, C. E., Wettig, A., Liber, K., Yao, S., Narasimhan, K., & Press, O. (2024). SWE-agent: Agent-computer interfaces enable automated software engineering. arXiv preprint arXiv:2405.15793.
Vaithilingam, P., Zhang, T., & Glassman, E. L. (2022). Expectation vs. experience: Evaluating the usability of code generation tools powered by large language models. CHI Conference on Human Factors in Computing Systems Extended Abstracts, 1-7.
Barke, S., James, M. B., & Polikarpova, N. (2023). Grounded copilot: How programmers interact with code-generating models. Proceedings of the ACM on Programming Languages, 7(OOPSLA1), 85-111.
Zhou, S., Xu, F. F., Zhu, H., Zhou, X., Lo, R., Sridhar, A., ... & Neubig, G. (2024). WebArena: A realistic web environment for building autonomous agents. Proceedings of the 12th International Conference on Learning Representations (ICLR 2024).
OpenAI (2025). Agents SDK: A lightweight framework for building agentic applications. OpenAI Documentation. https://openai.github.io/openai-agents-python/
Anthropic (2025). Claude Code: An agentic coding tool. Anthropic Documentation. https://docs.anthropic.com/en/docs/claude-code
Anthropic (2025). Model Context Protocol (MCP). Anthropic Documentation. https://modelcontextprotocol.io/

These lecture notes are part of the Agentic AI course. Licensed under CC BY 4.0.