FoundationsW0250 min read

LLMs as Reasoning Engines

Transformer architecture deep-dive: attention, positional encoding, scaling laws. Emergent reasoning from in-context learning and instruction tuning. Limitations (hallucination, context bounds, latency, cost) that motivate agent-style augmentation.

Core conceptsTransformer attentionIn-context learningScaling laws

01Learning Objectives

By the end of this lecture, students will be able to:

Explain the Transformer architecture, including the self-attention mechanism and its computational properties.
Describe how pre-training, instruction tuning, and RLHF shape model behavior.
Analyze the relationship between model scale and emergent capabilities.
Evaluate the reasoning capabilities and limitations of current LLMs.
Compare major LLM families (GPT, Claude, Gemini, Llama, Mistral) in terms of architecture, capabilities, and trade-offs.
Explain inference-time compute and its importance for agentic behavior.
Make basic API calls to an LLM with system, user, and assistant messages.

021. Transformer Architecture Recap

1.1 Why Transformers Matter for Agents

Every modern LLM-based agent runs on a Transformer. Understanding the architecture is not merely academic; it directly affects agent design decisions. Context window size, attention patterns, and computational costs all constrain what agents can do.

Consider these practical questions that arise when building agents:

"Why does my agent lose track of instructions in long conversations?" (Answer: attention patterns and the "lost in the middle" phenomenon.)
"Why does it cost more when my agent reads a large file?" (Answer: attention is quadratic in sequence length.)
"Why does the same prompt sometimes produce different results?" (Answer: autoregressive generation with temperature sampling.)

To answer these questions, we need to understand the engine under the hood.

The Transformer was introduced by Vaswani et al. (2017) in the paper "Attention Is All You Need." It replaced recurrent neural networks (RNNs) as the dominant architecture for sequence modeling, primarily because:

Parallelism: Unlike RNNs, which process tokens one at a time (the output of position 5 depends on position 4, which depends on position 3...), Transformers process all positions in a sequence simultaneously. This makes training dramatically faster on GPUs.
Long-range dependencies: The attention mechanism can directly connect any two positions, regardless of distance. In an RNN, information about the first word must pass through every intermediate word to reach the last word, getting diluted along the way.
Scalability: Transformers scale efficiently to billions of parameters and trillions of training tokens. The architecture has proven remarkably robust: the same basic design works at 100M parameters and at 1T parameters.

1.2 The Attention Mechanism

The core innovation of the Transformer is self-attention. To understand it, let us start with an analogy.

Imagine you are reading a sentence: "The cat sat on the mat because it was tired." When you read the word "it," your brain automatically figures out that "it" refers to "the cat," not "the mat." You do this by considering all the words in the sentence and determining which ones are most relevant to understanding "it."

Self-attention does something similar. Given a sequence of token embeddings, self-attention computes a weighted combination of all tokens for each position, where the weights reflect the relevance of each token to the current one.

Mathematically:

Given input matrix $X \in \mathbb{R}^{n \times d}$ (n tokens, d dimensions):

Compute queries, keys, and values:
- $Q = X W_Q$ , $K = X W_K$ , $V = X W_V$
- Where $W_Q, W_K \in \mathbb{R}^{d \times d_k}$ and $W_V \in \mathbb{R}^{d \times d_v}$
Compute attention scores: $\text{Attention}(Q, K, V) = \text{softmax}\left(\frac{QK^T}{\sqrt{d_k}}\right)V$

The $\frac{1}{\sqrt{d_k}}$ scaling factor prevents the dot products from becoming too large, which would push the softmax into regions with vanishing gradients. Without this scaling, as the dimensionality grows, the dot products grow in magnitude and the softmax would produce nearly one-hot outputs, making gradients very small.

Intuition behind Q, K, V: Think of attention like a library search system:

Query (Q): "What am I looking for?" Each token generates a query describing what information it needs.
Key (K): "What do I contain?" Each token generates a key describing what information it offers.
Value (V): "Here is my actual content." Each token generates a value containing the information to be transmitted.

The attention score between two tokens is the dot product of the query of the first token and the key of the second token. High score means "these tokens are relevant to each other," and the output for each token is a weighted average of all values, weighted by these relevance scores.

Key Insight: Attention is a soft lookup table. Unlike a hash table (which returns exactly one result), attention retrieves a weighted blend of all entries, allowing the model to combine information from multiple relevant positions simultaneously.

1.3 Multi-Head Attention

Rather than computing a single attention function, the Transformer uses multi-head attention: multiple attention functions running in parallel, each with different learned projection matrices.

text

MultiHead(Q, K, V) = Concat(head_1, ..., head_h) W_O

where head_i = Attention(Q W_Q^i, K W_K^i, V W_V^i)

Why multiple heads? Think of it this way: when you read a sentence, you simultaneously track multiple types of relationships. One part of your brain tracks who is doing what (subject-verb relationships). Another tracks descriptive modifiers (adjective-noun relationships). Another tracks temporal order (what happened first, second, third).

Similarly, different attention heads can specialize in different types of relationships:

One head might track syntactic dependencies (subject-verb agreement)
Another might track semantic relationships (coreference, like "it" referring to "the cat")
Another might track positional patterns (attending to nearby tokens)
Another might track long-range dependencies (connecting a pronoun to its antecedent paragraphs earlier)

Research has shown that these specializations emerge naturally during training; they are not manually programmed.

Try It Yourself: Consider the sentence "The professor who taught the class also wrote the textbook." Think about the different types of relationships a model needs to track: Who taught? What was taught? Who wrote? What was written? How are "professor," "taught," and "wrote" connected? Each of these relationships could be captured by a different attention head.

1.4 The Full Transformer Block

A Transformer block consists of:

Interactive · Transformer Decoder Block Architecture

Transformer block

From tokens to probability

Hover any layer to see what it does, its equation, and the tensor shape going in and out. Attention is the mechanism that distinguishes the transformer.

Multi-head attention

Every token looks at every other and combines information by weight.

Equation

softmax(QKᵀ / √d) · V

Tensor shape

L × d → L × d

Input

Theagentdecideswhattodo

Output

next-token: "tomorrow" · p=0.42

The Feed-Forward Network (FFN) is a two-layer MLP applied independently to each position:

$\text{FFN}(x) = \text{ReLU}(x W_1 + b_1) W_2 + b_2$

Modern models often use SwiGLU or GELU activation functions instead of ReLU, and may use RMSNorm instead of LayerNorm. These are engineering refinements that improve training stability and performance.

Residual connections ( $x + \text{sublayer}(x)$ ) are critical: they allow gradients to flow directly through the network, enabling training of very deep models (dozens or hundreds of layers). Without residual connections, gradients would vanish or explode in deep networks, making training impossible.

An intuitive way to think about residual connections: each layer refines the representation rather than replacing it. The input passes through unchanged, and the layer adds a correction. This is like editing a document: you start with the original text and make incremental changes, rather than rewriting from scratch at each step.

The interaction between attention and FFN is important for understanding what the model learns:

Attention layers move information between positions (they let tokens communicate with each other).
FFN layers transform information within each position (they process the combined information from attention into new representations).

Research suggests that the FFN layers store factual knowledge ("Paris is the capital of France") while attention layers handle relational reasoning ("the capital of the country where the Eiffel Tower is located").

1.5 A Concrete Example: How Attention Processes an Agent Prompt

To make the architecture concrete, let us trace how attention processes a typical agent prompt. Consider this simplified prompt:

text

System: You are a helpful assistant with calculator access.
User: What is 15% of 240?

When the model processes this, each token in "What is 15% of 240?" attends to every previous token. The attention mechanism allows the model to:

Connect "15%" to "of 240" to understand the mathematical relationship.
Connect "calculator access" from the system prompt to the mathematical question, activating the model's tool-calling behavior.
Connect "helpful assistant" to the overall generation strategy, influencing tone and format.

All of this happens simultaneously across multiple attention heads. One head might focus on the mathematical structure, another on the system instructions, and another on the conversational context. The combined output of all heads gives the model a rich understanding of what to do next.

1.6 Decoder-Only Architecture

Modern LLMs (GPT, Claude, Llama, Mistral) use a decoder-only architecture. Unlike the original encoder-decoder Transformer (designed for translation, where the encoder reads the input and the decoder generates the output), decoder-only models:

Use causal (masked) attention: Each token can only attend to itself and previous tokens, not future ones.
Generate text autoregressively: One token at a time, left to right.
Use the same architecture for both "understanding" and "generation."

Why causal masking? During training, the model learns to predict the next token given all previous tokens. If it could see future tokens, the task would be trivial (just copy the next token). The causal mask ensures that each position can only use information from the past, making the prediction task meaningful.

This is critical for understanding agent behavior: the model processes the entire prompt (system message + conversation history + tool results) and then generates the next token, one at a time. When an agent "decides" to call a tool, it is generating a sequence of tokens that happen to encode a tool call. There is no separate "decision module"; the tool call emerges from the same autoregressive process that generates ordinary text.

Key Insight: An LLM agent's "decision" to call a tool is not fundamentally different from its "decision" to write the next word of a sentence. Both are the result of autoregressive token generation conditioned on the context. This is both the power and the limitation of LLM agents: they can make remarkably flexible decisions, but those decisions are shaped by the statistical patterns in the training data.

Common Misconception: "The LLM understands the tool and decides to use it." More accurately, the LLM has been trained on text that describes tool usage, and it generates text that follows those patterns. Whether this constitutes "understanding" is a philosophical question; what matters for engineering is that it works reliably enough for practical agent tasks.

1.7 Positional Encoding

Since attention is permutation-invariant (it does not inherently know token order: the attention score between "cat" and "sat" is the same regardless of their positions), Transformers need positional information. Modern approaches include:

Rotary Position Embeddings (RoPE) (Su et al., 2024): Used in Llama, Mistral, and many recent models. Encodes relative position through rotation of query/key vectors. The key idea is that the dot product between a query at position $m$ and a key at position $n$ depends only on the content and the relative distance $m - n$ , not the absolute positions.
ALiBi (Attention with Linear Biases) (Press et al., 2022): Adds a linear bias based on distance between tokens. Closer tokens get a larger bonus, farther tokens get a penalty.
Learned absolute positional embeddings: Used in original GPT models. Each position gets a learned embedding vector that is added to the token embedding.

RoPE has become dominant because it supports length extrapolation: models can handle sequences longer than those seen during training. This is crucial for agents, which often process long conversations with many tool call results.

1.8 Scale of Modern Models

Model	Parameters	Training Tokens	Context Window
GPT-4 (2023)	~1.8T (rumored MoE)	~13T	128K
Claude 3.5 Sonnet (2024)	Undisclosed	Undisclosed	200K
Llama 3.1 405B (2024)	405B	15T	128K
Mistral Large 2 (2024)	~123B	Undisclosed	128K
Gemini 1.5 Pro (2024)	Undisclosed (MoE)	Undisclosed	1M-2M

To put these numbers in perspective:

405 billion parameters: If each parameter were a grain of sand, you would need about 2,000 dump trucks to carry them all. This model "knows" things because its parameters encode statistical patterns from trillions of words of text.
15 trillion training tokens: If you read at a speed of one word per second, 24/7, it would take you about 475,000 years to read 15 trillion tokens. The model absorbs this in a few months of GPU training.
200K context window: About 150,000 words, equivalent to a 500-page novel. An agent can keep an entire book, or an entire conversation with hundreds of tool calls, in its working memory.

The trend is clear: more parameters, more training data, and larger context windows. Each of these dimensions enables more capable agents.

032. From Language Models to Reasoning

2.1 What Does "Reasoning" Mean?

In the context of LLMs, "reasoning" refers to the ability to:

Draw logical conclusions from premises
Solve multi-step problems
Transfer knowledge across domains
Plan sequences of actions
Identify and correct errors

Whether LLMs truly "reason" or merely simulate reasoning through sophisticated pattern matching is an active philosophical and scientific debate. For agent design, what matters is practical capability: can the model reliably produce correct outputs for reasoning-intensive tasks?

This is worth dwelling on, because it affects how you build agents. If you believe the model truly reasons, you might trust it with complex planning. If you believe it is pattern-matching, you might add more verification steps, external tools, and human checkpoints. The pragmatic approach (which we take in this course) is: assume the model might be wrong, design accordingly, but take advantage of the impressive capabilities it does have.

2.2 Emergent Capabilities

One of the most fascinating phenomena in modern AI is that larger language models do not just do the same things better; they develop qualitatively new capabilities that smaller models simply cannot perform. These are called emergent capabilities.

Emergent capabilities are abilities that appear in larger models but are absent or unreliable in smaller ones. Wei et al. (2022a) documented several:

Arithmetic: Models below a certain scale cannot reliably add three-digit numbers. Above that threshold, accuracy jumps dramatically. This makes sense: reliable multi-digit arithmetic requires tracking carries across positions, which needs sufficient model capacity.
Chain-of-thought reasoning: Smaller models do not benefit from "let's think step by step" prompts; larger models do. This suggests that the ability to productively use generated intermediate steps requires a minimum level of capability.
Code generation: Producing correct, executable code requires substantial model capacity. The model needs to understand syntax, semantics, logic, and the conventions of a particular language simultaneously.
Instruction following: The ability to follow complex, multi-part instructions improves with scale. A small model might follow "write a poem" but fail at "write a poem about autumn in the style of Emily Dickinson, using exactly four stanzas of four lines each."

However, Schaeffer et al. (2024) argued that some apparent emergent abilities are artifacts of the evaluation metrics used. When using metrics that show a sharp transition (like exact-match accuracy), the transition appears sudden. When using smoother metrics (like partial credit), capabilities improve more gradually. The true picture is nuanced: capabilities generally improve gradually with scale, but the practical utility can appear to jump when accuracy crosses a usability threshold.

Key Insight: From an agent design perspective, the important question is not "at what scale does reasoning emerge?" but "at what scale is reasoning reliable enough for my use case?" An agent that needs to make 20 correct decisions in a row requires much higher per-decision accuracy than an agent that needs to make one correct decision.

2.3 The Scaling Hypothesis

The scaling laws (Kaplan et al., 2020; Hoffmann et al., 2022) describe a predictable relationship between model performance and three factors:

Number of parameters (N)
Dataset size (D)
Compute budget (C)

Kaplan et al. (2020) originally found that performance improves as a power law with each of these factors, and that the relationship is remarkably smooth across many orders of magnitude. This means you can predict how well a larger model will perform before you train it.

Hoffmann et al. (2022) refined this with the "Chinchilla" scaling law, showing that for a given compute budget, there is an optimal balance between model size and data size. Their key finding: many existing models were undertrained, meaning they needed more data relative to their parameter count. For example, the original GPT-3 (175B parameters, 300B training tokens) was significantly undertrained by Chinchilla standards, which would suggest training a smaller model on more data.

Implications for agent design:

Larger models are generally more capable agents, but the relationship is not linear. Doubling the model size does not double the agent's capability.
Training data quality matters as much as quantity. A model trained on curated, high-quality text outperforms one trained on raw web scrapes.
Compute costs at inference time also scale with model size, creating practical constraints for agent deployments. An agent that makes 20 calls to a 70B model is much cheaper than one that makes 20 calls to a 405B model.

043. Training Pipeline: Pre-training, Instruction Tuning, and RLHF

Understanding the training pipeline is essential for agent builders because each stage shapes different aspects of the model's behavior. When your agent behaves in a certain way (helpful, cautious, verbose, sycophantic), you can often trace it back to a specific training stage.

3.1 Pre-training

The foundation of every LLM is pre-training: predicting the next token on a massive corpus of text. This is one of the most elegant ideas in modern AI: by simply learning to predict what word comes next, the model acquires an astonishing range of capabilities including grammar, facts, reasoning, coding, and even some common sense.

text

Training objective: Minimize cross-entropy loss

L = -sum(log P(token_t | token_1, ..., token_{t-1}))

In plain English: for every position in the training text, the model tries to predict what comes next. The loss function penalizes the model more when it assigns low probability to the correct next token. Over trillions of tokens, the model learns the statistical structure of language, from grammar and syntax to facts and reasoning patterns.

To see why next-token prediction is so powerful, consider what the model needs to learn to predict well in different contexts:

"The capital of France is [Paris]" requires learning factual knowledge.
"def fibonacci(n): if n <= 1: return n; return [fibonacci(n-1)]" requires learning programming.
"If all dogs are mammals and Rex is a dog, then Rex is a [mammal]" requires learning logic.
"She felt sad because her friend [moved] away" requires learning common sense.

All of these capabilities emerge from a single training objective. This is remarkable and is the reason why LLMs are so versatile as agent brains.

Pre-training data typically includes:

Web pages (Common Crawl, filtered and deduplicated)
Books and academic papers
Code repositories (GitHub)
Wikipedia and encyclopedic content
Conversational data (forums, Q&A sites)

Pre-training produces a model that can complete text. But a text completion engine is not yet an agent; it can generate plausible continuations, but it does not reliably follow instructions or refuse harmful requests.

To illustrate the gap: a pre-trained model given the prompt "The capital of France is" will likely continue with "Paris." But given "What is the capital of France? Please answer in one word," it might continue with "The capital of France is Paris. France, officially the French Republic, is a country primarily located in Western Europe..." because it learned to generate Wikipedia-style text, not to follow instructions.

3.2 Instruction Tuning (Supervised Fine-Tuning)

Instruction tuning (also called SFT, Supervised Fine-Tuning) teaches the model to follow instructions. The training data consists of (instruction, response) pairs:

text

Instruction: "Explain quantum entanglement to a 10-year-old."
Response: "Imagine you have two magic coins..."

Key datasets and approaches:

FLAN (Wei et al., 2022b): Instruction-tuned on 1,836 tasks across many categories.
InstructGPT (Ouyang et al., 2022): Combined SFT with RLHF. Showed that instruction tuning with human feedback dramatically improved user satisfaction.
Alpaca (Taori et al., 2023): Self-instruct approach using GPT-generated data, demonstrating that synthetic instruction data can be effective.

After instruction tuning, the model follows instructions more reliably, responds in helpful formats, and begins to exhibit the conversational behavior we associate with AI assistants.

Key Insight: Instruction tuning is what makes LLMs usable as agent brains. Without it, you would need to carefully craft prompts that look like text completions rather than instructions. With it, you can simply tell the agent what to do in natural language. This is why instruction tuning is sometimes called the most important step in the pipeline from an application perspective.

3.3 RLHF: Reinforcement Learning from Human Feedback

RLHF refines the model's behavior based on human preferences. The process:

Step 1: Collect comparison data

Present human raters with a prompt and two model responses.
The rater selects the better response.
This is much easier than writing perfect responses from scratch; humans are better at judging than generating.

Step 2: Train a reward model

A separate model learns to predict human preferences.
Given a (prompt, response) pair, it outputs a scalar reward.
This automates the human judgment: instead of asking humans for every comparison, we train a model to approximate their preferences.

Step 3: Optimize the policy with RL

Use Proximal Policy Optimization (PPO) or similar to fine-tune the LLM.
The objective: maximize the reward model's score while staying close to the SFT model (to avoid reward hacking).

text

Objective: max E[R(prompt, response)] - beta * KL(policy || SFT_policy)

The KL divergence penalty is crucial: without it, the model would find "shortcuts" that exploit the reward model's weaknesses without actually being helpful. For example, a reward model trained on human preferences might give high scores to verbose, confident responses; without the KL penalty, the model would learn to always be verbose and confident, even when brevity and uncertainty would be more appropriate.

RLHF effects relevant to agents:

Models become more helpful and less likely to refuse valid requests.
Models learn to say "I don't know" rather than fabricate answers (sometimes).
Models develop safety behaviors (refusing harmful requests).
However, RLHF can also make models overly cautious or sycophantic. A model might agree with the user even when the user is wrong, because human raters tended to prefer agreeable responses.

3.4 RLAIF and Constitutional AI

Constitutional AI (Bai et al., 2022) replaces human feedback with AI feedback:

Generate responses to harmful prompts.
Ask the model to critique its own response based on a set of principles (the "constitution"). For example: "Is this response harmful? Does it help with illegal activities?"
Ask the model to revise its response.
Train on the revised responses.

This approach, also called RLAIF (Reinforcement Learning from AI Feedback), is more scalable than human feedback (you do not need to pay thousands of human raters) but introduces the risk of reinforcing the model's own biases.

Anthropic uses Constitutional AI as a core part of Claude's training. The "constitution" includes principles like "be helpful," "be honest," and "be harmless." This shapes Claude's distinctive style: helpful but cautious, direct about uncertainty, and more willing to push back on incorrect premises than purely RLHF-trained models.

3.5 Direct Preference Optimization (DPO)

Rafailov et al. (2023) introduced DPO as a simpler alternative to RLHF. Instead of training a separate reward model and then using RL, DPO directly optimizes the language model on preference data using a classification-like loss. The key insight is that the optimal policy under the RLHF objective can be expressed as a simple function of the log probabilities of the chosen and rejected responses.

DPO has become increasingly popular because it is simpler to implement, more stable during training, and often produces comparable results to full RLHF pipelines. It eliminates the need for: (a) training a separate reward model, (b) running RL optimization, and (c) dealing with the instability of PPO training.

3.6 Post-Training Summary

Each stage builds on the previous one. For agent applications, all three stages matter: pre-training provides the knowledge base, instruction tuning provides the ability to follow complex instructions, and alignment ensures the agent behaves safely and helpfully.

Try It Yourself: Think about how these training stages affect a coding agent. Pre-training gives the model knowledge of programming languages and patterns. Instruction tuning teaches it to follow coding instructions ("write a function that..."). RLHF teaches it to write clean, well-documented code (because human raters prefer that) and to refuse to write malicious code. Can you think of a scenario where each stage's contribution is essential?

054. Reasoning Capabilities

4.1 Arithmetic and Mathematical Reasoning

LLMs can perform arithmetic, but their reliability depends on the complexity. Understanding these limitations is critical for agent design, because agents that make calculations need to know when to trust the LLM and when to delegate to a tool.

Task	Typical Accuracy (GPT-4 class)
Single-digit addition	~100%
Multi-digit addition (5+ digits)	~90-95%
Multiplication (3+ digits)	~70-85%
Word problems (grade school)	~90-95% with CoT
Competition math (AMC/AIME)	~30-60%
Research-level math	Low, highly variable

Why LLMs struggle with arithmetic: LLMs process numbers as tokens, not as mathematical objects. The number "1,234,567" might be tokenized as ["1,", "234", ",567"] or similar, breaking the place-value structure that makes arithmetic systematic. The model must learn to "simulate" arithmetic from patterns in text, rather than executing it algorithmically.

Why agents need tools for math: Even a 95% accuracy rate is unacceptable for reliable computation. Consider an agent making a financial calculation: 5% error rate means 1 in 20 calculations is wrong. If your agent calculates loan payments, investment returns, or tax obligations, a 5% error rate is catastrophic. This is why agent architectures include calculator tools: not because LLMs cannot do math, but because they cannot do it reliably enough.

Key Insight: A good agent knows its own limitations. When faced with "What is 7 * 8?", a well-designed agent should answer directly (it reliably knows this). When faced with "What is 14,273 * 89,651?", it should use a calculator tool. The system prompt and tool descriptions should guide this behavior.

4.2 Logical Reasoning

Understanding how LLMs handle different types of logical reasoning is important for agent design because it tells you which kinds of reasoning you can trust the model with and which you need to verify through tools or external systems.

LLMs handle several types of logical reasoning with varying reliability:

Deductive reasoning (given premises, derive conclusions):

text

All mammals breathe air.
Whales are mammals.
Therefore, whales breathe air.

LLMs handle simple syllogisms well but struggle with longer chains of reasoning or when irrelevant premises are included. Adding irrelevant information ("Birds also breathe air. Some fish can breathe air too.") can confuse the model, even though logically it should not affect the conclusion.

Inductive reasoning (generalize from examples):

text

Observation: Every swan I've seen is white.
Hypothesis: All swans are white.

LLMs can identify patterns but may over-generalize. They also struggle with knowing when inductive reasoning is appropriate versus when more data is needed.

Abductive reasoning (find the best explanation):

text

The grass is wet. It probably rained.
(But it could also be sprinklers, morning dew, or a broken pipe.)

LLMs are surprisingly good at this type of everyday reasoning, likely because their training data is full of narratives that explain observations. This is valuable for agents: when a tool call fails, the agent needs to reason about why it failed and what to try next.

4.3 Commonsense Reasoning

LLMs have absorbed vast commonsense knowledge from their training data:

text

Q: "If I put a glass of water in the freezer overnight, what happens?"
A: "The water freezes and turns to ice. The glass might crack if
    the water expands enough."

This kind of commonsense reasoning is critical for agents that interact with the real world:

A coding agent needs to know that you should test code before deploying it.
A research agent needs to know that a 2019 paper cannot cite a 2023 paper.
A scheduling agent needs to know that people generally do not want meetings at 3 AM.

The remarkable thing about LLM commonsense reasoning is its breadth. No one programmed "water expands when it freezes" into the model; it learned this from text. The challenge is that commonsense knowledge is also where hallucinations are hardest to detect: the model may state something with the same confidence whether it is a genuine fact or a plausible-sounding fabrication.

4.4 Planning

Planning is perhaps the most important reasoning capability for agents, and also one of the hardest to get right. It involves:

Understanding the goal state
Identifying the current state
Generating a sequence of actions to get from current to goal
Anticipating obstacles and alternatives

LLMs show mixed planning abilities:

Simple plans (3-5 steps, familiar domains): Generally reliable. "To make a sandwich: get bread, add filling, close sandwich." The model has seen thousands of such plans in its training data.
Complex plans (10+ steps, novel domains): Frequently flawed; they may miss dependencies, generate impossible steps, or fail to account for constraints. "Design a migration plan for moving 200 microservices from AWS to GCP" is likely to have errors.
Long-horizon planning: Particularly challenging; agents tend to lose track of the overall goal during extended execution. After 15 steps, the agent may forget what it was originally trying to accomplish.

This is why many agent architectures separate planning from execution (we will cover this in Week 5). The planner creates a high-level strategy, and the executor carries out each step, with periodic check-ins to make sure the plan still makes sense.

4.5 Code Generation and Debugging

Code generation deserves special mention because it is the most practically useful reasoning capability for coding agents, the most deployed agent type today.

LLMs can generate working code from natural language descriptions, debug code by analyzing error messages, refactor code for readability or performance, write tests for existing code, and explain what code does in natural language.

However, code generation has characteristic failure modes:

Plausible but incorrect code: The code looks right and may even compile, but has a subtle logic error. This is the code equivalent of factual hallucination.
Outdated APIs: The model may use deprecated functions or old library syntax from its training data.
Security vulnerabilities: The model may generate code that works but has SQL injection risks, hardcoded credentials, or other issues.
Overfitting to examples: If the prompt resembles a common tutorial example, the model may produce the tutorial solution rather than the requested variation.

For coding agents, the implication is clear: always run generated code, always run the tests, and always review security-critical changes. The agent architecture should include verification steps, not just generation steps.

Try It Yourself: Ask an LLM to plan a complex task, like "Organize a surprise birthday party for 30 people at a restaurant." Examine the plan for: (a) missing steps, (b) dependency errors (steps that depend on information not yet available), (c) unrealistic assumptions. How would you fix the plan? What would an agent architecture need to handle these issues automatically?

065. Limitations

Understanding the limitations of LLMs is not pessimism; it is essential engineering knowledge. Every limitation implies a design decision: if the model might hallucinate, add verification. If the context window is limited, add memory management. If the model is sensitive to prompt phrasing, add robustness testing.

A useful mental model: think of each limitation as a "failure mode" that your agent architecture must handle. Just as a mechanical engineer designs bridges to withstand specific failure modes (wind, load, earthquakes), an agent engineer designs systems to withstand specific LLM failure modes (hallucination, context overflow, prompt sensitivity). The architectures we will study in Week 5 are, in many ways, engineering responses to the limitations described below.

5.1 Hallucinations

Hallucination is the generation of plausible-sounding but factually incorrect content. Types include:

Factual hallucination: Stating incorrect facts ("The Eiffel Tower was built in 1910"). The model generates text that sounds right but is wrong. This is particularly dangerous because the model is equally confident in true and false statements.
Fabricated citations: Inventing papers, authors, or publication venues that do not exist. "According to Smith et al. (2023) in Nature..." where no such paper exists. This is a well-documented problem that has led to embarrassing incidents in legal briefs and academic papers.
Logical hallucination: Making reasoning errors while presenting the chain of thought confidently. The model might correctly set up a math problem and then make an arithmetic error, or correctly identify premises and then draw an invalid conclusion.
Contextual hallucination: Contradicting information provided in the conversation. You tell the model "The meeting is on Tuesday" and later it says "As we discussed, the Wednesday meeting..."

Why this matters for agents: An agent that hallucinates a function name that does not exist will produce code that crashes. An agent that hallucinates a citation undermines research integrity. An agent that hallucinates a file path will fail to read the file. Agent architectures must include verification mechanisms: tool use (to check facts), code execution (to verify code works), and self-reflection (to catch errors).

Mitigation strategies:

Retrieval-Augmented Generation (RAG): Ground responses in retrieved documents. Instead of relying on the model's memory, provide the relevant information in the context.
Tool use: Use search engines, databases, and APIs to verify claims. "Let me check that..." should be a natural part of agent behavior.
Self-consistency checks: Generate multiple responses and check for agreement. If 5 independent reasoning chains produce different answers, the model is uncertain.
Confidence calibration: Ask the model to rate its confidence (imperfect but useful). Research shows that LLMs are poorly calibrated by default but can be improved with appropriate prompting.

5.2 Context Window Limitations

Even models with 200K token context windows face limitations:

Lost in the middle (Liu et al., 2024): A striking finding. Models tend to pay more attention to information at the beginning and end of the context, potentially missing important information in the middle. In experiments, accuracy on information retrieval tasks dropped significantly when the relevant information was placed in the middle of a long context, compared to the beginning or end.
Computational cost: Attention is quadratic in sequence length ( $O(n^2)$ ), making very long contexts expensive. Processing a 200K token context costs roughly 400x more than processing a 10K token context, in terms of computation per token.
Information density: Filling the context with irrelevant information degrades performance even if the relevant information is present. More context is not always better.

Agent implications: Memory management is crucial. Agents must decide what to keep in context, what to summarize, and what to store in external memory. This is analogous to how a researcher decides what notes to keep on their desk versus filing away for later. We will cover memory management strategies in Week 7.

Key Insight: A common mistake in agent design is dumping everything into the context ("just give it all the information and let it figure it out"). This fails for two reasons: (1) it is expensive, and (2) the model performs worse with irrelevant context noise. Good agents are selective about what they include in context, just as good researchers are selective about what they read.

5.3 Knowledge Cutoffs

LLMs have a training data cutoff date. They do not know about events, publications, or software updates that occurred after training. For agents, this means:

They may suggest deprecated APIs or outdated library versions. ("Use requests.get() with verify=False" when the API has changed.)
They may not know about recent security vulnerabilities.
They may reference organizations, people, or projects inaccurately if things have changed.

Solution: Tool use. Web search, documentation retrieval, and API queries provide current information. A well-designed agent checks its knowledge against current sources for anything time-sensitive.

5.4 Sensitivity to Prompt Formulation

The same question phrased differently can produce different answers. This is problematic for agents because:

Tool descriptions must be carefully crafted.
System prompts require iterative refinement.
Small changes in conversation history can alter agent behavior.

For example, "Calculate the total cost" and "What is the sum of all costs?" might lead to different tool calls or different calculation approaches, even though they mean the same thing. This fragility is why prompt engineering (Week 3) is a critical skill for agent builders.

5.5 Sycophancy

RLHF-trained models sometimes agree with the user even when the user is wrong. This is called sycophancy (Perez et al., 2023). The mechanism is straightforward: during RLHF training, human raters tend to prefer responses that agree with them, so the model learns to agree.

For agents, sycophancy can manifest as:

Going along with a flawed plan rather than pushing back. ("User: Let's delete the production database and recreate it. Agent: Great idea! Let me do that.")
Confirming incorrect assumptions from the user. ("User: The bug is in file X, right? Agent: Yes, it's definitely in file X." Even though the bug is actually in file Y.)
Not flagging errors in user-provided information.

This is particularly dangerous for agents because they act on these agreements. A sycophantic chatbot gives a wrong answer; a sycophantic agent takes a wrong action.

076. Key Model Families

Understanding the major model families is important for practical agent design because each family has distinct strengths and weaknesses. The choice of model affects cost, capability, latency, and deployment options. In this section, we survey the landscape as of early 2026.

Common Misconception: "There is a single 'best' model." In reality, the best model depends on the specific task, budget, latency requirements, and deployment constraints. A model that excels at code generation might underperform at creative writing. A model that is excellent for English might struggle with other languages. Agent designers often use multiple models within a single system.

6.1 GPT Series (OpenAI)

Model	Release	Key Features
GPT-3 (2020)	Jun 2020	175B parameters, demonstrated few-shot learning
GPT-3.5 (2022)	Nov 2022	ChatGPT, instruction-tuned, dialogue-optimized
GPT-4 (2023)	Mar 2023	Multimodal, dramatically improved reasoning
GPT-4 Turbo (2024)	2024	128K context, cheaper, function calling
GPT-4o (2024)	May 2024	Omni-model, native multimodal, faster
o1 / o3 (2024-2025)	2024-2025	Reasoning models with chain-of-thought

Characteristics: Strong general reasoning, excellent function calling support, well-established API ecosystem. The o1/o3 series introduced dedicated reasoning models that use inference-time compute for complex problems. GPT models have the most mature function-calling interface, making them popular for agent applications.

6.2 Claude Series (Anthropic)

Model	Release	Key Features
Claude 1 (2023)	Mar 2023	Focus on safety, long context
Claude 2 (2023)	Jul 2023	Improved reasoning, 100K context
Claude 3 Haiku/Sonnet/Opus (2024)	Mar 2024	Tiered model family, strong coding
Claude 3.5 Sonnet (2024)	Jun 2024	Best coding performance, computer use
Claude 4 (2025)	2025	Extended thinking, agentic capabilities

Characteristics: Strong safety alignment (Constitutional AI), excellent at following complex instructions, long context windows, strong coding ability. Claude's extended thinking feature enables inference-time reasoning similar to o1. Claude models tend to be more willing to say "I'm not sure" and less sycophantic than some alternatives.

6.3 Gemini Series (Google DeepMind)

Model	Release	Key Features
Gemini 1.0 (2023)	Dec 2023	Multimodal from the ground up
Gemini 1.5 Pro (2024)	Feb 2024	1M-2M token context window
Gemini 2.0 Flash (2025)	2025	Fast, multimodal, tool-use native

Characteristics: Extremely long context windows (up to 2M tokens), native multimodal capabilities (text, image, audio, video), deep integration with Google services. The long context makes Gemini particularly interesting for agents that need to process large amounts of information, like analyzing an entire codebase or a collection of documents.

6.4 Open-Source Models

Llama Series (Meta):

Model	Parameters	Key Features
Llama 2 (2023)	7B, 13B, 70B	First widely-adopted open-weight model
Llama 3 (2024)	8B, 70B	Competitive with GPT-3.5
Llama 3.1 (2024)	8B, 70B, 405B	Competitive with GPT-4 class
Llama 4 (2025)	Various	Mixture of Experts architecture

Mistral (Mistral AI):

Model	Parameters	Key Features
Mistral 7B (2023)	7B	Exceeded Llama 2 13B on benchmarks
Mixtral 8x7B (2024)	~47B active	Sparse MoE, efficient inference
Mistral Large 2 (2024)	123B	Competitive with frontier models

Characteristics of open models: Can be self-hosted (data privacy), fine-tuned for specific domains, no API costs at inference time. However, they generally lag behind frontier closed models in raw capability, especially for complex reasoning and tool use. For agent applications, open models are attractive when: data cannot leave your infrastructure, you need very low-latency inference, or you want to fine-tune for a specific domain.

6.5 Choosing a Model for Agent Applications

Consideration	Recommendation
Maximum capability	Claude Opus/Sonnet, GPT-4o, o3
Fast, cheap iterations	Claude Haiku, GPT-4o-mini, Gemini Flash
Long context	Gemini 1.5 Pro (1-2M), Claude (200K)
Self-hosted / privacy	Llama 3.1 70/405B, Mistral Large
Complex reasoning	o3, Claude with extended thinking
Tool use / function calling	GPT-4o, Claude 3.5+ (both excellent)
Code generation	Claude 3.5 Sonnet+, GPT-4o

In practice, many agent systems use model routing: a fast, cheap model for simple decisions, and a powerful, expensive model for complex reasoning steps. We will discuss this strategy in detail in Section 7.5.5.

Try It Yourself: If you have access to multiple LLM APIs, try the same agent task (e.g., "Search for information about X and then calculate Y") with different models. Compare: (1) How well each model follows tool-calling instructions. (2) The quality of reasoning in the generated thoughts. (3) The cost per task. (4) The latency per task. Document your findings.

087. Inference-Time Compute

This section covers one of the most important recent developments in LLM technology. Inference-time compute, the ability for models to "think harder" on difficult problems, fundamentally changes the capabilities available to agent designers. If you only remember one thing from this section, remember the model routing strategy in Section 7.5.5, because it has the most practical impact on agent cost and performance.

7.1 The Idea

Traditional LLMs spend a fixed amount of compute per token at inference time. But some problems are harder than others: a simple greeting needs less thought than a complex math proof.

Think about how you solve problems. When someone asks "What is 2 + 3?", you answer instantly. When someone asks "What is the optimal strategy for a complex negotiation scenario?", you pause, consider multiple angles, evaluate trade-offs, and then respond. You spend more cognitive effort on harder problems.

Inference-time compute (also called "test-time compute" or "thinking tokens") allows the model to do the same: spend more computation on harder problems. This is the key innovation behind OpenAI's o1/o3 models and Anthropic's extended thinking.

7.2 How It Works

The model generates a chain of thought before producing its final answer. This chain of thought is the model "reasoning" through the problem:

text

User: What is the 15th prime number?

Model (internal reasoning):
"Let me list the prime numbers:
2, 3, 5, 7, 11, 13, 17, 19, 23, 29, 31, 37, 41, 43, 47
That's 15 primes. The 15th is 47."

Model (output): The 15th prime number is 47.

The crucial insight is that this reasoning is happening at inference time, not during training. The model is allocating more compute to harder problems by generating more tokens. Each thinking token is a unit of computation that helps the model reach a better answer.

7.3 Why This Matters for Agents

Inference-time compute fundamentally changes agent capabilities:

Better planning: Agents can think through complex plans before executing them. Instead of jumping into action, the agent can generate and evaluate multiple plans in its thinking space.
Error detection: The model can catch its own mistakes during the thinking process. "Wait, that calculation does not look right, let me redo it."
Complex tool use: Multi-step tool use benefits from explicit reasoning about which tools to call and in what order. "I need the weather first, then I can decide whether to recommend indoor or outdoor activities."
Reduced hallucination: More thinking tokens generally means more accurate outputs, because the model has more opportunity to self-correct.

7.4 The Compute-Performance Trade-off

More thinking tokens means:

Higher latency (the model takes longer to respond)
Higher cost (you pay per token, including thinking tokens)
Better accuracy on complex tasks
Diminishing returns on simple tasks

For agent design, this creates an important optimization question: how much should the agent "think" before acting? Too little thinking leads to errors; too much thinking wastes time and money. The answer depends on the consequences of errors: an agent booking a flight should think carefully (mistakes are expensive to fix), while an agent formatting a table can be fast and loose (mistakes are easily corrected).

7.5 Extended Thinking and Reasoning Models

The concept of inference-time compute has evolved from an academic idea into a family of production models specifically designed to "think before they answer." These are often called reasoning models, and they represent a paradigm shift in how LLMs approach complex problems.

7.5.1 OpenAI's o1 and o3 Models

OpenAI's o1 (released September 2024) was the first widely available reasoning model. It generates an internal chain of thought before producing a response, spending significantly more compute on harder problems.

o3 (released early 2025) extended this approach with improved reasoning capabilities and the ability to "think" for variable durations depending on problem difficulty.

Key characteristics:

The model generates hidden reasoning tokens that the user does not see in the final output (though token counts reflect them). You pay for these tokens but do not see them.
On math benchmarks (AIME, MATH), o1/o3 dramatically outperformed standard GPT-4 class models.
The reasoning process is not a fixed template; the model learns during training when and how much to reason.
OpenAI introduced a reasoning_effort parameter (low, medium, high) that lets users control the trade-off between thinking depth and latency/cost.

7.5.2 Anthropic's Extended Thinking

Claude's extended thinking (introduced with Claude 3.5 and expanded in Claude 4) takes a different approach: the thinking trace is visible to the developer (though typically hidden from end users).

Key characteristics:

The model generates a thinking block before the response, containing its step-by-step reasoning.
Developers can inspect the thinking trace for debugging and transparency. This is a significant advantage for agent development: you can see why the agent made a particular decision.
Extended thinking can be enabled or disabled per request, and a budget_tokens parameter controls the maximum number of thinking tokens.
Particularly effective for multi-step reasoning, code generation, and complex analysis.

python

# Example: Claude extended thinking API call
import anthropic

client = anthropic.Anthropic()

response = client.messages.create(
    model="claude-sonnet-4-20250514",
    max_tokens=16000,
    thinking={
        "type": "enabled",
        "budget_tokens": 10000  # Up to 10K tokens for thinking
    },
    messages=[{
        "role": "user",
        "content": "Design a database schema for a multi-tenant SaaS platform with row-level security."
    }]
)

# The response contains both thinking and text blocks
for block in response.content:
    if block.type == "thinking":
        print(f"[Thinking]: {block.thinking[:200]}...")
    elif block.type == "text":
        print(f"[Response]: {block.text[:200]}...")

Let us walk through this code:

The thinking parameter enables extended thinking with a budget of 10,000 tokens. The model can use up to this many tokens for its internal reasoning.
The response contains multiple content blocks. A thinking block contains the model's internal reasoning; a text block contains the final response.
By inspecting the thinking block, you can understand the model's reasoning process, which is invaluable for debugging agent behavior.

7.5.3 DeepSeek R1: Open-Source Reasoning

DeepSeek R1 (January 2025) demonstrated that reasoning capabilities are not exclusive to closed-source models. It is an open-weight reasoning model that generates explicit chain-of-thought traces.

Key characteristics:

Fully open weights, allowing self-hosting and fine-tuning.
Trained using a combination of supervised fine-tuning on reasoning traces and reinforcement learning.
The thinking process is visible in the output (wrapped in <think> tags).
Performance on math and coding benchmarks approached o1-level results at a fraction of the cost.
Distilled versions (1.5B, 7B, 14B, 32B, 70B) make reasoning accessible on smaller hardware.

DeepSeek R1 is significant because it proved that the reasoning paradigm can be replicated outside the major labs, opening up inference-time compute to the broader open-source community.

7.5.4 Implications for Agent Design

Reasoning models change the calculus of agent architecture in several important ways:

1. Better planning with fewer iterations: A reasoning model can often produce a correct multi-step plan in a single call, whereas a standard model might need several ReAct iterations to converge on the same plan. This can reduce the total number of LLM calls in an agent pipeline.

2. More reliable tool selection: When an agent needs to choose between multiple tools or decide on the order of tool calls, a reasoning model is less likely to make errors because it explicitly reasons through the options before committing.

3. Reduced need for external scaffolding: Some agent architectures (Reflexion, LATS) exist to compensate for the model's reasoning limitations. With stronger reasoning models, simpler architectures may suffice for tasks that previously required complex scaffolding.

4. Higher cost and latency per call: Reasoning models can cost 5-20x more per call than standard models and take 5-30x longer. For an agent making 10-20 calls per task, this cost adds up quickly.

7.5.5 The Model Routing Strategy

In practice, the most effective agent systems use model routing: different models for different steps in the agent pipeline.

Interactive · Model Routing Strategy for Agent Pipelines

Model routing

The right model for each query

A router decides which tier handles each query. Balance cost, latency, and quality.

Pick a query

Query

Trivial greeting

Small model

Fast and cheap

Mid model

Balanced

Large model

Highest quality

Cost · latency

$0.0001 · 120 ms

Pregunta breve, sin contexto: el modelo barato es suficiente.

This is analogous to how a company operates: the CEO (expensive, strategic) makes high-level decisions, managers (moderate cost) handle coordination, and interns (cheap, fast) handle routine tasks. You do not pay the CEO to sort mail.

Step Type	Model Class	Examples	Cost per 1M tokens (approx.)
Planning, verification	Reasoning	o3, Claude + thinking, DeepSeek R1	$10-60
General reasoning	Standard frontier	GPT-4o, Claude Sonnet	$3-15
Simple decisions, extraction	Fast/cheap	GPT-4o-mini, Claude Haiku, Gemini Flash	$0.10-1.00

7.5.6 When to Use Reasoning Models in Agent Pipelines

Use reasoning models for:

Initial planning of complex tasks (research, multi-file code changes)
Decisions with high consequence (deploying code, sending emails, financial calculations)
Tasks requiring multi-step logical deduction
Error analysis and recovery after failures
Synthesizing information from multiple sources

Use standard/fast models for:

Simple tool calls (file reads, searches, API calls)
Classification and routing decisions
Format conversion and data extraction
Repetitive sub-tasks within a loop
Intermediate steps where errors are cheaply recoverable

7.5.7 The Frontier Is Moving Fast

The reasoning model landscape is evolving rapidly (as of early 2026):

OpenAI continues to iterate on the o-series, with o3-mini offering reasoning at reduced cost.
Anthropic has integrated extended thinking deeply into Claude, making it available across model tiers.
Google DeepMind has introduced reasoning capabilities in Gemini 2.0 models.
Open-source: DeepSeek R1, Qwen QwQ, and other open models have brought reasoning to self-hosted deployments.

The trend is clear: inference-time compute is becoming a standard feature rather than a premium offering, and the cost of reasoning is decreasing over time. For agent designers, this means that the model routing strategies of today may simplify in the future as reasoning becomes cheaper and faster.

098. Code Example: Making LLM API Calls

8.1 Basic API Call with OpenAI

python

"""
Basic LLM API call demonstrating system/user/assistant messages.
This is the fundamental building block of every LLM-based agent.
"""

from openai import OpenAI

client = OpenAI()  # Requires OPENAI_API_KEY environment variable

# The three message roles:
# - system: Sets the behavior and persona of the assistant.
#   Think of it as the agent's "job description." It is processed
#   first and strongly influences all subsequent behavior.
#
# - user: The human's input. In an agent loop, this also includes
#   tool results fed back to the model.
#
# - assistant: Previous responses from the model (for multi-turn).
#   Including these maintains conversational continuity.

response = client.chat.completions.create(
    model="gpt-4o",
    messages=[
        {
            "role": "system",
            "content": (
                "You are an expert software architect. "
                "Provide concise, technically accurate responses. "
                "When discussing trade-offs, present both sides."
            )
        },
        {
            "role": "user",
            "content": "Should I use microservices or a monolith for a new startup?"
        }
    ],
    temperature=0.7,     # Controls randomness (0 = deterministic, 1 = creative)
    max_tokens=1024,     # Maximum length of response
)

print(response.choices[0].message.content)

Understanding temperature: This parameter controls the randomness of the output.

temperature=0.0: The model always picks the most likely next token. Outputs are deterministic and consistent. Use this for agent tasks where you need reliability (tool calls, structured output, factual answers).
temperature=0.7: A good balance between creativity and coherence. Use for general conversation and explanations.
temperature=1.0: Maximum randomness within the model's distribution. Use for creative writing or brainstorming.

For agent applications, you almost always want temperature=0.0 for action-generating steps (tool calls, decision-making) and can use higher temperatures for user-facing text generation.

8.2 API Call with Anthropic (Claude)

python

"""
API call using the Anthropic Python SDK.
"""

import anthropic

client = anthropic.Anthropic()  # Requires ANTHROPIC_API_KEY

message = client.messages.create(
    model="claude-sonnet-4-20250514",
    max_tokens=1024,
    system=(
        "You are an expert software architect. "
        "Provide concise, technically accurate responses."
    ),
    messages=[
        {
            "role": "user",
            "content": "Should I use microservices or a monolith for a new startup?"
        }
    ]
)

print(message.content[0].text)

Notice the differences from the OpenAI API:

The system prompt is a separate parameter, not a message in the list.
The response is in message.content[0].text rather than response.choices[0].message.content.
These differences are why standardization protocols like MCP (Week 4) are valuable: they abstract away provider-specific API differences.

8.3 Multi-Turn Conversation

python

"""
Multi-turn conversation demonstrating how context accumulates.
This is the foundation of agent memory within a single session.
"""

from openai import OpenAI

client = OpenAI()

def chat(messages: list[dict], user_input: str) -> str:
    """Send a message and get a response, maintaining conversation history."""
    # Add the user's message to the conversation history
    messages.append({"role": "user", "content": user_input})

    # Send the ENTIRE conversation history to the model
    response = client.chat.completions.create(
        model="gpt-4o",
        messages=messages,
        temperature=0.7,
    )

    # Extract the assistant's response
    assistant_message = response.choices[0].message.content

    # Add the assistant's response to the history for future turns
    messages.append({"role": "assistant", "content": assistant_message})

    return assistant_message


# Initialize conversation with a system message
messages = [
    {
        "role": "system",
        "content": "You are a helpful Python tutor. Teach concepts step by step."
    }
]

# Simulate a multi-turn conversation
response1 = chat(messages, "What are Python decorators?")
print(f"Turn 1: {response1}\n")

response2 = chat(messages, "Can you show me a simple example?")
print(f"Turn 2: {response2}\n")

response3 = chat(messages, "How would I use that in a web framework?")
print(f"Turn 3: {response3}\n")

# At this point, `messages` contains the full conversation history.
# The model has access to all previous turns when generating each response.
print(f"Total messages in history: {len(messages)}")

This code demonstrates a critical point about LLM-based agents: the model has no memory between API calls. Every call sends the entire conversation history. This means:

The agent's "memory" is whatever you include in the messages list.
Context grows with each turn, increasing cost and eventually hitting the window limit.
You (the developer) control what the agent remembers by controlling what goes into messages.

Key Insight: This stateless-per-call design is both a limitation and a feature. It means agents have no hidden state (everything is in the messages), which makes debugging easier. But it also means you must actively manage memory as conversations grow long.

8.4 Structured Output

For agent applications, we often need the model to produce structured output that can be parsed programmatically. A tool call is structured output; a decision between options is structured output; extracted data is structured output.

python

"""
Getting structured JSON output from an LLM.
Critical for tool selection and agent decision-making.
"""

import json
from openai import OpenAI

client = OpenAI()

response = client.chat.completions.create(
    model="gpt-4o",
    messages=[
        {
            "role": "system",
            "content": (
                "You analyze sentiment in text. "
                "Always respond with a JSON object containing: "
                "sentiment (positive/negative/neutral), "
                "confidence (0-1), "
                "key_phrases (list of important phrases)."
            )
        },
        {
            "role": "user",
            "content": "The new restaurant downtown is amazing! The food is incredible but the wait times are a bit long."
        }
    ],
    response_format={"type": "json_object"},  # Force JSON output
    temperature=0.0,  # Low temperature for consistent structured output
)

result = json.loads(response.choices[0].message.content)
print(json.dumps(result, indent=2))

# Expected output:
# {
#   "sentiment": "positive",
#   "confidence": 0.78,
#   "key_phrases": ["amazing", "incredible", "wait times a bit long"]
# }

The response_format={"type": "json_object"} parameter tells the model to always output valid JSON. This is essential for agent reliability: without it, the model might sometimes output JSON and sometimes output narrative text, breaking your parsing code.

8.5 Streaming Responses

For agent UIs, streaming provides a better user experience: the user sees the response being generated in real time rather than waiting for the entire response.

python

"""
Streaming API call — essential for responsive agent interfaces.
"""

from openai import OpenAI

client = OpenAI()

stream = client.chat.completions.create(
    model="gpt-4o",
    messages=[
        {"role": "system", "content": "You are a helpful assistant."},
        {"role": "user", "content": "Explain how TCP/IP works."}
    ],
    stream=True,  # Enable streaming
)

# Process tokens as they arrive
for chunk in stream:
    if chunk.choices[0].delta.content is not None:
        print(chunk.choices[0].delta.content, end="", flush=True)

print()  # Final newline

Streaming is particularly important for agent interfaces because agent operations can take a long time (multiple tool calls, extended thinking). Without streaming, the user stares at a blank screen for 10-30 seconds wondering if anything is happening. With streaming, they see the agent's thoughts forming in real time, which builds trust and allows early intervention if the agent is going in the wrong direction.

8.6 Token Counting and Cost Awareness

python

"""
Understanding token usage — critical for agent cost management.
"""

import tiktoken  # OpenAI's tokenizer library

# Load the tokenizer for GPT-4o
encoding = tiktoken.encoding_for_model("gpt-4o")

text = "The quick brown fox jumps over the lazy dog."
tokens = encoding.encode(text)

print(f"Text: {text}")
print(f"Token count: {len(tokens)}")
print(f"Tokens: {tokens}")
print(f"Decoded tokens: {[encoding.decode([t]) for t in tokens]}")

# Cost estimation (approximate, as of 2025)
# GPT-4o: ~$2.50 per 1M input tokens, ~$10 per 1M output tokens
# Claude 3.5 Sonnet: ~$3 per 1M input tokens, ~$15 per 1M output tokens

def estimate_cost(input_tokens: int, output_tokens: int,
                  model: str = "gpt-4o") -> float:
    """Estimate API cost for a given token count."""
    rates = {
        "gpt-4o": {"input": 2.50, "output": 10.00},
        "gpt-4o-mini": {"input": 0.15, "output": 0.60},
        "claude-3.5-sonnet": {"input": 3.00, "output": 15.00},
    }
    rate = rates.get(model, rates["gpt-4o"])
    return (input_tokens * rate["input"] + output_tokens * rate["output"]) / 1_000_000


# Agent cost estimation
# A typical agent might make 5-20 LLM calls per task
# Each call: ~2000 input tokens, ~500 output tokens
calls_per_task = 10
input_per_call = 2000
output_per_call = 500

cost = calls_per_task * estimate_cost(input_per_call, output_per_call)
print(f"\nEstimated cost per agent task (~{calls_per_task} calls): ${cost:.4f}")
print(f"Estimated cost for 1000 tasks: ${cost * 1000:.2f}")

Why cost awareness matters: An agent that makes 20 API calls per task at $0.05 per call costs$ 1.00 per task. Run that agent 10,000 times per day, and you are spending $10,000/day on API costs alone. Cost optimization is not premature optimization for agents; it is a business requirement. The model routing strategy from Section 7.5.5 can reduce costs by 5-10x by using cheap models for simple steps.

Try It Yourself: Estimate the cost of running an agent that: (1) reads a 500-line source file (~5000 tokens), (2) makes 5 reasoning calls with the file in context, and (3) generates a 200-line edit (~2000 tokens). Compare the cost using GPT-4o, GPT-4o-mini, and a reasoning model like o3. How would model routing reduce the total cost?

109. The LLM as an Agent Brain: Putting It Together

We have now covered the architecture, training, capabilities, limitations, and model families. Let us step back and synthesize what all of this means for someone building an AI agent.

9.1 What Makes an LLM Suitable for Agency?

Not all language models make good agent brains. A model that excels at creative writing might be terrible at following structured instructions. A model that is great at factual Q&A might struggle with multi-step planning. The key capabilities for agency are:

Instruction following: The model must reliably follow complex, multi-part instructions. An agent prompt might contain 20 different rules; the model needs to follow all of them simultaneously.
Tool use: The model must understand tool descriptions and generate correct tool calls. This requires understanding function signatures, parameter types, and when each tool is appropriate.
Structured output: The model must produce parseable output (JSON, function calls). A single missing bracket in a JSON response can crash the agent.
Self-correction: The model should be able to recognize and fix its own errors. When a tool call fails, the agent should reason about why and try a different approach.
Long context handling: Agent conversations often grow long; the model must track information across the full context. Forgetting an instruction from the system prompt after 50 turns is a common failure mode.
Low hallucination rate: Agent decisions have consequences; fabricating information leads to failures. An agent that hallucinates a file path wastes a tool call; one that hallucinates a critical fact leads to wrong conclusions.

9.2 The Agent Capability Matrix

Here is a practical way to evaluate whether a model is suitable for your agent application:

Capability	How to Test	Minimum Bar for Agents
Instruction following	Give 5 complex multi-part instructions; check compliance	>90% compliance rate
Tool calling	Provide 10 scenarios with tools; check correct tool selection	>95% correct selection
JSON output	Request JSON 20 times; check validity	>98% valid JSON
Error recovery	Simulate 5 tool failures; check adaptation	Agent should try alternatives
Context tracking	Place key info early; ask about it after 20 turns	Should recall accurately
Hallucination rate	Ask 20 factual questions; verify answers	<10% hallucination rate

If a model fails these basic tests, it will produce an unreliable agent regardless of how good your architecture is. Test before you build.

9.3 The Reasoning-Action Connection

The key insight that connects this week's material to the rest of the course is this:

Key Insight: An agent's ability to act effectively in the world is bounded by its ability to reason. The LLM's reasoning capability is the fundamental bottleneck for agent performance.

If the model cannot reason about which tool to use, the agent will call the wrong tools. If the model cannot plan a multi-step strategy, the agent will take locally reasonable but globally suboptimal actions. If the model hallucinates, the agent will act on false premises.

This is why the next three weeks are structured the way they are:

Week 3 (Prompting Strategies): How to enhance the model's reasoning through prompt design. Better prompts lead to better reasoning, which leads to better agent behavior.
Week 4 (Tool Use and MCP): How to extend the agent's capabilities through tools. Tools compensate for the model's limitations (calculation, current information, actions).
Week 5 (Agent Architectures): How to structure the reasoning-action loop for different types of tasks. Architecture determines how the model's reasoning is channeled into effective behavior.

1110. Discussion Questions

Reasoning or pattern matching? When an LLM solves a math problem, is it truly "reasoning" or performing sophisticated pattern matching? Does the distinction matter for agent design? Why or why not?

Starting point: Consider that humans also learn math through pattern recognition (memorizing multiplication tables, recognizing problem types). Is human reasoning fundamentally different from sophisticated pattern matching?
Scaling limits: The scaling laws suggest that more compute leads to better performance. But are there problems that no amount of scaling will solve? What fundamental limitations might persist regardless of model size?

Starting point: Consider tasks that require access to private information the model was not trained on, tasks that require physical interaction with the world, or tasks where the correct answer depends on values and preferences rather than facts.
Open vs. closed models for agents: What are the trade-offs of building agents on open-weight models (Llama, Mistral) vs. closed API models (GPT-4, Claude)? Consider factors like capability, cost, privacy, customizability, and reliability.

Starting point: A hospital building an agent for medical records needs data privacy (favoring open models). A startup building a coding agent needs maximum capability (favoring closed models). What about an agent for a government agency?
Inference-time compute allocation: If you were designing an agent that had a fixed budget of $10 per task, how would you allocate inference-time compute? Would you prefer many cheap calls or a few expensive, high-reasoning calls?

Starting point: Consider the model routing strategy. $10 could buy ~100 calls to GPT-4o-mini, ~40 calls to GPT-4o, or ~5 calls to o3. What combination would be most effective?
The training data problem: LLMs are trained on internet text, which contains errors, biases, and outdated information. How does this affect agent reliability, and what architectural solutions can mitigate it?

Starting point: Consider RAG (retrieving current information), tool use (verifying facts), and self-consistency (generating multiple answers and checking agreement).

1211. Summary and Key Takeaways

The Transformer architecture is the foundation of all modern LLMs. Self-attention enables parallel processing and long-range dependencies, but comes with quadratic computational cost. Understanding the architecture helps explain agent behavior.
LLMs acquire capabilities through a multi-stage training pipeline: pre-training (language and knowledge), instruction tuning (following directions), and alignment (safety and helpfulness via RLHF/DPO). Each stage shapes the model's behavior as an agent.
Emergent capabilities appear as models scale, including arithmetic, code generation, and complex instruction following. However, the relationship between scale and capability is nuanced, and reliability thresholds matter more than benchmark scores for agent applications.
LLMs have fundamental limitations that directly affect agent design: hallucinations, context window constraints, knowledge cutoffs, prompt sensitivity, and sycophancy. Good agent architecture compensates for these limitations.
Major model families (GPT, Claude, Gemini, Llama, Mistral) offer different trade-offs in capability, cost, context length, and deployment options. Agent designers must choose models based on their specific requirements, and often use multiple models (model routing) for optimal cost-performance.
Inference-time compute is a paradigm shift: models that "think longer" on hard problems achieve dramatically better results, which is particularly valuable for complex agent tasks like planning and error analysis.
The LLM's reasoning capability is the fundamental bottleneck for agent performance. Weeks 3-5 will explore how to augment and scaffold this reasoning through prompting, tools, and architectures.

1312. Practical Exercise

This exercise is designed to give you hands-on experience with the concepts from this lecture. The goal is not just to run code, but to develop intuition about model capabilities and limitations that will inform your agent design decisions throughout the rest of the course.

Explore Model Capabilities: Using the API call examples from Section 8, conduct the following experiments:

Reasoning comparison: Give the same multi-step reasoning problem to two different models (e.g., GPT-4o-mini and GPT-4o, or a small and large Llama model). Compare the quality of their reasoning chains and final answers. Document specific differences: where does the smaller model fail?
Temperature sweep: For a given prompt, generate responses at temperatures 0.0, 0.3, 0.7, and 1.0. Document how the responses change in terms of accuracy, creativity, and consistency. Run each temperature 3 times to observe variability.
Context sensitivity: Create a 10-turn conversation and observe how well the model tracks information from early turns. Try placing a key fact in turn 2 and asking about it in turn 10. Does the model remember? What about turn 50 (if you simulate a long conversation)?
Structured output reliability: Ask the model to produce JSON output for 20 different inputs. Measure the rate of valid JSON responses and the rate of correct field values. Compare with and without response_format={"type": "json_object"}.
Cost analysis: For each experiment, track the token usage and compute the cost. How much would each experiment cost at scale (1000 runs)?

Deliverable: A Jupyter notebook with your code, outputs, and a 1-page analysis of your findings. Focus on practical implications: what do your findings mean for someone building an agent?

14References

Bai, Y., Kadavath, S., Kundu, S., Askell, A., Kernion, J., Jones, A., ... & Kaplan, J. (2022). Constitutional AI: Harmlessness from AI feedback. arXiv preprint arXiv:2212.08073.
Hoffmann, J., Borgeaud, S., Mensch, A., Buchatskaya, E., Cai, T., Rutherford, E., ... & Sifre, L. (2022). Training compute-optimal large language models. In Advances in Neural Information Processing Systems (NeurIPS).
Kaplan, J., McCandlish, S., Henighan, T., Brown, T. B., Chess, B., Child, R., ... & Amodei, D. (2020). Scaling laws for neural language models. arXiv preprint arXiv:2001.08361.
Liu, N. F., Lin, K., Hewitt, J., Paranjape, A., Bevilacqua, M., Petroni, F., & Liang, P. (2024). Lost in the middle: How language models use long contexts. Transactions of the Association for Computational Linguistics, 12, 157-173.
Ouyang, L., Wu, J., Jiang, X., Almeida, D., Wainwright, C., Mishkin, P., ... & Lowe, R. (2022). Training language models to follow instructions with human feedback. In Advances in Neural Information Processing Systems (NeurIPS).
Perez, E., Ringer, S., Lukosuite, K., Nguyen, K., Chen, E., Heiner, S., ... & Kaplan, J. (2023). Discovering language model behaviors with model-written evaluations. In Findings of the Association for Computational Linguistics: ACL 2023.
Press, O., Smith, N. A., & Lewis, M. (2022). Train short, test long: Attention with linear biases enables input length generalization. In Proceedings of the International Conference on Learning Representations (ICLR).
Rafailov, R., Sharma, A., Mitchell, E., Manning, C. D., Ermon, S., & Finn, C. (2023). Direct preference optimization: Your language model is secretly a reward model. In Advances in Neural Information Processing Systems (NeurIPS).
Schaeffer, R., Miranda, B., & Koyejo, S. (2024). Are emergent abilities of large language models a mirage? In Advances in Neural Information Processing Systems (NeurIPS).
Su, J., Ahmed, M., Lu, Y., Pan, S., Bo, W., & Liu, Y. (2024). RoFormer: Enhanced transformer with Rotary Position Embedding. Neurocomputing, 568, 127063.
Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., ... & Polosukhin, I. (2017). Attention is all you need. In Advances in Neural Information Processing Systems (NeurIPS).
Wei, J., Wang, X., Schuurmans, D., Bosma, M., Ichter, B., Xia, F., ... & Zhou, D. (2022a). Chain-of-thought prompting elicits reasoning in large language models. In Advances in Neural Information Processing Systems (NeurIPS).
Wei, J., Bosma, M., Zhao, V. Y., Guu, K., Yu, A. W., Lester, B., ... & Le, Q. V. (2022b). Finetuned language models are zero-shot learners. In Proceedings of the International Conference on Learning Representations (ICLR).

Part of "Agentic AI: Foundations, Architectures, and Applications" (CC BY-SA 4.0).