Fran Rodrigo
ArchitecturesW0631 min read

Memory Systems

Closing the statefulness gap. Working memory, long-term vector stores, episodic logs, semantic memory. Memory management strategies (summarisation, decay, prioritisation). Case study: Generative Agents (Park et al., 2023).

Core conceptsWorking memoryVector retrievalEpisodic memory

01Learning Objectives

By the end of this lecture, students will be able to:

  1. Explain why memory is essential for building capable AI agents and identify the limitations of stateless LLM interactions.
  2. Distinguish between short-term, working, long-term, episodic, and semantic memory in the context of AI agents.
  3. Compare implementation approaches for agent memory, including in-context memory, vector databases, key-value stores, and graph databases.
  4. Implement a basic embedding-based memory system in Python that supports storage and retrieval of past experiences.
  5. Analyze memory management strategies such as summarization, forgetting, and prioritization, and explain when each is appropriate.
  6. Critically evaluate the memory architecture of the "Generative Agents" system (Park et al., 2023).

021. Why Memory Matters: Statefulness in a Stateless World

The Fundamental Problem

Large Language Models are, at their core, stateless functions. Given an input sequence of tokens, they produce an output sequence. They have no inherent mechanism to remember previous interactions, accumulate knowledge over time, or learn from past mistakes. Every API call starts from scratch.

Think about what this means in practice. Imagine you hire a brilliant consultant, but every morning they wake up with complete amnesia. They have all their skills and general knowledge, but they remember nothing about your company, your previous meetings, or the project you have been working on together. Every day, you would have to re-explain everything from scratch. That is the situation we face with stateless LLMs.

This is a serious limitation for building agents. Consider a simple scenario: you ask an AI assistant to help you plan a vacation. In the first message, you mention you are allergic to shellfish. Ten messages later, it suggests a seafood restaurant. Without memory, the agent has no way to retain and apply information from earlier in the conversation, let alone from previous conversations.

Key Insight: The statefulness gap is one of the most important differences between human cognition and LLM-based systems. Human cognition is deeply stateful -- we carry forward context from every interaction, building mental models of the people we talk to, the tasks we are working on, and the world around us. Bridging this gap is one of the central challenges in building effective AI agents.

A Historical Perspective

The importance of memory in AI is not a new insight. Early AI systems in the 1950s and 1960s used explicit symbol tables and databases to maintain state. Expert systems in the 1980s had "working memory" as a core component of their architecture (the OPS5 production system, for example, had an explicit working memory that stored facts about the current problem). What changed with LLMs is that the reasoning engine itself has no memory -- it must be added externally.

The challenge is analogous to the transition from stateless protocols (like early HTTP) to stateful web applications. HTTP is a stateless protocol: each request is independent. Web developers solved this with cookies, sessions, and databases. Similarly, we need to build memory infrastructure around stateless LLMs to create stateful agents.

Why Agents Need Memory More Than Chatbots

A simple chatbot can often get by with just the conversation history in its context window. An agent, however, needs memory for several additional reasons:

  • Multi-step task execution: Agents work on tasks that span many steps. They need to remember what they have already done, what they planned to do, and what intermediate results they obtained. A coding agent that forgets it already installed a dependency will waste time (and tokens) reinstalling it.
  • Tool interaction tracking: When an agent calls APIs, queries databases, or runs code, it needs to remember the results and use them in subsequent reasoning. If an agent searches the web and finds relevant information, it needs to carry that forward.
  • Learning from experience: A truly capable agent should improve over time, remembering which approaches worked and which failed for similar tasks. The first time an agent encounters a specific API error, it may spend many steps debugging. The second time, it should recall the solution immediately.
  • Personalization: Agents that work with users repeatedly need to remember user preferences, past requests, and established context. A personal assistant that cannot remember your dietary restrictions, meeting preferences, or communication style is significantly less useful.
  • Cross-session continuity: Unlike a single conversation, agent tasks may span multiple sessions. The agent needs to pick up where it left off. Consider a research agent that takes days to complete a literature review -- it cannot start over each session.

The Context Window Constraint

Modern LLMs have context windows ranging from 4K to over 1M tokens. One might think that a sufficiently large context window eliminates the need for explicit memory systems. This is a common misconception, and it is incorrect for several reasons:

  1. Cost scales linearly (or worse) with context length. Stuffing everything into the context window is expensive. If you are paying 10permillioninputtokensandyouragentaccumulates500Ktokensofhistory,thatis10 per million input tokens and your agent accumulates 500K tokens of history, that is 5 per interaction just for context.
  2. Retrieval accuracy degrades with context length. The "lost in the middle" phenomenon (Liu et al., 2024) shows that LLMs struggle to use information placed in the middle of long contexts. Information at the beginning and end is used effectively, but information in the middle is often ignored.
  3. Context windows are still finite. Even a 1M token window will eventually be exhausted by a long-running agent. A software engineering agent working on a large codebase can easily exceed any context window.
  4. Not all information is equally relevant. A well-designed memory system retrieves only the information most pertinent to the current step, improving both efficiency and accuracy. Dumping everything into the context is like handing someone an entire library when they need one specific fact.

Key Insight: A large context window is a buffer, not a memory system. Memory requires selective storage, efficient retrieval, and intelligent management -- capabilities that a raw context window does not provide.

Try It Yourself: Experiencing the Memory Problem

Before we dive into solutions, try this thought experiment. Imagine you are an agent with no memory beyond your current context window. A user gives you the following sequence of instructions across separate conversations:

  • Session 1: "I am working on a project called Phoenix. We use Python 3.11 and PostgreSQL."
  • Session 2: "Continue working on the Phoenix project."

In Session 2, you have no idea what "Phoenix" refers to, what technology stack it uses, or what progress was made. This is the problem we are solving.


032. Short-Term Memory: Conversation Context and Scratchpads

Conversation Context as Memory

The simplest form of agent memory is the conversation history itself. Every message exchanged between the user, the agent, and its tools forms a chronological record that the LLM can reference.

text
User: Find me papers about media bias detection published after 2022.
Agent: [calls search tool] I found 15 papers. Here are the top 5...
User: Which of those use transformer-based models?
Agent: [references previous results] Of the 15 papers I found, 8 use transformer-based models...

In this exchange, the agent's ability to answer the follow-up question depends entirely on the previous messages being in its context window. The pronoun "those" in the second message has no meaning without the context of the first exchange.

This is the same principle behind how human short-term memory works in conversation. When someone tells you their name at a party, you hold it in your mind for the next few minutes. If the conversation shifts topics for ten minutes and then they ask "So, do you remember my name?", you might have lost it. Conversation context in LLMs works similarly -- recent information is accessible, but it fades as the context fills up with newer information.

Limitations of raw conversation context:

  • Grows linearly with interaction length, eventually exceeding the context window.
  • Contains a lot of noise (failed attempts, verbose tool outputs, pleasantries, system messages).
  • No prioritization: recent information is not inherently weighted higher than older information (though positional effects exist in how LLMs attend to context).
  • No structure: everything is a flat sequence of messages, making it hard to find specific information.

Scratchpads

A scratchpad is an explicit working space where an agent can write down intermediate results, plans, and notes to itself. Unlike conversation history, which is a passive record of everything that happened, a scratchpad is actively managed by the agent -- it chooses what to write and can update or delete entries.

Think of a scratchpad like the notepad a detective uses during an investigation. They do not write down every word of every conversation. They write down the key facts, the leads to follow, and the current hypotheses. The notepad is curated and organized, unlike a raw transcript.

python
class Scratchpad:
    """A simple scratchpad for agent intermediate reasoning.

    The scratchpad provides a structured way for agents to maintain
    key information across reasoning steps. Unlike raw conversation
    history, the scratchpad is actively curated -- the agent decides
    what to record and can update entries as understanding evolves.
    """

    def __init__(self):
        self.entries = []

    def write(self, key: str, value: str) -> None:
        """Write or update a scratchpad entry.

        If an entry with the given key already exists, it is updated
        in place. This allows the agent to refine its notes as it
        learns more -- for example, updating a 'current_hypothesis'
        entry as new evidence comes in.
        """
        # Update existing entry if key exists
        for entry in self.entries:
            if entry["key"] == key:
                entry["value"] = value
                return
        self.entries.append({"key": key, "value": value})

    def read(self, key: str) -> str | None:
        """Read a scratchpad entry by key."""
        for entry in self.entries:
            if entry["key"] == key:
                return entry["value"]
        return None

    def delete(self, key: str) -> bool:
        """Remove a scratchpad entry when it is no longer needed."""
        for i, entry in enumerate(self.entries):
            if entry["key"] == key:
                self.entries.pop(i)
                return True
        return False

    def to_prompt(self) -> str:
        """Format scratchpad contents for inclusion in the LLM prompt.

        This method is called before each LLM invocation to inject
        the scratchpad contents into the prompt, giving the agent
        access to its curated notes.
        """
        if not self.entries:
            return "Scratchpad: (empty)"
        lines = ["Scratchpad:"]
        for entry in self.entries:
            lines.append(f"  - {entry['key']}: {entry['value']}")
        return "\n".join(lines)

Let us walk through how an agent might use this scratchpad during a research task:

  1. User asks: "Find the most cited paper on media bias detection from 2023."
  2. Agent searches, finds 10 papers. Writes to scratchpad: search_results: "Found 10 papers. Top candidates: Smith2023 (45 citations), Jones2023 (38 citations), Lee2023 (52 citations)"
  3. Agent needs to verify citation counts. Writes: current_task: "Verify citation count for Lee2023 on Google Scholar"
  4. After verification, updates: answer: "Lee2023 with 52 citations is the most cited"
  5. The scratchpad now contains a clean summary of the agent's findings, not the entire search history.

The scratchpad pattern is used in systems like chain-of-thought prompting (Wei et al., 2022) and in agent frameworks where the agent explicitly maintains a "notepad" of key facts and intermediate results.

Context Window Management

When conversation history grows too long, agents need strategies to manage it. This is not a theoretical problem -- it happens in practice within the first few minutes of a complex agent task.

  • Sliding window: Keep only the most recent N messages. This is the simplest approach. It is like a security camera that only keeps the last 24 hours of footage -- easy to implement but you lose potentially important early context. If the user stated an important preference in message #3 and you are now on message #50, that preference is gone.

  • Summarization: Periodically summarize older messages into a compact representation. This preserves key information but may lose details. It is like reading a book summary instead of the full book -- you get the main points but miss the nuances.

  • Selective retention: Keep messages that contain key decisions, user preferences, or important results, while discarding routine exchanges. This is more sophisticated but requires the ability to judge which messages are important.

python
def summarize_context(messages: list[dict], model: str = "gpt-4") -> str:
    """Summarize older messages to compress the conversation context.

    This function takes a list of messages and produces a compact
    summary that preserves the essential information. In practice,
    this is called when the conversation history approaches the
    context window limit.

    The key challenge is deciding what to preserve. The prompt
    instructs the LLM to focus on facts, decisions, and preferences
    rather than conversational pleasantries or failed attempts.
    """
    summary_prompt = (
        "Summarize the following conversation, preserving all key facts, "
        "decisions, user preferences, and task progress. Omit small talk, "
        "failed attempts that were later corrected, and verbose tool outputs. "
        "Be concise but complete:\n\n"
    )
    for msg in messages:
        summary_prompt += f"{msg['role']}: {msg['content']}\n"

    # Call the LLM to generate a summary
    response = llm_call(model=model, prompt=summary_prompt)
    return response

Common Misconception: "Summarization is lossless." It is not. Every summarization step loses some information. The question is whether the lost information matters. In practice, a good summarization strategy loses mostly noise (verbose tool outputs, failed attempts) and retains mostly signal (key facts, decisions, preferences). But important details can be lost, which is why critical information should also be stored in long-term memory.


043. Working Memory: Maintaining Task State Across Steps

What Is Working Memory?

In cognitive psychology, working memory refers to the system that holds and manipulates information needed for ongoing cognitive tasks. It is not just a passive buffer -- it is an active workspace where information is combined, transformed, and used for reasoning. When you do mental arithmetic (what is 37 times 12?), you are using working memory to hold intermediate results (37 times 10 is 370, 37 times 2 is 74, 370 plus 74 is 444).

For AI agents, working memory serves an analogous function: it maintains the current task state, including the plan, progress, and any relevant intermediate data. It is the difference between an agent that stumbles through a task step by step (without remembering the big picture) and one that maintains a clear view of where it is, where it has been, and where it is going.

Working memory is distinct from short-term memory in that it is structured and task-oriented rather than simply being a buffer of recent inputs. Conversation history is a chronological log; working memory is an organized representation of the current task state.

An Analogy: The Surgeon's Mental Model

Consider a surgeon performing a complex operation. They do not just react to what they see at each moment. They maintain a mental model of:

  • What they are trying to achieve (the goal)
  • What they have already done (the completed steps)
  • What they need to do next (the plan)
  • What has gone wrong and how they adapted (error handling)
  • The current state of the patient (intermediate results)

This is working memory in action. An AI agent needs the same kind of structured state tracking.

Task State Representation

A well-designed working memory for an agent might include:

python
from dataclasses import dataclass, field
from enum import Enum


class TaskStatus(Enum):
    PENDING = "pending"
    IN_PROGRESS = "in_progress"
    COMPLETED = "completed"
    FAILED = "failed"
    BLOCKED = "blocked"


@dataclass
class TaskState:
    """Represents the current state of an agent's task execution.

    This dataclass captures everything the agent needs to know about
    its current task: the goal, the plan, where it is in execution,
    what results it has obtained, and any errors encountered.

    By serializing this state into the prompt (via to_prompt()), the
    agent gets a clear picture of its progress at each step, rather
    than having to piece it together from conversation history.
    """

    goal: str                                           # What are we trying to achieve?
    plan: list[str] = field(default_factory=list)       # The step-by-step plan
    current_step: int = 0                               # Which step are we on?
    intermediate_results: dict[str, str] = field(default_factory=dict)  # Results so far
    status: TaskStatus = TaskStatus.PENDING              # Overall status
    errors: list[str] = field(default_factory=list)     # Errors encountered
    context_variables: dict[str, str] = field(default_factory=dict)  # Key facts

    def advance(self, result: str) -> None:
        """Record the result of the current step and advance to the next.

        This method is called after each successful step execution.
        It stores the result keyed by the step name and moves the
        current_step pointer forward. When all steps are complete,
        it automatically marks the task as COMPLETED.
        """
        step_name = self.plan[self.current_step] if self.current_step < len(self.plan) else "unknown"
        self.intermediate_results[step_name] = result
        self.current_step += 1
        if self.current_step >= len(self.plan):
            self.status = TaskStatus.COMPLETED

    def record_error(self, error: str) -> None:
        """Record an error encountered during execution.

        Errors are accumulated rather than overwritten, creating a
        log that helps the agent (and developers) understand what
        went wrong during execution.
        """
        self.errors.append(error)

    def to_prompt(self) -> str:
        """Serialize task state for inclusion in the agent prompt.

        This is the critical method: it converts the structured task
        state into a text format that the LLM can understand. The
        format clearly shows which steps are done, which is next,
        and what results have been obtained.
        """
        lines = [
            f"Goal: {self.goal}",
            f"Status: {self.status.value}",
            f"Plan ({self.current_step}/{len(self.plan)} steps completed):",
        ]
        for i, step in enumerate(self.plan):
            marker = "[done]" if i < self.current_step else "[next]" if i == self.current_step else "[ ]"
            result = self.intermediate_results.get(step, "")
            result_str = f" -> {result}" if result else ""
            lines.append(f"  {marker} {step}{result_str}")
        if self.errors:
            lines.append("Errors encountered:")
            for err in self.errors:
                lines.append(f"  - {err}")
        return "\n".join(lines)

Let us trace through an example. Suppose an agent is tasked with "Analyze the sentiment of customer reviews for ProductX." After planning and executing two of four steps, the to_prompt() output might look like:

text
Goal: Analyze the sentiment of customer reviews for ProductX
Status: in_progress
Plan (2/4 steps completed):
  [done] Fetch customer reviews from the database -> Retrieved 1,247 reviews
  [done] Clean and preprocess the review text -> Cleaned 1,247 reviews, removed 23 duplicates
  [next] Run sentiment analysis on preprocessed reviews
  [ ] Generate summary report with visualizations

This gives the LLM a clear, structured view of the task state, far better than parsing through dozens of conversation messages.

Working Memory in Practice

The ReAct framework (Yao et al., 2023) implicitly uses working memory through its interleaved reasoning and action traces. At each step, the agent has access to all previous thoughts, actions, and observations, which collectively form its working memory. The "thought" step is particularly important -- it is where the agent reasons about what it knows, what it needs to do next, and what obstacles it faces.

More sophisticated systems like Voyager (Wang et al., 2023) maintain explicit working memory that includes:

  • The current objective
  • A skill library of previously learned behaviors
  • Environmental state (e.g., inventory in Minecraft)
  • Feedback from recent attempts

Try It Yourself: Design a working memory structure for a customer service agent. What fields would you include? Think about: the customer's issue, their emotional state, actions already taken, company policies that apply, and escalation status.


054. Long-Term Memory: Persistent Knowledge Stores

The Need for Persistence

Short-term and working memory are ephemeral: they exist only for the duration of a task or session. When the agent shuts down (or the API call ends), they are gone. Long-term memory persists across sessions, allowing the agent to accumulate knowledge, learn from experience, and maintain continuity.

The analogy here is straightforward: short-term memory is like RAM, and long-term memory is like a hard drive. RAM is fast but volatile (lost when power is off). A hard drive is slower but persistent. Just as computers need both, agents need both types of memory.

Long-term memory addresses questions like:

  • "What did this user ask me about last week?"
  • "What approach worked when I tried this type of task before?"
  • "What are the key facts I have learned about this domain?"
  • "What are this user's preferences and constraints?"

Implementation Approaches

There are several ways to implement long-term memory, each with different strengths. The choice depends on the type of information being stored and how it needs to be retrieved.

1. File-Based Storage

The simplest approach: write memories to files and read them back. This is the approach used by tools like Claude Code's CLAUDE.md files, where project context is stored in markdown files that are loaded at the start of each session.

python
import json
from pathlib import Path
from datetime import datetime


class FileMemory:
    """Simple file-based long-term memory.

    This approach stores memories as JSON Lines (one JSON object per
    line), which makes it easy to append new memories without
    rewriting the entire file. It is suitable for small-scale
    applications and prototyping.

    Think of this as a simple diary: entries are written chronologically
    and can be searched by keyword. It does not scale well (searching
    requires reading the entire file) but it is easy to understand,
    debug, and implement.
    """

    def __init__(self, memory_dir: str):
        self.memory_dir = Path(memory_dir)
        self.memory_dir.mkdir(parents=True, exist_ok=True)
        self.memory_file = self.memory_dir / "memories.jsonl"

    def store(self, content: str, metadata: dict | None = None) -> None:
        """Store a memory entry.

        Each entry is a single line of JSON, containing the content,
        a timestamp, and optional metadata. The JSONL format allows
        efficient appending without loading the entire file.
        """
        entry = {
            "content": content,
            "timestamp": datetime.now().isoformat(),
            "metadata": metadata or {},
        }
        with open(self.memory_file, "a") as f:
            f.write(json.dumps(entry) + "\n")

    def retrieve_all(self) -> list[dict]:
        """Retrieve all stored memories."""
        if not self.memory_file.exists():
            return []
        memories = []
        with open(self.memory_file) as f:
            for line in f:
                memories.append(json.loads(line.strip()))
        return memories

    def search(self, query: str) -> list[dict]:
        """Simple keyword-based search over memories.

        This is a brute-force approach: it loads all memories and
        checks each one for keyword matches. For small collections
        (hundreds of memories), this is fine. For larger collections,
        you would need indexing (see vector databases below).
        """
        query_lower = query.lower()
        results = []
        for memory in self.retrieve_all():
            if query_lower in memory["content"].lower():
                results.append(memory)
        return results

This approach is easy to implement but does not scale and lacks semantic search capabilities. If a user's preference is stored as "I prefer vegetarian food" and you search for "dietary restrictions," the keyword search will miss it. This is where vector databases become essential.

2. Vector Database Storage

Vector databases store memories as embeddings and support similarity search, enabling semantic retrieval. This is the most common approach in modern agent systems. The core idea: convert text into a numerical vector (embedding) that captures its meaning, then find similar vectors using mathematical distance.

The analogy is a library. In a traditional library, you find books by subject headings and keywords (like keyword search). A vector database is more like a librarian who understands what you mean: if you ask for "books about eating healthily," they would also suggest books filed under "nutrition," "diet," and "wellness" -- concepts that are semantically related even though the words are different.

Popular vector databases include:

  • Chroma: Lightweight, embedded, good for prototyping. Runs in-process with your application.
  • Pinecone: Managed cloud service, scales well. Good for production but requires an account.
  • Weaviate: Open-source, supports hybrid search (combining vector and keyword search).
  • Qdrant: Open-source, high performance, written in Rust.
  • pgvector: PostgreSQL extension, good for teams already using Postgres. Keeps everything in one database.

3. Key-Value Stores

For structured memories (user preferences, configuration, facts), a key-value store can be more appropriate than a vector database. When you know exactly what you are looking for (e.g., "what is this user's preferred programming language?"), a key-value lookup is faster and more reliable than semantic search.

python
import sqlite3
from datetime import datetime


class KeyValueMemory:
    """Key-value store for structured agent memories.

    This is ideal for storing discrete facts that have a clear key:
    - User preferences: ("preferred_language", "Python")
    - Configuration: ("max_retries", "3")
    - Named facts: ("user_timezone", "Europe/Madrid")

    We use SQLite for persistence, which gives us ACID transactions
    and SQL querying without requiring a separate database server.
    The 'category' field allows grouping related memories.
    """

    def __init__(self, db_path: str):
        self.conn = sqlite3.connect(db_path)
        self.conn.execute(
            """CREATE TABLE IF NOT EXISTS memories (
                key TEXT PRIMARY KEY,
                value TEXT NOT NULL,
                category TEXT,
                updated_at TEXT
            )"""
        )
        self.conn.commit()

    def set(self, key: str, value: str, category: str = "general") -> None:
        """Store or update a key-value pair.

        INSERT OR REPLACE ensures that if the key already exists,
        its value is updated rather than creating a duplicate.
        This is important because facts change: a user might update
        their preferred programming language from Python to Rust.
        """
        self.conn.execute(
            "INSERT OR REPLACE INTO memories (key, value, category, updated_at) VALUES (?, ?, ?, ?)",
            (key, value, category, datetime.now().isoformat()),
        )
        self.conn.commit()

    def get(self, key: str) -> str | None:
        """Retrieve a value by its exact key."""
        row = self.conn.execute("SELECT value FROM memories WHERE key = ?", (key,)).fetchone()
        return row[0] if row else None

    def get_by_category(self, category: str) -> dict[str, str]:
        """Retrieve all key-value pairs in a category.

        This is useful for loading all user preferences at once,
        or all facts about a specific project.
        """
        rows = self.conn.execute(
            "SELECT key, value FROM memories WHERE category = ?", (category,)
        ).fetchall()
        return {key: value for key, value in rows}

4. Graph Databases

For relational knowledge (entities, relationships, ontologies), graph databases like Neo4j provide powerful query capabilities. The key advantage of graph databases is that they naturally represent relationships between entities:

text
(User:Person {name: "Alice"}) -[:PREFERS]-> (Diet:Preference {type: "vegetarian"})
(User:Person {name: "Alice"}) -[:WORKS_AT]-> (Company:Organization {name: "Acme Corp"})
(Task:Task {id: "t1"}) -[:DEPENDS_ON]-> (Task:Task {id: "t2"})

Graph databases excel when memory involves complex relationships between entities. For example, if an agent is managing a project with multiple people, tasks, and dependencies, a graph database can answer questions like "which tasks assigned to Alice depend on tasks assigned to Bob?" with a single query. Trying to answer this with a flat key-value store or a vector database would be much harder.

However, graph databases add significant architectural complexity. You need to design a schema, learn a query language (like Cypher for Neo4j), and manage the database. For most agent applications, the simpler approaches (key-value + vector) are sufficient.

Key Insight: In practice, most production agent systems use a hybrid approach: a key-value store for structured facts (user preferences, configuration), a vector database for unstructured knowledge (past conversations, documents), and the conversation context for immediate state. Each type of information goes in the storage system best suited to it.


065. Episodic Memory: Remembering Past Experiences

What Is Episodic Memory?

Inspired by the cognitive science concept, episodic memory for AI agents refers to the storage and retrieval of specific past experiences or episodes. Each episode captures what happened, when it happened, and the context in which it happened.

In human cognition, episodic memory is what allows you to remember specific events: "last Tuesday, I tried the new Italian restaurant on Oak Street and the pasta was excellent." This is different from semantic memory (knowing that "pasta is an Italian dish") because it is tied to a specific time, place, and context.

For AI agents, episodic memory answers questions like:

  • "The last time I tried to parse a PDF, what approach did I use?"
  • "What happened when I called this API with those parameters?"
  • "How did the user react when I suggested this approach before?"
  • "The last time a database migration failed, what was the root cause?"

The power of episodic memory is that it enables learning from experience. Instead of approaching every task from scratch using only general knowledge, an agent can recall what worked (and what did not) in similar situations.

Structure of an Episode

What information should an episode capture? Think about what a doctor writes in a patient's medical record after a visit: what the problem was (context), what they did (action), what happened (outcome), and whether it helped (success). An agent's episodic memory should capture similar information:

python
from dataclasses import dataclass, field
from datetime import datetime


@dataclass
class Episode:
    """A single episodic memory entry.

    Each episode is a self-contained record of something that happened.
    It captures not just the action taken, but the full context: what
    was the situation, what was tried, what resulted, and whether it
    was successful. This rich structure enables the agent to learn
    nuanced lessons from experience.
    """

    description: str              # What happened (natural language summary)
    timestamp: datetime           # When it happened
    context: dict[str, str]       # Surrounding context (task, project, user, etc.)
    action_taken: str             # What the agent did
    outcome: str                  # What resulted from the action
    success: bool                 # Whether the outcome was positive
    tags: list[str] = field(default_factory=list)  # Categorization tags
    importance: float = 0.5       # Importance score (0.0 to 1.0)

    def to_text(self) -> str:
        """Convert episode to natural language for LLM consumption.

        This method creates a readable narrative from the structured
        fields. The LLM can understand and reason about this text
        representation when deciding how to handle similar situations.
        """
        success_str = "successfully" if self.success else "unsuccessfully"
        return (
            f"On {self.timestamp.strftime('%Y-%m-%d')}, while working on "
            f"{self.context.get('task', 'unknown task')}, I {success_str} "
            f"{self.action_taken}. The outcome was: {self.outcome}"
        )

Episodic Memory in Generative Agents

The landmark paper "Generative Agents: Interactive Simulacra of Human Behavior" (Park et al., 2023) introduced a compelling architecture for episodic memory in AI agents. This paper is worth studying in detail because it demonstrates how memory can give rise to remarkably human-like behavior.

In the Generative Agents system, 25 AI agents live in a simulated town (inspired by The Sims). Each agent has a name, occupation, relationships, and daily routines. What makes the system remarkable is not the individual agents' reasoning, but the emergent social behaviors that arise from their memory systems. Agents formed friendships, organized parties, spread gossip, and even ran for mayor -- all emerging from the interaction between memory, retrieval, and planning.

The memory system works as follows:

  1. Observations are recorded as memory stream entries, each with a timestamp, description, and pointers to related memories. Everything the agent perceives is recorded: "Isabella Rodriguez is setting up the cafe," "Maria Lopez said she is interested in running for mayor."

  2. Retrieval uses a scoring function that combines three factors:

    • Recency: More recent memories score higher (exponential decay). A memory from one hour ago is more relevant than one from last week, all else being equal.
    • Importance: Memories rated as more important score higher. "Isabella's father died" is more important than "Isabella ate breakfast."
    • Relevance: Memories semantically similar to the current query score higher. When deciding what to do at a party, memories about previous social events are more relevant than memories about grocery shopping.
  3. Reflection periodically synthesizes episodic memories into higher-level insights, which are themselves stored as memories. After accumulating many observations about a neighbor's behavior, the agent might generate the reflection: "Maria Lopez seems to be very interested in community leadership." These reflections become memories themselves and can be retrieved in future queries.

The retrieval scoring function from the paper:

text
score(memory) = alpha * recency(memory) + beta * importance(memory) + gamma * relevance(memory, query)

Where alpha, beta, and gamma are weighting parameters that balance the three factors.

Key Insight: The Generative Agents system shows that sophisticated behavior can emerge from relatively simple memory mechanisms. The agents do not have explicit social reasoning -- their social behavior emerges from remembering interactions, reflecting on patterns, and using those reflections to guide future behavior. Memory is the foundation of emergent intelligence.

Implementing Episodic Retrieval

Let us implement the three scoring functions from the Generative Agents paper:

python
import math
from datetime import datetime


def compute_recency_score(
    memory_timestamp: datetime, current_time: datetime, decay_factor: float = 0.995
) -> float:
    """Compute recency score with exponential decay.

    More recent memories get higher scores. The decay_factor controls
    how quickly older memories lose relevance.

    With the default decay_factor of 0.995:
    - 1 hour old: score = 0.995^1 = 0.995 (very recent, high score)
    - 24 hours old: score = 0.995^24 = 0.887 (still quite relevant)
    - 1 week old: score = 0.995^168 = 0.429 (moderately relevant)
    - 1 month old: score = 0.995^720 = 0.027 (mostly forgotten)

    This mirrors how human memory works: recent events are vivid,
    while older events fade unless they are particularly important
    or frequently recalled.
    """
    hours_elapsed = (current_time - memory_timestamp).total_seconds() / 3600.0
    return math.pow(decay_factor, hours_elapsed)


def compute_relevance_score(
    memory_embedding: list[float], query_embedding: list[float]
) -> float:
    """Compute cosine similarity between memory and query embeddings.

    Cosine similarity measures how similar two vectors are in terms
    of their direction (ignoring magnitude). It ranges from -1
    (opposite directions) to 1 (same direction), with 0 meaning
    no relationship.

    For text embeddings, cosine similarity captures semantic
    relatedness: "dog" and "puppy" will have high similarity,
    while "dog" and "algebra" will have low similarity.
    """
    dot_product = sum(a * b for a, b in zip(memory_embedding, query_embedding))
    norm_a = math.sqrt(sum(a * a for a in memory_embedding))
    norm_b = math.sqrt(sum(b * b for b in query_embedding))
    if norm_a == 0 or norm_b == 0:
        return 0.0
    return dot_product / (norm_a * norm_b)


def retrieve_episodes(
    episodes: list[dict],
    query_embedding: list[float],
    current_time: datetime,
    alpha: float = 1.0,   # Weight for recency
    beta: float = 1.0,    # Weight for importance
    gamma: float = 1.0,   # Weight for relevance
    top_k: int = 5,
) -> list[dict]:
    """Retrieve the most relevant episodes using the Generative Agents scoring.

    This function implements the core retrieval algorithm from the
    Generative Agents paper. Each memory is scored on three dimensions
    (recency, importance, relevance), and the top-k highest-scoring
    memories are returned.

    The alpha, beta, and gamma weights allow you to tune the balance:
    - High alpha: Prefer recent memories (good for fast-changing contexts)
    - High beta: Prefer important memories (good for critical decisions)
    - High gamma: Prefer relevant memories (good for specific queries)
    """
    scored = []
    for episode in episodes:
        recency = compute_recency_score(episode["timestamp"], current_time)
        importance = episode["importance"]
        relevance = compute_relevance_score(episode["embedding"], query_embedding)
        score = alpha * recency + beta * importance + gamma * relevance
        scored.append((score, episode))

    scored.sort(key=lambda x: x[0], reverse=True)
    return [episode for _, episode in scored[:top_k]]

Try It Yourself: Consider an agent that has 100 episodic memories. A query comes in that is moderately relevant to a very old but very important memory, and highly relevant to a recent but unimportant memory. Which memory should be retrieved? Experiment with different alpha/beta/gamma values to see how the scoring changes. There is no single right answer -- it depends on the use case.


076. Semantic Memory: Facts and Knowledge Retrieval

What Is Semantic Memory?

While episodic memory stores specific experiences ("last Tuesday I tried the new restaurant"), semantic memory stores general knowledge and facts ("Paris is the capital of France"). In cognitive science, semantic memory is your knowledge of the world independent of when or where you learned it. You know that water boils at 100 degrees Celsius, but you probably do not remember the specific lesson where you learned this.

For AI agents, semantic memory stores:

  • Domain knowledge relevant to the agent's tasks
  • User preferences and profiles
  • Learned rules and heuristics
  • Facts extracted from documents or interactions

Semantic Memory vs. Parametric Knowledge

LLMs already contain vast semantic knowledge in their parameters (parametric knowledge). GPT-4 "knows" that Paris is the capital of France because this fact appeared in its training data. So why do we need an explicit semantic memory system?

The purpose of an explicit semantic memory system is to:

  1. Supplement parametric knowledge with domain-specific or up-to-date information. Your company's internal API documentation is not in GPT-4's training data.
  2. Override parametric knowledge when the LLM's training data is outdated or incorrect. If a company changes its CEO, the LLM's parametric knowledge is stale until it is retrained.
  3. Personalize knowledge to the specific user or deployment context. The LLM knows general best practices for Python, but your team has specific coding conventions.
  4. Audit and explain where specific knowledge comes from. When an agent makes a claim, you can trace it to a specific document or interaction rather than an opaque model parameter.

Common Misconception: "LLMs already know everything, so we do not need semantic memory." LLMs know a lot, but their knowledge is frozen at training time, may contain errors, and lacks information about your specific context. Semantic memory is the bridge between general knowledge and specific, up-to-date, personalized knowledge.

Implementing Semantic Memory

A practical semantic memory system stores facts as structured entries with source attribution. We use subject-predicate-object triples, which is the same structure used in knowledge graphs and the semantic web:

python
from dataclasses import dataclass
from datetime import datetime


@dataclass
class SemanticFact:
    """A single fact in semantic memory.

    Facts are stored as subject-predicate-object triples, a format
    borrowed from knowledge representation theory. This structure
    makes it easy to query ("what do we know about X?") and update
    ("the CEO of Acme is now Jane, not John").

    The confidence and source fields support reasoning about fact
    reliability: a fact from an official document has higher
    confidence than a fact inferred from casual conversation.
    """

    subject: str              # What the fact is about (e.g., "Acme Corp")
    predicate: str            # The relationship or property (e.g., "CEO is")
    object: str               # The value or target (e.g., "Jane Smith")
    confidence: float         # How confident we are (0.0 to 1.0)
    source: str               # Where this fact came from
    last_verified: datetime   # When this was last confirmed

    def to_triple(self) -> str:
        return f"({self.subject}, {self.predicate}, {self.object})"

    def to_natural_language(self) -> str:
        return f"{self.subject} {self.predicate} {self.object} (confidence: {self.confidence:.0%})"


class SemanticMemory:
    """A simple semantic memory store using subject-predicate-object triples.

    This implementation stores facts in memory (for simplicity) and
    supports querying by subject and/or predicate. In production,
    you would back this with a database (SQL for structured queries,
    or a graph database for relationship traversal).

    A key design decision: when a new fact is added with the same
    subject and predicate as an existing fact, the old fact is
    replaced. This ensures the agent always uses the most current
    information. For example, if the user's preferred language
    changes from Python to Rust, the old preference is overwritten.
    """

    def __init__(self):
        self.facts: list[SemanticFact] = []

    def add_fact(self, fact: SemanticFact) -> None:
        """Add a fact, replacing any existing fact with same subject and predicate."""
        self.facts = [
            f for f in self.facts
            if not (f.subject == fact.subject and f.predicate == fact.predicate)
        ]
        self.facts.append(fact)

    def query(self, subject: str | None = None, predicate: str | None = None) -> list[SemanticFact]:
        """Query facts by subject and/or predicate.

        Examples:
        - query(subject="Alice") returns all facts about Alice
        - query(predicate="works at") returns all employment facts
        - query(subject="Alice", predicate="prefers") returns Alice's preferences
        """
        results = self.facts
        if subject:
            results = [f for f in results if f.subject.lower() == subject.lower()]
        if predicate:
            results = [f for f in results if f.predicate.lower() == predicate.lower()]
        return sorted(results, key=lambda f: f.confidence, reverse=True)

    def to_prompt(self, subject: str | None = None) -> str:
        """Format relevant facts for inclusion in the LLM prompt."""
        facts = self.query(subject=subject) if subject else self.facts
        if not facts:
            return "No relevant facts in memory."
        lines = ["Known facts:"]
        for fact in facts:
            lines.append(f"  - {fact.to_natural_language()}")
        return "\n".join(lines)

Knowledge Extraction

An important capability is automatically extracting facts from conversations and documents. Rather than requiring a human to manually add facts to semantic memory, the agent can extract them:

python
EXTRACTION_PROMPT = """Extract factual information from the following text as
subject-predicate-object triples. Only extract facts that are explicitly stated,
not inferences or opinions.

Text: {text}

Format each fact as:
- Subject: ...
- Predicate: ...
- Object: ...
- Confidence: high/medium/low
"""

For example, given the text "Alice works at Acme Corp as a senior engineer. She prefers Python for backend development," the LLM would extract:

  • (Alice, works at, Acme Corp) [confidence: high]
  • (Alice, role is, senior engineer) [confidence: high]
  • (Alice, prefers for backend, Python) [confidence: high]

087. Implementation Approaches: A Comparative Analysis

In-Context Memory

How it works: All memory is placed directly into the LLM's context window as part of the prompt.

Advantages:

  • Simplest to implement: no infrastructure, no databases, no embedding models.
  • No external dependencies: works with any LLM.
  • The LLM can reason over all available memory simultaneously -- it sees everything at once.

Disadvantages:

  • Limited by context window size: as memory grows, you hit the wall.
  • Cost increases linearly with memory size: you pay for every token in the context.
  • Retrieval accuracy degrades with length (Liu et al., 2024): the "lost in the middle" problem.

Best for: Short tasks, small memory footprints, prototyping, simple chatbots.

Vector Databases

How it works: Memories are embedded into vector representations and stored in a specialized database. Retrieval is performed via similarity search.

python
# Example using chromadb
import chromadb

# Create a client and collection
client = chromadb.Client()
collection = client.create_collection(name="agent_memory")

# Store a memory -- ChromaDB automatically generates embeddings
collection.add(
    documents=["The user prefers Python over JavaScript for data analysis tasks."],
    metadatas=[{"type": "preference", "timestamp": "2024-01-15"}],
    ids=["mem_001"],
)

# Retrieve relevant memories -- note that the query is semantically
# matched, not just keyword matched. "programming language" would
# also retrieve this memory even though those exact words are not in it.
results = collection.query(
    query_texts=["What programming language should I use?"],
    n_results=3,
)

Advantages:

  • Semantic search: finds conceptually related memories, not just keyword matches.
  • Scales to millions of memories efficiently.
  • Constant-time retrieval (approximate nearest neighbors).

Disadvantages:

  • Requires an embedding model (adds complexity and latency).
  • Approximate search may miss relevant results (typically 95-99% recall).
  • No structured querying: cannot do "give me all preferences added after January 2024."

Best for: Large memory stores, semantic retrieval, RAG-based systems, episodic memory.

Key-Value Stores

How it works: Memories are stored as key-value pairs, often with additional metadata for filtering.

Advantages:

  • Fast exact lookups: O(1) time to retrieve a specific memory.
  • Structured and predictable: you always get exactly what you asked for.
  • Easy to update and delete: changing a preference is a simple update.

Disadvantages:

  • No semantic search: you must know the exact key.
  • Not suitable for open-ended queries like "what do I know about this topic?"
  • Flat structure: no support for complex relationships.

Best for: User preferences, configuration, structured facts, session state.

Graph Databases

How it works: Memories are stored as nodes and edges in a graph, capturing entities and their relationships.

Advantages:

  • Rich relational queries: "find all tasks assigned to people in the engineering team."
  • Natural representation of knowledge: entities and relationships map directly to the graph.
  • Supports multi-hop reasoning: "who manages the person who wrote this code?"

Disadvantages:

  • Complex to set up and maintain: requires schema design and a dedicated database.
  • Query language learning curve: Cypher (Neo4j) or SPARQL are not trivial.
  • Overkill for simple memory needs: most agents do not need multi-hop reasoning.

Best for: Complex domains with many entity relationships, knowledge graphs, organizational data.

Comparison Table

FeatureIn-ContextVector DBKey-ValueGraph DB
Semantic searchVia LLMNativeNoLimited
ScalabilityLowHighHighMedium
Setup complexityNoneLowLowHigh
Structured queriesNoLimitedYesYes
Relationship modelingVia LLMNoNoNative
Cost at scaleHighLowLowMedium
Best forPrototypingUnstructured retrievalStructured factsRelational knowledge

098. Memory Management: Summarization, Forgetting, and Prioritization

The Memory Management Problem

As an agent accumulates memories, it faces the same challenges humans do: too much information, limited processing capacity, and the need to distinguish important memories from trivial ones. A human brain is estimated to encounter about 34 GB of information per day, but it retains only a tiny fraction. Agents face a similar selection problem.

Without memory management, the memory store grows without bound. Retrieval becomes slower, more expensive, and less accurate (because irrelevant memories dilute the relevant ones). Memory management strategies address this by deciding what to keep, what to compress, and what to discard.

Summarization

Summarization compresses multiple memories into a more compact representation. This is used at two levels:

Conversation summarization: Compressing older parts of a conversation to fit within the context window. This is the most common use case -- when a conversation exceeds the context limit, older messages are summarized.

python
def progressive_summarization(
    messages: list[dict], max_context_tokens: int = 4000
) -> list[dict]:
    """Progressively summarize older messages to fit context budget.

    Keeps recent messages verbatim and summarizes older ones. This
    approach preserves detail where it matters most (recent context)
    while retaining the gist of older interactions.

    The recursive implementation handles cases where even after
    summarization, the result is still too long -- it summarizes
    again until the budget is met.
    """
    # Estimate tokens (rough: 1 token per 4 characters)
    def estimate_tokens(msgs):
        return sum(len(m["content"]) // 4 for m in msgs)

    if estimate_tokens(messages) <= max_context_tokens:
        return messages  # Everything fits, no summarization needed

    # Split: older half gets summarized, recent half stays verbatim
    midpoint = len(messages) // 2
    older = messages[:midpoint]
    recent = messages[midpoint:]

    summary = summarize_context(older)
    summarized_msg = {"role": "system", "content": f"Summary of earlier conversation: {summary}"}

    result = [summarized_msg] + recent

    # Recursively summarize if still too long
    if estimate_tokens(result) > max_context_tokens:
        return progressive_summarization(result, max_context_tokens)
    return result

Reflection-based summarization: As seen in Generative Agents (Park et al., 2023), periodically synthesizing memories into higher-level observations. This is not just compression -- it is abstraction. The agent looks at many specific memories and generates general insights.

python
REFLECTION_PROMPT = """Given the following recent observations, generate 3-5
higher-level insights or patterns. These should be generalizations that could
be useful in future situations. Focus on patterns, preferences, and lessons learned.

Recent observations:
{observations}

Higher-level insights:"""

For example, given observations like "API call to ServiceX failed with timeout," "API call to ServiceX failed with 503," and "API call to ServiceX succeeded after retry," the reflection might generate: "ServiceX is unreliable and often requires retries. Always implement retry logic when calling ServiceX."

Forgetting

Not all memories should be kept forever. Forgetting is an important memory management strategy, not a flaw. In human cognition, forgetting serves a vital function: it prevents information overload and ensures that the most relevant information is readily accessible. An agent that remembers everything equally struggles to find the signal in the noise.

Strategies for forgetting:

  • Time-based decay: Memories that have not been accessed for a long time are gradually removed. This is the simplest approach and mirrors the exponential decay in the Generative Agents retrieval function.
  • Relevance-based pruning: Memories that are never retrieved (low relevance to queries) are candidates for removal. If a memory has never been useful, it probably never will be.
  • Redundancy elimination: When a new memory supersedes an old one, the old one can be removed. If we learn "the CEO is Jane" and we already have "the CEO is John," the old memory is obsolete.
  • Capacity-based eviction: When memory reaches a size limit, the least important memories are removed. This is the LRU (Least Recently Used) strategy from operating systems, applied to agent memories.
python
from datetime import datetime, timedelta


def forget_old_memories(
    memories: list[dict], max_age_days: int = 90, min_importance: float = 0.3
) -> list[dict]:
    """Remove memories that are old and unimportant.

    High-importance memories are kept regardless of age -- these are
    the "core memories" that define the agent's knowledge (like a
    user's dietary restrictions or critical project constraints).

    Low-importance memories are forgotten after max_age_days, unless
    they have been recently accessed (which would update their
    access timestamp and make them "recent" again).
    """
    cutoff = datetime.now() - timedelta(days=max_age_days)
    retained = []
    forgotten_count = 0
    for memory in memories:
        is_recent = memory["timestamp"] > cutoff
        is_important = memory.get("importance", 0.5) >= min_importance
        if is_recent or is_important:
            retained.append(memory)
        else:
            forgotten_count += 1
    if forgotten_count > 0:
        print(f"Forgot {forgotten_count} old, low-importance memories.")
    return retained

Prioritization

When retrieving memories, the agent must decide which are most relevant. The Generative Agents approach (combining recency, importance, and relevance) is one strategy. Others include:

  • Frequency-based: Memories that are accessed frequently are likely more important. If the agent keeps retrieving a particular fact, it is probably central to the current work.
  • Emotional salience: In human-facing agents, memories associated with strong user emotions (frustration, satisfaction) may be prioritized. If a user was very frustrated by a previous experience, the agent should remember that.
  • Task relevance: Memories tagged with the current task type receive a boost. When working on a Python project, Python-related memories are more relevant than JavaScript-related ones.
  • Contradiction detection: Memories that contradict each other flag an update need. If one memory says "the API endpoint is /v1/users" and another says "the API endpoint is /v2/users," the agent should investigate.

109. Practical Example: Building a Memory System with Embedding-Based Retrieval

Let us build a complete, working memory system that an agent can use for storing and retrieving memories using embeddings. We will walk through every component and explain the design decisions.

Full Implementation

python
"""
A complete embedding-based memory system for AI agents.

This implementation uses sentence-transformers for embeddings and
numpy for similarity computation. In production, you would use a
vector database (Chroma, Qdrant, pgvector) for scalability, but
this implementation makes every step transparent for learning.

Requirements:
    pip install sentence-transformers numpy
"""

import json
import math
import uuid
from dataclasses import dataclass, field, asdict
from datetime import datetime
from pathlib import Path

import numpy as np
from sentence_transformers import SentenceTransformer


@dataclass
class Memory:
    """A single memory entry.

    This dataclass represents everything we store about a memory:
    - id: A unique identifier for retrieval and deletion.
    - content: The text of the memory itself.
    - timestamp: When the memory was created (for recency scoring).
    - memory_type: A category label ("episodic", "semantic", "preference").
    - importance: How important this memory is (0.0 to 1.0).
    - metadata: Arbitrary additional data (project name, task type, etc.).
    - access_count: How often this memory has been retrieved.
    - last_accessed: When this memory was last retrieved.
    - embedding: The vector representation of the content.
    """

    id: str
    content: str
    timestamp: str
    memory_type: str  # "episodic", "semantic", "preference"
    importance: float  # 0.0 to 1.0
    metadata: dict = field(default_factory=dict)
    access_count: int = 0
    last_accessed: str = ""
    embedding: list[float] = field(default_factory=list)


class AgentMemorySystem:
    """A complete memory system with embedding-based retrieval.

    This system supports:
    - Storing memories with automatic embedding generation
    - Semantic retrieval using cosine similarity
    - Importance-based scoring (Generative Agents style)
    - Recency-based scoring (exponential decay)
    - Persistence to disk (JSON format)
    - Access tracking (for frequency-based prioritization)

    Architecture overview:
    1. When a memory is stored, we generate an embedding vector using
       a sentence-transformer model. This vector captures the semantic
       meaning of the text.
    2. When a query comes in, we generate an embedding for the query
       and compare it to all stored memory embeddings using cosine
       similarity.
    3. The final score combines recency, importance, and semantic
       relevance (following Park et al., 2023).
    4. The top-k scoring memories are returned.
    """

    def __init__(
        self,
        persist_path: str | None = None,
        embedding_model: str = "all-MiniLM-L6-v2",
        decay_rate: float = 0.995,
    ):
        self.persist_path = Path(persist_path) if persist_path else None
        self.decay_rate = decay_rate
        self.memories: list[Memory] = []

        # Load the embedding model.
        # all-MiniLM-L6-v2 produces 384-dimensional embeddings and is
        # fast enough for real-time use. For higher quality, consider
        # BGE-large-en-v1.5 (1024 dimensions) or a similar model.
        print(f"Loading embedding model: {embedding_model}")
        self.encoder = SentenceTransformer(embedding_model)

        # Load persisted memories if available
        if self.persist_path and self.persist_path.exists():
            self._load()

    def store(
        self,
        content: str,
        memory_type: str = "episodic",
        importance: float = 0.5,
        metadata: dict | None = None,
    ) -> str:
        """Store a new memory with its embedding.

        This is the write path of the memory system. When a new memory
        is stored:
        1. A unique ID is generated.
        2. The content is embedded using the sentence-transformer model.
        3. The memory is appended to the in-memory list.
        4. If persistence is configured, the entire list is saved to disk.

        Args:
            content: The text content of the memory.
            memory_type: Category of memory (episodic, semantic, preference).
            importance: Importance score from 0.0 to 1.0.
            metadata: Additional metadata to store with the memory.

        Returns:
            The ID of the stored memory.
        """
        memory_id = str(uuid.uuid4())[:8]
        now = datetime.now().isoformat()

        # Generate embedding -- this is the key step that enables
        # semantic retrieval later. The embedding captures the
        # meaning of the content as a vector of numbers.
        embedding = self.encoder.encode(content).tolist()

        memory = Memory(
            id=memory_id,
            content=content,
            timestamp=now,
            memory_type=memory_type,
            importance=importance,
            metadata=metadata or {},
            access_count=0,
            last_accessed=now,
            embedding=embedding,
        )
        self.memories.append(memory)

        # Persist if configured
        if self.persist_path:
            self._save()

        return memory_id

    def retrieve(
        self,
        query: str,
        top_k: int = 5,
        memory_type: str | None = None,
        alpha: float = 1.0,   # Weight for recency
        beta: float = 1.0,    # Weight for importance
        gamma: float = 1.0,   # Weight for relevance
    ) -> list[dict]:
        """Retrieve the most relevant memories for a query.

        Uses a weighted combination of recency, importance, and
        semantic relevance (following Park et al., 2023).

        The scoring formula is:
            score = alpha * recency + beta * importance + gamma * relevance

        where:
        - recency is exponential decay based on hours since creation
        - importance is the stored importance value (0-1)
        - relevance is cosine similarity between query and memory embeddings

        Args:
            query: The query text to search for.
            top_k: Number of memories to return.
            memory_type: Optional filter by memory type.
            alpha: Weight for recency in the scoring function.
            beta: Weight for importance in the scoring function.
            gamma: Weight for semantic relevance in the scoring function.

        Returns:
            List of memory dicts sorted by relevance score.
        """
        if not self.memories:
            return []

        # Filter by type if specified
        candidates = self.memories
        if memory_type:
            candidates = [m for m in candidates if m.memory_type == memory_type]

        if not candidates:
            return []

        # Compute query embedding -- same model used for storage
        query_embedding = self.encoder.encode(query)
        now = datetime.now()

        scored_memories = []
        for memory in candidates:
            # Recency score: exponential decay based on hours elapsed
            mem_time = datetime.fromisoformat(memory.timestamp)
            hours_elapsed = (now - mem_time).total_seconds() / 3600.0
            recency = math.pow(self.decay_rate, hours_elapsed)

            # Importance score: directly from the memory
            importance = memory.importance

            # Relevance score: cosine similarity between embeddings
            mem_embedding = np.array(memory.embedding)
            q_embedding = np.array(query_embedding)
            cosine_sim = np.dot(mem_embedding, q_embedding) / (
                np.linalg.norm(mem_embedding) * np.linalg.norm(q_embedding) + 1e-8
            )
            relevance = float(cosine_sim)

            # Combined score using the Generative Agents formula
            score = alpha * recency + beta * importance + gamma * relevance

            scored_memories.append({
                "id": memory.id,
                "content": memory.content,
                "memory_type": memory.memory_type,
                "importance": memory.importance,
                "timestamp": memory.timestamp,
                "metadata": memory.metadata,
                "scores": {
                    "recency": round(recency, 4),
                    "importance": round(importance, 4),
                    "relevance": round(relevance, 4),
                    "total": round(score, 4),
                },
            })

        # Sort by total score descending
        scored_memories.sort(key=lambda x: x["scores"]["total"], reverse=True)

        # Update access counts for retrieved memories -- this enables
        # frequency-based analysis later
        top_ids = {m["id"] for m in scored_memories[:top_k]}
        for memory in self.memories:
            if memory.id in top_ids:
                memory.access_count += 1
                memory.last_accessed = now.isoformat()

        return scored_memories[:top_k]

    def format_for_prompt(self, memories: list[dict]) -> str:
        """Format retrieved memories for inclusion in an LLM prompt.

        This method transforms the structured memory data into a
        text format that can be injected into a prompt. It includes
        the memory type and relevance score to help the LLM
        understand the source and reliability of each memory.
        """
        if not memories:
            return "No relevant memories found."

        lines = ["Relevant memories:"]
        for i, mem in enumerate(memories, 1):
            lines.append(
                f"  [{i}] ({mem['memory_type']}, relevance: {mem['scores']['relevance']:.2f}) "
                f"{mem['content']}"
            )
        return "\n".join(lines)

    def get_stats(self) -> dict:
        """Return statistics about the memory system."""
        type_counts = {}
        for memory in self.memories:
            type_counts[memory.memory_type] = type_counts.get(memory.memory_type, 0) + 1
        return {
            "total_memories": len(self.memories),
            "by_type": type_counts,
            "avg_importance": (
                sum(m.importance for m in self.memories) / len(self.memories)
                if self.memories else 0
            ),
        }

    def _save(self) -> None:
        """Persist memories to disk as JSON."""
        data = [asdict(m) for m in self.memories]
        with open(self.persist_path, "w") as f:
            json.dump(data, f, indent=2)

    def _load(self) -> None:
        """Load memories from disk."""
        with open(self.persist_path) as f:
            data = json.load(f)
        self.memories = [Memory(**entry) for entry in data]


# -- Usage Example ---------------------------------------------------

def main():
    """Demonstrate the memory system."""
    # Initialize memory system
    memory = AgentMemorySystem(persist_path="agent_memories.json")

    # Store various types of memories
    memory.store(
        "The user prefers concise explanations with code examples.",
        memory_type="preference",
        importance=0.8,
    )
    memory.store(
        "Successfully fixed a bug in the data pipeline by adding null checks.",
        memory_type="episodic",
        importance=0.6,
        metadata={"task": "bug_fix", "project": "data_pipeline"},
    )
    memory.store(
        "Python's asyncio library is used for concurrent I/O operations.",
        memory_type="semantic",
        importance=0.4,
    )
    memory.store(
        "The user's project uses PostgreSQL with pgvector for embeddings.",
        memory_type="semantic",
        importance=0.7,
        metadata={"project": "main_app"},
    )
    memory.store(
        "Previous attempt to use Redis for caching failed due to memory limits.",
        memory_type="episodic",
        importance=0.5,
        metadata={"task": "caching", "outcome": "failed"},
    )

    # Retrieve memories for a query
    print("=" * 60)
    print("Query: 'How should I store embeddings in the database?'")
    print("=" * 60)
    results = memory.retrieve("How should I store embeddings in the database?", top_k=3)
    print(memory.format_for_prompt(results))

    print()
    print("=" * 60)
    print("Query: 'What went wrong in past tasks?'")
    print("=" * 60)
    results = memory.retrieve("What went wrong in past tasks?", top_k=3, memory_type="episodic")
    print(memory.format_for_prompt(results))

    print()
    print("Memory stats:", memory.get_stats())


if __name__ == "__main__":
    main()

Expected Output

text
============================================================
Query: 'How should I store embeddings in the database?'
============================================================
Relevant memories:
  [1] (semantic, relevance: 0.72) The user's project uses PostgreSQL with pgvector for embeddings.
  [2] (semantic, relevance: 0.34) Python's asyncio library is used for concurrent I/O operations.
  [3] (episodic, relevance: 0.28) Previous attempt to use Redis for caching failed due to memory limits.

============================================================
Query: 'What went wrong in past tasks?'
============================================================
Relevant memories:
  [1] (episodic, relevance: 0.51) Previous attempt to use Redis for caching failed due to memory limits.
  [2] (episodic, relevance: 0.29) Successfully fixed a bug in the data pipeline by adding null checks.

Memory stats: {'total_memories': 5, 'by_type': {'preference': 1, 'episodic': 2, 'semantic': 2}, 'avg_importance': 0.6}

Notice how the first query retrieves the pgvector memory (semantically relevant to "embeddings in database") even though the query does not use the exact words "pgvector" or "PostgreSQL." This is the power of semantic search. The second query, filtered to episodic memories only, retrieves the failure experience first -- exactly what you would want when asking about past problems.

Try It Yourself: Extend this implementation with a forget() method that removes memories older than 90 days with importance below 0.3. Then add a reflect() method that uses an LLM to generate higher-level insights from the stored episodic memories.


1110. Advanced Topics

Memory Architectures in State-of-the-Art Systems

Several recent systems showcase sophisticated memory architectures:

MemGPT (Packer et al., 2023): Treats LLM context management as an operating system problem, with a virtual memory hierarchy that pages information in and out of the context window. Just as an operating system manages physical RAM by paging data to and from disk, MemGPT manages the LLM's context window by paging information to and from external storage. The system uses function calls to manage its own memory: when the context gets full, it writes less-important information to "disk" (an external database) and reads in more relevant information. This enables conversations that extend far beyond the context window, with the agent maintaining coherent long-term context.

Voyager (Wang et al., 2023): An LLM-powered agent for Minecraft that maintains a skill library -- a form of procedural memory where successful action sequences are stored as reusable JavaScript functions. When the agent discovers how to build a wooden pickaxe, it saves that skill as a function. Later, when it needs a pickaxe, it can recall and execute the saved skill instead of figuring it out from scratch. This is analogous to how humans develop motor memory: the first time you tie your shoes, it requires conscious effort. After practice, it becomes automatic -- stored as procedural memory rather than declarative knowledge.

Cognitive Architectures for Language Agents (CoALA) (Sumers et al., 2024): A framework that organizes agent memory into three components mirroring human cognition: working memory (the agent's current context), episodic memory (past experiences), and semantic memory (general knowledge). This framework provides a principled way to design memory systems for language agents, drawing on decades of cognitive science research. The key contribution is a taxonomy of memory designs and a set of design principles for choosing between them.

Memory and Privacy

An important consideration in agent memory systems is privacy. The more an agent remembers, the more sensitive data it potentially stores.

  • What should be remembered? Agents should have clear policies about what information they store, especially personal data. A medical assistant should remember that a patient is allergic to penicillin but should it remember every complaint they mentioned in passing?
  • Right to be forgotten: Users should be able to request deletion of their stored memories. This is not just good practice -- regulations like GDPR explicitly require it.
  • Access control: In multi-user systems, memories must be properly scoped to prevent information leakage. Agent A's memories about User A must not be accessible when serving User B.
  • Encryption: Sensitive memories should be encrypted at rest. If the memory database is compromised, the data should be unreadable without the encryption key.

Memory Consistency

When memories conflict, the agent needs a resolution strategy:

  • Recency bias: Prefer the most recent memory (assumes more recent information is more accurate). This is usually a reasonable default.
  • Source credibility: Prefer memories from more reliable sources. A fact from an official document outweighs a fact from casual conversation.
  • Confidence weighting: Prefer memories with higher confidence scores.
  • Explicit contradiction detection: Flag conflicting memories for human review rather than silently choosing one.

Key Insight: Memory consistency is an unsolved problem in agent systems. In practice, most systems use recency as the tiebreaker, but this can lead to errors when outdated information is mistakenly re-entered. Robust contradiction detection and resolution remain active research areas.


12Discussion Questions

  1. Memory and identity: If an agent's memories are erased, is it still the "same" agent? How does memory contribute to the notion of agent identity or continuity? Consider: if you lost all your memories, would you still be "you"? What does this tell us about the relationship between memory and identity in AI systems?

  2. The right to forget: Should AI agents have a "right to be forgotten" mechanism where users can request deletion of specific memories? What are the implications for agent capability versus user privacy? Consider the tension: an agent that remembers everything is more capable, but a user who cannot control what is remembered may feel surveilled.

  3. Memory manipulation: If memories shape an agent's behavior, what are the risks of adversarial memory injection (deliberately feeding an agent false memories to manipulate its future behavior)? How could you design defenses against this? Hint: consider source attribution, confidence scoring, and consistency checking.

  4. Cognitive fidelity: How closely should AI agent memory systems mirror human memory? Are there cases where non-human memory architectures (e.g., perfect recall with no forgetting) would be more useful? Hint: consider that human forgetting sometimes causes errors but also serves useful functions like generalization and noise filtering.

  5. Memory as a moat: In commercial AI systems, the accumulated memories of user interactions could be a significant competitive advantage. What are the ethical implications of this? Should users own their agent's memories and be able to port them to a different system?

  6. Emergent behavior from memory: The Generative Agents paper (Park et al., 2023) observed emergent social behaviors arising from memory and reflection. What other emergent behaviors might arise from sophisticated memory systems? Could negative behaviors (grudges, biases, manipulation) also emerge?


13Summary and Key Takeaways

  1. Memory transforms stateless LLMs into stateful agents. Without memory, agents cannot maintain context, learn from experience, or personalize their behavior. Memory is arguably the single most important infrastructure component for building capable agents.

  2. Multiple types of memory serve different purposes:

    • Short-term memory (conversation context) for immediate interaction.
    • Working memory (task state) for multi-step execution.
    • Episodic memory for learning from past experiences.
    • Semantic memory for accumulating knowledge.
  3. No single implementation fits all needs. In-context memory is simple but limited; vector databases enable semantic search; key-value stores suit structured data; graph databases capture relationships. Most production systems use a hybrid approach.

  4. The Generative Agents scoring function (recency + importance + relevance) provides a principled framework for memory retrieval that balances multiple factors. It is a good starting point for any memory system.

  5. Memory management is essential. Summarization, forgetting, and prioritization prevent memory systems from becoming unwieldy and ensure the most relevant information is surfaced. Remembering everything is not the goal -- remembering the right things is.

  6. Memory raises ethical questions about privacy, data ownership, and the potential for manipulation. These are not theoretical concerns -- they must be addressed in any production deployment.


14References

  1. Park, J. S., O'Brien, J. C., Cai, C. J., Morris, M. R., Liang, P., & Bernstein, M. S. (2023). Generative Agents: Interactive Simulacra of Human Behavior. Proceedings of the 36th Annual ACM Symposium on User Interface Software and Technology (UIST).

  2. Wei, J., Wang, X., Schuurmans, D., Bosma, M., Ichter, B., Xia, F., Chi, E., Le, Q., & Zhou, D. (2022). Chain-of-thought prompting elicits reasoning in large language models. Advances in Neural Information Processing Systems (NeurIPS), 35.

  3. Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., & Cao, Y. (2023). ReAct: Synergizing Reasoning and Acting in Language Models. International Conference on Learning Representations (ICLR).

  4. Wang, G., Xie, Y., Jiang, Y., Mandlekar, A., Xiao, C., Zhu, Y., Fan, L., & Anandkumar, A. (2023). Voyager: An Open-Ended Embodied Agent with Large Language Models. arXiv preprint arXiv:2305.16291.

  5. Packer, C., Wooders, S., Lin, K., Fang, V., Patil, S. G., Stoica, I., & Gonzalez, J. E. (2023). MemGPT: Towards LLMs as Operating Systems. arXiv preprint arXiv:2310.08560.

  6. Liu, N. F., Lin, K., Hewitt, J., Paranjape, A., Bevilacqua, M., Petroni, F., & Liang, P. (2024). Lost in the Middle: How Language Models Use Long Contexts. Transactions of the Association for Computational Linguistics (TACL), 12.

  7. Sumers, T. R., Yao, S., Narasimhan, K., & Griffiths, T. L. (2024). Cognitive Architectures for Language Agents. Transactions on Machine Learning Research (TMLR).

  8. Beurer-Kellner, L., Fischer, M., & Vechev, M. (2023). Prompting Is Programming: A Query Language for Large Language Models. Proceedings of the ACM on Programming Languages (PLDI), 7.


Part of "Agentic AI: Foundations, Architectures, and Applications" (CC BY-SA 4.0).