ArchitecturesW0725 min read

Retrieval-Augmented Generation for Agents

Full RAG pipeline: chunking strategies, embeddings, hybrid retrieval, re-ranking, advanced patterns (iterative, self-RAG, corrective RAG, agentic RAG). Evaluation with retrieval recall and faithfulness. Standard RAG vs agentic RAG.

Core conceptsEmbeddingsRe-rankingAgentic RAG

01Learning Objectives

By the end of this lecture, students will be able to:

Explain the core motivation behind Retrieval-Augmented Generation (RAG) and its advantages over relying solely on parametric knowledge.
Design and implement an indexing pipeline covering document chunking, embedding generation, and vector storage.
Compare dense, sparse, and hybrid retrieval strategies and select the appropriate one for a given use case.
Describe advanced RAG techniques including re-ranking, query transformation, and multi-step retrieval.
Distinguish between standard RAG and agentic RAG, where the agent autonomously decides when and what to retrieve.
Evaluate RAG systems using appropriate metrics for faithfulness, relevance, and completeness.
Build a working RAG pipeline in Python using embeddings and a vector store.

021. RAG Fundamentals

The Problem RAG Solves

Large Language Models have a fundamental limitation: their knowledge is frozen at training time. An LLM trained on data up to January 2024 knows nothing about events after that date. Moreover, LLMs have no access to private, proprietary, or domain-specific data unless it happened to be in the training corpus.

Consider an analogy. An LLM is like a person who read millions of books years ago and remembers most of what they read, but has not read anything since. They can discuss history, science, and literature with impressive fluency, but they cannot tell you what happened yesterday, what your company's internal policies say, or what is in the document you wrote last week. Their knowledge is vast but static.

This leads to several concrete problems:

Hallucination: When asked about information not in their training data, LLMs may generate plausible-sounding but incorrect answers rather than admitting ignorance. Ask a model about a paper published after its training cutoff and it may fabricate a convincing but entirely fictional citation.
Staleness: Information changes over time. Training data becomes outdated. A model trained in 2023 does not know about API changes, new regulations, or leadership changes that happened in 2024.
Lack of provenance: LLMs cannot cite sources for their claims because they do not "know" where their knowledge came from. The knowledge is diffused across billions of parameters, not traceable to specific documents.
No access to private data: Enterprise documents, personal notes, and proprietary databases are not available to the model. This is the most critical limitation for business applications.

Key Insight: The fundamental problem RAG solves is not that LLMs are unintelligent -- it is that they are uninformed about anything outside their training data. RAG gives LLMs the ability to look things up, transforming them from closed-book exam takers into open-book researchers.

The RAG Solution

Retrieval-Augmented Generation, introduced by Lewis et al. (2020), addresses these limitations by combining a retrieval component with a generation component. The idea is beautifully simple:

Retrieve: Given a query, search an external knowledge base to find relevant documents or passages.
Augment: Insert the retrieved information into the LLM's prompt as additional context.
Generate: The LLM generates its response conditioned on both the query and the retrieved context.

Interactive · RAG Pipeline: Indexing and Retrieval

RAG pipeline

From document to answer

Index, retrieve, rerank, and augment generation. The first two steps live offline; the rest run on every query.

Chunking

Split documents into semantically coherent pieces.

Offline indexingOnline query

QueryAnswer

The analogy here is a student taking an open-book exam versus a closed-book exam. In a closed-book exam (standard LLM), the student must rely entirely on what they memorized. In an open-book exam (RAG), the student can look up information in their notes and textbooks. The student still needs intelligence to understand the question, find the right information, and synthesize a coherent answer, but they are no longer limited by the gaps in their memory.

The original RAG paper (Lewis et al., 2020) proposed two variants:

RAG-Sequence: The same retrieved documents are used to generate the entire output sequence. Think of this as looking up a reference and then writing your entire answer based on it.
RAG-Token: Different documents can be retrieved and used for different tokens in the output. Think of this as consulting different references for different parts of your answer.

In practice, most modern RAG systems use the sequence variant (or a simplified version of it) because it is simpler and works well enough for most applications.

Why RAG Matters for Agents

For AI agents, RAG is not just about answering questions. It enables a fundamentally different mode of operation:

Grounded actions: An agent can look up documentation before calling an API, reducing errors. Instead of guessing at API parameters, it retrieves the actual specification.
Dynamic knowledge: An agent can retrieve up-to-date information rather than relying on training data. A financial agent can look up today's stock prices rather than hallucinating outdated numbers.
Domain specialization: By retrieving from domain-specific corpora, a general-purpose agent can perform specialized tasks. The same agent can answer questions about tax law, medical guidelines, or software documentation -- whatever corpus it has access to.
Transparency: Retrieved sources can be cited, making the agent's reasoning auditable. A user can verify the agent's claims by checking the cited sources.

Common Misconception: "RAG is just a fancy search engine." No. RAG combines search with reasoning. A search engine returns documents; RAG reads those documents, synthesizes the information, and generates a coherent answer that directly addresses the user's question. The generation step is what makes RAG powerful.

032. The Indexing Pipeline

Before retrieval can happen, documents must be processed and stored in a searchable format. This is the indexing pipeline -- the offline preparation step that happens before any queries are processed. Getting the indexing pipeline right is critical: garbage in, garbage out. If documents are poorly chunked or poorly embedded, no amount of retrieval sophistication will produce good results.

Step 1: Document Loading

Documents come in many formats: PDFs, web pages, Markdown files, database records, emails, code files, and more. The first step is normalizing them into plain text.

python

from pathlib import Path


def load_text_file(path: str) -> str:
    """Load a plain text or markdown file."""
    return Path(path).read_text(encoding="utf-8")


def load_documents(directory: str, extensions: tuple = (".txt", ".md")) -> list[dict]:
    """Load all documents from a directory.

    This simple loader handles text and markdown files. In production,
    you would use specialized loaders for different formats:
    - PDF: PyPDF2, pdfplumber, or unstructured
    - DOCX: python-docx
    - HTML: BeautifulSoup
    - Code: tree-sitter for syntax-aware parsing
    """
    documents = []
    for path in Path(directory).rglob("*"):
        if path.suffix in extensions:
            documents.append({
                "content": path.read_text(encoding="utf-8"),
                "source": str(path),
                "filename": path.name,
            })
    return documents

For production systems, libraries like LangChain and LlamaIndex provide document loaders for dozens of formats (PDF, DOCX, HTML, Notion, Confluence, Google Docs, etc.). The key challenge with document loading is preserving structure: a PDF table, a code block, or a bulleted list all have structure that is lost when converted to plain text. Sophisticated loaders attempt to preserve this structure, which improves downstream chunking and retrieval quality.

Step 2: Chunking

Documents are typically too long to embed as a single vector or to fit in an LLM's context as a single piece of retrieved context. Chunking splits documents into smaller, semantically coherent pieces.

Why not embed entire documents? Two reasons. First, embedding models have a maximum input length (typically 512 tokens). Second, even if they could handle longer inputs, the embedding of a long document would be a "blurred average" of all the topics in that document, making it hard to match against specific queries. A 50-page report about company finances might discuss revenue, expenses, headcount, strategy, and risks. A single embedding for the whole document would be a vague blend of all these topics. Chunking allows each topic to have its own embedding, making retrieval more precise.

The analogy is an index at the back of a textbook. The index does not have one entry for the entire book -- it has entries for specific topics on specific pages. Chunking creates the equivalent of those specific index entries.

Chunking Strategies

Fixed-size chunking: Split text into chunks of N characters or tokens with optional overlap.

python

def fixed_size_chunks(
    text: str, chunk_size: int = 500, overlap: int = 50
) -> list[str]:
    """Split text into fixed-size chunks with overlap.

    The overlap parameter is important: without it, a sentence that
    happens to fall right at a chunk boundary would be split in half,
    with the first part in one chunk and the second in the next.
    Overlap ensures that boundary sentences appear in both chunks,
    so at least one chunk contains the complete sentence.

    Args:
        text: The input text to chunk.
        chunk_size: Maximum number of characters per chunk.
        overlap: Number of characters to overlap between chunks.

    Returns:
        List of text chunks.
    """
    chunks = []
    start = 0
    while start < len(text):
        end = start + chunk_size
        chunk = text[start:end]
        chunks.append(chunk.strip())
        start = end - overlap
    return [c for c in chunks if c]  # Remove empty chunks

Sentence-based chunking: Split at sentence boundaries, grouping sentences until the chunk reaches a target size. This avoids the problem of cutting sentences in half.

python

import re


def sentence_based_chunks(
    text: str, max_chunk_size: int = 500
) -> list[str]:
    """Split text into chunks at sentence boundaries.

    This preserves sentence integrity -- no sentence is ever split
    across two chunks. Sentences are grouped together until the
    chunk reaches the target size, then a new chunk begins.

    The trade-off: chunk sizes are variable. Some chunks may be
    much shorter than max_chunk_size (if a single sentence is very
    long), while others may approach it closely.
    """
    # Simple sentence splitting (production systems use spaCy or nltk)
    sentences = re.split(r'(?<=[.!?])\s+', text)

    chunks = []
    current_chunk = []
    current_size = 0

    for sentence in sentences:
        sentence_size = len(sentence)
        if current_size + sentence_size > max_chunk_size and current_chunk:
            chunks.append(" ".join(current_chunk))
            current_chunk = []
            current_size = 0
        current_chunk.append(sentence)
        current_size += sentence_size

    if current_chunk:
        chunks.append(" ".join(current_chunk))

    return chunks

Recursive character splitting: Split by paragraphs first, then by sentences, then by words if needed. This is the approach used by LangChain's RecursiveCharacterTextSplitter. The idea is to use the most natural boundary possible: paragraphs are better than sentences, which are better than arbitrary character positions.

Semantic chunking: Use an embedding model to find natural breakpoints where the topic changes. Compute embeddings for each sentence, then split where the similarity between consecutive sentences drops below a threshold. This produces the most coherent chunks but is computationally expensive.

Document-structure-aware chunking: For structured documents (HTML, Markdown, code), use the document's own structure (headers, sections, functions) to define chunk boundaries. This is usually the best approach when document structure is available -- a Markdown file with ## Headers provides natural chunk boundaries.

Choosing a Chunking Strategy

Strategy	Pros	Cons	Best For
Fixed-size	Simple, predictable	May split mid-sentence	Prototyping
Sentence-based	Preserves sentence integrity	Chunk sizes vary	General text
Recursive	Good balance of coherence and size	More complex	Production systems
Semantic	Most coherent chunks	Expensive, slow	High-quality retrieval
Structure-aware	Respects document organization	Format-dependent	Structured documents

Chunk Size Considerations

Chunk size is a critical hyperparameter -- one of the most impactful decisions you make in a RAG system. Think of it as the granularity of your index:

Too small (< 100 tokens): Chunks lack sufficient context. A chunk saying "The answer is 42" is useless without knowing what question was being answered. Retrieved chunks may be fragments that do not make sense on their own.
Too large (> 1000 tokens): Chunks may contain multiple topics, reducing retrieval precision. When you search for "revenue growth," you get back a chunk that also discusses headcount, expenses, and strategy. This dilutes the relevant information and wastes context window space.
Sweet spot (200-500 tokens): Generally works well for most applications. This is enough to contain a complete thought or paragraph while remaining focused on a single topic.

Key Insight: Chunk size should be tuned empirically for your specific data and use case. There is no universal optimal value. Some teams run systematic experiments varying chunk size from 100 to 1000 tokens and measure retrieval quality (precision and recall) to find the sweet spot for their data.

Step 3: Embedding Generation

Each chunk is converted into a dense vector representation using an embedding model. This is the step that enables semantic search: similar concepts get similar vectors, regardless of the exact words used.

python

from sentence_transformers import SentenceTransformer


def generate_embeddings(
    chunks: list[str], model_name: str = "all-MiniLM-L6-v2"
) -> list[list[float]]:
    """Generate embeddings for a list of text chunks.

    Each chunk is converted into a fixed-size vector (e.g., 384
    dimensions for MiniLM) that captures its semantic meaning.
    Similar texts get similar vectors, enabling similarity search.

    Args:
        chunks: List of text strings to embed.
        model_name: Name of the sentence-transformers model to use.

    Returns:
        List of embedding vectors.
    """
    model = SentenceTransformer(model_name)
    embeddings = model.encode(chunks, show_progress_bar=True)
    return embeddings.tolist()

Choosing an Embedding Model

The choice of embedding model significantly affects retrieval quality. This choice is often more impactful than the choice of LLM for generation. Key considerations:

Dimensionality: Higher dimensions capture more nuance but require more storage and compute. Common values: 384 (MiniLM), 768 (BERT-base), 1024 (large models), 1536 (OpenAI text-embedding-3-small), 3072 (text-embedding-3-large).
Training objective: Models trained on retrieval tasks (e.g., with contrastive learning on query-document pairs) generally outperform models trained on other objectives.
Domain specificity: General-purpose models may underperform on specialized domains (legal, medical, scientific). Fine-tuned models can help.
Multilingual support: If your corpus is multilingual, use a multilingual embedding model.

Popular embedding models (as of 2024-2025):

Model	Dimensions	Open Source	Notes
all-MiniLM-L6-v2	384	Yes	Fast, good for prototyping
BGE-large-en-v1.5	1024	Yes	Strong performance on MTEB
E5-mistral-7b-instruct	4096	Yes	Instruction-tuned, state-of-the-art
text-embedding-3-small	1536	No (OpenAI)	Good quality-to-cost ratio
text-embedding-3-large	3072	No (OpenAI)	Highest quality from OpenAI
Cohere embed-v3	1024	No (Cohere)	Supports compression

The MTEB (Massive Text Embedding Benchmark) leaderboard (Muennighoff et al., 2023) provides comprehensive comparisons of embedding models across many tasks. Always check the leaderboard for the latest models before choosing.

Try It Yourself: Take the same sentence and embed it with two different models (e.g., MiniLM and BGE-large). Then embed a semantically similar sentence and a semantically different one. Compare the cosine similarities. You will see that better models produce larger gaps between similar and dissimilar pairs.

Step 4: Vector Storage

The embeddings and their associated text chunks are stored in a vector database. This is where the indexed data lives and where retrieval queries are executed.

python

import chromadb


def build_index(
    chunks: list[str],
    sources: list[str],
    collection_name: str = "documents",
) -> chromadb.Collection:
    """Build a vector index from text chunks.

    ChromaDB handles embedding generation internally using its
    default model, so we only need to provide the text chunks.
    The 'cosine' distance metric is used for similarity search,
    which measures the angle between vectors (direction similarity)
    rather than their absolute distance.

    Args:
        chunks: List of text chunks.
        sources: List of source identifiers (one per chunk).
        collection_name: Name for the vector collection.

    Returns:
        A ChromaDB collection with the indexed documents.
    """
    client = chromadb.Client()
    collection = client.create_collection(
        name=collection_name,
        metadata={"hnsw:space": "cosine"},  # Use cosine similarity
    )

    # ChromaDB generates embeddings automatically using its default model
    collection.add(
        documents=chunks,
        metadatas=[{"source": src} for src in sources],
        ids=[f"chunk_{i}" for i in range(len(chunks))],
    )

    return collection

043. Vector Similarity Search

Cosine Similarity

The most common similarity metric for embeddings. It measures the cosine of the angle between two vectors, ranging from -1 (opposite) to 1 (identical).

Imagine two arrows pointing from the origin. Cosine similarity measures how much they point in the same direction, regardless of how long they are. Two short arrows and two long arrows pointing in the same direction have the same cosine similarity.

python

import numpy as np


def cosine_similarity(a: np.ndarray, b: np.ndarray) -> float:
    """Compute cosine similarity between two vectors.

    The formula: cos(theta) = (A . B) / (|A| * |B|)

    Where A . B is the dot product (sum of element-wise products)
    and |A| is the magnitude (Euclidean norm) of vector A.
    """
    return float(np.dot(a, b) / (np.linalg.norm(a) * np.linalg.norm(b) + 1e-8))

Properties:

Invariant to vector magnitude (only direction matters). This is important because embedding models may produce vectors of different magnitudes for different inputs, but we care about semantic direction, not magnitude.
Works well for normalized embeddings. Most embedding models produce roughly normalized vectors.
Most embedding models are trained with cosine similarity in mind.

Other Distance Metrics

Euclidean (L2) distance: Measures the straight-line distance between vectors. Sensitive to magnitude. Two vectors that point in the same direction but have very different magnitudes will have a large Euclidean distance despite being semantically similar.
Dot product (inner product): Similar to cosine similarity for normalized vectors. Faster to compute because it skips the normalization step.
Manhattan (L1) distance: Sum of absolute differences. More robust to outliers.

In practice, cosine similarity and dot product are used in the vast majority of RAG systems. The difference between them is negligible when vectors are normalized.

Approximate Nearest Neighbors (ANN)

Exact nearest neighbor search has O(n) complexity -- you must compare the query against every stored vector. For a collection of 1 million documents, that is 1 million similarity computations per query. This is too slow for interactive applications.

Approximate Nearest Neighbor algorithms trade a small amount of accuracy for dramatically better performance. Instead of guaranteeing the absolute best matches, they find matches that are very likely to be the best (typically 95-99% of the time).

HNSW (Hierarchical Navigable Small World): The most popular ANN algorithm, used by Chroma, Qdrant, and pgvector. It builds a multi-layer graph where higher layers provide coarse navigation and lower layers provide fine-grained search. Think of it like zooming into a map: the top layer is the country level, the next is the region, then the city, then the neighborhood. Each layer narrows the search space. Typical recall: 95-99% at 10-100x speedup over brute force.

IVF (Inverted File Index): Clusters vectors into groups and only searches the nearest clusters. Used by FAISS. It is like organizing a library into sections: when you are looking for a science book, you only search the science section, not the entire library. Fast but less accurate than HNSW.

Product Quantization (PQ): Compresses vectors by splitting them into sub-vectors and quantizing each. Dramatically reduces memory usage at the cost of some accuracy. Useful when you have millions of vectors and limited memory.

python

# Example: FAISS with IVF + PQ for large-scale search
import faiss
import numpy as np


def build_faiss_index(
    embeddings: np.ndarray, nlist: int = 100, m: int = 8
) -> faiss.Index:
    """Build a FAISS index with IVF and Product Quantization.

    This combines two techniques:
    - IVF (Inverted File): Clusters vectors into nlist groups.
      At query time, only the nearest clusters are searched.
    - PQ (Product Quantization): Compresses each vector into m
      sub-vectors, reducing memory by ~16x.

    Together, they enable searching millions of vectors in
    milliseconds using modest hardware.

    Args:
        embeddings: Matrix of embeddings (n_docs x dim).
        nlist: Number of clusters for IVF.
        m: Number of sub-quantizers for PQ.

    Returns:
        Trained FAISS index.
    """
    dim = embeddings.shape[1]

    # Create the index: IVF with PQ compression
    quantizer = faiss.IndexFlatL2(dim)
    index = faiss.IndexIVFPQ(quantizer, dim, nlist, m, 8)

    # Train the index on the data -- IVF needs to learn cluster
    # centers, and PQ needs to learn quantization codebooks
    index.train(embeddings)
    index.add(embeddings)

    return index


def search_faiss(
    index: faiss.Index, query_embedding: np.ndarray, top_k: int = 5
) -> tuple[np.ndarray, np.ndarray]:
    """Search the FAISS index for nearest neighbors.

    Returns:
        Tuple of (distances, indices) arrays.
    """
    query = query_embedding.reshape(1, -1)
    distances, indices = index.search(query, top_k)
    return distances[0], indices[0]

Common Misconception: "Approximate search means inaccurate results." In practice, ANN algorithms find the true nearest neighbors 95-99% of the time. The 1-5% of cases where they miss are typically near-ties where the missed result was almost as relevant as the returned result. For RAG applications, this level of accuracy is more than sufficient.

054. Retrieval Strategies

Dense Retrieval

Dense retrieval uses learned vector representations (embeddings) for both queries and documents. Similarity is computed in the embedding space. This is the "standard" approach we have been discussing.

Advantages:

Captures semantic meaning beyond exact word matches. "Dog" and "canine" will have similar embeddings even though they share no characters.
Works across languages with multilingual models. A query in English can retrieve documents in Spanish.
Handles paraphrases naturally. "How do I fix this bug?" and "What is the solution to this error?" will match similar documents.

Disadvantages:

Requires an embedding model (adds latency and cost).
May miss exact keyword matches that matter. If a user searches for "error ERR-4032," dense retrieval might match documents about errors in general rather than the specific error code.
Embedding quality depends heavily on the model and domain. A model trained on web text may not work well for legal documents.

Sparse Retrieval

Sparse retrieval uses traditional information retrieval techniques based on term frequency and inverse document frequency. The key insight: a word that appears frequently in a document but rarely across the corpus is a strong signal of relevance.

BM25 is the most widely used sparse retrieval algorithm. It has been the backbone of web search engines for decades and remains competitive even in the age of neural retrieval.

python

import math
from collections import Counter


class BM25:
    """A simple BM25 implementation for sparse retrieval.

    BM25 (Best Matching 25) ranks documents based on:
    1. Term Frequency (TF): How often does the query term appear
       in this document? More occurrences = more relevant.
    2. Inverse Document Frequency (IDF): How rare is this term
       across all documents? Rare terms are more informative.
    3. Document Length Normalization: Longer documents naturally
       contain more terms, so we normalize for length.

    The name "BM25" comes from it being the 25th in a series of
    ranking functions explored by Robertson et al. in the 1990s.

    Parameters:
    - k1 controls term frequency saturation. Higher k1 means
      additional term occurrences have more impact. Default: 1.5.
    - b controls document length normalization. b=1 means full
      normalization, b=0 means no normalization. Default: 0.75.
    """

    def __init__(self, documents: list[str], k1: float = 1.5, b: float = 0.75):
        self.k1 = k1
        self.b = b
        self.documents = documents
        self.doc_count = len(documents)

        # Tokenize and compute statistics
        self.doc_tokens = [doc.lower().split() for doc in documents]
        self.doc_lengths = [len(tokens) for tokens in self.doc_tokens]
        self.avg_doc_length = sum(self.doc_lengths) / self.doc_count

        # Compute document frequency for each term
        self.df = {}
        for tokens in self.doc_tokens:
            for token in set(tokens):  # set() to count each term once per doc
                self.df[token] = self.df.get(token, 0) + 1

    def _idf(self, term: str) -> float:
        """Compute inverse document frequency for a term.

        IDF is high for rare terms and low for common terms.
        The formula includes smoothing (+0.5) to avoid division
        by zero and the +1 to ensure IDF is always positive.
        """
        df = self.df.get(term, 0)
        return math.log((self.doc_count - df + 0.5) / (df + 0.5) + 1)

    def score(self, query: str, doc_index: int) -> float:
        """Compute BM25 score for a query-document pair.

        The score is the sum of each query term's contribution:
        IDF * (TF * (k1 + 1)) / (TF + k1 * (1 - b + b * dl/avgdl))

        where TF is term frequency, dl is document length, and
        avgdl is average document length.
        """
        query_tokens = query.lower().split()
        doc_tokens = self.doc_tokens[doc_index]
        doc_length = self.doc_lengths[doc_index]
        tf_counts = Counter(doc_tokens)

        score = 0.0
        for term in query_tokens:
            if term not in tf_counts:
                continue
            tf = tf_counts[term]
            idf = self._idf(term)
            numerator = tf * (self.k1 + 1)
            denominator = tf + self.k1 * (
                1 - self.b + self.b * doc_length / self.avg_doc_length
            )
            score += idf * numerator / denominator

        return score

    def search(self, query: str, top_k: int = 5) -> list[tuple[int, float]]:
        """Search for the most relevant documents.

        Returns:
            List of (document_index, score) tuples sorted by score descending.
        """
        scores = [(i, self.score(query, i)) for i in range(self.doc_count)]
        scores.sort(key=lambda x: x[1], reverse=True)
        return scores[:top_k]

Advantages:

Excels at exact keyword matching. Searching for "ERR-4032" finds documents containing exactly that string.
No embedding model needed -- faster setup and no GPU requirements.
Fast, interpretable, and well-understood after decades of research.
Handles rare terms and proper nouns well. Dense models often struggle with entity names and codes they have not seen in training.

Disadvantages:

No semantic understanding ("dog" and "canine" are completely different terms).
Sensitive to vocabulary mismatch. If the document says "automobile" and the query says "car," BM25 will not match them.
No cross-language capability.

Hybrid Retrieval

Hybrid retrieval combines dense and sparse retrieval to get the best of both worlds. The key insight: dense retrieval captures semantics while sparse retrieval captures exact matches. Neither alone is perfect, but together they cover each other's weaknesses.

A common approach is Reciprocal Rank Fusion (RRF), which combines rankings from multiple retrieval systems:

python

def reciprocal_rank_fusion(
    rankings: list[list[tuple[str, float]]],
    k: int = 60,
) -> list[tuple[str, float]]:
    """Combine multiple rankings using Reciprocal Rank Fusion.

    RRF is elegant in its simplicity: it assigns a score of
    1/(k + rank) to each document in each ranking, then sums
    the scores across rankings.

    The k parameter (default 60) is a smoothing constant that
    prevents top-ranked documents from dominating. Without it,
    the #1 result from one system would always beat the #2 result
    from both systems combined.

    Example: If document X is ranked #1 by dense retrieval and
    #3 by sparse retrieval:
    - Dense score: 1/(60+1) = 0.0164
    - Sparse score: 1/(60+3) = 0.0159
    - Total: 0.0323

    If document Y is ranked #2 by both:
    - Dense score: 1/(60+2) = 0.0161
    - Sparse score: 1/(60+2) = 0.0161
    - Total: 0.0323

    Both documents score equally, which makes intuitive sense:
    being good in both systems is as valuable as being great
    in one and decent in the other.

    Args:
        rankings: List of ranked lists, each containing (doc_id, score) tuples.
        k: Constant to prevent high scores for top-ranked documents (default: 60).

    Returns:
        Fused ranking as a list of (doc_id, fused_score) tuples.
    """
    fused_scores: dict[str, float] = {}

    for ranking in rankings:
        for rank, (doc_id, _) in enumerate(ranking):
            if doc_id not in fused_scores:
                fused_scores[doc_id] = 0.0
            fused_scores[doc_id] += 1.0 / (k + rank + 1)

    # Sort by fused score descending
    fused = sorted(fused_scores.items(), key=lambda x: x[1], reverse=True)
    return fused

Hybrid retrieval is particularly effective because it captures both semantic similarity (from dense retrieval) and exact keyword matches (from sparse retrieval). In benchmarks, hybrid retrieval consistently outperforms either approach alone, typically by 5-15% on standard information retrieval metrics.

Key Insight: In production RAG systems, hybrid retrieval should be your default approach. It costs slightly more than either approach alone (you run two retrieval passes), but the quality improvement is substantial and consistent. Think of it as wearing both a belt and suspenders -- redundancy in retrieval is a feature, not a bug.

065. Advanced RAG Techniques

Re-ranking

The initial retrieval step optimizes for recall (finding all relevant documents). This means it uses fast but imprecise methods (approximate nearest neighbors, BM25). A re-ranker then optimizes for precision by scoring each retrieved document more carefully.

Cross-encoder re-rankers process the query and document together through a transformer, producing a relevance score. Unlike bi-encoders (which embed query and document separately), cross-encoders see both texts simultaneously and can capture fine-grained interactions between them. This is much more accurate but too expensive for the full corpus -- running a cross-encoder over a million documents would take hours.

The two-stage approach (fast retrieval + precise re-ranking) is like a hiring process: resume screening (fast, imprecise) narrows the pool from thousands to dozens, then interviews (slow, precise) identify the best candidates from that smaller pool.

python

from sentence_transformers import CrossEncoder


def rerank_documents(
    query: str,
    documents: list[str],
    model_name: str = "cross-encoder/ms-marco-MiniLM-L-6-v2",
    top_k: int = 5,
) -> list[tuple[str, float]]:
    """Re-rank retrieved documents using a cross-encoder.

    The cross-encoder processes each (query, document) pair through
    a transformer model, producing a relevance score. This is
    significantly more accurate than bi-encoder similarity because
    the model can attend to interactions between query terms and
    document terms.

    Typical pipeline:
    1. Initial retrieval: Get 20-50 candidates (fast, approximate)
    2. Re-ranking: Score all candidates with cross-encoder (slow, precise)
    3. Return top-k after re-ranking

    Args:
        query: The search query.
        documents: List of candidate documents from initial retrieval.
        model_name: Cross-encoder model to use.
        top_k: Number of top documents to return.

    Returns:
        List of (document, score) tuples sorted by relevance.
    """
    model = CrossEncoder(model_name)
    pairs = [(query, doc) for doc in documents]
    scores = model.predict(pairs)

    scored_docs = list(zip(documents, scores))
    scored_docs.sort(key=lambda x: x[1], reverse=True)
    return scored_docs[:top_k]

Query Transformation

Sometimes the user's query is not well-suited for direct retrieval. The user might ask a high-level question when the relevant documents contain low-level details, or they might use different terminology than the documents. Query transformation techniques improve retrieval by modifying the query before searching.

HyDE (Hypothetical Document Embeddings): Generate a hypothetical answer to the query, then use its embedding for retrieval. The hypothesis is often closer in embedding space to the actual document than the question is.

This is a clever insight. Consider the question "What causes aurora borealis?" The embedding of this question will be in "question space." But the documents that answer this question are in "answer space" -- they contain statements like "Charged particles from the solar wind interact with Earth's magnetic field..." HyDE bridges this gap by first generating a hypothetical answer, then searching for documents similar to that answer.

python

def hyde_retrieval(query: str, llm_call, retriever) -> list[str]:
    """Hypothetical Document Embedding retrieval.

    1. Ask the LLM to generate a hypothetical answer.
    2. Use the hypothetical answer's embedding for retrieval.
    3. Return the retrieved real documents.

    The hypothetical answer does not need to be correct -- it just
    needs to be in the right semantic neighborhood. Even an
    inaccurate hypothesis will use the right terminology and
    concepts, making it a better retrieval query than the original
    question.

    Reference: Gao et al., 2023 - "Precise Zero-Shot Dense Retrieval
    without Relevance Labels" (ACL 2023).
    """
    # Step 1: Generate hypothetical answer
    hypothesis = llm_call(
        prompt=f"Write a short paragraph that would answer this question: {query}"
    )

    # Step 2: Use the hypothetical answer for retrieval
    results = retriever.search(hypothesis, top_k=5)

    return results

Query decomposition: Break a complex query into simpler sub-queries and retrieve for each. This is particularly useful for multi-faceted questions.

python

DECOMPOSE_PROMPT = """Break the following complex question into 2-4 simpler
sub-questions that, when answered together, would answer the original question.

Original question: {query}

Sub-questions:
1."""


def decompose_and_retrieve(query: str, llm_call, retriever) -> list[str]:
    """Decompose a complex query and retrieve for each sub-query.

    Example: "Compare the performance and cost of GPT-4 and Claude 3.5"
    might decompose into:
    1. "What is the performance of GPT-4 on standard benchmarks?"
    2. "What is the performance of Claude 3.5 on standard benchmarks?"
    3. "What is the pricing of GPT-4?"
    4. "What is the pricing of Claude 3.5?"

    Each sub-query retrieves different documents, and together they
    provide the information needed to answer the original question.
    """
    # Generate sub-questions
    response = llm_call(prompt=DECOMPOSE_PROMPT.format(query=query))
    sub_queries = [q.strip() for q in response.split("\n") if q.strip()]

    # Retrieve for each sub-query, deduplicating results
    all_results = []
    seen = set()
    for sub_query in sub_queries:
        results = retriever.search(sub_query, top_k=3)
        for doc in results:
            if doc not in seen:
                all_results.append(doc)
                seen.add(doc)

    return all_results

Step-back prompting: Ask a more general version of the question to retrieve broader context. For example, "What was the GDP growth rate of Vietnam in Q3 2024?" might step back to "What is the recent economic performance of Vietnam?" to retrieve documents that provide context beyond the specific number.

Multi-Step Retrieval

For complex questions, a single retrieval step may not find all the information needed. Multi-step retrieval uses the results of one retrieval to inform the next, iteratively gathering information until the agent has enough to answer.

python

def iterative_retrieval(
    query: str, llm_call, retriever, max_steps: int = 3
) -> list[str]:
    """Iteratively retrieve and refine.

    At each step, the LLM examines current results and generates
    a follow-up query to fill in gaps. This mimics how a human
    researcher works: you read initial sources, identify gaps in
    your understanding, then search for more information to fill
    those gaps.

    The process terminates when the LLM determines it has enough
    information or when the maximum number of steps is reached.
    """
    all_retrieved = []

    current_query = query
    for step in range(max_steps):
        # Retrieve
        results = retriever.search(current_query, top_k=3)
        all_retrieved.extend(results)

        # Ask LLM if we have enough information
        context = "\n".join(all_retrieved)
        follow_up = llm_call(
            prompt=(
                f"Original question: {query}\n\n"
                f"Information retrieved so far:\n{context}\n\n"
                "Is this enough to answer the question? If not, what specific "
                "information is still missing? Generate a follow-up search query "
                "to find the missing information. If enough, respond with 'SUFFICIENT'."
            )
        )

        if "SUFFICIENT" in follow_up.upper():
            break
        current_query = follow_up

    return all_retrieved

Try It Yourself: Think of a complex question that would benefit from multi-step retrieval. For example: "How does the regulatory environment for AI in the EU compare to the US, and what are the implications for companies operating in both markets?" What sub-queries would you need? What information from the first retrieval step would inform your second query?

076. Agentic RAG: Letting the Agent Decide

From Passive to Active Retrieval

In standard RAG, retrieval is triggered automatically for every query -- a fixed pipeline where every input goes through the same retrieve-then-generate process. In agentic RAG, the agent has retrieval as one of its available tools and decides when and what to retrieve based on its assessment of what information it needs.

This is a significant shift: the agent becomes an active participant in the information-gathering process rather than a passive recipient of retrieved context. It is the difference between a student who reads every page of the textbook before answering a question (standard RAG) and one who first considers what they already know, identifies gaps, and looks up only what they need (agentic RAG).

Why Agentic RAG?

Not every query needs retrieval. Simple greetings ("Hi!"), arithmetic ("What is 2+2?"), or well-known facts ("What is the capital of France?") do not benefit from retrieval. Retrieving anyway wastes time and may introduce noise -- retrieved documents about France's political system could confuse a simple geography question.
The agent knows what it does not know. After an initial attempt at answering, the agent can identify specific knowledge gaps and retrieve targeted information. This is more efficient than blindly retrieving on every query.
Multi-source retrieval. An agent may have access to multiple knowledge bases (technical docs, company wiki, research papers) and need to decide which one to query. A question about vacation policy should go to the HR knowledge base, not the technical documentation.
Iterative refinement. The agent can retrieve, assess whether the results are sufficient, and retrieve again if needed. This handles complex queries that require information from multiple documents.

Implementing Agentic RAG

python

AGENT_SYSTEM_PROMPT = """You are a helpful assistant with access to a knowledge base.
You have the following tools available:

1. search_knowledge_base(query: str) -> list[str]
   Search the knowledge base for relevant information.

2. answer(response: str) -> None
   Provide your final answer to the user.

Guidelines:
- Only search the knowledge base when you need information you don't
  already know or when accuracy is critical.
- You may search multiple times with different queries.
- Always cite the source when using retrieved information.
- If the knowledge base doesn't have relevant information, say so
  and answer based on your general knowledge (with a caveat).

Think step by step about whether you need to search before answering.
"""


class AgenticRAG:
    """An agent that decides when and what to retrieve.

    Unlike standard RAG (which always retrieves), this agent
    uses retrieval as a tool -- it decides whether to search,
    what to search for, and when it has enough information to
    answer. This is more efficient and often more accurate.
    """

    def __init__(self, llm_call, retriever):
        self.llm_call = llm_call
        self.retriever = retriever
        self.retrieved_context = []

    def search_knowledge_base(self, query: str) -> list[str]:
        """Tool: search the knowledge base."""
        results = self.retriever.search(query, top_k=3)
        self.retrieved_context.extend(results)
        return results

    def process_query(self, user_query: str) -> str:
        """Process a user query with optional retrieval.

        The agent reasons about whether it needs to search,
        formulates search queries if needed, and synthesizes
        the results into a final answer.
        """
        messages = [
            {"role": "system", "content": AGENT_SYSTEM_PROMPT},
            {"role": "user", "content": user_query},
        ]

        # Agent reasoning loop (max 5 iterations to prevent runaway)
        for _ in range(5):
            response = self.llm_call(messages=messages)

            if "search_knowledge_base" in response:
                # Agent decided to search -- extract query and execute
                search_query = self._extract_search_query(response)
                results = self.search_knowledge_base(search_query)
                messages.append({"role": "assistant", "content": response})
                messages.append({
                    "role": "tool",
                    "content": f"Search results:\n" + "\n---\n".join(results),
                })
            elif "answer(" in response:
                # Agent is ready to give a final answer
                return self._extract_answer(response)
            else:
                return response

        return "I was unable to complete the query within the step limit."

    def _extract_search_query(self, response: str) -> str:
        """Extract the search query from an agent's tool call."""
        start = response.index("search_knowledge_base(") + len("search_knowledge_base(")
        end = response.index(")", start)
        return response[start:end].strip("\"'")

    def _extract_answer(self, response: str) -> str:
        """Extract the answer from an agent's tool call."""
        start = response.index("answer(") + len("answer(")
        end = response.rindex(")")
        return response[start:end].strip("\"'")

Router-Based Agentic RAG

A more sophisticated approach uses a router to direct queries to the appropriate retrieval source:

python

ROUTER_PROMPT = """Given the user's question, decide which knowledge source to query.
Available sources:
- technical_docs: API documentation, code references, technical specifications
- company_wiki: Company policies, procedures, organizational information
- research_papers: Academic papers and research findings
- none: No retrieval needed (use general knowledge)

Question: {query}

Respond with ONLY the source name."""


def route_query(query: str, llm_call) -> str:
    """Route a query to the appropriate knowledge source.

    This is like a librarian who directs you to the right section
    of the library based on your question, rather than making you
    search the entire library every time.
    """
    response = llm_call(prompt=ROUTER_PROMPT.format(query=query))
    return response.strip().lower()

Key Insight: Agentic RAG represents a maturation of the RAG paradigm. Standard RAG treats retrieval as a fixed preprocessing step. Agentic RAG treats retrieval as a tool the agent can use strategically. This is analogous to the difference between a student who always reads the textbook before answering (even for questions they know) and one who strategically consults references only when needed.

087. Self-RAG: Self-Reflective Retrieval

The Self-RAG Framework

Self-RAG (Asai et al., 2024) introduces a framework where the language model itself decides when to retrieve and then critically evaluates the retrieved information. The key innovation is the use of special reflection tokens that the model generates to assess its own retrieval and generation process.

Think of Self-RAG as an agent with a built-in fact-checker. After generating each part of its response, the agent asks itself: "Did I need to look this up? Was what I found relevant? Is my answer actually supported by the evidence?"

The Self-RAG process:

Retrieve decision: The model generates a token indicating whether retrieval is needed for the current segment.
Retrieval: If needed, relevant passages are retrieved.
Relevance assessment: The model generates a token indicating whether each retrieved passage is relevant.
Generation: The model generates a response segment using the relevant passages.
Support assessment: The model generates a token indicating whether the generated text is supported by the retrieved evidence.
Utility assessment: The model evaluates the overall utility of the generated response.

text

Input: "What causes aurora borealis?"

Step 1: [Retrieve: Yes]  -- Model decides retrieval is needed
Step 2: Retrieved: "Aurora borealis occurs when charged particles from
        the Sun interact with Earth's magnetic field..."
Step 3: [Relevant: Yes]  -- Model confirms passage is relevant
Step 4: "Aurora borealis is caused by charged particles from the Sun
        colliding with gases in Earth's atmosphere..."
Step 5: [Supported: Fully]  -- Model confirms response is grounded
Step 6: [Useful: 5]  -- Model rates utility highly

Advantages of Self-RAG

Adaptive retrieval: Avoids unnecessary retrieval for easy questions, saving time and cost.
Quality control: The model evaluates its own output for faithfulness, catching hallucinations before they reach the user.
Reduced hallucination: By checking whether generated text is supported by evidence, Self-RAG reduces unsupported claims.

Common Misconception: "Self-RAG requires training a custom model." While the original paper trains the reflection tokens into the model, the key ideas (decide whether to retrieve, evaluate relevance, check for support) can be approximated with prompting techniques using any LLM. The conceptual framework is more important than the exact implementation.

098. RAG vs. Fine-Tuning: Trade-offs for Agents

When adapting an LLM to a specific domain, there are two main approaches: RAG (retrieving relevant information at inference time) and fine-tuning (updating the model's weights on domain-specific data). Understanding the trade-offs is essential for making good architectural decisions.

RAG Advantages

No training required: Set up an index and start querying. Fine-tuning requires training infrastructure, data preparation, and evaluation.
Up-to-date knowledge: New documents can be indexed immediately. Fine-tuning requires retraining to incorporate new information.
Provenance: Retrieved sources can be cited. Fine-tuned knowledge is opaque.
Lower cost: No GPU training infrastructure needed.
Privacy: Sensitive data stays in the retrieval system, not baked into model weights that could be leaked through adversarial prompting.

Fine-Tuning Advantages

Deeper knowledge integration: The model "knows" the domain rather than reading about it at inference time. It can use domain knowledge implicitly in its reasoning.
Lower latency: No retrieval step at inference time.
Better style adaptation: Fine-tuning can change the model's writing style, tone, and formatting. RAG cannot change how the model writes, only what information it has access to.
Smaller context usage: No context window space consumed by retrieved passages.

When to Use Each

Scenario	Recommended Approach
Knowledge changes frequently	RAG
Need to cite sources	RAG
Small domain corpus (<1000 docs)	RAG
Need specific output format/style	Fine-tuning
Latency is critical	Fine-tuning
Large, stable domain corpus	Fine-tuning (or both)
Production agent with diverse tasks	RAG + fine-tuning

The Hybrid Approach

In practice, the best results often come from combining both approaches:

Fine-tune the model on domain-specific data to learn the domain's concepts, terminology, and style.
Use RAG at inference time to provide up-to-date, specific information.

This is sometimes called "RAG over a fine-tuned model." The fine-tuned model is better at understanding and reasoning about domain-specific content, while RAG ensures it has access to the latest information.

109. Evaluation of RAG Systems

Why RAG Evaluation Is Challenging

RAG systems have multiple points of failure, and a failure at any point can produce a bad result:

Retrieval failure: The relevant documents are not retrieved (wrong chunks, poor embeddings, bad query).
Context integration failure: The documents are retrieved but the LLM ignores or misinterprets them.
Generation failure: The LLM generates incorrect information despite having the right context (hallucination from within the context).

You must evaluate each component separately and together to understand where failures occur.

Key Metrics

Retrieval Metrics

Recall@k: Fraction of relevant documents that appear in the top k results. High recall means you are finding most of the relevant information.
Precision@k: Fraction of top k results that are relevant. High precision means you are not polluting the context with irrelevant information.
Mean Reciprocal Rank (MRR): Average of 1/rank of the first relevant result. Measures how quickly you find something relevant.
Normalized Discounted Cumulative Gain (nDCG): Weighted metric that accounts for the position of relevant results. Results at the top matter more than results at the bottom.

python

def recall_at_k(retrieved_ids: list[str], relevant_ids: set[str], k: int) -> float:
    """Compute recall at k: fraction of relevant docs found in top k results.

    Example: If there are 4 relevant documents and 3 of them appear
    in the top 10 results, recall@10 = 3/4 = 0.75.
    """
    top_k = set(retrieved_ids[:k])
    return len(top_k & relevant_ids) / len(relevant_ids) if relevant_ids else 0.0


def precision_at_k(retrieved_ids: list[str], relevant_ids: set[str], k: int) -> float:
    """Compute precision at k: fraction of top k results that are relevant.

    Example: If 3 out of 10 top results are relevant,
    precision@10 = 3/10 = 0.3.
    """
    top_k = set(retrieved_ids[:k])
    return len(top_k & relevant_ids) / k if k > 0 else 0.0


def mrr(retrieved_ids: list[str], relevant_ids: set[str]) -> float:
    """Compute Mean Reciprocal Rank.

    If the first relevant result is at position 3,
    MRR = 1/3 = 0.333. A higher MRR means relevant results
    appear earlier in the ranking.
    """
    for i, doc_id in enumerate(retrieved_ids):
        if doc_id in relevant_ids:
            return 1.0 / (i + 1)
    return 0.0

Generation Metrics

Faithfulness: Does the generated answer accurately reflect the retrieved context? (No hallucination beyond what the context says.)
Answer relevance: Does the answer actually address the question?
Context relevance: Are the retrieved passages relevant to the question?
Completeness: Does the answer cover all aspects of the question?

The RAGAS Framework

RAGAS (Retrieval Augmented Generation Assessment) by Es et al. (2024) provides an automated evaluation framework with four metrics. It is useful because it does not require ground-truth answers -- it uses an LLM to evaluate the RAG pipeline:

Faithfulness: Measures whether claims in the answer are supported by the context.
Answer relevance: Measures whether the answer addresses the question.
Context precision: Measures whether the retrieved context is relevant to the question.
Context recall: Measures whether the retrieved context contains the information needed to answer the question.

python

def evaluate_faithfulness(
    question: str, answer: str, context: str, llm_call
) -> float:
    """Evaluate faithfulness of an answer to its context.

    Simplified version of the RAGAS faithfulness metric.

    The process:
    1. Extract all factual claims from the answer.
    2. Check each claim against the retrieved context.
    3. Faithfulness = (supported claims) / (total claims).

    A faithfulness of 1.0 means every claim in the answer is
    supported by the context. A faithfulness of 0.5 means half
    the claims are unsupported (hallucinated).
    """
    # Step 1: Extract claims from the answer
    claims_response = llm_call(
        prompt=f"List all factual claims made in this answer:\n\n{answer}\n\nClaims:"
    )
    claims = [c.strip() for c in claims_response.split("\n") if c.strip()]

    if not claims:
        return 1.0  # No claims to verify

    # Step 2: Check each claim against the context
    supported = 0
    for claim in claims:
        verdict = llm_call(
            prompt=(
                f"Is the following claim supported by the context?\n\n"
                f"Claim: {claim}\n\n"
                f"Context: {context}\n\n"
                f"Answer only 'Yes' or 'No'."
            )
        )
        if "yes" in verdict.lower():
            supported += 1

    return supported / len(claims)

Try It Yourself: Build a simple RAG evaluation pipeline. Create 10 question-answer pairs with known answers, index the source documents, retrieve and generate answers, then compute faithfulness and relevance. Experiment with different chunk sizes and embedding models to see how they affect the metrics.

1110. Practical Example: Building a Simple RAG Pipeline

Let us build a complete RAG pipeline from scratch, walking through every component.

python

"""
A complete RAG pipeline implementation.

This example demonstrates building a RAG system from document loading
through retrieval and generation. Each component is implemented
transparently so you can see exactly what happens at each step.

Requirements:
    pip install sentence-transformers chromadb openai
"""

import os
from pathlib import Path

import chromadb
from sentence_transformers import SentenceTransformer, CrossEncoder


class SimpleRAGPipeline:
    """A complete RAG pipeline with indexing, retrieval, and generation.

    This pipeline supports:
    - Document loading and chunking
    - Embedding generation and vector storage
    - Dense retrieval with optional re-ranking
    - Context-augmented generation

    The pipeline follows the standard RAG architecture:
    1. OFFLINE: Documents -> Chunks -> Embeddings -> Vector Store
    2. ONLINE:  Query -> Retrieve -> Augment Prompt -> Generate Answer
    """

    def __init__(
        self,
        embedding_model: str = "all-MiniLM-L6-v2",
        reranker_model: str = "cross-encoder/ms-marco-MiniLM-L-6-v2",
        chunk_size: int = 500,
        chunk_overlap: int = 50,
    ):
        self.chunk_size = chunk_size
        self.chunk_overlap = chunk_overlap

        # Initialize models
        self.embedder = SentenceTransformer(embedding_model)
        self.reranker = CrossEncoder(reranker_model)

        # Initialize vector store (in-memory for this example)
        self.client = chromadb.Client()
        self.collection = self.client.create_collection(
            name="rag_docs",
            metadata={"hnsw:space": "cosine"},
        )
        self.doc_count = 0

    def chunk_text(self, text: str, source: str) -> list[dict]:
        """Split text into overlapping chunks with metadata."""
        chunks = []
        start = 0
        chunk_idx = 0
        while start < len(text):
            end = start + self.chunk_size
            chunk_text = text[start:end].strip()
            if chunk_text:
                chunks.append({
                    "text": chunk_text,
                    "source": source,
                    "chunk_index": chunk_idx,
                })
                chunk_idx += 1
            start = end - self.chunk_overlap
        return chunks

    def index_documents(self, documents: list[dict]) -> int:
        """Index a list of documents.

        This is the OFFLINE phase: documents are chunked, embedded,
        and stored in the vector database. This only needs to happen
        once (or when documents are updated).

        Args:
            documents: List of dicts with 'content' and 'source' keys.

        Returns:
            Number of chunks indexed.
        """
        all_chunks = []
        for doc in documents:
            chunks = self.chunk_text(doc["content"], doc["source"])
            all_chunks.extend(chunks)

        if not all_chunks:
            return 0

        # Add to vector store (ChromaDB handles embedding internally)
        self.collection.add(
            documents=[c["text"] for c in all_chunks],
            metadatas=[{"source": c["source"], "chunk_index": c["chunk_index"]} for c in all_chunks],
            ids=[f"doc_{self.doc_count + i}" for i in range(len(all_chunks))],
        )
        self.doc_count += len(all_chunks)
        return len(all_chunks)

    def retrieve(
        self,
        query: str,
        top_k: int = 5,
        use_reranking: bool = True,
        initial_k: int = 20,
    ) -> list[dict]:
        """Retrieve relevant chunks for a query.

        This is the ONLINE retrieval phase. For each query:
        1. Search the vector store for initial_k candidates (fast).
        2. Optionally re-rank with a cross-encoder (precise).
        3. Return the top_k results.

        Args:
            query: The search query.
            top_k: Number of results to return.
            use_reranking: Whether to apply cross-encoder re-ranking.
            initial_k: Number of candidates for re-ranking.

        Returns:
            List of relevant chunks with scores.
        """
        # Initial retrieval (fast, approximate)
        k = initial_k if use_reranking else top_k
        results = self.collection.query(
            query_texts=[query],
            n_results=min(k, self.doc_count),
        )

        if not results["documents"][0]:
            return []

        chunks = []
        for i, (doc, metadata, distance) in enumerate(zip(
            results["documents"][0],
            results["metadatas"][0],
            results["distances"][0],
        )):
            chunks.append({
                "text": doc,
                "source": metadata["source"],
                "initial_score": 1 - distance,  # Convert distance to similarity
            })

        # Re-rank if requested (slow, precise)
        if use_reranking and len(chunks) > top_k:
            pairs = [(query, chunk["text"]) for chunk in chunks]
            rerank_scores = self.reranker.predict(pairs)
            for chunk, score in zip(chunks, rerank_scores):
                chunk["rerank_score"] = float(score)
            chunks.sort(key=lambda x: x["rerank_score"], reverse=True)

        return chunks[:top_k]

    def build_prompt(
        self, query: str, retrieved_chunks: list[dict]
    ) -> str:
        """Build the augmented prompt with retrieved context.

        This is the AUGMENT phase: we construct a prompt that includes
        both the user's question and the retrieved context. The prompt
        instructs the LLM to answer based on the context and to
        acknowledge when the context is insufficient.

        Args:
            query: The user's question.
            retrieved_chunks: List of retrieved text chunks.

        Returns:
            The complete prompt string for the LLM.
        """
        context_parts = []
        for i, chunk in enumerate(retrieved_chunks, 1):
            context_parts.append(f"[Source {i}: {chunk['source']}]\n{chunk['text']}")

        context = "\n\n".join(context_parts)

        prompt = f"""Answer the following question based on the provided context.
If the context does not contain enough information to answer the question,
say so clearly and explain what information is missing.

Context:
{context}

Question: {query}

Answer:"""
        return prompt

    def query(
        self,
        question: str,
        llm_call,
        top_k: int = 5,
        use_reranking: bool = True,
    ) -> dict:
        """Execute a full RAG query: retrieve, augment, generate.

        This is the complete ONLINE pipeline:
        1. RETRIEVE: Find relevant chunks in the vector store.
        2. AUGMENT: Build a prompt with the retrieved context.
        3. GENERATE: Call the LLM to produce an answer.

        Args:
            question: The user's question.
            llm_call: A callable that takes a prompt string and returns a response.
            top_k: Number of chunks to retrieve.
            use_reranking: Whether to use cross-encoder re-ranking.

        Returns:
            Dict with 'answer', 'sources', and 'prompt' keys.
        """
        # Retrieve
        chunks = self.retrieve(question, top_k=top_k, use_reranking=use_reranking)

        # Augment
        prompt = self.build_prompt(question, chunks)

        # Generate
        answer = llm_call(prompt=prompt)

        return {
            "answer": answer,
            "sources": [{"source": c["source"], "text": c["text"][:200]} for c in chunks],
            "prompt": prompt,
            "num_chunks_retrieved": len(chunks),
        }


# -- Usage Example ---------------------------------------------------

def main():
    """Demonstrate the RAG pipeline."""

    # Initialize the pipeline
    rag = SimpleRAGPipeline(chunk_size=300, chunk_overlap=50)

    # Sample documents
    documents = [
        {
            "content": (
                "Retrieval-Augmented Generation (RAG) is a technique that combines "
                "information retrieval with text generation. It was introduced by "
                "Lewis et al. in 2020. The key idea is to retrieve relevant documents "
                "from an external knowledge base and use them as additional context for "
                "the language model. This allows the model to access up-to-date "
                "information and reduce hallucinations. RAG has become a fundamental "
                "building block for modern AI applications."
            ),
            "source": "rag_overview.txt",
        },
        {
            "content": (
                "Vector databases are specialized databases designed to store and "
                "query high-dimensional vectors efficiently. They use approximate "
                "nearest neighbor (ANN) algorithms like HNSW to enable fast similarity "
                "search. Popular vector databases include Chroma, Pinecone, Weaviate, "
                "and Qdrant. They are essential infrastructure for RAG systems, as they "
                "store the embeddings of document chunks and enable semantic search."
            ),
            "source": "vector_databases.txt",
        },
        {
            "content": (
                "Fine-tuning is the process of updating a pre-trained model's weights "
                "on a smaller, task-specific dataset. Unlike RAG, which adds information "
                "at inference time, fine-tuning bakes knowledge into the model's "
                "parameters. Fine-tuning is better for adapting the model's style and "
                "behavior, while RAG is better for providing up-to-date factual "
                "information. Many production systems use both approaches together."
            ),
            "source": "fine_tuning.txt",
        },
    ]

    # Index documents
    num_chunks = rag.index_documents(documents)
    print(f"Indexed {num_chunks} chunks from {len(documents)} documents.\n")

    # Simulate an LLM call
    def mock_llm_call(prompt: str) -> str:
        return (
            "Based on the provided context, RAG (Retrieval-Augmented Generation) "
            "works by retrieving relevant documents from an external knowledge base "
            "and using them as additional context for the language model. This was "
            "introduced by Lewis et al. in 2020. The retrieved documents are stored "
            "as embeddings in vector databases, which use approximate nearest neighbor "
            "algorithms for efficient similarity search."
        )

    # Execute a query
    result = rag.query(
        "How does RAG work?",
        llm_call=mock_llm_call,
        top_k=3,
        use_reranking=False,
    )

    print("Question: How does RAG work?\n")
    print(f"Answer: {result['answer']}\n")
    print("Sources used:")
    for src in result["sources"]:
        print(f"  - {src['source']}: {src['text'][:100]}...")


if __name__ == "__main__":
    main()

12Discussion Questions

RAG vs. longer context windows: As context windows grow to 1M+ tokens, will RAG become obsolete? What are the fundamental advantages of RAG that survive arbitrarily large context windows? Hint: consider cost, retrieval precision, and the "lost in the middle" problem. Even with infinite context, you still need to decide what information to include.
Chunking as a bottleneck: Many RAG failures trace back to poor chunking. How might future systems eliminate the need for chunking entirely? Hint: consider late-interaction models like ColBERT, which represent documents as sets of token embeddings rather than a single vector.
The faithfulness problem: How can we ensure that an LLM does not hallucinate information that contradicts its retrieved context? Is Self-RAG sufficient, or do we need fundamentally different approaches? Hint: consider the difference between "the context does not support this claim" and "the context contradicts this claim."
Agentic vs. automatic retrieval: In what scenarios would you prefer an agent to decide when to retrieve versus always retrieving? What are the failure modes of each approach? Hint: agents might wrongly decide they do not need to retrieve, while automatic retrieval wastes resources on simple queries.
Multi-modal RAG: How would you extend a RAG system to handle images, tables, and diagrams in addition to text? What additional challenges does this introduce? Hint: consider how you would embed a table or a diagram, and how you would include non-text content in the LLM prompt.
Adversarial RAG: If an attacker can inject documents into a RAG system's knowledge base, how could they manipulate the system's outputs? What defenses exist? Hint: consider prompt injection via retrieved documents, where the document itself contains instructions like "Ignore all previous instructions and..."

13Summary and Key Takeaways

RAG bridges the gap between parametric and non-parametric knowledge. It allows LLMs to access external, up-to-date, and domain-specific information without retraining.
The indexing pipeline is critical. Chunking strategy, embedding model selection, and vector database configuration all significantly affect retrieval quality. Invest time in getting the indexing right.
Hybrid retrieval outperforms pure dense or pure sparse retrieval in most scenarios. Combining semantic search with keyword matching captures both types of relevance and should be your default approach.
Advanced techniques add significant value. Re-ranking, query transformation (especially HyDE), and multi-step retrieval each improve retrieval quality, and they can be combined.
Agentic RAG represents a paradigm shift. Moving from automatic retrieval to agent-controlled retrieval enables more intelligent, efficient, and adaptive information gathering.
Evaluation must cover the full pipeline. Retrieval metrics (recall, precision) and generation metrics (faithfulness, relevance) are both essential for understanding system performance.
RAG and fine-tuning are complementary, not competing approaches. Production systems often benefit from combining both.

14References

Lewis, P., Perez, E., Piktus, A., Petroni, F., Karpukhin, V., Goyal, N., Kuttler, H., Lewis, M., Yih, W., Rocktaschel, T., Riedel, S., & Kiela, D. (2020). Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks. Advances in Neural Information Processing Systems (NeurIPS), 33.
Gao, L., Ma, X., Lin, J., & Callan, J. (2023). Precise Zero-Shot Dense Retrieval without Relevance Labels. Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (ACL).
Asai, A., Wu, Z., Wang, Y., Sil, A., & Hajishirzi, H. (2024). Self-RAG: Learning to Retrieve, Generate, and Critique through Self-Reflection. International Conference on Learning Representations (ICLR).
Muennighoff, N., Tazi, N., Magne, L., & Reimers, N. (2023). MTEB: Massive Text Embedding Benchmark. Proceedings of the 17th Conference of the European Chapter of the Association for Computational Linguistics (EACL).
Es, S., James, J., Espinosa-Anke, L., & Schockaert, S. (2024). RAGAS: Automated Evaluation of Retrieval Augmented Generation. Proceedings of the 18th Conference of the European Chapter of the Association for Computational Linguistics (EACL): System Demonstrations.
Robertson, S. E., & Zaragoza, H. (2009). The Probabilistic Relevance Framework: BM25 and Beyond. Foundations and Trends in Information Retrieval, 3(4), 333-389.
Malkov, Y. A., & Yashunin, D. A. (2020). Efficient and Robust Approximate Nearest Neighbor Search Using Hierarchical Navigable Small World Graphs. IEEE Transactions on Pattern Analysis and Machine Intelligence, 42(4), 824-836.
Karpukhin, V., Oguz, B., Min, S., Lewis, P., Wu, L., Edunov, S., Chen, D., & Yih, W. (2020). Dense Passage Retrieval for Open-Domain Question Answering. Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP).

Part of "Agentic AI: Foundations, Architectures, and Applications" (CC BY-SA 4.0).