Content

Contents

Content#

Module 7 — RAG Pipelines

Retrieval-Augmented Generation as an Engineering System


What This Module Covers#

Group

Topic

Key Skill

1

Why RAG Exists

Understand the fundamental problem RAG solves

2

RAG Architecture

Design component-based RAG systems

3

Building RAG Pipelines

Implement end-to-end retrieval and generation

4

Failure Modes & Guardrails

Handle RAG-specific failure cases

5

Production RAG

Deploy evaluable, auditable RAG systems


Learning Objectives#

By the end of this module, you will be able to:

  1. Explain why RAG is necessary for real-world LLM applications

  2. Design RAG pipelines with clear component boundaries

  3. Implement retrieval, prompt construction, and generation

  4. Handle failure modes including near-misses and low-confidence retrieval

  5. Evaluate both retrieval quality and generation faithfulness

  6. Apply RAG patterns to enterprise scenarios


Prerequisites#

This module builds directly on:

Module

Concepts Used Here

Module 3

LLM behavior, hallucination patterns

Module 4

How models learn patterns, not truth

Module 5

Embeddings, vector similarity, FAISS retrieval

Module 6

LLM API clients, retries, structured output

Module 7 is where everything comes together.


Setup#

Run this cell to install dependencies and configure the environment.

!pip -q install sentence-transformers faiss-cpu requests

import numpy as np
import requests
import json
import time
import faiss
from sentence_transformers import SentenceTransformer
from typing import List, Tuple, Dict, Optional

# Load embedding model (same as Module 5)
model = SentenceTransformer("all-MiniLM-L6-v2")
print(f"Embedding model loaded: all-MiniLM-L6-v2")
print(f"Embedding dimension: 384")
print("Setup complete!")
^C
ERROR: Operation cancelled by user

[notice] A new release of pip is available: 23.0.1 -> 25.3
[notice] To update, run: pip install --upgrade pip
---------------------------------------------------------------------------
ModuleNotFoundError                       Traceback (most recent call last)
Cell In[1], line 3
      1 get_ipython().system('pip -q install sentence-transformers faiss-cpu requests')
----> 3 import numpy as np
      4 import requests
      5 import json

ModuleNotFoundError: No module named 'numpy'

LLM Gateway Configuration#

Configure your LLM endpoint. Choose one option:

Option

When to Use

Setup

Pinggy Tunnel

Running Ollama locally

Start tunnel, paste URL

JBChat Server

Classroom setting

Get API key from instructor

Option A: Pinggy Tunnel (Local Ollama)#

# Terminal 1: Start Ollama
OLLAMA_HOST=0.0.0.0 ollama serve

# Terminal 2: Start Pinggy tunnel
ssh -p 443 -R0:localhost:11434 -L4300:localhost:4300 a.pinggy.io

Option B: JBChat Server#

Get the API key from your instructor.

# ------ OPTION A: Pinggy Tunnel (for local Ollama) ------
# LLM_BASE_URL = "https://your-pinggy-url.a.pinggy.io"
# LLM_API_KEY = None

# ------ OPTION B: JBChat Server (classroom) ------
LLM_BASE_URL = "https://jbchat.jonbowden.com.ngrok.app"
LLM_API_KEY = "<provided-by-instructor>"  # Get from instructor

DEFAULT_MODEL = "llama3.1:8b"

print(f"LLM endpoint: {LLM_BASE_URL}")
print(f"Model: {DEFAULT_MODEL}")

Group 1: Why RAG Exists#

The fundamental problem RAG solves

Section

Topic

7.1

The Knowledge Gap Problem

7.2

What RAG Actually Does

7.3

RAG vs Fine-Tuning vs Prompting


7.1 The Knowledge Gap Problem#

LLMs have a fundamental limitation that no amount of prompting can fix:

LLMs have no access to your data at the moment they answer a question.

What LLMs Know#

Knowledge Type

Available?

Example

Training data (pre-cutoff)

✅ Yes

“What is Python?”

Recent events (post-cutoff)

❌ No

“What happened yesterday?”

Your internal documents

❌ No

“What’s our refund policy?”

Your database records

❌ No

“What’s customer #12345’s status?”

Private company data

❌ No

“What were Q3 sales?”

The Hallucination Risk#

When asked about information they don’t have, LLMs don’t say “I don’t know.”

They confidently fabricate plausible-sounding answers.

This is not a bug—it’s how language models work. They generate probable continuations of text, whether or not those continuations are factually correct.

# Demonstration: LLMs will answer questions about data they've never seen

# Imagine asking an LLM about your company's internal policy
hypothetical_question = "What is Acme Corp's work-from-home policy?"

# Without RAG, the LLM might respond:
hypothetical_hallucination = """
Acme Corp allows employees to work from home up to 3 days per week.
Employees must be available during core hours (10am-4pm) and attend
all mandatory team meetings in person.
"""

print("Question:", hypothetical_question)
print("\nPotential LLM response (hallucinated):")
print(hypothetical_hallucination)
print("⚠️  This sounds authoritative but is completely fabricated!")
print("   The LLM has never seen Acme Corp's actual policy.")

7.2 What RAG Actually Does#

RAG = Retrieval-Augmented Generation

RAG solves the knowledge gap by providing relevant information at runtime:

┌─────────────────────────────────────────────────────────────┐
│                      RAG Pipeline                           │
├─────────────────────────────────────────────────────────────┤
│                                                             │
│   User Question ─────► Retriever ─────► Relevant Chunks     │
│                            │                   │            │
│                            │                   ▼            │
│                            │           Prompt Builder       │
│                            │                   │            │
│                            │                   ▼            │
│                            │           LLM Generation       │
│                            │                   │            │
│                            │                   ▼            │
│                            └──────────► Grounded Answer     │
│                                                             │
└─────────────────────────────────────────────────────────────┘

RAG Does NOT:#

Misconception

Reality

Make the model smarter

Model is unchanged

Retrain the model

No training occurs

Eliminate hallucinations

Reduces risk, doesn’t eliminate

Guarantee correctness

Still requires validation

RAG DOES:#

Capability

Benefit

Provide runtime evidence

Answers based on actual data

Ground generation

Model cites provided context

Enable auditability

Can trace answer to source

Support updates

New data available immediately


7.3 RAG vs Fine-Tuning vs Prompting#

RAG is one of several approaches to customizing LLM behavior:

Approach

What It Does

When to Use

Limitations

Prompting

Provides instructions in context

General behavior guidance

No access to external data

Fine-tuning

Modifies model weights

Teaching new skills/patterns

Expensive, data goes stale

RAG

Retrieves relevant data at runtime

Grounding in specific knowledge

Retrieval quality matters

When RAG is the Right Choice#

Use RAG when:

  • You need answers grounded in specific documents

  • Data changes frequently

  • You need to cite sources

  • You need auditability

Don’t use RAG when:

  • Teaching the model a new task format

  • The knowledge is general/public

  • Real-time retrieval is too slow

Enterprise Reality#

Most enterprise LLM applications require RAG.

Fine-tuning teaches how to respond. RAG provides what to respond about.


Group 2: RAG Architecture#

Designing component-based RAG systems

Section

Topic

7.4

RAG as an Architectural Pattern

7.5

The Four Core Components

7.6

Data Flow and Dependencies


7.4 RAG as an Architectural Pattern#

RAG is not a single function or library—it’s an architectural pattern.

Key Insight#

RAG is a pipeline of composable components, each testable independently.

This matters because:

Principle

Benefit

Separation of concerns

Each component has one job

Independent testing

Debug retrieval separate from generation

Swappable parts

Change embedding model without changing LLM

Clear failure attribution

Know which component failed

Anti-Pattern: The Monolithic RAG Function#

# ❌ BAD: Everything in one function
def answer_question(query):
    # Embed, retrieve, build prompt, call LLM, parse response...
    # 200 lines of tangled logic
    pass

Pattern: Component-Based RAG#

# ✅ GOOD: Clear component boundaries
query_embedding = embedder.encode(query)
chunks = retriever.search(query_embedding, k=5)
prompt = prompt_builder.build(chunks, query)
response = generator.generate(prompt)
answer = validator.validate(response)

7.5 The Four Core Components#

Every RAG system has these components (even if combined):

1. Retriever#

Responsibility

Implementation

Find relevant chunks

Vector similarity search

Return ranked results

Top-k with scores

Preserve metadata

Source, page, timestamp

2. Prompt Builder#

Responsibility

Implementation

Structure the prompt

Template with placeholders

Inject retrieved context

Format chunks clearly

Constrain the model

Instructions for grounded answers

3. Generator#

Responsibility

Implementation

Call the LLM API

HTTP client with retries

Handle failures

Timeout, rate limits

Return response

Raw text or structured


7.6 Data Flow and Dependencies#

Understanding data flow helps debug RAG systems:

┌───────────────────────────────────────────────────────────────────────┐
│                        RAG Data Flow                                  │
├───────────────────────────────────────────────────────────────────────┤
│                                                                       │
│   [User Query]                                                        │
│        │                                                              │
│        ▼                                                              │
│   ┌─────────────┐     ┌─────────────────┐                            │
│   │  Embedder   │────►│  Query Vector   │                            │
│   └─────────────┘     └────────┬────────┘                            │
│                                │                                      │
│                                ▼                                      │
│   ┌─────────────┐     ┌─────────────────┐     ┌──────────────────┐   │
│   │ Vector DB   │────►│   Retriever     │────►│ Retrieved Chunks │   │
│   └─────────────┘     └─────────────────┘     └────────┬─────────┘   │
│                                                        │             │
│                                                        ▼             │
│   [User Query] ──────────────────────────►    ┌───────────────┐      │
│                                               │ Prompt Builder│      │
│                                               └───────┬───────┘      │
│                                                       │              │
│                                                       ▼              │
│                                               ┌───────────────┐      │
│                                               │  RAG Prompt   │      │
│                                               └───────┬───────┘      │
│                                                       │              │
│                                                       ▼              │
│                                               ┌───────────────┐      │
│                                               │   Generator   │      │
│                                               └───────┬───────┘      │
│                                                       │              │
│                                                       ▼              │
│                                               [Grounded Answer]      │
│                                                                       │
└───────────────────────────────────────────────────────────────────────┘

Dependency Matrix#

Component

Depends On

Produces

Embedder

Query text

Query vector

Retriever

Query vector, Vector DB

Ranked chunks

Prompt Builder

Chunks, Query

RAG prompt string

Generator

RAG prompt

LLM response

Validator

LLM response, Chunks

Final answer


Group 3: Building RAG Pipelines#

Hands-on implementation

Section

Topic

7.7

Setting Up the Knowledge Base

7.8

Implementing the Retriever

7.9

Building Evidence-First Prompts

7.10

Connecting to the Generator

7.11

The Complete RAG Pipeline


7.7 Setting Up the Knowledge Base#

A RAG system needs a knowledge base—documents to retrieve from.

For this module, we’ll use a corpus about central banking and interest rates (same domain as Module 5).

# Knowledge base: documents about monetary policy
# In production, this would come from a database, files, or API

knowledge_base = [
    {
        "id": "doc_001",
        "text": "The central bank raised interest rates by 25 basis points to combat inflation. This decision was made after reviewing economic indicators showing persistent price increases across multiple sectors.",
        "source": "monetary_policy_report_q3.pdf"
    },
    {
        "id": "doc_002",
        "text": "Higher borrowing costs are expected to slow consumer spending and reduce inflationary pressure. The central bank indicated further rate increases may follow if inflation remains elevated.",
        "source": "monetary_policy_report_q3.pdf"
    },
    {
        "id": "doc_003",
        "text": "Mortgage rates have risen to their highest level in two decades, causing a significant slowdown in the housing market. Home sales declined 15% compared to the previous quarter.",
        "source": "housing_market_analysis.pdf"
    },
    {
        "id": "doc_004",
        "text": "The Federal Reserve's dual mandate requires balancing maximum employment with price stability. Current policy prioritizes inflation control over employment growth.",
        "source": "fed_policy_overview.pdf"
    },
    {
        "id": "doc_005",
        "text": "Bank earnings improved as net interest margins widened due to higher rates. Financial sector stocks outperformed the broader market this quarter.",
        "source": "quarterly_earnings_summary.pdf"
    },
    {
        "id": "doc_006",
        "text": "Small businesses report difficulty accessing credit as lending standards tighten. The cost of business loans has increased substantially since rate hikes began.",
        "source": "small_business_survey.pdf"
    },
    {
        "id": "doc_007",
        "text": "International markets reacted strongly to the rate decision, with currency fluctuations affecting trade balances. Emerging markets face capital outflow pressures.",
        "source": "global_markets_report.pdf"
    },
    {
        "id": "doc_008",
        "text": "The championship football match ended in a dramatic penalty shootout. The home team secured victory after their goalkeeper saved three consecutive penalties.",
        "source": "sports_news.pdf"
    }
]

print(f"Knowledge base loaded: {len(knowledge_base)} documents")
print("\nSources:")
for source in set(doc['source'] for doc in knowledge_base):
    count = sum(1 for doc in knowledge_base if doc['source'] == source)
    print(f"  - {source}: {count} document(s)")
# Create vector index from knowledge base
# This is the "offline" step - done once when documents change

# Extract texts and generate embeddings
texts = [doc["text"] for doc in knowledge_base]
doc_embeddings = model.encode(texts, normalize_embeddings=True)

# Create FAISS index (inner product = cosine similarity for normalized vectors)
dimension = doc_embeddings.shape[1]  # 384
index = faiss.IndexFlatIP(dimension)
index.add(doc_embeddings.astype('float32'))

print(f"FAISS index created:")
print(f"  - Dimension: {dimension}")
print(f"  - Documents indexed: {index.ntotal}")

7.8 Implementing the Retriever#

The retriever’s job: find the most relevant chunks for a query.

Key Decisions#

Parameter

Trade-off

k (number of results)

More context vs. more noise

Score threshold

Precision vs. recall

Metadata filtering

Targeted vs. comprehensive

class FAISSRetriever(Retriever):
    """Retriever using FAISS vector index."""
    
    def __init__(self, index, documents, embedding_model):
        self.index = index
        self.documents = documents
        self.model = embedding_model
    
    def retrieve(self, query: str, k: int = 3) -> List[RetrievedChunk]:
        """Retrieve top-k relevant chunks for the query."""
        # Encode query
        query_vec = self.model.encode(
            query, 
            normalize_embeddings=True,
            convert_to_numpy=True
        ).astype('float32').reshape(1, -1)
        
        # Search index
        scores, indices = self.index.search(query_vec, k)
        
        # Build result list
        results = []
        for score, idx in zip(scores[0], indices[0]):
            doc = self.documents[idx]
            results.append(RetrievedChunk(
                text=doc["text"],
                score=float(score),
                source=doc["source"]
            ))
        
        return results

# Create retriever instance
retriever = FAISSRetriever(index, knowledge_base, model)
print("Retriever created successfully")
# Test the retriever
test_query = "Why did the central bank raise interest rates?"

chunks = retriever.retrieve(test_query, k=3)

print(f"Query: {test_query}")
print(f"\nTop {len(chunks)} retrieved chunks:")
print("=" * 70)
for i, chunk in enumerate(chunks, 1):
    print(f"\n{i}. [Score: {chunk.score:.3f}] Source: {chunk.source}")
    print(f"   {chunk.text[:100]}...")

7.9 Building Evidence-First Prompts#

The prompt is where retrieval meets generation. A well-structured RAG prompt:

Evidence-First Prompting Principles#

Principle

Implementation

Context before question

Retrieved evidence appears first

Explicit grounding instruction

“Answer ONLY based on the provided context”

Refusal permission

“If the context doesn’t contain the answer, say so”

Clear structure

Labeled sections: CONTEXT, QUESTION, ANSWER

class RAGPromptBuilder(PromptBuilder):
    """Builds evidence-first RAG prompts."""
    
    TEMPLATE = """You are a helpful assistant that answers questions based ONLY on the provided context.

IMPORTANT RULES:
1. Answer ONLY using information from the CONTEXT below
2. If the context does not contain enough information to answer, say "I don't have enough information to answer this question."
3. Do not use any prior knowledge or make assumptions
4. Keep your answer concise and directly relevant to the question

CONTEXT:
{context}

QUESTION: {question}

ANSWER:"""
    
    def build(self, chunks: List[RetrievedChunk], question: str) -> str:
        """Build the RAG prompt from chunks and question."""
        # Format context from retrieved chunks
        context_parts = []
        for i, chunk in enumerate(chunks, 1):
            context_parts.append(f"[{i}] {chunk.text}")
        
        context = "\n\n".join(context_parts)
        
        # Build final prompt
        return self.TEMPLATE.format(
            context=context,
            question=question
        )

# Create prompt builder
prompt_builder = RAGPromptBuilder()
print("Prompt builder created")
# Test the prompt builder
rag_prompt = prompt_builder.build(chunks, test_query)

print("Generated RAG Prompt:")
print("=" * 70)
print(rag_prompt)
print("=" * 70)

7.10 Connecting to the Generator#

The generator calls the LLM API. We’ll reuse patterns from Module 6.

class LLMGenerator(Generator):
    """Generator that calls an LLM API."""
    
    def __init__(self, base_url: str, api_key: Optional[str] = None, model: str = "llama3.1:8b"):
        self.base_url = base_url
        self.api_key = api_key
        self.model = model
    
    def generate(self, prompt: str, temperature: float = 0.1) -> str:
        """Generate response from LLM."""
        headers = {
            "Content-Type": "application/json",
            "ngrok-skip-browser-warning": "true"
        }
        
        # Determine endpoint based on whether we have an API key
        use_jbchat = self.api_key and self.api_key != "<provided-by-instructor>"
        
        if use_jbchat:
            headers["X-API-Key"] = self.api_key
            endpoint = f"{self.base_url}/chat/direct"
            payload = {
                "model": self.model,
                "messages": [{"role": "user", "content": prompt}],
                "temperature": temperature,
                "stream": False
            }
        else:
            endpoint = f"{self.base_url}/api/chat"
            payload = {
                "model": self.model,
                "messages": [{"role": "user", "content": prompt}],
                "stream": False
            }
        
        try:
            response = requests.post(
                endpoint, 
                headers=headers, 
                json=payload, 
                timeout=60
            )
            response.raise_for_status()
            return response.json()["message"]["content"]
        except Exception as e:
            return f"[Generation Error: {e}]"

# Create generator
generator = LLMGenerator(LLM_BASE_URL, LLM_API_KEY, DEFAULT_MODEL)
print(f"Generator created: {LLM_BASE_URL}")

7.11 The Complete RAG Pipeline#

Now we combine all components into a complete pipeline.

class RAGPipeline:
    """Complete RAG pipeline combining retrieval and generation."""
    
    def __init__(self, retriever: Retriever, prompt_builder: PromptBuilder, generator: Generator):
        self.retriever = retriever
        self.prompt_builder = prompt_builder
        self.generator = generator
    
    def answer(self, question: str, k: int = 3, verbose: bool = False) -> dict:
        """Answer a question using RAG.
        
        Returns dict with:
        - answer: The generated response
        - chunks: Retrieved chunks used
        - prompt: The constructed prompt
        """
        # Step 1: Retrieve relevant chunks
        chunks = self.retriever.retrieve(question, k=k)
        
        if verbose:
            print(f"Retrieved {len(chunks)} chunks:")
            for i, c in enumerate(chunks, 1):
                print(f"  {i}. [{c.score:.3f}] {c.text[:50]}...")
            print()
        
        # Step 2: Build prompt
        prompt = self.prompt_builder.build(chunks, question)
        
        if verbose:
            print("Prompt built. Calling LLM...")
        
        # Step 3: Generate answer
        answer = self.generator.generate(prompt)
        
        return {
            "answer": answer,
            "chunks": chunks,
            "prompt": prompt
        }

# Create the complete pipeline
rag = RAGPipeline(retriever, prompt_builder, generator)
print("RAG pipeline created!")
# Test the complete pipeline
question = "Why did the central bank raise interest rates?"

print(f"Question: {question}")
print("=" * 70)

result = rag.answer(question, k=3, verbose=True)

print("\nRAG Answer:")
print("=" * 70)
print(result["answer"])
print("=" * 70)
# Test with a question NOT in the knowledge base
# A good RAG system should refuse to answer

question_out_of_scope = "What is the weather forecast for tomorrow?"

print(f"Question: {question_out_of_scope}")
print("(This question is NOT covered by our knowledge base)")
print("=" * 70)

result = rag.answer(question_out_of_scope, k=3, verbose=True)

print("\nRAG Answer:")
print("=" * 70)
print(result["answer"])
print("=" * 70)
print("\n✅ A well-designed RAG system should refuse or indicate uncertainty")
print("   when the retrieved context doesn't support an answer.")

Group 4: Failure Modes & Guardrails#

What can go wrong in RAG systems

Section

Topic

7.12

RAG-Specific Failure Modes

7.13

The Near-Miss Problem

7.14

Implementing Guardrails

7.15

When to Refuse


7.12 RAG-Specific Failure Modes#

RAG reduces hallucination risk but introduces new failure modes:

Failure Mode

Description

Impact

Wrong but similar chunks

Retrieval returns plausible but incorrect context

Grounded hallucination

Missing relevant chunks

Best evidence not retrieved

Incomplete answer

Conflicting evidence

Multiple chunks contradict each other

Confused response

Context overflow

Too many chunks, model loses focus

Noise in answer

Stale data

Knowledge base not updated

Outdated information

Citation hallucination

Model cites sources that don’t exist

False attribution

Key Insight#

RAG shifts the risk from “model makes up facts” to “retrieval returns wrong evidence.”

Both are problems. RAG is often more controllable.


7.13 The Near-Miss Problem#

The most dangerous failure in RAG: near-misses.

What is a Near-Miss?#

A chunk that is:

  • Semantically similar to the query (high retrieval score)

  • Factually different from what’s needed

Example#

Query

Retrieved Chunk

Problem

“What is Apple’s revenue?”

“Apple reported Q2 revenue…”

Wrong quarter

“What is the refund policy?”

“Our 2022 refund policy states…”

Outdated policy

“What did the CEO say about AI?”

“The CTO commented on AI…”

Wrong person

Why Near-Misses are Dangerous#

  1. High confidence: Model thinks it has good evidence

  2. Plausible output: Answer sounds correct

  3. Hard to detect: No obvious error signal

  4. User trust: Grounded answers seem authoritative

# Demonstrate near-miss retrieval
# Our knowledge base has sports content (the football doc)
# that could be a near-miss for unrelated queries

query_finance = "What happened in the championship game?"

# This will retrieve the football doc even though our KB is mostly finance
chunks = retriever.retrieve(query_finance, k=3)

print(f"Query: {query_finance}")
print("\nRetrieved chunks:")
for i, chunk in enumerate(chunks, 1):
    print(f"\n{i}. [Score: {chunk.score:.3f}]")
    print(f"   {chunk.text[:80]}...")

print("\n⚠️  Notice: The sports document (doc_008) is retrieved")
print("   because 'championship' matches, even though our KB")
print("   is primarily about monetary policy.")

7.14 Implementing Guardrails#

Guardrails protect RAG systems from failure modes:

Guardrail

What It Does

When to Use

Score threshold

Reject low-confidence retrieval

Always

Chunk count validation

Ensure minimum evidence

Critical queries

Source validation

Verify chunks from trusted sources

Regulated domains

Response length check

Detect overly brief/long answers

Quality control

Faithfulness check

Verify answer uses context

High-stakes answers

class RAGPipelineWithGuardrails:
    """RAG pipeline with configurable guardrails."""
    
    def __init__(
        self, 
        retriever: Retriever, 
        prompt_builder: PromptBuilder, 
        generator: Generator,
        min_score: float = 0.3,
        min_chunks: int = 1
    ):
        self.retriever = retriever
        self.prompt_builder = prompt_builder
        self.generator = generator
        self.min_score = min_score
        self.min_chunks = min_chunks
    
    def answer(self, question: str, k: int = 3) -> dict:
        """Answer with guardrails applied."""
        # Step 1: Retrieve
        all_chunks = self.retriever.retrieve(question, k=k)
        
        # Guardrail 1: Filter by score threshold
        valid_chunks = [
            c for c in all_chunks 
            if c.score >= self.min_score
        ]
        
        # Guardrail 2: Check minimum chunk count
        if len(valid_chunks) < self.min_chunks:
            return {
                "answer": "I don't have enough relevant information to answer this question confidently.",
                "chunks": all_chunks,
                "refused": True,
                "reason": f"Only {len(valid_chunks)} chunks above threshold {self.min_score}"
            }
        
        # Step 2: Build prompt with valid chunks only
        prompt = self.prompt_builder.build(valid_chunks, question)
        
        # Step 3: Generate
        answer = self.generator.generate(prompt)
        
        return {
            "answer": answer,
            "chunks": valid_chunks,
            "refused": False
        }

# Create pipeline with guardrails
rag_guarded = RAGPipelineWithGuardrails(
    retriever, 
    prompt_builder, 
    generator,
    min_score=0.4,  # Require 40% similarity
    min_chunks=2    # Require at least 2 relevant chunks
)
print("Guarded RAG pipeline created")
print(f"  - Min score threshold: 0.4")
print(f"  - Min chunks required: 2")
# Test guardrails with a well-covered question
print("Test 1: Question well-covered by knowledge base")
print("=" * 70)

result = rag_guarded.answer("Why did the central bank raise rates?")
print(f"Refused: {result['refused']}")
print(f"Chunks used: {len(result['chunks'])}")
print(f"\nAnswer: {result['answer'][:200]}...")
# Test guardrails with an out-of-scope question
print("Test 2: Question NOT covered by knowledge base")
print("=" * 70)

result = rag_guarded.answer("What is the best programming language?")
print(f"Refused: {result['refused']}")
if result['refused']:
    print(f"Reason: {result['reason']}")
print(f"\nAnswer: {result['answer']}")

print("\n✅ Guardrails prevent the system from hallucinating")
print("   when retrieval doesn't find relevant evidence.")

7.15 When to Refuse#

Refusal is a feature, not a failure.

When RAG Should Refuse#

Condition

Action

No chunks above score threshold

Refuse

Chunks are from wrong domain

Refuse

Query asks for speculation

Refuse

Conflicting evidence

Acknowledge uncertainty

Refusal Patterns#

Pattern

Example Response

No information

“I don’t have information about that topic.”

Low confidence

“Based on limited evidence, I cannot confidently answer.”

Out of scope

“This question is outside my knowledge base.”

Enterprise Reality#

In regulated environments, a wrong answer is far more costly than no answer.

Banks, healthcare, legal: refusal is risk management.


Group 5: Production RAG#

Deploying evaluable, auditable RAG systems

Section

Topic

7.16

Evaluating Retrieval Quality

7.17

Evaluating Generation Faithfulness

7.18

Caching and Performance

7.19

Observability and Audit Trails

7.20

RAG as a Platform Capability


7.16 Evaluating Retrieval Quality#

RAG quality starts with retrieval quality. Poor retrieval = poor answers.

Retrieval Metrics#

Metric

What It Measures

How to Compute

Precision@k

Relevant chunks in top-k

Relevant / k

Recall@k

Coverage of all relevant chunks

Retrieved relevant / Total relevant

MRR

Position of first relevant chunk

1 / rank of first relevant

NDCG

Ranking quality weighted by position

Complex formula

Practical Evaluation#

For most applications, simple checks work:

  1. Does the top chunk answer the question?

  2. Are retrieved chunks from appropriate sources?

  3. Is retrieval score reasonable?

def evaluate_retrieval(query: str, expected_keywords: list, k: int = 3):
    """Simple retrieval evaluation."""
    chunks = retriever.retrieve(query, k=k)
    
    print(f"Query: {query}")
    print(f"Expected keywords: {expected_keywords}")
    print("\nRetrieved chunks:")
    
    hits = 0
    for i, chunk in enumerate(chunks, 1):
        text_lower = chunk.text.lower()
        matched = [kw for kw in expected_keywords if kw.lower() in text_lower]
        
        status = "✅" if matched else "❌"
        hits += 1 if matched else 0
        
        print(f"  {i}. [{chunk.score:.3f}] {status} Keywords: {matched}")
        print(f"     {chunk.text[:60]}...")
    
    precision = hits / k
    print(f"\nPrecision@{k}: {precision:.1%}")
    return precision

# Evaluate retrieval for a test query
evaluate_retrieval(
    "Why did interest rates increase?",
    ["interest", "rate", "inflation", "central bank"]
)

7.17 Evaluating Generation Faithfulness#

Faithfulness: Does the answer only use information from the provided context?

Faithfulness Checks#

Check

Question

Grounding

Can every claim be traced to context?

No hallucination

Does the answer avoid inventing facts?

Appropriate refusal

Does it refuse when context is insufficient?

Manual Evaluation Template#

For each answer, ask:

  1. Is this answer supported by the retrieved chunks? (Yes/No/Partial)

  2. Does the answer add information not in the chunks? (Yes/No)

  3. Is the answer’s confidence appropriate? (Yes/No)

def check_faithfulness_simple(answer: str, chunks: List[RetrievedChunk]) -> dict:
    """Simple faithfulness heuristics."""
    # Combine all chunk text
    context_text = " ".join(c.text.lower() for c in chunks)
    answer_lower = answer.lower()
    
    # Check for common hallucination signals
    hallucination_phrases = [
        "i think", "probably", "might be", "i believe",
        "generally speaking", "in my opinion", "typically"
    ]
    
    found_phrases = [p for p in hallucination_phrases if p in answer_lower]
    
    # Check if answer is appropriately uncertain when needed
    uncertainty_phrases = [
        "don't have", "cannot", "no information", 
        "not mentioned", "unclear"
    ]
    shows_uncertainty = any(p in answer_lower for p in uncertainty_phrases)
    
    return {
        "potential_hallucination_signals": found_phrases,
        "shows_uncertainty": shows_uncertainty,
        "answer_length": len(answer),
        "context_length": len(context_text)
    }

# Test on a RAG response
result = rag.answer("What is the current inflation rate?")

print("Question: What is the current inflation rate?")
print(f"\nAnswer: {result['answer']}")
print("\nFaithfulness analysis:")
analysis = check_faithfulness_simple(result['answer'], result['chunks'])
for key, value in analysis.items():
    print(f"  {key}: {value}")

7.18 Caching and Performance#

Production RAG systems need performance optimization.

What to Cache#

Component

Cache Strategy

Invalidation

Document embeddings

Precompute, persist

On document change

Query embeddings

LRU cache

Time-based

Retrieval results

Query hash → chunks

On index update

LLM responses

Prompt hash → answer

Careful—may go stale

Latency Breakdown#

Typical RAG latency:

Step

Typical Time

Query embedding

50-100ms

Vector search

10-50ms

LLM generation

500-3000ms

Total

600-3000ms

LLM generation dominates. Cache carefully.

from functools import lru_cache
import hashlib

class CachedRetriever:
    """Retriever with query caching."""
    
    def __init__(self, base_retriever: Retriever, cache_size: int = 100):
        self.base = base_retriever
        self.cache = {}
        self.cache_size = cache_size
        self.hits = 0
        self.misses = 0
    
    def _cache_key(self, query: str, k: int) -> str:
        return hashlib.md5(f"{query}:{k}".encode()).hexdigest()
    
    def retrieve(self, query: str, k: int = 3) -> List[RetrievedChunk]:
        key = self._cache_key(query, k)
        
        if key in self.cache:
            self.hits += 1
            return self.cache[key]
        
        self.misses += 1
        result = self.base.retrieve(query, k)
        
        # Simple cache with size limit
        if len(self.cache) >= self.cache_size:
            # Remove oldest entry (simple strategy)
            oldest_key = next(iter(self.cache))
            del self.cache[oldest_key]
        
        self.cache[key] = result
        return result
    
    def stats(self):
        total = self.hits + self.misses
        hit_rate = self.hits / total if total > 0 else 0
        return {"hits": self.hits, "misses": self.misses, "hit_rate": hit_rate}

# Demo caching
cached_retriever = CachedRetriever(retriever)

# First query - cache miss
cached_retriever.retrieve("interest rates", k=3)
print(f"After first query: {cached_retriever.stats()}")

# Same query - cache hit
cached_retriever.retrieve("interest rates", k=3)
print(f"After same query:  {cached_retriever.stats()}")

# Different query - cache miss
cached_retriever.retrieve("mortgage rates", k=3)
print(f"After new query:   {cached_retriever.stats()}")

7.19 Observability and Audit Trails#

Production RAG requires comprehensive logging for:

Purpose

What to Log

Debugging

Query, chunks, prompt, response

Quality monitoring

Retrieval scores, response latency

Compliance

User ID, timestamp, sources cited

Improvement

Failed queries, low-confidence responses

Audit Trail Structure#

{
  "request_id": "uuid",
  "timestamp": "ISO-8601",
  "user_id": "user-123",
  "query": "original question",
  "retrieval": {
    "chunk_ids": ["doc_001", "doc_002"],
    "scores": [0.85, 0.72],
    "latency_ms": 45
  },
  "generation": {
    "model": "llama3.1:8b",
    "prompt_tokens": 450,
    "response_tokens": 120,
    "latency_ms": 1200
  },
  "response": "final answer",
  "refused": false
}
import uuid
from datetime import datetime

class AuditableRAGPipeline:
    """RAG pipeline with audit logging."""
    
    def __init__(self, retriever, prompt_builder, generator):
        self.retriever = retriever
        self.prompt_builder = prompt_builder
        self.generator = generator
        self.audit_log = []
    
    def answer(self, question: str, user_id: str = "anonymous", k: int = 3) -> dict:
        request_id = str(uuid.uuid4())
        start_time = time.time()
        
        # Retrieval
        retrieval_start = time.time()
        chunks = self.retriever.retrieve(question, k=k)
        retrieval_ms = (time.time() - retrieval_start) * 1000
        
        # Prompt building
        prompt = self.prompt_builder.build(chunks, question)
        
        # Generation
        generation_start = time.time()
        answer = self.generator.generate(prompt)
        generation_ms = (time.time() - generation_start) * 1000
        
        total_ms = (time.time() - start_time) * 1000
        
        # Build audit record
        audit_record = {
            "request_id": request_id,
            "timestamp": datetime.utcnow().isoformat(),
            "user_id": user_id,
            "query": question,
            "retrieval": {
                "chunk_sources": [c.source for c in chunks],
                "scores": [c.score for c in chunks],
                "latency_ms": round(retrieval_ms, 1)
            },
            "generation": {
                "latency_ms": round(generation_ms, 1)
            },
            "total_latency_ms": round(total_ms, 1),
            "response_length": len(answer)
        }
        
        self.audit_log.append(audit_record)
        
        return {
            "answer": answer,
            "request_id": request_id,
            "chunks": chunks
        }
    
    def get_audit_log(self):
        return self.audit_log

# Create auditable pipeline
auditable_rag = AuditableRAGPipeline(retriever, prompt_builder, generator)

# Make a query
result = auditable_rag.answer(
    "What impact did rate hikes have on mortgages?",
    user_id="user-42"
)

print("Answer:", result["answer"][:100], "...")
print(f"\nRequest ID: {result['request_id']}")
print("\nAudit Record:")
print(json.dumps(auditable_rag.audit_log[-1], indent=2))

7.20 RAG as a Platform Capability#

In enterprise settings, RAG becomes a platform—not a one-off feature.

Platform Characteristics#

Aspect

Implementation

Multi-tenant

Different knowledge bases per team/product

Swappable components

Change LLM without rebuilding

Configurable guardrails

Different thresholds per use case

Centralized logging

Unified audit across all RAG apps

Evolution Path#

Prototype RAG          Production RAG         Platform RAG
     │                      │                      │
     │ Single corpus        │ Multiple corpora     │ Self-service corpora
     │ One model            │ Model selection      │ Model marketplace
     │ No guardrails        │ Fixed guardrails     │ Configurable policies
     │ No logging           │ Basic logging        │ Full observability
     ▼                      ▼                      ▼

Key Insight#

RAG at scale is about governance, not just generation.

Who can access which knowledge? What gets logged? How do we audit?


Module Summary#

Key Takeaways#

Concept

Remember

Why RAG

LLMs have no runtime access to your data

RAG Architecture

Component-based: Retriever → Prompt → Generator → Validator

Near-misses

Most dangerous failure: semantically similar but factually different

Guardrails

Score thresholds and refusal are features, not failures

Evaluation

Measure retrieval quality and generation faithfulness separately

Production

Logging, caching, audit trails are mandatory

The RAG Mental Model#

RAG is how we turn LLMs from storytellers into assistants grounded in evidence.

It is an architectural discipline, not a prompt trick.

What’s Next#

You now have all the components to build production AI systems:

  • Module 5: Embeddings and retrieval

  • Module 6: LLM API engineering

  • Module 7: RAG pipelines

The assessment will test your ability to combine these into a working system.


Practice Exercises#

Exercise 1: Adjust Retrieval Parameters#

Modify the retriever to use k=5 instead of k=3. How does this affect answer quality?

Exercise 2: Custom Guardrails#

Create a guardrail that refuses to answer if retrieved chunks come from more than 2 different sources (potential conflicting evidence).

Exercise 3: Evaluate Your RAG#

Write 5 test questions and manually evaluate:

  1. Retrieval precision (are the right chunks retrieved?)

  2. Generation faithfulness (does the answer use only the context?)

Exercise 4: Add a New Document#

Add a new document to the knowledge base about cryptocurrency regulation. Test that queries about crypto now return relevant results.