Content

Content#

Module 7 — RAG Pipelines

Retrieval-Augmented Generation as an Engineering System

What This Module Covers#

Group	Topic	Key Skill
1	Why RAG Exists	Understand the fundamental problem RAG solves
2	RAG Architecture	Design component-based RAG systems
3	Building RAG Pipelines	Implement end-to-end retrieval and generation
4	Failure Modes & Guardrails	Handle RAG-specific failure cases
5	Production RAG	Deploy evaluable, auditable RAG systems

Learning Objectives#

By the end of this module, you will be able to:

Explain why RAG is necessary for real-world LLM applications
Design RAG pipelines with clear component boundaries
Implement retrieval, prompt construction, and generation
Handle failure modes including near-misses and low-confidence retrieval
Evaluate both retrieval quality and generation faithfulness
Apply RAG patterns to enterprise scenarios

Prerequisites#

This module builds directly on:

Module	Concepts Used Here
Module 3	LLM behavior, hallucination patterns
Module 4	How models learn patterns, not truth
Module 5	Embeddings, vector similarity, FAISS retrieval
Module 6	LLM API clients, retries, structured output

Module 7 is where everything comes together.

Setup#

Run this cell to install dependencies and configure the environment.

!pip -q install sentence-transformers faiss-cpu requests

import numpy as np
import requests
import json
import time
import faiss
from sentence_transformers import SentenceTransformer
from typing import List, Tuple, Dict, Optional

# Load embedding model (same as Module 5)
model = SentenceTransformer("all-MiniLM-L6-v2")
print(f"Embedding model loaded: all-MiniLM-L6-v2")
print(f"Embedding dimension: 384")
print("Setup complete!")

^C
ERROR: Operation cancelled by user

[notice] A new release of pip is available: 23.0.1 -> 25.3
[notice] To update, run: pip install --upgrade pip

---------------------------------------------------------------------------
ModuleNotFoundError                       Traceback (most recent call last)
Cell In[1], line 3
      1 get_ipython().system('pip -q install sentence-transformers faiss-cpu requests')
----> 3 import numpy as np
      4 import requests
      5 import json

ModuleNotFoundError: No module named 'numpy'

LLM Gateway Configuration#

Configure your LLM endpoint. Choose one option:

Option	When to Use	Setup
Pinggy Tunnel	Running Ollama locally	Start tunnel, paste URL
JBChat Server	Classroom setting	Get API key from instructor

Option A: Pinggy Tunnel (Local Ollama)#

# Terminal 1: Start Ollama
OLLAMA_HOST=0.0.0.0 ollama serve

# Terminal 2: Start Pinggy tunnel
ssh -p 443 -R0:localhost:11434 -L4300:localhost:4300 a.pinggy.io

Option B: JBChat Server#

Get the API key from your instructor.

# ------ OPTION A: Pinggy Tunnel (for local Ollama) ------
# LLM_BASE_URL = "https://your-pinggy-url.a.pinggy.io"
# LLM_API_KEY = None

# ------ OPTION B: JBChat Server (classroom) ------
LLM_BASE_URL = "https://jbchat.jonbowden.com.ngrok.app"
LLM_API_KEY = "<provided-by-instructor>"  # Get from instructor

DEFAULT_MODEL = "llama3.1:8b"

print(f"LLM endpoint: {LLM_BASE_URL}")
print(f"Model: {DEFAULT_MODEL}")

Group 1: Why RAG Exists#

The fundamental problem RAG solves

Section	Topic
7.1	The Knowledge Gap Problem
7.2	What RAG Actually Does
7.3	RAG vs Fine-Tuning vs Prompting

7.1 The Knowledge Gap Problem#

LLMs have a fundamental limitation that no amount of prompting can fix:

LLMs have no access to your data at the moment they answer a question.

What LLMs Know#

Knowledge Type	Available?	Example
Training data (pre-cutoff)	✅ Yes	“What is Python?”
Recent events (post-cutoff)	❌ No	“What happened yesterday?”
Your internal documents	❌ No	“What’s our refund policy?”
Your database records	❌ No	“What’s customer #12345’s status?”
Private company data	❌ No	“What were Q3 sales?”

The Hallucination Risk#

When asked about information they don’t have, LLMs don’t say “I don’t know.”

They confidently fabricate plausible-sounding answers.

This is not a bug—it’s how language models work. They generate probable continuations of text, whether or not those continuations are factually correct.

# Demonstration: LLMs will answer questions about data they've never seen

# Imagine asking an LLM about your company's internal policy
hypothetical_question = "What is Acme Corp's work-from-home policy?"

# Without RAG, the LLM might respond:
hypothetical_hallucination = """
Acme Corp allows employees to work from home up to 3 days per week.
Employees must be available during core hours (10am-4pm) and attend
all mandatory team meetings in person.
"""

print("Question:", hypothetical_question)
print("\nPotential LLM response (hallucinated):")
print(hypothetical_hallucination)
print("⚠️  This sounds authoritative but is completely fabricated!")
print("   The LLM has never seen Acme Corp's actual policy.")

7.2 What RAG Actually Does#

RAG = Retrieval-Augmented Generation

RAG solves the knowledge gap by providing relevant information at runtime:

┌─────────────────────────────────────────────────────────────┐
│                      RAG Pipeline                           │
├─────────────────────────────────────────────────────────────┤
│                                                             │
│   User Question ─────► Retriever ─────► Relevant Chunks     │
│                            │                   │            │
│                            │                   ▼            │
│                            │           Prompt Builder       │
│                            │                   │            │
│                            │                   ▼            │
│                            │           LLM Generation       │
│                            │                   │            │
│                            │                   ▼            │
│                            └──────────► Grounded Answer     │
│                                                             │
└─────────────────────────────────────────────────────────────┘

RAG Does NOT:#

Misconception	Reality
Make the model smarter	Model is unchanged
Retrain the model	No training occurs
Eliminate hallucinations	Reduces risk, doesn’t eliminate
Guarantee correctness	Still requires validation

RAG DOES:#

Capability	Benefit
Provide runtime evidence	Answers based on actual data
Ground generation	Model cites provided context
Enable auditability	Can trace answer to source
Support updates	New data available immediately

7.3 RAG vs Fine-Tuning vs Prompting#

RAG is one of several approaches to customizing LLM behavior:

Approach	What It Does	When to Use	Limitations
Prompting	Provides instructions in context	General behavior guidance	No access to external data
Fine-tuning	Modifies model weights	Teaching new skills/patterns	Expensive, data goes stale
RAG	Retrieves relevant data at runtime	Grounding in specific knowledge	Retrieval quality matters

When RAG is the Right Choice#

✅ Use RAG when:

You need answers grounded in specific documents
Data changes frequently
You need to cite sources
You need auditability

❌ Don’t use RAG when:

Teaching the model a new task format
The knowledge is general/public
Real-time retrieval is too slow

Enterprise Reality#

Most enterprise LLM applications require RAG.

Fine-tuning teaches how to respond. RAG provides what to respond about.

Group 2: RAG Architecture#

Designing component-based RAG systems

Section	Topic
7.4	RAG as an Architectural Pattern
7.5	The Four Core Components
7.6	Data Flow and Dependencies

7.4 RAG as an Architectural Pattern#

RAG is not a single function or library—it’s an architectural pattern.

Key Insight#

RAG is a pipeline of composable components, each testable independently.

This matters because:

Principle	Benefit
Separation of concerns	Each component has one job
Independent testing	Debug retrieval separate from generation
Swappable parts	Change embedding model without changing LLM
Clear failure attribution	Know which component failed

Anti-Pattern: The Monolithic RAG Function#

# ❌ BAD: Everything in one function
def answer_question(query):
    # Embed, retrieve, build prompt, call LLM, parse response...
    # 200 lines of tangled logic
    pass

Pattern: Component-Based RAG#

# ✅ GOOD: Clear component boundaries
query_embedding = embedder.encode(query)
chunks = retriever.search(query_embedding, k=5)
prompt = prompt_builder.build(chunks, query)
response = generator.generate(prompt)
answer = validator.validate(response)

7.5 The Four Core Components#

Every RAG system has these components (even if combined):

1. Retriever#

Responsibility	Implementation
Find relevant chunks	Vector similarity search
Return ranked results	Top-k with scores
Preserve metadata	Source, page, timestamp

2. Prompt Builder#

Responsibility	Implementation
Structure the prompt	Template with placeholders
Inject retrieved context	Format chunks clearly
Constrain the model	Instructions for grounded answers

3. Generator#

Responsibility	Implementation
Call the LLM API	HTTP client with retries
Handle failures	Timeout, rate limits
Return response	Raw text or structured

4. Validator (Optional but Recommended)#

Responsibility	Implementation
Check response quality	Length, format, content
Detect hallucination signals	Claims not in context
Trigger fallback	Refusal or retry

# Component interfaces (contracts)
# These define what each component must do

from abc import ABC, abstractmethod
from dataclasses import dataclass
from typing import List, Tuple

@dataclass
class RetrievedChunk:
    """A chunk retrieved from the knowledge base."""
    text: str
    score: float
    source: str = "unknown"
    
class Retriever(ABC):
    """Interface for retrieval components."""
    @abstractmethod
    def retrieve(self, query: str, k: int = 3) -> List[RetrievedChunk]:
        pass

class PromptBuilder(ABC):
    """Interface for prompt construction."""
    @abstractmethod
    def build(self, chunks: List[RetrievedChunk], question: str) -> str:
        pass

class Generator(ABC):
    """Interface for LLM generation."""
    @abstractmethod
    def generate(self, prompt: str) -> str:
        pass

print("Component interfaces defined:")
print("  - RetrievedChunk: data class for retrieved content")
print("  - Retriever: find relevant chunks")
print("  - PromptBuilder: construct RAG prompt")
print("  - Generator: call LLM and return response")

7.6 Data Flow and Dependencies#

Understanding data flow helps debug RAG systems:

┌───────────────────────────────────────────────────────────────────────┐
│                        RAG Data Flow                                  │
├───────────────────────────────────────────────────────────────────────┤
│                                                                       │
│   [User Query]                                                        │
│        │                                                              │
│        ▼                                                              │
│   ┌─────────────┐     ┌─────────────────┐                            │
│   │  Embedder   │────►│  Query Vector   │                            │
│   └─────────────┘     └────────┬────────┘                            │
│                                │                                      │
│                                ▼                                      │
│   ┌─────────────┐     ┌─────────────────┐     ┌──────────────────┐   │
│   │ Vector DB   │────►│   Retriever     │────►│ Retrieved Chunks │   │
│   └─────────────┘     └─────────────────┘     └────────┬─────────┘   │
│                                                        │             │
│                                                        ▼             │
│   [User Query] ──────────────────────────►    ┌───────────────┐      │
│                                               │ Prompt Builder│      │
│                                               └───────┬───────┘      │
│                                                       │              │
│                                                       ▼              │
│                                               ┌───────────────┐      │
│                                               │  RAG Prompt   │      │
│                                               └───────┬───────┘      │
│                                                       │              │
│                                                       ▼              │
│                                               ┌───────────────┐      │
│                                               │   Generator   │      │
│                                               └───────┬───────┘      │
│                                                       │              │
│                                                       ▼              │
│                                               [Grounded Answer]      │
│                                                                       │
└───────────────────────────────────────────────────────────────────────┘

Dependency Matrix#

Component	Depends On	Produces
Embedder	Query text	Query vector
Retriever	Query vector, Vector DB	Ranked chunks
Prompt Builder	Chunks, Query	RAG prompt string
Generator	RAG prompt	LLM response
Validator	LLM response, Chunks	Final answer

Group 3: Building RAG Pipelines#

Hands-on implementation

Section	Topic
7.7	Setting Up the Knowledge Base
7.8	Implementing the Retriever
7.9	Building Evidence-First Prompts
7.10	Connecting to the Generator
7.11	The Complete RAG Pipeline

7.7 Setting Up the Knowledge Base#

A RAG system needs a knowledge base—documents to retrieve from.

For this module, we’ll use a corpus about central banking and interest rates (same domain as Module 5).

# Knowledge base: documents about monetary policy
# In production, this would come from a database, files, or API

knowledge_base = [
    {
        "id": "doc_001",
        "text": "The central bank raised interest rates by 25 basis points to combat inflation. This decision was made after reviewing economic indicators showing persistent price increases across multiple sectors.",
        "source": "monetary_policy_report_q3.pdf"
    },
    {
        "id": "doc_002",
        "text": "Higher borrowing costs are expected to slow consumer spending and reduce inflationary pressure. The central bank indicated further rate increases may follow if inflation remains elevated.",
        "source": "monetary_policy_report_q3.pdf"
    },
    {
        "id": "doc_003",
        "text": "Mortgage rates have risen to their highest level in two decades, causing a significant slowdown in the housing market. Home sales declined 15% compared to the previous quarter.",
        "source": "housing_market_analysis.pdf"
    },
    {
        "id": "doc_004",
        "text": "The Federal Reserve's dual mandate requires balancing maximum employment with price stability. Current policy prioritizes inflation control over employment growth.",
        "source": "fed_policy_overview.pdf"
    },
    {
        "id": "doc_005",
        "text": "Bank earnings improved as net interest margins widened due to higher rates. Financial sector stocks outperformed the broader market this quarter.",
        "source": "quarterly_earnings_summary.pdf"
    },
    {
        "id": "doc_006",
        "text": "Small businesses report difficulty accessing credit as lending standards tighten. The cost of business loans has increased substantially since rate hikes began.",
        "source": "small_business_survey.pdf"
    },
    {
        "id": "doc_007",
        "text": "International markets reacted strongly to the rate decision, with currency fluctuations affecting trade balances. Emerging markets face capital outflow pressures.",
        "source": "global_markets_report.pdf"
    },
    {
        "id": "doc_008",
        "text": "The championship football match ended in a dramatic penalty shootout. The home team secured victory after their goalkeeper saved three consecutive penalties.",
        "source": "sports_news.pdf"
    }
]

print(f"Knowledge base loaded: {len(knowledge_base)} documents")
print("\nSources:")
for source in set(doc['source'] for doc in knowledge_base):
    count = sum(1 for doc in knowledge_base if doc['source'] == source)
    print(f"  - {source}: {count} document(s)")

# Create vector index from knowledge base
# This is the "offline" step - done once when documents change

# Extract texts and generate embeddings
texts = [doc["text"] for doc in knowledge_base]
doc_embeddings = model.encode(texts, normalize_embeddings=True)

# Create FAISS index (inner product = cosine similarity for normalized vectors)
dimension = doc_embeddings.shape[1]  # 384
index = faiss.IndexFlatIP(dimension)
index.add(doc_embeddings.astype('float32'))

print(f"FAISS index created:")
print(f"  - Dimension: {dimension}")
print(f"  - Documents indexed: {index.ntotal}")

7.8 Implementing the Retriever#

The retriever’s job: find the most relevant chunks for a query.

Key Decisions#

Parameter	Trade-off
k (number of results)	More context vs. more noise
Score threshold	Precision vs. recall
Metadata filtering	Targeted vs. comprehensive

class FAISSRetriever(Retriever):
    """Retriever using FAISS vector index."""
    
    def __init__(self, index, documents, embedding_model):
        self.index = index
        self.documents = documents
        self.model = embedding_model
    
    def retrieve(self, query: str, k: int = 3) -> List[RetrievedChunk]:
        """Retrieve top-k relevant chunks for the query."""
        # Encode query
        query_vec = self.model.encode(
            query, 
            normalize_embeddings=True,
            convert_to_numpy=True
        ).astype('float32').reshape(1, -1)
        
        # Search index
        scores, indices = self.index.search(query_vec, k)
        
        # Build result list
        results = []
        for score, idx in zip(scores[0], indices[0]):
            doc = self.documents[idx]
            results.append(RetrievedChunk(
                text=doc["text"],
                score=float(score),
                source=doc["source"]
            ))
        
        return results

# Create retriever instance
retriever = FAISSRetriever(index, knowledge_base, model)
print("Retriever created successfully")

# Test the retriever
test_query = "Why did the central bank raise interest rates?"

chunks = retriever.retrieve(test_query, k=3)

print(f"Query: {test_query}")
print(f"\nTop {len(chunks)} retrieved chunks:")
print("=" * 70)
for i, chunk in enumerate(chunks, 1):
    print(f"\n{i}. [Score: {chunk.score:.3f}] Source: {chunk.source}")
    print(f"   {chunk.text[:100]}...")

7.9 Building Evidence-First Prompts#

The prompt is where retrieval meets generation. A well-structured RAG prompt:

Evidence-First Prompting Principles#

Principle	Implementation
Context before question	Retrieved evidence appears first
Explicit grounding instruction	“Answer ONLY based on the provided context”
Refusal permission	“If the context doesn’t contain the answer, say so”
Clear structure	Labeled sections: CONTEXT, QUESTION, ANSWER

class RAGPromptBuilder(PromptBuilder):
    """Builds evidence-first RAG prompts."""
    
    TEMPLATE = """You are a helpful assistant that answers questions based ONLY on the provided context.

IMPORTANT RULES:
1. Answer ONLY using information from the CONTEXT below
2. If the context does not contain enough information to answer, say "I don't have enough information to answer this question."
3. Do not use any prior knowledge or make assumptions
4. Keep your answer concise and directly relevant to the question

CONTEXT:
{context}

QUESTION: {question}

ANSWER:"""
    
    def build(self, chunks: List[RetrievedChunk], question: str) -> str:
        """Build the RAG prompt from chunks and question."""
        # Format context from retrieved chunks
        context_parts = []
        for i, chunk in enumerate(chunks, 1):
            context_parts.append(f"[{i}] {chunk.text}")
        
        context = "\n\n".join(context_parts)
        
        # Build final prompt
        return self.TEMPLATE.format(
            context=context,
            question=question
        )

# Create prompt builder
prompt_builder = RAGPromptBuilder()
print("Prompt builder created")

# Test the prompt builder
rag_prompt = prompt_builder.build(chunks, test_query)

print("Generated RAG Prompt:")
print("=" * 70)
print(rag_prompt)
print("=" * 70)

7.10 Connecting to the Generator#

The generator calls the LLM API. We’ll reuse patterns from Module 6.

class LLMGenerator(Generator):
    """Generator that calls an LLM API."""
    
    def __init__(self, base_url: str, api_key: Optional[str] = None, model: str = "llama3.1:8b"):
        self.base_url = base_url
        self.api_key = api_key
        self.model = model
    
    def generate(self, prompt: str, temperature: float = 0.1) -> str:
        """Generate response from LLM."""
        headers = {
            "Content-Type": "application/json",
            "ngrok-skip-browser-warning": "true"
        }
        
        # Determine endpoint based on whether we have an API key
        use_jbchat = self.api_key and self.api_key != "<provided-by-instructor>"
        
        if use_jbchat:
            headers["X-API-Key"] = self.api_key
            endpoint = f"{self.base_url}/chat/direct"
            payload = {
                "model": self.model,
                "messages": [{"role": "user", "content": prompt}],
                "temperature": temperature,
                "stream": False
            }
        else:
            endpoint = f"{self.base_url}/api/chat"
            payload = {
                "model": self.model,
                "messages": [{"role": "user", "content": prompt}],
                "stream": False
            }
        
        try:
            response = requests.post(
                endpoint, 
                headers=headers, 
                json=payload, 
                timeout=60
            )
            response.raise_for_status()
            return response.json()["message"]["content"]
        except Exception as e:
            return f"[Generation Error: {e}]"

# Create generator
generator = LLMGenerator(LLM_BASE_URL, LLM_API_KEY, DEFAULT_MODEL)
print(f"Generator created: {LLM_BASE_URL}")

7.11 The Complete RAG Pipeline#

Now we combine all components into a complete pipeline.

class RAGPipeline:
    """Complete RAG pipeline combining retrieval and generation."""
    
    def __init__(self, retriever: Retriever, prompt_builder: PromptBuilder, generator: Generator):
        self.retriever = retriever
        self.prompt_builder = prompt_builder
        self.generator = generator
    
    def answer(self, question: str, k: int = 3, verbose: bool = False) -> dict:
        """Answer a question using RAG.
        
        Returns dict with:
        - answer: The generated response
        - chunks: Retrieved chunks used
        - prompt: The constructed prompt
        """
        # Step 1: Retrieve relevant chunks
        chunks = self.retriever.retrieve(question, k=k)
        
        if verbose:
            print(f"Retrieved {len(chunks)} chunks:")
            for i, c in enumerate(chunks, 1):
                print(f"  {i}. [{c.score:.3f}] {c.text[:50]}...")
            print()
        
        # Step 2: Build prompt
        prompt = self.prompt_builder.build(chunks, question)
        
        if verbose:
            print("Prompt built. Calling LLM...")
        
        # Step 3: Generate answer
        answer = self.generator.generate(prompt)
        
        return {
            "answer": answer,
            "chunks": chunks,
            "prompt": prompt
        }

# Create the complete pipeline
rag = RAGPipeline(retriever, prompt_builder, generator)
print("RAG pipeline created!")

# Test the complete pipeline
question = "Why did the central bank raise interest rates?"

print(f"Question: {question}")
print("=" * 70)

result = rag.answer(question, k=3, verbose=True)

print("\nRAG Answer:")
print("=" * 70)
print(result["answer"])
print("=" * 70)

# Test with a question NOT in the knowledge base
# A good RAG system should refuse to answer

question_out_of_scope = "What is the weather forecast for tomorrow?"

print(f"Question: {question_out_of_scope}")
print("(This question is NOT covered by our knowledge base)")
print("=" * 70)

result = rag.answer(question_out_of_scope, k=3, verbose=True)

print("\nRAG Answer:")
print("=" * 70)
print(result["answer"])
print("=" * 70)
print("\n✅ A well-designed RAG system should refuse or indicate uncertainty")
print("   when the retrieved context doesn't support an answer.")

Group 4: Failure Modes & Guardrails#

What can go wrong in RAG systems

Section	Topic
7.12	RAG-Specific Failure Modes
7.13	The Near-Miss Problem
7.14	Implementing Guardrails
7.15	When to Refuse

7.12 RAG-Specific Failure Modes#

RAG reduces hallucination risk but introduces new failure modes:

Failure Mode	Description	Impact
Wrong but similar chunks	Retrieval returns plausible but incorrect context	Grounded hallucination
Missing relevant chunks	Best evidence not retrieved	Incomplete answer
Conflicting evidence	Multiple chunks contradict each other	Confused response
Context overflow	Too many chunks, model loses focus	Noise in answer
Stale data	Knowledge base not updated	Outdated information
Citation hallucination	Model cites sources that don’t exist	False attribution

Key Insight#

RAG shifts the risk from “model makes up facts” to “retrieval returns wrong evidence.”

Both are problems. RAG is often more controllable.

7.13 The Near-Miss Problem#

The most dangerous failure in RAG: near-misses.

What is a Near-Miss?#

A chunk that is:

Semantically similar to the query (high retrieval score)
Factually different from what’s needed

Example#

Query	Retrieved Chunk	Problem
“What is Apple’s revenue?”	“Apple reported Q2 revenue…”	Wrong quarter
“What is the refund policy?”	“Our 2022 refund policy states…”	Outdated policy
“What did the CEO say about AI?”	“The CTO commented on AI…”	Wrong person

Why Near-Misses are Dangerous#

High confidence: Model thinks it has good evidence
Plausible output: Answer sounds correct
Hard to detect: No obvious error signal
User trust: Grounded answers seem authoritative

# Demonstrate near-miss retrieval
# Our knowledge base has sports content (the football doc)
# that could be a near-miss for unrelated queries

query_finance = "What happened in the championship game?"

# This will retrieve the football doc even though our KB is mostly finance
chunks = retriever.retrieve(query_finance, k=3)

print(f"Query: {query_finance}")
print("\nRetrieved chunks:")
for i, chunk in enumerate(chunks, 1):
    print(f"\n{i}. [Score: {chunk.score:.3f}]")
    print(f"   {chunk.text[:80]}...")

print("\n⚠️  Notice: The sports document (doc_008) is retrieved")
print("   because 'championship' matches, even though our KB")
print("   is primarily about monetary policy.")

7.14 Implementing Guardrails#

Guardrails protect RAG systems from failure modes:

Guardrail	What It Does	When to Use
Score threshold	Reject low-confidence retrieval	Always
Chunk count validation	Ensure minimum evidence	Critical queries
Source validation	Verify chunks from trusted sources	Regulated domains
Response length check	Detect overly brief/long answers	Quality control
Faithfulness check	Verify answer uses context	High-stakes answers

class RAGPipelineWithGuardrails:
    """RAG pipeline with configurable guardrails."""
    
    def __init__(
        self, 
        retriever: Retriever, 
        prompt_builder: PromptBuilder, 
        generator: Generator,
        min_score: float = 0.3,
        min_chunks: int = 1
    ):
        self.retriever = retriever
        self.prompt_builder = prompt_builder
        self.generator = generator
        self.min_score = min_score
        self.min_chunks = min_chunks
    
    def answer(self, question: str, k: int = 3) -> dict:
        """Answer with guardrails applied."""
        # Step 1: Retrieve
        all_chunks = self.retriever.retrieve(question, k=k)
        
        # Guardrail 1: Filter by score threshold
        valid_chunks = [
            c for c in all_chunks 
            if c.score >= self.min_score
        ]
        
        # Guardrail 2: Check minimum chunk count
        if len(valid_chunks) < self.min_chunks:
            return {
                "answer": "I don't have enough relevant information to answer this question confidently.",
                "chunks": all_chunks,
                "refused": True,
                "reason": f"Only {len(valid_chunks)} chunks above threshold {self.min_score}"
            }
        
        # Step 2: Build prompt with valid chunks only
        prompt = self.prompt_builder.build(valid_chunks, question)
        
        # Step 3: Generate
        answer = self.generator.generate(prompt)
        
        return {
            "answer": answer,
            "chunks": valid_chunks,
            "refused": False
        }

# Create pipeline with guardrails
rag_guarded = RAGPipelineWithGuardrails(
    retriever, 
    prompt_builder, 
    generator,
    min_score=0.4,  # Require 40% similarity
    min_chunks=2    # Require at least 2 relevant chunks
)
print("Guarded RAG pipeline created")
print(f"  - Min score threshold: 0.4")
print(f"  - Min chunks required: 2")

# Test guardrails with a well-covered question
print("Test 1: Question well-covered by knowledge base")
print("=" * 70)

result = rag_guarded.answer("Why did the central bank raise rates?")
print(f"Refused: {result['refused']}")
print(f"Chunks used: {len(result['chunks'])}")
print(f"\nAnswer: {result['answer'][:200]}...")

# Test guardrails with an out-of-scope question
print("Test 2: Question NOT covered by knowledge base")
print("=" * 70)

result = rag_guarded.answer("What is the best programming language?")
print(f"Refused: {result['refused']}")
if result['refused']:
    print(f"Reason: {result['reason']}")
print(f"\nAnswer: {result['answer']}")

print("\n✅ Guardrails prevent the system from hallucinating")
print("   when retrieval doesn't find relevant evidence.")

7.15 When to Refuse#

Refusal is a feature, not a failure.

When RAG Should Refuse#

Condition	Action
No chunks above score threshold	Refuse
Chunks are from wrong domain	Refuse
Query asks for speculation	Refuse
Conflicting evidence	Acknowledge uncertainty

Refusal Patterns#

Pattern	Example Response
No information	“I don’t have information about that topic.”
Low confidence	“Based on limited evidence, I cannot confidently answer.”
Out of scope	“This question is outside my knowledge base.”

Enterprise Reality#

In regulated environments, a wrong answer is far more costly than no answer.

Banks, healthcare, legal: refusal is risk management.

Group 5: Production RAG#

Deploying evaluable, auditable RAG systems

Section	Topic
7.16	Evaluating Retrieval Quality
7.17	Evaluating Generation Faithfulness
7.18	Caching and Performance
7.19	Observability and Audit Trails
7.20	RAG as a Platform Capability

7.16 Evaluating Retrieval Quality#

RAG quality starts with retrieval quality. Poor retrieval = poor answers.

Retrieval Metrics#

Metric	What It Measures	How to Compute
Precision@k	Relevant chunks in top-k	Relevant / k
Recall@k	Coverage of all relevant chunks	Retrieved relevant / Total relevant
MRR	Position of first relevant chunk	1 / rank of first relevant
NDCG	Ranking quality weighted by position	Complex formula

Practical Evaluation#

For most applications, simple checks work:

Does the top chunk answer the question?
Are retrieved chunks from appropriate sources?
Is retrieval score reasonable?

def evaluate_retrieval(query: str, expected_keywords: list, k: int = 3):
    """Simple retrieval evaluation."""
    chunks = retriever.retrieve(query, k=k)
    
    print(f"Query: {query}")
    print(f"Expected keywords: {expected_keywords}")
    print("\nRetrieved chunks:")
    
    hits = 0
    for i, chunk in enumerate(chunks, 1):
        text_lower = chunk.text.lower()
        matched = [kw for kw in expected_keywords if kw.lower() in text_lower]
        
        status = "✅" if matched else "❌"
        hits += 1 if matched else 0
        
        print(f"  {i}. [{chunk.score:.3f}] {status} Keywords: {matched}")
        print(f"     {chunk.text[:60]}...")
    
    precision = hits / k
    print(f"\nPrecision@{k}: {precision:.1%}")
    return precision

# Evaluate retrieval for a test query
evaluate_retrieval(
    "Why did interest rates increase?",
    ["interest", "rate", "inflation", "central bank"]
)

7.17 Evaluating Generation Faithfulness#

Faithfulness: Does the answer only use information from the provided context?

Faithfulness Checks#

Check	Question
Grounding	Can every claim be traced to context?
No hallucination	Does the answer avoid inventing facts?
Appropriate refusal	Does it refuse when context is insufficient?

Manual Evaluation Template#

For each answer, ask:

Is this answer supported by the retrieved chunks? (Yes/No/Partial)
Does the answer add information not in the chunks? (Yes/No)
Is the answer’s confidence appropriate? (Yes/No)

def check_faithfulness_simple(answer: str, chunks: List[RetrievedChunk]) -> dict:
    """Simple faithfulness heuristics."""
    # Combine all chunk text
    context_text = " ".join(c.text.lower() for c in chunks)
    answer_lower = answer.lower()
    
    # Check for common hallucination signals
    hallucination_phrases = [
        "i think", "probably", "might be", "i believe",
        "generally speaking", "in my opinion", "typically"
    ]
    
    found_phrases = [p for p in hallucination_phrases if p in answer_lower]
    
    # Check if answer is appropriately uncertain when needed
    uncertainty_phrases = [
        "don't have", "cannot", "no information", 
        "not mentioned", "unclear"
    ]
    shows_uncertainty = any(p in answer_lower for p in uncertainty_phrases)
    
    return {
        "potential_hallucination_signals": found_phrases,
        "shows_uncertainty": shows_uncertainty,
        "answer_length": len(answer),
        "context_length": len(context_text)
    }

# Test on a RAG response
result = rag.answer("What is the current inflation rate?")

print("Question: What is the current inflation rate?")
print(f"\nAnswer: {result['answer']}")
print("\nFaithfulness analysis:")
analysis = check_faithfulness_simple(result['answer'], result['chunks'])
for key, value in analysis.items():
    print(f"  {key}: {value}")

7.18 Caching and Performance#

Production RAG systems need performance optimization.

What to Cache#

Component	Cache Strategy	Invalidation
Document embeddings	Precompute, persist	On document change
Query embeddings	LRU cache	Time-based
Retrieval results	Query hash → chunks	On index update
LLM responses	Prompt hash → answer	Careful—may go stale

Latency Breakdown#

Typical RAG latency:

Step	Typical Time
Query embedding	50-100ms
Vector search	10-50ms
LLM generation	500-3000ms
Total	600-3000ms

LLM generation dominates. Cache carefully.

from functools import lru_cache
import hashlib

class CachedRetriever:
    """Retriever with query caching."""
    
    def __init__(self, base_retriever: Retriever, cache_size: int = 100):
        self.base = base_retriever
        self.cache = {}
        self.cache_size = cache_size
        self.hits = 0
        self.misses = 0
    
    def _cache_key(self, query: str, k: int) -> str:
        return hashlib.md5(f"{query}:{k}".encode()).hexdigest()
    
    def retrieve(self, query: str, k: int = 3) -> List[RetrievedChunk]:
        key = self._cache_key(query, k)
        
        if key in self.cache:
            self.hits += 1
            return self.cache[key]
        
        self.misses += 1
        result = self.base.retrieve(query, k)
        
        # Simple cache with size limit
        if len(self.cache) >= self.cache_size:
            # Remove oldest entry (simple strategy)
            oldest_key = next(iter(self.cache))
            del self.cache[oldest_key]
        
        self.cache[key] = result
        return result
    
    def stats(self):
        total = self.hits + self.misses
        hit_rate = self.hits / total if total > 0 else 0
        return {"hits": self.hits, "misses": self.misses, "hit_rate": hit_rate}

# Demo caching
cached_retriever = CachedRetriever(retriever)

# First query - cache miss
cached_retriever.retrieve("interest rates", k=3)
print(f"After first query: {cached_retriever.stats()}")

# Same query - cache hit
cached_retriever.retrieve("interest rates", k=3)
print(f"After same query:  {cached_retriever.stats()}")

# Different query - cache miss
cached_retriever.retrieve("mortgage rates", k=3)
print(f"After new query:   {cached_retriever.stats()}")

7.19 Observability and Audit Trails#

Production RAG requires comprehensive logging for:

Purpose	What to Log
Debugging	Query, chunks, prompt, response
Quality monitoring	Retrieval scores, response latency
Compliance	User ID, timestamp, sources cited
Improvement	Failed queries, low-confidence responses

Audit Trail Structure#

{
  "request_id": "uuid",
  "timestamp": "ISO-8601",
  "user_id": "user-123",
  "query": "original question",
  "retrieval": {
    "chunk_ids": ["doc_001", "doc_002"],
    "scores": [0.85, 0.72],
    "latency_ms": 45
  },
  "generation": {
    "model": "llama3.1:8b",
    "prompt_tokens": 450,
    "response_tokens": 120,
    "latency_ms": 1200
  },
  "response": "final answer",
  "refused": false
}

import uuid
from datetime import datetime

class AuditableRAGPipeline:
    """RAG pipeline with audit logging."""
    
    def __init__(self, retriever, prompt_builder, generator):
        self.retriever = retriever
        self.prompt_builder = prompt_builder
        self.generator = generator
        self.audit_log = []
    
    def answer(self, question: str, user_id: str = "anonymous", k: int = 3) -> dict:
        request_id = str(uuid.uuid4())
        start_time = time.time()
        
        # Retrieval
        retrieval_start = time.time()
        chunks = self.retriever.retrieve(question, k=k)
        retrieval_ms = (time.time() - retrieval_start) * 1000
        
        # Prompt building
        prompt = self.prompt_builder.build(chunks, question)
        
        # Generation
        generation_start = time.time()
        answer = self.generator.generate(prompt)
        generation_ms = (time.time() - generation_start) * 1000
        
        total_ms = (time.time() - start_time) * 1000
        
        # Build audit record
        audit_record = {
            "request_id": request_id,
            "timestamp": datetime.utcnow().isoformat(),
            "user_id": user_id,
            "query": question,
            "retrieval": {
                "chunk_sources": [c.source for c in chunks],
                "scores": [c.score for c in chunks],
                "latency_ms": round(retrieval_ms, 1)
            },
            "generation": {
                "latency_ms": round(generation_ms, 1)
            },
            "total_latency_ms": round(total_ms, 1),
            "response_length": len(answer)
        }
        
        self.audit_log.append(audit_record)
        
        return {
            "answer": answer,
            "request_id": request_id,
            "chunks": chunks
        }
    
    def get_audit_log(self):
        return self.audit_log

# Create auditable pipeline
auditable_rag = AuditableRAGPipeline(retriever, prompt_builder, generator)

# Make a query
result = auditable_rag.answer(
    "What impact did rate hikes have on mortgages?",
    user_id="user-42"
)

print("Answer:", result["answer"][:100], "...")
print(f"\nRequest ID: {result['request_id']}")
print("\nAudit Record:")
print(json.dumps(auditable_rag.audit_log[-1], indent=2))

7.20 RAG as a Platform Capability#

In enterprise settings, RAG becomes a platform—not a one-off feature.

Platform Characteristics#

Aspect	Implementation
Multi-tenant	Different knowledge bases per team/product
Swappable components	Change LLM without rebuilding
Configurable guardrails	Different thresholds per use case
Centralized logging	Unified audit across all RAG apps

Evolution Path#

Prototype RAG          Production RAG         Platform RAG
     │                      │                      │
     │ Single corpus        │ Multiple corpora     │ Self-service corpora
     │ One model            │ Model selection      │ Model marketplace
     │ No guardrails        │ Fixed guardrails     │ Configurable policies
     │ No logging           │ Basic logging        │ Full observability
     ▼                      ▼                      ▼

Key Insight#

RAG at scale is about governance, not just generation.

Who can access which knowledge? What gets logged? How do we audit?

Module Summary#

Key Takeaways#

Concept	Remember
Why RAG	LLMs have no runtime access to your data
RAG Architecture	Component-based: Retriever → Prompt → Generator → Validator
Near-misses	Most dangerous failure: semantically similar but factually different
Guardrails	Score thresholds and refusal are features, not failures
Evaluation	Measure retrieval quality and generation faithfulness separately
Production	Logging, caching, audit trails are mandatory

The RAG Mental Model#

RAG is how we turn LLMs from storytellers into assistants grounded in evidence.

It is an architectural discipline, not a prompt trick.

What’s Next#

You now have all the components to build production AI systems:

Module 5: Embeddings and retrieval
Module 6: LLM API engineering
Module 7: RAG pipelines

The assessment will test your ability to combine these into a working system.

Practice Exercises#

Exercise 1: Adjust Retrieval Parameters#

Modify the retriever to use k=5 instead of k=3. How does this affect answer quality?

Exercise 2: Custom Guardrails#

Create a guardrail that refuses to answer if retrieved chunks come from more than 2 different sources (potential conflicting evidence).

Exercise 3: Evaluate Your RAG#

Write 5 test questions and manually evaluate:

Retrieval precision (are the right chunks retrieved?)
Generation faithfulness (does the answer use only the context?)

Exercise 4: Add a New Document#

Add a new document to the knowledge base about cryptocurrency regulation. Test that queries about crypto now return relevant results.

Content

Contents

Content#

What This Module Covers#

Learning Objectives#

Prerequisites#

Setup#

LLM Gateway Configuration#

Option A: Pinggy Tunnel (Local Ollama)#

Option B: JBChat Server#

Group 1: Why RAG Exists#

7.1 The Knowledge Gap Problem#

What LLMs Know#

The Hallucination Risk#

7.2 What RAG Actually Does#

RAG Does NOT:#

RAG DOES:#

7.3 RAG vs Fine-Tuning vs Prompting#

When RAG is the Right Choice#

Enterprise Reality#

Group 2: RAG Architecture#

7.4 RAG as an Architectural Pattern#

Key Insight#

Anti-Pattern: The Monolithic RAG Function#

Pattern: Component-Based RAG#

7.5 The Four Core Components#

1. Retriever#

2. Prompt Builder#

3. Generator#

4. Validator (Optional but Recommended)#

7.6 Data Flow and Dependencies#

Dependency Matrix#

Group 3: Building RAG Pipelines#

7.7 Setting Up the Knowledge Base#

7.8 Implementing the Retriever#

Key Decisions#

7.9 Building Evidence-First Prompts#

Evidence-First Prompting Principles#

7.10 Connecting to the Generator#

7.11 The Complete RAG Pipeline#

Group 4: Failure Modes & Guardrails#

7.12 RAG-Specific Failure Modes#

Key Insight#

7.13 The Near-Miss Problem#

What is a Near-Miss?#

Example#

Why Near-Misses are Dangerous#

7.14 Implementing Guardrails#

7.15 When to Refuse#

When RAG Should Refuse#

Refusal Patterns#

Enterprise Reality#

Group 5: Production RAG#

7.16 Evaluating Retrieval Quality#

Retrieval Metrics#

Practical Evaluation#

7.17 Evaluating Generation Faithfulness#

Faithfulness Checks#

Manual Evaluation Template#

7.18 Caching and Performance#

What to Cache#

Latency Breakdown#

7.19 Observability and Audit Trails#

Audit Trail Structure#

7.20 RAG as a Platform Capability#

Platform Characteristics#

Evolution Path#

Key Insight#

Module Summary#

Key Takeaways#

The RAG Mental Model#

What’s Next#

Practice Exercises#

Exercise 1: Adjust Retrieval Parameters#

Exercise 2: Custom Guardrails#

Exercise 3: Evaluate Your RAG#

Exercise 4: Add a New Document#