Content#
Module 7 — RAG Pipelines
Retrieval-Augmented Generation as an Engineering System
What This Module Covers#
Group |
Topic |
Key Skill |
|---|---|---|
1 |
Why RAG Exists |
Understand the fundamental problem RAG solves |
2 |
RAG Architecture |
Design component-based RAG systems |
3 |
Building RAG Pipelines |
Implement end-to-end retrieval and generation |
4 |
Failure Modes & Guardrails |
Handle RAG-specific failure cases |
5 |
Production RAG |
Deploy evaluable, auditable RAG systems |
Learning Objectives#
By the end of this module, you will be able to:
Explain why RAG is necessary for real-world LLM applications
Design RAG pipelines with clear component boundaries
Implement retrieval, prompt construction, and generation
Handle failure modes including near-misses and low-confidence retrieval
Evaluate both retrieval quality and generation faithfulness
Apply RAG patterns to enterprise scenarios
Prerequisites#
This module builds directly on:
Module |
Concepts Used Here |
|---|---|
Module 3 |
LLM behavior, hallucination patterns |
Module 4 |
How models learn patterns, not truth |
Module 5 |
Embeddings, vector similarity, FAISS retrieval |
Module 6 |
LLM API clients, retries, structured output |
Module 7 is where everything comes together.
Setup#
Run this cell to install dependencies and configure the environment.
!pip -q install sentence-transformers faiss-cpu requests
import numpy as np
import requests
import json
import time
import faiss
from sentence_transformers import SentenceTransformer
from typing import List, Tuple, Dict, Optional
# Load embedding model (same as Module 5)
model = SentenceTransformer("all-MiniLM-L6-v2")
print(f"Embedding model loaded: all-MiniLM-L6-v2")
print(f"Embedding dimension: 384")
print("Setup complete!")
^C
ERROR: Operation cancelled by user
[notice] A new release of pip is available: 23.0.1 -> 25.3
[notice] To update, run: pip install --upgrade pip
---------------------------------------------------------------------------
ModuleNotFoundError Traceback (most recent call last)
Cell In[1], line 3
1 get_ipython().system('pip -q install sentence-transformers faiss-cpu requests')
----> 3 import numpy as np
4 import requests
5 import json
ModuleNotFoundError: No module named 'numpy'
LLM Gateway Configuration#
Configure your LLM endpoint. Choose one option:
Option |
When to Use |
Setup |
|---|---|---|
Pinggy Tunnel |
Running Ollama locally |
Start tunnel, paste URL |
JBChat Server |
Classroom setting |
Get API key from instructor |
Option A: Pinggy Tunnel (Local Ollama)#
# Terminal 1: Start Ollama
OLLAMA_HOST=0.0.0.0 ollama serve
# Terminal 2: Start Pinggy tunnel
ssh -p 443 -R0:localhost:11434 -L4300:localhost:4300 a.pinggy.io
Option B: JBChat Server#
Get the API key from your instructor.
# ------ OPTION A: Pinggy Tunnel (for local Ollama) ------
# LLM_BASE_URL = "https://your-pinggy-url.a.pinggy.io"
# LLM_API_KEY = None
# ------ OPTION B: JBChat Server (classroom) ------
LLM_BASE_URL = "https://jbchat.jonbowden.com.ngrok.app"
LLM_API_KEY = "<provided-by-instructor>" # Get from instructor
DEFAULT_MODEL = "llama3.1:8b"
print(f"LLM endpoint: {LLM_BASE_URL}")
print(f"Model: {DEFAULT_MODEL}")
Group 1: Why RAG Exists#
The fundamental problem RAG solves
Section |
Topic |
|---|---|
7.1 |
The Knowledge Gap Problem |
7.2 |
What RAG Actually Does |
7.3 |
RAG vs Fine-Tuning vs Prompting |
7.1 The Knowledge Gap Problem#
LLMs have a fundamental limitation that no amount of prompting can fix:
LLMs have no access to your data at the moment they answer a question.
What LLMs Know#
Knowledge Type |
Available? |
Example |
|---|---|---|
Training data (pre-cutoff) |
✅ Yes |
“What is Python?” |
Recent events (post-cutoff) |
❌ No |
“What happened yesterday?” |
Your internal documents |
❌ No |
“What’s our refund policy?” |
Your database records |
❌ No |
“What’s customer #12345’s status?” |
Private company data |
❌ No |
“What were Q3 sales?” |
The Hallucination Risk#
When asked about information they don’t have, LLMs don’t say “I don’t know.”
They confidently fabricate plausible-sounding answers.
This is not a bug—it’s how language models work. They generate probable continuations of text, whether or not those continuations are factually correct.
# Demonstration: LLMs will answer questions about data they've never seen
# Imagine asking an LLM about your company's internal policy
hypothetical_question = "What is Acme Corp's work-from-home policy?"
# Without RAG, the LLM might respond:
hypothetical_hallucination = """
Acme Corp allows employees to work from home up to 3 days per week.
Employees must be available during core hours (10am-4pm) and attend
all mandatory team meetings in person.
"""
print("Question:", hypothetical_question)
print("\nPotential LLM response (hallucinated):")
print(hypothetical_hallucination)
print("⚠️ This sounds authoritative but is completely fabricated!")
print(" The LLM has never seen Acme Corp's actual policy.")
7.2 What RAG Actually Does#
RAG = Retrieval-Augmented Generation
RAG solves the knowledge gap by providing relevant information at runtime:
┌─────────────────────────────────────────────────────────────┐
│ RAG Pipeline │
├─────────────────────────────────────────────────────────────┤
│ │
│ User Question ─────► Retriever ─────► Relevant Chunks │
│ │ │ │
│ │ ▼ │
│ │ Prompt Builder │
│ │ │ │
│ │ ▼ │
│ │ LLM Generation │
│ │ │ │
│ │ ▼ │
│ └──────────► Grounded Answer │
│ │
└─────────────────────────────────────────────────────────────┘
RAG Does NOT:#
Misconception |
Reality |
|---|---|
Make the model smarter |
Model is unchanged |
Retrain the model |
No training occurs |
Eliminate hallucinations |
Reduces risk, doesn’t eliminate |
Guarantee correctness |
Still requires validation |
RAG DOES:#
Capability |
Benefit |
|---|---|
Provide runtime evidence |
Answers based on actual data |
Ground generation |
Model cites provided context |
Enable auditability |
Can trace answer to source |
Support updates |
New data available immediately |
7.3 RAG vs Fine-Tuning vs Prompting#
RAG is one of several approaches to customizing LLM behavior:
Approach |
What It Does |
When to Use |
Limitations |
|---|---|---|---|
Prompting |
Provides instructions in context |
General behavior guidance |
No access to external data |
Fine-tuning |
Modifies model weights |
Teaching new skills/patterns |
Expensive, data goes stale |
RAG |
Retrieves relevant data at runtime |
Grounding in specific knowledge |
Retrieval quality matters |
When RAG is the Right Choice#
✅ Use RAG when:
You need answers grounded in specific documents
Data changes frequently
You need to cite sources
You need auditability
❌ Don’t use RAG when:
Teaching the model a new task format
The knowledge is general/public
Real-time retrieval is too slow
Enterprise Reality#
Most enterprise LLM applications require RAG.
Fine-tuning teaches how to respond. RAG provides what to respond about.
Group 2: RAG Architecture#
Designing component-based RAG systems
Section |
Topic |
|---|---|
7.4 |
RAG as an Architectural Pattern |
7.5 |
The Four Core Components |
7.6 |
Data Flow and Dependencies |
7.4 RAG as an Architectural Pattern#
RAG is not a single function or library—it’s an architectural pattern.
Key Insight#
RAG is a pipeline of composable components, each testable independently.
This matters because:
Principle |
Benefit |
|---|---|
Separation of concerns |
Each component has one job |
Independent testing |
Debug retrieval separate from generation |
Swappable parts |
Change embedding model without changing LLM |
Clear failure attribution |
Know which component failed |
Anti-Pattern: The Monolithic RAG Function#
# ❌ BAD: Everything in one function
def answer_question(query):
# Embed, retrieve, build prompt, call LLM, parse response...
# 200 lines of tangled logic
pass
Pattern: Component-Based RAG#
# ✅ GOOD: Clear component boundaries
query_embedding = embedder.encode(query)
chunks = retriever.search(query_embedding, k=5)
prompt = prompt_builder.build(chunks, query)
response = generator.generate(prompt)
answer = validator.validate(response)
7.5 The Four Core Components#
Every RAG system has these components (even if combined):
1. Retriever#
Responsibility |
Implementation |
|---|---|
Find relevant chunks |
Vector similarity search |
Return ranked results |
Top-k with scores |
Preserve metadata |
Source, page, timestamp |
2. Prompt Builder#
Responsibility |
Implementation |
|---|---|
Structure the prompt |
Template with placeholders |
Inject retrieved context |
Format chunks clearly |
Constrain the model |
Instructions for grounded answers |
3. Generator#
Responsibility |
Implementation |
|---|---|
Call the LLM API |
HTTP client with retries |
Handle failures |
Timeout, rate limits |
Return response |
Raw text or structured |
4. Validator (Optional but Recommended)#
Responsibility |
Implementation |
|---|---|
Check response quality |
Length, format, content |
Detect hallucination signals |
Claims not in context |
Trigger fallback |
Refusal or retry |
# Component interfaces (contracts)
# These define what each component must do
from abc import ABC, abstractmethod
from dataclasses import dataclass
from typing import List, Tuple
@dataclass
class RetrievedChunk:
"""A chunk retrieved from the knowledge base."""
text: str
score: float
source: str = "unknown"
class Retriever(ABC):
"""Interface for retrieval components."""
@abstractmethod
def retrieve(self, query: str, k: int = 3) -> List[RetrievedChunk]:
pass
class PromptBuilder(ABC):
"""Interface for prompt construction."""
@abstractmethod
def build(self, chunks: List[RetrievedChunk], question: str) -> str:
pass
class Generator(ABC):
"""Interface for LLM generation."""
@abstractmethod
def generate(self, prompt: str) -> str:
pass
print("Component interfaces defined:")
print(" - RetrievedChunk: data class for retrieved content")
print(" - Retriever: find relevant chunks")
print(" - PromptBuilder: construct RAG prompt")
print(" - Generator: call LLM and return response")
7.6 Data Flow and Dependencies#
Understanding data flow helps debug RAG systems:
┌───────────────────────────────────────────────────────────────────────┐
│ RAG Data Flow │
├───────────────────────────────────────────────────────────────────────┤
│ │
│ [User Query] │
│ │ │
│ ▼ │
│ ┌─────────────┐ ┌─────────────────┐ │
│ │ Embedder │────►│ Query Vector │ │
│ └─────────────┘ └────────┬────────┘ │
│ │ │
│ ▼ │
│ ┌─────────────┐ ┌─────────────────┐ ┌──────────────────┐ │
│ │ Vector DB │────►│ Retriever │────►│ Retrieved Chunks │ │
│ └─────────────┘ └─────────────────┘ └────────┬─────────┘ │
│ │ │
│ ▼ │
│ [User Query] ──────────────────────────► ┌───────────────┐ │
│ │ Prompt Builder│ │
│ └───────┬───────┘ │
│ │ │
│ ▼ │
│ ┌───────────────┐ │
│ │ RAG Prompt │ │
│ └───────┬───────┘ │
│ │ │
│ ▼ │
│ ┌───────────────┐ │
│ │ Generator │ │
│ └───────┬───────┘ │
│ │ │
│ ▼ │
│ [Grounded Answer] │
│ │
└───────────────────────────────────────────────────────────────────────┘
Dependency Matrix#
Component |
Depends On |
Produces |
|---|---|---|
Embedder |
Query text |
Query vector |
Retriever |
Query vector, Vector DB |
Ranked chunks |
Prompt Builder |
Chunks, Query |
RAG prompt string |
Generator |
RAG prompt |
LLM response |
Validator |
LLM response, Chunks |
Final answer |
Group 3: Building RAG Pipelines#
Hands-on implementation
Section |
Topic |
|---|---|
7.7 |
Setting Up the Knowledge Base |
7.8 |
Implementing the Retriever |
7.9 |
Building Evidence-First Prompts |
7.10 |
Connecting to the Generator |
7.11 |
The Complete RAG Pipeline |
7.7 Setting Up the Knowledge Base#
A RAG system needs a knowledge base—documents to retrieve from.
For this module, we’ll use a corpus about central banking and interest rates (same domain as Module 5).
# Knowledge base: documents about monetary policy
# In production, this would come from a database, files, or API
knowledge_base = [
{
"id": "doc_001",
"text": "The central bank raised interest rates by 25 basis points to combat inflation. This decision was made after reviewing economic indicators showing persistent price increases across multiple sectors.",
"source": "monetary_policy_report_q3.pdf"
},
{
"id": "doc_002",
"text": "Higher borrowing costs are expected to slow consumer spending and reduce inflationary pressure. The central bank indicated further rate increases may follow if inflation remains elevated.",
"source": "monetary_policy_report_q3.pdf"
},
{
"id": "doc_003",
"text": "Mortgage rates have risen to their highest level in two decades, causing a significant slowdown in the housing market. Home sales declined 15% compared to the previous quarter.",
"source": "housing_market_analysis.pdf"
},
{
"id": "doc_004",
"text": "The Federal Reserve's dual mandate requires balancing maximum employment with price stability. Current policy prioritizes inflation control over employment growth.",
"source": "fed_policy_overview.pdf"
},
{
"id": "doc_005",
"text": "Bank earnings improved as net interest margins widened due to higher rates. Financial sector stocks outperformed the broader market this quarter.",
"source": "quarterly_earnings_summary.pdf"
},
{
"id": "doc_006",
"text": "Small businesses report difficulty accessing credit as lending standards tighten. The cost of business loans has increased substantially since rate hikes began.",
"source": "small_business_survey.pdf"
},
{
"id": "doc_007",
"text": "International markets reacted strongly to the rate decision, with currency fluctuations affecting trade balances. Emerging markets face capital outflow pressures.",
"source": "global_markets_report.pdf"
},
{
"id": "doc_008",
"text": "The championship football match ended in a dramatic penalty shootout. The home team secured victory after their goalkeeper saved three consecutive penalties.",
"source": "sports_news.pdf"
}
]
print(f"Knowledge base loaded: {len(knowledge_base)} documents")
print("\nSources:")
for source in set(doc['source'] for doc in knowledge_base):
count = sum(1 for doc in knowledge_base if doc['source'] == source)
print(f" - {source}: {count} document(s)")
# Create vector index from knowledge base
# This is the "offline" step - done once when documents change
# Extract texts and generate embeddings
texts = [doc["text"] for doc in knowledge_base]
doc_embeddings = model.encode(texts, normalize_embeddings=True)
# Create FAISS index (inner product = cosine similarity for normalized vectors)
dimension = doc_embeddings.shape[1] # 384
index = faiss.IndexFlatIP(dimension)
index.add(doc_embeddings.astype('float32'))
print(f"FAISS index created:")
print(f" - Dimension: {dimension}")
print(f" - Documents indexed: {index.ntotal}")
7.8 Implementing the Retriever#
The retriever’s job: find the most relevant chunks for a query.
Key Decisions#
Parameter |
Trade-off |
|---|---|
k (number of results) |
More context vs. more noise |
Score threshold |
Precision vs. recall |
Metadata filtering |
Targeted vs. comprehensive |
class FAISSRetriever(Retriever):
"""Retriever using FAISS vector index."""
def __init__(self, index, documents, embedding_model):
self.index = index
self.documents = documents
self.model = embedding_model
def retrieve(self, query: str, k: int = 3) -> List[RetrievedChunk]:
"""Retrieve top-k relevant chunks for the query."""
# Encode query
query_vec = self.model.encode(
query,
normalize_embeddings=True,
convert_to_numpy=True
).astype('float32').reshape(1, -1)
# Search index
scores, indices = self.index.search(query_vec, k)
# Build result list
results = []
for score, idx in zip(scores[0], indices[0]):
doc = self.documents[idx]
results.append(RetrievedChunk(
text=doc["text"],
score=float(score),
source=doc["source"]
))
return results
# Create retriever instance
retriever = FAISSRetriever(index, knowledge_base, model)
print("Retriever created successfully")
# Test the retriever
test_query = "Why did the central bank raise interest rates?"
chunks = retriever.retrieve(test_query, k=3)
print(f"Query: {test_query}")
print(f"\nTop {len(chunks)} retrieved chunks:")
print("=" * 70)
for i, chunk in enumerate(chunks, 1):
print(f"\n{i}. [Score: {chunk.score:.3f}] Source: {chunk.source}")
print(f" {chunk.text[:100]}...")
7.9 Building Evidence-First Prompts#
The prompt is where retrieval meets generation. A well-structured RAG prompt:
Evidence-First Prompting Principles#
Principle |
Implementation |
|---|---|
Context before question |
Retrieved evidence appears first |
Explicit grounding instruction |
“Answer ONLY based on the provided context” |
Refusal permission |
“If the context doesn’t contain the answer, say so” |
Clear structure |
Labeled sections: CONTEXT, QUESTION, ANSWER |
class RAGPromptBuilder(PromptBuilder):
"""Builds evidence-first RAG prompts."""
TEMPLATE = """You are a helpful assistant that answers questions based ONLY on the provided context.
IMPORTANT RULES:
1. Answer ONLY using information from the CONTEXT below
2. If the context does not contain enough information to answer, say "I don't have enough information to answer this question."
3. Do not use any prior knowledge or make assumptions
4. Keep your answer concise and directly relevant to the question
CONTEXT:
{context}
QUESTION: {question}
ANSWER:"""
def build(self, chunks: List[RetrievedChunk], question: str) -> str:
"""Build the RAG prompt from chunks and question."""
# Format context from retrieved chunks
context_parts = []
for i, chunk in enumerate(chunks, 1):
context_parts.append(f"[{i}] {chunk.text}")
context = "\n\n".join(context_parts)
# Build final prompt
return self.TEMPLATE.format(
context=context,
question=question
)
# Create prompt builder
prompt_builder = RAGPromptBuilder()
print("Prompt builder created")
# Test the prompt builder
rag_prompt = prompt_builder.build(chunks, test_query)
print("Generated RAG Prompt:")
print("=" * 70)
print(rag_prompt)
print("=" * 70)
7.10 Connecting to the Generator#
The generator calls the LLM API. We’ll reuse patterns from Module 6.
class LLMGenerator(Generator):
"""Generator that calls an LLM API."""
def __init__(self, base_url: str, api_key: Optional[str] = None, model: str = "llama3.1:8b"):
self.base_url = base_url
self.api_key = api_key
self.model = model
def generate(self, prompt: str, temperature: float = 0.1) -> str:
"""Generate response from LLM."""
headers = {
"Content-Type": "application/json",
"ngrok-skip-browser-warning": "true"
}
# Determine endpoint based on whether we have an API key
use_jbchat = self.api_key and self.api_key != "<provided-by-instructor>"
if use_jbchat:
headers["X-API-Key"] = self.api_key
endpoint = f"{self.base_url}/chat/direct"
payload = {
"model": self.model,
"messages": [{"role": "user", "content": prompt}],
"temperature": temperature,
"stream": False
}
else:
endpoint = f"{self.base_url}/api/chat"
payload = {
"model": self.model,
"messages": [{"role": "user", "content": prompt}],
"stream": False
}
try:
response = requests.post(
endpoint,
headers=headers,
json=payload,
timeout=60
)
response.raise_for_status()
return response.json()["message"]["content"]
except Exception as e:
return f"[Generation Error: {e}]"
# Create generator
generator = LLMGenerator(LLM_BASE_URL, LLM_API_KEY, DEFAULT_MODEL)
print(f"Generator created: {LLM_BASE_URL}")
7.11 The Complete RAG Pipeline#
Now we combine all components into a complete pipeline.
class RAGPipeline:
"""Complete RAG pipeline combining retrieval and generation."""
def __init__(self, retriever: Retriever, prompt_builder: PromptBuilder, generator: Generator):
self.retriever = retriever
self.prompt_builder = prompt_builder
self.generator = generator
def answer(self, question: str, k: int = 3, verbose: bool = False) -> dict:
"""Answer a question using RAG.
Returns dict with:
- answer: The generated response
- chunks: Retrieved chunks used
- prompt: The constructed prompt
"""
# Step 1: Retrieve relevant chunks
chunks = self.retriever.retrieve(question, k=k)
if verbose:
print(f"Retrieved {len(chunks)} chunks:")
for i, c in enumerate(chunks, 1):
print(f" {i}. [{c.score:.3f}] {c.text[:50]}...")
print()
# Step 2: Build prompt
prompt = self.prompt_builder.build(chunks, question)
if verbose:
print("Prompt built. Calling LLM...")
# Step 3: Generate answer
answer = self.generator.generate(prompt)
return {
"answer": answer,
"chunks": chunks,
"prompt": prompt
}
# Create the complete pipeline
rag = RAGPipeline(retriever, prompt_builder, generator)
print("RAG pipeline created!")
# Test the complete pipeline
question = "Why did the central bank raise interest rates?"
print(f"Question: {question}")
print("=" * 70)
result = rag.answer(question, k=3, verbose=True)
print("\nRAG Answer:")
print("=" * 70)
print(result["answer"])
print("=" * 70)
# Test with a question NOT in the knowledge base
# A good RAG system should refuse to answer
question_out_of_scope = "What is the weather forecast for tomorrow?"
print(f"Question: {question_out_of_scope}")
print("(This question is NOT covered by our knowledge base)")
print("=" * 70)
result = rag.answer(question_out_of_scope, k=3, verbose=True)
print("\nRAG Answer:")
print("=" * 70)
print(result["answer"])
print("=" * 70)
print("\n✅ A well-designed RAG system should refuse or indicate uncertainty")
print(" when the retrieved context doesn't support an answer.")
Group 4: Failure Modes & Guardrails#
What can go wrong in RAG systems
Section |
Topic |
|---|---|
7.12 |
RAG-Specific Failure Modes |
7.13 |
The Near-Miss Problem |
7.14 |
Implementing Guardrails |
7.15 |
When to Refuse |
7.12 RAG-Specific Failure Modes#
RAG reduces hallucination risk but introduces new failure modes:
Failure Mode |
Description |
Impact |
|---|---|---|
Wrong but similar chunks |
Retrieval returns plausible but incorrect context |
Grounded hallucination |
Missing relevant chunks |
Best evidence not retrieved |
Incomplete answer |
Conflicting evidence |
Multiple chunks contradict each other |
Confused response |
Context overflow |
Too many chunks, model loses focus |
Noise in answer |
Stale data |
Knowledge base not updated |
Outdated information |
Citation hallucination |
Model cites sources that don’t exist |
False attribution |
Key Insight#
RAG shifts the risk from “model makes up facts” to “retrieval returns wrong evidence.”
Both are problems. RAG is often more controllable.
7.13 The Near-Miss Problem#
The most dangerous failure in RAG: near-misses.
What is a Near-Miss?#
A chunk that is:
Semantically similar to the query (high retrieval score)
Factually different from what’s needed
Example#
Query |
Retrieved Chunk |
Problem |
|---|---|---|
“What is Apple’s revenue?” |
“Apple reported Q2 revenue…” |
Wrong quarter |
“What is the refund policy?” |
“Our 2022 refund policy states…” |
Outdated policy |
“What did the CEO say about AI?” |
“The CTO commented on AI…” |
Wrong person |
Why Near-Misses are Dangerous#
High confidence: Model thinks it has good evidence
Plausible output: Answer sounds correct
Hard to detect: No obvious error signal
User trust: Grounded answers seem authoritative
# Demonstrate near-miss retrieval
# Our knowledge base has sports content (the football doc)
# that could be a near-miss for unrelated queries
query_finance = "What happened in the championship game?"
# This will retrieve the football doc even though our KB is mostly finance
chunks = retriever.retrieve(query_finance, k=3)
print(f"Query: {query_finance}")
print("\nRetrieved chunks:")
for i, chunk in enumerate(chunks, 1):
print(f"\n{i}. [Score: {chunk.score:.3f}]")
print(f" {chunk.text[:80]}...")
print("\n⚠️ Notice: The sports document (doc_008) is retrieved")
print(" because 'championship' matches, even though our KB")
print(" is primarily about monetary policy.")
7.14 Implementing Guardrails#
Guardrails protect RAG systems from failure modes:
Guardrail |
What It Does |
When to Use |
|---|---|---|
Score threshold |
Reject low-confidence retrieval |
Always |
Chunk count validation |
Ensure minimum evidence |
Critical queries |
Source validation |
Verify chunks from trusted sources |
Regulated domains |
Response length check |
Detect overly brief/long answers |
Quality control |
Faithfulness check |
Verify answer uses context |
High-stakes answers |
class RAGPipelineWithGuardrails:
"""RAG pipeline with configurable guardrails."""
def __init__(
self,
retriever: Retriever,
prompt_builder: PromptBuilder,
generator: Generator,
min_score: float = 0.3,
min_chunks: int = 1
):
self.retriever = retriever
self.prompt_builder = prompt_builder
self.generator = generator
self.min_score = min_score
self.min_chunks = min_chunks
def answer(self, question: str, k: int = 3) -> dict:
"""Answer with guardrails applied."""
# Step 1: Retrieve
all_chunks = self.retriever.retrieve(question, k=k)
# Guardrail 1: Filter by score threshold
valid_chunks = [
c for c in all_chunks
if c.score >= self.min_score
]
# Guardrail 2: Check minimum chunk count
if len(valid_chunks) < self.min_chunks:
return {
"answer": "I don't have enough relevant information to answer this question confidently.",
"chunks": all_chunks,
"refused": True,
"reason": f"Only {len(valid_chunks)} chunks above threshold {self.min_score}"
}
# Step 2: Build prompt with valid chunks only
prompt = self.prompt_builder.build(valid_chunks, question)
# Step 3: Generate
answer = self.generator.generate(prompt)
return {
"answer": answer,
"chunks": valid_chunks,
"refused": False
}
# Create pipeline with guardrails
rag_guarded = RAGPipelineWithGuardrails(
retriever,
prompt_builder,
generator,
min_score=0.4, # Require 40% similarity
min_chunks=2 # Require at least 2 relevant chunks
)
print("Guarded RAG pipeline created")
print(f" - Min score threshold: 0.4")
print(f" - Min chunks required: 2")
# Test guardrails with a well-covered question
print("Test 1: Question well-covered by knowledge base")
print("=" * 70)
result = rag_guarded.answer("Why did the central bank raise rates?")
print(f"Refused: {result['refused']}")
print(f"Chunks used: {len(result['chunks'])}")
print(f"\nAnswer: {result['answer'][:200]}...")
# Test guardrails with an out-of-scope question
print("Test 2: Question NOT covered by knowledge base")
print("=" * 70)
result = rag_guarded.answer("What is the best programming language?")
print(f"Refused: {result['refused']}")
if result['refused']:
print(f"Reason: {result['reason']}")
print(f"\nAnswer: {result['answer']}")
print("\n✅ Guardrails prevent the system from hallucinating")
print(" when retrieval doesn't find relevant evidence.")
7.15 When to Refuse#
Refusal is a feature, not a failure.
When RAG Should Refuse#
Condition |
Action |
|---|---|
No chunks above score threshold |
Refuse |
Chunks are from wrong domain |
Refuse |
Query asks for speculation |
Refuse |
Conflicting evidence |
Acknowledge uncertainty |
Refusal Patterns#
Pattern |
Example Response |
|---|---|
No information |
“I don’t have information about that topic.” |
Low confidence |
“Based on limited evidence, I cannot confidently answer.” |
Out of scope |
“This question is outside my knowledge base.” |
Enterprise Reality#
In regulated environments, a wrong answer is far more costly than no answer.
Banks, healthcare, legal: refusal is risk management.
Group 5: Production RAG#
Deploying evaluable, auditable RAG systems
Section |
Topic |
|---|---|
7.16 |
Evaluating Retrieval Quality |
7.17 |
Evaluating Generation Faithfulness |
7.18 |
Caching and Performance |
7.19 |
Observability and Audit Trails |
7.20 |
RAG as a Platform Capability |
7.16 Evaluating Retrieval Quality#
RAG quality starts with retrieval quality. Poor retrieval = poor answers.
Retrieval Metrics#
Metric |
What It Measures |
How to Compute |
|---|---|---|
Precision@k |
Relevant chunks in top-k |
Relevant / k |
Recall@k |
Coverage of all relevant chunks |
Retrieved relevant / Total relevant |
MRR |
Position of first relevant chunk |
1 / rank of first relevant |
NDCG |
Ranking quality weighted by position |
Complex formula |
Practical Evaluation#
For most applications, simple checks work:
Does the top chunk answer the question?
Are retrieved chunks from appropriate sources?
Is retrieval score reasonable?
def evaluate_retrieval(query: str, expected_keywords: list, k: int = 3):
"""Simple retrieval evaluation."""
chunks = retriever.retrieve(query, k=k)
print(f"Query: {query}")
print(f"Expected keywords: {expected_keywords}")
print("\nRetrieved chunks:")
hits = 0
for i, chunk in enumerate(chunks, 1):
text_lower = chunk.text.lower()
matched = [kw for kw in expected_keywords if kw.lower() in text_lower]
status = "✅" if matched else "❌"
hits += 1 if matched else 0
print(f" {i}. [{chunk.score:.3f}] {status} Keywords: {matched}")
print(f" {chunk.text[:60]}...")
precision = hits / k
print(f"\nPrecision@{k}: {precision:.1%}")
return precision
# Evaluate retrieval for a test query
evaluate_retrieval(
"Why did interest rates increase?",
["interest", "rate", "inflation", "central bank"]
)
7.17 Evaluating Generation Faithfulness#
Faithfulness: Does the answer only use information from the provided context?
Faithfulness Checks#
Check |
Question |
|---|---|
Grounding |
Can every claim be traced to context? |
No hallucination |
Does the answer avoid inventing facts? |
Appropriate refusal |
Does it refuse when context is insufficient? |
Manual Evaluation Template#
For each answer, ask:
Is this answer supported by the retrieved chunks? (Yes/No/Partial)
Does the answer add information not in the chunks? (Yes/No)
Is the answer’s confidence appropriate? (Yes/No)
def check_faithfulness_simple(answer: str, chunks: List[RetrievedChunk]) -> dict:
"""Simple faithfulness heuristics."""
# Combine all chunk text
context_text = " ".join(c.text.lower() for c in chunks)
answer_lower = answer.lower()
# Check for common hallucination signals
hallucination_phrases = [
"i think", "probably", "might be", "i believe",
"generally speaking", "in my opinion", "typically"
]
found_phrases = [p for p in hallucination_phrases if p in answer_lower]
# Check if answer is appropriately uncertain when needed
uncertainty_phrases = [
"don't have", "cannot", "no information",
"not mentioned", "unclear"
]
shows_uncertainty = any(p in answer_lower for p in uncertainty_phrases)
return {
"potential_hallucination_signals": found_phrases,
"shows_uncertainty": shows_uncertainty,
"answer_length": len(answer),
"context_length": len(context_text)
}
# Test on a RAG response
result = rag.answer("What is the current inflation rate?")
print("Question: What is the current inflation rate?")
print(f"\nAnswer: {result['answer']}")
print("\nFaithfulness analysis:")
analysis = check_faithfulness_simple(result['answer'], result['chunks'])
for key, value in analysis.items():
print(f" {key}: {value}")
7.18 Caching and Performance#
Production RAG systems need performance optimization.
What to Cache#
Component |
Cache Strategy |
Invalidation |
|---|---|---|
Document embeddings |
Precompute, persist |
On document change |
Query embeddings |
LRU cache |
Time-based |
Retrieval results |
Query hash → chunks |
On index update |
LLM responses |
Prompt hash → answer |
Careful—may go stale |
Latency Breakdown#
Typical RAG latency:
Step |
Typical Time |
|---|---|
Query embedding |
50-100ms |
Vector search |
10-50ms |
LLM generation |
500-3000ms |
Total |
600-3000ms |
LLM generation dominates. Cache carefully.
from functools import lru_cache
import hashlib
class CachedRetriever:
"""Retriever with query caching."""
def __init__(self, base_retriever: Retriever, cache_size: int = 100):
self.base = base_retriever
self.cache = {}
self.cache_size = cache_size
self.hits = 0
self.misses = 0
def _cache_key(self, query: str, k: int) -> str:
return hashlib.md5(f"{query}:{k}".encode()).hexdigest()
def retrieve(self, query: str, k: int = 3) -> List[RetrievedChunk]:
key = self._cache_key(query, k)
if key in self.cache:
self.hits += 1
return self.cache[key]
self.misses += 1
result = self.base.retrieve(query, k)
# Simple cache with size limit
if len(self.cache) >= self.cache_size:
# Remove oldest entry (simple strategy)
oldest_key = next(iter(self.cache))
del self.cache[oldest_key]
self.cache[key] = result
return result
def stats(self):
total = self.hits + self.misses
hit_rate = self.hits / total if total > 0 else 0
return {"hits": self.hits, "misses": self.misses, "hit_rate": hit_rate}
# Demo caching
cached_retriever = CachedRetriever(retriever)
# First query - cache miss
cached_retriever.retrieve("interest rates", k=3)
print(f"After first query: {cached_retriever.stats()}")
# Same query - cache hit
cached_retriever.retrieve("interest rates", k=3)
print(f"After same query: {cached_retriever.stats()}")
# Different query - cache miss
cached_retriever.retrieve("mortgage rates", k=3)
print(f"After new query: {cached_retriever.stats()}")
7.19 Observability and Audit Trails#
Production RAG requires comprehensive logging for:
Purpose |
What to Log |
|---|---|
Debugging |
Query, chunks, prompt, response |
Quality monitoring |
Retrieval scores, response latency |
Compliance |
User ID, timestamp, sources cited |
Improvement |
Failed queries, low-confidence responses |
Audit Trail Structure#
{
"request_id": "uuid",
"timestamp": "ISO-8601",
"user_id": "user-123",
"query": "original question",
"retrieval": {
"chunk_ids": ["doc_001", "doc_002"],
"scores": [0.85, 0.72],
"latency_ms": 45
},
"generation": {
"model": "llama3.1:8b",
"prompt_tokens": 450,
"response_tokens": 120,
"latency_ms": 1200
},
"response": "final answer",
"refused": false
}
import uuid
from datetime import datetime
class AuditableRAGPipeline:
"""RAG pipeline with audit logging."""
def __init__(self, retriever, prompt_builder, generator):
self.retriever = retriever
self.prompt_builder = prompt_builder
self.generator = generator
self.audit_log = []
def answer(self, question: str, user_id: str = "anonymous", k: int = 3) -> dict:
request_id = str(uuid.uuid4())
start_time = time.time()
# Retrieval
retrieval_start = time.time()
chunks = self.retriever.retrieve(question, k=k)
retrieval_ms = (time.time() - retrieval_start) * 1000
# Prompt building
prompt = self.prompt_builder.build(chunks, question)
# Generation
generation_start = time.time()
answer = self.generator.generate(prompt)
generation_ms = (time.time() - generation_start) * 1000
total_ms = (time.time() - start_time) * 1000
# Build audit record
audit_record = {
"request_id": request_id,
"timestamp": datetime.utcnow().isoformat(),
"user_id": user_id,
"query": question,
"retrieval": {
"chunk_sources": [c.source for c in chunks],
"scores": [c.score for c in chunks],
"latency_ms": round(retrieval_ms, 1)
},
"generation": {
"latency_ms": round(generation_ms, 1)
},
"total_latency_ms": round(total_ms, 1),
"response_length": len(answer)
}
self.audit_log.append(audit_record)
return {
"answer": answer,
"request_id": request_id,
"chunks": chunks
}
def get_audit_log(self):
return self.audit_log
# Create auditable pipeline
auditable_rag = AuditableRAGPipeline(retriever, prompt_builder, generator)
# Make a query
result = auditable_rag.answer(
"What impact did rate hikes have on mortgages?",
user_id="user-42"
)
print("Answer:", result["answer"][:100], "...")
print(f"\nRequest ID: {result['request_id']}")
print("\nAudit Record:")
print(json.dumps(auditable_rag.audit_log[-1], indent=2))
7.20 RAG as a Platform Capability#
In enterprise settings, RAG becomes a platform—not a one-off feature.
Platform Characteristics#
Aspect |
Implementation |
|---|---|
Multi-tenant |
Different knowledge bases per team/product |
Swappable components |
Change LLM without rebuilding |
Configurable guardrails |
Different thresholds per use case |
Centralized logging |
Unified audit across all RAG apps |
Evolution Path#
Prototype RAG Production RAG Platform RAG
│ │ │
│ Single corpus │ Multiple corpora │ Self-service corpora
│ One model │ Model selection │ Model marketplace
│ No guardrails │ Fixed guardrails │ Configurable policies
│ No logging │ Basic logging │ Full observability
▼ ▼ ▼
Key Insight#
RAG at scale is about governance, not just generation.
Who can access which knowledge? What gets logged? How do we audit?
Module Summary#
Key Takeaways#
Concept |
Remember |
|---|---|
Why RAG |
LLMs have no runtime access to your data |
RAG Architecture |
Component-based: Retriever → Prompt → Generator → Validator |
Near-misses |
Most dangerous failure: semantically similar but factually different |
Guardrails |
Score thresholds and refusal are features, not failures |
Evaluation |
Measure retrieval quality and generation faithfulness separately |
Production |
Logging, caching, audit trails are mandatory |
The RAG Mental Model#
RAG is how we turn LLMs from storytellers into assistants grounded in evidence.
It is an architectural discipline, not a prompt trick.
What’s Next#
You now have all the components to build production AI systems:
Module 5: Embeddings and retrieval
Module 6: LLM API engineering
Module 7: RAG pipelines
The assessment will test your ability to combine these into a working system.
Practice Exercises#
Exercise 1: Adjust Retrieval Parameters#
Modify the retriever to use k=5 instead of k=3. How does this affect answer quality?
Exercise 2: Custom Guardrails#
Create a guardrail that refuses to answer if retrieved chunks come from more than 2 different sources (potential conflicting evidence).
Exercise 3: Evaluate Your RAG#
Write 5 test questions and manually evaluate:
Retrieval precision (are the right chunks retrieved?)
Generation faithfulness (does the answer use only the context?)
Exercise 4: Add a New Document#
Add a new document to the knowledge base about cryptocurrency regulation. Test that queries about crypto now return relevant results.