Module 7 Resources — RAG Pipelines#
Additional resources to deepen your understanding of Retrieval-Augmented Generation.
Core Concepts#
RAG Fundamentals#
Resource |
Description |
|---|---|
The original RAG paper from Facebook AI |
|
When to use RAG vs fine-tuning |
|
OpenAI’s guide to building RAG with embeddings |
|
Conceptual overview of retrieval in LangChain |
Retrieval and Vector Search#
Resource |
Description |
|---|---|
Facebook’s efficient similarity search library |
|
Comprehensive vector database learning resources |
|
Sentence Transformers documentation |
|
Open-source embedding database |
Implementation#
RAG Frameworks#
Framework |
Description |
Use Case |
|---|---|---|
Comprehensive LLM application framework |
Full-featured RAG pipelines |
|
Data framework for LLM applications |
Document indexing and retrieval |
|
End-to-end NLP framework |
Production RAG systems |
|
Microsoft’s AI orchestration |
Enterprise integration |
Vector Databases#
Database |
Type |
Key Features |
|---|---|---|
Managed |
Serverless, scalable, metadata filtering |
|
Open Source |
GraphQL API, hybrid search |
|
Open Source |
Rust-based, filtering, payload storage |
|
Open Source |
Distributed, GPU acceleration |
|
Open Source |
Lightweight, embedded, Python-native |
|
Library |
Local, efficient, research-grade |
Embedding Models#
Model |
Provider |
Dimensions |
Best For |
|---|---|---|---|
|
Sentence Transformers |
384 |
Fast, general purpose |
|
OpenAI |
1536 |
High quality, API-based |
|
OpenAI |
3072 |
Highest quality, longer context |
|
Microsoft |
1024 |
Multilingual, instruction-tuned |
|
BAAI |
1024 |
Strong benchmark performance |
Production Patterns#
Prompt Engineering for RAG#
Resource |
Focus |
|---|---|
Evidence-first prompting techniques |
|
Claude-specific RAG prompts |
|
General prompt engineering |
Chunking Strategies#
Strategy |
When to Use |
|---|---|
Fixed-size chunks |
Simple documents, consistent structure |
Semantic chunking |
Documents with clear topic boundaries |
Recursive splitting |
Nested document structures |
Sentence-based |
When context boundaries matter |
# Example: Recursive text splitter (LangChain style)
from langchain.text_splitter import RecursiveCharacterTextSplitter
splitter = RecursiveCharacterTextSplitter(
chunk_size=500,
chunk_overlap=50,
separators=["\n\n", "\n", ". ", " ", ""]
)
Evaluation Metrics#
Metric |
What It Measures |
Use Case |
|---|---|---|
Precision@k |
Relevance of retrieved chunks |
Retrieval quality |
Recall@k |
Coverage of relevant chunks |
Completeness |
MRR |
Rank of first relevant result |
Ranking quality |
Faithfulness |
Answer uses only context |
Generation quality |
Answer Relevancy |
Answer addresses the question |
End-to-end quality |
Guardrails and Safety#
Guardrail Libraries#
Library |
Purpose |
|---|---|
Structured output validation |
|
NVIDIA’s conversational safety |
|
Prompt injection detection |
Common Guardrails#
# Example guardrail configuration
guardrails = {
"min_retrieval_score": 0.4,
"min_chunks_required": 2,
"max_chunks": 5,
"allowed_sources": ["policy_docs", "faq"],
"blocked_topics": ["competitor_info"],
"max_response_length": 500
}
Observability and Monitoring#
Tracing Tools#
Tool |
Description |
|---|---|
LangChain’s tracing and debugging |
|
ML experiment tracking |
|
LLM observability |
|
LLM request logging and analytics |
What to Log#
{
"request_id": "uuid",
"timestamp": "ISO-8601",
"user_id": "user-123",
"query": "original question",
"retrieval": {
"chunk_ids": ["doc_001", "doc_002"],
"scores": [0.85, 0.72],
"sources": ["policy.pdf", "faq.md"],
"latency_ms": 45
},
"generation": {
"model": "gpt-4",
"prompt_tokens": 450,
"response_tokens": 120,
"latency_ms": 1200
},
"response": "final answer",
"refused": false,
"guardrails_triggered": []
}
Quick Reference#
Minimal RAG Pipeline#
from sentence_transformers import SentenceTransformer
import faiss
import numpy as np
# 1. Setup
model = SentenceTransformer("all-MiniLM-L6-v2")
documents = ["doc1...", "doc2...", "doc3..."]
# 2. Index documents
embeddings = model.encode(documents, normalize_embeddings=True)
index = faiss.IndexFlatIP(embeddings.shape[1])
index.add(embeddings.astype('float32'))
# 3. Retrieve
query = "user question"
query_vec = model.encode(query, normalize_embeddings=True)
scores, indices = index.search(query_vec.reshape(1, -1).astype('float32'), k=3)
# 4. Build prompt
context = "\n".join([documents[i] for i in indices[0]])
prompt = f"Context:\n{context}\n\nQuestion: {query}\n\nAnswer:"
# 5. Generate (call your LLM)
# answer = llm.generate(prompt)
RAG Prompt Template#
RAG_PROMPT = """You are a helpful assistant that answers questions based ONLY on the provided context.
IMPORTANT RULES:
1. Answer ONLY using information from the CONTEXT below
2. If the context does not contain enough information, say "I don't have enough information to answer this question."
3. Do not use any prior knowledge or make assumptions
4. Keep your answer concise and directly relevant to the question
CONTEXT:
{context}
QUESTION: {question}
ANSWER:"""
Guardrail Implementation#
def rag_with_guardrails(query, retriever, generator, min_score=0.4, min_chunks=2):
# Retrieve
chunks = retriever.retrieve(query, k=5)
# Filter by score
valid_chunks = [c for c in chunks if c.score >= min_score]
# Check minimum chunks
if len(valid_chunks) < min_chunks:
return {
"answer": "I don't have enough relevant information.",
"refused": True,
"reason": f"Only {len(valid_chunks)} chunks above threshold"
}
# Generate
prompt = build_prompt(valid_chunks, query)
answer = generator.generate(prompt)
return {"answer": answer, "refused": False, "chunks": valid_chunks}
Further Reading#
Research Papers#
Paper |
Topic |
|---|---|
Retrieval-augmented language model pre-training |
|
Few-shot learning via retrieval-augmented LMs |
|
Self-reflective RAG |
|
Recursive summarization for tree-organized retrieval |
Blog Posts and Tutorials#
Resource |
Author |
|---|---|
Anyscale |
|
Pinecone |
|
LangChain |
|
RAGAS |
Module 7 Checklist#
Before moving to the assessment, ensure you can:
Explain why RAG is necessary for enterprise LLM applications
Design a RAG pipeline with clear component boundaries
Implement retrieval using FAISS and sentence transformers
Build evidence-first prompts that constrain LLM responses
Identify and handle RAG-specific failure modes (near-misses, etc.)
Implement guardrails for score thresholds and refusal behavior
Evaluate retrieval quality using Precision@k
Evaluate generation faithfulness
Design audit trails for production RAG systems
Explain the trade-offs in caching and performance optimization