Module 7 Resources — RAG Pipelines#

Additional resources to deepen your understanding of Retrieval-Augmented Generation.


Core Concepts#

RAG Fundamentals#

Resource

Description

Retrieval-Augmented Generation (Lewis et al., 2020)

The original RAG paper from Facebook AI

RAG vs Fine-tuning (Anthropic)

When to use RAG vs fine-tuning

Building RAG Applications (OpenAI)

OpenAI’s guide to building RAG with embeddings

LangChain RAG Concepts

Conceptual overview of retrieval in LangChain


Implementation#

RAG Frameworks#

Framework

Description

Use Case

LangChain

Comprehensive LLM application framework

Full-featured RAG pipelines

LlamaIndex

Data framework for LLM applications

Document indexing and retrieval

Haystack

End-to-end NLP framework

Production RAG systems

Semantic Kernel

Microsoft’s AI orchestration

Enterprise integration

Vector Databases#

Database

Type

Key Features

Pinecone

Managed

Serverless, scalable, metadata filtering

Weaviate

Open Source

GraphQL API, hybrid search

Qdrant

Open Source

Rust-based, filtering, payload storage

Milvus

Open Source

Distributed, GPU acceleration

ChromaDB

Open Source

Lightweight, embedded, Python-native

FAISS

Library

Local, efficient, research-grade

Embedding Models#

Model

Provider

Dimensions

Best For

all-MiniLM-L6-v2

Sentence Transformers

384

Fast, general purpose

text-embedding-3-small

OpenAI

1536

High quality, API-based

text-embedding-3-large

OpenAI

3072

Highest quality, longer context

e5-large-v2

Microsoft

1024

Multilingual, instruction-tuned

bge-large-en-v1.5

BAAI

1024

Strong benchmark performance


Production Patterns#

Prompt Engineering for RAG#

Resource

Focus

Prompting Guide - RAG

Evidence-first prompting techniques

Anthropic Prompt Engineering

Claude-specific RAG prompts

OpenAI Best Practices

General prompt engineering

Chunking Strategies#

Strategy

When to Use

Fixed-size chunks

Simple documents, consistent structure

Semantic chunking

Documents with clear topic boundaries

Recursive splitting

Nested document structures

Sentence-based

When context boundaries matter

# Example: Recursive text splitter (LangChain style)
from langchain.text_splitter import RecursiveCharacterTextSplitter

splitter = RecursiveCharacterTextSplitter(
    chunk_size=500,
    chunk_overlap=50,
    separators=["\n\n", "\n", ". ", " ", ""]
)

Evaluation Metrics#

Metric

What It Measures

Use Case

Precision@k

Relevance of retrieved chunks

Retrieval quality

Recall@k

Coverage of relevant chunks

Completeness

MRR

Rank of first relevant result

Ranking quality

Faithfulness

Answer uses only context

Generation quality

Answer Relevancy

Answer addresses the question

End-to-end quality


Guardrails and Safety#

Guardrail Libraries#

Library

Purpose

Guardrails AI

Structured output validation

NeMo Guardrails

NVIDIA’s conversational safety

Rebuff

Prompt injection detection

Common Guardrails#

# Example guardrail configuration
guardrails = {
    "min_retrieval_score": 0.4,
    "min_chunks_required": 2,
    "max_chunks": 5,
    "allowed_sources": ["policy_docs", "faq"],
    "blocked_topics": ["competitor_info"],
    "max_response_length": 500
}

Observability and Monitoring#

Tracing Tools#

Tool

Description

LangSmith

LangChain’s tracing and debugging

Weights & Biases

ML experiment tracking

Arize Phoenix

LLM observability

Helicone

LLM request logging and analytics

What to Log#

{
  "request_id": "uuid",
  "timestamp": "ISO-8601",
  "user_id": "user-123",
  "query": "original question",
  "retrieval": {
    "chunk_ids": ["doc_001", "doc_002"],
    "scores": [0.85, 0.72],
    "sources": ["policy.pdf", "faq.md"],
    "latency_ms": 45
  },
  "generation": {
    "model": "gpt-4",
    "prompt_tokens": 450,
    "response_tokens": 120,
    "latency_ms": 1200
  },
  "response": "final answer",
  "refused": false,
  "guardrails_triggered": []
}

Quick Reference#

Minimal RAG Pipeline#

from sentence_transformers import SentenceTransformer
import faiss
import numpy as np

# 1. Setup
model = SentenceTransformer("all-MiniLM-L6-v2")
documents = ["doc1...", "doc2...", "doc3..."]

# 2. Index documents
embeddings = model.encode(documents, normalize_embeddings=True)
index = faiss.IndexFlatIP(embeddings.shape[1])
index.add(embeddings.astype('float32'))

# 3. Retrieve
query = "user question"
query_vec = model.encode(query, normalize_embeddings=True)
scores, indices = index.search(query_vec.reshape(1, -1).astype('float32'), k=3)

# 4. Build prompt
context = "\n".join([documents[i] for i in indices[0]])
prompt = f"Context:\n{context}\n\nQuestion: {query}\n\nAnswer:"

# 5. Generate (call your LLM)
# answer = llm.generate(prompt)

RAG Prompt Template#

RAG_PROMPT = """You are a helpful assistant that answers questions based ONLY on the provided context.

IMPORTANT RULES:
1. Answer ONLY using information from the CONTEXT below
2. If the context does not contain enough information, say "I don't have enough information to answer this question."
3. Do not use any prior knowledge or make assumptions
4. Keep your answer concise and directly relevant to the question

CONTEXT:
{context}

QUESTION: {question}

ANSWER:"""

Guardrail Implementation#

def rag_with_guardrails(query, retriever, generator, min_score=0.4, min_chunks=2):
    # Retrieve
    chunks = retriever.retrieve(query, k=5)

    # Filter by score
    valid_chunks = [c for c in chunks if c.score >= min_score]

    # Check minimum chunks
    if len(valid_chunks) < min_chunks:
        return {
            "answer": "I don't have enough relevant information.",
            "refused": True,
            "reason": f"Only {len(valid_chunks)} chunks above threshold"
        }

    # Generate
    prompt = build_prompt(valid_chunks, query)
    answer = generator.generate(prompt)

    return {"answer": answer, "refused": False, "chunks": valid_chunks}

Further Reading#

Research Papers#

Paper

Topic

REALM (Guu et al., 2020)

Retrieval-augmented language model pre-training

Atlas (Izacard et al., 2022)

Few-shot learning via retrieval-augmented LMs

Self-RAG (Asai et al., 2023)

Self-reflective RAG

RAPTOR (Sarthi et al., 2024)

Recursive summarization for tree-organized retrieval

Blog Posts and Tutorials#


Module 7 Checklist#

Before moving to the assessment, ensure you can:

  • Explain why RAG is necessary for enterprise LLM applications

  • Design a RAG pipeline with clear component boundaries

  • Implement retrieval using FAISS and sentence transformers

  • Build evidence-first prompts that constrain LLM responses

  • Identify and handle RAG-specific failure modes (near-misses, etc.)

  • Implement guardrails for score thresholds and refusal behavior

  • Evaluate retrieval quality using Precision@k

  • Evaluate generation faithfulness

  • Design audit trails for production RAG systems

  • Explain the trade-offs in caching and performance optimization