Content

Content#

Module 5 — Embeddings & Vector Databases

CodeVision Academy

Overview#

If Module 4 explains how models learn, Module 5 explains how models remember.

This module introduces embeddings and vector databases, the foundations of:

semantic search
retrieval systems
Retrieval-Augmented Generation (RAG)

The entire module runs in Google Colab (CPU-only) and requires no server access.

One Big Idea to Remember#

Embeddings turn meaning into numbers, so computers can measure similarity.

Learning Objectives#

By the end of this module, you will be able to:

Explain what embeddings represent (and what they do not)
Explain how neural networks produce embeddings
Generate embeddings for text using a lightweight model
Explain similarity geometrically
Implement semantic search end-to-end
Explain why vector databases exist
Understand chunking, metadata, and retrieval quality
Explain how embeddings + retrieval enable RAG

Before You Start: Hugging Face Token Setup#

This notebook downloads models from Hugging Face. To avoid rate limits and warnings, you need your own free token.

Quick Setup (2-3 minutes)#

Go to huggingface.co/settings/tokens
Click New token → Name it anything → Select Read access → Create
Copy the token (you won’t see it again)
In Google Colab, click the Key icon in the left sidebar
Add a secret named exactly HF_TOKEN with your token as the value
Turn ON “Notebook access” → Restart the runtime

Important rules:

Do NOT hard-code tokens in notebooks
Do NOT share your token
Do NOT upload tokens to GitHub

Run the cell below to verify your setup:

# Verify HF Token is set up correctly
import os

# Try to get token from Colab secrets first, then environment
try:
    from google.colab import userdata
    hf_token = userdata.get('HF_TOKEN')
    os.environ['HF_TOKEN'] = hf_token
    print("HF_TOKEN loaded from Colab secrets")
except:
    if 'HF_TOKEN' in os.environ:
        print("HF_TOKEN found in environment")
    else:
        print("WARNING: HF_TOKEN not found!")
        print("Please follow the setup instructions above.")
        print("You may see download warnings without it.")

WARNING: HF_TOKEN not found!
Please follow the setup instructions above.
You may see download warnings without it.

Setup#

Install the required packages. This takes about 1-2 minutes.

!pip -q install sentence-transformers scikit-learn faiss-cpu

^C
ERROR: Operation cancelled by user

[notice] A new release of pip is available: 23.0.1 -> 25.3
[notice] To update, run: pip install --upgrade pip

Group 1 — What Are Embeddings?#

Before we write code, let’s build intuition about what embeddings are and why they matter.

5.1 What Is an Embedding?#

An embedding is a list of numbers (a vector) that represents the meaning of something.

Think of it like coordinates on a map:

Paris and Lyon are closer together (both in France)
Paris and Tokyo are far apart (different continents)

Embeddings work the same way for meaning:

“dog” and “puppy” → vectors close together
“dog” and “economics” → vectors far apart

Traditional approach:              Embedding approach:

"bank" = "bank"                   "bank" = [0.12, -0.45, 0.78, ...]
(just text, no meaning)            (captures context and meaning)

Why This Matters#

Without Embeddings	With Embeddings
Search for exact words only	Search for similar meaning
“car” won’t find “automobile”	“car” finds “automobile”, “vehicle”
Keyword matching	Semantic understanding

5.2 What Embeddings Are NOT#

Embeddings are powerful, but they have important limitations:

Embeddings ARE	Embeddings are NOT
Statistical patterns	Truth or facts
Learned from training data	A knowledge database
Good at similarity	Good at reasoning
Context-dependent	Universal definitions

Critical Insight for Enterprise#

Embeddings reflect the biases in their training data:

If training data associates “nurse” with “female”, the embedding will too
Domain-specific language may not be well-represented
Recent events or proprietary terms won’t be captured

Key takeaway: Embeddings are useful approximations, not ground truth.

5.3 How Neural Networks Create Embeddings#

Remember from Module 4: neural networks learn by adjusting weights to reduce error.

Embedding models are trained on tasks like:

“These two sentences mean the same thing” (similarity)
“This sentence follows that sentence” (context)
“This word fits in this blank” (language modeling)

Through training, the network learns to place similar meanings close together:

BEFORE TRAINING:                    AFTER TRAINING:
(random positions)                  (meaningful positions)

    dog •      • cat                    dog •  • puppy
            • puppy                          • cat
    car •                               
        • economics                     car • • vehicle
    • vehicle                           
                                        economics •

The final layer of the network (before the output) contains the embedding—a compressed representation of meaning.

5.4 Dimensionality: Why Hundreds of Numbers?#

An embedding might have 384, 768, or even 1536 dimensions. Why so many?

Each dimension captures a different aspect of meaning:

Dimension	Might capture…
#1	Formality (casual ↔ formal)
#2	Sentiment (negative ↔ positive)
#3	Topic (finance ↔ sports)
#47	Tense (past ↔ future)
…	Hundreds more subtle features

Note: We don’t actually know what each dimension means! The network learns these features automatically during training. This is called a latent space.

Practical Tradeoffs#

Dimension Size	Pros	Cons
Small (384)	Fast, less memory	Less nuance
Large (1536)	More detail	Slower, more storage

5.5 Enterprise Relevance#

Embeddings power many enterprise applications:

Use Case	How Embeddings Help
Document Search	Find relevant policies even with different wording
Customer Support	Match queries to similar past tickets
Compliance	Flag documents similar to known violations
Recommendation	“Customers who viewed X also viewed Y”
Duplicate Detection	Find near-duplicate records
RAG (Module 5 focus)	Ground LLM answers in real documents

Why This Matters for Banking#

In regulated industries, embeddings enable:

Searching regulations by intent, not just keywords
Detecting similar fraud patterns
Matching customer queries to approved responses
Auditable retrieval for compliance

Group 2 — Generating Embeddings#

Now let’s create embeddings with real code.

5.6 Choosing an Embedding Model#

There are many embedding models available. For this module, we use all-MiniLM-L6-v2:

Property	Value	Why it matters
Size	80MB	Fits in Colab memory
Dimensions	384	Good balance of quality/speed
Speed	Fast	Works on CPU
Quality	Good	Top performer for its size

Other Popular Models#

Model	Dimensions	Best for
all-MiniLM-L6-v2	384	General purpose, fast
all-mpnet-base-v2	768	Higher quality, slower
OpenAI text-embedding-3-small	1536	API-based, high quality
BGE, E5, GTE	Various	Multilingual, specialized

# Load the embedding model
# This downloads the model (~80MB) on first run

from sentence_transformers import SentenceTransformer

model = SentenceTransformer("all-MiniLM-L6-v2")

print(f"Model loaded!")
print(f"Embedding dimension: {model.get_sentence_embedding_dimension()}")

5.7 Building a Corpus#

A corpus is the collection of documents you want to search over.

In a real system, this might be:

Company policies
Knowledge base articles
Customer support history
Regulatory documents

For this demo, we’ll use a small set of sentences about finance and sports:

# Our corpus: documents we want to search
corpus = [
    "Interest rates were increased by the central bank to control inflation.",
    "The bank raised rates after inflation surprised to the upside.",
    "Quarterly earnings improved as net interest margin widened.",
    "The Federal Reserve announced a 25 basis point rate hike.",
    "Mortgage rates have reached their highest level in 20 years.",
    "Football is a popular sport in Europe.",
    "A goal was scored in the final minute of the match.",
    "The team won the championship after a penalty shootout.",
]

print(f"Corpus size: {len(corpus)} documents")
for i, doc in enumerate(corpus):
    print(f"  [{i}] {doc[:60]}..." if len(doc) > 60 else f"  [{i}] {doc}")

5.8 Encoding the Corpus#

Encoding means converting text into embeddings (vectors).

The normalize_embeddings=True option ensures all vectors have length 1, which:

Makes cosine similarity equal to dot product (faster!)
Ensures fair comparison regardless of text length

import numpy as np

# Convert all documents to embeddings
corpus_embeddings = model.encode(corpus, normalize_embeddings=True)

print(f"Corpus embeddings shape: {corpus_embeddings.shape}")
print(f"  - {corpus_embeddings.shape[0]} documents")
print(f"  - {corpus_embeddings.shape[1]} dimensions per embedding")
print()
print(f"First embedding (first 10 values): {corpus_embeddings[0][:10]}")
print(f"Vector length (should be ~1.0): {np.linalg.norm(corpus_embeddings[0]):.4f}")

5.9 Encoding a Query#

When a user asks a question, we encode it the same way.

The query embedding can then be compared to all corpus embeddings to find the most relevant documents.

# A user's question
query = "Why did the central bank raise interest rates?"

# Encode the query
query_embedding = model.encode([query], normalize_embeddings=True)[0]

print(f"Query: '{query}'")
print(f"Query embedding shape: {query_embedding.shape}")
print(f"First 10 values: {query_embedding[:10]}")

Group 3 — Similarity and Retrieval#

Now we have vectors for our corpus and query. How do we find the most similar documents?

5.10 Measuring Similarity#

Cosine similarity measures the angle between two vectors:

1.0 = identical direction (same meaning)
0.0 = perpendicular (unrelated)
-1.0 = opposite direction (opposite meaning)

           Similar (cos ≈ 0.9)
              ↗
            ↗
Query →  ↗
            ↘
              ↘
           Unrelated (cos ≈ 0.1)

Why Cosine Similarity?#

Metric	Pros	Cons
Cosine similarity	Scale-independent, standard for text	Ignores magnitude
Euclidean distance	Intuitive	Affected by vector length
Dot product	Fast (with normalized vectors)	Requires normalization

from sklearn.metrics.pairwise import cosine_similarity

# Calculate similarity between query and ALL corpus documents
similarities = cosine_similarity([query_embedding], corpus_embeddings)[0]

print(f"Query: '{query}'")
print("\nSimilarity scores:")
print("-" * 70)

for i, (doc, score) in enumerate(zip(corpus, similarities)):
    # Visual indicator of relevance
    bar = "*" * int(score * 20)
    print(f"[{score:.3f}] {bar:20s} {doc[:50]}...")

Interpreting the Results#

Notice how:

Finance-related sentences score high (0.5-0.8)
Sports sentences score low (0.1-0.2)
The model understands “central bank” and “interest rates” are related to “Federal Reserve” and “rate hike”

This is semantic search — finding meaning, not just matching keywords!

5.11 Ranking and Top-K Retrieval#

In practice, we don’t return all documents. We return the top K most relevant.

This is the core of semantic search:

Encode the query
Calculate similarity to all documents
Sort by similarity
Return top K results

def semantic_search(query, corpus, corpus_embeddings, model, k=3):
    """Perform semantic search and return top-k results."""
    # Encode query
    query_emb = model.encode([query], normalize_embeddings=True)[0]
    
    # Calculate similarities
    scores = cosine_similarity([query_emb], corpus_embeddings)[0]
    
    # Get top-k indices (highest scores first)
    top_indices = np.argsort(scores)[::-1][:k]
    
    # Return results
    results = []
    for idx in top_indices:
        results.append({
            'rank': len(results) + 1,
            'score': scores[idx],
            'document': corpus[idx]
        })
    return results

# Test it!
query = "Why did the central bank raise interest rates?"
results = semantic_search(query, corpus, corpus_embeddings, model, k=3)

print(f"Query: '{query}'\n")
print("Top 3 results:")
print("=" * 70)
for r in results:
    print(f"#{r['rank']} (score: {r['score']:.3f})")
    print(f"   {r['document']}")
    print()

Try Different Queries#

Experiment with the search to see how it handles different types of queries:

# Try these different queries and see what comes back
test_queries = [
    "What happened to mortgage costs?",
    "Tell me about the football game",
    "monetary policy decisions",  # Different words, same concept!
    "Who won the sports competition?"
]

for q in test_queries:
    results = semantic_search(q, corpus, corpus_embeddings, model, k=2)
    print(f"Query: '{q}'")
    for r in results:
        print(f"  [{r['score']:.3f}] {r['document'][:50]}...")
    print()

5.12 Failure Modes: When Embeddings Go Wrong#

Embeddings are powerful but not perfect. Understanding their limitations is crucial for enterprise use.

Common Failure Modes#

Failure Mode	Example	Mitigation
Domain mismatch	General model doesn’t understand legal jargon	Fine-tune on domain data
Ambiguity	“bank” (financial vs river)	Add context, use metadata
Negation	“not interested in rates” matches rate documents	Use reranking or hybrid search
Length mismatch	Short query vs long document	Chunk documents appropriately
Recency	Model doesn’t know recent terms	Update model or use hybrid search

Important Enterprise Considerations#

Similarity ≠ Correctness: A document can be similar but contain wrong information
No reasoning: Embeddings don’t understand logic or causation
Threshold sensitivity: Choosing the right similarity cutoff is tricky
Adversarial inputs: Carefully crafted queries can retrieve inappropriate content

# Demonstration: Negation doesn't work well
print("Failure mode: Negation")
print("=" * 50)

q1 = "interest rate increases"
q2 = "NOT about interest rates"  # Should match different docs, but won't!

for q in [q1, q2]:
    results = semantic_search(q, corpus, corpus_embeddings, model, k=2)
    print(f"\nQuery: '{q}'")
    for r in results:
        print(f"  [{r['score']:.3f}] {r['document'][:45]}...")

print("\n" + "=" * 50)
print("Notice: Both queries return similar results!")
print("The model focuses on 'interest rates', ignoring 'NOT'.")

Group 4 — Vector Databases and Scaling#

What happens when you have millions of documents?

5.13 Why Vector Databases Exist#

Our simple approach (compare query to ALL documents) doesn’t scale:

Corpus Size	Comparisons per Query	Time (estimate)
1,000	1,000	1ms
1,000,000	1,000,000	1 second
1,000,000,000	1,000,000,000	17 minutes

Vector databases solve this with approximate nearest neighbor (ANN) algorithms:

Trade perfect accuracy for massive speed gains
Find 95% of correct results in 1% of the time

Popular Vector Databases#

Database	Type	Best For
FAISS	Library	Local/embedded use
Pinecone	Managed service	Production, serverless
Weaviate	Open source	Self-hosted, GraphQL
ChromaDB	Lightweight	Prototyping, local dev
pgvector	PostgreSQL extension	Existing Postgres users

5.14 FAISS: Fast Similarity Search#

FAISS (Facebook AI Similarity Search) is a library for efficient similarity search.

We’ll use IndexFlatIP (Inner Product, exact search) for this demo. Production systems use approximate indexes like IndexIVFFlat or IndexHNSW.

Without Index:              With FAISS Index:

Query → Compare ALL         Query → Check ~100 candidates
        1,000,000 docs              (same quality!)
        
Slow, O(n)                  Fast, O(log n)

import faiss

# Get embedding dimension
d = corpus_embeddings.shape[1]  # 384 for our model

# Create a FAISS index (Inner Product for normalized vectors = cosine similarity)
index = faiss.IndexFlatIP(d)

# Add our corpus embeddings to the index
index.add(corpus_embeddings.astype('float32'))

print(f"FAISS index created!")
print(f"  - Dimension: {d}")
print(f"  - Vectors indexed: {index.ntotal}")

5.15 Searching with FAISS#

Now we can search using the index. The search() method returns:

D: Distances (similarities) to the top K matches
I: Indices of the top K matches

# Search with FAISS
query = "What is the Federal Reserve doing about inflation?"
query_emb = model.encode([query], normalize_embeddings=True).astype('float32')

k = 3  # Return top 3 results
D, I = index.search(query_emb, k)

print(f"Query: '{query}'\n")
print("FAISS Results:")
print("=" * 70)

for rank, (score, idx) in enumerate(zip(D[0], I[0]), 1):
    print(f"#{rank} (score: {score:.3f})")
    print(f"   {corpus[idx]}")
    print()

Group 5 — From Retrieval to RAG#

Now we connect everything back to LLMs and enterprise applications.

5.16 What is RAG?#

RAG (Retrieval-Augmented Generation) combines:

Retrieval: Find relevant documents using embeddings
Augmentation: Add those documents to the LLM prompt
Generation: LLM generates an answer using the context

WITHOUT RAG:
User Question → LLM → Answer (might hallucinate)

WITH RAG:
User Question → Embedding → Vector Search → Retrieved Docs
                                                  ↓
                           LLM ← [Question + Docs] → Grounded Answer

Why RAG Matters#

Problem	How RAG Helps
Hallucinations	LLM can only use provided facts
Outdated knowledge	Retrieve from current documents
Proprietary data	Search your own knowledge base
Auditability	You can show which documents were used
Cost	Retrieval is cheaper than fine-tuning

5.17 Building a Simple RAG Pipeline#

Let’s build a simple RAG system. We’ll simulate the LLM part, but the retrieval is real.

def rag_retrieve(question, index, corpus, model, k=3):
    """Retrieve relevant documents for RAG."""
    # Encode the question
    q_emb = model.encode([question], normalize_embeddings=True).astype('float32')
    
    # Search the index
    D, I = index.search(q_emb, k)
    
    # Collect retrieved documents
    retrieved = []
    for score, idx in zip(D[0], I[0]):
        retrieved.append({
            'document': corpus[idx],
            'score': float(score)
        })
    
    return retrieved

def build_rag_prompt(question, retrieved_docs):
    """Build a prompt for the LLM with retrieved context."""
    context = "\n".join([f"- {doc['document']}" for doc in retrieved_docs])
    
    prompt = f"""Answer the question based ONLY on the following context.
If the context doesn't contain the answer, say "I don't have enough information."

Context:
{context}

Question: {question}

Answer:"""
    return prompt

# Demo the RAG pipeline
question = "What actions has the central bank taken regarding interest rates?"

# Step 1: Retrieve
retrieved = rag_retrieve(question, index, corpus, model, k=3)

print("STEP 1: RETRIEVAL")
print("=" * 60)
print(f"Question: {question}\n")
print("Retrieved documents:")
for i, doc in enumerate(retrieved, 1):
    print(f"  {i}. [{doc['score']:.3f}] {doc['document']}")

# Step 2: Build prompt
prompt = build_rag_prompt(question, retrieved)

print("\n" + "=" * 60)
print("STEP 2: RAG PROMPT (would be sent to LLM)")
print("=" * 60)
print(prompt)

5.18 Completing the RAG Pipeline with an LLM#

Now let’s send the prompt to an actual LLM to generate a grounded answer.

LLM Gateway Configuration#

We use the same LLM gateway from Module 3. You have two options:

Option	Model	Why
A: Local Ollama	`phi3:mini`	Lightweight (2.2GB), runs on most laptops
B: JBChat Server	`llama3.1:8b`	Higher quality answers, see production-grade RAG

Option A: Local Ollama#

If running Jupyter locally: Use http://localhost:11434 directly.

If running in Google Colab: You must expose Ollama via a tunnel (Colab cannot reach your localhost).

Pinggy Setup (required for Colab):

Open a terminal on your local machine
Make sure Ollama is running: ollama serve

Start the tunnel:

ssh -p 443 -R0:localhost:11434 a.pinggy.io

Copy the HTTPS URL it gives you (e.g., https://xyz-abc.a.pinggy.io)
Use that URL in the config below

Option B: Server Gateway (JBChat)#

If you cannot run Ollama locally, use the course server:

URL: https://jbchat.jonbowden.com.ngrok.app
Requires API key from instructor
Model: llama3.1:8b (better quality)

Configure below:#

# ===== LLM GATEWAY CONFIGURATION =====
# Same setup as Module 3

# ------ OPTION A: Local Ollama ------
# If running Jupyter LOCALLY, use localhost:
LLM_BASE_URL = "http://localhost:11434"

# If running in COLAB, use your Pinggy tunnel URL instead:
# LLM_BASE_URL = "https://your-pinggy-url.a.pinggy.io"

LLM_API_KEY = None  # No API key → uses Ollama /api/chat endpoint
DEFAULT_MODEL = "phi3:mini"  # Lightweight, runs on most laptops

# ------ OPTION B: Server Gateway (JBChat) ------
# Uncomment these 3 lines to use the course server:
# LLM_BASE_URL = "https://jbchat.jonbowden.com.ngrok.app"
# LLM_API_KEY = "<provided-by-instructor>"
# DEFAULT_MODEL = "llama3.1:8b"  # Higher quality model on server

import requests

def call_llm(
    prompt: str,
    model: str = None,
    temperature: float = 0.0,
    max_tokens: int = 256,
    base_url: str = None,
    api_key: str = None,
    timeout: tuple = (10, 120)
) -> str:
    """
    Canonical LLM call - same as Module 3.
    Auto-detects endpoint mode:
      - If API key is set → JBChat gateway (/chat/direct)
      - If no API key → Direct Ollama (/api/chat)
    """
    # Use defaults if not specified
    if base_url is None:
        base_url = LLM_BASE_URL
    if model is None:
        model = DEFAULT_MODEL
    if api_key is None:
        api_key = LLM_API_KEY if (LLM_API_KEY and LLM_API_KEY != "<provided-by-instructor>") else None

    use_jbchat = api_key is not None

    headers = {
        "Content-Type": "application/json",
        "ngrok-skip-browser-warning": "true",
        "Bypass-Tunnel-Reminder": "true",
    }
    
    if api_key:
        headers["X-API-Key"] = api_key

    if use_jbchat:
        endpoint = f"{base_url.rstrip('/')}/chat/direct"
        payload = {
            "model": model,
            "messages": [{"role": "user", "content": prompt}],
            "temperature": temperature,
            "max_tokens": max_tokens,
            "stream": False
        }
    else:
        endpoint = f"{base_url.rstrip('/')}/api/chat"
        payload = {
            "model": model,
            "messages": [{"role": "user", "content": prompt}],
            "options": {"temperature": temperature},
            "stream": False
        }

    resp = requests.post(endpoint, headers=headers, json=payload, timeout=timeout)
    resp.raise_for_status()
    data = resp.json()
    return data["message"]["content"]

# Smoke test
try:
    mode = "JBChat" if (LLM_API_KEY and LLM_API_KEY != "<provided-by-instructor>") else "Ollama"
    print(f"Mode: {mode} | Model: {DEFAULT_MODEL} | URL: {LLM_BASE_URL}")
    out = call_llm("Say 'LLM connected' in exactly 3 words.", temperature=0.0)
    print(f"Response: {out[:100]}")
except Exception as e:
    print(f"Connection error: {e}")
    print("\nIf using Colab, make sure you've set up Pinggy tunnel (see instructions above)")

Step 3: Send to LLM and Get a Grounded Answer#

Now we complete the RAG pipeline by sending the prompt to OpenAI’s GPT model:

# Complete RAG pipeline with LLM generation
def rag_query(question, index, corpus, model, k=3):
    """Complete RAG pipeline: retrieve + generate."""

    # Step 1: Retrieve relevant documents
    retrieved = rag_retrieve(question, index, corpus, model, k)

    # Step 2: Build the prompt
    prompt = build_rag_prompt(question, retrieved)

    # Step 3: Send to LLM via call_llm()
    answer = call_llm(prompt, temperature=0.0, max_tokens=200)

    return {
        'question': question,
        'retrieved': retrieved,
        'answer': answer
    }

# Run the complete RAG pipeline
question = "What actions has the central bank taken regarding interest rates?"
result = rag_query(question, index, corpus, model, k=3)

print("COMPLETE RAG PIPELINE")
print("=" * 70)
print(f"\nQuestion: {result['question']}\n")

print("Retrieved Documents:")
for i, doc in enumerate(result['retrieved'], 1):
    print(f"  {i}. [{doc['score']:.3f}] {doc['document']}")

print("\n" + "-" * 70)
print("LLM ANSWER (grounded in retrieved documents):")
print("-" * 70)
print(result['answer'])

Try More Questions#

See how RAG grounds the LLM’s answers in the retrieved documents:

# Try different questions
test_questions = [
    "What happened to mortgage rates?",
    "Who won the football match?",
    "How did inflation affect bank earnings?",
]

for q in test_questions:
    result = rag_query(q, index, corpus, model, k=2)
    print(f"Q: {q}")
    print(f"A: {result['answer'][:200]}...")
    print("-" * 50)
    print()

5.19 RAG Best Practices#

Building effective RAG systems requires attention to several factors:

Chunking Strategy#

Documents are usually too long to embed directly. You need to split them into chunks.

Strategy	Pros	Cons
Fixed size (e.g., 500 tokens)	Simple, predictable	May split mid-sentence
Sentence-based	Natural boundaries	Variable sizes
Paragraph-based	Preserves context	May be too large
Semantic chunking	Best quality	More complex

Metadata Filtering#

Add metadata to enable filtering before or after search:

Document type (policy, FAQ, procedure)
Date (for recency)
Department (for access control)
Confidence scores

Reranking#

First-stage retrieval prioritizes recall (finding all relevant docs). Reranking improves precision:

Retrieve top 20-50 candidates
Use a more expensive model to rerank
Return top 3-5 to the LLM

Hybrid Search#

Combine embedding search with keyword search:

Embeddings: semantic understanding
Keywords: exact matches (product codes, names)
Weighted combination of both scores

Module Summary#

Key Concepts#

Concept	What It Means
Embedding	Vector representation of meaning
Similarity	Cosine of angle between vectors
Corpus	Collection of documents to search
Vector Database	Efficient storage and search for embeddings
RAG	Retrieval + LLM for grounded answers

The RAG Pipeline#

1. PREPARE (once):
   Documents → Chunk → Embed → Store in Vector DB

2. QUERY (each request):
   Question → Embed → Search → Retrieve → Build Prompt → LLM → Answer

Enterprise Implications#

Embeddings enable semantic search beyond keywords
RAG grounds LLM answers in your actual documents
Vector databases scale to millions of documents
Retrieval is auditable — you can explain why an answer was given
This is how enterprise AI assistants work responsibly

Next Steps#

Quiz — Test your understanding
Assessment — Apply these concepts
Resources — Further reading and tools