Content#
Module 5 — Embeddings & Vector Databases
CodeVision Academy
Overview#
If Module 4 explains how models learn, Module 5 explains how models remember.
This module introduces embeddings and vector databases, the foundations of:
semantic search
retrieval systems
Retrieval-Augmented Generation (RAG)
The entire module runs in Google Colab (CPU-only) and requires no server access.
One Big Idea to Remember#
Embeddings turn meaning into numbers, so computers can measure similarity.
Learning Objectives#
By the end of this module, you will be able to:
Explain what embeddings represent (and what they do not)
Explain how neural networks produce embeddings
Generate embeddings for text using a lightweight model
Explain similarity geometrically
Implement semantic search end-to-end
Explain why vector databases exist
Understand chunking, metadata, and retrieval quality
Explain how embeddings + retrieval enable RAG
Before You Start: Hugging Face Token Setup#
This notebook downloads models from Hugging Face. To avoid rate limits and warnings, you need your own free token.
Quick Setup (2-3 minutes)#
Click New token → Name it anything → Select Read access → Create
Copy the token (you won’t see it again)
In Google Colab, click the Key icon in the left sidebar
Add a secret named exactly
HF_TOKENwith your token as the valueTurn ON “Notebook access” → Restart the runtime
Important rules:
Do NOT hard-code tokens in notebooks
Do NOT share your token
Do NOT upload tokens to GitHub
Run the cell below to verify your setup:
# Verify HF Token is set up correctly
import os
# Try to get token from Colab secrets first, then environment
try:
from google.colab import userdata
hf_token = userdata.get('HF_TOKEN')
os.environ['HF_TOKEN'] = hf_token
print("HF_TOKEN loaded from Colab secrets")
except:
if 'HF_TOKEN' in os.environ:
print("HF_TOKEN found in environment")
else:
print("WARNING: HF_TOKEN not found!")
print("Please follow the setup instructions above.")
print("You may see download warnings without it.")
WARNING: HF_TOKEN not found!
Please follow the setup instructions above.
You may see download warnings without it.
Setup#
Install the required packages. This takes about 1-2 minutes.
!pip -q install sentence-transformers scikit-learn faiss-cpu
^C
ERROR: Operation cancelled by user
[notice] A new release of pip is available: 23.0.1 -> 25.3
[notice] To update, run: pip install --upgrade pip
Group 1 — What Are Embeddings?#
Before we write code, let’s build intuition about what embeddings are and why they matter.
5.1 What Is an Embedding?#
An embedding is a list of numbers (a vector) that represents the meaning of something.
Think of it like coordinates on a map:
Paris and Lyon are closer together (both in France)
Paris and Tokyo are far apart (different continents)
Embeddings work the same way for meaning:
“dog” and “puppy” → vectors close together
“dog” and “economics” → vectors far apart
Traditional approach: Embedding approach:
"bank" = "bank" "bank" = [0.12, -0.45, 0.78, ...]
(just text, no meaning) (captures context and meaning)
Why This Matters#
Without Embeddings |
With Embeddings |
|---|---|
Search for exact words only |
Search for similar meaning |
“car” won’t find “automobile” |
“car” finds “automobile”, “vehicle” |
Keyword matching |
Semantic understanding |
5.2 What Embeddings Are NOT#
Embeddings are powerful, but they have important limitations:
Embeddings ARE |
Embeddings are NOT |
|---|---|
Statistical patterns |
Truth or facts |
Learned from training data |
A knowledge database |
Good at similarity |
Good at reasoning |
Context-dependent |
Universal definitions |
Critical Insight for Enterprise#
Embeddings reflect the biases in their training data:
If training data associates “nurse” with “female”, the embedding will too
Domain-specific language may not be well-represented
Recent events or proprietary terms won’t be captured
Key takeaway: Embeddings are useful approximations, not ground truth.
5.3 How Neural Networks Create Embeddings#
Remember from Module 4: neural networks learn by adjusting weights to reduce error.
Embedding models are trained on tasks like:
“These two sentences mean the same thing” (similarity)
“This sentence follows that sentence” (context)
“This word fits in this blank” (language modeling)
Through training, the network learns to place similar meanings close together:
BEFORE TRAINING: AFTER TRAINING:
(random positions) (meaningful positions)
dog • • cat dog • • puppy
• puppy • cat
car •
• economics car • • vehicle
• vehicle
economics •
The final layer of the network (before the output) contains the embedding—a compressed representation of meaning.
5.4 Dimensionality: Why Hundreds of Numbers?#
An embedding might have 384, 768, or even 1536 dimensions. Why so many?
Each dimension captures a different aspect of meaning:
Dimension |
Might capture… |
|---|---|
#1 |
Formality (casual ↔ formal) |
#2 |
Sentiment (negative ↔ positive) |
#3 |
Topic (finance ↔ sports) |
#47 |
Tense (past ↔ future) |
… |
Hundreds more subtle features |
Note: We don’t actually know what each dimension means! The network learns these features automatically during training. This is called a latent space.
Practical Tradeoffs#
Dimension Size |
Pros |
Cons |
|---|---|---|
Small (384) |
Fast, less memory |
Less nuance |
Large (1536) |
More detail |
Slower, more storage |
5.5 Enterprise Relevance#
Embeddings power many enterprise applications:
Use Case |
How Embeddings Help |
|---|---|
Document Search |
Find relevant policies even with different wording |
Customer Support |
Match queries to similar past tickets |
Compliance |
Flag documents similar to known violations |
Recommendation |
“Customers who viewed X also viewed Y” |
Duplicate Detection |
Find near-duplicate records |
RAG (Module 5 focus) |
Ground LLM answers in real documents |
Why This Matters for Banking#
In regulated industries, embeddings enable:
Searching regulations by intent, not just keywords
Detecting similar fraud patterns
Matching customer queries to approved responses
Auditable retrieval for compliance
Group 2 — Generating Embeddings#
Now let’s create embeddings with real code.
5.6 Choosing an Embedding Model#
There are many embedding models available. For this module, we use all-MiniLM-L6-v2:
Property |
Value |
Why it matters |
|---|---|---|
Size |
80MB |
Fits in Colab memory |
Dimensions |
384 |
Good balance of quality/speed |
Speed |
Fast |
Works on CPU |
Quality |
Good |
Top performer for its size |
Other Popular Models#
Model |
Dimensions |
Best for |
|---|---|---|
all-MiniLM-L6-v2 |
384 |
General purpose, fast |
all-mpnet-base-v2 |
768 |
Higher quality, slower |
OpenAI text-embedding-3-small |
1536 |
API-based, high quality |
BGE, E5, GTE |
Various |
Multilingual, specialized |
# Load the embedding model
# This downloads the model (~80MB) on first run
from sentence_transformers import SentenceTransformer
model = SentenceTransformer("all-MiniLM-L6-v2")
print(f"Model loaded!")
print(f"Embedding dimension: {model.get_sentence_embedding_dimension()}")
5.7 Building a Corpus#
A corpus is the collection of documents you want to search over.
In a real system, this might be:
Company policies
Knowledge base articles
Customer support history
Regulatory documents
For this demo, we’ll use a small set of sentences about finance and sports:
# Our corpus: documents we want to search
corpus = [
"Interest rates were increased by the central bank to control inflation.",
"The bank raised rates after inflation surprised to the upside.",
"Quarterly earnings improved as net interest margin widened.",
"The Federal Reserve announced a 25 basis point rate hike.",
"Mortgage rates have reached their highest level in 20 years.",
"Football is a popular sport in Europe.",
"A goal was scored in the final minute of the match.",
"The team won the championship after a penalty shootout.",
]
print(f"Corpus size: {len(corpus)} documents")
for i, doc in enumerate(corpus):
print(f" [{i}] {doc[:60]}..." if len(doc) > 60 else f" [{i}] {doc}")
5.8 Encoding the Corpus#
Encoding means converting text into embeddings (vectors).
The normalize_embeddings=True option ensures all vectors have length 1, which:
Makes cosine similarity equal to dot product (faster!)
Ensures fair comparison regardless of text length
import numpy as np
# Convert all documents to embeddings
corpus_embeddings = model.encode(corpus, normalize_embeddings=True)
print(f"Corpus embeddings shape: {corpus_embeddings.shape}")
print(f" - {corpus_embeddings.shape[0]} documents")
print(f" - {corpus_embeddings.shape[1]} dimensions per embedding")
print()
print(f"First embedding (first 10 values): {corpus_embeddings[0][:10]}")
print(f"Vector length (should be ~1.0): {np.linalg.norm(corpus_embeddings[0]):.4f}")
5.9 Encoding a Query#
When a user asks a question, we encode it the same way.
The query embedding can then be compared to all corpus embeddings to find the most relevant documents.
# A user's question
query = "Why did the central bank raise interest rates?"
# Encode the query
query_embedding = model.encode([query], normalize_embeddings=True)[0]
print(f"Query: '{query}'")
print(f"Query embedding shape: {query_embedding.shape}")
print(f"First 10 values: {query_embedding[:10]}")
Group 3 — Similarity and Retrieval#
Now we have vectors for our corpus and query. How do we find the most similar documents?
5.10 Measuring Similarity#
Cosine similarity measures the angle between two vectors:
1.0 = identical direction (same meaning)
0.0 = perpendicular (unrelated)
-1.0 = opposite direction (opposite meaning)
Similar (cos ≈ 0.9)
↗
↗
Query → ↗
↘
↘
Unrelated (cos ≈ 0.1)
Why Cosine Similarity?#
Metric |
Pros |
Cons |
|---|---|---|
Cosine similarity |
Scale-independent, standard for text |
Ignores magnitude |
Euclidean distance |
Intuitive |
Affected by vector length |
Dot product |
Fast (with normalized vectors) |
Requires normalization |
from sklearn.metrics.pairwise import cosine_similarity
# Calculate similarity between query and ALL corpus documents
similarities = cosine_similarity([query_embedding], corpus_embeddings)[0]
print(f"Query: '{query}'")
print("\nSimilarity scores:")
print("-" * 70)
for i, (doc, score) in enumerate(zip(corpus, similarities)):
# Visual indicator of relevance
bar = "*" * int(score * 20)
print(f"[{score:.3f}] {bar:20s} {doc[:50]}...")
Interpreting the Results#
Notice how:
Finance-related sentences score high (0.5-0.8)
Sports sentences score low (0.1-0.2)
The model understands “central bank” and “interest rates” are related to “Federal Reserve” and “rate hike”
This is semantic search — finding meaning, not just matching keywords!
5.11 Ranking and Top-K Retrieval#
In practice, we don’t return all documents. We return the top K most relevant.
This is the core of semantic search:
Encode the query
Calculate similarity to all documents
Sort by similarity
Return top K results
def semantic_search(query, corpus, corpus_embeddings, model, k=3):
"""Perform semantic search and return top-k results."""
# Encode query
query_emb = model.encode([query], normalize_embeddings=True)[0]
# Calculate similarities
scores = cosine_similarity([query_emb], corpus_embeddings)[0]
# Get top-k indices (highest scores first)
top_indices = np.argsort(scores)[::-1][:k]
# Return results
results = []
for idx in top_indices:
results.append({
'rank': len(results) + 1,
'score': scores[idx],
'document': corpus[idx]
})
return results
# Test it!
query = "Why did the central bank raise interest rates?"
results = semantic_search(query, corpus, corpus_embeddings, model, k=3)
print(f"Query: '{query}'\n")
print("Top 3 results:")
print("=" * 70)
for r in results:
print(f"#{r['rank']} (score: {r['score']:.3f})")
print(f" {r['document']}")
print()
Try Different Queries#
Experiment with the search to see how it handles different types of queries:
# Try these different queries and see what comes back
test_queries = [
"What happened to mortgage costs?",
"Tell me about the football game",
"monetary policy decisions", # Different words, same concept!
"Who won the sports competition?"
]
for q in test_queries:
results = semantic_search(q, corpus, corpus_embeddings, model, k=2)
print(f"Query: '{q}'")
for r in results:
print(f" [{r['score']:.3f}] {r['document'][:50]}...")
print()
5.12 Failure Modes: When Embeddings Go Wrong#
Embeddings are powerful but not perfect. Understanding their limitations is crucial for enterprise use.
Common Failure Modes#
Failure Mode |
Example |
Mitigation |
|---|---|---|
Domain mismatch |
General model doesn’t understand legal jargon |
Fine-tune on domain data |
Ambiguity |
“bank” (financial vs river) |
Add context, use metadata |
Negation |
“not interested in rates” matches rate documents |
Use reranking or hybrid search |
Length mismatch |
Short query vs long document |
Chunk documents appropriately |
Recency |
Model doesn’t know recent terms |
Update model or use hybrid search |
Important Enterprise Considerations#
Similarity ≠ Correctness: A document can be similar but contain wrong information
No reasoning: Embeddings don’t understand logic or causation
Threshold sensitivity: Choosing the right similarity cutoff is tricky
Adversarial inputs: Carefully crafted queries can retrieve inappropriate content
# Demonstration: Negation doesn't work well
print("Failure mode: Negation")
print("=" * 50)
q1 = "interest rate increases"
q2 = "NOT about interest rates" # Should match different docs, but won't!
for q in [q1, q2]:
results = semantic_search(q, corpus, corpus_embeddings, model, k=2)
print(f"\nQuery: '{q}'")
for r in results:
print(f" [{r['score']:.3f}] {r['document'][:45]}...")
print("\n" + "=" * 50)
print("Notice: Both queries return similar results!")
print("The model focuses on 'interest rates', ignoring 'NOT'.")
Group 4 — Vector Databases and Scaling#
What happens when you have millions of documents?
5.13 Why Vector Databases Exist#
Our simple approach (compare query to ALL documents) doesn’t scale:
Corpus Size |
Comparisons per Query |
Time (estimate) |
|---|---|---|
1,000 |
1,000 |
1ms |
1,000,000 |
1,000,000 |
1 second |
1,000,000,000 |
1,000,000,000 |
17 minutes |
Vector databases solve this with approximate nearest neighbor (ANN) algorithms:
Trade perfect accuracy for massive speed gains
Find 95% of correct results in 1% of the time
Popular Vector Databases#
Database |
Type |
Best For |
|---|---|---|
FAISS |
Library |
Local/embedded use |
Pinecone |
Managed service |
Production, serverless |
Weaviate |
Open source |
Self-hosted, GraphQL |
ChromaDB |
Lightweight |
Prototyping, local dev |
pgvector |
PostgreSQL extension |
Existing Postgres users |
5.14 FAISS: Fast Similarity Search#
FAISS (Facebook AI Similarity Search) is a library for efficient similarity search.
We’ll use IndexFlatIP (Inner Product, exact search) for this demo. Production systems use approximate indexes like IndexIVFFlat or IndexHNSW.
Without Index: With FAISS Index:
Query → Compare ALL Query → Check ~100 candidates
1,000,000 docs (same quality!)
Slow, O(n) Fast, O(log n)
import faiss
# Get embedding dimension
d = corpus_embeddings.shape[1] # 384 for our model
# Create a FAISS index (Inner Product for normalized vectors = cosine similarity)
index = faiss.IndexFlatIP(d)
# Add our corpus embeddings to the index
index.add(corpus_embeddings.astype('float32'))
print(f"FAISS index created!")
print(f" - Dimension: {d}")
print(f" - Vectors indexed: {index.ntotal}")
5.15 Searching with FAISS#
Now we can search using the index. The search() method returns:
D: Distances (similarities) to the top K matches
I: Indices of the top K matches
# Search with FAISS
query = "What is the Federal Reserve doing about inflation?"
query_emb = model.encode([query], normalize_embeddings=True).astype('float32')
k = 3 # Return top 3 results
D, I = index.search(query_emb, k)
print(f"Query: '{query}'\n")
print("FAISS Results:")
print("=" * 70)
for rank, (score, idx) in enumerate(zip(D[0], I[0]), 1):
print(f"#{rank} (score: {score:.3f})")
print(f" {corpus[idx]}")
print()
Group 5 — From Retrieval to RAG#
Now we connect everything back to LLMs and enterprise applications.
5.16 What is RAG?#
RAG (Retrieval-Augmented Generation) combines:
Retrieval: Find relevant documents using embeddings
Augmentation: Add those documents to the LLM prompt
Generation: LLM generates an answer using the context
WITHOUT RAG:
User Question → LLM → Answer (might hallucinate)
WITH RAG:
User Question → Embedding → Vector Search → Retrieved Docs
↓
LLM ← [Question + Docs] → Grounded Answer
Why RAG Matters#
Problem |
How RAG Helps |
|---|---|
Hallucinations |
LLM can only use provided facts |
Outdated knowledge |
Retrieve from current documents |
Proprietary data |
Search your own knowledge base |
Auditability |
You can show which documents were used |
Cost |
Retrieval is cheaper than fine-tuning |
5.17 Building a Simple RAG Pipeline#
Let’s build a simple RAG system. We’ll simulate the LLM part, but the retrieval is real.
def rag_retrieve(question, index, corpus, model, k=3):
"""Retrieve relevant documents for RAG."""
# Encode the question
q_emb = model.encode([question], normalize_embeddings=True).astype('float32')
# Search the index
D, I = index.search(q_emb, k)
# Collect retrieved documents
retrieved = []
for score, idx in zip(D[0], I[0]):
retrieved.append({
'document': corpus[idx],
'score': float(score)
})
return retrieved
def build_rag_prompt(question, retrieved_docs):
"""Build a prompt for the LLM with retrieved context."""
context = "\n".join([f"- {doc['document']}" for doc in retrieved_docs])
prompt = f"""Answer the question based ONLY on the following context.
If the context doesn't contain the answer, say "I don't have enough information."
Context:
{context}
Question: {question}
Answer:"""
return prompt
# Demo the RAG pipeline
question = "What actions has the central bank taken regarding interest rates?"
# Step 1: Retrieve
retrieved = rag_retrieve(question, index, corpus, model, k=3)
print("STEP 1: RETRIEVAL")
print("=" * 60)
print(f"Question: {question}\n")
print("Retrieved documents:")
for i, doc in enumerate(retrieved, 1):
print(f" {i}. [{doc['score']:.3f}] {doc['document']}")
# Step 2: Build prompt
prompt = build_rag_prompt(question, retrieved)
print("\n" + "=" * 60)
print("STEP 2: RAG PROMPT (would be sent to LLM)")
print("=" * 60)
print(prompt)
5.18 Completing the RAG Pipeline with an LLM#
Now let’s send the prompt to an actual LLM to generate a grounded answer.
LLM Gateway Configuration#
We use the same LLM gateway from Module 3. You have two options:
Option |
Model |
Why |
|---|---|---|
A: Local Ollama |
|
Lightweight (2.2GB), runs on most laptops |
B: JBChat Server |
|
Higher quality answers, see production-grade RAG |
Option A: Local Ollama#
If running Jupyter locally: Use http://localhost:11434 directly.
If running in Google Colab: You must expose Ollama via a tunnel (Colab cannot reach your localhost).
Pinggy Setup (required for Colab):
Open a terminal on your local machine
Make sure Ollama is running:
ollama serveStart the tunnel:
ssh -p 443 -R0:localhost:11434 a.pinggy.io
Copy the HTTPS URL it gives you (e.g.,
https://xyz-abc.a.pinggy.io)Use that URL in the config below
Option B: Server Gateway (JBChat)#
If you cannot run Ollama locally, use the course server:
URL:
https://jbchat.jonbowden.com.ngrok.appRequires API key from instructor
Model:
llama3.1:8b(better quality)
Configure below:#
# ===== LLM GATEWAY CONFIGURATION =====
# Same setup as Module 3
# ------ OPTION A: Local Ollama ------
# If running Jupyter LOCALLY, use localhost:
LLM_BASE_URL = "http://localhost:11434"
# If running in COLAB, use your Pinggy tunnel URL instead:
# LLM_BASE_URL = "https://your-pinggy-url.a.pinggy.io"
LLM_API_KEY = None # No API key → uses Ollama /api/chat endpoint
DEFAULT_MODEL = "phi3:mini" # Lightweight, runs on most laptops
# ------ OPTION B: Server Gateway (JBChat) ------
# Uncomment these 3 lines to use the course server:
# LLM_BASE_URL = "https://jbchat.jonbowden.com.ngrok.app"
# LLM_API_KEY = "<provided-by-instructor>"
# DEFAULT_MODEL = "llama3.1:8b" # Higher quality model on server
import requests
def call_llm(
prompt: str,
model: str = None,
temperature: float = 0.0,
max_tokens: int = 256,
base_url: str = None,
api_key: str = None,
timeout: tuple = (10, 120)
) -> str:
"""
Canonical LLM call - same as Module 3.
Auto-detects endpoint mode:
- If API key is set → JBChat gateway (/chat/direct)
- If no API key → Direct Ollama (/api/chat)
"""
# Use defaults if not specified
if base_url is None:
base_url = LLM_BASE_URL
if model is None:
model = DEFAULT_MODEL
if api_key is None:
api_key = LLM_API_KEY if (LLM_API_KEY and LLM_API_KEY != "<provided-by-instructor>") else None
use_jbchat = api_key is not None
headers = {
"Content-Type": "application/json",
"ngrok-skip-browser-warning": "true",
"Bypass-Tunnel-Reminder": "true",
}
if api_key:
headers["X-API-Key"] = api_key
if use_jbchat:
endpoint = f"{base_url.rstrip('/')}/chat/direct"
payload = {
"model": model,
"messages": [{"role": "user", "content": prompt}],
"temperature": temperature,
"max_tokens": max_tokens,
"stream": False
}
else:
endpoint = f"{base_url.rstrip('/')}/api/chat"
payload = {
"model": model,
"messages": [{"role": "user", "content": prompt}],
"options": {"temperature": temperature},
"stream": False
}
resp = requests.post(endpoint, headers=headers, json=payload, timeout=timeout)
resp.raise_for_status()
data = resp.json()
return data["message"]["content"]
# Smoke test
try:
mode = "JBChat" if (LLM_API_KEY and LLM_API_KEY != "<provided-by-instructor>") else "Ollama"
print(f"Mode: {mode} | Model: {DEFAULT_MODEL} | URL: {LLM_BASE_URL}")
out = call_llm("Say 'LLM connected' in exactly 3 words.", temperature=0.0)
print(f"Response: {out[:100]}")
except Exception as e:
print(f"Connection error: {e}")
print("\nIf using Colab, make sure you've set up Pinggy tunnel (see instructions above)")
Step 3: Send to LLM and Get a Grounded Answer#
Now we complete the RAG pipeline by sending the prompt to OpenAI’s GPT model:
# Complete RAG pipeline with LLM generation
def rag_query(question, index, corpus, model, k=3):
"""Complete RAG pipeline: retrieve + generate."""
# Step 1: Retrieve relevant documents
retrieved = rag_retrieve(question, index, corpus, model, k)
# Step 2: Build the prompt
prompt = build_rag_prompt(question, retrieved)
# Step 3: Send to LLM via call_llm()
answer = call_llm(prompt, temperature=0.0, max_tokens=200)
return {
'question': question,
'retrieved': retrieved,
'answer': answer
}
# Run the complete RAG pipeline
question = "What actions has the central bank taken regarding interest rates?"
result = rag_query(question, index, corpus, model, k=3)
print("COMPLETE RAG PIPELINE")
print("=" * 70)
print(f"\nQuestion: {result['question']}\n")
print("Retrieved Documents:")
for i, doc in enumerate(result['retrieved'], 1):
print(f" {i}. [{doc['score']:.3f}] {doc['document']}")
print("\n" + "-" * 70)
print("LLM ANSWER (grounded in retrieved documents):")
print("-" * 70)
print(result['answer'])
Try More Questions#
See how RAG grounds the LLM’s answers in the retrieved documents:
# Try different questions
test_questions = [
"What happened to mortgage rates?",
"Who won the football match?",
"How did inflation affect bank earnings?",
]
for q in test_questions:
result = rag_query(q, index, corpus, model, k=2)
print(f"Q: {q}")
print(f"A: {result['answer'][:200]}...")
print("-" * 50)
print()
5.19 RAG Best Practices#
Building effective RAG systems requires attention to several factors:
Chunking Strategy#
Documents are usually too long to embed directly. You need to split them into chunks.
Strategy |
Pros |
Cons |
|---|---|---|
Fixed size (e.g., 500 tokens) |
Simple, predictable |
May split mid-sentence |
Sentence-based |
Natural boundaries |
Variable sizes |
Paragraph-based |
Preserves context |
May be too large |
Semantic chunking |
Best quality |
More complex |
Metadata Filtering#
Add metadata to enable filtering before or after search:
Document type (policy, FAQ, procedure)
Date (for recency)
Department (for access control)
Confidence scores
Reranking#
First-stage retrieval prioritizes recall (finding all relevant docs). Reranking improves precision:
Retrieve top 20-50 candidates
Use a more expensive model to rerank
Return top 3-5 to the LLM
Hybrid Search#
Combine embedding search with keyword search:
Embeddings: semantic understanding
Keywords: exact matches (product codes, names)
Weighted combination of both scores
Module Summary#
Key Concepts#
Concept |
What It Means |
|---|---|
Embedding |
Vector representation of meaning |
Similarity |
Cosine of angle between vectors |
Corpus |
Collection of documents to search |
Vector Database |
Efficient storage and search for embeddings |
RAG |
Retrieval + LLM for grounded answers |
The RAG Pipeline#
1. PREPARE (once):
Documents → Chunk → Embed → Store in Vector DB
2. QUERY (each request):
Question → Embed → Search → Retrieve → Build Prompt → LLM → Answer
Enterprise Implications#
Embeddings enable semantic search beyond keywords
RAG grounds LLM answers in your actual documents
Vector databases scale to millions of documents
Retrieval is auditable — you can explain why an answer was given
This is how enterprise AI assistants work responsibly
Next Steps#
Quiz — Test your understanding
Assessment — Apply these concepts
Resources — Further reading and tools