Content#
This notebook explains Machine Learning (ML) and Deep Learning (DL) in a simple, intuitive way.
You do not need advanced math.
The goal is to understand ideas, not formulas.
📺 Watch first: Hiker in the Fog — ML Analogy Video (recommended)
One Big Idea to Remember#
Machine learning means adjusting numbers to make predictions less wrong.
Companion Resources#
Hiker’s Cheat Sheet — Maps analogy terms to technical terms
Knowledge Checks — Test your understanding
Part 0 — The AI Family Tree: AI → ML → DL → LLMs#
Before diving in, let’s understand how these terms relate:
┌─────────────────────────────────────────────────────────────┐
│ ARTIFICIAL INTELLIGENCE (AI) │
│ Any system that mimics human-like intelligence │
│ │
│ ┌─────────────────────────────────────────────────────┐ │
│ │ MACHINE LEARNING (ML) │ │
│ │ AI that learns patterns from data │ │
│ │ │ │
│ │ ┌─────────────────────────────────────────────┐ │ │
│ │ │ DEEP LEARNING (DL) │ │ │
│ │ │ ML using neural networks with many layers │ │ │
│ │ │ │ │ │
│ │ │ ┌─────────────────────────────────────┐ │ │ │
│ │ │ │ LLMs (Large Language Models) │ │ │ │
│ │ │ │ DL models trained on text │ │ │ │
│ │ │ │ Examples: GPT, Claude, Llama │ │ │ │
│ │ │ └─────────────────────────────────────┘ │ │ │
│ │ └─────────────────────────────────────────────┘ │ │
│ └─────────────────────────────────────────────────────┘ │
└─────────────────────────────────────────────────────────────┘
Key Relationships#
Term |
What it is |
Example |
|---|---|---|
AI |
Broad field of intelligent systems |
Chess engines, Siri, self-driving cars |
ML |
Subset of AI that learns from data |
Spam filters, recommendation systems |
DL |
Subset of ML using neural networks |
Image recognition, speech-to-text |
LLMs |
DL models for language (Generative AI) |
ChatGPT, Claude, code assistants |
Why This Matters in Enterprise#
In banking and enterprise settings, understanding this hierarchy helps you:
Choose the right tool: Not every problem needs an LLM
Understand limitations: Each layer inherits limitations from the ones above
Manage risk: LLMs add language-specific risks (hallucinations) on top of ML risks (overfitting)
Communicate clearly: Executives often confuse these terms
Part 1 — The Hiker in the Fog#
Imagine a hiker standing on a mountain covered in thick fog.
The hiker cannot see far.
The hiker does not know where the lowest point is.
The hiker can only feel whether the ground goes up or down.
The hiker’s goal is simple:
Reach the lowest point.
This is how machine learning works:
Start with wrong guesses
Make small changes
Slowly improve
What the model does NOT know
It does not know the global optimum
It does not know whether a better solution exists elsewhere
It only reacts to local feedback (loss and gradient)
Story |
Meaning |
|---|---|
Hiker |
The model |
Height |
How wrong the model is |
Fog |
Not knowing the right answer |
Step |
Small change to the model |
Lowest point |
Best possible model |
Part 2 — What Is a Model?#
A model is a rule that turns inputs into outputs.
Example:
Input: hours studied
Output: exam score
def predict(hours, weight, bias):
return weight * hours + bias
weight = 1.0
bias = 0.0
print("Prediction for 5 hours of study:", predict(5, weight, bias))
Prediction for 5 hours of study: 5.0
Part 3 — Weights#
Weights are numbers inside the model.
They control predictions
They start as guesses
Learning means changing them
Let’s see how different weights change predictions:
# Same input, different weights = different predictions
hours = 5
# Try different weights
for weight in [1.0, 5.0, 10.0, 15.0]:
prediction = weight * hours
print(f"Weight = {weight:4.1f} → Prediction = {prediction:5.1f}")
print("\nThe RIGHT weight depends on the actual data!")
print("If students who study 5 hours score ~75, weight ≈ 15 is best.")
Weight = 1.0 → Prediction = 5.0
Weight = 5.0 → Prediction = 25.0
Weight = 10.0 → Prediction = 50.0
Weight = 15.0 → Prediction = 75.0
The RIGHT weight depends on the actual data!
If students who study 5 hours score ~75, weight ≈ 15 is best.
Part 4 — Loss#
Loss tells us how wrong a prediction is.
Big loss = very wrong
Small loss = almost right
The most common loss is squared error: (predicted - actual)²
# Calculate loss for different predictions
actual_score = 80
print("If actual score is 80:")
print("-" * 40)
for predicted in [60, 70, 75, 80, 85]:
loss = (predicted - actual_score) ** 2
print(f"Predicted: {predicted} → Loss: {loss:4d} {'← Perfect!' if loss == 0 else ''}")
If actual score is 80:
----------------------------------------
Predicted: 60 → Loss: 400
Predicted: 70 → Loss: 100
Predicted: 75 → Loss: 25
Predicted: 80 → Loss: 0 ← Perfect!
Predicted: 85 → Loss: 25
Part 5 — Learning by Small Steps#
The model changes its weights a little at a time.
If the loss gets smaller, the change was good. If the loss gets bigger, try a different direction.
If you’ve watched the “Hiker in the Fog” video for this module, this is exactly what’s happening: the model can’t see the best solution, only the slope right under its feet.
# Gradient Descent: Finding the best weight step by step
# Goal: predict exam scores from hours studied
# Our "training data" - one student
actual_hours = 5
actual_score = 75
# Start with a wrong guess
weight = 1.0
learning_rate = 0.01 # Small steps! (0.1 would overshoot badly)
print("Gradient Descent in Action")
print("=" * 60)
print(f"Goal: Find weight so that {actual_hours} hours → {actual_score} points")
print(f"Perfect weight would be: {actual_score/actual_hours} (since {actual_score}/{actual_hours} = {actual_score/actual_hours})")
print(f"Starting weight: {weight} (way too low!)")
print()
for step in range(8):
# 1. Make prediction with current weight
prediction = weight * actual_hours
# 2. Calculate loss (how wrong are we?)
loss = (prediction - actual_score) ** 2
# 3. Calculate gradient (which direction to go, and how steep)
# Negative gradient means we need to INCREASE the weight
gradient = 2 * (prediction - actual_score) * actual_hours
# 4. Update weight (take a small step in the right direction)
old_weight = weight
weight = weight - learning_rate * gradient
direction = "↑" if weight > old_weight else "↓"
print(f"Step {step}: pred={prediction:5.1f}, loss={loss:8.1f}, weight {old_weight:.2f}→{weight:.2f} {direction}")
print()
print(f"Final weight: {weight:.2f} (target was {actual_score/actual_hours})")
print(f"Final prediction: {weight * actual_hours:.1f} (target was {actual_score})")
print("✓ Loss decreased at every step - the model improved!")
Gradient Descent in Action
============================================================
Goal: Find weight so that 5 hours → 75 points
Perfect weight would be: 15.0 (since 75/5 = 15.0)
Starting weight: 1.0 (way too low!)
Step 0: pred= 5.0, loss= 4900.0, weight 1.00→8.00 ↑
Step 1: pred= 40.0, loss= 1225.0, weight 8.00→11.50 ↑
Step 2: pred= 57.5, loss= 306.2, weight 11.50→13.25 ↑
Step 3: pred= 66.2, loss= 76.6, weight 13.25→14.12 ↑
Step 4: pred= 70.6, loss= 19.1, weight 14.12→14.56 ↑
Step 5: pred= 72.8, loss= 4.8, weight 14.56→14.78 ↑
Step 6: pred= 73.9, loss= 1.2, weight 14.78→14.89 ↑
Step 7: pred= 74.5, loss= 0.3, weight 14.89→14.95 ↑
Final weight: 14.95 (target was 15.0)
Final prediction: 74.7 (target was 75)
✓ Loss decreased at every step - the model improved!
Backpropagation: How Errors Flow Backward#
In the example above, we had one weight. But real networks have millions of weights across many layers. How do we know which weights to change?
Backpropagation (“backward propagation of errors”):
Forward pass: Input flows through the network → prediction
Calculate loss: Compare prediction to actual answer
Backward pass: Error signal flows backward through each layer
Update weights: Each weight gets adjusted based on how much it contributed to the error
FORWARD PASS (make prediction):
Input → Layer 1 → Layer 2 → Layer 3 → Prediction
BACKWARD PASS (assign blame):
← Layer 1 ← Layer 2 ← Layer 3 ← Loss
(how much did each layer contribute to the error?)
Key insight: Weights that contributed more to the error get changed more.
This is why frameworks like PyTorch are valuable—they compute backpropagation automatically!
Convergence: When Training Stops Improving#
Convergence means the model has stopped improving—the loss has settled to a stable value.
But convergence has an important limitation:
Global Minimum
↓
Loss ★
│ ╱╲
│ ╱ ╲ Local
│ ╱ ╲ Minimum
│ ╱ ╲ ↓
│╱ ╲ •
└──────────────────── Weights
Term |
Meaning |
Implication |
|---|---|---|
Local minimum |
A low point with higher points on both sides |
Model might get “stuck” here |
Global minimum |
The absolute lowest point |
What we ideally want |
Convergence |
Loss stopped decreasing |
Does NOT mean we found the best solution! |
Why this matters for enterprise:
A “converged” model might still be suboptimal
Different random starting weights can lead to different final models
This is why ML teams train multiple models and compare them
Part 6 — Training#
Training means:
Make a prediction (forward pass)
Measure how wrong it is (loss)
Calculate gradients (backward pass)
Adjust weights
Repeat
One full pass through all training data is called an epoch.
Multiple epochs = multiple passes through the same data, refining the model each time.
Part 7 — Common Problems and Mitigations#
Overfitting#
The model memorizes the training data instead of learning general patterns.
Signs of overfitting:
Training loss is very low
Validation/test loss is much higher
Model fails on new data it hasn’t seen
Underfitting#
The model is too simple and fails to capture the patterns in the data.
Signs of underfitting:
Both training and test loss are high
Model makes poor predictions on everything
Mitigations#
Problem |
Mitigation |
How it helps |
|---|---|---|
Overfitting |
More training data |
Harder to memorize larger datasets |
Regularization (L1/L2) |
Penalizes large weights, forces simplicity |
|
Dropout |
Randomly ignores neurons during training |
|
Early stopping |
Stop training when validation loss stops improving |
|
Data augmentation |
Create variations of training data |
|
Underfitting |
More complex model |
Add more layers/neurons |
Train longer |
More epochs to learn patterns |
|
Better features |
Provide more relevant input data |
|
Reduce regularization |
Allow model more flexibility |
Why This Matters in Financial Models#
In banking and finance, overfitting is particularly dangerous:
A model might appear to predict market movements perfectly on historical data
But fail completely when deployed on new, real-world data
This can lead to significant financial losses
Regulatory bodies (like the Fed, PRA) require model validation to detect overfitting
# Complete Training Loop with Multiple Data Points
# Training data: hours studied → exam scores
training_data = [
(1, 20), # 1 hour → 20 points
(2, 35), # 2 hours → 35 points
(3, 50), # 3 hours → 50 points
(5, 75), # 5 hours → 75 points
(7, 90), # 7 hours → 90 points
]
# Initialize weight
weight = 0.0
learning_rate = 0.01
print("Training over 3 epochs (3 passes through all data)")
print("=" * 55)
for epoch in range(3):
total_loss = 0
for hours, actual in training_data:
# Forward pass: make prediction
prediction = weight * hours
# Calculate loss
loss = (prediction - actual) ** 2
total_loss += loss
# Backward pass: calculate gradient and update
gradient = 2 * (prediction - actual) * hours
weight = weight - learning_rate * gradient
avg_loss = total_loss / len(training_data)
print(f"Epoch {epoch + 1}: avg_loss = {avg_loss:8.1f}, weight = {weight:.2f}")
print(f"\nFinal model: score = {weight:.1f} × hours")
print(f"Prediction for 4 hours: {weight * 4:.0f} points")
Training over 3 epochs (3 passes through all data)
=======================================================
Epoch 1: avg_loss = 1366.2, weight = 12.79
Epoch 2: avg_loss = 78.3, weight = 12.89
Epoch 3: avg_loss = 76.8, weight = 12.89
Final model: score = 12.9 × hours
Prediction for 4 hours: 52 points
import matplotlib.pyplot as plt
# Demonstrate overfitting vs good fit
# Training data (what the model sees)
train_hours = [1, 2, 3, 5, 7]
train_scores = [20, 35, 50, 75, 90]
# Test data (new, unseen data)
test_hours = [4, 6]
test_scores = [62, 82] # Actual scores
# Good model: simple linear fit (generalizes well)
good_weight = 12.5
# Overfit model: memorized exact training points with complex formula
# (simulated - in reality this would be a high-degree polynomial)
fig, axes = plt.subplots(1, 2, figsize=(12, 4))
# Plot 1: Good Fit
ax1 = axes[0]
ax1.scatter(train_hours, train_scores, color='blue', s=100, label='Training data', zorder=5)
ax1.scatter(test_hours, test_scores, color='green', s=100, marker='s', label='Test data (unseen)', zorder=5)
x_line = range(0, 9)
y_line = [good_weight * x for x in x_line]
ax1.plot(x_line, y_line, 'b-', linewidth=2, label=f'Model: {good_weight}×hours')
ax1.set_xlabel('Hours Studied')
ax1.set_ylabel('Exam Score')
ax1.set_title('GOOD FIT\n(Generalizes to new data)')
ax1.legend()
ax1.grid(True, alpha=0.3)
ax1.set_xlim(0, 8)
ax1.set_ylim(0, 100)
# Plot 2: Overfitting
ax2 = axes[1]
ax2.scatter(train_hours, train_scores, color='blue', s=100, label='Training data', zorder=5)
ax2.scatter(test_hours, test_scores, color='green', s=100, marker='s', label='Test data (unseen)', zorder=5)
# Wiggly line that hits all training points but misses test points
ax2.plot(train_hours, train_scores, 'r-', linewidth=2, label='Overfit model')
ax2.scatter([4], [45], color='red', s=100, marker='x', label='Bad prediction!', zorder=5)
ax2.set_xlabel('Hours Studied')
ax2.set_ylabel('Exam Score')
ax2.set_title('OVERFITTING\n(Memorized training, fails on new data)')
ax2.legend()
ax2.grid(True, alpha=0.3)
ax2.set_xlim(0, 8)
ax2.set_ylim(0, 100)
plt.tight_layout()
plt.show()
print("Key insight: Overfitting = perfect on training, poor on new data")
---------------------------------------------------------------------------
ModuleNotFoundError Traceback (most recent call last)
Cell In[6], line 1
----> 1 import matplotlib.pyplot as plt
3 # Demonstrate overfitting vs good fit
4 # Training data (what the model sees)
5 train_hours = [1, 2, 3, 5, 7]
ModuleNotFoundError: No module named 'matplotlib'
Part 8 — Deep Learning#
Deep Learning uses many simple models together.
Each small part is called a neuron.
Together, they can learn complex patterns.
Neural Network Architecture: Layers#
A neural network organizes neurons into layers:
INPUT LAYER HIDDEN LAYER(S) OUTPUT LAYER
○ ○ ○
○ ──────────────► ○ ──────────────►
○ ○
○
Layer |
Purpose |
Example |
|---|---|---|
Input Layer |
Receives raw data |
Pixel values, hours studied, sensor readings |
Hidden Layer(s) |
Learns patterns |
Combines inputs in useful ways |
Output Layer |
Produces final answer |
Classification, score prediction |
“Deep” Learning = Many Hidden Layers
1-2 hidden layers = “shallow” network
10+ hidden layers = “deep” network
GPT-4 has ~120 layers!
A Single Neuron#
A neuron does three things:
Multiply inputs by weights
Add a bias
Apply an activation function (adds non-linearity)
import matplotlib.pyplot as plt
# A single neuron with ReLU activation
def relu(x):
"""ReLU: if negative, output 0. Otherwise, output x."""
return max(0, x)
def neuron(inputs, weights, bias):
"""A single neuron: weighted sum + bias + activation"""
# Step 1: weighted sum
weighted_sum = sum(i * w for i, w in zip(inputs, weights))
# Step 2: add bias
with_bias = weighted_sum + bias
# Step 3: apply activation (ReLU)
output = relu(with_bias)
return output
# Example: 2 inputs (like hours studied, hours slept)
inputs = [5, 8] # 5 hours studied, 8 hours slept
weights = [10, 5] # studying matters more than sleep
bias = -20
result = neuron(inputs, weights, bias)
print(f"Inputs: {inputs}")
print(f"Weights: {weights}")
print(f"Bias: {bias}")
print(f"Weighted sum: {inputs[0]}×{weights[0]} + {inputs[1]}×{weights[1]} = {sum(i*w for i,w in zip(inputs, weights))}")
print(f"With bias: {sum(i*w for i,w in zip(inputs, weights))} + {bias} = {sum(i*w for i,w in zip(inputs, weights)) + bias}")
print(f"After ReLU: {result}")
# Visualize ReLU
print("\n--- ReLU Activation Function ---")
x_vals = list(range(-5, 6))
y_vals = [relu(x) for x in x_vals]
plt.figure(figsize=(8, 3))
plt.plot(x_vals, y_vals, 'b-', linewidth=2)
plt.axhline(y=0, color='gray', linestyle='--', alpha=0.5)
plt.axvline(x=0, color='gray', linestyle='--', alpha=0.5)
plt.xlabel('Input')
plt.ylabel('Output')
plt.title('ReLU: Negative → 0, Positive → unchanged')
plt.grid(True, alpha=0.3)
plt.show()
Part 9 — Training vs Using a Model#
Training: weights change
Using the model: weights stay the same
# Training vs Inference: The Two Phases
class SimpleModel:
def __init__(self):
self.weight = 1.0 # Start with a guess
def predict(self, hours):
return self.weight * hours
def train_step(self, hours, actual_score, learning_rate=0.01):
"""Training: weights CHANGE"""
prediction = self.predict(hours)
gradient = 2 * (prediction - actual_score) * hours
self.weight = self.weight - learning_rate * gradient
return prediction
# Create model
model = SimpleModel()
print("=" * 50)
print("PHASE 1: TRAINING (weights change)")
print("=" * 50)
training_examples = [(3, 45), (5, 75), (7, 105)]
for hours, score in training_examples:
old_weight = model.weight
pred = model.train_step(hours, score)
print(f"Input: {hours}h → Actual: {score}, Predicted: {pred:.0f}")
print(f" Weight changed: {old_weight:.2f} → {model.weight:.2f}")
print()
print("=" * 50)
print("PHASE 2: INFERENCE (weights frozen)")
print("=" * 50)
print(f"Final trained weight: {model.weight:.2f}")
print()
# Now use the trained model (no more training)
for hours in [1, 4, 6, 10]:
prediction = model.predict(hours)
print(f"Inference: {hours} hours → predicted score: {prediction:.0f}")
Part 10 — LLMs: How Language Models Work (and Why They Hallucinate)#
Now that you understand ML/DL fundamentals, let’s see how they apply to Large Language Models (LLMs) like GPT and Claude.
How LLMs Are Trained#
LLMs learn through self-supervised learning on massive text datasets:
Training data: Billions of documents from the internet, books, code
Task: Predict the next word (token) given previous words
Process: Same gradient descent we learned—adjust weights to reduce prediction error
Input: "The capital of France is ___"
Target: "Paris"
Model predicts → Calculates loss → Backpropagation → Updates weights
Next-Token Prediction#
LLMs don’t “understand” like humans. They learn statistical patterns:
What LLM sees |
What it learns |
|---|---|
“The sky is ___” |
“blue” often follows |
“Once upon a ___” |
“time” often follows |
“SELECT * FROM ___” |
Table names often follow |
Key insight: LLMs are pattern completion machines. They predict what text typically follows, based on training data.
Why Hallucinations Are Expected#
Hallucination = LLM produces confident-sounding but incorrect information.
This isn’t a bug—it’s a direct consequence of how LLMs work:
ML Concept |
LLM Behavior |
|---|---|
Pattern completion |
Generates plausible-sounding text even when facts are wrong |
Training data bias |
Repeats errors or biases present in training data |
No fact-checking |
Model doesn’t verify claims—just predicts likely text |
Over-generalisation |
Applies patterns to situations where they don’t apply |
Example: If asked about a fictional event, an LLM might generate a detailed, confident-sounding description—because that’s what text about events typically looks like.
Why This Matters for Enterprise#
In banking and enterprise settings, hallucinations are a serious risk:
Compliance: LLM might cite non-existent regulations
Financial advice: LLM might invent statistics or market data
Legal: LLM might fabricate case law (this has happened!)
Reputation: Incorrect information damages trust
The Solution: Grounding and RAG#
RAG (Retrieval-Augmented Generation) mitigates hallucinations by:
Retrieving relevant documents from a trusted knowledge base
Providing these documents as context to the LLM
Grounding the response in actual evidence
WITHOUT RAG:
User question → LLM → Potentially hallucinated answer
WITH RAG:
User question → Search knowledge base → Retrieve relevant docs
→ LLM + docs → Answer grounded in evidence
Think of it as giving the hiker (model) a map and signposts, rather than relying only on memory of terrain from past walks.
Key Takeaways#
LLMs are pattern completion machines, not knowledge databases
Hallucinations are expected behavior, not bugs
Training data quality directly affects output quality
Enterprise use requires grounding (RAG) and human oversight
Never trust LLM outputs for facts without verification
Bonus: PyTorch Preview#
In real ML projects, you use frameworks like PyTorch or TensorFlow/Keras.
They do the same things we did above, but:
Handle gradients automatically (no manual math!)
Run on GPUs for speed
Provide building blocks for complex models
Here’s what our training loop looks like in PyTorch:
# PyTorch version of our training loop
# (This is what real ML code looks like!)
try:
import torch
import torch.nn as nn
# Training data as PyTorch tensors
X = torch.tensor([[1.0], [2.0], [3.0], [5.0], [7.0]]) # hours
y = torch.tensor([[20.0], [35.0], [50.0], [75.0], [90.0]]) # scores
# Define a simple model (1 input → 1 output)
model = nn.Linear(1, 1)
# Loss function and optimizer (same concepts!)
loss_fn = nn.MSELoss() # Mean Squared Error
optimizer = torch.optim.SGD(model.parameters(), lr=0.01) # Gradient Descent
print("PyTorch Training Loop")
print("=" * 40)
# Training loop - same 4 steps!
for epoch in range(100):
# 1. Forward pass (predict)
predictions = model(X)
# 2. Calculate loss
loss = loss_fn(predictions, y)
# 3. Backward pass (gradients calculated automatically!)
optimizer.zero_grad()
loss.backward()
# 4. Update weights
optimizer.step()
if epoch % 20 == 0:
print(f"Epoch {epoch:3d}: loss = {loss.item():.2f}")
# Show learned parameters
weight = model.weight.item()
bias = model.bias.item()
print(f"\nLearned: score = {weight:.1f} × hours + {bias:.1f}")
# Inference
with torch.no_grad(): # No gradients needed for inference
test_hours = torch.tensor([[4.0]])
prediction = model(test_hours)
print(f"Prediction for 4 hours: {prediction.item():.0f} points")
except ImportError:
print("PyTorch not installed in this environment.")
print("This is just a preview - the same concepts apply!")
print()
print("Key PyTorch concepts:")
print(" - torch.tensor() → data containers")
print(" - nn.Linear() → a layer with weights")
print(" - loss.backward() → automatic gradient calculation")
print(" - optimizer.step() → update weights")
Part 11 — Final Summary#
The Core ML Loop#
Machine learning works like this:
Start with random guesses (weights)
Make predictions
Measure how wrong they are (loss)
Adjust weights using gradients (backpropagation)
Repeat until convergence
Machine learning is learning by gradual improvement.
Key Concepts Covered#
Concept |
What it means |
|---|---|
AI → ML → DL → LLM |
Nested hierarchy of technologies |
Weights & Biases |
The learnable numbers in a model |
Loss Function |
Measures how wrong predictions are |
Gradient Descent |
Method to minimize loss |
Backpropagation |
How errors flow backward to update weights |
Convergence |
When training stabilizes (but may be local minimum) |
Overfitting |
Memorizing training data instead of learning |
Layers |
Input → Hidden → Output structure |
Activation Functions |
Add non-linearity (e.g., ReLU) |
Hallucinations |
LLMs generating plausible but false information |
RAG |
Grounding LLM outputs with retrieved evidence |
Enterprise Implications#
Understanding these concepts helps you:
Evaluate ML/AI vendors critically
Identify risks in AI-powered systems
Communicate with technical teams
Make informed decisions about AI adoption
Ensure compliance with regulatory requirements
End-of-Module Resources#
Hiker’s Cheat Sheet — Quick reference
Knowledge Checks — Test yourself