# Content

## Overview
This module introduces **Large Language Models (LLMs)** from an engineering and enterprise perspective.
It is **code-first**, grounded in Python, and builds directly on:

- **Module 1:** Python fundamentals (functions, JSON, notebooks)
- **Module 2:** Data work with Pandas and visualisation

You will learn how LLMs work, how to call them from Python, how they fail, and how to use them safely in regulated environments such as banking and financial services.

### Supported LLM access methods (choose one)
- **Local laptop LLM** ‚Äî run a lightweight model using **Ollama** on your PC.
- **Remote CodeVision LLM API** ‚Äî an Ollama-compatible `/api/generate` endpoint provided by the course admin.

Both options use the same request shape. Your code should work by changing **one** base URL.

## Learning objectives
By the end of this module, you will be able to:

1. Explain what an LLM is (and what it is not)
2. Explain tokens, context windows, and training vs inference
3. Call an LLM from Python via HTTP API (local or remote)
4. Control determinism using temperature
5. Force structured output (JSON) and validate it
6. Recognise hallucinations and common failure modes
7. Apply LLMs safely in a small data pipeline
8. Explain why LLMs alone are insufficient for enterprise use, and why grounding (RAG) helps (Module 5)

## Setup ‚Äî LLM Gateway Configuration

### Why Run Your Own LLM?

Before connecting to any API, we **strongly recommend** setting up a local LLM on your machine. Here's why:

üéì **Learning Value:**
- See exactly how LLM inference works ‚Äî no black box
- Understand latency, memory usage, and model loading firsthand
- Debug issues locally before blaming "the API"
- Build intuition about model sizes, speed, and quality trade-offs

üîí **Enterprise Mindset:**
- Data never leaves your machine ‚Äî critical for sensitive workloads
- No API keys to manage or rotate
- No rate limits or usage costs
- Full control over model versions and updates

üíº **Career Advantage:**
- "I've run LLMs locally" sets you apart in interviews
- Prepares you for on-premise deployments in regulated industries
- Understanding the full stack makes you a better engineer

---

### Option A: Local LLM (Recommended ‚Äî Try This First!)

Run **Ollama** on your laptop. It's surprisingly easy and works on Windows, Mac, and Linux.

üì∫ **Video Tutorial:** [Running Ollama Locally and Accessing It from Google Colab via Pinggy](https://youtu.be/8WKUWnpyxBQ) ‚Äî watch this walkthrough before starting!

**Setup steps:**
1. **Install Ollama:** Download from [ollama.ai](https://ollama.ai) (2-minute install)
2. **Pull a model:** Open terminal and run:
   ```bash
   ollama pull phi3:mini
   ```
3. **Start the server:**
   ```bash
   ollama serve
   ```
4. **Expose via tunnel** (for HTTPS access from Colab/remote notebooks):
   ```bash
   ssh -p 443 -R0:localhost:11434 a.pinggy.io
   ```
5. **Configure below:**
   ```python
   LLM_BASE_URL = "https://your-pinggy-url.a.pinggy.io"
   LLM_API_KEY = None
   ```

**Note:** If running Jupyter locally, you can skip the tunnel and use `http://localhost:11434` directly.

---

### Option B: Server-Side Gateway (Fallback)

If you cannot run Ollama locally (e.g., Chromebook, restricted laptop), use the course gateway.

**Setup:**
```python
LLM_BASE_URL = "https://jbchat.jonbowden.com.ngrok.app"
LLM_API_KEY = "your-api-key-here"  # Provided by instructor
```

This option is convenient but you miss the learning experience of running your own model.

---

### Comparison

| Aspect | Local Ollama | Server Gateway |
|--------|--------------|----------------|
| **Learning value** | ‚≠ê‚≠ê‚≠ê High | ‚≠ê Low |
| **Setup effort** | 5-10 minutes | Instant |
| **Data privacy** | 100% local | Shared server |
| **Cost** | Free forever | API key required |
| **Offline use** | ‚úÖ Yes | ‚ùå No |
| **Speed** | Depends on your hardware | Consistent |

---

### Configuration Cell

**Set your URL and API key below. The code auto-detects the endpoint:**

In [None]:
# ===== LLM GATEWAY CONFIGURATION =====
# Try Option A first! Only use Option B if you can't run Ollama locally.

# ------ OPTION A: Local Ollama (Recommended) ------
# If running Jupyter locally, use localhost directly:
LLM_BASE_URL = "http://localhost:11434"
LLM_API_KEY = None  # No API key ‚Üí uses Ollama /api/chat endpoint

# If using Colab/remote notebook, use your pinggy tunnel URL:
# LLM_BASE_URL = "https://your-pinggy-url.a.pinggy.io"
# LLM_API_KEY = None

# ------ OPTION B: Server Gateway (Fallback) ------
# LLM_BASE_URL = "https://jbchat.jonbowden.com.ngrok.app"
# LLM_API_KEY = "<provided-by-instructor>"  # API key ‚Üí uses /chat/direct

# ------ Model configuration ------
DEFAULT_MODEL = "phi3:mini"      # Recommended for this module
# DEFAULT_MODEL = "llama3.2:1b"  # Alternative smaller model

## Canonical LLM Caller ‚Äî Single Source of Truth

All examples in this module use a single helper function: `call_llm()`. This function:

- **Auto-detects** the correct endpoint based on whether an API key is set
- No need to manually switch modes ‚Äî just set `LLM_BASE_URL` and `LLM_API_KEY`
- Returns the response text directly (not raw JSON)

| API Key | Endpoint Used | Use Case |
|---------|---------------|----------|
| Set | `/chat/direct` | Server-side gateway |
| `None` | `/api/chat` | Local Ollama (direct or via tunnel) |

**Important:** All examples must use this function. No direct `requests.post()` calls elsewhere.

In [None]:
import requests

def call_llm(
    prompt: str,
    model: str = DEFAULT_MODEL,
    temperature: float = 0.0,
    max_tokens: int = 256,
    base_url: str = LLM_BASE_URL,
    api_key: str | None = None,
    timeout: tuple = (10, 120)
) -> str:
    """
    Canonical LLM call for Module 3.
    Auto-detects endpoint mode:
      - If API key is set ‚Üí JBChat gateway (/chat/direct)
      - If no API key ‚Üí Direct Ollama (/api/chat)
    """
    # Resolve API key
    if api_key is None:
        api_key = LLM_API_KEY if (LLM_API_KEY and LLM_API_KEY != "<provided-by-instructor>") else None

    # Auto-detect mode: API key present = jbchat, no API key = ollama
    use_jbchat = api_key is not None

    headers = {
        "Content-Type": "application/json",
        "ngrok-skip-browser-warning": "true",
        "Bypass-Tunnel-Reminder": "true",
    }
    
    if api_key:
        headers["X-API-Key"] = api_key

    if use_jbchat:
        # JBChat gateway /chat/direct endpoint
        endpoint = f"{base_url.rstrip('/')}/chat/direct"
        payload = {
            "model": model,
            "messages": [{"role": "user", "content": prompt}],
            "temperature": temperature,
            "max_tokens": max_tokens,
            "stream": False
        }
    else:
        # Direct Ollama /api/chat endpoint
        endpoint = f"{base_url.rstrip('/')}/api/chat"
        payload = {
            "model": model,
            "messages": [{"role": "user", "content": prompt}],
            "options": {"temperature": temperature},
            "stream": False
        }

    resp = requests.post(endpoint, headers=headers, json=payload, timeout=timeout)
    resp.raise_for_status()
    data = resp.json()

    return data["message"]["content"]

# Smoke test
try:
    mode = "JBChat" if (LLM_API_KEY and LLM_API_KEY != "<provided-by-instructor>") else "Ollama"
    print(f"Mode: {mode} | URL: {LLM_BASE_URL}")
    out = call_llm("In one sentence, define inflation for a banking audience.", temperature=0.0)
    print(out[:400])
except Exception as e:
    print(f"Connection error: {e}")

# Section 3.1 ‚Äî What is a Large Language Model?

An LLM is best understood as a **next-token prediction engine**. It generates text that is statistically likely, not text that is guaranteed true.

**Enterprise mindset:** treat LLM output as **untrusted** unless validated.

In [None]:
prompt = "Complete: 'Interest rates are rising because'"
print(call_llm(prompt, temperature=0.7)[:300])

# Section 3.2 ‚Äî Tokens: How LLMs see text

LLMs operate on **tokens** (subword pieces), not words. Tokenisation affects context limits and truncation.

### Token Counting and Server-Side Processing

Key concepts for enterprise use:

- **Tokens are counted server-side** ‚Äî the LLM gateway tracks usage
- **`max_tokens` limits output**, not input ‚Äî you control response length
- **Long inputs increase:**
  - Latency (more to process)
  - Truncation risk (may hit context limit)
  - Timeout probability (especially with small models)
- **Small models exaggerate these effects** ‚Äî useful for learning, but plan for larger models in production

Practical implication: keep prompts concise and plan for chunking on long documents.

# Section 3.3 ‚Äî Training vs inference

- **Training**: offline learning of model parameters from huge datasets.
- **Inference**: runtime generation when you call the model endpoint.

This module focuses on inference.

In [None]:
resp = call_llm("Explain training vs inference in 2 bullet points.", temperature=0.0)
print(resp)

# Section 3.4 ‚Äî LLMs as services (APIs)

Treat the LLM like any other service: send JSON request, receive JSON response. This builds on your JSON and requests skills.

In [None]:
resp = call_llm("Say hello.")
print(f"Response type: {type(resp)}")
print(f"Response: {resp}")

# Section 3.5 ‚Äî Prompt structure: role, task, constraints

With single-prompt endpoints, simulate roles by placing behaviour rules first, task second, constraints last.

This reduces ambiguity and improves reliability.

In [None]:
system = "You are a cautious banking analyst. Do not speculate. If unsure, say 'Insufficient information'."
task = "Summarise for an executive: FX volatility increased due to rate differentials."
constraints = "Return exactly 2 bullet points. Max 20 words each."
prompt = f"SYSTEM:\n{system}\n\nTASK:\n{task}\n\nCONSTRAINTS:\n{constraints}"
print(call_llm(prompt, temperature=0.0))

# Section 3.6 ‚Äî Temperature and determinism

Temperature controls randomness. Low temperature (0.0‚Äì0.2) is preferred in regulated workflows for consistency.

In [None]:
prompt = "Explain what a context window is in 2 sentences."
low = call_llm(prompt, temperature=0.0)
high = call_llm(prompt, temperature=0.8)
print("Temp 0.0:\n", low)
print("\nTemp 0.8:\n", high)

# Section 3.7 ‚Äî Hallucinations (confident but wrong)

Hallucinations occur because the model optimises for **plausible text** rather than **verified truth**. The model will confidently generate answers even when the question refers to something that does not exist.

**Teaching goal:** *Hallucination is not random error ‚Äî it is plausible continuation beating truthful uncertainty.*

Never treat confident language as evidence. Always verify claims against trusted sources.

In [None]:
# Hallucination demonstration: asking about a paper that does not exist
prompt = """Explain the key ideas from the 2019 paper
"Temporal Diffusion Graph Transformers for Quantum Finance"
by Liu and Henderson, published at NeurIPS."""

response = call_llm(prompt, temperature=0.0, max_tokens=512)
print("LLM Response:")
print(response)

print("\n" + "="*60)
print("ANALYSIS: Why this is a hallucination")
print("="*60)
print("""
This paper DOES NOT EXIST. The model's response demonstrates classic hallucination patterns:

1. PAPER SUBSTITUTION: The model invents plausible-sounding content based on
   keywords (transformers, finance, quantum). It may cite real papers or
   concepts that are unrelated.

2. CONFIDENT SPECULATION: Watch for phrases like "likely", "would probably",
   "typically involves" ‚Äî these mask uncertainty as knowledge.

3. GENERIC ML BOILERPLATE: The response uses standard ML vocabulary
   (attention mechanisms, embeddings, architectures) that sounds authoritative
   but is not grounded in any real paper.

4. DOMAIN DRIFT: The model may conflate "quantum finance" (a real niche field)
   with generic finance ML, producing plausible but wrong explanations.

5. HEDGING ADMISSION: Sometimes the model adds "if such a paper exists" or
   similar ‚Äî but still provides fabricated details anyway.

KEY LESSON: Confident language ‚â† truthful content. Always verify against
authoritative sources (actual paper, official database, domain expert).
""")

# Section 3.11 ‚Äî Defensive parsing and validation

Models may return invalid JSON for several reasons:
- **Markdown wrapping** ‚Äî response wrapped in ` ```json ... ``` ` blocks
- **Trailing text** ‚Äî explanatory text after the JSON
- **Malformed structure** ‚Äî missing quotes, trailing commas, etc.
- **Empty response** ‚Äî timeout or model failure

The `safe_json_loads()` function below handles these cases:
1. Strips markdown code block wrappers
2. Attempts JSON parsing
3. Returns a tuple: `(success, result_or_error)`

**Enterprise pattern:** Always wrap JSON parsing in try/except and have a fallback strategy.

In [None]:
# Moderate example: a reasonably long input (not extreme)
# Note: We use a moderate size to avoid destabilising small models
policy_text = ("This is a paragraph from a banking policy document covering risk management. " * 50)
prompt = f"Summarise in 3 bullets:\n{policy_text}"

try:
    response = call_llm(prompt, temperature=0.0, max_tokens=256)
    print("Summary:")
    print(response[:600])
except Exception as e:
    print(f"Request failed (expected for very long inputs): {e}")
    print("In production, you would chunk the input or use a larger model.")

# Section 3.9 ‚Äî Prompt hygiene: common mistakes and fixes

Avoid vague asks, missing constraints, and multi-task prompts. Prefer clear audience, format, and uncertainty policy.

In [None]:
bad = "Tell me about interest rates."
good = "Explain interest rates to a new bank analyst in 3 bullets, <= 18 words each. No speculation."
print("BAD:\n", call_llm(bad, temperature=0.0))
print("\nGOOD:\n", call_llm(good, temperature=0.0))

# Section 3.10 ‚Äî Structured output: why JSON matters

JSON output enables deterministic parsing, validation, and automation. This builds directly on Module 1.

### Common Issue: Markdown-Wrapped JSON

LLMs often return JSON wrapped in markdown code blocks:

```
```json
{"key": "value"}
```
```

This causes `json.loads()` to fail! You must **strip the markdown wrapper** before parsing.

The `strip_markdown_json()` helper function below handles this:
1. Detects if the response starts with ` ```json ` or ` ``` `
2. Extracts the content between the backticks
3. Returns clean JSON ready for parsing

**Always use this pattern when parsing LLM JSON output.**

In [None]:
import json
import re

def strip_markdown_json(s: str) -> str:
    """
    Strip markdown code block wrappers from LLM JSON output.
    
    LLMs often return JSON wrapped in markdown:
        ```json
        {"key": "value"}
        ```
    
    This function extracts the raw JSON for parsing.
    """
    s = s.strip()
    # Pattern matches ```json or ``` at start, and ``` at end
    pattern = r'^```(?:json)?\s*\n?(.*?)\n?```$'
    match = re.match(pattern, s, re.DOTALL | re.IGNORECASE)
    if match:
        return match.group(1).strip()
    return s

prompt = (
"Return ONLY valid JSON with keys: summary (string), risks (array of exactly 3 strings). "
"No extra text. Use double quotes. "
"Text: Banks face credit risk, market risk, and operational risk."
)
raw = call_llm(prompt, temperature=0.0)
print("Raw response:")
print(raw)

# IMPORTANT: LLMs often wrap JSON in markdown code blocks - strip before parsing
cleaned = strip_markdown_json(raw)
print("\nCleaned for parsing:")
print(cleaned)

try:
    data = json.loads(cleaned)
    print("\nParsed successfully:")
    print(data)
except json.JSONDecodeError as e:
    print(f"\nInvalid JSON - do not proceed: {e}")
    print("In production, you would retry or fail gracefully here.")

# Section 3.11 ‚Äî Defensive parsing and validation

Models sometimes return invalid JSON. Handle this safely: parse, validate, retry or fail clearly.

In [None]:
import json
import re

def strip_markdown_json(s: str) -> str:
    """
    Strip markdown code block wrappers from LLM JSON output.
    
    LLMs often return JSON wrapped in markdown:
        ```json
        {"key": "value"}
        ```
    
    This function extracts the raw JSON for parsing.
    """
    s = s.strip()
    # Pattern matches ```json or ``` at start, and ``` at end
    pattern = r'^```(?:json)?\s*\n?(.*?)\n?```$'
    match = re.match(pattern, s, re.DOTALL | re.IGNORECASE)
    if match:
        return match.group(1).strip()
    return s

def safe_json_loads(s: str) -> tuple:
    """
    Attempt to parse JSON safely, handling markdown-wrapped responses.
    Returns (success, result_or_error).
    """
    # First, strip any markdown code block wrappers
    cleaned = strip_markdown_json(s)
    try:
        return True, json.loads(cleaned)
    except Exception as e:
        return False, f"{type(e).__name__}: {e}"

# Example: LLM may return JSON wrapped in markdown code blocks
raw = call_llm('Return JSON only: {"a": 1}', temperature=0.0)
print(f"Raw response: {raw!r}")

ok, parsed = safe_json_loads(raw)
print(f"Parse successful: {ok}")
if ok:
    print(f"Parsed data: {parsed}")
else:
    print(f"Error: {parsed}")

# Section 3.12 ‚Äî Text validators: length, bullets, vocabulary

Not all tasks need JSON. You can validate text using deterministic rules like bullet count and max length.

In [None]:
text = call_llm("Return exactly 3 bullet points about liquidity risk.", temperature=0.0)
bullets = [ln for ln in text.splitlines() if ln.strip().startswith(("-", "*"))]
print("Bullet count:", len(bullets))
print(text)

# Section 3.13 ‚Äî LLMs inside a Pandas pipeline

LLMs can augment data pipelines by generating summaries or tags. Start small and validate outputs.

In [None]:
import pandas as pd
df = pd.DataFrame({
    "id": [1,2,3],
    "text": [
        "Credit risk is the possibility of loss from borrower default.",
        "Market risk comes from adverse movements in interest rates and FX.",
        "Operational risk arises from process, people, or system failures."
    ]
})
def summarise_row(t: str) -> str:
    prompt = f"Summarise in 10 words or fewer: {t}"
    return call_llm(prompt, temperature=0.0).strip()
df["summary"] = df["text"].apply(summarise_row)
df

# Section 3.14 ‚Äî Cost/latency mindset: caching

LLM calls are slow compared to normal functions. Use caching for repeated prompts.

In [None]:
_cache = {}
def cached_llm(prompt: str, temperature: float = 0.0) -> str:
    key = (prompt, temperature, DEFAULT_MODEL, LLM_BASE_URL)
    if key in _cache:
        return _cache[key]
    out = call_llm(prompt, temperature=temperature).strip()
    _cache[key] = out
    return out
p = "Summarise: Banks face credit and market risk."
print(cached_llm(p, 0.0))
print(cached_llm(p, 0.0))

# Section 3.15 ‚Äî Local vs remote endpoint trade-offs

Local: simple, private, predictable. Remote: centrally managed, potentially faster, requires network/access control. Your code should work for both by switching LLM_BASE_URL.

# Section 3.16 ‚Äî Enterprise constraints: auditability and compliance

Log prompts (or hashes), parameters, model, and output metadata for auditability. Avoid sending sensitive data to unapproved endpoints.

In [None]:
import hashlib, time
def audit_meta(prompt: str, response_text: str, model: str, temperature: float) -> dict:
    return {
        "ts": time.time(),
        "model": model,
        "temperature": temperature,
        "prompt_sha256": hashlib.sha256(prompt.encode()).hexdigest(),
        "response_sha256": hashlib.sha256(response_text.encode()).hexdigest(),
        "response_len": len(response_text),
    }
p = "Summarise operational risk in 12 words."
resp = call_llm(p, temperature=0.0)
meta = audit_meta(p, resp, DEFAULT_MODEL, 0.0)
meta

# Section 3.16b ‚Äî API Gateway Security: Preventing Misuse

When exposing LLM services via an API gateway (like ngrok), security is critical. Unsecured endpoints can be:

- **Abused for free compute** ‚Äî attackers use your LLM for their own purposes
- **Used for prompt injection attacks** ‚Äî malicious prompts extracting sensitive data
- **Overwhelmed by DoS** ‚Äî excessive requests crashing your service
- **Scraped for model outputs** ‚Äî competitors harvesting your model's responses

### Security Measures for LLM Gateways

| Layer | Measure | Purpose |
|-------|---------|---------|
| **Authentication** | API keys (`X-API-Key` header) | Identify and authorise callers |
| **Rate Limiting** | Requests per minute/hour per key | Prevent abuse and DoS |
| **Input Validation** | Max prompt length, blocked patterns | Prevent injection attacks |
| **Output Filtering** | Sanitise responses, remove PII | Prevent data leakage |
| **Logging & Monitoring** | Track usage per key, alert on anomalies | Detect and respond to abuse |
| **Token Quotas** | Max tokens per key per day | Control costs and fair usage |

### Implementation Approaches

1. **API Gateway Layer** (AWS API Gateway, Kong, nginx):
   - Rate limiting and throttling
   - API key validation
   - Request/response logging

2. **Application Layer**:
   - Input sanitisation before LLM call
   - Output filtering after LLM response
   - User-specific quotas in database

3. **Network Layer**:
   - IP allowlisting for internal use
   - TLS/HTTPS only (ngrok provides this)
   - VPN for sensitive deployments

### Enterprise Best Practice

In production, **never expose raw LLM endpoints publicly**. Always:

- Wrap in an authenticated API layer
- Log all requests with user identity
- Set hard limits on usage per user/key
- Monitor for anomalous patterns (unusual prompts, high volume)
- Have an incident response plan for detected abuse

# Section 3.17 ‚Äî Evaluation without using another LLM

Prefer deterministic checks: schema validation, key checks, length constraints, bullet counts. Avoid 'LLM judging LLM' as your only control.

# Section 3.18 ‚Äî Safety patterns: uncertainty and fallbacks

Include an uncertainty policy: if unsure, say 'Insufficient information'. Build fallbacks when validation fails.

In [None]:
system = "If you are unsure, respond exactly: Insufficient information. Do not guess."
task = "What is the exact USD/GBP rate at 09:31 UTC yesterday?"
prompt = f"{system}\n\n{task}"
print(call_llm(prompt, temperature=0.0))

# Section 3.19 ‚Äî Why LLMs alone are not enough

LLMs have hallucinations, context limits, and no grounding in your internal data by default. This motivates grounding and retrieval techniques.

# Section 3.20 ‚Äî Preparing for Module 5 (Grounding / RAG)

Mental model: LLM = language engine; RAG = evidence + memory. RAG reduces hallucinations by supplying trusted context.

## Practice exercises (ungraded)
1. Force JSON output for a classification task and parse it.
2. Demonstrate one hallucination and explain why it happened.
3. Enrich a small DataFrame with LLM-generated summaries and add a cache.
4. Add a validator enforcing: exactly 3 bullets and <= 20 words each.

## Module summary
- LLMs generate **probabilistic text**, not guaranteed truth.
- Treat outputs as **untrusted** unless validated.
- Use **low temperature** for consistency and auditability.
- Prefer **structured outputs (JSON)** for automation.
- Design for **failures, retries, and fallbacks**.