Content

Content#

Overview#

This module introduces Large Language Models (LLMs) from an engineering and enterprise perspective. It is code-first, grounded in Python, and builds directly on:

Module 1: Python fundamentals (functions, JSON, notebooks)
Module 2: Data work with Pandas and visualisation

You will learn how LLMs work, how to call them from Python, how they fail, and how to use them safely in regulated environments such as banking and financial services.

Supported LLM access methods (choose one)#

Local laptop LLM — run a lightweight model using Ollama on your PC.
Remote CodeVision LLM API — an Ollama-compatible /api/generate endpoint provided by the course admin.

Both options use the same request shape. Your code should work by changing one base URL.

Learning objectives#

By the end of this module, you will be able to:

Explain what an LLM is (and what it is not)
Explain tokens, context windows, and training vs inference
Call an LLM from Python via HTTP API (local or remote)
Control determinism using temperature
Force structured output (JSON) and validate it
Recognise hallucinations and common failure modes
Apply LLMs safely in a small data pipeline
Explain why LLMs alone are insufficient for enterprise use, and why grounding (RAG) helps (Module 5)

Setup — LLM Gateway Configuration#

Why Run Your Own LLM?#

Before connecting to any API, we strongly recommend setting up a local LLM on your machine. Here’s why:

🎓 Learning Value:

See exactly how LLM inference works — no black box
Understand latency, memory usage, and model loading firsthand
Debug issues locally before blaming “the API”
Build intuition about model sizes, speed, and quality trade-offs

🔒 Enterprise Mindset:

Data never leaves your machine — critical for sensitive workloads
No API keys to manage or rotate
No rate limits or usage costs
Full control over model versions and updates

💼 Career Advantage:

“I’ve run LLMs locally” sets you apart in interviews
Prepares you for on-premise deployments in regulated industries
Understanding the full stack makes you a better engineer

Option A: Local LLM (Recommended — Try This First!)#

Run Ollama on your laptop. It’s surprisingly easy and works on Windows, Mac, and Linux.

📺 Video Tutorial: Running Ollama Locally and Accessing It from Google Colab via Pinggy — watch this walkthrough before starting!

Setup steps:

Install Ollama: Download from ollama.ai (2-minute install)
Pull a model: Open terminal and run:
```
ollama pull phi3:mini
```
Start the server:
```
ollama serve
```
Expose via tunnel (for HTTPS access from Colab/remote notebooks):
```
ssh -p 443 -R0:localhost:11434 a.pinggy.io
```

Configure below:

LLM_BASE_URL = "https://your-pinggy-url.a.pinggy.io"
LLM_API_KEY = None

Note: If running Jupyter locally, you can skip the tunnel and use http://localhost:11434 directly.

Option B: Server-Side Gateway (Fallback)#

If you cannot run Ollama locally (e.g., Chromebook, restricted laptop), use the course gateway.

Setup:

LLM_BASE_URL = "https://jbchat.jonbowden.com.ngrok.app"
LLM_API_KEY = "your-api-key-here"  # Provided by instructor

This option is convenient but you miss the learning experience of running your own model.

Comparison#

Aspect	Local Ollama	Server Gateway
Learning value	⭐⭐⭐ High	⭐ Low
Setup effort	5-10 minutes	Instant
Data privacy	100% local	Shared server
Cost	Free forever	API key required
Offline use	✅ Yes	❌ No
Speed	Depends on your hardware	Consistent

Configuration Cell#

Set your URL and API key below. The code auto-detects the endpoint:

# ===== LLM GATEWAY CONFIGURATION =====
# Try Option A first! Only use Option B if you can't run Ollama locally.

# ------ OPTION A: Local Ollama (Recommended) ------
# If running Jupyter locally, use localhost directly:
LLM_BASE_URL = "http://localhost:11434"
LLM_API_KEY = None  # No API key → uses Ollama /api/chat endpoint

# If using Colab/remote notebook, use your pinggy tunnel URL:
# LLM_BASE_URL = "https://your-pinggy-url.a.pinggy.io"
# LLM_API_KEY = None

# ------ OPTION B: Server Gateway (Fallback) ------
# LLM_BASE_URL = "https://jbchat.jonbowden.com.ngrok.app"
# LLM_API_KEY = "<provided-by-instructor>"  # API key → uses /chat/direct

# ------ Model configuration ------
DEFAULT_MODEL = "phi3:mini"      # Recommended for this module
# DEFAULT_MODEL = "llama3.2:1b"  # Alternative smaller model

Canonical LLM Caller — Single Source of Truth#

All examples in this module use a single helper function: call_llm(). This function:

Auto-detects the correct endpoint based on whether an API key is set
No need to manually switch modes — just set LLM_BASE_URL and LLM_API_KEY
Returns the response text directly (not raw JSON)

API Key	Endpoint Used	Use Case
Set	`/chat/direct`	Server-side gateway
`None`	`/api/chat`	Local Ollama (direct or via tunnel)

Important: All examples must use this function. No direct requests.post() calls elsewhere.

import requests

def call_llm(
    prompt: str,
    model: str = DEFAULT_MODEL,
    temperature: float = 0.0,
    max_tokens: int = 256,
    base_url: str = LLM_BASE_URL,
    api_key: str | None = None,
    timeout: tuple = (10, 120)
) -> str:
    """
    Canonical LLM call for Module 3.
    Auto-detects endpoint mode:
      - If API key is set → JBChat gateway (/chat/direct)
      - If no API key → Direct Ollama (/api/chat)
    """
    # Resolve API key
    if api_key is None:
        api_key = LLM_API_KEY if (LLM_API_KEY and LLM_API_KEY != "<provided-by-instructor>") else None

    # Auto-detect mode: API key present = jbchat, no API key = ollama
    use_jbchat = api_key is not None

    headers = {
        "Content-Type": "application/json",
        "ngrok-skip-browser-warning": "true",
        "Bypass-Tunnel-Reminder": "true",
    }
    
    if api_key:
        headers["X-API-Key"] = api_key

    if use_jbchat:
        # JBChat gateway /chat/direct endpoint
        endpoint = f"{base_url.rstrip('/')}/chat/direct"
        payload = {
            "model": model,
            "messages": [{"role": "user", "content": prompt}],
            "temperature": temperature,
            "max_tokens": max_tokens,
            "stream": False
        }
    else:
        # Direct Ollama /api/chat endpoint
        endpoint = f"{base_url.rstrip('/')}/api/chat"
        payload = {
            "model": model,
            "messages": [{"role": "user", "content": prompt}],
            "options": {"temperature": temperature},
            "stream": False
        }

    resp = requests.post(endpoint, headers=headers, json=payload, timeout=timeout)
    resp.raise_for_status()
    data = resp.json()

    return data["message"]["content"]

# Smoke test
try:
    mode = "JBChat" if (LLM_API_KEY and LLM_API_KEY != "<provided-by-instructor>") else "Ollama"
    print(f"Mode: {mode} | URL: {LLM_BASE_URL}")
    out = call_llm("In one sentence, define inflation for a banking audience.", temperature=0.0)
    print(out[:400])
except Exception as e:
    print(f"Connection error: {e}")

Mode: Ollama | URL: http://localhost:11434
Connection error: HTTPConnectionPool(host='localhost', port=11434): Max retries exceeded with url: /api/chat (Caused by NewConnectionError('<urllib3.connection.HTTPConnection object at 0x7f473adabaf0>: Failed to establish a new connection: [Errno 111] Connection refused'))

Section 3.1 — What is a Large Language Model?#

An LLM is best understood as a next-token prediction engine. It generates text that is statistically likely, not text that is guaranteed true.

Enterprise mindset: treat LLM output as untrusted unless validated.

prompt = "Complete: 'Interest rates are rising because'"
print(call_llm(prompt, temperature=0.7)[:300])

---------------------------------------------------------------------------
ConnectionRefusedError                    Traceback (most recent call last)
File ~/.pyenv/versions/3.10.18/lib/python3.10/site-packages/urllib3/connection.py:198, in HTTPConnection._new_conn(self)
    197 try:
--> 198     sock = connection.create_connection(
    199         (self._dns_host, self.port),
    200         self.timeout,
    201         source_address=self.source_address,
    202         socket_options=self.socket_options,
    203     )
    204 except socket.gaierror as e:

File ~/.pyenv/versions/3.10.18/lib/python3.10/site-packages/urllib3/util/connection.py:85, in create_connection(address, timeout, source_address, socket_options)
     84 try:
---> 85     raise err
     86 finally:
     87     # Break explicitly a reference cycle

File ~/.pyenv/versions/3.10.18/lib/python3.10/site-packages/urllib3/util/connection.py:73, in create_connection(address, timeout, source_address, socket_options)
     72     sock.bind(source_address)
---> 73 sock.connect(sa)
     74 # Break explicitly a reference cycle

ConnectionRefusedError: [Errno 111] Connection refused

The above exception was the direct cause of the following exception:

NewConnectionError                        Traceback (most recent call last)
File ~/.pyenv/versions/3.10.18/lib/python3.10/site-packages/urllib3/connectionpool.py:787, in HTTPConnectionPool.urlopen(self, method, url, body, headers, retries, redirect, assert_same_host, timeout, pool_timeout, release_conn, chunked, body_pos, preload_content, decode_content, **response_kw)
    786 # Make the request on the HTTPConnection object
--> 787 response = self._make_request(
    788     conn,
    789     method,
    790     url,
    791     timeout=timeout_obj,
    792     body=body,
    793     headers=headers,
    794     chunked=chunked,
    795     retries=retries,
    796     response_conn=response_conn,
    797     preload_content=preload_content,
    798     decode_content=decode_content,
    799     **response_kw,
    800 )
    802 # Everything went great!

File ~/.pyenv/versions/3.10.18/lib/python3.10/site-packages/urllib3/connectionpool.py:493, in HTTPConnectionPool._make_request(self, conn, method, url, body, headers, retries, timeout, chunked, response_conn, preload_content, decode_content, enforce_content_length)
    492 try:
--> 493     conn.request(
    494         method,
    495         url,
    496         body=body,
    497         headers=headers,
    498         chunked=chunked,
    499         preload_content=preload_content,
    500         decode_content=decode_content,
    501         enforce_content_length=enforce_content_length,
    502     )
    504 # We are swallowing BrokenPipeError (errno.EPIPE) since the server is
    505 # legitimately able to close the connection after sending a valid response.
    506 # With this behaviour, the received response is still readable.

File ~/.pyenv/versions/3.10.18/lib/python3.10/site-packages/urllib3/connection.py:494, in HTTPConnection.request(self, method, url, body, headers, chunked, preload_content, decode_content, enforce_content_length)
    493     self.putheader(header, value)
--> 494 self.endheaders()
    496 # If we're given a body we start sending that in chunks.

File ~/.pyenv/versions/3.10.18/lib/python3.10/http/client.py:1278, in HTTPConnection.endheaders(self, message_body, encode_chunked)
   1277     raise CannotSendHeader()
-> 1278 self._send_output(message_body, encode_chunked=encode_chunked)

File ~/.pyenv/versions/3.10.18/lib/python3.10/http/client.py:1038, in HTTPConnection._send_output(self, message_body, encode_chunked)
   1037 del self._buffer[:]
-> 1038 self.send(msg)
   1040 if message_body is not None:
   1041 
   1042     # create a consistent interface to message_body

File ~/.pyenv/versions/3.10.18/lib/python3.10/http/client.py:976, in HTTPConnection.send(self, data)
    975 if self.auto_open:
--> 976     self.connect()
    977 else:

File ~/.pyenv/versions/3.10.18/lib/python3.10/site-packages/urllib3/connection.py:325, in HTTPConnection.connect(self)
    324 def connect(self) -> None:
--> 325     self.sock = self._new_conn()
    326     if self._tunnel_host:
    327         # If we're tunneling it means we're connected to our proxy.

File ~/.pyenv/versions/3.10.18/lib/python3.10/site-packages/urllib3/connection.py:213, in HTTPConnection._new_conn(self)
    212 except OSError as e:
--> 213     raise NewConnectionError(
    214         self, f"Failed to establish a new connection: {e}"
    215     ) from e
    217 sys.audit("http.client.connect", self, self.host, self.port)

NewConnectionError: <urllib3.connection.HTTPConnection object at 0x7f473b1459c0>: Failed to establish a new connection: [Errno 111] Connection refused

The above exception was the direct cause of the following exception:

MaxRetryError                             Traceback (most recent call last)
File ~/.pyenv/versions/3.10.18/lib/python3.10/site-packages/requests/adapters.py:667, in HTTPAdapter.send(self, request, stream, timeout, verify, cert, proxies)
    666 try:
--> 667     resp = conn.urlopen(
    668         method=request.method,
    669         url=url,
    670         body=request.body,
    671         headers=request.headers,
    672         redirect=False,
    673         assert_same_host=False,
    674         preload_content=False,
    675         decode_content=False,
    676         retries=self.max_retries,
    677         timeout=timeout,
    678         chunked=chunked,
    679     )
    681 except (ProtocolError, OSError) as err:

File ~/.pyenv/versions/3.10.18/lib/python3.10/site-packages/urllib3/connectionpool.py:841, in HTTPConnectionPool.urlopen(self, method, url, body, headers, retries, redirect, assert_same_host, timeout, pool_timeout, release_conn, chunked, body_pos, preload_content, decode_content, **response_kw)
    839     new_e = ProtocolError("Connection aborted.", new_e)
--> 841 retries = retries.increment(
    842     method, url, error=new_e, _pool=self, _stacktrace=sys.exc_info()[2]
    843 )
    844 retries.sleep()

File ~/.pyenv/versions/3.10.18/lib/python3.10/site-packages/urllib3/util/retry.py:519, in Retry.increment(self, method, url, response, error, _pool, _stacktrace)
    518     reason = error or ResponseError(cause)
--> 519     raise MaxRetryError(_pool, url, reason) from reason  # type: ignore[arg-type]
    521 log.debug("Incremented Retry for (url='%s'): %r", url, new_retry)

MaxRetryError: HTTPConnectionPool(host='localhost', port=11434): Max retries exceeded with url: /api/chat (Caused by NewConnectionError('<urllib3.connection.HTTPConnection object at 0x7f473b1459c0>: Failed to establish a new connection: [Errno 111] Connection refused'))

During handling of the above exception, another exception occurred:

ConnectionError                           Traceback (most recent call last)
Cell In[3], line 2
      1 prompt = "Complete: 'Interest rates are rising because'"
----> 2 print(call_llm(prompt, temperature=0.7)[:300])

Cell In[2], line 54, in call_llm(prompt, model, temperature, max_tokens, base_url, api_key, timeout)
     46     endpoint = f"{base_url.rstrip('/')}/api/chat"
     47     payload = {
     48         "model": model,
     49         "messages": [{"role": "user", "content": prompt}],
     50         "options": {"temperature": temperature},
     51         "stream": False
     52     }
---> 54 resp = requests.post(endpoint, headers=headers, json=payload, timeout=timeout)
     55 resp.raise_for_status()
     56 data = resp.json()

File ~/.pyenv/versions/3.10.18/lib/python3.10/site-packages/requests/api.py:115, in post(url, data, json, **kwargs)
    103 def post(url, data=None, json=None, **kwargs):
    104     r"""Sends a POST request.
    105 
    106     :param url: URL for the new :class:`Request` object.
   (...)
    112     :rtype: requests.Response
    113     """
--> 115     return request("post", url, data=data, json=json, **kwargs)

File ~/.pyenv/versions/3.10.18/lib/python3.10/site-packages/requests/api.py:59, in request(method, url, **kwargs)
     55 # By using the 'with' statement we are sure the session is closed, thus we
     56 # avoid leaving sockets open which can trigger a ResourceWarning in some
     57 # cases, and look like a memory leak in others.
     58 with sessions.Session() as session:
---> 59     return session.request(method=method, url=url, **kwargs)

File ~/.pyenv/versions/3.10.18/lib/python3.10/site-packages/requests/sessions.py:589, in Session.request(self, method, url, params, data, headers, cookies, files, auth, timeout, allow_redirects, proxies, hooks, stream, verify, cert, json)
    584 send_kwargs = {
    585     "timeout": timeout,
    586     "allow_redirects": allow_redirects,
    587 }
    588 send_kwargs.update(settings)
--> 589 resp = self.send(prep, **send_kwargs)
    591 return resp

File ~/.pyenv/versions/3.10.18/lib/python3.10/site-packages/requests/sessions.py:703, in Session.send(self, request, **kwargs)
    700 start = preferred_clock()
    702 # Send the request
--> 703 r = adapter.send(request, **kwargs)
    705 # Total elapsed time of the request (approximately)
    706 elapsed = preferred_clock() - start

File ~/.pyenv/versions/3.10.18/lib/python3.10/site-packages/requests/adapters.py:700, in HTTPAdapter.send(self, request, stream, timeout, verify, cert, proxies)
    696     if isinstance(e.reason, _SSLError):
    697         # This branch is for urllib3 v1.22 and later.
    698         raise SSLError(e, request=request)
--> 700     raise ConnectionError(e, request=request)
    702 except ClosedPoolError as e:
    703     raise ConnectionError(e, request=request)

ConnectionError: HTTPConnectionPool(host='localhost', port=11434): Max retries exceeded with url: /api/chat (Caused by NewConnectionError('<urllib3.connection.HTTPConnection object at 0x7f473b1459c0>: Failed to establish a new connection: [Errno 111] Connection refused'))

Section 3.2 — Tokens: How LLMs see text#

LLMs operate on tokens (subword pieces), not words. Tokenisation affects context limits and truncation.

Token Counting and Server-Side Processing#

Key concepts for enterprise use:

Tokens are counted server-side — the LLM gateway tracks usage
max_tokens limits output, not input — you control response length
Long inputs increase:
- Latency (more to process)
- Truncation risk (may hit context limit)
- Timeout probability (especially with small models)
Small models exaggerate these effects — useful for learning, but plan for larger models in production

Practical implication: keep prompts concise and plan for chunking on long documents.

Section 3.3 — Training vs inference#

Training: offline learning of model parameters from huge datasets.
Inference: runtime generation when you call the model endpoint.

This module focuses on inference.

resp = call_llm("Explain training vs inference in 2 bullet points.", temperature=0.0)
print(resp)

Section 3.4 — LLMs as services (APIs)#

Treat the LLM like any other service: send JSON request, receive JSON response. This builds on your JSON and requests skills.

resp = call_llm("Say hello.")
print(f"Response type: {type(resp)}")
print(f"Response: {resp}")

Section 3.5 — Prompt structure: role, task, constraints#

With single-prompt endpoints, simulate roles by placing behaviour rules first, task second, constraints last.

This reduces ambiguity and improves reliability.

system = "You are a cautious banking analyst. Do not speculate. If unsure, say 'Insufficient information'."
task = "Summarise for an executive: FX volatility increased due to rate differentials."
constraints = "Return exactly 2 bullet points. Max 20 words each."
prompt = f"SYSTEM:\n{system}\n\nTASK:\n{task}\n\nCONSTRAINTS:\n{constraints}"
print(call_llm(prompt, temperature=0.0))

Section 3.6 — Temperature and determinism#

Temperature controls randomness. Low temperature (0.0–0.2) is preferred in regulated workflows for consistency.

prompt = "Explain what a context window is in 2 sentences."
low = call_llm(prompt, temperature=0.0)
high = call_llm(prompt, temperature=0.8)
print("Temp 0.0:\n", low)
print("\nTemp 0.8:\n", high)

Section 3.7 — Hallucinations (confident but wrong)#

Hallucinations occur because the model optimises for plausible text rather than verified truth. The model will confidently generate answers even when the question refers to something that does not exist.

Teaching goal: Hallucination is not random error — it is plausible continuation beating truthful uncertainty.

Never treat confident language as evidence. Always verify claims against trusted sources.

# Hallucination demonstration: asking about a paper that does not exist
prompt = """Explain the key ideas from the 2019 paper
"Temporal Diffusion Graph Transformers for Quantum Finance"
by Liu and Henderson, published at NeurIPS."""

response = call_llm(prompt, temperature=0.0, max_tokens=512)
print("LLM Response:")
print(response)

print("\n" + "="*60)
print("ANALYSIS: Why this is a hallucination")
print("="*60)
print("""
This paper DOES NOT EXIST. The model's response demonstrates classic hallucination patterns:

1. PAPER SUBSTITUTION: The model invents plausible-sounding content based on
   keywords (transformers, finance, quantum). It may cite real papers or
   concepts that are unrelated.

2. CONFIDENT SPECULATION: Watch for phrases like "likely", "would probably",
   "typically involves" — these mask uncertainty as knowledge.

3. GENERIC ML BOILERPLATE: The response uses standard ML vocabulary
   (attention mechanisms, embeddings, architectures) that sounds authoritative
   but is not grounded in any real paper.

4. DOMAIN DRIFT: The model may conflate "quantum finance" (a real niche field)
   with generic finance ML, producing plausible but wrong explanations.

5. HEDGING ADMISSION: Sometimes the model adds "if such a paper exists" or
   similar — but still provides fabricated details anyway.

KEY LESSON: Confident language ≠ truthful content. Always verify against
authoritative sources (actual paper, official database, domain expert).
""")

Section 3.11 — Defensive parsing and validation#

Models may return invalid JSON for several reasons:

Markdown wrapping — response wrapped in ```json ... ``` blocks
Trailing text — explanatory text after the JSON
Malformed structure — missing quotes, trailing commas, etc.
Empty response — timeout or model failure

The safe_json_loads() function below handles these cases:

Strips markdown code block wrappers
Attempts JSON parsing
Returns a tuple: (success, result_or_error)

Enterprise pattern: Always wrap JSON parsing in try/except and have a fallback strategy.

# Moderate example: a reasonably long input (not extreme)
# Note: We use a moderate size to avoid destabilising small models
policy_text = ("This is a paragraph from a banking policy document covering risk management. " * 50)
prompt = f"Summarise in 3 bullets:\n{policy_text}"

try:
    response = call_llm(prompt, temperature=0.0, max_tokens=256)
    print("Summary:")
    print(response[:600])
except Exception as e:
    print(f"Request failed (expected for very long inputs): {e}")
    print("In production, you would chunk the input or use a larger model.")

Section 3.9 — Prompt hygiene: common mistakes and fixes#

Avoid vague asks, missing constraints, and multi-task prompts. Prefer clear audience, format, and uncertainty policy.

bad = "Tell me about interest rates."
good = "Explain interest rates to a new bank analyst in 3 bullets, <= 18 words each. No speculation."
print("BAD:\n", call_llm(bad, temperature=0.0))
print("\nGOOD:\n", call_llm(good, temperature=0.0))

Section 3.10 — Structured output: why JSON matters#

JSON output enables deterministic parsing, validation, and automation. This builds directly on Module 1.

Common Issue: Markdown-Wrapped JSON#

LLMs often return JSON wrapped in markdown code blocks:

```json
{"key": "value"}

This causes `json.loads()` to fail! You must **strip the markdown wrapper** before parsing.

The `strip_markdown_json()` helper function below handles this:
1. Detects if the response starts with ` ```json ` or ` ``` `
2. Extracts the content between the backticks
3. Returns clean JSON ready for parsing

**Always use this pattern when parsing LLM JSON output.**

import json
import re

def strip_markdown_json(s: str) -> str:
    """
    Strip markdown code block wrappers from LLM JSON output.
    
    LLMs often return JSON wrapped in markdown:
        ```json
        {"key": "value"}
        ```
    
    This function extracts the raw JSON for parsing.
    """
    s = s.strip()
    # Pattern matches ```json or ``` at start, and ``` at end
    pattern = r'^```(?:json)?\s*\n?(.*?)\n?```$'
    match = re.match(pattern, s, re.DOTALL | re.IGNORECASE)
    if match:
        return match.group(1).strip()
    return s

prompt = (
"Return ONLY valid JSON with keys: summary (string), risks (array of exactly 3 strings). "
"No extra text. Use double quotes. "
"Text: Banks face credit risk, market risk, and operational risk."
)
raw = call_llm(prompt, temperature=0.0)
print("Raw response:")
print(raw)

# IMPORTANT: LLMs often wrap JSON in markdown code blocks - strip before parsing
cleaned = strip_markdown_json(raw)
print("\nCleaned for parsing:")
print(cleaned)

try:
    data = json.loads(cleaned)
    print("\nParsed successfully:")
    print(data)
except json.JSONDecodeError as e:
    print(f"\nInvalid JSON - do not proceed: {e}")
    print("In production, you would retry or fail gracefully here.")

Section 3.11 — Defensive parsing and validation#

Models sometimes return invalid JSON. Handle this safely: parse, validate, retry or fail clearly.

import json
import re

def strip_markdown_json(s: str) -> str:
    """
    Strip markdown code block wrappers from LLM JSON output.
    
    LLMs often return JSON wrapped in markdown:
        ```json
        {"key": "value"}
        ```
    
    This function extracts the raw JSON for parsing.
    """
    s = s.strip()
    # Pattern matches ```json or ``` at start, and ``` at end
    pattern = r'^```(?:json)?\s*\n?(.*?)\n?```$'
    match = re.match(pattern, s, re.DOTALL | re.IGNORECASE)
    if match:
        return match.group(1).strip()
    return s

def safe_json_loads(s: str) -> tuple:
    """
    Attempt to parse JSON safely, handling markdown-wrapped responses.
    Returns (success, result_or_error).
    """
    # First, strip any markdown code block wrappers
    cleaned = strip_markdown_json(s)
    try:
        return True, json.loads(cleaned)
    except Exception as e:
        return False, f"{type(e).__name__}: {e}"

# Example: LLM may return JSON wrapped in markdown code blocks
raw = call_llm('Return JSON only: {"a": 1}', temperature=0.0)
print(f"Raw response: {raw!r}")

ok, parsed = safe_json_loads(raw)
print(f"Parse successful: {ok}")
if ok:
    print(f"Parsed data: {parsed}")
else:
    print(f"Error: {parsed}")

Section 3.12 — Text validators: length, bullets, vocabulary#

Not all tasks need JSON. You can validate text using deterministic rules like bullet count and max length.

text = call_llm("Return exactly 3 bullet points about liquidity risk.", temperature=0.0)
bullets = [ln for ln in text.splitlines() if ln.strip().startswith(("-", "*"))]
print("Bullet count:", len(bullets))
print(text)

Section 3.13 — LLMs inside a Pandas pipeline#

LLMs can augment data pipelines by generating summaries or tags. Start small and validate outputs.

import pandas as pd
df = pd.DataFrame({
    "id": [1,2,3],
    "text": [
        "Credit risk is the possibility of loss from borrower default.",
        "Market risk comes from adverse movements in interest rates and FX.",
        "Operational risk arises from process, people, or system failures."
    ]
})
def summarise_row(t: str) -> str:
    prompt = f"Summarise in 10 words or fewer: {t}"
    return call_llm(prompt, temperature=0.0).strip()
df["summary"] = df["text"].apply(summarise_row)
df

Section 3.14 — Cost/latency mindset: caching#

LLM calls are slow compared to normal functions. Use caching for repeated prompts.

_cache = {}
def cached_llm(prompt: str, temperature: float = 0.0) -> str:
    key = (prompt, temperature, DEFAULT_MODEL, LLM_BASE_URL)
    if key in _cache:
        return _cache[key]
    out = call_llm(prompt, temperature=temperature).strip()
    _cache[key] = out
    return out
p = "Summarise: Banks face credit and market risk."
print(cached_llm(p, 0.0))
print(cached_llm(p, 0.0))

Section 3.15 — Local vs remote endpoint trade-offs#

Local: simple, private, predictable. Remote: centrally managed, potentially faster, requires network/access control. Your code should work for both by switching LLM_BASE_URL.

Section 3.16 — Enterprise constraints: auditability and compliance#

Log prompts (or hashes), parameters, model, and output metadata for auditability. Avoid sending sensitive data to unapproved endpoints.

import hashlib, time
def audit_meta(prompt: str, response_text: str, model: str, temperature: float) -> dict:
    return {
        "ts": time.time(),
        "model": model,
        "temperature": temperature,
        "prompt_sha256": hashlib.sha256(prompt.encode()).hexdigest(),
        "response_sha256": hashlib.sha256(response_text.encode()).hexdigest(),
        "response_len": len(response_text),
    }
p = "Summarise operational risk in 12 words."
resp = call_llm(p, temperature=0.0)
meta = audit_meta(p, resp, DEFAULT_MODEL, 0.0)
meta

Section 3.16b — API Gateway Security: Preventing Misuse#

When exposing LLM services via an API gateway (like ngrok), security is critical. Unsecured endpoints can be:

Abused for free compute — attackers use your LLM for their own purposes
Used for prompt injection attacks — malicious prompts extracting sensitive data
Overwhelmed by DoS — excessive requests crashing your service
Scraped for model outputs — competitors harvesting your model’s responses

Security Measures for LLM Gateways#

Layer	Measure	Purpose
Authentication	API keys (`X-API-Key` header)	Identify and authorise callers
Rate Limiting	Requests per minute/hour per key	Prevent abuse and DoS
Input Validation	Max prompt length, blocked patterns	Prevent injection attacks
Output Filtering	Sanitise responses, remove PII	Prevent data leakage
Logging & Monitoring	Track usage per key, alert on anomalies	Detect and respond to abuse
Token Quotas	Max tokens per key per day	Control costs and fair usage

Implementation Approaches#

API Gateway Layer (AWS API Gateway, Kong, nginx):
- Rate limiting and throttling
- API key validation
- Request/response logging
Application Layer:
- Input sanitisation before LLM call
- Output filtering after LLM response
- User-specific quotas in database
Network Layer:
- IP allowlisting for internal use
- TLS/HTTPS only (ngrok provides this)
- VPN for sensitive deployments

Enterprise Best Practice#

In production, never expose raw LLM endpoints publicly. Always:

Wrap in an authenticated API layer
Log all requests with user identity
Set hard limits on usage per user/key
Monitor for anomalous patterns (unusual prompts, high volume)
Have an incident response plan for detected abuse

Section 3.17 — Evaluation without using another LLM#

Prefer deterministic checks: schema validation, key checks, length constraints, bullet counts. Avoid ‘LLM judging LLM’ as your only control.

Section 3.18 — Safety patterns: uncertainty and fallbacks#

Include an uncertainty policy: if unsure, say ‘Insufficient information’. Build fallbacks when validation fails.

system = "If you are unsure, respond exactly: Insufficient information. Do not guess."
task = "What is the exact USD/GBP rate at 09:31 UTC yesterday?"
prompt = f"{system}\n\n{task}"
print(call_llm(prompt, temperature=0.0))

Section 3.19 — Why LLMs alone are not enough#

LLMs have hallucinations, context limits, and no grounding in your internal data by default. This motivates grounding and retrieval techniques.

Section 3.20 — Preparing for Module 5 (Grounding / RAG)#

Mental model: LLM = language engine; RAG = evidence + memory. RAG reduces hallucinations by supplying trusted context.

Practice exercises (ungraded)#

Force JSON output for a classification task and parse it.
Demonstrate one hallucination and explain why it happened.
Enrich a small DataFrame with LLM-generated summaries and add a cache.
Add a validator enforcing: exactly 3 bullets and <= 20 words each.

Module summary#

LLMs generate probabilistic text, not guaranteed truth.
Treat outputs as untrusted unless validated.
Use low temperature for consistency and auditability.
Prefer structured outputs (JSON) for automation.
Design for failures, retries, and fallbacks.

Content

Contents

Content#

Overview#

Supported LLM access methods (choose one)#

Learning objectives#

Setup — LLM Gateway Configuration#

Why Run Your Own LLM?#

Option A: Local LLM (Recommended — Try This First!)#

Option B: Server-Side Gateway (Fallback)#

Comparison#

Configuration Cell#

Canonical LLM Caller — Single Source of Truth#

Section 3.1 — What is a Large Language Model?#

Section 3.2 — Tokens: How LLMs see text#

Token Counting and Server-Side Processing#

Section 3.3 — Training vs inference#

Section 3.4 — LLMs as services (APIs)#

Section 3.5 — Prompt structure: role, task, constraints#

Section 3.6 — Temperature and determinism#

Section 3.7 — Hallucinations (confident but wrong)#

Section 3.11 — Defensive parsing and validation#

Section 3.9 — Prompt hygiene: common mistakes and fixes#

Section 3.10 — Structured output: why JSON matters#

Common Issue: Markdown-Wrapped JSON#

Section 3.11 — Defensive parsing and validation#

Section 3.12 — Text validators: length, bullets, vocabulary#

Section 3.13 — LLMs inside a Pandas pipeline#

Section 3.14 — Cost/latency mindset: caching#

Section 3.15 — Local vs remote endpoint trade-offs#

Section 3.16 — Enterprise constraints: auditability and compliance#

Section 3.16b — API Gateway Security: Preventing Misuse#

Security Measures for LLM Gateways#

Implementation Approaches#

Enterprise Best Practice#

Section 3.17 — Evaluation without using another LLM#

Section 3.18 — Safety patterns: uncertainty and fallbacks#

Section 3.19 — Why LLMs alone are not enough#

Section 3.20 — Preparing for Module 5 (Grounding / RAG)#

Practice exercises (ungraded)#

Module summary#