Content#
Overview#
This module introduces Large Language Models (LLMs) from an engineering and enterprise perspective. It is code-first, grounded in Python, and builds directly on:
Module 1: Python fundamentals (functions, JSON, notebooks)
Module 2: Data work with Pandas and visualisation
You will learn how LLMs work, how to call them from Python, how they fail, and how to use them safely in regulated environments such as banking and financial services.
Supported LLM access methods (choose one)#
Local laptop LLM — run a lightweight model using Ollama on your PC.
Remote CodeVision LLM API — an Ollama-compatible
/api/generateendpoint provided by the course admin.
Both options use the same request shape. Your code should work by changing one base URL.
Learning objectives#
By the end of this module, you will be able to:
Explain what an LLM is (and what it is not)
Explain tokens, context windows, and training vs inference
Call an LLM from Python via HTTP API (local or remote)
Control determinism using temperature
Force structured output (JSON) and validate it
Recognise hallucinations and common failure modes
Apply LLMs safely in a small data pipeline
Explain why LLMs alone are insufficient for enterprise use, and why grounding (RAG) helps (Module 5)
Setup — LLM Gateway Configuration#
Why Run Your Own LLM?#
Before connecting to any API, we strongly recommend setting up a local LLM on your machine. Here’s why:
🎓 Learning Value:
See exactly how LLM inference works — no black box
Understand latency, memory usage, and model loading firsthand
Debug issues locally before blaming “the API”
Build intuition about model sizes, speed, and quality trade-offs
🔒 Enterprise Mindset:
Data never leaves your machine — critical for sensitive workloads
No API keys to manage or rotate
No rate limits or usage costs
Full control over model versions and updates
💼 Career Advantage:
“I’ve run LLMs locally” sets you apart in interviews
Prepares you for on-premise deployments in regulated industries
Understanding the full stack makes you a better engineer
Option A: Local LLM (Recommended — Try This First!)#
Run Ollama on your laptop. It’s surprisingly easy and works on Windows, Mac, and Linux.
📺 Video Tutorial: Running Ollama Locally and Accessing It from Google Colab via Pinggy — watch this walkthrough before starting!
Setup steps:
Install Ollama: Download from ollama.ai (2-minute install)
Pull a model: Open terminal and run:
ollama pull phi3:mini
Start the server:
ollama serveExpose via tunnel (for HTTPS access from Colab/remote notebooks):
ssh -p 443 -R0:localhost:11434 a.pinggy.io
Configure below:
LLM_BASE_URL = "https://your-pinggy-url.a.pinggy.io" LLM_API_KEY = None
Note: If running Jupyter locally, you can skip the tunnel and use http://localhost:11434 directly.
Option B: Server-Side Gateway (Fallback)#
If you cannot run Ollama locally (e.g., Chromebook, restricted laptop), use the course gateway.
Setup:
LLM_BASE_URL = "https://jbchat.jonbowden.com.ngrok.app"
LLM_API_KEY = "your-api-key-here" # Provided by instructor
This option is convenient but you miss the learning experience of running your own model.
Comparison#
Aspect |
Local Ollama |
Server Gateway |
|---|---|---|
Learning value |
⭐⭐⭐ High |
⭐ Low |
Setup effort |
5-10 minutes |
Instant |
Data privacy |
100% local |
Shared server |
Cost |
Free forever |
API key required |
Offline use |
✅ Yes |
❌ No |
Speed |
Depends on your hardware |
Consistent |
Configuration Cell#
Set your URL and API key below. The code auto-detects the endpoint:
# ===== LLM GATEWAY CONFIGURATION =====
# Try Option A first! Only use Option B if you can't run Ollama locally.
# ------ OPTION A: Local Ollama (Recommended) ------
# If running Jupyter locally, use localhost directly:
LLM_BASE_URL = "http://localhost:11434"
LLM_API_KEY = None # No API key → uses Ollama /api/chat endpoint
# If using Colab/remote notebook, use your pinggy tunnel URL:
# LLM_BASE_URL = "https://your-pinggy-url.a.pinggy.io"
# LLM_API_KEY = None
# ------ OPTION B: Server Gateway (Fallback) ------
# LLM_BASE_URL = "https://jbchat.jonbowden.com.ngrok.app"
# LLM_API_KEY = "<provided-by-instructor>" # API key → uses /chat/direct
# ------ Model configuration ------
DEFAULT_MODEL = "phi3:mini" # Recommended for this module
# DEFAULT_MODEL = "llama3.2:1b" # Alternative smaller model
Canonical LLM Caller — Single Source of Truth#
All examples in this module use a single helper function: call_llm(). This function:
Auto-detects the correct endpoint based on whether an API key is set
No need to manually switch modes — just set
LLM_BASE_URLandLLM_API_KEYReturns the response text directly (not raw JSON)
API Key |
Endpoint Used |
Use Case |
|---|---|---|
Set |
|
Server-side gateway |
|
|
Local Ollama (direct or via tunnel) |
Important: All examples must use this function. No direct requests.post() calls elsewhere.
import requests
def call_llm(
prompt: str,
model: str = DEFAULT_MODEL,
temperature: float = 0.0,
max_tokens: int = 256,
base_url: str = LLM_BASE_URL,
api_key: str | None = None,
timeout: tuple = (10, 120)
) -> str:
"""
Canonical LLM call for Module 3.
Auto-detects endpoint mode:
- If API key is set → JBChat gateway (/chat/direct)
- If no API key → Direct Ollama (/api/chat)
"""
# Resolve API key
if api_key is None:
api_key = LLM_API_KEY if (LLM_API_KEY and LLM_API_KEY != "<provided-by-instructor>") else None
# Auto-detect mode: API key present = jbchat, no API key = ollama
use_jbchat = api_key is not None
headers = {
"Content-Type": "application/json",
"ngrok-skip-browser-warning": "true",
"Bypass-Tunnel-Reminder": "true",
}
if api_key:
headers["X-API-Key"] = api_key
if use_jbchat:
# JBChat gateway /chat/direct endpoint
endpoint = f"{base_url.rstrip('/')}/chat/direct"
payload = {
"model": model,
"messages": [{"role": "user", "content": prompt}],
"temperature": temperature,
"max_tokens": max_tokens,
"stream": False
}
else:
# Direct Ollama /api/chat endpoint
endpoint = f"{base_url.rstrip('/')}/api/chat"
payload = {
"model": model,
"messages": [{"role": "user", "content": prompt}],
"options": {"temperature": temperature},
"stream": False
}
resp = requests.post(endpoint, headers=headers, json=payload, timeout=timeout)
resp.raise_for_status()
data = resp.json()
return data["message"]["content"]
# Smoke test
try:
mode = "JBChat" if (LLM_API_KEY and LLM_API_KEY != "<provided-by-instructor>") else "Ollama"
print(f"Mode: {mode} | URL: {LLM_BASE_URL}")
out = call_llm("In one sentence, define inflation for a banking audience.", temperature=0.0)
print(out[:400])
except Exception as e:
print(f"Connection error: {e}")
Mode: Ollama | URL: http://localhost:11434
Connection error: HTTPConnectionPool(host='localhost', port=11434): Max retries exceeded with url: /api/chat (Caused by NewConnectionError('<urllib3.connection.HTTPConnection object at 0x7f473adabaf0>: Failed to establish a new connection: [Errno 111] Connection refused'))
Section 3.1 — What is a Large Language Model?#
An LLM is best understood as a next-token prediction engine. It generates text that is statistically likely, not text that is guaranteed true.
Enterprise mindset: treat LLM output as untrusted unless validated.
prompt = "Complete: 'Interest rates are rising because'"
print(call_llm(prompt, temperature=0.7)[:300])
---------------------------------------------------------------------------
ConnectionRefusedError Traceback (most recent call last)
File ~/.pyenv/versions/3.10.18/lib/python3.10/site-packages/urllib3/connection.py:198, in HTTPConnection._new_conn(self)
197 try:
--> 198 sock = connection.create_connection(
199 (self._dns_host, self.port),
200 self.timeout,
201 source_address=self.source_address,
202 socket_options=self.socket_options,
203 )
204 except socket.gaierror as e:
File ~/.pyenv/versions/3.10.18/lib/python3.10/site-packages/urllib3/util/connection.py:85, in create_connection(address, timeout, source_address, socket_options)
84 try:
---> 85 raise err
86 finally:
87 # Break explicitly a reference cycle
File ~/.pyenv/versions/3.10.18/lib/python3.10/site-packages/urllib3/util/connection.py:73, in create_connection(address, timeout, source_address, socket_options)
72 sock.bind(source_address)
---> 73 sock.connect(sa)
74 # Break explicitly a reference cycle
ConnectionRefusedError: [Errno 111] Connection refused
The above exception was the direct cause of the following exception:
NewConnectionError Traceback (most recent call last)
File ~/.pyenv/versions/3.10.18/lib/python3.10/site-packages/urllib3/connectionpool.py:787, in HTTPConnectionPool.urlopen(self, method, url, body, headers, retries, redirect, assert_same_host, timeout, pool_timeout, release_conn, chunked, body_pos, preload_content, decode_content, **response_kw)
786 # Make the request on the HTTPConnection object
--> 787 response = self._make_request(
788 conn,
789 method,
790 url,
791 timeout=timeout_obj,
792 body=body,
793 headers=headers,
794 chunked=chunked,
795 retries=retries,
796 response_conn=response_conn,
797 preload_content=preload_content,
798 decode_content=decode_content,
799 **response_kw,
800 )
802 # Everything went great!
File ~/.pyenv/versions/3.10.18/lib/python3.10/site-packages/urllib3/connectionpool.py:493, in HTTPConnectionPool._make_request(self, conn, method, url, body, headers, retries, timeout, chunked, response_conn, preload_content, decode_content, enforce_content_length)
492 try:
--> 493 conn.request(
494 method,
495 url,
496 body=body,
497 headers=headers,
498 chunked=chunked,
499 preload_content=preload_content,
500 decode_content=decode_content,
501 enforce_content_length=enforce_content_length,
502 )
504 # We are swallowing BrokenPipeError (errno.EPIPE) since the server is
505 # legitimately able to close the connection after sending a valid response.
506 # With this behaviour, the received response is still readable.
File ~/.pyenv/versions/3.10.18/lib/python3.10/site-packages/urllib3/connection.py:494, in HTTPConnection.request(self, method, url, body, headers, chunked, preload_content, decode_content, enforce_content_length)
493 self.putheader(header, value)
--> 494 self.endheaders()
496 # If we're given a body we start sending that in chunks.
File ~/.pyenv/versions/3.10.18/lib/python3.10/http/client.py:1278, in HTTPConnection.endheaders(self, message_body, encode_chunked)
1277 raise CannotSendHeader()
-> 1278 self._send_output(message_body, encode_chunked=encode_chunked)
File ~/.pyenv/versions/3.10.18/lib/python3.10/http/client.py:1038, in HTTPConnection._send_output(self, message_body, encode_chunked)
1037 del self._buffer[:]
-> 1038 self.send(msg)
1040 if message_body is not None:
1041
1042 # create a consistent interface to message_body
File ~/.pyenv/versions/3.10.18/lib/python3.10/http/client.py:976, in HTTPConnection.send(self, data)
975 if self.auto_open:
--> 976 self.connect()
977 else:
File ~/.pyenv/versions/3.10.18/lib/python3.10/site-packages/urllib3/connection.py:325, in HTTPConnection.connect(self)
324 def connect(self) -> None:
--> 325 self.sock = self._new_conn()
326 if self._tunnel_host:
327 # If we're tunneling it means we're connected to our proxy.
File ~/.pyenv/versions/3.10.18/lib/python3.10/site-packages/urllib3/connection.py:213, in HTTPConnection._new_conn(self)
212 except OSError as e:
--> 213 raise NewConnectionError(
214 self, f"Failed to establish a new connection: {e}"
215 ) from e
217 sys.audit("http.client.connect", self, self.host, self.port)
NewConnectionError: <urllib3.connection.HTTPConnection object at 0x7f473b1459c0>: Failed to establish a new connection: [Errno 111] Connection refused
The above exception was the direct cause of the following exception:
MaxRetryError Traceback (most recent call last)
File ~/.pyenv/versions/3.10.18/lib/python3.10/site-packages/requests/adapters.py:667, in HTTPAdapter.send(self, request, stream, timeout, verify, cert, proxies)
666 try:
--> 667 resp = conn.urlopen(
668 method=request.method,
669 url=url,
670 body=request.body,
671 headers=request.headers,
672 redirect=False,
673 assert_same_host=False,
674 preload_content=False,
675 decode_content=False,
676 retries=self.max_retries,
677 timeout=timeout,
678 chunked=chunked,
679 )
681 except (ProtocolError, OSError) as err:
File ~/.pyenv/versions/3.10.18/lib/python3.10/site-packages/urllib3/connectionpool.py:841, in HTTPConnectionPool.urlopen(self, method, url, body, headers, retries, redirect, assert_same_host, timeout, pool_timeout, release_conn, chunked, body_pos, preload_content, decode_content, **response_kw)
839 new_e = ProtocolError("Connection aborted.", new_e)
--> 841 retries = retries.increment(
842 method, url, error=new_e, _pool=self, _stacktrace=sys.exc_info()[2]
843 )
844 retries.sleep()
File ~/.pyenv/versions/3.10.18/lib/python3.10/site-packages/urllib3/util/retry.py:519, in Retry.increment(self, method, url, response, error, _pool, _stacktrace)
518 reason = error or ResponseError(cause)
--> 519 raise MaxRetryError(_pool, url, reason) from reason # type: ignore[arg-type]
521 log.debug("Incremented Retry for (url='%s'): %r", url, new_retry)
MaxRetryError: HTTPConnectionPool(host='localhost', port=11434): Max retries exceeded with url: /api/chat (Caused by NewConnectionError('<urllib3.connection.HTTPConnection object at 0x7f473b1459c0>: Failed to establish a new connection: [Errno 111] Connection refused'))
During handling of the above exception, another exception occurred:
ConnectionError Traceback (most recent call last)
Cell In[3], line 2
1 prompt = "Complete: 'Interest rates are rising because'"
----> 2 print(call_llm(prompt, temperature=0.7)[:300])
Cell In[2], line 54, in call_llm(prompt, model, temperature, max_tokens, base_url, api_key, timeout)
46 endpoint = f"{base_url.rstrip('/')}/api/chat"
47 payload = {
48 "model": model,
49 "messages": [{"role": "user", "content": prompt}],
50 "options": {"temperature": temperature},
51 "stream": False
52 }
---> 54 resp = requests.post(endpoint, headers=headers, json=payload, timeout=timeout)
55 resp.raise_for_status()
56 data = resp.json()
File ~/.pyenv/versions/3.10.18/lib/python3.10/site-packages/requests/api.py:115, in post(url, data, json, **kwargs)
103 def post(url, data=None, json=None, **kwargs):
104 r"""Sends a POST request.
105
106 :param url: URL for the new :class:`Request` object.
(...)
112 :rtype: requests.Response
113 """
--> 115 return request("post", url, data=data, json=json, **kwargs)
File ~/.pyenv/versions/3.10.18/lib/python3.10/site-packages/requests/api.py:59, in request(method, url, **kwargs)
55 # By using the 'with' statement we are sure the session is closed, thus we
56 # avoid leaving sockets open which can trigger a ResourceWarning in some
57 # cases, and look like a memory leak in others.
58 with sessions.Session() as session:
---> 59 return session.request(method=method, url=url, **kwargs)
File ~/.pyenv/versions/3.10.18/lib/python3.10/site-packages/requests/sessions.py:589, in Session.request(self, method, url, params, data, headers, cookies, files, auth, timeout, allow_redirects, proxies, hooks, stream, verify, cert, json)
584 send_kwargs = {
585 "timeout": timeout,
586 "allow_redirects": allow_redirects,
587 }
588 send_kwargs.update(settings)
--> 589 resp = self.send(prep, **send_kwargs)
591 return resp
File ~/.pyenv/versions/3.10.18/lib/python3.10/site-packages/requests/sessions.py:703, in Session.send(self, request, **kwargs)
700 start = preferred_clock()
702 # Send the request
--> 703 r = adapter.send(request, **kwargs)
705 # Total elapsed time of the request (approximately)
706 elapsed = preferred_clock() - start
File ~/.pyenv/versions/3.10.18/lib/python3.10/site-packages/requests/adapters.py:700, in HTTPAdapter.send(self, request, stream, timeout, verify, cert, proxies)
696 if isinstance(e.reason, _SSLError):
697 # This branch is for urllib3 v1.22 and later.
698 raise SSLError(e, request=request)
--> 700 raise ConnectionError(e, request=request)
702 except ClosedPoolError as e:
703 raise ConnectionError(e, request=request)
ConnectionError: HTTPConnectionPool(host='localhost', port=11434): Max retries exceeded with url: /api/chat (Caused by NewConnectionError('<urllib3.connection.HTTPConnection object at 0x7f473b1459c0>: Failed to establish a new connection: [Errno 111] Connection refused'))
Section 3.2 — Tokens: How LLMs see text#
LLMs operate on tokens (subword pieces), not words. Tokenisation affects context limits and truncation.
Token Counting and Server-Side Processing#
Key concepts for enterprise use:
Tokens are counted server-side — the LLM gateway tracks usage
max_tokenslimits output, not input — you control response lengthLong inputs increase:
Latency (more to process)
Truncation risk (may hit context limit)
Timeout probability (especially with small models)
Small models exaggerate these effects — useful for learning, but plan for larger models in production
Practical implication: keep prompts concise and plan for chunking on long documents.
Section 3.3 — Training vs inference#
Training: offline learning of model parameters from huge datasets.
Inference: runtime generation when you call the model endpoint.
This module focuses on inference.
resp = call_llm("Explain training vs inference in 2 bullet points.", temperature=0.0)
print(resp)
Section 3.4 — LLMs as services (APIs)#
Treat the LLM like any other service: send JSON request, receive JSON response. This builds on your JSON and requests skills.
resp = call_llm("Say hello.")
print(f"Response type: {type(resp)}")
print(f"Response: {resp}")
Section 3.5 — Prompt structure: role, task, constraints#
With single-prompt endpoints, simulate roles by placing behaviour rules first, task second, constraints last.
This reduces ambiguity and improves reliability.
system = "You are a cautious banking analyst. Do not speculate. If unsure, say 'Insufficient information'."
task = "Summarise for an executive: FX volatility increased due to rate differentials."
constraints = "Return exactly 2 bullet points. Max 20 words each."
prompt = f"SYSTEM:\n{system}\n\nTASK:\n{task}\n\nCONSTRAINTS:\n{constraints}"
print(call_llm(prompt, temperature=0.0))
Section 3.6 — Temperature and determinism#
Temperature controls randomness. Low temperature (0.0–0.2) is preferred in regulated workflows for consistency.
prompt = "Explain what a context window is in 2 sentences."
low = call_llm(prompt, temperature=0.0)
high = call_llm(prompt, temperature=0.8)
print("Temp 0.0:\n", low)
print("\nTemp 0.8:\n", high)
Section 3.7 — Hallucinations (confident but wrong)#
Hallucinations occur because the model optimises for plausible text rather than verified truth. The model will confidently generate answers even when the question refers to something that does not exist.
Teaching goal: Hallucination is not random error — it is plausible continuation beating truthful uncertainty.
Never treat confident language as evidence. Always verify claims against trusted sources.
# Hallucination demonstration: asking about a paper that does not exist
prompt = """Explain the key ideas from the 2019 paper
"Temporal Diffusion Graph Transformers for Quantum Finance"
by Liu and Henderson, published at NeurIPS."""
response = call_llm(prompt, temperature=0.0, max_tokens=512)
print("LLM Response:")
print(response)
print("\n" + "="*60)
print("ANALYSIS: Why this is a hallucination")
print("="*60)
print("""
This paper DOES NOT EXIST. The model's response demonstrates classic hallucination patterns:
1. PAPER SUBSTITUTION: The model invents plausible-sounding content based on
keywords (transformers, finance, quantum). It may cite real papers or
concepts that are unrelated.
2. CONFIDENT SPECULATION: Watch for phrases like "likely", "would probably",
"typically involves" — these mask uncertainty as knowledge.
3. GENERIC ML BOILERPLATE: The response uses standard ML vocabulary
(attention mechanisms, embeddings, architectures) that sounds authoritative
but is not grounded in any real paper.
4. DOMAIN DRIFT: The model may conflate "quantum finance" (a real niche field)
with generic finance ML, producing plausible but wrong explanations.
5. HEDGING ADMISSION: Sometimes the model adds "if such a paper exists" or
similar — but still provides fabricated details anyway.
KEY LESSON: Confident language ≠ truthful content. Always verify against
authoritative sources (actual paper, official database, domain expert).
""")
Section 3.11 — Defensive parsing and validation#
Models may return invalid JSON for several reasons:
Markdown wrapping — response wrapped in
```json ... ```blocksTrailing text — explanatory text after the JSON
Malformed structure — missing quotes, trailing commas, etc.
Empty response — timeout or model failure
The safe_json_loads() function below handles these cases:
Strips markdown code block wrappers
Attempts JSON parsing
Returns a tuple:
(success, result_or_error)
Enterprise pattern: Always wrap JSON parsing in try/except and have a fallback strategy.
# Moderate example: a reasonably long input (not extreme)
# Note: We use a moderate size to avoid destabilising small models
policy_text = ("This is a paragraph from a banking policy document covering risk management. " * 50)
prompt = f"Summarise in 3 bullets:\n{policy_text}"
try:
response = call_llm(prompt, temperature=0.0, max_tokens=256)
print("Summary:")
print(response[:600])
except Exception as e:
print(f"Request failed (expected for very long inputs): {e}")
print("In production, you would chunk the input or use a larger model.")
Section 3.9 — Prompt hygiene: common mistakes and fixes#
Avoid vague asks, missing constraints, and multi-task prompts. Prefer clear audience, format, and uncertainty policy.
bad = "Tell me about interest rates."
good = "Explain interest rates to a new bank analyst in 3 bullets, <= 18 words each. No speculation."
print("BAD:\n", call_llm(bad, temperature=0.0))
print("\nGOOD:\n", call_llm(good, temperature=0.0))
Section 3.10 — Structured output: why JSON matters#
JSON output enables deterministic parsing, validation, and automation. This builds directly on Module 1.
Common Issue: Markdown-Wrapped JSON#
LLMs often return JSON wrapped in markdown code blocks:
```json
{"key": "value"}
This causes `json.loads()` to fail! You must **strip the markdown wrapper** before parsing.
The `strip_markdown_json()` helper function below handles this:
1. Detects if the response starts with ` ```json ` or ` ``` `
2. Extracts the content between the backticks
3. Returns clean JSON ready for parsing
**Always use this pattern when parsing LLM JSON output.**
import json
import re
def strip_markdown_json(s: str) -> str:
"""
Strip markdown code block wrappers from LLM JSON output.
LLMs often return JSON wrapped in markdown:
```json
{"key": "value"}
```
This function extracts the raw JSON for parsing.
"""
s = s.strip()
# Pattern matches ```json or ``` at start, and ``` at end
pattern = r'^```(?:json)?\s*\n?(.*?)\n?```$'
match = re.match(pattern, s, re.DOTALL | re.IGNORECASE)
if match:
return match.group(1).strip()
return s
prompt = (
"Return ONLY valid JSON with keys: summary (string), risks (array of exactly 3 strings). "
"No extra text. Use double quotes. "
"Text: Banks face credit risk, market risk, and operational risk."
)
raw = call_llm(prompt, temperature=0.0)
print("Raw response:")
print(raw)
# IMPORTANT: LLMs often wrap JSON in markdown code blocks - strip before parsing
cleaned = strip_markdown_json(raw)
print("\nCleaned for parsing:")
print(cleaned)
try:
data = json.loads(cleaned)
print("\nParsed successfully:")
print(data)
except json.JSONDecodeError as e:
print(f"\nInvalid JSON - do not proceed: {e}")
print("In production, you would retry or fail gracefully here.")
Section 3.11 — Defensive parsing and validation#
Models sometimes return invalid JSON. Handle this safely: parse, validate, retry or fail clearly.
import json
import re
def strip_markdown_json(s: str) -> str:
"""
Strip markdown code block wrappers from LLM JSON output.
LLMs often return JSON wrapped in markdown:
```json
{"key": "value"}
```
This function extracts the raw JSON for parsing.
"""
s = s.strip()
# Pattern matches ```json or ``` at start, and ``` at end
pattern = r'^```(?:json)?\s*\n?(.*?)\n?```$'
match = re.match(pattern, s, re.DOTALL | re.IGNORECASE)
if match:
return match.group(1).strip()
return s
def safe_json_loads(s: str) -> tuple:
"""
Attempt to parse JSON safely, handling markdown-wrapped responses.
Returns (success, result_or_error).
"""
# First, strip any markdown code block wrappers
cleaned = strip_markdown_json(s)
try:
return True, json.loads(cleaned)
except Exception as e:
return False, f"{type(e).__name__}: {e}"
# Example: LLM may return JSON wrapped in markdown code blocks
raw = call_llm('Return JSON only: {"a": 1}', temperature=0.0)
print(f"Raw response: {raw!r}")
ok, parsed = safe_json_loads(raw)
print(f"Parse successful: {ok}")
if ok:
print(f"Parsed data: {parsed}")
else:
print(f"Error: {parsed}")
Section 3.12 — Text validators: length, bullets, vocabulary#
Not all tasks need JSON. You can validate text using deterministic rules like bullet count and max length.
text = call_llm("Return exactly 3 bullet points about liquidity risk.", temperature=0.0)
bullets = [ln for ln in text.splitlines() if ln.strip().startswith(("-", "*"))]
print("Bullet count:", len(bullets))
print(text)
Section 3.13 — LLMs inside a Pandas pipeline#
LLMs can augment data pipelines by generating summaries or tags. Start small and validate outputs.
import pandas as pd
df = pd.DataFrame({
"id": [1,2,3],
"text": [
"Credit risk is the possibility of loss from borrower default.",
"Market risk comes from adverse movements in interest rates and FX.",
"Operational risk arises from process, people, or system failures."
]
})
def summarise_row(t: str) -> str:
prompt = f"Summarise in 10 words or fewer: {t}"
return call_llm(prompt, temperature=0.0).strip()
df["summary"] = df["text"].apply(summarise_row)
df
Section 3.14 — Cost/latency mindset: caching#
LLM calls are slow compared to normal functions. Use caching for repeated prompts.
_cache = {}
def cached_llm(prompt: str, temperature: float = 0.0) -> str:
key = (prompt, temperature, DEFAULT_MODEL, LLM_BASE_URL)
if key in _cache:
return _cache[key]
out = call_llm(prompt, temperature=temperature).strip()
_cache[key] = out
return out
p = "Summarise: Banks face credit and market risk."
print(cached_llm(p, 0.0))
print(cached_llm(p, 0.0))
Section 3.15 — Local vs remote endpoint trade-offs#
Local: simple, private, predictable. Remote: centrally managed, potentially faster, requires network/access control. Your code should work for both by switching LLM_BASE_URL.
Section 3.16 — Enterprise constraints: auditability and compliance#
Log prompts (or hashes), parameters, model, and output metadata for auditability. Avoid sending sensitive data to unapproved endpoints.
import hashlib, time
def audit_meta(prompt: str, response_text: str, model: str, temperature: float) -> dict:
return {
"ts": time.time(),
"model": model,
"temperature": temperature,
"prompt_sha256": hashlib.sha256(prompt.encode()).hexdigest(),
"response_sha256": hashlib.sha256(response_text.encode()).hexdigest(),
"response_len": len(response_text),
}
p = "Summarise operational risk in 12 words."
resp = call_llm(p, temperature=0.0)
meta = audit_meta(p, resp, DEFAULT_MODEL, 0.0)
meta
Section 3.16b — API Gateway Security: Preventing Misuse#
When exposing LLM services via an API gateway (like ngrok), security is critical. Unsecured endpoints can be:
Abused for free compute — attackers use your LLM for their own purposes
Used for prompt injection attacks — malicious prompts extracting sensitive data
Overwhelmed by DoS — excessive requests crashing your service
Scraped for model outputs — competitors harvesting your model’s responses
Security Measures for LLM Gateways#
Layer |
Measure |
Purpose |
|---|---|---|
Authentication |
API keys ( |
Identify and authorise callers |
Rate Limiting |
Requests per minute/hour per key |
Prevent abuse and DoS |
Input Validation |
Max prompt length, blocked patterns |
Prevent injection attacks |
Output Filtering |
Sanitise responses, remove PII |
Prevent data leakage |
Logging & Monitoring |
Track usage per key, alert on anomalies |
Detect and respond to abuse |
Token Quotas |
Max tokens per key per day |
Control costs and fair usage |
Implementation Approaches#
API Gateway Layer (AWS API Gateway, Kong, nginx):
Rate limiting and throttling
API key validation
Request/response logging
Application Layer:
Input sanitisation before LLM call
Output filtering after LLM response
User-specific quotas in database
Network Layer:
IP allowlisting for internal use
TLS/HTTPS only (ngrok provides this)
VPN for sensitive deployments
Enterprise Best Practice#
In production, never expose raw LLM endpoints publicly. Always:
Wrap in an authenticated API layer
Log all requests with user identity
Set hard limits on usage per user/key
Monitor for anomalous patterns (unusual prompts, high volume)
Have an incident response plan for detected abuse
Section 3.17 — Evaluation without using another LLM#
Prefer deterministic checks: schema validation, key checks, length constraints, bullet counts. Avoid ‘LLM judging LLM’ as your only control.
Section 3.18 — Safety patterns: uncertainty and fallbacks#
Include an uncertainty policy: if unsure, say ‘Insufficient information’. Build fallbacks when validation fails.
system = "If you are unsure, respond exactly: Insufficient information. Do not guess."
task = "What is the exact USD/GBP rate at 09:31 UTC yesterday?"
prompt = f"{system}\n\n{task}"
print(call_llm(prompt, temperature=0.0))
Section 3.19 — Why LLMs alone are not enough#
LLMs have hallucinations, context limits, and no grounding in your internal data by default. This motivates grounding and retrieval techniques.
Section 3.20 — Preparing for Module 5 (Grounding / RAG)#
Mental model: LLM = language engine; RAG = evidence + memory. RAG reduces hallucinations by supplying trusted context.
Practice exercises (ungraded)#
Force JSON output for a classification task and parse it.
Demonstrate one hallucination and explain why it happened.
Enrich a small DataFrame with LLM-generated summaries and add a cache.
Add a validator enforcing: exactly 3 bullets and <= 20 words each.
Module summary#
LLMs generate probabilistic text, not guaranteed truth.
Treat outputs as untrusted unless validated.
Use low temperature for consistency and auditability.
Prefer structured outputs (JSON) for automation.
Design for failures, retries, and fallbacks.