Content#
Module 6 — LLM APIs (Python)
CodeVision Academy
Overview#
If Module 5 taught you how to find relevant information, Module 6 teaches you how to reliably call the AI that uses it.
This module marks the transition from using AI to engineering AI systems.
Up to now, you’ve called LLMs casually—paste a prompt, get a response. That works for demos. It does not work for production systems that must:
Handle failures gracefully
Validate outputs before using them
Control costs and latency
Pass audits and compliance reviews
One Big Idea to Remember#
An LLM is not a function call. It is a remote, rate-limited, probabilistic service. Failure is normal. Correctness is engineered.
Learning Objectives#
By the end of this module, you will be able to:
Explain why LLMs must be treated as external services, not functions
Build a reusable Python client class for LLM APIs
Implement proper error handling with timeouts and retries
Enforce structured JSON outputs and validate responses
Write tests for LLM-integrated code without calling live models
Implement logging for cost tracking and auditability
Apply defensive programming patterns for non-deterministic systems
Prepare code cleanly for RAG integration
Before You Start: LLM Gateway Configuration#
This module requires access to an LLM API. You have two options:
Option |
Model |
Best For |
|---|---|---|
A: Local Ollama |
|
Running locally, learning API patterns |
B: JBChat Server |
|
Higher quality, Colab users |
Option A: Local Ollama#
If running Jupyter locally: Use http://localhost:11434 directly.
If running in Google Colab: You must expose Ollama via a tunnel.
Pinggy Setup (required for Colab):
Open a terminal on your local machine
Make sure Ollama is running:
ollama serveStart the tunnel:
ssh -p 443 -R0:localhost:11434 a.pinggy.io
Copy the HTTPS URL (e.g.,
https://xyz-abc.a.pinggy.io)Use that URL in the config below
Option B: Server Gateway (JBChat)#
If you cannot run Ollama locally:
URL:
https://jbchat.jonbowden.com.ngrok.appRequires API key from instructor
Model:
llama3.1:8b
Configure below:#
# ===== LLM GATEWAY CONFIGURATION =====
# ------ OPTION A: Local Ollama ------
LLM_BASE_URL = "http://localhost:11434"
# For Colab with Pinggy tunnel:
# LLM_BASE_URL = "https://your-pinggy-url.a.pinggy.io"
LLM_API_KEY = None # No API key = Ollama mode
DEFAULT_MODEL = "phi3:mini"
# ------ OPTION B: Server Gateway (JBChat) ------
# Uncomment these 3 lines to use the course server:
# LLM_BASE_URL = "https://jbchat.jonbowden.com.ngrok.app"
# LLM_API_KEY = "<provided-by-instructor>"
# DEFAULT_MODEL = "llama3.1:8b"
print(f"Configured: {LLM_BASE_URL}")
print(f"Model: {DEFAULT_MODEL}")
print(f"Mode: {'JBChat' if LLM_API_KEY else 'Ollama'}")
Configured: http://localhost:11434
Model: phi3:mini
Mode: Ollama
Group 1 — The Service Mindset#
Before we write code, we need to understand why LLM integration is fundamentally different from calling a local function.
6.1 From Model Calls to Service Contracts#
When you call a local function, you expect:
Instant response
Deterministic output
No network failures
No rate limits
When you call an LLM API, you face:
Challenge |
Reality |
|---|---|
Latency |
1-30+ seconds per call |
Availability |
Services go down, networks fail |
Rate limits |
Too many calls = blocked |
Non-determinism |
Same input can yield different outputs |
Cost |
Every token costs money |
Output format |
No guarantee of structure |
The Mindset Shift#
WRONG MENTAL MODEL: RIGHT MENTAL MODEL:
result = llm(prompt) try:
use(result) result = llm_with_retry(prompt)
validated = parse_and_validate(result)
use(validated)
except LLMError:
handle_gracefully()
Enterprise Implications#
In production systems, you must design for:
Graceful degradation — What happens when the LLM is down?
Timeout budgets — How long can users wait?
Fallback strategies — Can you use cached responses?
Cost controls — How do you prevent runaway API bills?
6.2 Anatomy of an LLM API Request#
Every LLM API call is fundamentally JSON over HTTP. Understanding the structure helps you debug issues and optimize performance.
Request Components#
Component |
Purpose |
Example |
|---|---|---|
Endpoint |
Where to send the request |
|
Headers |
Authentication, content type |
|
Model |
Which model to use |
|
Messages |
The conversation/prompt |
|
Temperature |
Randomness (0=deterministic) |
|
Max tokens |
Output length limit |
|
Standard Payload Structure#
# A typical LLM API payload
payload = {
"model": "phi3:mini",
"messages": [
{"role": "system", "content": "You are a helpful assistant."},
{"role": "user", "content": "Explain inflation in one sentence."}
],
"temperature": 0.0, # Deterministic
"max_tokens": 100 # Limit response length
}
import json
print("Request payload:")
print(json.dumps(payload, indent=2))
Request payload:
{
"model": "phi3:mini",
"messages": [
{
"role": "system",
"content": "You are a helpful assistant."
},
{
"role": "user",
"content": "Explain inflation in one sentence."
}
],
"temperature": 0.0,
"max_tokens": 100
}
Response Structure#
The response also follows a standard structure:
{
"model": "phi3:mini",
"message": {
"role": "assistant",
"content": "Inflation is the rate at which prices rise over time."
},
"done": true,
"total_duration": 1234567890
}
Different providers have slightly different response formats, but the core pattern is the same.
6.3 Configuration Discipline#
Never hardcode configuration. This is a fundamental principle for maintainable systems.
Why Configuration Matters#
Hardcoded |
Configurable |
|---|---|
Change requires code edit |
Change via environment |
Secrets in source code |
Secrets in secure storage |
Same settings everywhere |
Dev/staging/prod can differ |
Hard to test |
Easy to mock |
The Configuration Pattern#
import os
# Configuration from environment (with fallbacks)
class LLMConfig:
"""Centralized LLM configuration."""
BASE_URL = os.getenv("LLM_BASE_URL", LLM_BASE_URL)
API_KEY = os.getenv("LLM_API_KEY", LLM_API_KEY)
MODEL = os.getenv("LLM_MODEL", DEFAULT_MODEL)
# Operational defaults
DEFAULT_TEMPERATURE = 0.0
DEFAULT_MAX_TOKENS = 256
DEFAULT_TIMEOUT = (5, 60) # (connect, read) in seconds
MAX_RETRIES = 3
print(f"Config loaded:")
print(f" BASE_URL: {LLMConfig.BASE_URL}")
print(f" MODEL: {LLMConfig.MODEL}")
print(f" API_KEY: {'***' if LLMConfig.API_KEY else 'None (Ollama mode)'}")
Config loaded:
BASE_URL: http://localhost:11434
MODEL: phi3:mini
API_KEY: None (Ollama mode)
Configuration Best Practices#
Use environment variables for anything that varies by environment
Provide sensible defaults for development
Never commit secrets to version control
Validate configuration at startup, not at first use
Document required variables in README or setup scripts
Group 2 — Building a Robust Client#
Now we build a reusable client class that encapsulates all the complexity of LLM communication.
6.4 Client Class Rationale#
Why wrap LLM calls in a class instead of simple functions?
Approach |
Pros |
Cons |
|---|---|---|
Raw requests |
Simple, direct |
Repeated code, no encapsulation |
Functions |
Reusable |
State management is awkward |
Client class |
Encapsulated, testable, extensible |
Slightly more setup |
What a Good Client Provides#
Encapsulation — Hide HTTP details from business logic
Configuration — Centralized settings management
Retry logic — Automatic handling of transient failures
Logging — Consistent audit trail
Testability — Easy to mock for unit tests
import requests
import time
import json
from typing import Optional, Dict, Any, List
class LLMClient:
"""
A robust client for LLM API interactions.
Handles:
- Configuration management
- Request construction
- Error handling
- Retry logic
- Response parsing
"""
def __init__(
self,
base_url: str,
api_key: Optional[str] = None,
model: str = "phi3:mini",
timeout: tuple = (5, 60)
):
"""
Initialize the LLM client.
Args:
base_url: API endpoint base URL
api_key: Optional API key (None for Ollama)
model: Model identifier
timeout: (connect_timeout, read_timeout) in seconds
"""
self.base_url = base_url.rstrip('/')
self.api_key = api_key
self.model = model
self.timeout = timeout
# Detect mode based on API key
self._use_jbchat = api_key is not None
def __repr__(self):
mode = "JBChat" if self._use_jbchat else "Ollama"
return f"LLMClient(mode={mode}, model={self.model})"
# Create a client instance
client = LLMClient(
base_url=LLMConfig.BASE_URL,
api_key=LLMConfig.API_KEY,
model=LLMConfig.MODEL
)
print(f"Client created: {client}")
Client created: LLMClient(mode=Ollama, model=phi3:mini)
6.5 Request Construction#
Building requests correctly is crucial. Different APIs have different formats, so we encapsulate this complexity.
Key Considerations#
Aspect |
Why It Matters |
|---|---|
Headers |
Authentication, content negotiation |
Endpoint |
Different APIs use different paths |
Payload format |
Ollama vs OpenAI vs others differ |
Timeout tuning |
Connect fast, allow long reads |
# Add request construction methods to our client
class LLMClient(LLMClient): # Extending previous definition
def _build_headers(self) -> Dict[str, str]:
"""Build HTTP headers for the request."""
headers = {
"Content-Type": "application/json",
# Bypass tunnel browser warnings
"ngrok-skip-browser-warning": "true",
"Bypass-Tunnel-Reminder": "true",
}
if self.api_key:
headers["X-API-Key"] = self.api_key
return headers
def _build_payload(
self,
prompt: str,
temperature: float = 0.0,
max_tokens: int = 256,
system_prompt: Optional[str] = None
) -> Dict[str, Any]:
"""Build the request payload."""
# Build messages list
messages = []
if system_prompt:
messages.append({"role": "system", "content": system_prompt})
messages.append({"role": "user", "content": prompt})
if self._use_jbchat:
# JBChat/OpenAI format
return {
"model": self.model,
"messages": messages,
"temperature": temperature,
"max_tokens": max_tokens,
"stream": False
}
else:
# Ollama format
return {
"model": self.model,
"messages": messages,
"options": {"temperature": temperature},
"stream": False
}
def _get_endpoint(self) -> str:
"""Get the correct API endpoint."""
if self._use_jbchat:
return f"{self.base_url}/chat/direct"
else:
return f"{self.base_url}/api/chat"
# Recreate client with new methods
client = LLMClient(
base_url=LLMConfig.BASE_URL,
api_key=LLMConfig.API_KEY,
model=LLMConfig.MODEL
)
# Show what a request looks like
print("Endpoint:", client._get_endpoint())
print("\nHeaders:", json.dumps(client._build_headers(), indent=2))
print("\nPayload:", json.dumps(client._build_payload("Hello"), indent=2))
Endpoint: http://localhost:11434/api/chat
Headers: {
"Content-Type": "application/json",
"ngrok-skip-browser-warning": "true",
"Bypass-Tunnel-Reminder": "true"
}
Payload: {
"model": "phi3:mini",
"messages": [
{
"role": "user",
"content": "Hello"
}
],
"options": {
"temperature": 0.0
},
"stream": false
}
6.6 Making Safe API Calls#
The actual API call must handle many potential failures:
Failure Type |
Cause |
Handling |
|---|---|---|
Connection timeout |
Network issues, server down |
Retry with backoff |
Read timeout |
Slow response, overloaded server |
Increase timeout or retry |
HTTP 429 |
Rate limited |
Back off, then retry |
HTTP 500 |
Server error |
Retry with backoff |
HTTP 401/403 |
Auth failure |
Don’t retry, fix config |
Invalid JSON |
Malformed response |
Log and raise |
class LLMClient(LLMClient): # Extending again
def chat(
self,
prompt: str,
temperature: float = 0.0,
max_tokens: int = 256,
system_prompt: Optional[str] = None
) -> str:
"""
Send a chat request and return the response content.
Args:
prompt: User message
temperature: Randomness (0.0 = deterministic)
max_tokens: Maximum response length
system_prompt: Optional system message
Returns:
The assistant's response text
Raises:
requests.exceptions.RequestException: On network/HTTP errors
ValueError: On invalid response format
"""
response = requests.post(
self._get_endpoint(),
headers=self._build_headers(),
json=self._build_payload(prompt, temperature, max_tokens, system_prompt),
timeout=self.timeout
)
# Raise exception for HTTP errors (4xx, 5xx)
response.raise_for_status()
# Parse response
data = response.json()
# Extract content (handle different response formats)
if "message" in data and "content" in data["message"]:
return data["message"]["content"]
elif "choices" in data: # OpenAI format
return data["choices"][0]["message"]["content"]
else:
raise ValueError(f"Unexpected response format: {data}")
# Recreate and test
client = LLMClient(
base_url=LLMConfig.BASE_URL,
api_key=LLMConfig.API_KEY,
model=LLMConfig.MODEL
)
# Test the client
try:
response = client.chat("Say 'API connected' in exactly two words.")
print(f"Response: {response}")
except Exception as e:
print(f"Error: {e}")
print("\nMake sure your LLM server is running!")
Error: HTTPConnectionPool(host='localhost', port=11434): Max retries exceeded with url: /api/chat (Caused by NewConnectionError('<urllib3.connection.HTTPConnection object at 0x7f6b92159030>: Failed to establish a new connection: [Errno 111] Connection refused'))
Make sure your LLM server is running!
6.7 Failure as the Default Assumption#
In distributed systems, the question is not if things will fail, but when and how often.
Types of Failures#
Type |
Example |
Frequency |
|---|---|---|
Transient |
Network hiccup, brief overload |
Common (retry helps) |
Persistent |
Server down, config error |
Less common (retry won’t help) |
Partial |
Slow response, truncated output |
Common (timeout/validate) |
Silent |
Wrong answer, hallucination |
Common (validation needed) |
The Defensive Mindset#
# WRONG: Assume success
result = client.chat(prompt)
use(result)
# RIGHT: Assume failure, verify success
try:
result = client.chat(prompt)
validated = validate(result)
use(validated)
except TransientError:
retry()
except PermanentError:
fallback()
# Demonstrating what failures look like
import requests
def demonstrate_failures():
"""Show common LLM API failure modes."""
print("Common LLM API Failures:")
print("=" * 50)
# 1. Connection refused (server not running)
try:
requests.post("http://localhost:99999/api/chat", timeout=1)
except requests.exceptions.ConnectionError as e:
print(f"1. ConnectionError: Server not reachable")
# 2. Timeout
try:
requests.get("https://httpstat.us/200?sleep=5000", timeout=0.1)
except requests.exceptions.Timeout:
print(f"2. Timeout: Server too slow")
# 3. HTTP errors
print(f"3. HTTP 429: Rate limit exceeded (too many requests)")
print(f"4. HTTP 500: Internal server error")
print(f"5. HTTP 401: Authentication failed")
print("\n" + "=" * 50)
print("All of these require proper handling!")
demonstrate_failures()
Common LLM API Failures:
==================================================
---------------------------------------------------------------------------
LocationParseError Traceback (most recent call last)
File ~/.pyenv/versions/3.10.18/lib/python3.10/site-packages/urllib3/util/url.py:434, in parse_url(url)
433 if not (0 <= port_int <= 65535):
--> 434 raise LocationParseError(url)
435 else:
LocationParseError: Failed to parse: http://localhost:99999/api/chat
The above exception was the direct cause of the following exception:
LocationParseError Traceback (most recent call last)
File ~/.pyenv/versions/3.10.18/lib/python3.10/site-packages/requests/models.py:433, in PreparedRequest.prepare_url(self, url, params)
432 try:
--> 433 scheme, auth, host, port, path, query, fragment = parse_url(url)
434 except LocationParseError as e:
File ~/.pyenv/versions/3.10.18/lib/python3.10/site-packages/urllib3/util/url.py:449, in parse_url(url)
448 except (ValueError, AttributeError) as e:
--> 449 raise LocationParseError(source_url) from e
451 # For the sake of backwards compatibility we put empty
452 # string values for path if there are any defined values
453 # beyond the path in the URL.
454 # TODO: Remove this when we break backwards compatibility.
LocationParseError: Failed to parse: http://localhost:99999/api/chat
During handling of the above exception, another exception occurred:
InvalidURL Traceback (most recent call last)
Cell In[7], line 30
27 print("\n" + "=" * 50)
28 print("All of these require proper handling!")
---> 30 demonstrate_failures()
Cell In[7], line 12, in demonstrate_failures()
10 # 1. Connection refused (server not running)
11 try:
---> 12 requests.post("http://localhost:99999/api/chat", timeout=1)
13 except requests.exceptions.ConnectionError as e:
14 print(f"1. ConnectionError: Server not reachable")
File ~/.pyenv/versions/3.10.18/lib/python3.10/site-packages/requests/api.py:115, in post(url, data, json, **kwargs)
103 def post(url, data=None, json=None, **kwargs):
104 r"""Sends a POST request.
105
106 :param url: URL for the new :class:`Request` object.
(...)
112 :rtype: requests.Response
113 """
--> 115 return request("post", url, data=data, json=json, **kwargs)
File ~/.pyenv/versions/3.10.18/lib/python3.10/site-packages/requests/api.py:59, in request(method, url, **kwargs)
55 # By using the 'with' statement we are sure the session is closed, thus we
56 # avoid leaving sockets open which can trigger a ResourceWarning in some
57 # cases, and look like a memory leak in others.
58 with sessions.Session() as session:
---> 59 return session.request(method=method, url=url, **kwargs)
File ~/.pyenv/versions/3.10.18/lib/python3.10/site-packages/requests/sessions.py:575, in Session.request(self, method, url, params, data, headers, cookies, files, auth, timeout, allow_redirects, proxies, hooks, stream, verify, cert, json)
562 # Create the Request.
563 req = Request(
564 method=method.upper(),
565 url=url,
(...)
573 hooks=hooks,
574 )
--> 575 prep = self.prepare_request(req)
577 proxies = proxies or {}
579 settings = self.merge_environment_settings(
580 prep.url, proxies, stream, verify, cert
581 )
File ~/.pyenv/versions/3.10.18/lib/python3.10/site-packages/requests/sessions.py:484, in Session.prepare_request(self, request)
481 auth = get_netrc_auth(request.url)
483 p = PreparedRequest()
--> 484 p.prepare(
485 method=request.method.upper(),
486 url=request.url,
487 files=request.files,
488 data=request.data,
489 json=request.json,
490 headers=merge_setting(
491 request.headers, self.headers, dict_class=CaseInsensitiveDict
492 ),
493 params=merge_setting(request.params, self.params),
494 auth=merge_setting(auth, self.auth),
495 cookies=merged_cookies,
496 hooks=merge_hooks(request.hooks, self.hooks),
497 )
498 return p
File ~/.pyenv/versions/3.10.18/lib/python3.10/site-packages/requests/models.py:367, in PreparedRequest.prepare(self, method, url, headers, files, data, params, auth, cookies, hooks, json)
364 """Prepares the entire request with the given parameters."""
366 self.prepare_method(method)
--> 367 self.prepare_url(url, params)
368 self.prepare_headers(headers)
369 self.prepare_cookies(cookies)
File ~/.pyenv/versions/3.10.18/lib/python3.10/site-packages/requests/models.py:435, in PreparedRequest.prepare_url(self, url, params)
433 scheme, auth, host, port, path, query, fragment = parse_url(url)
434 except LocationParseError as e:
--> 435 raise InvalidURL(*e.args)
437 if not scheme:
438 raise MissingSchema(
439 f"Invalid URL {url!r}: No scheme supplied. "
440 f"Perhaps you meant https://{url}?"
441 )
InvalidURL: Failed to parse: http://localhost:99999/api/chat
6.8 Retry with Exponential Backoff#
Exponential backoff is the standard pattern for handling transient failures:
Try the operation
If it fails, wait a short time and retry
If it fails again, wait longer (exponentially)
After N retries, give up
Why Exponential?#
Attempt |
Wait Time |
Cumulative |
|---|---|---|
1 |
0s |
0s |
2 |
1s |
1s |
3 |
2s |
3s |
4 |
4s |
7s |
5 |
8s |
15s |
This gives the server time to recover while not waiting forever.
import time
from typing import Callable, TypeVar
T = TypeVar('T')
def retry_with_backoff(
fn: Callable[[], T],
max_retries: int = 3,
base_delay: float = 1.0,
max_delay: float = 30.0,
retryable_exceptions: tuple = (requests.exceptions.RequestException,)
) -> T:
"""
Execute a function with exponential backoff retry.
Args:
fn: Function to execute (no arguments)
max_retries: Maximum number of retry attempts
base_delay: Initial delay in seconds
max_delay: Maximum delay between retries
retryable_exceptions: Tuple of exceptions that trigger retry
Returns:
Result of fn() on success
Raises:
The last exception if all retries fail
"""
last_exception = None
for attempt in range(max_retries + 1):
try:
return fn()
except retryable_exceptions as e:
last_exception = e
if attempt == max_retries:
break # Don't sleep after last attempt
# Calculate delay with exponential backoff
delay = min(base_delay * (2 ** attempt), max_delay)
print(f" Attempt {attempt + 1} failed: {e}")
print(f" Retrying in {delay:.1f}s...")
time.sleep(delay)
raise last_exception
# Add retry method to client
class LLMClient(LLMClient):
def chat_with_retry(
self,
prompt: str,
temperature: float = 0.0,
max_tokens: int = 256,
max_retries: int = 3
) -> str:
"""Chat with automatic retry on transient failures."""
return retry_with_backoff(
fn=lambda: self.chat(prompt, temperature, max_tokens),
max_retries=max_retries
)
# Recreate client
client = LLMClient(
base_url=LLMConfig.BASE_URL,
api_key=LLMConfig.API_KEY,
model=LLMConfig.MODEL
)
print("Retry function defined.")
print("\nExample usage:")
print(' result = client.chat_with_retry("Your prompt here")')
Group 3 — Structured Output and Validation#
Getting a response is only half the battle. The response must be usable.
6.9 The Necessity of Structured Output#
LLMs naturally produce free-form text. That’s great for chatbots. It’s terrible for software systems.
The Problem#
# You asked for a summary
response = "Here's a summary of the document. The main points are..."
# How do you extract the actual summary programmatically?
# What if the format changes?
# What if there's extra text?
The Solution: JSON#
# Ask for JSON
response = '{"summary": "The main points are...", "confidence": 0.85}'
# Now you can parse and use it reliably
data = json.loads(response)
summary = data["summary"]
Why JSON?#
Format |
Pros |
Cons |
|---|---|---|
Free text |
Natural, flexible |
Hard to parse, unreliable |
JSON |
Structured, parseable, typed |
LLM may not comply |
XML |
Structured, handles nesting |
Verbose, harder for LLMs |
YAML |
Readable, structured |
Whitespace-sensitive |
6.10 JSON Enforcement in Prompts#
The key to getting JSON output is explicit instruction and schema specification.
Prompt Patterns for JSON#
Pattern |
Reliability |
|---|---|
“Return JSON” |
Low |
“Return ONLY valid JSON: {schema}” |
Medium |
“Return ONLY valid JSON. No other text. Schema: {schema}” |
High |
System prompt + user prompt + schema |
Highest |
# Template for JSON-enforced prompts
def build_json_prompt(task: str, schema: dict, data: str = None) -> str:
"""
Build a prompt that requests JSON output.
Args:
task: What to do
schema: Expected JSON structure
data: Optional data to process
"""
schema_str = json.dumps(schema, indent=2)
prompt = f"""Task: {task}
Return ONLY valid JSON matching this schema:
{schema_str}
Rules:
- Return ONLY the JSON object
- No markdown, no explanations, no extra text
- All fields are required
"""
if data:
prompt += f"\nData to process:\n{data}\n"
prompt += "\nJSON:"
return prompt
# Example: Sentiment analysis with structured output
schema = {
"sentiment": "positive | negative | neutral",
"confidence": "float between 0 and 1",
"key_phrases": ["list", "of", "phrases"]
}
prompt = build_json_prompt(
task="Analyze the sentiment of the following text",
schema=schema,
data="The product exceeded my expectations. Great value!"
)
print("JSON-enforced prompt:")
print("=" * 50)
print(prompt)
# Test JSON output with the LLM
try:
response = client.chat(prompt, temperature=0.0)
print("LLM Response:")
print(response)
# Try to parse it
parsed = json.loads(response)
print("\nParsed successfully!")
print(f"Sentiment: {parsed.get('sentiment')}")
print(f"Confidence: {parsed.get('confidence')}")
except json.JSONDecodeError as e:
print(f"JSON parsing failed: {e}")
print("The LLM did not return valid JSON.")
except Exception as e:
print(f"Error: {e}")
6.11 Validation Before Use#
Even when the LLM returns valid JSON, you must validate it before using it in your application.
Validation Layers#
Layer |
Checks |
Example |
|---|---|---|
Syntax |
Is it valid JSON? |
|
Schema |
Are required fields present? |
Check keys exist |
Types |
Are values the right type? |
|
Values |
Are values in valid ranges? |
Business logic |
Semantic |
Does it make sense? |
Domain validation |
from typing import Any
class ValidationError(Exception):
"""Raised when LLM output fails validation."""
pass
def parse_json_response(text: str) -> dict:
"""
Parse JSON from LLM response.
Handles common issues:
- Markdown code blocks
- Leading/trailing whitespace
- Common escape issues
"""
# Clean up common LLM output artifacts
text = text.strip()
# Remove markdown code blocks if present
if text.startswith("```json"):
text = text[7:]
if text.startswith("```"):
text = text[3:]
if text.endswith("```"):
text = text[:-3]
text = text.strip()
try:
return json.loads(text)
except json.JSONDecodeError as e:
raise ValidationError(f"Invalid JSON: {e}")
def validate_schema(data: dict, required_fields: list, field_types: dict = None) -> dict:
"""
Validate that data matches expected schema.
Args:
data: Parsed JSON data
required_fields: List of required field names
field_types: Optional dict of field_name -> expected_type
"""
# Check required fields
missing = [f for f in required_fields if f not in data]
if missing:
raise ValidationError(f"Missing required fields: {missing}")
# Check types if specified
if field_types:
for field, expected_type in field_types.items():
if field in data and not isinstance(data[field], expected_type):
actual = type(data[field]).__name__
expected = expected_type.__name__
raise ValidationError(
f"Field '{field}' has wrong type: expected {expected}, got {actual}"
)
return data
# Example validation
test_responses = [
'{"sentiment": "positive", "confidence": 0.9}', # Valid
'{"sentiment": "positive"}', # Missing confidence
'{"sentiment": "positive", "confidence": "high"}', # Wrong type
'Here is the JSON: {"sentiment": "positive"}', # Extra text
]
print("Validation Examples:")
print("=" * 50)
for response in test_responses:
print(f"\nInput: {response[:50]}...")
try:
data = parse_json_response(response)
validated = validate_schema(
data,
required_fields=["sentiment", "confidence"],
field_types={"confidence": (int, float)}
)
print(f" ✓ Valid: {validated}")
except ValidationError as e:
print(f" ✗ Invalid: {e}")
Group 4 — Testing and Determinism#
How do you test code that calls a non-deterministic external service?
6.12 Testing Without Live Models#
Never call live LLMs in unit tests. This is a fundamental principle.
Why Not?#
Problem |
Consequence |
|---|---|
Cost |
Every test run costs money |
Speed |
Tests take seconds instead of milliseconds |
Flakiness |
Non-deterministic = random failures |
Availability |
Tests fail when API is down |
Rate limits |
CI/CD can hit rate limits |
The Solution: Mocking#
Replace the LLM call with a predictable fake:
# Production code
result = client.chat(prompt) # Calls real LLM
# Test code
client.chat = lambda p: '{"result": "mocked"}' # Returns fixed value
result = client.chat(prompt) # Uses mock
# Testing the JSON parser (no LLM needed!)
def test_parse_json_response():
"""Test JSON parsing without any LLM calls."""
# Test 1: Valid JSON
result = parse_json_response('{"key": "value"}')
assert result == {"key": "value"}, "Basic JSON parsing failed"
# Test 2: With markdown code block
result = parse_json_response('```json\n{"key": "value"}\n```')
assert result == {"key": "value"}, "Markdown cleanup failed"
# Test 3: With whitespace
result = parse_json_response(' \n{"key": "value"}\n ')
assert result == {"key": "value"}, "Whitespace handling failed"
# Test 4: Invalid JSON should raise
try:
parse_json_response('not json')
assert False, "Should have raised ValidationError"
except ValidationError:
pass # Expected
print("✓ All JSON parser tests passed!")
def test_validate_schema():
"""Test schema validation without any LLM calls."""
# Test 1: Valid data
data = {"name": "Alice", "age": 30}
result = validate_schema(data, required_fields=["name", "age"])
assert result == data
# Test 2: Missing field
try:
validate_schema({"name": "Alice"}, required_fields=["name", "age"])
assert False, "Should have raised ValidationError"
except ValidationError as e:
assert "age" in str(e)
# Test 3: Wrong type
try:
validate_schema(
{"name": "Alice", "age": "thirty"},
required_fields=["name", "age"],
field_types={"age": int}
)
assert False, "Should have raised ValidationError"
except ValidationError as e:
assert "wrong type" in str(e)
print("✓ All schema validation tests passed!")
# Run the tests
test_parse_json_response()
test_validate_schema()
# Testing with a mock client
class MockLLMClient:
"""A mock client for testing LLM-integrated code."""
def __init__(self, responses: dict = None):
"""
Args:
responses: Dict mapping prompt substrings to responses
"""
self.responses = responses or {}
self.calls = [] # Track calls for verification
def chat(self, prompt: str, **kwargs) -> str:
"""Return a mocked response based on the prompt."""
self.calls.append({"prompt": prompt, **kwargs})
# Find matching response
for key, response in self.responses.items():
if key.lower() in prompt.lower():
return response
# Default response
return '{"status": "mocked"}'
# Example: Testing a function that uses LLM
def analyze_sentiment(client, text: str) -> str:
"""Analyze sentiment using LLM."""
prompt = f"Analyze sentiment: {text}\nReturn JSON: {{\"sentiment\": \"...\"}}"
response = client.chat(prompt)
data = parse_json_response(response)
return data.get("sentiment", "unknown")
# Test with mock
def test_analyze_sentiment():
mock_client = MockLLMClient(responses={
"great": '{"sentiment": "positive"}',
"terrible": '{"sentiment": "negative"}',
"okay": '{"sentiment": "neutral"}',
})
assert analyze_sentiment(mock_client, "This is great!") == "positive"
assert analyze_sentiment(mock_client, "This is terrible!") == "negative"
assert analyze_sentiment(mock_client, "It's okay.") == "neutral"
# Verify the client was called correctly
assert len(mock_client.calls) == 3
print("✓ Sentiment analysis tests passed (using mock)!")
test_analyze_sentiment()
6.13 Achieving Determinism#
Even with mocks, you need strategies for making LLM behavior more predictable.
Techniques for Determinism#
Technique |
How |
Effectiveness |
|---|---|---|
Temperature 0 |
Set |
High (but not perfect) |
Seed parameter |
Some APIs support |
Medium (provider-dependent) |
Constrained output |
JSON mode, function calling |
High |
Caching |
Cache responses by prompt hash |
Perfect (for repeated calls) |
Mocking |
Replace with fake in tests |
Perfect (for tests) |
import hashlib
from typing import Optional
class CachingLLMClient:
"""LLM client with response caching for determinism."""
def __init__(self, client: LLMClient):
self.client = client
self.cache = {}
self.cache_hits = 0
self.cache_misses = 0
def _cache_key(self, prompt: str, temperature: float, max_tokens: int) -> str:
"""Generate a cache key from request parameters."""
key_data = f"{prompt}:{temperature}:{max_tokens}"
return hashlib.sha256(key_data.encode()).hexdigest()[:16]
def chat(self, prompt: str, temperature: float = 0.0, max_tokens: int = 256) -> str:
"""Chat with caching."""
key = self._cache_key(prompt, temperature, max_tokens)
if key in self.cache:
self.cache_hits += 1
return self.cache[key]
self.cache_misses += 1
response = self.client.chat(prompt, temperature, max_tokens)
self.cache[key] = response
return response
def stats(self) -> dict:
"""Return cache statistics."""
total = self.cache_hits + self.cache_misses
hit_rate = self.cache_hits / total if total > 0 else 0
return {
"hits": self.cache_hits,
"misses": self.cache_misses,
"hit_rate": f"{hit_rate:.1%}",
"cached_responses": len(self.cache)
}
print("CachingLLMClient defined.")
print("\nBenefits:")
print(" - Same prompt always returns same response")
print(" - Reduces API costs")
print(" - Faster repeated calls")
print(" - Useful for development and testing")
Group 5 — Production Concerns#
Beyond correctness, production systems need logging, cost control, and auditability.
6.14 Cost and Audit Logging#
Every LLM call should be logged for:
Purpose |
What to Log |
|---|---|
Cost tracking |
Tokens used, model, timestamp |
Debugging |
Prompt, response, errors |
Compliance |
User ID, request ID, inputs/outputs |
Performance |
Latency, retry count |
Security |
Suspicious patterns, PII detection |
import logging
from datetime import datetime
from dataclasses import dataclass
from typing import Optional
import uuid
@dataclass
class LLMCallLog:
"""Structured log entry for an LLM call."""
request_id: str
timestamp: str
model: str
prompt_preview: str # First N chars of prompt
response_preview: str # First N chars of response
latency_ms: float
success: bool
error: Optional[str] = None
def to_dict(self) -> dict:
return {
"request_id": self.request_id,
"timestamp": self.timestamp,
"model": self.model,
"prompt_preview": self.prompt_preview,
"response_preview": self.response_preview,
"latency_ms": self.latency_ms,
"success": self.success,
"error": self.error
}
class LoggingLLMClient:
"""LLM client with structured logging."""
def __init__(self, client: LLMClient, preview_length: int = 100):
self.client = client
self.preview_length = preview_length
self.logs: list[LLMCallLog] = []
def chat(self, prompt: str, **kwargs) -> str:
"""Chat with logging."""
request_id = str(uuid.uuid4())[:8]
start_time = datetime.now()
try:
response = self.client.chat(prompt, **kwargs)
latency = (datetime.now() - start_time).total_seconds() * 1000
log = LLMCallLog(
request_id=request_id,
timestamp=start_time.isoformat(),
model=self.client.model,
prompt_preview=prompt[:self.preview_length],
response_preview=response[:self.preview_length],
latency_ms=round(latency, 2),
success=True
)
self.logs.append(log)
return response
except Exception as e:
latency = (datetime.now() - start_time).total_seconds() * 1000
log = LLMCallLog(
request_id=request_id,
timestamp=start_time.isoformat(),
model=self.client.model,
prompt_preview=prompt[:self.preview_length],
response_preview="",
latency_ms=round(latency, 2),
success=False,
error=str(e)
)
self.logs.append(log)
raise
def get_logs(self) -> list[dict]:
"""Return all logs as dictionaries."""
return [log.to_dict() for log in self.logs]
def summary(self) -> dict:
"""Return summary statistics."""
if not self.logs:
return {"total_calls": 0}
successful = [l for l in self.logs if l.success]
failed = [l for l in self.logs if not l.success]
latencies = [l.latency_ms for l in successful]
return {
"total_calls": len(self.logs),
"successful": len(successful),
"failed": len(failed),
"avg_latency_ms": round(sum(latencies) / len(latencies), 2) if latencies else 0,
"max_latency_ms": round(max(latencies), 2) if latencies else 0
}
print("LoggingLLMClient defined.")
print("\nFeatures:")
print(" - Structured log entries")
print(" - Request ID tracking")
print(" - Latency measurement")
print(" - Error capture")
print(" - Summary statistics")
# Demonstrate logging (using mock to avoid real API calls)
# Create a mock for demonstration
mock = MockLLMClient(responses={
"hello": '{"greeting": "Hello!"}',
"weather": '{"forecast": "Sunny"}',
})
mock.model = "mock-model" # Add model attribute
# Wrap with logging
logged_client = LoggingLLMClient(mock)
# Make some calls
logged_client.chat("Hello, how are you?")
logged_client.chat("What's the weather like?")
logged_client.chat("Hello again!")
# View logs
print("Call Logs:")
print("=" * 60)
for log in logged_client.get_logs():
print(f"[{log['request_id']}] {log['prompt_preview'][:30]}... -> {log['latency_ms']}ms")
print("\nSummary:")
print(logged_client.summary())
6.15 Preparing for RAG#
Everything in this module prepares you for building RAG (Retrieval-Augmented Generation) systems.
The RAG Pattern (Recap from Module 5)#
User Question
↓
Embed Question (Module 5)
↓
Vector Search (Module 5)
↓
Build Prompt with Context (Module 6)
↓
Call LLM with Retry (Module 6)
↓
Validate Response (Module 6)
↓
Return to User
What You Now Know#
Module 5 Skills |
Module 6 Skills |
|---|---|
Generate embeddings |
Build robust API clients |
Semantic search |
Handle failures gracefully |
Vector databases |
Enforce structured output |
Chunking strategies |
Validate before using |
Retrieval evaluation |
Test without live models |
Next Steps#
Combining Modules 5 and 6, you can now:
Index documents with embeddings
Retrieve relevant context for any question
Build prompts that ground LLM answers in facts
Call LLMs reliably with proper error handling
Validate and log everything for compliance
# A complete RAG-ready LLM client
def build_rag_prompt(question: str, context_docs: list[str]) -> str:
"""
Build a RAG prompt with retrieved context.
Args:
question: User's question
context_docs: Retrieved relevant documents
"""
context = "\n\n".join([f"[{i+1}] {doc}" for i, doc in enumerate(context_docs)])
return f"""Answer the question based ONLY on the provided context.
If the context doesn't contain enough information, say "I don't have enough information."
Context:
{context}
Question: {question}
Provide your answer in JSON format:
{{
"answer": "your answer here",
"sources_used": [1, 2],
"confidence": "high|medium|low"
}}
JSON:"""
# Example
context = [
"The company's Q3 revenue was $4.2 billion, up 15% YoY.",
"Operating margin improved to 23% from 21% last year.",
"The CEO announced plans to expand into Asian markets."
]
prompt = build_rag_prompt(
question="What was the company's revenue in Q3?",
context_docs=context
)
print("RAG Prompt Example:")
print("=" * 60)
print(prompt)
Module Summary#
Key Concepts#
Concept |
What It Means |
|---|---|
Service mindset |
LLMs are external services, not functions |
Defensive programming |
Assume failure, verify success |
Exponential backoff |
Wait longer between each retry |
Structured output |
Request JSON, validate before use |
Mocking |
Test without live LLM calls |
Audit logging |
Track every call for cost/compliance |
The Production LLM Client Checklist#
Configuration externalized (no hardcoded secrets)
Proper timeouts set (connect and read)
Retry logic with exponential backoff
Structured output requested (JSON)
Response validation before use
Error handling for all failure modes
Logging for debugging and audit
Tests that don’t call live APIs
Enterprise Implications#
Concern |
Solution |
|---|---|
Cost control |
Logging, caching, token limits |
Reliability |
Retries, fallbacks, timeouts |
Compliance |
Audit logs, input/output validation |
Security |
No hardcoded secrets, PII detection |
Testability |
Mocking, dependency injection |
What’s Next#
Quiz — Test your understanding of LLM API patterns
Assessment — Build and test your own LLM client
Module 7 — Putting it all together with a complete RAG system