Content#

Module 6 — LLM APIs (Python)

CodeVision Academy

Overview#

If Module 5 taught you how to find relevant information, Module 6 teaches you how to reliably call the AI that uses it.

This module marks the transition from using AI to engineering AI systems.

Up to now, you’ve called LLMs casually—paste a prompt, get a response. That works for demos. It does not work for production systems that must:

  • Handle failures gracefully

  • Validate outputs before using them

  • Control costs and latency

  • Pass audits and compliance reviews


One Big Idea to Remember#

An LLM is not a function call. It is a remote, rate-limited, probabilistic service. Failure is normal. Correctness is engineered.


Learning Objectives#

By the end of this module, you will be able to:

  1. Explain why LLMs must be treated as external services, not functions

  2. Build a reusable Python client class for LLM APIs

  3. Implement proper error handling with timeouts and retries

  4. Enforce structured JSON outputs and validate responses

  5. Write tests for LLM-integrated code without calling live models

  6. Implement logging for cost tracking and auditability

  7. Apply defensive programming patterns for non-deterministic systems

  8. Prepare code cleanly for RAG integration


Before You Start: LLM Gateway Configuration#

This module requires access to an LLM API. You have two options:

Option

Model

Best For

A: Local Ollama

phi3:mini

Running locally, learning API patterns

B: JBChat Server

llama3.1:8b

Higher quality, Colab users


Option A: Local Ollama#

If running Jupyter locally: Use http://localhost:11434 directly.

If running in Google Colab: You must expose Ollama via a tunnel.

Pinggy Setup (required for Colab):

  1. Open a terminal on your local machine

  2. Make sure Ollama is running: ollama serve

  3. Start the tunnel:

    ssh -p 443 -R0:localhost:11434 a.pinggy.io
    
  4. Copy the HTTPS URL (e.g., https://xyz-abc.a.pinggy.io)

  5. Use that URL in the config below


Option B: Server Gateway (JBChat)#

If you cannot run Ollama locally:

  • URL: https://jbchat.jonbowden.com.ngrok.app

  • Requires API key from instructor

  • Model: llama3.1:8b


Configure below:#

# ===== LLM GATEWAY CONFIGURATION =====

# ------ OPTION A: Local Ollama ------
LLM_BASE_URL = "http://localhost:11434"
# For Colab with Pinggy tunnel:
# LLM_BASE_URL = "https://your-pinggy-url.a.pinggy.io"

LLM_API_KEY = None  # No API key = Ollama mode
DEFAULT_MODEL = "phi3:mini"

# ------ OPTION B: Server Gateway (JBChat) ------
# Uncomment these 3 lines to use the course server:
# LLM_BASE_URL = "https://jbchat.jonbowden.com.ngrok.app"
# LLM_API_KEY = "<provided-by-instructor>"
# DEFAULT_MODEL = "llama3.1:8b"

print(f"Configured: {LLM_BASE_URL}")
print(f"Model: {DEFAULT_MODEL}")
print(f"Mode: {'JBChat' if LLM_API_KEY else 'Ollama'}")
Configured: http://localhost:11434
Model: phi3:mini
Mode: Ollama

Group 1 — The Service Mindset#

Before we write code, we need to understand why LLM integration is fundamentally different from calling a local function.

6.1 From Model Calls to Service Contracts#

When you call a local function, you expect:

  • Instant response

  • Deterministic output

  • No network failures

  • No rate limits

When you call an LLM API, you face:

Challenge

Reality

Latency

1-30+ seconds per call

Availability

Services go down, networks fail

Rate limits

Too many calls = blocked

Non-determinism

Same input can yield different outputs

Cost

Every token costs money

Output format

No guarantee of structure

The Mindset Shift#

WRONG MENTAL MODEL:           RIGHT MENTAL MODEL:

result = llm(prompt)          try:
use(result)                       result = llm_with_retry(prompt)
                                  validated = parse_and_validate(result)
                                  use(validated)
                              except LLMError:
                                  handle_gracefully()

Enterprise Implications#

In production systems, you must design for:

  • Graceful degradation — What happens when the LLM is down?

  • Timeout budgets — How long can users wait?

  • Fallback strategies — Can you use cached responses?

  • Cost controls — How do you prevent runaway API bills?

6.2 Anatomy of an LLM API Request#

Every LLM API call is fundamentally JSON over HTTP. Understanding the structure helps you debug issues and optimize performance.

Request Components#

Component

Purpose

Example

Endpoint

Where to send the request

/api/chat, /v1/completions

Headers

Authentication, content type

Authorization: Bearer xxx

Model

Which model to use

phi3:mini, gpt-4

Messages

The conversation/prompt

[{"role": "user", "content": "..."}]

Temperature

Randomness (0=deterministic)

0.0 to 1.0

Max tokens

Output length limit

256, 1024

Standard Payload Structure#

# A typical LLM API payload
payload = {
    "model": "phi3:mini",
    "messages": [
        {"role": "system", "content": "You are a helpful assistant."},
        {"role": "user", "content": "Explain inflation in one sentence."}
    ],
    "temperature": 0.0,  # Deterministic
    "max_tokens": 100    # Limit response length
}

import json
print("Request payload:")
print(json.dumps(payload, indent=2))
Request payload:
{
  "model": "phi3:mini",
  "messages": [
    {
      "role": "system",
      "content": "You are a helpful assistant."
    },
    {
      "role": "user",
      "content": "Explain inflation in one sentence."
    }
  ],
  "temperature": 0.0,
  "max_tokens": 100
}

Response Structure#

The response also follows a standard structure:

{
  "model": "phi3:mini",
  "message": {
    "role": "assistant",
    "content": "Inflation is the rate at which prices rise over time."
  },
  "done": true,
  "total_duration": 1234567890
}

Different providers have slightly different response formats, but the core pattern is the same.

6.3 Configuration Discipline#

Never hardcode configuration. This is a fundamental principle for maintainable systems.

Why Configuration Matters#

Hardcoded

Configurable

Change requires code edit

Change via environment

Secrets in source code

Secrets in secure storage

Same settings everywhere

Dev/staging/prod can differ

Hard to test

Easy to mock

The Configuration Pattern#

import os

# Configuration from environment (with fallbacks)
class LLMConfig:
    """Centralized LLM configuration."""
    
    BASE_URL = os.getenv("LLM_BASE_URL", LLM_BASE_URL)
    API_KEY = os.getenv("LLM_API_KEY", LLM_API_KEY)
    MODEL = os.getenv("LLM_MODEL", DEFAULT_MODEL)
    
    # Operational defaults
    DEFAULT_TEMPERATURE = 0.0
    DEFAULT_MAX_TOKENS = 256
    DEFAULT_TIMEOUT = (5, 60)  # (connect, read) in seconds
    MAX_RETRIES = 3

print(f"Config loaded:")
print(f"  BASE_URL: {LLMConfig.BASE_URL}")
print(f"  MODEL: {LLMConfig.MODEL}")
print(f"  API_KEY: {'***' if LLMConfig.API_KEY else 'None (Ollama mode)'}")
Config loaded:
  BASE_URL: http://localhost:11434
  MODEL: phi3:mini
  API_KEY: None (Ollama mode)

Configuration Best Practices#

  1. Use environment variables for anything that varies by environment

  2. Provide sensible defaults for development

  3. Never commit secrets to version control

  4. Validate configuration at startup, not at first use

  5. Document required variables in README or setup scripts


Group 2 — Building a Robust Client#

Now we build a reusable client class that encapsulates all the complexity of LLM communication.

6.4 Client Class Rationale#

Why wrap LLM calls in a class instead of simple functions?

Approach

Pros

Cons

Raw requests

Simple, direct

Repeated code, no encapsulation

Functions

Reusable

State management is awkward

Client class

Encapsulated, testable, extensible

Slightly more setup

What a Good Client Provides#

  • Encapsulation — Hide HTTP details from business logic

  • Configuration — Centralized settings management

  • Retry logic — Automatic handling of transient failures

  • Logging — Consistent audit trail

  • Testability — Easy to mock for unit tests

import requests
import time
import json
from typing import Optional, Dict, Any, List

class LLMClient:
    """
    A robust client for LLM API interactions.
    
    Handles:
    - Configuration management
    - Request construction
    - Error handling
    - Retry logic
    - Response parsing
    """
    
    def __init__(
        self,
        base_url: str,
        api_key: Optional[str] = None,
        model: str = "phi3:mini",
        timeout: tuple = (5, 60)
    ):
        """
        Initialize the LLM client.
        
        Args:
            base_url: API endpoint base URL
            api_key: Optional API key (None for Ollama)
            model: Model identifier
            timeout: (connect_timeout, read_timeout) in seconds
        """
        self.base_url = base_url.rstrip('/')
        self.api_key = api_key
        self.model = model
        self.timeout = timeout
        
        # Detect mode based on API key
        self._use_jbchat = api_key is not None
    
    def __repr__(self):
        mode = "JBChat" if self._use_jbchat else "Ollama"
        return f"LLMClient(mode={mode}, model={self.model})"

# Create a client instance
client = LLMClient(
    base_url=LLMConfig.BASE_URL,
    api_key=LLMConfig.API_KEY,
    model=LLMConfig.MODEL
)
print(f"Client created: {client}")
Client created: LLMClient(mode=Ollama, model=phi3:mini)

6.5 Request Construction#

Building requests correctly is crucial. Different APIs have different formats, so we encapsulate this complexity.

Key Considerations#

Aspect

Why It Matters

Headers

Authentication, content negotiation

Endpoint

Different APIs use different paths

Payload format

Ollama vs OpenAI vs others differ

Timeout tuning

Connect fast, allow long reads

# Add request construction methods to our client

class LLMClient(LLMClient):  # Extending previous definition
    
    def _build_headers(self) -> Dict[str, str]:
        """Build HTTP headers for the request."""
        headers = {
            "Content-Type": "application/json",
            # Bypass tunnel browser warnings
            "ngrok-skip-browser-warning": "true",
            "Bypass-Tunnel-Reminder": "true",
        }
        
        if self.api_key:
            headers["X-API-Key"] = self.api_key
        
        return headers
    
    def _build_payload(
        self,
        prompt: str,
        temperature: float = 0.0,
        max_tokens: int = 256,
        system_prompt: Optional[str] = None
    ) -> Dict[str, Any]:
        """Build the request payload."""
        
        # Build messages list
        messages = []
        if system_prompt:
            messages.append({"role": "system", "content": system_prompt})
        messages.append({"role": "user", "content": prompt})
        
        if self._use_jbchat:
            # JBChat/OpenAI format
            return {
                "model": self.model,
                "messages": messages,
                "temperature": temperature,
                "max_tokens": max_tokens,
                "stream": False
            }
        else:
            # Ollama format
            return {
                "model": self.model,
                "messages": messages,
                "options": {"temperature": temperature},
                "stream": False
            }
    
    def _get_endpoint(self) -> str:
        """Get the correct API endpoint."""
        if self._use_jbchat:
            return f"{self.base_url}/chat/direct"
        else:
            return f"{self.base_url}/api/chat"

# Recreate client with new methods
client = LLMClient(
    base_url=LLMConfig.BASE_URL,
    api_key=LLMConfig.API_KEY,
    model=LLMConfig.MODEL
)

# Show what a request looks like
print("Endpoint:", client._get_endpoint())
print("\nHeaders:", json.dumps(client._build_headers(), indent=2))
print("\nPayload:", json.dumps(client._build_payload("Hello"), indent=2))
Endpoint: http://localhost:11434/api/chat

Headers: {
  "Content-Type": "application/json",
  "ngrok-skip-browser-warning": "true",
  "Bypass-Tunnel-Reminder": "true"
}

Payload: {
  "model": "phi3:mini",
  "messages": [
    {
      "role": "user",
      "content": "Hello"
    }
  ],
  "options": {
    "temperature": 0.0
  },
  "stream": false
}

6.6 Making Safe API Calls#

The actual API call must handle many potential failures:

Failure Type

Cause

Handling

Connection timeout

Network issues, server down

Retry with backoff

Read timeout

Slow response, overloaded server

Increase timeout or retry

HTTP 429

Rate limited

Back off, then retry

HTTP 500

Server error

Retry with backoff

HTTP 401/403

Auth failure

Don’t retry, fix config

Invalid JSON

Malformed response

Log and raise

class LLMClient(LLMClient):  # Extending again
    
    def chat(
        self,
        prompt: str,
        temperature: float = 0.0,
        max_tokens: int = 256,
        system_prompt: Optional[str] = None
    ) -> str:
        """
        Send a chat request and return the response content.
        
        Args:
            prompt: User message
            temperature: Randomness (0.0 = deterministic)
            max_tokens: Maximum response length
            system_prompt: Optional system message
            
        Returns:
            The assistant's response text
            
        Raises:
            requests.exceptions.RequestException: On network/HTTP errors
            ValueError: On invalid response format
        """
        response = requests.post(
            self._get_endpoint(),
            headers=self._build_headers(),
            json=self._build_payload(prompt, temperature, max_tokens, system_prompt),
            timeout=self.timeout
        )
        
        # Raise exception for HTTP errors (4xx, 5xx)
        response.raise_for_status()
        
        # Parse response
        data = response.json()
        
        # Extract content (handle different response formats)
        if "message" in data and "content" in data["message"]:
            return data["message"]["content"]
        elif "choices" in data:  # OpenAI format
            return data["choices"][0]["message"]["content"]
        else:
            raise ValueError(f"Unexpected response format: {data}")

# Recreate and test
client = LLMClient(
    base_url=LLMConfig.BASE_URL,
    api_key=LLMConfig.API_KEY,
    model=LLMConfig.MODEL
)

# Test the client
try:
    response = client.chat("Say 'API connected' in exactly two words.")
    print(f"Response: {response}")
except Exception as e:
    print(f"Error: {e}")
    print("\nMake sure your LLM server is running!")
Error: HTTPConnectionPool(host='localhost', port=11434): Max retries exceeded with url: /api/chat (Caused by NewConnectionError('<urllib3.connection.HTTPConnection object at 0x7f6b92159030>: Failed to establish a new connection: [Errno 111] Connection refused'))

Make sure your LLM server is running!

6.7 Failure as the Default Assumption#

In distributed systems, the question is not if things will fail, but when and how often.

Types of Failures#

Type

Example

Frequency

Transient

Network hiccup, brief overload

Common (retry helps)

Persistent

Server down, config error

Less common (retry won’t help)

Partial

Slow response, truncated output

Common (timeout/validate)

Silent

Wrong answer, hallucination

Common (validation needed)

The Defensive Mindset#

# WRONG: Assume success
result = client.chat(prompt)
use(result)

# RIGHT: Assume failure, verify success
try:
    result = client.chat(prompt)
    validated = validate(result)
    use(validated)
except TransientError:
    retry()
except PermanentError:
    fallback()
# Demonstrating what failures look like
import requests

def demonstrate_failures():
    """Show common LLM API failure modes."""
    
    print("Common LLM API Failures:")
    print("=" * 50)
    
    # 1. Connection refused (server not running)
    try:
        requests.post("http://localhost:99999/api/chat", timeout=1)
    except requests.exceptions.ConnectionError as e:
        print(f"1. ConnectionError: Server not reachable")
    
    # 2. Timeout
    try:
        requests.get("https://httpstat.us/200?sleep=5000", timeout=0.1)
    except requests.exceptions.Timeout:
        print(f"2. Timeout: Server too slow")
    
    # 3. HTTP errors
    print(f"3. HTTP 429: Rate limit exceeded (too many requests)")
    print(f"4. HTTP 500: Internal server error")
    print(f"5. HTTP 401: Authentication failed")
    
    print("\n" + "=" * 50)
    print("All of these require proper handling!")

demonstrate_failures()
Common LLM API Failures:
==================================================
---------------------------------------------------------------------------
LocationParseError                        Traceback (most recent call last)
File ~/.pyenv/versions/3.10.18/lib/python3.10/site-packages/urllib3/util/url.py:434, in parse_url(url)
    433     if not (0 <= port_int <= 65535):
--> 434         raise LocationParseError(url)
    435 else:

LocationParseError: Failed to parse: http://localhost:99999/api/chat

The above exception was the direct cause of the following exception:

LocationParseError                        Traceback (most recent call last)
File ~/.pyenv/versions/3.10.18/lib/python3.10/site-packages/requests/models.py:433, in PreparedRequest.prepare_url(self, url, params)
    432 try:
--> 433     scheme, auth, host, port, path, query, fragment = parse_url(url)
    434 except LocationParseError as e:

File ~/.pyenv/versions/3.10.18/lib/python3.10/site-packages/urllib3/util/url.py:449, in parse_url(url)
    448 except (ValueError, AttributeError) as e:
--> 449     raise LocationParseError(source_url) from e
    451 # For the sake of backwards compatibility we put empty
    452 # string values for path if there are any defined values
    453 # beyond the path in the URL.
    454 # TODO: Remove this when we break backwards compatibility.

LocationParseError: Failed to parse: http://localhost:99999/api/chat

During handling of the above exception, another exception occurred:

InvalidURL                                Traceback (most recent call last)
Cell In[7], line 30
     27     print("\n" + "=" * 50)
     28     print("All of these require proper handling!")
---> 30 demonstrate_failures()

Cell In[7], line 12, in demonstrate_failures()
     10 # 1. Connection refused (server not running)
     11 try:
---> 12     requests.post("http://localhost:99999/api/chat", timeout=1)
     13 except requests.exceptions.ConnectionError as e:
     14     print(f"1. ConnectionError: Server not reachable")

File ~/.pyenv/versions/3.10.18/lib/python3.10/site-packages/requests/api.py:115, in post(url, data, json, **kwargs)
    103 def post(url, data=None, json=None, **kwargs):
    104     r"""Sends a POST request.
    105 
    106     :param url: URL for the new :class:`Request` object.
   (...)
    112     :rtype: requests.Response
    113     """
--> 115     return request("post", url, data=data, json=json, **kwargs)

File ~/.pyenv/versions/3.10.18/lib/python3.10/site-packages/requests/api.py:59, in request(method, url, **kwargs)
     55 # By using the 'with' statement we are sure the session is closed, thus we
     56 # avoid leaving sockets open which can trigger a ResourceWarning in some
     57 # cases, and look like a memory leak in others.
     58 with sessions.Session() as session:
---> 59     return session.request(method=method, url=url, **kwargs)

File ~/.pyenv/versions/3.10.18/lib/python3.10/site-packages/requests/sessions.py:575, in Session.request(self, method, url, params, data, headers, cookies, files, auth, timeout, allow_redirects, proxies, hooks, stream, verify, cert, json)
    562 # Create the Request.
    563 req = Request(
    564     method=method.upper(),
    565     url=url,
   (...)
    573     hooks=hooks,
    574 )
--> 575 prep = self.prepare_request(req)
    577 proxies = proxies or {}
    579 settings = self.merge_environment_settings(
    580     prep.url, proxies, stream, verify, cert
    581 )

File ~/.pyenv/versions/3.10.18/lib/python3.10/site-packages/requests/sessions.py:484, in Session.prepare_request(self, request)
    481     auth = get_netrc_auth(request.url)
    483 p = PreparedRequest()
--> 484 p.prepare(
    485     method=request.method.upper(),
    486     url=request.url,
    487     files=request.files,
    488     data=request.data,
    489     json=request.json,
    490     headers=merge_setting(
    491         request.headers, self.headers, dict_class=CaseInsensitiveDict
    492     ),
    493     params=merge_setting(request.params, self.params),
    494     auth=merge_setting(auth, self.auth),
    495     cookies=merged_cookies,
    496     hooks=merge_hooks(request.hooks, self.hooks),
    497 )
    498 return p

File ~/.pyenv/versions/3.10.18/lib/python3.10/site-packages/requests/models.py:367, in PreparedRequest.prepare(self, method, url, headers, files, data, params, auth, cookies, hooks, json)
    364 """Prepares the entire request with the given parameters."""
    366 self.prepare_method(method)
--> 367 self.prepare_url(url, params)
    368 self.prepare_headers(headers)
    369 self.prepare_cookies(cookies)

File ~/.pyenv/versions/3.10.18/lib/python3.10/site-packages/requests/models.py:435, in PreparedRequest.prepare_url(self, url, params)
    433     scheme, auth, host, port, path, query, fragment = parse_url(url)
    434 except LocationParseError as e:
--> 435     raise InvalidURL(*e.args)
    437 if not scheme:
    438     raise MissingSchema(
    439         f"Invalid URL {url!r}: No scheme supplied. "
    440         f"Perhaps you meant https://{url}?"
    441     )

InvalidURL: Failed to parse: http://localhost:99999/api/chat

6.8 Retry with Exponential Backoff#

Exponential backoff is the standard pattern for handling transient failures:

  1. Try the operation

  2. If it fails, wait a short time and retry

  3. If it fails again, wait longer (exponentially)

  4. After N retries, give up

Why Exponential?#

Attempt

Wait Time

Cumulative

1

0s

0s

2

1s

1s

3

2s

3s

4

4s

7s

5

8s

15s

This gives the server time to recover while not waiting forever.

import time
from typing import Callable, TypeVar

T = TypeVar('T')

def retry_with_backoff(
    fn: Callable[[], T],
    max_retries: int = 3,
    base_delay: float = 1.0,
    max_delay: float = 30.0,
    retryable_exceptions: tuple = (requests.exceptions.RequestException,)
) -> T:
    """
    Execute a function with exponential backoff retry.
    
    Args:
        fn: Function to execute (no arguments)
        max_retries: Maximum number of retry attempts
        base_delay: Initial delay in seconds
        max_delay: Maximum delay between retries
        retryable_exceptions: Tuple of exceptions that trigger retry
        
    Returns:
        Result of fn() on success
        
    Raises:
        The last exception if all retries fail
    """
    last_exception = None
    
    for attempt in range(max_retries + 1):
        try:
            return fn()
        except retryable_exceptions as e:
            last_exception = e
            
            if attempt == max_retries:
                break  # Don't sleep after last attempt
                
            # Calculate delay with exponential backoff
            delay = min(base_delay * (2 ** attempt), max_delay)
            print(f"  Attempt {attempt + 1} failed: {e}")
            print(f"  Retrying in {delay:.1f}s...")
            time.sleep(delay)
    
    raise last_exception

# Add retry method to client
class LLMClient(LLMClient):
    
    def chat_with_retry(
        self,
        prompt: str,
        temperature: float = 0.0,
        max_tokens: int = 256,
        max_retries: int = 3
    ) -> str:
        """Chat with automatic retry on transient failures."""
        return retry_with_backoff(
            fn=lambda: self.chat(prompt, temperature, max_tokens),
            max_retries=max_retries
        )

# Recreate client
client = LLMClient(
    base_url=LLMConfig.BASE_URL,
    api_key=LLMConfig.API_KEY,
    model=LLMConfig.MODEL
)

print("Retry function defined.")
print("\nExample usage:")
print('  result = client.chat_with_retry("Your prompt here")')

Group 3 — Structured Output and Validation#

Getting a response is only half the battle. The response must be usable.

6.9 The Necessity of Structured Output#

LLMs naturally produce free-form text. That’s great for chatbots. It’s terrible for software systems.

The Problem#

# You asked for a summary
response = "Here's a summary of the document. The main points are..."

# How do you extract the actual summary programmatically?
# What if the format changes?
# What if there's extra text?

The Solution: JSON#

# Ask for JSON
response = '{"summary": "The main points are...", "confidence": 0.85}'

# Now you can parse and use it reliably
data = json.loads(response)
summary = data["summary"]

Why JSON?#

Format

Pros

Cons

Free text

Natural, flexible

Hard to parse, unreliable

JSON

Structured, parseable, typed

LLM may not comply

XML

Structured, handles nesting

Verbose, harder for LLMs

YAML

Readable, structured

Whitespace-sensitive

6.10 JSON Enforcement in Prompts#

The key to getting JSON output is explicit instruction and schema specification.

Prompt Patterns for JSON#

Pattern

Reliability

“Return JSON”

Low

“Return ONLY valid JSON: {schema}”

Medium

“Return ONLY valid JSON. No other text. Schema: {schema}”

High

System prompt + user prompt + schema

Highest

# Template for JSON-enforced prompts

def build_json_prompt(task: str, schema: dict, data: str = None) -> str:
    """
    Build a prompt that requests JSON output.
    
    Args:
        task: What to do
        schema: Expected JSON structure
        data: Optional data to process
    """
    schema_str = json.dumps(schema, indent=2)
    
    prompt = f"""Task: {task}

Return ONLY valid JSON matching this schema:
{schema_str}

Rules:
- Return ONLY the JSON object
- No markdown, no explanations, no extra text
- All fields are required
"""
    
    if data:
        prompt += f"\nData to process:\n{data}\n"
    
    prompt += "\nJSON:"
    
    return prompt

# Example: Sentiment analysis with structured output
schema = {
    "sentiment": "positive | negative | neutral",
    "confidence": "float between 0 and 1",
    "key_phrases": ["list", "of", "phrases"]
}

prompt = build_json_prompt(
    task="Analyze the sentiment of the following text",
    schema=schema,
    data="The product exceeded my expectations. Great value!"
)

print("JSON-enforced prompt:")
print("=" * 50)
print(prompt)
# Test JSON output with the LLM
try:
    response = client.chat(prompt, temperature=0.0)
    print("LLM Response:")
    print(response)
    
    # Try to parse it
    parsed = json.loads(response)
    print("\nParsed successfully!")
    print(f"Sentiment: {parsed.get('sentiment')}")
    print(f"Confidence: {parsed.get('confidence')}")
except json.JSONDecodeError as e:
    print(f"JSON parsing failed: {e}")
    print("The LLM did not return valid JSON.")
except Exception as e:
    print(f"Error: {e}")

6.11 Validation Before Use#

Even when the LLM returns valid JSON, you must validate it before using it in your application.

Validation Layers#

Layer

Checks

Example

Syntax

Is it valid JSON?

json.loads()

Schema

Are required fields present?

Check keys exist

Types

Are values the right type?

isinstance()

Values

Are values in valid ranges?

Business logic

Semantic

Does it make sense?

Domain validation

from typing import Any

class ValidationError(Exception):
    """Raised when LLM output fails validation."""
    pass

def parse_json_response(text: str) -> dict:
    """
    Parse JSON from LLM response.
    
    Handles common issues:
    - Markdown code blocks
    - Leading/trailing whitespace
    - Common escape issues
    """
    # Clean up common LLM output artifacts
    text = text.strip()
    
    # Remove markdown code blocks if present
    if text.startswith("```json"):
        text = text[7:]
    if text.startswith("```"):
        text = text[3:]
    if text.endswith("```"):
        text = text[:-3]
    
    text = text.strip()
    
    try:
        return json.loads(text)
    except json.JSONDecodeError as e:
        raise ValidationError(f"Invalid JSON: {e}")

def validate_schema(data: dict, required_fields: list, field_types: dict = None) -> dict:
    """
    Validate that data matches expected schema.
    
    Args:
        data: Parsed JSON data
        required_fields: List of required field names
        field_types: Optional dict of field_name -> expected_type
    """
    # Check required fields
    missing = [f for f in required_fields if f not in data]
    if missing:
        raise ValidationError(f"Missing required fields: {missing}")
    
    # Check types if specified
    if field_types:
        for field, expected_type in field_types.items():
            if field in data and not isinstance(data[field], expected_type):
                actual = type(data[field]).__name__
                expected = expected_type.__name__
                raise ValidationError(
                    f"Field '{field}' has wrong type: expected {expected}, got {actual}"
                )
    
    return data

# Example validation
test_responses = [
    '{"sentiment": "positive", "confidence": 0.9}',  # Valid
    '{"sentiment": "positive"}',  # Missing confidence
    '{"sentiment": "positive", "confidence": "high"}',  # Wrong type
    'Here is the JSON: {"sentiment": "positive"}',  # Extra text
]

print("Validation Examples:")
print("=" * 50)

for response in test_responses:
    print(f"\nInput: {response[:50]}...")
    try:
        data = parse_json_response(response)
        validated = validate_schema(
            data,
            required_fields=["sentiment", "confidence"],
            field_types={"confidence": (int, float)}
        )
        print(f"  ✓ Valid: {validated}")
    except ValidationError as e:
        print(f"  ✗ Invalid: {e}")

Group 4 — Testing and Determinism#

How do you test code that calls a non-deterministic external service?

6.12 Testing Without Live Models#

Never call live LLMs in unit tests. This is a fundamental principle.

Why Not?#

Problem

Consequence

Cost

Every test run costs money

Speed

Tests take seconds instead of milliseconds

Flakiness

Non-deterministic = random failures

Availability

Tests fail when API is down

Rate limits

CI/CD can hit rate limits

The Solution: Mocking#

Replace the LLM call with a predictable fake:

# Production code
result = client.chat(prompt)  # Calls real LLM

# Test code
client.chat = lambda p: '{"result": "mocked"}'  # Returns fixed value
result = client.chat(prompt)  # Uses mock
# Testing the JSON parser (no LLM needed!)

def test_parse_json_response():
    """Test JSON parsing without any LLM calls."""
    
    # Test 1: Valid JSON
    result = parse_json_response('{"key": "value"}')
    assert result == {"key": "value"}, "Basic JSON parsing failed"
    
    # Test 2: With markdown code block
    result = parse_json_response('```json\n{"key": "value"}\n```')
    assert result == {"key": "value"}, "Markdown cleanup failed"
    
    # Test 3: With whitespace
    result = parse_json_response('  \n{"key": "value"}\n  ')
    assert result == {"key": "value"}, "Whitespace handling failed"
    
    # Test 4: Invalid JSON should raise
    try:
        parse_json_response('not json')
        assert False, "Should have raised ValidationError"
    except ValidationError:
        pass  # Expected
    
    print("✓ All JSON parser tests passed!")

def test_validate_schema():
    """Test schema validation without any LLM calls."""
    
    # Test 1: Valid data
    data = {"name": "Alice", "age": 30}
    result = validate_schema(data, required_fields=["name", "age"])
    assert result == data
    
    # Test 2: Missing field
    try:
        validate_schema({"name": "Alice"}, required_fields=["name", "age"])
        assert False, "Should have raised ValidationError"
    except ValidationError as e:
        assert "age" in str(e)
    
    # Test 3: Wrong type
    try:
        validate_schema(
            {"name": "Alice", "age": "thirty"},
            required_fields=["name", "age"],
            field_types={"age": int}
        )
        assert False, "Should have raised ValidationError"
    except ValidationError as e:
        assert "wrong type" in str(e)
    
    print("✓ All schema validation tests passed!")

# Run the tests
test_parse_json_response()
test_validate_schema()
# Testing with a mock client

class MockLLMClient:
    """A mock client for testing LLM-integrated code."""
    
    def __init__(self, responses: dict = None):
        """
        Args:
            responses: Dict mapping prompt substrings to responses
        """
        self.responses = responses or {}
        self.calls = []  # Track calls for verification
    
    def chat(self, prompt: str, **kwargs) -> str:
        """Return a mocked response based on the prompt."""
        self.calls.append({"prompt": prompt, **kwargs})
        
        # Find matching response
        for key, response in self.responses.items():
            if key.lower() in prompt.lower():
                return response
        
        # Default response
        return '{"status": "mocked"}'

# Example: Testing a function that uses LLM
def analyze_sentiment(client, text: str) -> str:
    """Analyze sentiment using LLM."""
    prompt = f"Analyze sentiment: {text}\nReturn JSON: {{\"sentiment\": \"...\"}}"
    response = client.chat(prompt)
    data = parse_json_response(response)
    return data.get("sentiment", "unknown")

# Test with mock
def test_analyze_sentiment():
    mock_client = MockLLMClient(responses={
        "great": '{"sentiment": "positive"}',
        "terrible": '{"sentiment": "negative"}',
        "okay": '{"sentiment": "neutral"}',
    })
    
    assert analyze_sentiment(mock_client, "This is great!") == "positive"
    assert analyze_sentiment(mock_client, "This is terrible!") == "negative"
    assert analyze_sentiment(mock_client, "It's okay.") == "neutral"
    
    # Verify the client was called correctly
    assert len(mock_client.calls) == 3
    
    print("✓ Sentiment analysis tests passed (using mock)!")

test_analyze_sentiment()

6.13 Achieving Determinism#

Even with mocks, you need strategies for making LLM behavior more predictable.

Techniques for Determinism#

Technique

How

Effectiveness

Temperature 0

Set temperature=0.0

High (but not perfect)

Seed parameter

Some APIs support seed=42

Medium (provider-dependent)

Constrained output

JSON mode, function calling

High

Caching

Cache responses by prompt hash

Perfect (for repeated calls)

Mocking

Replace with fake in tests

Perfect (for tests)

import hashlib
from typing import Optional

class CachingLLMClient:
    """LLM client with response caching for determinism."""
    
    def __init__(self, client: LLMClient):
        self.client = client
        self.cache = {}
        self.cache_hits = 0
        self.cache_misses = 0
    
    def _cache_key(self, prompt: str, temperature: float, max_tokens: int) -> str:
        """Generate a cache key from request parameters."""
        key_data = f"{prompt}:{temperature}:{max_tokens}"
        return hashlib.sha256(key_data.encode()).hexdigest()[:16]
    
    def chat(self, prompt: str, temperature: float = 0.0, max_tokens: int = 256) -> str:
        """Chat with caching."""
        key = self._cache_key(prompt, temperature, max_tokens)
        
        if key in self.cache:
            self.cache_hits += 1
            return self.cache[key]
        
        self.cache_misses += 1
        response = self.client.chat(prompt, temperature, max_tokens)
        self.cache[key] = response
        return response
    
    def stats(self) -> dict:
        """Return cache statistics."""
        total = self.cache_hits + self.cache_misses
        hit_rate = self.cache_hits / total if total > 0 else 0
        return {
            "hits": self.cache_hits,
            "misses": self.cache_misses,
            "hit_rate": f"{hit_rate:.1%}",
            "cached_responses": len(self.cache)
        }

print("CachingLLMClient defined.")
print("\nBenefits:")
print("  - Same prompt always returns same response")
print("  - Reduces API costs")
print("  - Faster repeated calls")
print("  - Useful for development and testing")

Group 5 — Production Concerns#

Beyond correctness, production systems need logging, cost control, and auditability.

6.14 Cost and Audit Logging#

Every LLM call should be logged for:

Purpose

What to Log

Cost tracking

Tokens used, model, timestamp

Debugging

Prompt, response, errors

Compliance

User ID, request ID, inputs/outputs

Performance

Latency, retry count

Security

Suspicious patterns, PII detection

import logging
from datetime import datetime
from dataclasses import dataclass
from typing import Optional
import uuid

@dataclass
class LLMCallLog:
    """Structured log entry for an LLM call."""
    request_id: str
    timestamp: str
    model: str
    prompt_preview: str  # First N chars of prompt
    response_preview: str  # First N chars of response
    latency_ms: float
    success: bool
    error: Optional[str] = None
    
    def to_dict(self) -> dict:
        return {
            "request_id": self.request_id,
            "timestamp": self.timestamp,
            "model": self.model,
            "prompt_preview": self.prompt_preview,
            "response_preview": self.response_preview,
            "latency_ms": self.latency_ms,
            "success": self.success,
            "error": self.error
        }

class LoggingLLMClient:
    """LLM client with structured logging."""
    
    def __init__(self, client: LLMClient, preview_length: int = 100):
        self.client = client
        self.preview_length = preview_length
        self.logs: list[LLMCallLog] = []
    
    def chat(self, prompt: str, **kwargs) -> str:
        """Chat with logging."""
        request_id = str(uuid.uuid4())[:8]
        start_time = datetime.now()
        
        try:
            response = self.client.chat(prompt, **kwargs)
            latency = (datetime.now() - start_time).total_seconds() * 1000
            
            log = LLMCallLog(
                request_id=request_id,
                timestamp=start_time.isoformat(),
                model=self.client.model,
                prompt_preview=prompt[:self.preview_length],
                response_preview=response[:self.preview_length],
                latency_ms=round(latency, 2),
                success=True
            )
            self.logs.append(log)
            
            return response
            
        except Exception as e:
            latency = (datetime.now() - start_time).total_seconds() * 1000
            
            log = LLMCallLog(
                request_id=request_id,
                timestamp=start_time.isoformat(),
                model=self.client.model,
                prompt_preview=prompt[:self.preview_length],
                response_preview="",
                latency_ms=round(latency, 2),
                success=False,
                error=str(e)
            )
            self.logs.append(log)
            raise
    
    def get_logs(self) -> list[dict]:
        """Return all logs as dictionaries."""
        return [log.to_dict() for log in self.logs]
    
    def summary(self) -> dict:
        """Return summary statistics."""
        if not self.logs:
            return {"total_calls": 0}
        
        successful = [l for l in self.logs if l.success]
        failed = [l for l in self.logs if not l.success]
        latencies = [l.latency_ms for l in successful]
        
        return {
            "total_calls": len(self.logs),
            "successful": len(successful),
            "failed": len(failed),
            "avg_latency_ms": round(sum(latencies) / len(latencies), 2) if latencies else 0,
            "max_latency_ms": round(max(latencies), 2) if latencies else 0
        }

print("LoggingLLMClient defined.")
print("\nFeatures:")
print("  - Structured log entries")
print("  - Request ID tracking")
print("  - Latency measurement")
print("  - Error capture")
print("  - Summary statistics")
# Demonstrate logging (using mock to avoid real API calls)

# Create a mock for demonstration
mock = MockLLMClient(responses={
    "hello": '{"greeting": "Hello!"}',
    "weather": '{"forecast": "Sunny"}',
})
mock.model = "mock-model"  # Add model attribute

# Wrap with logging
logged_client = LoggingLLMClient(mock)

# Make some calls
logged_client.chat("Hello, how are you?")
logged_client.chat("What's the weather like?")
logged_client.chat("Hello again!")

# View logs
print("Call Logs:")
print("=" * 60)
for log in logged_client.get_logs():
    print(f"[{log['request_id']}] {log['prompt_preview'][:30]}... -> {log['latency_ms']}ms")

print("\nSummary:")
print(logged_client.summary())

6.15 Preparing for RAG#

Everything in this module prepares you for building RAG (Retrieval-Augmented Generation) systems.

The RAG Pattern (Recap from Module 5)#

User Question
     ↓
Embed Question (Module 5)
     ↓
Vector Search (Module 5)
     ↓
Build Prompt with Context (Module 6)
     ↓
Call LLM with Retry (Module 6)
     ↓
Validate Response (Module 6)
     ↓
Return to User

What You Now Know#

Module 5 Skills

Module 6 Skills

Generate embeddings

Build robust API clients

Semantic search

Handle failures gracefully

Vector databases

Enforce structured output

Chunking strategies

Validate before using

Retrieval evaluation

Test without live models

Next Steps#

Combining Modules 5 and 6, you can now:

  1. Index documents with embeddings

  2. Retrieve relevant context for any question

  3. Build prompts that ground LLM answers in facts

  4. Call LLMs reliably with proper error handling

  5. Validate and log everything for compliance

# A complete RAG-ready LLM client

def build_rag_prompt(question: str, context_docs: list[str]) -> str:
    """
    Build a RAG prompt with retrieved context.
    
    Args:
        question: User's question
        context_docs: Retrieved relevant documents
    """
    context = "\n\n".join([f"[{i+1}] {doc}" for i, doc in enumerate(context_docs)])
    
    return f"""Answer the question based ONLY on the provided context.
If the context doesn't contain enough information, say "I don't have enough information."

Context:
{context}

Question: {question}

Provide your answer in JSON format:
{{
    "answer": "your answer here",
    "sources_used": [1, 2],
    "confidence": "high|medium|low"
}}

JSON:"""

# Example
context = [
    "The company's Q3 revenue was $4.2 billion, up 15% YoY.",
    "Operating margin improved to 23% from 21% last year.",
    "The CEO announced plans to expand into Asian markets."
]

prompt = build_rag_prompt(
    question="What was the company's revenue in Q3?",
    context_docs=context
)

print("RAG Prompt Example:")
print("=" * 60)
print(prompt)

Module Summary#

Key Concepts#

Concept

What It Means

Service mindset

LLMs are external services, not functions

Defensive programming

Assume failure, verify success

Exponential backoff

Wait longer between each retry

Structured output

Request JSON, validate before use

Mocking

Test without live LLM calls

Audit logging

Track every call for cost/compliance

The Production LLM Client Checklist#

  • Configuration externalized (no hardcoded secrets)

  • Proper timeouts set (connect and read)

  • Retry logic with exponential backoff

  • Structured output requested (JSON)

  • Response validation before use

  • Error handling for all failure modes

  • Logging for debugging and audit

  • Tests that don’t call live APIs

Enterprise Implications#

Concern

Solution

Cost control

Logging, caching, token limits

Reliability

Retries, fallbacks, timeouts

Compliance

Audit logs, input/output validation

Security

No hardcoded secrets, PII detection

Testability

Mocking, dependency injection


What’s Next#

  1. Quiz — Test your understanding of LLM API patterns

  2. Assessment — Build and test your own LLM client

  3. Module 7 — Putting it all together with a complete RAG system