Content

Content#

Module 6 — LLM APIs (Python)

CodeVision Academy

Overview#

If Module 5 taught you how to find relevant information, Module 6 teaches you how to reliably call the AI that uses it.

This module marks the transition from using AI to engineering AI systems.

Up to now, you’ve called LLMs casually—paste a prompt, get a response. That works for demos. It does not work for production systems that must:

Handle failures gracefully
Validate outputs before using them
Control costs and latency
Pass audits and compliance reviews

One Big Idea to Remember#

An LLM is not a function call. It is a remote, rate-limited, probabilistic service. Failure is normal. Correctness is engineered.

Learning Objectives#

By the end of this module, you will be able to:

Explain why LLMs must be treated as external services, not functions
Build a reusable Python client class for LLM APIs
Implement proper error handling with timeouts and retries
Enforce structured JSON outputs and validate responses
Write tests for LLM-integrated code without calling live models
Implement logging for cost tracking and auditability
Apply defensive programming patterns for non-deterministic systems
Prepare code cleanly for RAG integration

Before You Start: LLM Gateway Configuration#

This module requires access to an LLM API. You have two options:

Option	Model	Best For
A: Local Ollama	`phi3:mini`	Running locally, learning API patterns
B: JBChat Server	`llama3.1:8b`	Higher quality, Colab users

Option A: Local Ollama#

If running Jupyter locally: Use http://localhost:11434 directly.

If running in Google Colab: You must expose Ollama via a tunnel.

Pinggy Setup (required for Colab):

Open a terminal on your local machine
Make sure Ollama is running: ollama serve

Start the tunnel:

ssh -p 443 -R0:localhost:11434 a.pinggy.io

Copy the HTTPS URL (e.g., https://xyz-abc.a.pinggy.io)
Use that URL in the config below

Option B: Server Gateway (JBChat)#

If you cannot run Ollama locally:

URL: https://jbchat.jonbowden.com.ngrok.app
Requires API key from instructor
Model: llama3.1:8b

Configure below:#

# ===== LLM GATEWAY CONFIGURATION =====

# ------ OPTION A: Local Ollama ------
LLM_BASE_URL = "http://localhost:11434"
# For Colab with Pinggy tunnel:
# LLM_BASE_URL = "https://your-pinggy-url.a.pinggy.io"

LLM_API_KEY = None  # No API key = Ollama mode
DEFAULT_MODEL = "phi3:mini"

# ------ OPTION B: Server Gateway (JBChat) ------
# Uncomment these 3 lines to use the course server:
# LLM_BASE_URL = "https://jbchat.jonbowden.com.ngrok.app"
# LLM_API_KEY = "<provided-by-instructor>"
# DEFAULT_MODEL = "llama3.1:8b"

print(f"Configured: {LLM_BASE_URL}")
print(f"Model: {DEFAULT_MODEL}")
print(f"Mode: {'JBChat' if LLM_API_KEY else 'Ollama'}")

Configured: http://localhost:11434
Model: phi3:mini
Mode: Ollama

Group 1 — The Service Mindset#

Before we write code, we need to understand why LLM integration is fundamentally different from calling a local function.

6.1 From Model Calls to Service Contracts#

When you call a local function, you expect:

Instant response
Deterministic output
No network failures
No rate limits

When you call an LLM API, you face:

Challenge	Reality
Latency	1-30+ seconds per call
Availability	Services go down, networks fail
Rate limits	Too many calls = blocked
Non-determinism	Same input can yield different outputs
Cost	Every token costs money
Output format	No guarantee of structure

The Mindset Shift#

WRONG MENTAL MODEL:           RIGHT MENTAL MODEL:

result = llm(prompt)          try:
use(result)                       result = llm_with_retry(prompt)
                                  validated = parse_and_validate(result)
                                  use(validated)
                              except LLMError:
                                  handle_gracefully()

Enterprise Implications#

In production systems, you must design for:

Graceful degradation — What happens when the LLM is down?
Timeout budgets — How long can users wait?
Fallback strategies — Can you use cached responses?
Cost controls — How do you prevent runaway API bills?

6.2 Anatomy of an LLM API Request#

Every LLM API call is fundamentally JSON over HTTP. Understanding the structure helps you debug issues and optimize performance.

Request Components#

Component	Purpose	Example
Endpoint	Where to send the request	`/api/chat`, `/v1/completions`
Headers	Authentication, content type	`Authorization: Bearer xxx`
Model	Which model to use	`phi3:mini`, `gpt-4`
Messages	The conversation/prompt	`[{"role": "user", "content": "..."}]`
Temperature	Randomness (0=deterministic)	`0.0` to `1.0`
Max tokens	Output length limit	`256`, `1024`

Standard Payload Structure#

# A typical LLM API payload
payload = {
    "model": "phi3:mini",
    "messages": [
        {"role": "system", "content": "You are a helpful assistant."},
        {"role": "user", "content": "Explain inflation in one sentence."}
    ],
    "temperature": 0.0,  # Deterministic
    "max_tokens": 100    # Limit response length
}

import json
print("Request payload:")
print(json.dumps(payload, indent=2))

Request payload:
{
  "model": "phi3:mini",
  "messages": [
    {
      "role": "system",
      "content": "You are a helpful assistant."
    },
    {
      "role": "user",
      "content": "Explain inflation in one sentence."
    }
  ],
  "temperature": 0.0,
  "max_tokens": 100
}

Response Structure#

The response also follows a standard structure:

{
  "model": "phi3:mini",
  "message": {
    "role": "assistant",
    "content": "Inflation is the rate at which prices rise over time."
  },
  "done": true,
  "total_duration": 1234567890
}

Different providers have slightly different response formats, but the core pattern is the same.

6.3 Configuration Discipline#

Never hardcode configuration. This is a fundamental principle for maintainable systems.

Why Configuration Matters#

Hardcoded	Configurable
Change requires code edit	Change via environment
Secrets in source code	Secrets in secure storage
Same settings everywhere	Dev/staging/prod can differ
Hard to test	Easy to mock

The Configuration Pattern#

import os

# Configuration from environment (with fallbacks)
class LLMConfig:
    """Centralized LLM configuration."""
    
    BASE_URL = os.getenv("LLM_BASE_URL", LLM_BASE_URL)
    API_KEY = os.getenv("LLM_API_KEY", LLM_API_KEY)
    MODEL = os.getenv("LLM_MODEL", DEFAULT_MODEL)
    
    # Operational defaults
    DEFAULT_TEMPERATURE = 0.0
    DEFAULT_MAX_TOKENS = 256
    DEFAULT_TIMEOUT = (5, 60)  # (connect, read) in seconds
    MAX_RETRIES = 3

print(f"Config loaded:")
print(f"  BASE_URL: {LLMConfig.BASE_URL}")
print(f"  MODEL: {LLMConfig.MODEL}")
print(f"  API_KEY: {'***' if LLMConfig.API_KEY else 'None (Ollama mode)'}")

Config loaded:
  BASE_URL: http://localhost:11434
  MODEL: phi3:mini
  API_KEY: None (Ollama mode)

Configuration Best Practices#

Use environment variables for anything that varies by environment
Provide sensible defaults for development
Never commit secrets to version control
Validate configuration at startup, not at first use
Document required variables in README or setup scripts

Group 2 — Building a Robust Client#

Now we build a reusable client class that encapsulates all the complexity of LLM communication.

6.4 Client Class Rationale#

Why wrap LLM calls in a class instead of simple functions?

Approach	Pros	Cons
Raw requests	Simple, direct	Repeated code, no encapsulation
Functions	Reusable	State management is awkward
Client class	Encapsulated, testable, extensible	Slightly more setup

What a Good Client Provides#

Encapsulation — Hide HTTP details from business logic
Configuration — Centralized settings management
Retry logic — Automatic handling of transient failures
Logging — Consistent audit trail
Testability — Easy to mock for unit tests

import requests
import time
import json
from typing import Optional, Dict, Any, List

class LLMClient:
    """
    A robust client for LLM API interactions.
    
    Handles:
    - Configuration management
    - Request construction
    - Error handling
    - Retry logic
    - Response parsing
    """
    
    def __init__(
        self,
        base_url: str,
        api_key: Optional[str] = None,
        model: str = "phi3:mini",
        timeout: tuple = (5, 60)
    ):
        """
        Initialize the LLM client.
        
        Args:
            base_url: API endpoint base URL
            api_key: Optional API key (None for Ollama)
            model: Model identifier
            timeout: (connect_timeout, read_timeout) in seconds
        """
        self.base_url = base_url.rstrip('/')
        self.api_key = api_key
        self.model = model
        self.timeout = timeout
        
        # Detect mode based on API key
        self._use_jbchat = api_key is not None
    
    def __repr__(self):
        mode = "JBChat" if self._use_jbchat else "Ollama"
        return f"LLMClient(mode={mode}, model={self.model})"

# Create a client instance
client = LLMClient(
    base_url=LLMConfig.BASE_URL,
    api_key=LLMConfig.API_KEY,
    model=LLMConfig.MODEL
)
print(f"Client created: {client}")

Client created: LLMClient(mode=Ollama, model=phi3:mini)

6.5 Request Construction#

Building requests correctly is crucial. Different APIs have different formats, so we encapsulate this complexity.

Key Considerations#

Aspect	Why It Matters
Headers	Authentication, content negotiation
Endpoint	Different APIs use different paths
Payload format	Ollama vs OpenAI vs others differ
Timeout tuning	Connect fast, allow long reads

# Add request construction methods to our client

class LLMClient(LLMClient):  # Extending previous definition
    
    def _build_headers(self) -> Dict[str, str]:
        """Build HTTP headers for the request."""
        headers = {
            "Content-Type": "application/json",
            # Bypass tunnel browser warnings
            "ngrok-skip-browser-warning": "true",
            "Bypass-Tunnel-Reminder": "true",
        }
        
        if self.api_key:
            headers["X-API-Key"] = self.api_key
        
        return headers
    
    def _build_payload(
        self,
        prompt: str,
        temperature: float = 0.0,
        max_tokens: int = 256,
        system_prompt: Optional[str] = None
    ) -> Dict[str, Any]:
        """Build the request payload."""
        
        # Build messages list
        messages = []
        if system_prompt:
            messages.append({"role": "system", "content": system_prompt})
        messages.append({"role": "user", "content": prompt})
        
        if self._use_jbchat:
            # JBChat/OpenAI format
            return {
                "model": self.model,
                "messages": messages,
                "temperature": temperature,
                "max_tokens": max_tokens,
                "stream": False
            }
        else:
            # Ollama format
            return {
                "model": self.model,
                "messages": messages,
                "options": {"temperature": temperature},
                "stream": False
            }
    
    def _get_endpoint(self) -> str:
        """Get the correct API endpoint."""
        if self._use_jbchat:
            return f"{self.base_url}/chat/direct"
        else:
            return f"{self.base_url}/api/chat"

# Recreate client with new methods
client = LLMClient(
    base_url=LLMConfig.BASE_URL,
    api_key=LLMConfig.API_KEY,
    model=LLMConfig.MODEL
)

# Show what a request looks like
print("Endpoint:", client._get_endpoint())
print("\nHeaders:", json.dumps(client._build_headers(), indent=2))
print("\nPayload:", json.dumps(client._build_payload("Hello"), indent=2))

Endpoint: http://localhost:11434/api/chat

Headers: {
  "Content-Type": "application/json",
  "ngrok-skip-browser-warning": "true",
  "Bypass-Tunnel-Reminder": "true"
}

Payload: {
  "model": "phi3:mini",
  "messages": [
    {
      "role": "user",
      "content": "Hello"
    }
  ],
  "options": {
    "temperature": 0.0
  },
  "stream": false
}

6.6 Making Safe API Calls#

The actual API call must handle many potential failures:

Failure Type	Cause	Handling
Connection timeout	Network issues, server down	Retry with backoff
Read timeout	Slow response, overloaded server	Increase timeout or retry
HTTP 429	Rate limited	Back off, then retry
HTTP 500	Server error	Retry with backoff
HTTP 401/403	Auth failure	Don’t retry, fix config
Invalid JSON	Malformed response	Log and raise

class LLMClient(LLMClient):  # Extending again
    
    def chat(
        self,
        prompt: str,
        temperature: float = 0.0,
        max_tokens: int = 256,
        system_prompt: Optional[str] = None
    ) -> str:
        """
        Send a chat request and return the response content.
        
        Args:
            prompt: User message
            temperature: Randomness (0.0 = deterministic)
            max_tokens: Maximum response length
            system_prompt: Optional system message
            
        Returns:
            The assistant's response text
            
        Raises:
            requests.exceptions.RequestException: On network/HTTP errors
            ValueError: On invalid response format
        """
        response = requests.post(
            self._get_endpoint(),
            headers=self._build_headers(),
            json=self._build_payload(prompt, temperature, max_tokens, system_prompt),
            timeout=self.timeout
        )
        
        # Raise exception for HTTP errors (4xx, 5xx)
        response.raise_for_status()
        
        # Parse response
        data = response.json()
        
        # Extract content (handle different response formats)
        if "message" in data and "content" in data["message"]:
            return data["message"]["content"]
        elif "choices" in data:  # OpenAI format
            return data["choices"][0]["message"]["content"]
        else:
            raise ValueError(f"Unexpected response format: {data}")

# Recreate and test
client = LLMClient(
    base_url=LLMConfig.BASE_URL,
    api_key=LLMConfig.API_KEY,
    model=LLMConfig.MODEL
)

# Test the client
try:
    response = client.chat("Say 'API connected' in exactly two words.")
    print(f"Response: {response}")
except Exception as e:
    print(f"Error: {e}")
    print("\nMake sure your LLM server is running!")

Error: HTTPConnectionPool(host='localhost', port=11434): Max retries exceeded with url: /api/chat (Caused by NewConnectionError('<urllib3.connection.HTTPConnection object at 0x7f6b92159030>: Failed to establish a new connection: [Errno 111] Connection refused'))

Make sure your LLM server is running!

6.7 Failure as the Default Assumption#

In distributed systems, the question is not if things will fail, but when and how often.

Types of Failures#

Type	Example	Frequency
Transient	Network hiccup, brief overload	Common (retry helps)
Persistent	Server down, config error	Less common (retry won’t help)
Partial	Slow response, truncated output	Common (timeout/validate)
Silent	Wrong answer, hallucination	Common (validation needed)

The Defensive Mindset#

# WRONG: Assume success
result = client.chat(prompt)
use(result)

# RIGHT: Assume failure, verify success
try:
    result = client.chat(prompt)
    validated = validate(result)
    use(validated)
except TransientError:
    retry()
except PermanentError:
    fallback()

# Demonstrating what failures look like
import requests

def demonstrate_failures():
    """Show common LLM API failure modes."""
    
    print("Common LLM API Failures:")
    print("=" * 50)
    
    # 1. Connection refused (server not running)
    try:
        requests.post("http://localhost:99999/api/chat", timeout=1)
    except requests.exceptions.ConnectionError as e:
        print(f"1. ConnectionError: Server not reachable")
    
    # 2. Timeout
    try:
        requests.get("https://httpstat.us/200?sleep=5000", timeout=0.1)
    except requests.exceptions.Timeout:
        print(f"2. Timeout: Server too slow")
    
    # 3. HTTP errors
    print(f"3. HTTP 429: Rate limit exceeded (too many requests)")
    print(f"4. HTTP 500: Internal server error")
    print(f"5. HTTP 401: Authentication failed")
    
    print("\n" + "=" * 50)
    print("All of these require proper handling!")

demonstrate_failures()

Common LLM API Failures:
==================================================

---------------------------------------------------------------------------
LocationParseError                        Traceback (most recent call last)
File ~/.pyenv/versions/3.10.18/lib/python3.10/site-packages/urllib3/util/url.py:434, in parse_url(url)
    433     if not (0 <= port_int <= 65535):
--> 434         raise LocationParseError(url)
    435 else:

LocationParseError: Failed to parse: http://localhost:99999/api/chat

The above exception was the direct cause of the following exception:

LocationParseError                        Traceback (most recent call last)
File ~/.pyenv/versions/3.10.18/lib/python3.10/site-packages/requests/models.py:433, in PreparedRequest.prepare_url(self, url, params)
    432 try:
--> 433     scheme, auth, host, port, path, query, fragment = parse_url(url)
    434 except LocationParseError as e:

File ~/.pyenv/versions/3.10.18/lib/python3.10/site-packages/urllib3/util/url.py:449, in parse_url(url)
    448 except (ValueError, AttributeError) as e:
--> 449     raise LocationParseError(source_url) from e
    451 # For the sake of backwards compatibility we put empty
    452 # string values for path if there are any defined values
    453 # beyond the path in the URL.
    454 # TODO: Remove this when we break backwards compatibility.

LocationParseError: Failed to parse: http://localhost:99999/api/chat

During handling of the above exception, another exception occurred:

InvalidURL                                Traceback (most recent call last)
Cell In[7], line 30
     27     print("\n" + "=" * 50)
     28     print("All of these require proper handling!")
---> 30 demonstrate_failures()

Cell In[7], line 12, in demonstrate_failures()
     10 # 1. Connection refused (server not running)
     11 try:
---> 12     requests.post("http://localhost:99999/api/chat", timeout=1)
     13 except requests.exceptions.ConnectionError as e:
     14     print(f"1. ConnectionError: Server not reachable")

File ~/.pyenv/versions/3.10.18/lib/python3.10/site-packages/requests/api.py:115, in post(url, data, json, **kwargs)
    103 def post(url, data=None, json=None, **kwargs):
    104     r"""Sends a POST request.
    105 
    106     :param url: URL for the new :class:`Request` object.
   (...)
    112     :rtype: requests.Response
    113     """
--> 115     return request("post", url, data=data, json=json, **kwargs)

File ~/.pyenv/versions/3.10.18/lib/python3.10/site-packages/requests/api.py:59, in request(method, url, **kwargs)
     55 # By using the 'with' statement we are sure the session is closed, thus we
     56 # avoid leaving sockets open which can trigger a ResourceWarning in some
     57 # cases, and look like a memory leak in others.
     58 with sessions.Session() as session:
---> 59     return session.request(method=method, url=url, **kwargs)

File ~/.pyenv/versions/3.10.18/lib/python3.10/site-packages/requests/sessions.py:575, in Session.request(self, method, url, params, data, headers, cookies, files, auth, timeout, allow_redirects, proxies, hooks, stream, verify, cert, json)
    562 # Create the Request.
    563 req = Request(
    564     method=method.upper(),
    565     url=url,
   (...)
    573     hooks=hooks,
    574 )
--> 575 prep = self.prepare_request(req)
    577 proxies = proxies or {}
    579 settings = self.merge_environment_settings(
    580     prep.url, proxies, stream, verify, cert
    581 )

File ~/.pyenv/versions/3.10.18/lib/python3.10/site-packages/requests/sessions.py:484, in Session.prepare_request(self, request)
    481     auth = get_netrc_auth(request.url)
    483 p = PreparedRequest()
--> 484 p.prepare(
    485     method=request.method.upper(),
    486     url=request.url,
    487     files=request.files,
    488     data=request.data,
    489     json=request.json,
    490     headers=merge_setting(
    491         request.headers, self.headers, dict_class=CaseInsensitiveDict
    492     ),
    493     params=merge_setting(request.params, self.params),
    494     auth=merge_setting(auth, self.auth),
    495     cookies=merged_cookies,
    496     hooks=merge_hooks(request.hooks, self.hooks),
    497 )
    498 return p

File ~/.pyenv/versions/3.10.18/lib/python3.10/site-packages/requests/models.py:367, in PreparedRequest.prepare(self, method, url, headers, files, data, params, auth, cookies, hooks, json)
    364 """Prepares the entire request with the given parameters."""
    366 self.prepare_method(method)
--> 367 self.prepare_url(url, params)
    368 self.prepare_headers(headers)
    369 self.prepare_cookies(cookies)

File ~/.pyenv/versions/3.10.18/lib/python3.10/site-packages/requests/models.py:435, in PreparedRequest.prepare_url(self, url, params)
    433     scheme, auth, host, port, path, query, fragment = parse_url(url)
    434 except LocationParseError as e:
--> 435     raise InvalidURL(*e.args)
    437 if not scheme:
    438     raise MissingSchema(
    439         f"Invalid URL {url!r}: No scheme supplied. "
    440         f"Perhaps you meant https://{url}?"
    441     )

InvalidURL: Failed to parse: http://localhost:99999/api/chat

6.8 Retry with Exponential Backoff#

Exponential backoff is the standard pattern for handling transient failures:

Try the operation
If it fails, wait a short time and retry
If it fails again, wait longer (exponentially)
After N retries, give up

Why Exponential?#

Attempt	Wait Time	Cumulative
1	0s	0s
2	1s	1s
3	2s	3s
4	4s	7s
5	8s	15s

This gives the server time to recover while not waiting forever.

import time
from typing import Callable, TypeVar

T = TypeVar('T')

def retry_with_backoff(
    fn: Callable[[], T],
    max_retries: int = 3,
    base_delay: float = 1.0,
    max_delay: float = 30.0,
    retryable_exceptions: tuple = (requests.exceptions.RequestException,)
) -> T:
    """
    Execute a function with exponential backoff retry.
    
    Args:
        fn: Function to execute (no arguments)
        max_retries: Maximum number of retry attempts
        base_delay: Initial delay in seconds
        max_delay: Maximum delay between retries
        retryable_exceptions: Tuple of exceptions that trigger retry
        
    Returns:
        Result of fn() on success
        
    Raises:
        The last exception if all retries fail
    """
    last_exception = None
    
    for attempt in range(max_retries + 1):
        try:
            return fn()
        except retryable_exceptions as e:
            last_exception = e
            
            if attempt == max_retries:
                break  # Don't sleep after last attempt
                
            # Calculate delay with exponential backoff
            delay = min(base_delay * (2 ** attempt), max_delay)
            print(f"  Attempt {attempt + 1} failed: {e}")
            print(f"  Retrying in {delay:.1f}s...")
            time.sleep(delay)
    
    raise last_exception

# Add retry method to client
class LLMClient(LLMClient):
    
    def chat_with_retry(
        self,
        prompt: str,
        temperature: float = 0.0,
        max_tokens: int = 256,
        max_retries: int = 3
    ) -> str:
        """Chat with automatic retry on transient failures."""
        return retry_with_backoff(
            fn=lambda: self.chat(prompt, temperature, max_tokens),
            max_retries=max_retries
        )

# Recreate client
client = LLMClient(
    base_url=LLMConfig.BASE_URL,
    api_key=LLMConfig.API_KEY,
    model=LLMConfig.MODEL
)

print("Retry function defined.")
print("\nExample usage:")
print('  result = client.chat_with_retry("Your prompt here")')

Group 3 — Structured Output and Validation#

Getting a response is only half the battle. The response must be usable.

6.9 The Necessity of Structured Output#

LLMs naturally produce free-form text. That’s great for chatbots. It’s terrible for software systems.

The Problem#

# You asked for a summary
response = "Here's a summary of the document. The main points are..."

# How do you extract the actual summary programmatically?
# What if the format changes?
# What if there's extra text?

The Solution: JSON#

# Ask for JSON
response = '{"summary": "The main points are...", "confidence": 0.85}'

# Now you can parse and use it reliably
data = json.loads(response)
summary = data["summary"]

Why JSON?#

Format	Pros	Cons
Free text	Natural, flexible	Hard to parse, unreliable
JSON	Structured, parseable, typed	LLM may not comply
XML	Structured, handles nesting	Verbose, harder for LLMs
YAML	Readable, structured	Whitespace-sensitive

6.10 JSON Enforcement in Prompts#

The key to getting JSON output is explicit instruction and schema specification.

Prompt Patterns for JSON#

Pattern	Reliability
“Return JSON”	Low
“Return ONLY valid JSON: {schema}”	Medium
“Return ONLY valid JSON. No other text. Schema: {schema}”	High
System prompt + user prompt + schema	Highest

# Template for JSON-enforced prompts

def build_json_prompt(task: str, schema: dict, data: str = None) -> str:
    """
    Build a prompt that requests JSON output.
    
    Args:
        task: What to do
        schema: Expected JSON structure
        data: Optional data to process
    """
    schema_str = json.dumps(schema, indent=2)
    
    prompt = f"""Task: {task}

Return ONLY valid JSON matching this schema:
{schema_str}

Rules:
- Return ONLY the JSON object
- No markdown, no explanations, no extra text
- All fields are required
"""
    
    if data:
        prompt += f"\nData to process:\n{data}\n"
    
    prompt += "\nJSON:"
    
    return prompt

# Example: Sentiment analysis with structured output
schema = {
    "sentiment": "positive | negative | neutral",
    "confidence": "float between 0 and 1",
    "key_phrases": ["list", "of", "phrases"]
}

prompt = build_json_prompt(
    task="Analyze the sentiment of the following text",
    schema=schema,
    data="The product exceeded my expectations. Great value!"
)

print("JSON-enforced prompt:")
print("=" * 50)
print(prompt)

# Test JSON output with the LLM
try:
    response = client.chat(prompt, temperature=0.0)
    print("LLM Response:")
    print(response)
    
    # Try to parse it
    parsed = json.loads(response)
    print("\nParsed successfully!")
    print(f"Sentiment: {parsed.get('sentiment')}")
    print(f"Confidence: {parsed.get('confidence')}")
except json.JSONDecodeError as e:
    print(f"JSON parsing failed: {e}")
    print("The LLM did not return valid JSON.")
except Exception as e:
    print(f"Error: {e}")

6.11 Validation Before Use#

Even when the LLM returns valid JSON, you must validate it before using it in your application.

Validation Layers#

Layer	Checks	Example
Syntax	Is it valid JSON?	`json.loads()`
Schema	Are required fields present?	Check keys exist
Types	Are values the right type?	`isinstance()`
Values	Are values in valid ranges?	Business logic
Semantic	Does it make sense?	Domain validation

from typing import Any

class ValidationError(Exception):
    """Raised when LLM output fails validation."""
    pass

def parse_json_response(text: str) -> dict:
    """
    Parse JSON from LLM response.
    
    Handles common issues:
    - Markdown code blocks
    - Leading/trailing whitespace
    - Common escape issues
    """
    # Clean up common LLM output artifacts
    text = text.strip()
    
    # Remove markdown code blocks if present
    if text.startswith("```json"):
        text = text[7:]
    if text.startswith("```"):
        text = text[3:]
    if text.endswith("```"):
        text = text[:-3]
    
    text = text.strip()
    
    try:
        return json.loads(text)
    except json.JSONDecodeError as e:
        raise ValidationError(f"Invalid JSON: {e}")

def validate_schema(data: dict, required_fields: list, field_types: dict = None) -> dict:
    """
    Validate that data matches expected schema.
    
    Args:
        data: Parsed JSON data
        required_fields: List of required field names
        field_types: Optional dict of field_name -> expected_type
    """
    # Check required fields
    missing = [f for f in required_fields if f not in data]
    if missing:
        raise ValidationError(f"Missing required fields: {missing}")
    
    # Check types if specified
    if field_types:
        for field, expected_type in field_types.items():
            if field in data and not isinstance(data[field], expected_type):
                actual = type(data[field]).__name__
                expected = expected_type.__name__
                raise ValidationError(
                    f"Field '{field}' has wrong type: expected {expected}, got {actual}"
                )
    
    return data

# Example validation
test_responses = [
    '{"sentiment": "positive", "confidence": 0.9}',  # Valid
    '{"sentiment": "positive"}',  # Missing confidence
    '{"sentiment": "positive", "confidence": "high"}',  # Wrong type
    'Here is the JSON: {"sentiment": "positive"}',  # Extra text
]

print("Validation Examples:")
print("=" * 50)

for response in test_responses:
    print(f"\nInput: {response[:50]}...")
    try:
        data = parse_json_response(response)
        validated = validate_schema(
            data,
            required_fields=["sentiment", "confidence"],
            field_types={"confidence": (int, float)}
        )
        print(f"  ✓ Valid: {validated}")
    except ValidationError as e:
        print(f"  ✗ Invalid: {e}")

Group 4 — Testing and Determinism#

How do you test code that calls a non-deterministic external service?

6.12 Testing Without Live Models#

Never call live LLMs in unit tests. This is a fundamental principle.

Why Not?#

Problem	Consequence
Cost	Every test run costs money
Speed	Tests take seconds instead of milliseconds
Flakiness	Non-deterministic = random failures
Availability	Tests fail when API is down
Rate limits	CI/CD can hit rate limits

The Solution: Mocking#

Replace the LLM call with a predictable fake:

# Production code
result = client.chat(prompt)  # Calls real LLM

# Test code
client.chat = lambda p: '{"result": "mocked"}'  # Returns fixed value
result = client.chat(prompt)  # Uses mock

# Testing the JSON parser (no LLM needed!)

def test_parse_json_response():
    """Test JSON parsing without any LLM calls."""
    
    # Test 1: Valid JSON
    result = parse_json_response('{"key": "value"}')
    assert result == {"key": "value"}, "Basic JSON parsing failed"
    
    # Test 2: With markdown code block
    result = parse_json_response('```json\n{"key": "value"}\n```')
    assert result == {"key": "value"}, "Markdown cleanup failed"
    
    # Test 3: With whitespace
    result = parse_json_response('  \n{"key": "value"}\n  ')
    assert result == {"key": "value"}, "Whitespace handling failed"
    
    # Test 4: Invalid JSON should raise
    try:
        parse_json_response('not json')
        assert False, "Should have raised ValidationError"
    except ValidationError:
        pass  # Expected
    
    print("✓ All JSON parser tests passed!")

def test_validate_schema():
    """Test schema validation without any LLM calls."""
    
    # Test 1: Valid data
    data = {"name": "Alice", "age": 30}
    result = validate_schema(data, required_fields=["name", "age"])
    assert result == data
    
    # Test 2: Missing field
    try:
        validate_schema({"name": "Alice"}, required_fields=["name", "age"])
        assert False, "Should have raised ValidationError"
    except ValidationError as e:
        assert "age" in str(e)
    
    # Test 3: Wrong type
    try:
        validate_schema(
            {"name": "Alice", "age": "thirty"},
            required_fields=["name", "age"],
            field_types={"age": int}
        )
        assert False, "Should have raised ValidationError"
    except ValidationError as e:
        assert "wrong type" in str(e)
    
    print("✓ All schema validation tests passed!")

# Run the tests
test_parse_json_response()
test_validate_schema()

# Testing with a mock client

class MockLLMClient:
    """A mock client for testing LLM-integrated code."""
    
    def __init__(self, responses: dict = None):
        """
        Args:
            responses: Dict mapping prompt substrings to responses
        """
        self.responses = responses or {}
        self.calls = []  # Track calls for verification
    
    def chat(self, prompt: str, **kwargs) -> str:
        """Return a mocked response based on the prompt."""
        self.calls.append({"prompt": prompt, **kwargs})
        
        # Find matching response
        for key, response in self.responses.items():
            if key.lower() in prompt.lower():
                return response
        
        # Default response
        return '{"status": "mocked"}'

# Example: Testing a function that uses LLM
def analyze_sentiment(client, text: str) -> str:
    """Analyze sentiment using LLM."""
    prompt = f"Analyze sentiment: {text}\nReturn JSON: {{\"sentiment\": \"...\"}}"
    response = client.chat(prompt)
    data = parse_json_response(response)
    return data.get("sentiment", "unknown")

# Test with mock
def test_analyze_sentiment():
    mock_client = MockLLMClient(responses={
        "great": '{"sentiment": "positive"}',
        "terrible": '{"sentiment": "negative"}',
        "okay": '{"sentiment": "neutral"}',
    })
    
    assert analyze_sentiment(mock_client, "This is great!") == "positive"
    assert analyze_sentiment(mock_client, "This is terrible!") == "negative"
    assert analyze_sentiment(mock_client, "It's okay.") == "neutral"
    
    # Verify the client was called correctly
    assert len(mock_client.calls) == 3
    
    print("✓ Sentiment analysis tests passed (using mock)!")

test_analyze_sentiment()

6.13 Achieving Determinism#

Even with mocks, you need strategies for making LLM behavior more predictable.

Techniques for Determinism#

Technique	How	Effectiveness
Temperature 0	Set `temperature=0.0`	High (but not perfect)
Seed parameter	Some APIs support `seed=42`	Medium (provider-dependent)
Constrained output	JSON mode, function calling	High
Caching	Cache responses by prompt hash	Perfect (for repeated calls)
Mocking	Replace with fake in tests	Perfect (for tests)

import hashlib
from typing import Optional

class CachingLLMClient:
    """LLM client with response caching for determinism."""
    
    def __init__(self, client: LLMClient):
        self.client = client
        self.cache = {}
        self.cache_hits = 0
        self.cache_misses = 0
    
    def _cache_key(self, prompt: str, temperature: float, max_tokens: int) -> str:
        """Generate a cache key from request parameters."""
        key_data = f"{prompt}:{temperature}:{max_tokens}"
        return hashlib.sha256(key_data.encode()).hexdigest()[:16]
    
    def chat(self, prompt: str, temperature: float = 0.0, max_tokens: int = 256) -> str:
        """Chat with caching."""
        key = self._cache_key(prompt, temperature, max_tokens)
        
        if key in self.cache:
            self.cache_hits += 1
            return self.cache[key]
        
        self.cache_misses += 1
        response = self.client.chat(prompt, temperature, max_tokens)
        self.cache[key] = response
        return response
    
    def stats(self) -> dict:
        """Return cache statistics."""
        total = self.cache_hits + self.cache_misses
        hit_rate = self.cache_hits / total if total > 0 else 0
        return {
            "hits": self.cache_hits,
            "misses": self.cache_misses,
            "hit_rate": f"{hit_rate:.1%}",
            "cached_responses": len(self.cache)
        }

print("CachingLLMClient defined.")
print("\nBenefits:")
print("  - Same prompt always returns same response")
print("  - Reduces API costs")
print("  - Faster repeated calls")
print("  - Useful for development and testing")

Group 5 — Production Concerns#

Beyond correctness, production systems need logging, cost control, and auditability.

6.14 Cost and Audit Logging#

Every LLM call should be logged for:

Purpose	What to Log
Cost tracking	Tokens used, model, timestamp
Debugging	Prompt, response, errors
Compliance	User ID, request ID, inputs/outputs
Performance	Latency, retry count
Security	Suspicious patterns, PII detection

import logging
from datetime import datetime
from dataclasses import dataclass
from typing import Optional
import uuid

@dataclass
class LLMCallLog:
    """Structured log entry for an LLM call."""
    request_id: str
    timestamp: str
    model: str
    prompt_preview: str  # First N chars of prompt
    response_preview: str  # First N chars of response
    latency_ms: float
    success: bool
    error: Optional[str] = None
    
    def to_dict(self) -> dict:
        return {
            "request_id": self.request_id,
            "timestamp": self.timestamp,
            "model": self.model,
            "prompt_preview": self.prompt_preview,
            "response_preview": self.response_preview,
            "latency_ms": self.latency_ms,
            "success": self.success,
            "error": self.error
        }

class LoggingLLMClient:
    """LLM client with structured logging."""
    
    def __init__(self, client: LLMClient, preview_length: int = 100):
        self.client = client
        self.preview_length = preview_length
        self.logs: list[LLMCallLog] = []
    
    def chat(self, prompt: str, **kwargs) -> str:
        """Chat with logging."""
        request_id = str(uuid.uuid4())[:8]
        start_time = datetime.now()
        
        try:
            response = self.client.chat(prompt, **kwargs)
            latency = (datetime.now() - start_time).total_seconds() * 1000
            
            log = LLMCallLog(
                request_id=request_id,
                timestamp=start_time.isoformat(),
                model=self.client.model,
                prompt_preview=prompt[:self.preview_length],
                response_preview=response[:self.preview_length],
                latency_ms=round(latency, 2),
                success=True
            )
            self.logs.append(log)
            
            return response
            
        except Exception as e:
            latency = (datetime.now() - start_time).total_seconds() * 1000
            
            log = LLMCallLog(
                request_id=request_id,
                timestamp=start_time.isoformat(),
                model=self.client.model,
                prompt_preview=prompt[:self.preview_length],
                response_preview="",
                latency_ms=round(latency, 2),
                success=False,
                error=str(e)
            )
            self.logs.append(log)
            raise
    
    def get_logs(self) -> list[dict]:
        """Return all logs as dictionaries."""
        return [log.to_dict() for log in self.logs]
    
    def summary(self) -> dict:
        """Return summary statistics."""
        if not self.logs:
            return {"total_calls": 0}
        
        successful = [l for l in self.logs if l.success]
        failed = [l for l in self.logs if not l.success]
        latencies = [l.latency_ms for l in successful]
        
        return {
            "total_calls": len(self.logs),
            "successful": len(successful),
            "failed": len(failed),
            "avg_latency_ms": round(sum(latencies) / len(latencies), 2) if latencies else 0,
            "max_latency_ms": round(max(latencies), 2) if latencies else 0
        }

print("LoggingLLMClient defined.")
print("\nFeatures:")
print("  - Structured log entries")
print("  - Request ID tracking")
print("  - Latency measurement")
print("  - Error capture")
print("  - Summary statistics")

# Demonstrate logging (using mock to avoid real API calls)

# Create a mock for demonstration
mock = MockLLMClient(responses={
    "hello": '{"greeting": "Hello!"}',
    "weather": '{"forecast": "Sunny"}',
})
mock.model = "mock-model"  # Add model attribute

# Wrap with logging
logged_client = LoggingLLMClient(mock)

# Make some calls
logged_client.chat("Hello, how are you?")
logged_client.chat("What's the weather like?")
logged_client.chat("Hello again!")

# View logs
print("Call Logs:")
print("=" * 60)
for log in logged_client.get_logs():
    print(f"[{log['request_id']}] {log['prompt_preview'][:30]}... -> {log['latency_ms']}ms")

print("\nSummary:")
print(logged_client.summary())

6.15 Preparing for RAG#

Everything in this module prepares you for building RAG (Retrieval-Augmented Generation) systems.

The RAG Pattern (Recap from Module 5)#

User Question
     ↓
Embed Question (Module 5)
     ↓
Vector Search (Module 5)
     ↓
Build Prompt with Context (Module 6)
     ↓
Call LLM with Retry (Module 6)
     ↓
Validate Response (Module 6)
     ↓
Return to User

What You Now Know#

Module 5 Skills	Module 6 Skills
Generate embeddings	Build robust API clients
Semantic search	Handle failures gracefully
Vector databases	Enforce structured output
Chunking strategies	Validate before using
Retrieval evaluation	Test without live models

Next Steps#

Combining Modules 5 and 6, you can now:

Index documents with embeddings
Retrieve relevant context for any question
Build prompts that ground LLM answers in facts
Call LLMs reliably with proper error handling
Validate and log everything for compliance

# A complete RAG-ready LLM client

def build_rag_prompt(question: str, context_docs: list[str]) -> str:
    """
    Build a RAG prompt with retrieved context.
    
    Args:
        question: User's question
        context_docs: Retrieved relevant documents
    """
    context = "\n\n".join([f"[{i+1}] {doc}" for i, doc in enumerate(context_docs)])
    
    return f"""Answer the question based ONLY on the provided context.
If the context doesn't contain enough information, say "I don't have enough information."

Context:
{context}

Question: {question}

Provide your answer in JSON format:
{{
    "answer": "your answer here",
    "sources_used": [1, 2],
    "confidence": "high|medium|low"
}}

JSON:"""

# Example
context = [
    "The company's Q3 revenue was $4.2 billion, up 15% YoY.",
    "Operating margin improved to 23% from 21% last year.",
    "The CEO announced plans to expand into Asian markets."
]

prompt = build_rag_prompt(
    question="What was the company's revenue in Q3?",
    context_docs=context
)

print("RAG Prompt Example:")
print("=" * 60)
print(prompt)

Module Summary#

Key Concepts#

Concept	What It Means
Service mindset	LLMs are external services, not functions
Defensive programming	Assume failure, verify success
Exponential backoff	Wait longer between each retry
Structured output	Request JSON, validate before use
Mocking	Test without live LLM calls
Audit logging	Track every call for cost/compliance

The Production LLM Client Checklist#

Configuration externalized (no hardcoded secrets)
Proper timeouts set (connect and read)
Retry logic with exponential backoff
Structured output requested (JSON)
Response validation before use
Error handling for all failure modes
Logging for debugging and audit
Tests that don’t call live APIs

Enterprise Implications#

Concern	Solution
Cost control	Logging, caching, token limits
Reliability	Retries, fallbacks, timeouts
Compliance	Audit logs, input/output validation
Security	No hardcoded secrets, PII detection
Testability	Mocking, dependency injection

What’s Next#

Quiz — Test your understanding of LLM API patterns
Assessment — Build and test your own LLM client
Module 7 — Putting it all together with a complete RAG system

Content

Contents

Content#

Overview#

One Big Idea to Remember#

Learning Objectives#

Before You Start: LLM Gateway Configuration#

Option A: Local Ollama#

Option B: Server Gateway (JBChat)#

Configure below:#

Group 1 — The Service Mindset#

6.1 From Model Calls to Service Contracts#

The Mindset Shift#

Enterprise Implications#

6.2 Anatomy of an LLM API Request#

Request Components#

Standard Payload Structure#

Response Structure#

6.3 Configuration Discipline#

Why Configuration Matters#

The Configuration Pattern#

Configuration Best Practices#

Group 2 — Building a Robust Client#

6.4 Client Class Rationale#

What a Good Client Provides#

6.5 Request Construction#

Key Considerations#

6.6 Making Safe API Calls#

6.7 Failure as the Default Assumption#

Types of Failures#

The Defensive Mindset#

6.8 Retry with Exponential Backoff#

Why Exponential?#

Group 3 — Structured Output and Validation#

6.9 The Necessity of Structured Output#

The Problem#

The Solution: JSON#

Why JSON?#

6.10 JSON Enforcement in Prompts#

Prompt Patterns for JSON#

6.11 Validation Before Use#

Validation Layers#

Group 4 — Testing and Determinism#

6.12 Testing Without Live Models#

Why Not?#

The Solution: Mocking#

6.13 Achieving Determinism#

Techniques for Determinism#

Group 5 — Production Concerns#

6.14 Cost and Audit Logging#

6.15 Preparing for RAG#

The RAG Pattern (Recap from Module 5)#

What You Now Know#

Next Steps#

Module Summary#

Key Concepts#

The Production LLM Client Checklist#

Enterprise Implications#

What’s Next#