LLM API Design: Streaming, Structured Output, Error Handling, Rate Limits
Introduction
Designing APIs that wrap large language models requires handling concerns that traditional REST APIs do not face: streaming token-by-token responses, enforcing structured output schemas, managing unpredictable latency, and protecting against runaway costs. This guide covers the four critical pillars of LLM API design with production-ready patterns.
Streaming Responses
Streaming is the standard way to return LLM outputs. Instead of waiting for the full response, the client receives tokens as they are generated:
from fastapi import FastAPI, Request
from fastapi.responses import StreamingResponse
from anthropic import AsyncAnthropic
import asyncio
app = FastAPI()
client = AsyncAnthropic(api_key="sk-ant-...")
async def generate_stream(prompt: str):
async with client.messages.stream(
model="claude-sonnet-4-20260512",
max_tokens=4096,
messages=[{"role": "user", "content": prompt}],
) as stream:
async for chunk in stream:
if chunk.type == "content_block_delta":
yield f"data: {chunk.delta.text}\n\n"
@app.post("/chat")
async def chat(request: Request):
body = await request.json()
return StreamingResponse(
generate_stream(body["prompt"]),
media_type="text/event-stream",
headers={
"Cache-Control": "no-cache",
"Connection": "keep-alive",
"X-Accel-Buffering": "no",
},
)
The Server-Sent Events protocol is the most compatible streaming format. Each `data:` line is a new token chunk. Clients use `EventSource` or fetch with `ReadableStream` to consume the stream progressively.
Structured Output
Raw LLM text is unreliable for programmatic consumption. Use structured output modes to enforce JSON schemas:
from pydantic import BaseModel
from openai import OpenAI
client = OpenAI()
class ExtractedEntity(BaseModel):
name: str
type: str
confidence: float
source_text: str
class ExtractionResult(BaseModel):
entities: list[ExtractedEntity]
summary: str
language: str
response = client.beta.chat.completions.parse(
model="gpt-4o",
messages=[
{"role": "system", "content": "Extract entities from the text."},
{"role": "user", "content": user_text},
],
response_format=ExtractionResult,
)
result: ExtractionResult = response.choices[0].message.parsed
When the API does not support native structured output, use a two-step approach: request JSON in the prompt, then validate and re-request on failure:
import json
from pydantic import ValidationError
def safe_structured_generate(prompt: str, schema: type[BaseModel], max_retries=3):
for attempt in range(max_retries):
raw = call_llm(prompt + "\n\nRespond in valid JSON matching this schema: " + str(schema.model_json_schema()))
try:
parsed = json.loads(clean_json(raw))
return schema.model_validate(parsed)
except (json.JSONDecodeError, ValidationError) as e:
if attempt == max_retries - 1:
raise
prompt += f"\n\nPrevious attempt failed: {e}. Please fix the JSON."
return None
Error Handling
LLM APIs fail in distinctive ways. Build a retry strategy around each failure mode:
import time
from tenacity import retry, stop_after_attempt, wait_exponential, retry_if_exception_type
class RateLimitError(Exception): pass
class ContextWindowExceeded(Exception): pass
class ContentFilterError(Exception): pass
@retry(
stop=stop_after_attempt(3),
wait=wait_exponential(multiplier=1, min=2, max=30),
retry=retry_if_exception_type(RateLimitError),
)
def call_with_retry(prompt: str) -> str:
try:
response = client.messages.create(model="claude-sonnet-4-20260512", max_tokens=1024, messages=[{"role": "user", "content": prompt}])
return response.content[0].text
except APIStatusError as e:
if e.status_code == 429:
retry_after = int(e.response.headers.get("retry-after", 5))
time.sleep(retry_after)
raise RateLimitError from e
elif e.status_code == 400 and "context_length_exceeded" in str(e):
raise ContextWindowExceeded from e
elif e.status_code == 400 and "content_filter" in str(e):
raise ContentFilterError from e
raise
Each error type deserves a different handler: rate limits get exponential backoff, context windows trigger input truncation, and content filter errors should be logged and escalated.
Rate Limiting
Protect your API from abuse and cost spikes with layered rate limiting:
from fastapi import HTTPException
import time
from collections import defaultdict
class RateLimiter:
def __init__(self):
self.tokens_per_second = 10
self.burst_limit = 20
self.cost_per_token = 0.000003
self.daily_budget = 10.0
self.user_usage = defaultdict(float)
async def check(self, user_id: str, estimated_tokens: int):
cost = estimated_tokens * self.cost_per_token
if self.user_usage[user_id] + cost > self.daily_budget:
raise HTTPException(status_code=429, detail="Daily budget exceeded")
self.user_usage[user_id] += cost
def get_usage(self, user_id: str) -> dict:
return {"cost": self.user_usage[user_id], "budget": self.daily_budget}
rate_limiter = RateLimiter()
@app.post("/chat")
async def chat(request: Request):
user_id = request.headers.get("X-User-Id")
body = await request.json()
estimated = len(body["prompt"].split()) * 2 + int(body.get("max_tokens", 1024))
await rate_limiter.check(user_id, estimated)
return await generate_response(body["prompt"])
Conclusion
Designing LLM APIs requires balancing responsiveness with cost control. Stream responses for user experience, enforce structured output for programmatic reliability, implement retry logic calibrated to each error type, and gate access with rate and budget limits. These four patterns form the foundation of any production LLM service.