An LLM Gateway is a single API in front of multiple model providers (OpenAI, Anthropic, etc). One endpoint for clients, smart routing behind it: rate limiting, automatic failover, caching, async jobs. Think Stripe but for LLM calls.
Real components used: Node.js stateless gateway, AWS ALB, PostgreSQL primary + read replicas, Redis for rate limits / cache / streams, and OpenAI + Anthropic upstreams. Patterns: signed JWT auth, per-provider token-bucket rate limiting, circuit breakers, retry with jittered backoff, request hedging, cache-aside with TTL eviction, and at-least-once async workers with DLQ.
LLM gateways are everywhere in production AI apps now. Anyone building an AI product on top of frontier models hits the same set of problems and ends up with the same architecture.
| Service | Owner | Notes |
|---|---|---|
OpenRouter | Independent | One API, 200+ models, automatic failover and price routing. |
Portkey | Portkey AI | Production AI Gateway with observability, caching, retries, prompt versioning. |
LiteLLM | BerriAI (OSS) | Self-host the same patterns. 100+ providers behind one OpenAI-compatible API. |
Helicone | Helicone | Proxy in front of OpenAI for caching, rate limits, observability. |
Cloudflare AI Gateway | Cloudflare | Edge-deployed gateway with caching, retries, analytics. |
AWS Bedrock | Amazon | Bedrock is a managed multi-provider gateway in disguise (Anthropic, Mistral, Meta, etc). |
Vercel AI SDK | Vercel | Client-side adapter but ships server-side gateway patterns for failover and streaming. |
| Internal (most AI startups) | Anyone with $$ on LLM bills | If you spend $50K+/mo on LLM APIs, you build one. |
Three flows to simulate: a happy-path streaming call, a failover when OpenAI is down, and a 429 on user quota exhausted. Hover any component for its responsibility.
/v1/chat/completions, /v1/embeddings, etc).| Metric | Calculation | Result |
|---|---|---|
| Calls/sec (avg) | 20K ÷ 86,400 | 0.23 / sec |
| Calls/sec (peak, 10× spike) | 0.23 × 10 | ~2.3 / sec |
| Cache hit savings | 20% × 20K | 4K upstream calls saved/day |
| Outbound bandwidth peak | 2.3 × 5KB × 8 | ~92 Kbps |
| Postgres audit writes/sec | 0.23 (one per call) | 0.23/s (trivial) |
| Redis cache memory | 4K entries × ~6KB | ~24 MB |
| Cost @ $0.005 / call | 20K × $0.005 × 30 | $3,000/mo upstream |
| Cost with 20% cache | $3,000 × 0.8 | $2,400/mo (saves $7.2K/yr) |
Stateless API nodes need stateless auth. Signed JWTs.Each user's SDK or app gets a short-lived access token. The gateway verifies the signature using cached JWKS — no database hit on the hot path.
// Header
Authorization: Bearer eyJhbGciOiJSUzI1NiIsImtpZCI6...
// Payload (decoded)
{
"sub": "user_42",
"org": "acme",
"tier": "pro",
"iat": 1715846400,
"exp": 1715850000, // 1h validity
"scope": "chat embed"
}
// Verification on the hot path
const jwks = await jwksCache.get(token.kid); // cached, refreshes in background
const claims = verify(token, jwks); // ~0.3ms, pure CPU
if (claims.exp < now()) throw 401;Two distinct rate-limit concerns: per-user quotas(your product's business rules) and per-upstream budgets(don't exceed what OpenAI lets you do).
-- Lua script for atomic check-and-decrement.
-- Returns 1 if allowed, 0 if rate-limited.
local key = KEYS[1]
local capacity = tonumber(ARGV[1])
local refill = tonumber(ARGV[2]) -- tokens per ms
local now = tonumber(ARGV[3])
local bucket = redis.call('HMGET', key, 'tokens', 'ts')
local tokens = tonumber(bucket[1]) or capacity
local ts = tonumber(bucket[2]) or now
-- Refill based on time elapsed.
local elapsed = now - ts
tokens = math.min(capacity, tokens + elapsed * refill)
if tokens < 1 then
redis.call('HMSET', key, 'tokens', tokens, 'ts', now)
return 0
end
tokens = tokens - 1
redis.call('HMSET', key, 'tokens', tokens, 'ts', now)
redis.call('PEXPIRE', key, 60000)
return 1bucket:<scope>:<id> e.g. bucket:user:42, bucket:provider:openai.429 Too Many Requests + Retry-After: 12 header so SDKs back off correctly.Without a breaker, when OpenAI starts failing your gateway burns timeout budget on every request (5s × 2.3 req/s × N minutes). With one, you detect failure fast and skip the dead provider entirely. Failover routes take over and users barely notice.
Wraps every outbound call. Trips OPEN at 50% failure ratio, recovers via a half-open probe after 4000ms.
type State = "CLOSED" | "OPEN" | "HALF_OPEN";
class CircuitBreaker {
state: State = "CLOSED";
failures: number[] = []; // timestamps within window
openedAt = 0;
async call<T>(fn: () => Promise<T>): Promise<T> {
if (this.state === "OPEN") {
if (now() - this.openedAt >= OPEN_TIMEOUT_MS) {
this.state = "HALF_OPEN";
} else {
throw new BreakerOpenError();
}
}
try {
const result = await fn();
this.onSuccess();
return result;
} catch (err) {
this.onFailure();
throw err;
}
}
onSuccess() {
if (this.state === "HALF_OPEN") {
this.state = "CLOSED";
this.failures = [];
}
}
onFailure() {
if (this.state === "HALF_OPEN") {
this.state = "OPEN";
this.openedAt = now();
return;
}
this.failures.push(now());
this.failures = this.failures.filter(t => now() - t < WINDOW_MS);
if (this.failures.length / WINDOW_SIZE >= FAIL_RATIO) {
this.state = "OPEN";
this.openedAt = now();
}
}
}Transient errors (network blips, 503s) deserve a retry. But naive retries cause thundering herds. When a flaky upstream finally recovers, every client that was backing off fires their retry at the same instant, crushing it again.
The fix: jitter. Spread retries randomly within the backoff window. The AWS recommendation is full jitter or decorrelated jitter.
Simulating 50 clients retrying after a shared upstream blip. Base 200ms, doubled per attempt, capped at 6400ms. Watch for the thundering-herd spike under "no jitter".
async function callWithRetry(fn: () => Promise<Response>): Promise<Response> {
const MAX_ATTEMPTS = 4;
const BASE = 200; // ms
const CAP = 6400;
for (let attempt = 0; attempt < MAX_ATTEMPTS; attempt++) {
try {
const res = await fn();
if (res.status < 500 && res.status !== 429) return res;
if (attempt === MAX_ATTEMPTS - 1) return res;
} catch (err) {
if (attempt === MAX_ATTEMPTS - 1) throw err;
}
// Full jitter: random(0, base * 2^attempt) capped at CAP.
const expCap = Math.min(CAP, BASE * Math.pow(2, attempt));
const delay = Math.random() * expCap;
await sleep(delay);
}
throw new Error("unreachable");
}Retry helps with errors. Hedging helps with slow successes. If your primary call takes longer than usual (the long tail), fire the same request at a backup provider. Whichever returns first wins. The loser is cancelled.
Fire the primary, start a timer at p95, race a secondary if the primary is late. Winner returns to the user, loser is cancelled. p99 latency drops without adding load on the happy path.
AbortController. Crucial for not double-billing the user.LLM calls are expensive. Same prompt + same model + same parameters → same output. Hash the request, cache the response.
async function getCachedCompletion(req: ChatRequest): Promise<ChatResponse> {
// 1. Compute a stable cache key from (model, messages, temperature, ...).
// Skip "temperature": cache only when deterministic (temp=0).
const key = canonicalHash(req);
// 2. Cache lookup.
const cached = await redis.get(`llm:${key}`);
if (cached) {
metrics.inc("llm.cache_hit");
return JSON.parse(cached);
}
// 3. Miss. Call upstream.
const response = await callUpstream(req);
// 4. Write-through (TTL = 1h by default, longer for embeddings).
await redis.setex(`llm:${key}`, 3600, JSON.stringify(response));
return response;
}
// On invalidation:
async function bustCache(prefix: string) {
const keys = await redis.scan(`llm:${prefix}*`);
if (keys.length) await redis.del(...keys);
}Some calls are too long for the HTTP request path: embedding 10K documents, fine-tune polling, image generation, summarizing a 200-page PDF. Offload to a queue. Return a job ID immediately. Worker pulls from the queue, runs the work, stores the result for polling.
XADD jobs by producer · XREADGROUP by 3 workers · at-least-once delivery · retry on failure · DLQ after 3 attempts.
jobs:dlq for human review. Set up a Prometheus alert on xlen(jobs:dlq) > 0.| Option | Verdict | Why |
|---|---|---|
| DB-as-queue (Postgres LISTEN/NOTIFY, SKIP LOCKED) | ~ | Fine to ~5K msg/s, then connection pool starves. Good for v1 only. |
| Redis Streams | ✓ | You already run Redis. Consumer groups, at-least-once, DLQ. Up to ~50K msg/s. |
| Kafka | ~ | Overkill at this scale. Adds operational complexity. Pick if you already run Kafka or need replay history. |
| SQS | ~ | Reasonable on AWS. Loses Redis Streams' XPENDING visibility but cheaper at low volume. |
// Producer side (API)
await redis.xadd(
"jobs",
"*", // auto-id
"type", "embed-batch",
"user_id", userId,
"payload", JSON.stringify(payload),
);
// Consumer side (worker)
while (true) {
const messages = await redis.xreadgroup(
"GROUP", "workers", `worker-${id}`,
"COUNT", 10, "BLOCK", 2000,
"STREAMS", "jobs", ">"
);
for (const [streamId, msg] of messages) {
try {
await handle(msg); // your business logic
await redis.xack("jobs", "workers", streamId); // mark done
} catch (err) {
// Will be re-delivered to another worker via XPENDING. After N attempts,
// a janitor moves it to jobs:dlq and acks the original.
logger.error({ streamId, err }, "job failed");
}
}
}
// DLQ janitor (runs periodically)
const stuck = await redis.xpending(
"jobs", "workers", "-", "+", 100, undefined, // idle filter via XPENDING + idle ms
);
for (const m of stuck) {
if (m.deliveryCount >= MAX_ATTEMPTS) {
await redis.xadd("jobs:dlq", "*", "original_id", m.id, "payload", m.payload);
await redis.xack("jobs", "workers", m.id);
}
}From the resume bullet: kept p99 under 1.2s through multi-minute upstream 5xx incidents. Here is how each piece contributes:
| Pattern | What it protects against |
|---|---|
| Circuit breaker | Burning latency budget on a dead upstream. Fail-fast. |
| Failover | Provider-wide outages. Anthropic picks up while OpenAI is down. |
| Hedging | The long tail (one slow request among many). p99 collapses to p50. |
| Retry + jitter | Transient 5xx and network blips. Without thundering-herd recovery. |
| Cache-aside | Hot keys during burst traffic. Half the cost of an OpenAI bill. |
| Per-provider token buckets | Upstream throttling you back; your gateway smooths it. |
| Async via Streams + DLQ | Long-running jobs. Survives worker crashes. Dead jobs surfaced for humans. |
| Stateless API + ALB | Replica crashes. ALB drains the bad node, others keep serving. |
| PG primary + read replicas | Read traffic for audit/analytics. Primary stays focused on writes. |
XLEN(jobs:dlq) > 0.A CDN caches static responses for many users. LLM responses are per-request(different prompts) and often streamed. CDN edge caching is the wrong shape. The cache-aside layer in Redis is the right shape: hash on the exact request body, cache per-tenant. Cloudflare AI Gateway is the exception — it runs the gateway itself at the edge, which is a different design.
When your upstream is rate-limit-constrained (already throttling you). Hedging doubles the rate, you trip the upstream's limits, both calls fail. Solution: only hedge when the upstream's remaining budget is comfortable.
The cache key is a hash so it doesn't leak content directly. The cached value might. Three options: don't cache PII-tagged routes; encrypt cached payloads with a per-tenant key; namespace cache by user so eviction is per-user. We picked the third (namespace) plus the first for sensitive routes.
A global breaker means every replica sees the same state via Redis. Adds a round-trip per call, and one slow Redis read becomes a global single point of failure. Per-replica breakers converge to the same decision within seconds, are independent, and add zero latency.
Charge the user for the winning call only. Track the losing call internally for cost accounting (your bill from the loser still arrives). The hedge cost is gateway overhead, paid by the platform, not the customer.
Drain mode on the ALB: stop sending new connections to the replica being replaced, let in-flight streams finish, then terminate. Configure connection draining timeout ≥ your longest streaming response (typically 60s for LLM completions).
Each provider has an adapter that maps the gateway's normalized request to the provider's API and the provider's response back. Add the adapter file, declare it in the routing config, add a circuit breaker entry. Zero changes to handler code. That is the whole point of the gateway abstraction.
Discussion
…