Multi-Provider LLM Gateway · System DesignHard

Real-World Usage Architecture Requirements Estimation Auth (JWT)Rate Limiting Circuit Breaker Retry + Jitter Hedging Cache-Aside Async + DLQ Reliability Interview Q&A

Multi-Provider LLM Gateway · Deep Dive

HardDistributed SystemsReliabilityAI InfraCaching·OpenAIAnthropicCloudflareStripe

An LLM Gateway is a single API in front of multiple model providers (OpenAI, Anthropic, etc). One endpoint for clients, smart routing behind it: rate limiting, automatic failover, caching, async jobs. Think Stripe but for LLM calls.

Real components used: Node.js stateless gateway, AWS ALB, PostgreSQL primary + read replicas, Redis for rate limits / cache / streams, and OpenAI + Anthropic upstreams. Patterns: signed JWT auth, per-provider token-bucket rate limiting, circuit breakers, retry with jittered backoff, request hedging, cache-aside with TTL eviction, and at-least-once async workers with DLQ.

Where This Is Actually Used

LLM gateways are everywhere in production AI apps now. Anyone building an AI product on top of frontier models hits the same set of problems and ends up with the same architecture.

Real services running this design

Service	Owner	Notes
`OpenRouter`	Independent	One API, 200+ models, automatic failover and price routing.
`Portkey`	Portkey AI	Production AI Gateway with observability, caching, retries, prompt versioning.
`LiteLLM`	BerriAI (OSS)	Self-host the same patterns. 100+ providers behind one OpenAI-compatible API.
`Helicone`	Helicone	Proxy in front of OpenAI for caching, rate limits, observability.
`Cloudflare AI Gateway`	Cloudflare	Edge-deployed gateway with caching, retries, analytics.
`AWS Bedrock`	Amazon	Bedrock is a managed multi-provider gateway in disguise (Anthropic, Mistral, Meta, etc).
`Vercel AI SDK`	Vercel	Client-side adapter but ships server-side gateway patterns for failover and streaming.
Internal (most AI startups)	Anyone with $$ on LLM bills	If you spend $50K+/mo on LLM APIs, you build one.

Why teams build one

Reliability: OpenAI had multi-hour outages in 2024. Without failover, your app is down. With Anthropic as a hot standby, p99 stays under 1.2s.
Cost control: route cheap requests to cheaper models. Cache duplicates. Saves 30–70% on bills.
Rate-limit hiding: upstream has per-org limits. The gateway smooths bursts via token buckets and queues.
Per-user quotas: charge usage-based. Without a gateway you have no way to enforce per-user budgets.
Observability: one trace pipeline, one billing aggregator, one place to debug bad prompts.
Vendor flexibility: swap providers without changing app code.

Architecture · Trace a Real Request

Three flows to simulate: a happy-path streaming call, a failover when OpenAI is down, and a 429 on user quota exhausted. Hover any component for its responsibility.

Simulate Traffic

Hover components for details

EDGE

GATEWAY

STATE

PROVIDERS

Client

App / SDK

ALB

Load Balancer

Node.js API

Stateless Gateway

Redis

Limits · Cache · Streams

PostgreSQL

Primary · Source of Truth

PG Replicas

Read Fan-out

OpenAI

LLM Provider

Anthropic

LLM Provider

Worker

Async Consumer

Wire-Level Trace

Click a flow above to trace packets…

Requirements

Functional

OpenAI-compatible API surface (/v1/chat/completions, /v1/embeddings, etc).
Streaming via SSE: relay tokens to the client as they arrive upstream.
Route by model, by user policy, by cost, by upstream health.
Automatic failover between providers (OpenAI ↔ Anthropic).
Per-user and per-organization rate limiting + spend budgets.
Async jobs (embeddings batches, long completions) via a queue.
Full audit log of every request, including tokens, cost, latency.

Non-Functional

Availability ≥ 99.95%: ~22 min/month downtime. Survives any single upstream outage.
Gateway overhead < 10ms p99. The gateway itself adds negligible latency.
User-facing p99 < 1.2s even through multi-minute upstream 5xx incidents.
Horizontal scalability: stateless API nodes, scale-out by adding replicas.
Cost-efficiency: cache absorbs duplicate requests; cheap routes used when possible.

Back-of-Envelope Estimation

Assumptions

1,000 active users, average 20 LLM calls/day each → 20K calls/day.
Average request: 1 KB in, 5 KB out (streaming completion).
Average upstream latency: 800ms median, 2.5s p99.
~20% of calls are cacheable (same prompt / same model).

Metric	Calculation	Result
Calls/sec (avg)	20K ÷ 86,400	0.23 / sec
Calls/sec (peak, 10× spike)	0.23 × 10	~2.3 / sec
Cache hit savings	20% × 20K	4K upstream calls saved/day
Outbound bandwidth peak	2.3 × 5KB × 8	~92 Kbps
Postgres audit writes/sec	0.23 (one per call)	0.23/s (trivial)
Redis cache memory	4K entries × ~6KB	~24 MB
Cost @ $0.005 / call	20K × $0.005 × 30	$3,000/mo upstream
Cost with 20% cache	$3,000 × 0.8	$2,400/mo (saves $7.2K/yr)

Key insight: the gateway barely needs scale at this user count. The interesting engineering is not capacity — it is reliability and cost. Make the system ride out upstream incidents without paging anyone.

Auth · Signed JWT

Stateless API nodes need stateless auth. Signed JWTs.Each user's SDK or app gets a short-lived access token. The gateway verifies the signature using cached JWKS — no database hit on the hot path.

// Header
Authorization: Bearer eyJhbGciOiJSUzI1NiIsImtpZCI6...

// Payload (decoded)
{
  "sub": "user_42",
  "org": "acme",
  "tier": "pro",
  "iat": 1715846400,
  "exp": 1715850000,   // 1h validity
  "scope": "chat embed"
}

// Verification on the hot path
const jwks = await jwksCache.get(token.kid);  // cached, refreshes in background
const claims = verify(token, jwks);            // ~0.3ms, pure CPU
if (claims.exp < now()) throw 401;

Short TTL (60 minutes). Compromised tokens auto-expire.
JWKS cached in process memory with background refresh. Zero DB calls per request.
kid in header lets you rotate signing keys without breaking active tokens.
Refresh tokens stored in Postgres. One DB hit per refresh, not per request.
Revocation: kept short by short TTL. For instant kill, a Redis denylist of token IDs.

Rate Limiting · Per-Provider Token Bucket

Two distinct rate-limit concerns: per-user quotas(your product's business rules) and per-upstream budgets(don't exceed what OpenAI lets you do).

Token bucket in Redis

-- Lua script for atomic check-and-decrement.
-- Returns 1 if allowed, 0 if rate-limited.

local key      = KEYS[1]
local capacity = tonumber(ARGV[1])
local refill   = tonumber(ARGV[2])  -- tokens per ms
local now      = tonumber(ARGV[3])

local bucket = redis.call('HMGET', key, 'tokens', 'ts')
local tokens = tonumber(bucket[1]) or capacity
local ts     = tonumber(bucket[2]) or now

-- Refill based on time elapsed.
local elapsed = now - ts
tokens = math.min(capacity, tokens + elapsed * refill)

if tokens < 1 then
  redis.call('HMSET', key, 'tokens', tokens, 'ts', now)
  return 0
end

tokens = tokens - 1
redis.call('HMSET', key, 'tokens', tokens, 'ts', now)
redis.call('PEXPIRE', key, 60000)
return 1

Atomic in Redis via Lua. No race between check and decrement.
Bucket key pattern: bucket:<scope>:<id> e.g. bucket:user:42, bucket:provider:openai.
Capacity + refill rate vary by tier. Pro = 100/min, free = 10/min.
Reject with 429 Too Many Requests + Retry-After: 12 header so SDKs back off correctly.

Circuit Breaker · Fail Fast When Upstream Is Down

Without a breaker, when OpenAI starts failing your gateway burns timeout budget on every request (5s × 2.3 req/s × N minutes). With one, you detect failure fast and skip the dead provider entirely. Failover routes take over and users barely notice.

Circuit Breaker · 3-State Machine

Wraps every outbound call. Trips OPEN at 50% failure ratio, recovers via a half-open probe after 4000ms.

States

CLOSED

ACTIVE

All requests pass through. Track failure ratio over recent window.

HALF_OPEN

Allow ONE probe to test if upstream is back. Reject the rest.

OPEN

Fail-fast. Skip the upstream call. Save its budget.

Recent 10-Request Window

FAIL

REJECTED

ALLOWED

Failure ratio (over window)0% / 50%

How to read this: green ✓ = upstream succeeded, red ! = upstream failed, gray ✗ = rejected (circuit OPEN, never sent), P = HALF_OPEN probe. Toggle Upstream: failing to drive the failure ratio above 50% and watch the breaker trip OPEN. After 4000ms it auto-transitions to HALF_OPEN and lets one probe through.

Implementation sketch

type State = "CLOSED" | "OPEN" | "HALF_OPEN";

class CircuitBreaker {
  state: State = "CLOSED";
  failures: number[] = [];   // timestamps within window
  openedAt = 0;

  async call<T>(fn: () => Promise<T>): Promise<T> {
    if (this.state === "OPEN") {
      if (now() - this.openedAt >= OPEN_TIMEOUT_MS) {
        this.state = "HALF_OPEN";
      } else {
        throw new BreakerOpenError();
      }
    }

    try {
      const result = await fn();
      this.onSuccess();
      return result;
    } catch (err) {
      this.onFailure();
      throw err;
    }
  }

  onSuccess() {
    if (this.state === "HALF_OPEN") {
      this.state = "CLOSED";
      this.failures = [];
    }
  }

  onFailure() {
    if (this.state === "HALF_OPEN") {
      this.state = "OPEN";
      this.openedAt = now();
      return;
    }
    this.failures.push(now());
    this.failures = this.failures.filter(t => now() - t < WINDOW_MS);
    if (this.failures.length / WINDOW_SIZE >= FAIL_RATIO) {
      this.state = "OPEN";
      this.openedAt = now();
    }
  }
}

One breaker per upstream, per gateway replica.Don't share state across replicas; you want each node to make its own probe decision. Cross-replica gossip would be over-engineering.

Retry with Jittered Backoff

Transient errors (network blips, 503s) deserve a retry. But naive retries cause thundering herds. When a flaky upstream finally recovers, every client that was backing off fires their retry at the same instant, crushing it again.

The fix: jitter. Spread retries randomly within the backoff window. The AWS recommendation is full jitter or decorrelated jitter.

Retry with Exponential Backoff + Jitter

Simulating 50 clients retrying after a shared upstream blip. Base 200ms, doubled per attempt, capped at 6400ms. Watch for the thundering-herd spike under "no jitter".

RETRY DENSITY · 50 clients × 6 attemptspeak: 62 concurrent retries

0s4s8s12s16s

PER-CLIENT TIMELINE (first 8 of 50)

Smeared: retries are spread across time. The upstream sees a gentle wave instead of a wall. Decorrelated jitter (the AWS recommendation) adapts to history and avoids both clustering and excessive waits.

async function callWithRetry(fn: () => Promise<Response>): Promise<Response> {
  const MAX_ATTEMPTS = 4;
  const BASE = 200;   // ms
  const CAP  = 6400;

  for (let attempt = 0; attempt < MAX_ATTEMPTS; attempt++) {
    try {
      const res = await fn();
      if (res.status < 500 && res.status !== 429) return res;
      if (attempt === MAX_ATTEMPTS - 1) return res;
    } catch (err) {
      if (attempt === MAX_ATTEMPTS - 1) throw err;
    }
    // Full jitter: random(0, base * 2^attempt) capped at CAP.
    const expCap = Math.min(CAP, BASE * Math.pow(2, attempt));
    const delay  = Math.random() * expCap;
    await sleep(delay);
  }
  throw new Error("unreachable");
}

Request Hedging · Cut Tail Latency

Retry helps with errors. Hedging helps with slow successes. If your primary call takes longer than usual (the long tail), fire the same request at a backup provider. Whichever returns first wins. The loser is cancelled.

Request Hedging · cut tail latency

Fire the primary, start a timer at p95, race a secondary if the primary is late. Winner returns to the user, loser is cancelled. p99 latency drops without adding load on the happy path.

Scenario:

Primary latency2400ms

Hedge delay (p95)1200ms

Secondary latency700ms

0ms1500ms3000ms

hedge fires

primary (OpenAI)2400ms

hedge (Anthropic)700ms

Cost trade-off: hedging at p95 means roughly 5% of requests fire a duplicate call. You pay for ~1.05× the upstream cost in exchange for cutting p99 by 30–70%. For LLM APIs at $10/M tokens, that is usually a great trade for any user-facing path.

Set the hedge delay at the p95 latency of the primary. Below p95 → primary almost always wins, no extra cost. Above p95 → hedge fires, p99 collapses toward the secondary's median.
Cancel the loser with an AbortController. Crucial for not double-billing the user.
Only hedge idempotent requests. LLM completions are idempotent (same input → equivalent output) so this is safe.
Track cost: hedging at p95 means ~5% of requests fire twice. Worth it for any user-facing path.

Cache-Aside Layer · Save 30–70% of Upstream Bills

LLM calls are expensive. Same prompt + same model + same parameters → same output. Hash the request, cache the response.

async function getCachedCompletion(req: ChatRequest): Promise<ChatResponse> {
  // 1. Compute a stable cache key from (model, messages, temperature, ...).
  //    Skip "temperature": cache only when deterministic (temp=0).
  const key = canonicalHash(req);

  // 2. Cache lookup.
  const cached = await redis.get(`llm:${key}`);
  if (cached) {
    metrics.inc("llm.cache_hit");
    return JSON.parse(cached);
  }

  // 3. Miss. Call upstream.
  const response = await callUpstream(req);

  // 4. Write-through (TTL = 1h by default, longer for embeddings).
  await redis.setex(`llm:${key}`, 3600, JSON.stringify(response));
  return response;
}

// On invalidation:
async function bustCache(prefix: string) {
  const keys = await redis.scan(`llm:${prefix}*`);
  if (keys.length) await redis.del(...keys);
}

Only cache when temperature = 0 (deterministic). Non-deterministic calls are uncachable by nature.
Canonical hashing: sort message keys, normalize whitespace, version the schema. Same logical input → same key.
TTL eviction. Cache entries expire on their own. No manual cleanup.
Write-through invalidation when prompt templates change: namespace keys by template version, bump the version.
Cache stampede: first request on a hot miss takes a single-flight lock. Concurrent callers wait for the same upstream call.

Real numbers from the bullet: 68% DB QPS cut, hot-endpoint p99 from 480ms → 90ms. That is the classic cache-aside win. The trick is picking the right canonicalization so your hit rate is as high as it should be.

Async Workers · Redis Streams + DLQ

Some calls are too long for the HTTP request path: embedding 10K documents, fine-tune polling, image generation, summarizing a 200-page PDF. Offload to a queue. Return a job ID immediately. Worker pulls from the queue, runs the work, stores the result for polling.

Redis Streams · Consumer Group + DLQ

XADD jobs by producer · XREADGROUP by 3 workers · at-least-once delivery · retry on failure · DLQ after 3 attempts.

SIMULATED FAILURE RATE30%

Stream "jobs" · 0 queued

no jobs queued

Consumer group · 3 workers

worker-1idle

worker-2idle

worker-3idle

0 completed total

DLQ "jobs:dlq" · 0

no failed jobs

Why this matters: at-least-once delivery means a job may be processed more than once if a worker crashes mid-flight. Make your handlers idempotent (key on job ID, dedupe writes). Failures retry automatically up to 3 attempts, then land in jobs:dlq for human review. Set up a Prometheus alert on xlen(jobs:dlq) > 0.

Why Redis Streams (vs Kafka or DB-as-queue)

Option	Verdict	Why
DB-as-queue (Postgres LISTEN/NOTIFY, SKIP LOCKED)	~	Fine to ~5K msg/s, then connection pool starves. Good for v1 only.
Redis Streams	✓	You already run Redis. Consumer groups, at-least-once, DLQ. Up to ~50K msg/s.
Kafka	~	Overkill at this scale. Adds operational complexity. Pick if you already run Kafka or need replay history.
SQS	~	Reasonable on AWS. Loses Redis Streams' XPENDING visibility but cheaper at low volume.

// Producer side (API)
await redis.xadd(
  "jobs",
  "*",                              // auto-id
  "type", "embed-batch",
  "user_id", userId,
  "payload", JSON.stringify(payload),
);

// Consumer side (worker)
while (true) {
  const messages = await redis.xreadgroup(
    "GROUP", "workers", `worker-${id}`,
    "COUNT", 10, "BLOCK", 2000,
    "STREAMS", "jobs", ">"
  );
  for (const [streamId, msg] of messages) {
    try {
      await handle(msg);                              // your business logic
      await redis.xack("jobs", "workers", streamId);  // mark done
    } catch (err) {
      // Will be re-delivered to another worker via XPENDING. After N attempts,
      // a janitor moves it to jobs:dlq and acks the original.
      logger.error({ streamId, err }, "job failed");
    }
  }
}

// DLQ janitor (runs periodically)
const stuck = await redis.xpending(
  "jobs", "workers", "-", "+", 100, undefined, // idle filter via XPENDING + idle ms
);
for (const m of stuck) {
  if (m.deliveryCount >= MAX_ATTEMPTS) {
    await redis.xadd("jobs:dlq", "*", "original_id", m.id, "payload", m.payload);
    await redis.xack("jobs", "workers", m.id);
  }
}

Idempotency is non-negotiable.At-least-once delivery means a job may be handled twice if a worker crashes after the upstream succeeded but before the ACK. Make handlers idempotent: key on the job's logical ID, write results to a table with a UNIQUE constraint, fail benignly on retry.

Reliability · Surviving Real Outages

From the resume bullet: kept p99 under 1.2s through multi-minute upstream 5xx incidents. Here is how each piece contributes:

Pattern	What it protects against
Circuit breaker	Burning latency budget on a dead upstream. Fail-fast.
Failover	Provider-wide outages. Anthropic picks up while OpenAI is down.
Hedging	The long tail (one slow request among many). p99 collapses to p50.
Retry + jitter	Transient 5xx and network blips. Without thundering-herd recovery.
Cache-aside	Hot keys during burst traffic. Half the cost of an OpenAI bill.
Per-provider token buckets	Upstream throttling you back; your gateway smooths it.
Async via Streams + DLQ	Long-running jobs. Survives worker crashes. Dead jobs surfaced for humans.
Stateless API + ALB	Replica crashes. ALB drains the bad node, others keep serving.
PG primary + read replicas	Read traffic for audit/analytics. Primary stays focused on writes.

Observability bare minimum

Per-route: request count, latency histogram (p50/p95/p99), success/error rate.
Per-upstream: same, broken out by provider. Auto-alert on 5xx rate > 5%.
Per-user: token usage and spend. Alert on anomalies (probable bot or compromised key).
Circuit breaker state changes logged as events. Page on prolonged OPEN.
DLQ depth: alert when XLEN(jobs:dlq) > 0.

Interview Follow-ups

Q1: Why not put the LLM gateway behind a CDN?

A CDN caches static responses for many users. LLM responses are per-request(different prompts) and often streamed. CDN edge caching is the wrong shape. The cache-aside layer in Redis is the right shape: hash on the exact request body, cache per-tenant. Cloudflare AI Gateway is the exception — it runs the gateway itself at the edge, which is a different design.

Q2: When does hedging hurt?

When your upstream is rate-limit-constrained (already throttling you). Hedging doubles the rate, you trip the upstream's limits, both calls fail. Solution: only hedge when the upstream's remaining budget is comfortable.

Q3: How would you handle PII in the cache?

The cache key is a hash so it doesn't leak content directly. The cached value might. Three options: don't cache PII-tagged routes; encrypt cached payloads with a per-tenant key; namespace cache by user so eviction is per-user. We picked the third (namespace) plus the first for sensitive routes.

Q4: Why per-replica circuit breakers, not a global one?

A global breaker means every replica sees the same state via Redis. Adds a round-trip per call, and one slow Redis read becomes a global single point of failure. Per-replica breakers converge to the same decision within seconds, are independent, and add zero latency.

Q5: How do you bill users when a request hedges?

Charge the user for the winning call only. Track the losing call internally for cost accounting (your bill from the loser still arrives). The hedge cost is gateway overhead, paid by the platform, not the customer.

Q6: How do you keep streaming connections alive across deploys?

Drain mode on the ALB: stop sending new connections to the replica being replaced, let in-flight streams finish, then terminate. Configure connection draining timeout ≥ your longest streaming response (typically 60s for LLM completions).

Q7: How would you support a new provider tomorrow?

Each provider has an adapter that maps the gateway's normalized request to the provider's API and the provider's response back. Add the adapter file, declare it in the routing config, add a circuit breaker entry. Zero changes to handler code. That is the whole point of the gateway abstraction.

Discussion

…

Loading comments…

LeetMotion