The KV cache miss your load balancer caused

The prefill for a 6,000-token enterprise system prompt on Qwen3-32B takes about 4.3 seconds cold. With KV cache, the second request for that same prefix takes 0.6 seconds. You lose 3.7 seconds every time a request lands on a pod that doesn’t hold the cache.

In an eight-pod cluster with round-robin routing, that miss rate is 87.5%. Most of your cluster’s prefill compute is re-deriving attention for tokens you already computed — on a different pod, five milliseconds ago.

llm-d is a Kubernetes inference scheduler that makes routing KV-cache-aware. On 16 H100s running Qwen3-32B, it moved p90 TTFT from 92 seconds to 0.54 seconds with no hardware changes. 170x. The rest of this post is the explanation.


#The cost of a KV cache miss

A transformer computes attention over the full context at every forward pass. For each new output token, it needs the key and value projections for every prior token in the sequence. The KV cache stores those projections so they don’t have to be recomputed.

For a model with nLn_L layers, nhn_h attention heads (after GQA/MQA), and head dimension dhd_h, caching a prefix of LL tokens costs:

KV memory=2nLnhdhLbytes_per_element\text{KV memory} = 2 \cdot n_L \cdot n_h \cdot d_h \cdot L \cdot \text{bytes\_per\_element}

The factor of 2 is keys and values. In practice, vLLM blocks KV memory in 128-token chunks and manages it as a paged allocator — the block is the unit of both storage and cache lookup.

The computation savings on a cache hit are direct: if a request carries a prefix of LpL_p cached tokens and LnL_n new tokens, the fraction of prefill work that disappears is:

η=LpLp+Ln\eta = \frac{L_p}{L_p + L_n}

With a 6,000-token system prompt and 1,200-token query: η=6000/7200=0.83\eta = 6000 / 7200 = 0.83. Eighty-three percent of the prefill computation is free if the cache hits. Anthropic prices this directly: $3.00 per million uncached input tokens vs. $0.30 cached — a 10x cost difference that is a direct readout of the compute ratio.


#Why naive routing destroys cache locality

vLLM’s prefix cache is in-process. Each pod maintains its own KV block index: a hash map from a 64-bit hash of each 128-token block’s content to a GPU memory address. A cache hit requires that the pod serving the request holds the exact prefix in its local cache.

In a cluster of kk pods with round-robin routing, a prefix cached on exactly one pod gets a hit with probability:

P(hitround-robin)=1kP(\text{hit} \mid \text{round-robin}) = \frac{1}{k}

For k=8k = 8: 12.5%. Every other request recomputes from scratch. With cache-aware routing that steers each request to the pod with the highest prefix match:

P(hitprecise)1.0P(\text{hit} \mid \text{precise}) \approx 1.0

provided the cluster has enough total KV capacity for the working set. At 73% utilization in the benchmark below, it does.

Expected savings shift from 0.125×0.83=10%0.125 \times 0.83 = 10\% to 1.0×0.83=83%1.0 \times 0.83 = 83\% of prefill work per request — an 8x difference in expectation. The measured TTFT improvement is 170x because of queueing amplification: a cache miss causes a full prefill run, which holds GPU memory longer, which reduces available batch slots, which delays other requests, which fills the queue. The system cascades. The vLLM wait queue in the benchmark averaged 27 requests under random routing and 0.1 requests under precise routing — the entire difference in TTFT is driven by queuing, not raw compute.


#What llm-d does

llm-d adds a routing layer above vLLM on Kubernetes. Three components:

InferencePool is a Kubernetes CRD grouping pods that serve the same model — a “KV-aware Service.” It is the unit of routing policy and is being standardized in the Kubernetes Gateway API Inference Extension SIG.

Proxy is a standard L7 proxy (Envoy, Istio, or cloud-managed ALB) that handles connection management and TLS. It delegates the routing decision to the EPP via Envoy’s ext-proc external processing protocol.

Endpoint Picker (EPP) is the scheduler. It runs a filter → score → pick pipeline over candidate pods, using real-time pod metrics and KV cache state, and returns the selected pod address to the Proxy.

The EPP’s cache awareness comes from a continuous event stream: vLLM emits a KVEvents stream — one event per block create or evict — and the EPP consumes it to maintain a KV-Block Index: a map from block hash to the set of pods holding that block and the memory tier (GPU HBM, CPU DRAM, or disk). When a request arrives, the EPP tokenizes its prefix, hashes each 128-token block, and queries the index: what fraction of this request’s KV state is already resident on each candidate pod?

The metadata overhead is negligible. Managing the full KV cache of an 8× H200 DeepSeek R1 cluster — 365 GB of KV VRAM — requires 339 KB of index state on the scheduler side. Data-to-metadata ratio: over 1,000,000:1. The scheduler has a complete real-time map of every cached block across the cluster for essentially nothing.


#Precise vs. approximate

The EPP ships two scheduling modes:

Approximate builds a routing history: if past requests with this prefix hash went to pod A, steer future ones there. No KVEvents stream required; works with any vLLM version. Cost: the index is a guess. Pods evict blocks under memory pressure without the scheduler knowing, so affinity decisions can become stale.

Precise consumes the live KVEvents stream for an exact real-time view. The scheduler knows which blocks are resident where and computes a true cache affinity score per pod. Cost: vLLM must support the KVEvents API (supported in current vLLM), and the EPP maintains more state.

The benchmark: 8 pods, 16 H100s, Qwen3-32B TP=2, 307,328-token KV cache per pod. Workload: 150 enterprise customers × 5 users each, 6,000-token system prompts, 1,200-token queries, 3–60 QPS. Total KV demand: 73% of cluster capacity.

Scheduler Throughput (tok/s) TTFT p90 (s) TTFT mean (s) Queue depth (mean)
precise 8730 0.54 0.30 0.1
approximate 6944 31.1 13.3 8.1
load-aware 4429 94.9 47.0 28.9
random 4429 92.6 45.3 27.3

Source: llm-d v0.5 release — Precise Inference Scheduling

Load-aware vs. random are nearly identical: when you don’t know where the cache is, how you distribute load doesn’t matter much. Approximate gets the queue to 8 (from 28) by making better guesses most of the time, but collapses at the p90 tail when guesses go stale. Precise holds at 0.54s because the scheduler never guesses.

The gap between approximate and precise — 31s vs. 0.54s TTFT — is not a marginal improvement. It is a different operating regime.


#What this maps to

This is a cache locality problem of the kind backend engineers have been solving for twenty years: hot tier, cold tier, miss penalty, routing policy. The KV cache is the hot tier. Every miss is a full recompute. The router is the only component that can enforce locality, because neither the model server (which doesn’t know about other pods’ caches) nor the Kubernetes Service (which doesn’t know about LLM semantics) can make the decision.

The architecture is composable: EPP plugs into Envoy’s ext-proc protocol, which makes it a drop-in addition to any existing Kubernetes networking stack. No changes to the model serving containers; no specialized networking fabric required for the routing layer itself (disaggregated prefill/decode needs RDMA, but that’s a separate feature).

For engineers who’ve spent time on Redis cache-aside patterns, consistent hashing for cache affinity, or the latency math of tiered storage — the concepts transfer directly. The penalty for a miss is just measured in seconds instead of microseconds.


llm-d is Apache 2.0, CNCF Sandbox (March 2026). Source: github.com/llm-d/llm-d. Benchmarks reproduced from the v0.5 release post at llm-d.ai. Numbers above are from the precise-scheduling benchmark on the public benchmark platform at prism.llm-d.ai.