Animesh Kumar Mishra

Data compression math: what each database actually stores on disk

2025-05-13T18:30:00.000Z

A transaction_status column holds three values: APPROVED, DECLINED, PENDING. In a careless schema each row stores a VARCHAR(20) — call it 8 bytes average. At 10 million rows that is 80 MB. Shannon’s entropy bound says the minimum is 1.16 bits per symbol. That is 1.5 MB. The gap is 53x, and it exists entirely because of the choices the storage engine makes (or doesn’t make) between serialisation and disk.

This post is the math behind those choices across four storage architectures.

#The floor: Shannon entropy

For a random variable $X$ with $n$ outcomes and probabilities $p_1, \ldots, p_n$ :

H(X) = -\sum_{i=1}^{n} p_i \log_2 p_i \quad \text{bits/symbol}

For the status column with distribution APPROVED=70%, DECLINED=20%, PENDING=10%:

H = -(0.70 \log_2 0.70 + 0.20 \log_2 0.20 + 0.10 \log_2 0.10) \approx 1.16 \text{ bits/symbol}

No lossless compressor beats this. Everything below is about how close each engine gets, and what it gives up to get there.

#RDBMS (PostgreSQL)

PostgreSQL stores rows in 8 KB heap pages. A row with a long text or JSONB column triggers TOAST (The Oversized-Attribute Storage Technique) once its size crosses the toast_tuple_target threshold (default: 2 KB).

TOAST compresses the attribute inline before spilling it to a separate table. Three algorithms are available:

Algorithm	Compression ratio (typical text)	Decompression throughput
pglz (default, PG <14)	2.0–2.5x	~500 MB/s
LZ4 (PG 14+)	2.0–3.0x	~4 GB/s
ZSTD (PG 15+)	3.0–5.0x	~1.5 GB/s

LZ4 is the right default for any latency-sensitive path. ZSTD wins on bulk analytics where decompression throughput is not the bottleneck.

Dictionary encoding — which PostgreSQL does not do natively for row-oriented heaps but Citus columnar and any Parquet-backed foreign table does — changes the picture entirely for low-cardinality columns. The status column with three values needs $\lceil \log_2 3 \rceil = 2$ bits per row in a dictionary scheme, versus 8 bytes raw. On 10M rows:

\text{savings} = \frac{8 \times 8 \text{ bits} - 2 \text{ bits}}{8 \times 8 \text{ bits}} \approx 96.9\%

This is why columnar formats compress enum-like columns near-perfectly even before applying a secondary byte-level compressor on top.

#ScyllaDB

ScyllaDB compresses at the SSTable chunk level. Each SSTable is a sequence of fixed-size chunks; each chunk is compressed independently. The default chunk size is 4 KB.

Chunk-level independence has a cost: every single-row read must decompress the full chunk that contains it.

\text{read amplification (rows)} = \frac{\text{chunk\_size\_bytes}}{\text{avg\_row\_size\_bytes}}

At chunk = 4 KB and avg row = 128 B: 32 rows decompressed per point read. Increasing the chunk to 64 KB improves compression ratio by 15–25% (longer runs → better LZ back-references) but raises read amplification to 512x.

Supported algorithms and typical ratios on event-stream data:

Algorithm	Ratio (time-series)	Ratio (UUID-heavy)	Notes
LZ4	3–5x	1.5–2x	Default; lowest CPU cost
Snappy	2.5–4x	1.4–1.8x	Slightly lower ratio than LZ4
ZSTD	4–8x	2–3x	Best ratio; higher CPU on write path
Deflate	3–6x	2–2.5x	Slowest; avoid unless storage-constrained

The write path keeps data uncompressed in the memtable (RAM) and compresses only on SSTable flush. Reads access the page cache; Scylla caches decompressed chunks by default (unlike Cassandra which can cache compressed). At high read concurrency the cache hit rate dominates latency far more than the compression ratio.

#Graph databases

A graph $G = (V, E)$ needs to store adjacency. Three representations:

Adjacency matrix: $|V|^2$ bits. At $|V| = 1\text{M}$ nodes: $10^{12}$ bits = 122 GB even for a bit-packed binary matrix. Viable only for dense graphs.

Adjacency list: $O(|V| + |E|)$ with pointer overhead per node — typically 24–48 bytes per node entry plus 4–8 bytes per edge.

Compressed Sparse Row (CSR): two flat arrays.

row_ptr[0..V]: V+1 integers; row_ptr[i] is the index into col_idx where node i’s neighbors begin
col_idx[0..E-1]: E integers; the actual neighbor IDs

\text{CSR size} = (|V| + 1 + |E|) \times 4 \text{ bytes (int32)}

For $|V| = 1\text{M}$ , $|E| = 10\text{M}$ : $(1{,}000{,}001 + 10{,}000{,}000) \times 4 = 44 \text{ MB}$ , versus 122 GB for the adjacency matrix. CSR is what graph analytics engines (GraphX, cuGraph, igraph) use internally.

Neo4j uses a different approach: index-free adjacency — each node record stores a pointer to its first relationship record, and each relationship record is a doubly-linked list node. This trades memory compactness for O(degree) traversal without an index lookup. Compressed against CSR, Neo4j’s native format uses more bytes per edge but achieves sub-millisecond hop traversal because no secondary index is touched.

Delta encoding on sorted neighbor lists further reduces CSR storage. If neighbors of node $i$ are sorted, store differences rather than absolute IDs:

\delta_j = \text{neighbor}_j - \text{neighbor}_{j-1}

For a social graph where average node ID delta is small (local community structure), varints encoding $\delta_j$ reduces the col_idx array by 60–70% over fixed int32. This is what WebGraph and similar large-scale graph compression formats do.

#Vector databases

This is where compression math gets structurally different. The compression is lossy — you trade exact distance preservation for storage and compute savings.

Baseline: float32

A $d$ -dimensional embedding stored as float32: $d \times 4$ bytes.

Model	Dimensions	Bytes/vector	10M vectors
text-embedding-3-small	1536	6,144 B	61.4 GB
text-embedding-3-large	3072	12,288 B	122.9 GB
Cohere Embed v3	1024	4,096 B	41.0 GB

Before the HNSW index. Add roughly $M \times 8$ bytes per vector for the graph layer (M=16 is common), so another 1.3 GB for 10M vectors.

Scalar quantization (SQ8)

Map each float32 dimension to uint8 via per-dimension affine transform:

x_q^{(d)} = \operatorname{clamp}\!\left(\operatorname{round}\!\left(\frac{x^{(d)} - \min_d}{\max_d - \min_d} \times 255\right),\, 0,\, 255\right)

Store $\min_d$ and $\max_d$ per dimension (2 floats × $d$ = negligible). Each vector: $d$ bytes. Compression: 4x. Recall@10 degradation on MSMARCO-style benchmarks: typically 0.5–1.5%.

Product quantization (PQ)

Split the $d$ -dimensional space into $m$ sub-spaces of $d/m$ dimensions. For each sub-space, learn $k$ centroids offline via k-means. At inference, store the centroid index per sub-space: $\log_2 k$ bits.

With $k = 256$ (8 bits per sub-space), each vector stores $m$ bytes:

\text{compression ratio} = \frac{d \times 4}{m}

For $d = 1536$ , $m = 96$ : ratio = $\frac{6144}{96} = 64\times$ . The 10M float32 corpus shrinks from 61.4 GB to 960 MB.

Distance at query time uses asymmetric distance computation (ADC): precompute a lookup table of distances from the query sub-vector to all $k$ centroids in each sub-space ( $m \times k$ entries), then approximate each database vector’s distance as a sum of $m$ table lookups — no multiply-accumulate, just integer adds and memory reads.

\|q - x\|^2 \approx \sum_{i=1}^{m} \text{dist\_table}[i][\text{code}[i][x]]

PQ recall degrades more than SQ8 — typically 3–8% at Recall@10 for $m = d/16$ . The tradeoff is navigated by adjusting $m$ (more sub-spaces = better recall, larger index).

Binary quantization (BQ)

x_b^{(d)} = \mathbb{1}[x^{(d)} > 0] \in \{0, 1\}

Storage: $d/8$ bytes per vector. 32x compression. Distance: Hamming distance via POPCNT, which CPUs execute in a single instruction per 64-bit word.

\text{hamming}(q_b, x_b) = \text{popcount}(q_b \oplus x_b)

BQ works well only for embedding models with approximately symmetric dimension distributions (Matryoshka representations, Cohere Embed v3 trained with BQ in mind). On general-purpose embeddings it loses 5–15% Recall@10. Qdrant, Weaviate, and Vespa all support it; Pinecone does not as of this writing.

#Summary

Engine	Primary technique	Typical ratio	Lossy?
PostgreSQL (row)	LZ4 / ZSTD on TOAST values	2–5x	No
PostgreSQL (columnar)	Dictionary + LZ4/ZSTD	10–50x on low-cardinality	No
ScyllaDB	Chunk-level LZ4/ZSTD on SSTables	3–8x	No
Graph (CSR + delta)	Delta-varint encoding	3–5x over adjacency list	No
Vector (SQ8)	Per-dimension affine quantization	4x	Yes (< 2% recall loss)
Vector (PQ)	Sub-space centroid codes	16–128x	Yes (3–8% recall loss)
Vector (BQ)	Sign quantization + Hamming	32x	Yes (model-dependent)

The RDBMS and ScyllaDB cases are lossless — the data that comes out matches what went in, the compressor is just exploiting redundancy in the byte stream. Vector quantization is structurally different: you are approximating the metric space, and the approximation quality is measurable (Recall@k) and tunable. Getting the tuning right requires knowing both your distance distribution and your recall SLO before you pick $m$ or the quantization scheme.

Primary sources: Zstd compression levels, ScyllaDB Compression documentation, Product Quantization — Jégou et al. 2011, Qdrant quantization docs.

Math is substrate

2025-04-20T18:30:00.000Z

There is a view of mathematics as a tool — a language humans invented to describe patterns they observed. Useful, precise, but ultimately a map drawn by mapmakers. The territory is physical; the map is symbolic.

I don’t hold that view. I think the map is the territory.

#The hierarchy

Physics is applied mathematics. Every physical law is a mathematical statement: $F = ma$ , $E = mc^2$ , $i\hbar \partial\psi/\partial t = \hat{H}\psi$ . Remove the math and there is no physics — just hand-waving about objects moving and energy changing. The math is not a description of the physics; the math is the physics, made legible.

Chemistry is applied physics. Molecular geometry is quantum mechanical probability distributions. The bond angles in water ( $104.5°$ ) are not arbitrary — they are the solution to a Schrödinger equation for the electron configuration of oxygen. The Gibbs free energy $\Delta G = \Delta H - T\Delta S$ decides which reactions happen. Thermodynamics is just statistical mechanics applied to many-body quantum systems. Chemistry is physics at the molecular scale, which is mathematics at the molecular scale.

Biology is applied chemistry. The double helix is stable because of hydrogen bond energetics, which are electrostatic, which are quantum mechanical, which are mathematical. Protein folding — the problem that occupied structural biology for fifty years — is an optimisation problem over an energy landscape defined by physical forces that are defined by mathematics. Evolution is a search algorithm over genotype space. The logistic growth equation $dN/dt = rN(1 - N/K)$ emerges from the same conservation-of-resources math as any constrained optimisation.

Life is emergent biology. Consciousness, culture, language, cities — all of it is patterns running on biological substrate, which is chemical substrate, which is physical substrate, which is mathematical substrate.

#What this means practically

If you accept this view, then learning mathematics is not a career investment. It is literacy. An engineer who cannot read mathematics is in the same position as an engineer who cannot read — they can do a great deal by pattern-matching and imitation, but they cannot go to first principles when imitation fails.

Most engineering failures I have seen were not failures of implementation. They were failures of model. Someone built the wrong thing, correctly. The wrong model was usually not a software architecture mistake; it was a mathematical mistake — an assumption that did not hold, a distribution that was not stationary, a latency budget that did not add up.

The habit I am trying to build on this blog: before writing the code, write the equation. If you cannot write the equation, you do not understand the system.

This is a note, not an argument. I am not claiming to have proved anything. I am reporting what working in production data systems for fourteen years has made me believe.

Inside a BNPL fraud score: the 300 ms budget and where it goes

2025-03-07T18:30:00.000Z

A BNPL checkout approval has one constraint that shapes every architectural decision: the latency budget. The user tapped “Pay Later.” Their thumb is already moving toward the confirm button. Between 150 and 300 milliseconds from now, a spinner that stays visible too long stops feeling like “loading” and starts feeling like “something is wrong.”

Inside that window the risk engine must fetch a feature vector, score it, apply rule vetoes, and commit a decision your compliance team can defend twelve months later.

#Where the 300 ms goes

Step	Budget	Bottleneck
Feature fetch (Redis)	5–12 ms	Network RTT + serialisation
Feature fetch (Postgres)	8–25 ms	Index scan + connection pool
Model inference	2–8 ms	Serialised feature vector
Rule engine	< 1 ms	In-memory
Audit write (async)	off critical path	Kafka producer
Network + overhead	10–20 ms	Service mesh, TLS
Total p99 budget	< 300 ms

The model is not the bottleneck. The feature fetch is.

#The feature staleness tradeoff

A Redis lookup costs ~1–3 ms and returns a feature vector that is only as fresh as the upstream stream processor. If the user made a transaction 4 seconds ago that would change their velocity feature, and the Kafka consumer lag is 6 seconds, the model scores a stale vector.

A Postgres lookup costs 8–25 ms and returns fresh data — but at p99, under connection pool pressure, it can spike to 80 ms and breach the SLO.

The decision: which features are read from Redis (fast, slightly stale) and which from Postgres (slow, fresh) is the primary engineering decision in a real-time risk system. It is not a data science decision. It is a latency-vs-staleness tradeoff that only the engineer who has seen both p99s can make correctly.

The math is not complicated:

\text{expected loss from staleness} = P(\text{fraud} \mid \text{stale feature}) \times \text{loss per fraud} - P(\text{fraud} \mid \text{fresh feature}) \times \text{loss per fraud}

If the staleness window is 6 seconds and the fraud velocity feature changes meaningfully in 6 seconds for fewer than 0.1% of sessions, the expected loss is smaller than the p99 latency cost of going to Postgres.

#What the audit trail requires

Every decision must be auditable: which model version, which features, which rules fired, what the score was, what the outcome was. This sounds obvious until you realise that if you store the feature vector at decision time, you own it — and if a regulator asks why a user was declined eighteen months later, you need to reconstruct the exact state of the world at that moment.

The audit write is off the critical path (Kafka producer, fire-and-forget with at-least-once delivery). The consumer writes to an append-only Postgres table. The feature snapshot is stored as JSONB alongside the decision.

This is a working note. A longer post with the full data model, the rule engine architecture, and the false-positive economics is in progress.

ScyllaDB vs Cassandra: what the p99 actually looks like at fintech scale

2025-02-11T18:30:00.000Z

We ran Apache Cassandra in production for two years before migrating the user-identity lookup path to ScyllaDB. The decision was not made from a benchmark blog post. It was made after watching a p99 read latency of 180 ms on a 3-node Cassandra cluster serve a path that had a 50 ms SLO.

This post is a working note on what we measured, why Cassandra behaved the way it did, and what changed after the migration. A longer post with the full LSM-tree internals and compaction math is in progress.

#The problem in one number

A user-identity lookup on the BNPL approval path had a budget of 50 ms. Cassandra was hitting p99 of 180 ms under load — 3.6x over budget — despite the cluster being at roughly 30% CPU utilisation.

The symptom was GC pauses. Cassandra’s JVM heap was collecting under read pressure, and the stop-the-world pauses were showing up directly in the tail latency.

ScyllaDB is a C++ reimplementation of the Cassandra storage engine with a shard-per-core architecture and no JVM. The GC pause problem is structurally absent.

#What the migration changed

Metric	Cassandra (3-node)	ScyllaDB (3-node)
p50 read latency	4 ms	1.8 ms
p99 read latency	180 ms	9 ms
p99.9 read latency	340 ms	22 ms
CPU at peak QPS	31%	18%

Same hardware, same data model, same replication factor. The p99 improvement is 20x. The p99.9 improvement is 15x.

#Why Cassandra’s p99 drifts at load

Cassandra’s read path merges data from the memtable and potentially multiple SSTables (after compaction, ideally one — but compaction is async and never perfectly caught up). Each SSTable read involves a bloom filter check, a partition index lookup, and a block read from disk or page cache.

Under concurrent read pressure, the JVM heap fills with bloom filter and index structures. When the GC fires — even a minor collection — every in-flight read on that node pauses. The pause duration is proportional to heap pressure, which is proportional to read concurrency.

The math is simple: if GC fires every $T$ seconds and pauses for $\Delta t$ milliseconds, any request in flight during that window takes at least $\Delta t$ ms extra. At p99 and p99.9, requests are almost certain to hit at least one GC event across their lifetime.

ScyllaDB eliminates this by allocating off-heap (no GC) and using a seastar reactor per core with cooperative scheduling. Tail latency is bounded by I/O, not by the runtime.

Full post coming: LSM-tree compaction strategies, bloom filter false-positive rates, and the exact data model we used. Numbers above are from a production environment, redacted for specifics.

The KV cache miss your load balancer caused

2025-01-14T18:30:00.000Z

The prefill for a 6,000-token enterprise system prompt on Qwen3-32B takes about 4.3 seconds cold. With KV cache, the second request for that same prefix takes 0.6 seconds. You lose 3.7 seconds every time a request lands on a pod that doesn’t hold the cache.

In an eight-pod cluster with round-robin routing, that miss rate is 87.5%. Most of your cluster’s prefill compute is re-deriving attention for tokens you already computed — on a different pod, five milliseconds ago.

llm-d is a Kubernetes inference scheduler that makes routing KV-cache-aware. On 16 H100s running Qwen3-32B, it moved p90 TTFT from 92 seconds to 0.54 seconds with no hardware changes. 170x. The rest of this post is the explanation.

#The cost of a KV cache miss

A transformer computes attention over the full context at every forward pass. For each new output token, it needs the key and value projections for every prior token in the sequence. The KV cache stores those projections so they don’t have to be recomputed.

For a model with $n_L$ layers, $n_h$ attention heads (after GQA/MQA), and head dimension $d_h$ , caching a prefix of $L$ tokens costs:

\text{KV memory} = 2 \cdot n_L \cdot n_h \cdot d_h \cdot L \cdot \text{bytes\_per\_element}

The factor of 2 is keys and values. In practice, vLLM blocks KV memory in 128-token chunks and manages it as a paged allocator — the block is the unit of both storage and cache lookup.

The computation savings on a cache hit are direct: if a request carries a prefix of $L_p$ cached tokens and $L_n$ new tokens, the fraction of prefill work that disappears is:

\eta = \frac{L_p}{L_p + L_n}

With a 6,000-token system prompt and 1,200-token query: $\eta = 6000 / 7200 = 0.83$ . Eighty-three percent of the prefill computation is free if the cache hits. Anthropic prices this directly: $3.00 per million uncached input tokens vs. $0.30 cached — a 10x cost difference that is a direct readout of the compute ratio.

#Why naive routing destroys cache locality

vLLM’s prefix cache is in-process. Each pod maintains its own KV block index: a hash map from a 64-bit hash of each 128-token block’s content to a GPU memory address. A cache hit requires that the pod serving the request holds the exact prefix in its local cache.

In a cluster of $k$ pods with round-robin routing, a prefix cached on exactly one pod gets a hit with probability:

P(\text{hit} \mid \text{round-robin}) = \frac{1}{k}

For $k = 8$ : 12.5%. Every other request recomputes from scratch. With cache-aware routing that steers each request to the pod with the highest prefix match:

P(\text{hit} \mid \text{precise}) \approx 1.0

provided the cluster has enough total KV capacity for the working set. At 73% utilization in the benchmark below, it does.

Expected savings shift from $0.125 \times 0.83 = 10\%$ to $1.0 \times 0.83 = 83\%$ of prefill work per request — an 8x difference in expectation. The measured TTFT improvement is 170x because of queueing amplification: a cache miss causes a full prefill run, which holds GPU memory longer, which reduces available batch slots, which delays other requests, which fills the queue. The system cascades. The vLLM wait queue in the benchmark averaged 27 requests under random routing and 0.1 requests under precise routing — the entire difference in TTFT is driven by queuing, not raw compute.

#What llm-d does

llm-d adds a routing layer above vLLM on Kubernetes. Three components:

InferencePool is a Kubernetes CRD grouping pods that serve the same model — a “KV-aware Service.” It is the unit of routing policy and is being standardized in the Kubernetes Gateway API Inference Extension SIG.

Proxy is a standard L7 proxy (Envoy, Istio, or cloud-managed ALB) that handles connection management and TLS. It delegates the routing decision to the EPP via Envoy’s ext-proc external processing protocol.

Endpoint Picker (EPP) is the scheduler. It runs a filter → score → pick pipeline over candidate pods, using real-time pod metrics and KV cache state, and returns the selected pod address to the Proxy.

The EPP’s cache awareness comes from a continuous event stream: vLLM emits a KVEvents stream — one event per block create or evict — and the EPP consumes it to maintain a KV-Block Index: a map from block hash to the set of pods holding that block and the memory tier (GPU HBM, CPU DRAM, or disk). When a request arrives, the EPP tokenizes its prefix, hashes each 128-token block, and queries the index: what fraction of this request’s KV state is already resident on each candidate pod?

The metadata overhead is negligible. Managing the full KV cache of an 8× H200 DeepSeek R1 cluster — 365 GB of KV VRAM — requires 339 KB of index state on the scheduler side. Data-to-metadata ratio: over 1,000,000:1. The scheduler has a complete real-time map of every cached block across the cluster for essentially nothing.

#Precise vs. approximate

The EPP ships two scheduling modes:

Approximate builds a routing history: if past requests with this prefix hash went to pod A, steer future ones there. No KVEvents stream required; works with any vLLM version. Cost: the index is a guess. Pods evict blocks under memory pressure without the scheduler knowing, so affinity decisions can become stale.

Precise consumes the live KVEvents stream for an exact real-time view. The scheduler knows which blocks are resident where and computes a true cache affinity score per pod. Cost: vLLM must support the KVEvents API (supported in current vLLM), and the EPP maintains more state.

The benchmark: 8 pods, 16 H100s, Qwen3-32B TP=2, 307,328-token KV cache per pod. Workload: 150 enterprise customers × 5 users each, 6,000-token system prompts, 1,200-token queries, 3–60 QPS. Total KV demand: 73% of cluster capacity.

Scheduler	Throughput (tok/s)	TTFT p90 (s)	TTFT mean (s)	Queue depth (mean)
precise	8730	0.54	0.30	0.1
approximate	6944	31.1	13.3	8.1
load-aware	4429	94.9	47.0	28.9
random	4429	92.6	45.3	27.3

Source: llm-d v0.5 release — Precise Inference Scheduling

Load-aware vs. random are nearly identical: when you don’t know where the cache is, how you distribute load doesn’t matter much. Approximate gets the queue to 8 (from 28) by making better guesses most of the time, but collapses at the p90 tail when guesses go stale. Precise holds at 0.54s because the scheduler never guesses.

The gap between approximate and precise — 31s vs. 0.54s TTFT — is not a marginal improvement. It is a different operating regime.

#What this maps to

This is a cache locality problem of the kind backend engineers have been solving for twenty years: hot tier, cold tier, miss penalty, routing policy. The KV cache is the hot tier. Every miss is a full recompute. The router is the only component that can enforce locality, because neither the model server (which doesn’t know about other pods’ caches) nor the Kubernetes Service (which doesn’t know about LLM semantics) can make the decision.

The architecture is composable: EPP plugs into Envoy’s ext-proc protocol, which makes it a drop-in addition to any existing Kubernetes networking stack. No changes to the model serving containers; no specialized networking fabric required for the routing layer itself (disaggregated prefill/decode needs RDMA, but that’s a separate feature).

For engineers who’ve spent time on Redis cache-aside patterns, consistent hashing for cache affinity, or the latency math of tiered storage — the concepts transfer directly. The penalty for a miss is just measured in seconds instead of microseconds.

llm-d is Apache 2.0, CNCF Sandbox (March 2026). Source: github.com/llm-d/llm-d. Benchmarks reproduced from the v0.5 release post at llm-d.ai. Numbers above are from the precise-scheduling benchmark on the public benchmark platform at prism.llm-d.ai.