Data compression math: what each database actually stores on disk

A transaction_status column holds three values: APPROVED, DECLINED, PENDING. In a careless schema each row stores a VARCHAR(20) — call it 8 bytes average. At 10 million rows that is 80 MB. Shannon’s entropy bound says the minimum is 1.16 bits per symbol. That is 1.5 MB. The gap is 53x, and it exists entirely because of the choices the storage engine makes (or doesn’t make) between serialisation and disk.

This post is the math behind those choices across four storage architectures.


#The floor: Shannon entropy

For a random variable XX with nn outcomes and probabilities p1,,pnp_1, \ldots, p_n:

H(X)=i=1npilog2pibits/symbolH(X) = -\sum_{i=1}^{n} p_i \log_2 p_i \quad \text{bits/symbol}

For the status column with distribution APPROVED=70%, DECLINED=20%, PENDING=10%:

H=(0.70log20.70+0.20log20.20+0.10log20.10)1.16 bits/symbolH = -(0.70 \log_2 0.70 + 0.20 \log_2 0.20 + 0.10 \log_2 0.10) \approx 1.16 \text{ bits/symbol}

No lossless compressor beats this. Everything below is about how close each engine gets, and what it gives up to get there.


#RDBMS (PostgreSQL)

PostgreSQL stores rows in 8 KB heap pages. A row with a long text or JSONB column triggers TOAST (The Oversized-Attribute Storage Technique) once its size crosses the toast_tuple_target threshold (default: 2 KB).

TOAST compresses the attribute inline before spilling it to a separate table. Three algorithms are available:

Algorithm Compression ratio (typical text) Decompression throughput
pglz (default, PG <14) 2.0–2.5x ~500 MB/s
LZ4 (PG 14+) 2.0–3.0x ~4 GB/s
ZSTD (PG 15+) 3.0–5.0x ~1.5 GB/s

LZ4 is the right default for any latency-sensitive path. ZSTD wins on bulk analytics where decompression throughput is not the bottleneck.

Dictionary encoding — which PostgreSQL does not do natively for row-oriented heaps but Citus columnar and any Parquet-backed foreign table does — changes the picture entirely for low-cardinality columns. The status column with three values needs log23=2\lceil \log_2 3 \rceil = 2 bits per row in a dictionary scheme, versus 8 bytes raw. On 10M rows:

savings=8×8 bits2 bits8×8 bits96.9%\text{savings} = \frac{8 \times 8 \text{ bits} - 2 \text{ bits}}{8 \times 8 \text{ bits}} \approx 96.9\%

This is why columnar formats compress enum-like columns near-perfectly even before applying a secondary byte-level compressor on top.


#ScyllaDB

ScyllaDB compresses at the SSTable chunk level. Each SSTable is a sequence of fixed-size chunks; each chunk is compressed independently. The default chunk size is 4 KB.

Chunk-level independence has a cost: every single-row read must decompress the full chunk that contains it.

read amplification (rows)=chunk_size_bytesavg_row_size_bytes\text{read amplification (rows)} = \frac{\text{chunk\_size\_bytes}}{\text{avg\_row\_size\_bytes}}

At chunk = 4 KB and avg row = 128 B: 32 rows decompressed per point read. Increasing the chunk to 64 KB improves compression ratio by 15–25% (longer runs → better LZ back-references) but raises read amplification to 512x.

Supported algorithms and typical ratios on event-stream data:

Algorithm Ratio (time-series) Ratio (UUID-heavy) Notes
LZ4 3–5x 1.5–2x Default; lowest CPU cost
Snappy 2.5–4x 1.4–1.8x Slightly lower ratio than LZ4
ZSTD 4–8x 2–3x Best ratio; higher CPU on write path
Deflate 3–6x 2–2.5x Slowest; avoid unless storage-constrained

The write path keeps data uncompressed in the memtable (RAM) and compresses only on SSTable flush. Reads access the page cache; Scylla caches decompressed chunks by default (unlike Cassandra which can cache compressed). At high read concurrency the cache hit rate dominates latency far more than the compression ratio.


#Graph databases

A graph G=(V,E)G = (V, E) needs to store adjacency. Three representations:

Adjacency matrix: V2|V|^2 bits. At V=1M|V| = 1\text{M} nodes: 101210^{12} bits = 122 GB even for a bit-packed binary matrix. Viable only for dense graphs.

Adjacency list: O(V+E)O(|V| + |E|) with pointer overhead per node — typically 24–48 bytes per node entry plus 4–8 bytes per edge.

Compressed Sparse Row (CSR): two flat arrays.

  • row_ptr[0..V]: V+1 integers; row_ptr[i] is the index into col_idx where node i’s neighbors begin
  • col_idx[0..E-1]: E integers; the actual neighbor IDs

CSR size=(V+1+E)×4 bytes (int32)\text{CSR size} = (|V| + 1 + |E|) \times 4 \text{ bytes (int32)}

For V=1M|V| = 1\text{M}, E=10M|E| = 10\text{M}: (1,000,001+10,000,000)×4=44 MB(1{,}000{,}001 + 10{,}000{,}000) \times 4 = 44 \text{ MB}, versus 122 GB for the adjacency matrix. CSR is what graph analytics engines (GraphX, cuGraph, igraph) use internally.

Neo4j uses a different approach: index-free adjacency — each node record stores a pointer to its first relationship record, and each relationship record is a doubly-linked list node. This trades memory compactness for O(degree) traversal without an index lookup. Compressed against CSR, Neo4j’s native format uses more bytes per edge but achieves sub-millisecond hop traversal because no secondary index is touched.

Delta encoding on sorted neighbor lists further reduces CSR storage. If neighbors of node ii are sorted, store differences rather than absolute IDs:

δj=neighborjneighborj1\delta_j = \text{neighbor}_j - \text{neighbor}_{j-1}

For a social graph where average node ID delta is small (local community structure), varints encoding δj\delta_j reduces the col_idx array by 60–70% over fixed int32. This is what WebGraph and similar large-scale graph compression formats do.


#Vector databases

This is where compression math gets structurally different. The compression is lossy — you trade exact distance preservation for storage and compute savings.

Baseline: float32

A dd-dimensional embedding stored as float32: d×4d \times 4 bytes.

Model Dimensions Bytes/vector 10M vectors
text-embedding-3-small 1536 6,144 B 61.4 GB
text-embedding-3-large 3072 12,288 B 122.9 GB
Cohere Embed v3 1024 4,096 B 41.0 GB

Before the HNSW index. Add roughly M×8M \times 8 bytes per vector for the graph layer (M=16 is common), so another 1.3 GB for 10M vectors.

Scalar quantization (SQ8)

Map each float32 dimension to uint8 via per-dimension affine transform:

xq(d)=clamp ⁣(round ⁣(x(d)mindmaxdmind×255),0,255)x_q^{(d)} = \operatorname{clamp}\!\left(\operatorname{round}\!\left(\frac{x^{(d)} - \min_d}{\max_d - \min_d} \times 255\right),\, 0,\, 255\right)

Store mind\min_d and maxd\max_d per dimension (2 floats × dd = negligible). Each vector: dd bytes. Compression: 4x. Recall@10 degradation on MSMARCO-style benchmarks: typically 0.5–1.5%.

Product quantization (PQ)

Split the dd-dimensional space into mm sub-spaces of d/md/m dimensions. For each sub-space, learn kk centroids offline via k-means. At inference, store the centroid index per sub-space: log2k\log_2 k bits.

With k=256k = 256 (8 bits per sub-space), each vector stores mm bytes:

compression ratio=d×4m\text{compression ratio} = \frac{d \times 4}{m}

For d=1536d = 1536, m=96m = 96: ratio = 614496=64×\frac{6144}{96} = 64\times. The 10M float32 corpus shrinks from 61.4 GB to 960 MB.

Distance at query time uses asymmetric distance computation (ADC): precompute a lookup table of distances from the query sub-vector to all kk centroids in each sub-space (m×km \times k entries), then approximate each database vector’s distance as a sum of mm table lookups — no multiply-accumulate, just integer adds and memory reads.

qx2i=1mdist_table[i][code[i][x]]\|q - x\|^2 \approx \sum_{i=1}^{m} \text{dist\_table}[i][\text{code}[i][x]]

PQ recall degrades more than SQ8 — typically 3–8% at Recall@10 for m=d/16m = d/16. The tradeoff is navigated by adjusting mm (more sub-spaces = better recall, larger index).

Binary quantization (BQ)

xb(d)=1[x(d)>0]{0,1}x_b^{(d)} = \mathbb{1}[x^{(d)} > 0] \in \{0, 1\}

Storage: d/8d/8 bytes per vector. 32x compression. Distance: Hamming distance via POPCNT, which CPUs execute in a single instruction per 64-bit word.

hamming(qb,xb)=popcount(qbxb)\text{hamming}(q_b, x_b) = \text{popcount}(q_b \oplus x_b)

BQ works well only for embedding models with approximately symmetric dimension distributions (Matryoshka representations, Cohere Embed v3 trained with BQ in mind). On general-purpose embeddings it loses 5–15% Recall@10. Qdrant, Weaviate, and Vespa all support it; Pinecone does not as of this writing.


#Summary

Engine Primary technique Typical ratio Lossy?
PostgreSQL (row) LZ4 / ZSTD on TOAST values 2–5x No
PostgreSQL (columnar) Dictionary + LZ4/ZSTD 10–50x on low-cardinality No
ScyllaDB Chunk-level LZ4/ZSTD on SSTables 3–8x No
Graph (CSR + delta) Delta-varint encoding 3–5x over adjacency list No
Vector (SQ8) Per-dimension affine quantization 4x Yes (< 2% recall loss)
Vector (PQ) Sub-space centroid codes 16–128x Yes (3–8% recall loss)
Vector (BQ) Sign quantization + Hamming 32x Yes (model-dependent)

The RDBMS and ScyllaDB cases are lossless — the data that comes out matches what went in, the compressor is just exploiting redundancy in the byte stream. Vector quantization is structurally different: you are approximating the metric space, and the approximation quality is measurable (Recall@k) and tunable. Getting the tuning right requires knowing both your distance distribution and your recall SLO before you pick mm or the quantization scheme.


Primary sources: Zstd compression levels, ScyllaDB Compression documentation, Product Quantization — Jégou et al. 2011, Qdrant quantization docs.