Data compression math: what each database actually stores on disk
A transaction_status column holds three values: APPROVED, DECLINED, PENDING. In a careless schema each row stores a VARCHAR(20) — call it 8 bytes average. At 10 million rows that is 80 MB. Shannon’s entropy bound says the minimum is 1.16 bits per symbol. That is 1.5 MB. The gap is 53x, and it exists entirely because of the choices the storage engine makes (or doesn’t make) between serialisation and disk.
This post is the math behind those choices across four storage architectures.
#The floor: Shannon entropy
For a random variable with outcomes and probabilities :
For the status column with distribution APPROVED=70%, DECLINED=20%, PENDING=10%:
No lossless compressor beats this. Everything below is about how close each engine gets, and what it gives up to get there.
#RDBMS (PostgreSQL)
PostgreSQL stores rows in 8 KB heap pages. A row with a long text or JSONB column triggers TOAST (The Oversized-Attribute Storage Technique) once its size crosses the toast_tuple_target threshold (default: 2 KB).
TOAST compresses the attribute inline before spilling it to a separate table. Three algorithms are available:
| Algorithm | Compression ratio (typical text) | Decompression throughput |
|---|---|---|
| pglz (default, PG <14) | 2.0–2.5x | ~500 MB/s |
| LZ4 (PG 14+) | 2.0–3.0x | ~4 GB/s |
| ZSTD (PG 15+) | 3.0–5.0x | ~1.5 GB/s |
LZ4 is the right default for any latency-sensitive path. ZSTD wins on bulk analytics where decompression throughput is not the bottleneck.
Dictionary encoding — which PostgreSQL does not do natively for row-oriented heaps but Citus columnar and any Parquet-backed foreign table does — changes the picture entirely for low-cardinality columns. The status column with three values needs bits per row in a dictionary scheme, versus 8 bytes raw. On 10M rows:
This is why columnar formats compress enum-like columns near-perfectly even before applying a secondary byte-level compressor on top.
#ScyllaDB
ScyllaDB compresses at the SSTable chunk level. Each SSTable is a sequence of fixed-size chunks; each chunk is compressed independently. The default chunk size is 4 KB.
Chunk-level independence has a cost: every single-row read must decompress the full chunk that contains it.
At chunk = 4 KB and avg row = 128 B: 32 rows decompressed per point read. Increasing the chunk to 64 KB improves compression ratio by 15–25% (longer runs → better LZ back-references) but raises read amplification to 512x.
Supported algorithms and typical ratios on event-stream data:
| Algorithm | Ratio (time-series) | Ratio (UUID-heavy) | Notes |
|---|---|---|---|
| LZ4 | 3–5x | 1.5–2x | Default; lowest CPU cost |
| Snappy | 2.5–4x | 1.4–1.8x | Slightly lower ratio than LZ4 |
| ZSTD | 4–8x | 2–3x | Best ratio; higher CPU on write path |
| Deflate | 3–6x | 2–2.5x | Slowest; avoid unless storage-constrained |
The write path keeps data uncompressed in the memtable (RAM) and compresses only on SSTable flush. Reads access the page cache; Scylla caches decompressed chunks by default (unlike Cassandra which can cache compressed). At high read concurrency the cache hit rate dominates latency far more than the compression ratio.
#Graph databases
A graph needs to store adjacency. Three representations:
Adjacency matrix: bits. At nodes: bits = 122 GB even for a bit-packed binary matrix. Viable only for dense graphs.
Adjacency list: with pointer overhead per node — typically 24–48 bytes per node entry plus 4–8 bytes per edge.
Compressed Sparse Row (CSR): two flat arrays.
row_ptr[0..V]: V+1 integers;row_ptr[i]is the index intocol_idxwhere node i’s neighbors begincol_idx[0..E-1]: E integers; the actual neighbor IDs
For , : , versus 122 GB for the adjacency matrix. CSR is what graph analytics engines (GraphX, cuGraph, igraph) use internally.
Neo4j uses a different approach: index-free adjacency — each node record stores a pointer to its first relationship record, and each relationship record is a doubly-linked list node. This trades memory compactness for O(degree) traversal without an index lookup. Compressed against CSR, Neo4j’s native format uses more bytes per edge but achieves sub-millisecond hop traversal because no secondary index is touched.
Delta encoding on sorted neighbor lists further reduces CSR storage. If neighbors of node are sorted, store differences rather than absolute IDs:
For a social graph where average node ID delta is small (local community structure), varints encoding reduces the col_idx array by 60–70% over fixed int32. This is what WebGraph and similar large-scale graph compression formats do.
#Vector databases
This is where compression math gets structurally different. The compression is lossy — you trade exact distance preservation for storage and compute savings.
Baseline: float32
A -dimensional embedding stored as float32: bytes.
| Model | Dimensions | Bytes/vector | 10M vectors |
|---|---|---|---|
| text-embedding-3-small | 1536 | 6,144 B | 61.4 GB |
| text-embedding-3-large | 3072 | 12,288 B | 122.9 GB |
| Cohere Embed v3 | 1024 | 4,096 B | 41.0 GB |
Before the HNSW index. Add roughly bytes per vector for the graph layer (M=16 is common), so another 1.3 GB for 10M vectors.
Scalar quantization (SQ8)
Map each float32 dimension to uint8 via per-dimension affine transform:
Store and per dimension (2 floats × = negligible). Each vector: bytes. Compression: 4x. Recall@10 degradation on MSMARCO-style benchmarks: typically 0.5–1.5%.
Product quantization (PQ)
Split the -dimensional space into sub-spaces of dimensions. For each sub-space, learn centroids offline via k-means. At inference, store the centroid index per sub-space: bits.
With (8 bits per sub-space), each vector stores bytes:
For , : ratio = . The 10M float32 corpus shrinks from 61.4 GB to 960 MB.
Distance at query time uses asymmetric distance computation (ADC): precompute a lookup table of distances from the query sub-vector to all centroids in each sub-space ( entries), then approximate each database vector’s distance as a sum of table lookups — no multiply-accumulate, just integer adds and memory reads.
PQ recall degrades more than SQ8 — typically 3–8% at Recall@10 for . The tradeoff is navigated by adjusting (more sub-spaces = better recall, larger index).
Binary quantization (BQ)
Storage: bytes per vector. 32x compression. Distance: Hamming distance via POPCNT, which CPUs execute in a single instruction per 64-bit word.
BQ works well only for embedding models with approximately symmetric dimension distributions (Matryoshka representations, Cohere Embed v3 trained with BQ in mind). On general-purpose embeddings it loses 5–15% Recall@10. Qdrant, Weaviate, and Vespa all support it; Pinecone does not as of this writing.
#Summary
| Engine | Primary technique | Typical ratio | Lossy? |
|---|---|---|---|
| PostgreSQL (row) | LZ4 / ZSTD on TOAST values | 2–5x | No |
| PostgreSQL (columnar) | Dictionary + LZ4/ZSTD | 10–50x on low-cardinality | No |
| ScyllaDB | Chunk-level LZ4/ZSTD on SSTables | 3–8x | No |
| Graph (CSR + delta) | Delta-varint encoding | 3–5x over adjacency list | No |
| Vector (SQ8) | Per-dimension affine quantization | 4x | Yes (< 2% recall loss) |
| Vector (PQ) | Sub-space centroid codes | 16–128x | Yes (3–8% recall loss) |
| Vector (BQ) | Sign quantization + Hamming | 32x | Yes (model-dependent) |
The RDBMS and ScyllaDB cases are lossless — the data that comes out matches what went in, the compressor is just exploiting redundancy in the byte stream. Vector quantization is structurally different: you are approximating the metric space, and the approximation quality is measurable (Recall@k) and tunable. Getting the tuning right requires knowing both your distance distribution and your recall SLO before you pick or the quantization scheme.
Primary sources: Zstd compression levels, ScyllaDB Compression documentation, Product Quantization — Jégou et al. 2011, Qdrant quantization docs.