Product Quantization and DiskANN

In short

A billion 768-dim float32 vectors take 3 TB of RAM. No commodity box holds that, and sharding multiplies cost. The fix is compression. Product Quantization (PQ) splits each vector into M sub-vectors of d/M dim, runs k-means with K = 256 centroids on each sub-position, and stores each vector as just M bytes — one centroid ID per sub-vector. A 768-dim vector at M = 8 becomes 8 bytes, a 384× shrink. Distances are computed by asymmetric distance computation (ADC): at query time you pre-compute a tiny M × 256 lookup table of distances from the query's sub-vectors to every codebook centroid, and approximate the distance to any stored vector as the sum of M table reads. No float multiplies; just integer indexing into a 1 KB table that fits in L1 cache. Recall lands around 90–95 % at PQ8 — recovered to 99 % by re-ranking the top-100 candidates with the original full vectors loaded from disk. DiskANN (Microsoft, 2019) bolts PQ onto an HNSW-like graph stored on SSD: PQ-compressed vectors live in RAM and drive the graph walk; full vectors live on disk and are loaded only for the final re-rank. The result: 1 billion vectors, single machine, 64 GB RAM, sub-100 ms p99. This is how Microsoft Bing's vector search runs, and the standard recipe (with FAISS's IVF-PQ) at billion-scale across the industry.

The previous chapter on IVF ended with an honest admission: IVF is a speed index, not a space index. You still hold every vector in RAM at full precision. For 100 million 768-dim float32 vectors that is 300 GB; for a billion, 3 TB. No single machine on AWS or GCP — short of the most exotic memory-optimised SKUs — has that much RAM, and the ones that do cost more per hour than a small Indian startup spends on engineers per day. You cannot make N smaller (that defeats having data) and you cannot make d smaller without hurting embedding quality. The only remaining lever is the bytes per vector. This chapter is about that lever. Product Quantization compresses each vector by 10 to 100 times with a small, controllable accuracy loss. Combined with IVF you get IVF-PQ — the canonical FAISS recipe past 100 M vectors. Combined with an on-disk graph you get DiskANN — Microsoft's open-source library that puts a billion vectors on a single laptop-class machine. Both are built on the same compression idea, derived below from first principles.

The thesis: 3 TB does not fit, so we shrink the vectors

A 768-dim float32 vector is 3072 bytes. Multiply by a billion: 3 TB. RAM costs ~₹500 per GB at retail, so 3 TB is ₹15 lakh just for the chips, ignoring the chassis, the network, the operations team, and the fact that no off-the-shelf box even accepts that much. Even at 100 M vectors you are already at 300 GB — past the 256 GB ceiling of most general-purpose cloud instances. The two ways out are shard across machines (which doubles latency, multiplies cost, and adds a coordinator) or compress each vector (which adds CPU and loses a little accuracy). Compression wins on cost and latency by an order of magnitude — when it works. Product Quantization is the technique that makes it work. The intuition: a 768-dim vector is not 768 independent random numbers. It is a point in a learned manifold where adjacent dimensions are correlated and the data clusters into a finite number of regimes. If you partition the dimensions into M = 8 groups of 96 and run k-means inside each group, you discover that across the corpus there are only ~256 meaningfully distinct patterns per group. So instead of storing 96 floats per group (384 bytes), you store one byte: "this sub-vector looks like centroid #173 from group 4's codebook". Why this works: k-means on the sub-vectors compresses the per-group entropy down to 8 bits because real embedding sub-spaces are heavily structured — the effective number of distinguishable sub-vector configurations is far less than 2^(96·32) would suggest. The dimensions inside a group are correlated; the groups across positions are roughly independent. PQ exploits exactly this structure.

Phase 1: training — `M` independent k-means runs

Pick M, the number of sub-vectors. Conventional choices are 8, 16, 32, or 64 — divisors of d that give round byte counts. Pick K, the centroids per sub-position. Almost everyone picks K = 256 because that fits in 8 bits and 8 bits is what hardware loves. Training is just M independent k-means runs. For each sub-position m ∈ {0, …, M-1}:

Take the m-th slice of every vector — columns m·(d/M) through (m+1)·(d/M) - 1.
Run k-means with K = 256 clusters on those N slices in d/M-dimensional space.
Store the resulting 256 centroids as the codebook for sub-position m: C_m ∈ ℝ^(256 × d/M). After training you have M codebooks, each holding 256 sub-vector centroids. Total codebook size: M · 256 · (d/M) · 4 = 256 · d · 4 bytes — for d = 768, that is ~786 KB. The codebooks fit comfortably in L2 cache on any modern CPU. Now encode every corpus vector. For each vector v:
Split v into M sub-vectors v^(0), v^(1), …, v^(M-1).
For each sub-position m, find the index i_m ∈ {0, …, 255} of the centroid in C_m closest to v^(m).
Replace v with the byte string (i_0, i_1, …, i_{M-1}) — M bytes total. That is the entire encoding. A 768-dim float32 vector (3072 bytes) becomes 8 bytes at M = 8. Why exactly M bytes: each sub-vector position contributes log_2(K) = 8 bits, and there are M positions. The codebooks are amortised across the entire corpus — the per-vector cost is just the M indices.

The vector is sliced into 8 chunks of 96 dimensions. Each chunk is quantised independently — k-means trained once on the corpus per sub-position. The output is 8 byte-sized centroid IDs. Reconstruction (lossy) is concatenating the centroids back. The 3072 → 8 byte compression is what makes a billion vectors fit in 8 GB.

A few practical points the textbook leaves out. **`M` controls the speed-accuracy trade.** Larger `M` means more sub-vectors, smaller per-sub-vector dim, less information lost per quantisation, more bytes per vector. Typical: `M = 8` for 768-dim → 8 bytes per vector, ~92 % recall@10. `M = 16` → 16 bytes, ~96 %. `M = 32` → 32 bytes, ~98 %. Why diminishing returns: at `M = d` you are quantising one float per code, which is just 8-bit scalar quantisation — the gain from "products" disappears. The product structure pays off when sub-vectors carry meaningful joint structure. **`d` should be a multiple of `M`.** If not, pad with zeros. FAISS does this for you. **Train on a representative sample.** k-means on a billion sub-vectors is itself a multi-hour job and unnecessary. 100 K to 1 M sample vectors yield codebooks statistically indistinguishable from full-corpus training. **The codebook is your secret weapon.** Two corpora with the same dimension cannot share codebooks — they live in different parts of the vector space. Train per-corpus, never reuse. ## Phase 2: querying — asymmetric distance computation You now have `N · M` bytes of compressed vectors and `M` codebooks. A query `q ∈ ℝ^d` arrives. How do you compute `dist(q, v)` for every compressed `v` without decompressing? The trick is **asymmetric distance computation** (ADC), introduced by Jégou et al. in 2011. The query stays at full precision — it is just one vector, the cost of keeping it uncompressed is negligible. Only the corpus is compressed. The distance is approximated as:

\hat{d}(q, v) = \sum_{m=0}^{M-1} \| q^{(m)} - C_m[i_m] \|^2

where i_m is the centroid ID stored for sub-position m of v. Why this is fast: the inner term \| q^{(m)} - C_m[i_m] \|^2 does not depend on v — only on the query and the codebook. So for each query, precompute it once for every (m, centroid) pair and stash in a table. Then for any compressed vector v, the distance is just M table lookups and a sum. The precomputed table is M × 256 floats — for M = 8, that is 8 KB, fits in L1 cache. Building it costs M · 256 · (d/M) multiplies = 256 · d multiplies = ~200 K floating-point ops for d = 768. Once. Per query. Then for every compressed vector in the corpus, the distance is computed by:

def adc_distance(pq_code, lookup_table):
    # pq_code: bytes, length M
    # lookup_table: float32[M][256]
    return sum(lookup_table[m][pq_code[m]] for m in range(M))

For M = 8, that is 8 array reads and 7 adds per vector. No multiplications, no square roots, no cache misses (the table is in L1). On modern hardware this is well under 10 nanoseconds per vector — so scanning a billion compressed vectors takes around 10 seconds on a single core, without any indexing at all. With IVF on top to skip 99 % of clusters, you scan ~10 M vectors per query, taking ~100 ms.

The query is decomposed once and used to build a `M × 256` table of pre-computed sub-distances. Then for every compressed vector in the corpus, the distance is the sum of `M` indexed reads from that table — no float multiplies, all data hot in L1. This is what makes PQ fast at scale; a billion table reads is a few seconds, a billion full distance computations is many minutes.

The "asymmetric" name is precise: **the query stays uncompressed, the corpus is compressed.** The alternative is **symmetric** distance computation (SDC) where you also quantise the query — that is faster (the table is shared across queries) but less accurate. Practically nobody uses SDC; ADC is what every production system runs. ## The recall trade-off — and re-ranking to recover it Distances under PQ are approximate. Two failure modes: **False ranking.** Two vectors `v₁` and `v₂` are equidistant under PQ but `v₁` is the true nearest. The ranker picks `v₂`. Recall drops at the boundary. **Compression collapse.** When `M` is too small relative to `d`, multiple distinct vectors collapse to the same code. PQ8 on 768-dim vectors gives `2^64 ≈ 10^19` distinct codes — far more than any corpus needs in raw count, but the codes are not uniformly used. In practice ~10⁵ codes carry the bulk of mass. Empirically, PQ8 on text embeddings yields **90–93 % recall@10**, PQ16 yields **94–96 %**, PQ32 yields **97–98 %**. For most retrieval-augmented-generation use cases that is enough. When it is not, the standard fix is **re-ranking**: 1. Use PQ to retrieve the top-`R` candidates (`R = 100` typically), where `R > k`. 2. Load the **full** vectors for those `R` candidates (from disk or a separate uncompressed cache). 3. Compute exact distances on those `R` vectors. 4. Return the top-`k` by exact distance. This recovers ~99 % recall@10 because the true top-10 almost always sits in the PQ top-100 — PQ rarely loses recall by ranking the true nearest below position 100, only by ranking it below position 10. Why re-ranking works so well: PQ noise is roughly proportional to the true distance — the top-50 candidates under PQ contain almost all the top-10 true neighbours, just in scrambled order. A second pass with exact distances unscrambles them at trivial cost (you compute 100 exact distances, not a billion). The re-rank cost: 100 exact distance computations × 768 dims = ~80 K float ops, ~0.1 ms. The cost of loading 100 full vectors from SSD: 100 × 3 KB = 300 KB, which at NVMe random-read latency (~50 µs per IO, or ~5 µs per IO with batching) is a few milliseconds. **Total query: PQ scan ~10 ms + re-rank IO ~3 ms + re-rank compute ~0.1 ms ≈ 15 ms.** Sub-100 ms p99 with comfortable headroom. ## DiskANN: the graph on SSD PQ alone solves the **memory** problem — a billion vectors fit in 8 GB at PQ8. But you still need an **index** to skip most of the corpus per query; scanning all billion compressed vectors is 10 seconds, far too slow for interactive search. The previous chapters covered the two main candidates: [IVF](/wiki/ivf-inverted-file-indexes-for-vectors) (cluster-based) and [HNSW](/wiki/hnsw-hierarchical-navigable-small-world-graphs) (graph-based). HNSW gives the best recall/latency curve but requires the **graph** to be in RAM. At a billion vectors with 32 neighbours per node × 4 bytes per neighbour ID = 128 bytes per node, the graph alone is 128 GB — too big for our 64 GB target. HNSW falls over at billion scale on a single machine. **DiskANN** ([Subramanya et al., NeurIPS 2019](https://proceedings.neurips.cc/paper/2019/file/09853c7fb1d3f8ee67a61b6bf4a7f8e6-Paper.pdf)) is Microsoft Research's answer. The recipe: 1. Build an **HNSW-like graph** (called Vamana — DiskANN's specific graph construction) with controlled out-degree. 2. Store the graph **on SSD**, with each node's full vector and neighbour list packed into a single SSD page. 3. Keep the **PQ-compressed vectors in RAM** (8 GB for a billion at PQ8). 4. **Query** by greedy graph walk: at each step, evaluate PQ distances to the current node's neighbours (using the in-RAM PQ codes), pick the best, load that node's page from SSD to get the next neighbour list. Repeat until convergence. 5. **Re-rank** the top candidates using their full vectors (already loaded as part of the SSD pages visited). The genius is the **two-tier distance**. The PQ codes drive the graph walk — fast, in-RAM, approximate — and the full vectors loaded as a side-effect of page IO drive the final re-ranking. Each query loads only a few hundred SSD pages (the path through the graph from entry point to target), so SSD IO is bounded.

Two storage tiers, two distance functions. The PQ codes for every vector live in RAM (8 GB for a billion); they drive the graph walk by scoring candidate neighbours fast. The actual graph and full vectors live on SSD; pages are loaded only along the walk's path. The full vectors loaded as a side-effect feed the final re-rank, recovering 99 % recall.

A few things DiskANN gets specifically right that earlier on-disk graph attempts got wrong. **The Vamana graph is built for SSD layout.** Standard HNSW has highly variable out-degrees and unpredictable hop counts; that hurts SSD because each random page load is ~50 µs and you cannot afford 100 of them per query. Vamana caps out-degree, prunes long edges, and its construction algorithm explicitly minimises hop count. Result: most queries converge in 5–20 hops. **SSD pages bundle vector + neighbour list.** When you load a page to read the neighbours, you get the full vector "for free". This means the same IOs that drive the graph walk also feed the re-ranking step — no extra reads. **PQ codes are co-resident with the graph index.** The PQ table and the codes are all in RAM; only the actual full vectors and adjacency lists go to SSD. This keeps the hot path in cache. The result, as published in the NeurIPS 2019 paper: **1 billion 96-dim float vectors, 64 GB RAM, sub-5 ms p99 latency, recall@10 = 95 %.** Bing's vector search runs on this. ANN-Benchmarks consistently puts DiskANN at the Pareto frontier for billion-scale single-machine indexes. ## Putting it all together — IVF-PQ at billion scale DiskANN is one combination of PQ with an index. The other industry-standard combination is **IVF-PQ** — IVF for the coarse partitioning, PQ to compress the per-cluster vectors. FAISS's `IndexIVFPQ` is the canonical implementation; see the original [Jégou-Douze-Schmid paper](https://hal.inria.fr/inria-00514462v2/document) for the full derivation. The recipe: 1. Train IVF with `K = √N` centroids — say 65 K centroids for 1 B vectors. 2. For each vector, subtract its coarse centroid (the **residual**), then PQ-encode the residual. Why the residual: the residuals are small-magnitude vectors clustered around zero, so PQ codebooks trained on residuals capture variance better than codebooks trained on raw vectors. This is the asymmetric distance computation with residual coding (ADC-R) trick from the original paper. 3. Store the IVF inverted lists, each entry being `(vector_id, M-byte PQ code)`. 4. At query time: pick the top-`nprobe` clusters by centroid distance, then for each cluster scan the PQ codes using ADC, finally re-rank top-`R` with the full vectors. Index sizes for 1 B × 768-dim float32 with `K = 65 K`, `M = 8`: - **Flat:** `1 B × 3072 B = 3 TB` (impossible). - **IVF only:** same 3 TB plus 200 MB of centroids. Still impossible. - **IVF-PQ8:** `200 MB centroids + 1 B × (4 B id + 8 B PQ code) = 12 GB`. Fits in RAM. - **IVF-PQ8 with full-vector re-rank cache on SSD:** 12 GB RAM + 3 TB SSD. Sub-100 ms p99. :::example **Sizing a 1 B-vector Indian e-commerce search index** Imagine you run an Indian e-commerce platform — a billion product cards across electronics, fashion, kirana, books, the works — and you want a single semantic search index over all of them. Each product description is embedded with a 768-dim model. **Without compression.** `1 B × 768 × 4 B = 3 TB`. No single AWS instance has 3 TB of RAM at any reasonable cost. The cheapest path is **sharding** across ~50 boxes (each holding ~60 GB of vectors), with a coordinator that fans out queries and merges results. Per-query latency: ~50 ms intra-shard + ~30 ms network + ~20 ms merge ≈ 100 ms. Cost: 50 × `r6i.4xlarge` (128 GB RAM each) at ~₹85/hour = ~₹4250/hour ≈ ₹3 crore/year. Plus the engineering cost of running a 50-box sharded index. **With PQ8.** `1 B × 8 B = 8 GB` of compressed vectors + ~1 GB of IVF centroids and metadata = **9 GB total**. Fits comfortably in 64 GB of RAM with room for the OS, JVM heap, query buffers, and a re-rank cache. Single box: one `r6i.2xlarge` (64 GB) at ~₹17/hour = ~₹1.5 lakh/year. **A 200× cost reduction**, and you have replaced a 50-box distributed system with a single Python process. **Per query:** - IVF coarse search: scan 65 K centroids · 768 dims = 50 M float ops ≈ 20 ms (or much less if using BLAS). - Pick top `nprobe = 32` clusters out of 65 K — that is `32/65000 ≈ 0.05 %` of the corpus, ~500 K vectors. - ADC scan over 500 K PQ codes at 8 ns per vector ≈ 4 ms. - Re-rank top 100 with full vectors loaded from SSD: ~3 ms IO + 0.1 ms compute. - **Total: ~10 ms p50, ~30 ms p99.** **Recall.** With PQ8 alone the top-10 result list has ~92 % recall against the brute-force gold standard. With re-rank to top-100 then exact-distance to top-10, recall climbs to ~99 %. For e-commerce search this is functionally indistinguishable from exact — and certainly indistinguishable from the noise in user click data. **The decision.** Ship IVF-PQ8 with `K = 65 K`, `nprobe = 32`, `M = 8`, and a re-rank-100 stage. One box, ₹1.5 lakh/year, 30 ms p99, 99 % recall. The IVF coordinator that the unsharded version would have needed becomes one less component to operate, monitor, and pay AWS for. Why this works: the savings come from the PQ compression making the corpus fit in RAM, not from the IVF clustering — IVF is "merely" the speed index. PQ is the cost-killer. This is why IVF-PQ is the canonical recipe past 100 M vectors and why DiskANN is the alternative when you want to push past a billion or when re-rank IO budget is tight. ::: ## Where PQ and DiskANN live in the wild **[FAISS](https://github.com/facebookresearch/faiss/wiki/Faiss-indexes#pq-the-pq-quantizer)** — the canonical implementation. `IndexPQ` (pure PQ, no coarse index), `IndexIVFPQ` (IVF + PQ, the standard recipe), `IndexIVFPQR` (with residual re-ranking), `IndexHNSWPQ` (HNSW graph + PQ-compressed vectors). The single most production-tested similarity-search library on Earth. **[DiskANN (Microsoft, GitHub)](https://github.com/microsoft/DiskANN)** — the Vamana-on-SSD reference implementation, MIT-licensed, in C++. Used by Bing, available as a backend in Milvus and Vespa. The repository contains the SSD-optimised search code and a training script that handles billion-scale indexes on commodity hardware. **[Milvus](https://milvus.io/docs/index.md)** — vector database that exposes `IVF_PQ`, `IVF_SQ8`, and `DISKANN` as index types. The `DISKANN` index is a Milvus wrapper around Microsoft's library; widely used by Indian fintech and e-commerce startups for product search and KYC document matching. **[Vespa](https://docs.vespa.ai/en/approximate-nn-hnsw.html)** — Yahoo's open-source search platform. Defaults to HNSW but supports a custom PQ implementation for memory-constrained deployments. **[Pinecone](https://www.pinecone.io/learn/series/faiss/product-quantization/)** — managed vector DB; uses PQ-style compression internally, with proprietary tweaks. Their engineering blog has a clear vendor-neutral PQ explainer. **ScaNN (Google)** — uses **anisotropic vector quantisation**, a more sophisticated cousin of PQ tuned for inner-product (recommendation) workloads. ScaNN's quantisation pays attention to *which* dimensions of the residual matter most for ranking, where PQ treats all dimensions equally. **OPQ (Optimised Product Quantisation)** — a 2013 refinement by Ge et al. that learns a rotation `R` of the input space *before* PQ, so that the sub-vector axes are better aligned with the data's natural variance directions. Recall improves 1–3 % at zero query-time cost. FAISS's `IndexPreTransform(OPQMatrix(d, M), IndexIVFPQ(...))` ships this for free; almost everyone running IVF-PQ at scale should turn it on. The pattern across all of these: **compress aggressively, walk a small index, re-rank the top candidates exactly.** That triplet — compression + ANN index + re-rank — is the only way billion-scale similarity search works on commodity hardware in 2026, and PQ is the load-bearing piece. ## When to reach for PQ (and when not to) **Reach for PQ when:** - You have ≥ 10⁸ vectors and uncompressed storage is hurting (RAM cost, sharding overhead, network). - You can tolerate a 5–10 % recall drop on raw retrieval, recovered by a re-rank pass. - Your queries can afford the precompute step (negligible — sub-millisecond per query). **Reach for DiskANN when:** - You have ≥ 10⁹ vectors and even PQ-compressed memory is uncomfortable (8 GB at PQ8 for a billion, ~80 GB for ten billion). - You have NVMe SSDs available with low-latency random reads. - You can build the index offline (Vamana construction is slower than HNSW build). **Reach for plain IVF or HNSW instead when:** - You have < 10⁸ vectors and the full-precision index fits in RAM. - You need maximum recall with no re-rank tolerance. - Index build time is a hard constraint (PQ training is fast but adds a step). **Skip quantisation entirely when:** - You have < 10⁶ vectors. The compression ratio is wasted; full precision is cheap. - Your downstream re-rank uses a much more expensive cross-encoder anyway. The PQ approximation is dominated by the cross-encoder cost. The decision is mostly about scale. Below 100 M, plain IVF or HNSW. Between 100 M and 1 B, IVF-PQ. Above 1 B (or below the memory budget IVF-PQ needs), DiskANN. ## What you have built You started this chapter with a memory wall: a billion vectors do not fit in commodity RAM at full precision. You end it with two complementary tools that knock the wall down. **Product Quantization** compresses each vector by 100–400× by exploiting per-sub-position structure with a tiny per-position k-means codebook; distances become a `M × 256` lookup-table sum, faster than any float arithmetic. **DiskANN** combines PQ with a graph index laid out for SSD pages, putting a billion vectors on a single 64 GB box at sub-100 ms latency. The next chapter takes the search problem in a different direction — [filtered search](/wiki/filtered-search-the-pre-post-filter-tradeoff), where the user wants "nearest vectors **and** in-stock **and** under ₹5000". The compression and indexing tricks from chapters 152–155 stay; what changes is how to combine them with the structured filters that real e-commerce queries always have. ## References 1. Jégou, Douze, Schmid — [Product Quantization for Nearest Neighbor Search](https://hal.inria.fr/inria-00514462v2/document) (2011). The foundational PQ paper. Sections 3 and 4 derive ADC and IVFADC; everything since is variations on this. 2. Ge, He, Ke, Sun — [Optimized Product Quantization](https://www.kaiminghe.com/publications/pami13opq.pdf) (PAMI 2013). Adds a learned rotation before PQ, lifting recall by 1–3 % at no query-time cost. The default in modern FAISS pipelines. 3. Subramanya, Devvrit, Simhadri, Krishnaswamy, Kadekodi — [DiskANN: Fast Accurate Billion-point Nearest Neighbor Search on a Single Node](https://proceedings.neurips.cc/paper/2019/file/09853c7fb1d3f8ee67a61b6bf4a7f8e6-Paper.pdf) (NeurIPS 2019). The Microsoft Research paper. Vamana graph construction, SSD page layout, and the billion-vector benchmark. 4. [Microsoft DiskANN — GitHub](https://github.com/microsoft/DiskANN). The MIT-licensed reference implementation in C++. Includes scripts for billion-scale index builds and the SSD-optimised search runtime. 5. [Pinecone — Product Quantization](https://www.pinecone.io/learn/series/faiss/product-quantization/). Vendor-flavoured but well-illustrated walkthrough of PQ encoding, ADC, and the IVF-PQ combination. 6. Johnson, Douze, Jégou — [Billion-scale similarity search with GPUs](https://arxiv.org/abs/1702.08734) (2017). The FAISS paper. Section 4 covers GPU-accelerated IVF-PQ at billion scale, the engineering basis for almost every large-scale vector index in production today.

The thesis: 3 TB does not fit, so we shrink the vectors

Phase 1: training — M independent k-means runs

Phase 1: training — `M` independent k-means runs