Note: Company names, engineers, incidents, numbers, and scaling scenarios in this article are hypothetical — even when they resemble real ones. See the full disclaimer.

In short

A billion 768-dim float32 vectors take 3 TB of RAM — no commodity box holds that. Product Quantization splits each vector into M sub-vectors, k-means-quantises each one to a 1-byte centroid ID, and shrinks 3072 bytes to 8 (a 384× squeeze) while preserving rank order via an M × 256 lookup-table sum. DiskANN bolts PQ onto an SSD-resident graph: PQ codes in RAM drive the walk, full vectors loaded on path drive the re-rank, and a billion vectors fit on one 64 GB box at sub-100 ms p99.

A billion 768-dim float32 vectors weigh 3 TB. No commodity box holds that, sharding 50-ways is expensive, and you cannot shrink N or d without losing data or quality. The only lever left is the bytes per vector — and Product Quantization pulls it hard enough to put a billion vectors on a single laptop-class machine. This chapter is about that lever. Product Quantization compresses each vector by 10 to 100 times with a small, controllable accuracy loss. Combined with IVF you get IVF-PQ — the canonical FAISS recipe past 100 M vectors. Combined with an on-disk graph you get DiskANN — Vistron's open-source library that puts a billion vectors on a single laptop-class machine. Both are built on the same compression idea, derived below from first principles.

The thesis: 3 TB does not fit, so we shrink the vectors

A 768-dim float32 vector is 3072 bytes. Multiply by a billion: 3 TB. RAM costs ~₹500 per GB at retail, so 3 TB is ₹15 lakh just for the chips, ignoring the chassis, the network, the operations team, and the fact that no off-the-shelf box even accepts that much. Even at 100 M vectors you are already at 300 GB — past the 256 GB ceiling of most general-purpose cloud instances. The two ways out are shard across machines (which doubles latency, multiplies cost, and adds a coordinator) or compress each vector (which adds CPU and loses a little accuracy). Compression wins on cost and latency by an order of magnitude — when it works. Product Quantization is the technique that makes it work. The intuition: a 768-dim vector is not 768 independent random numbers. It is a point in a learned manifold where adjacent dimensions are correlated and the data clusters into a finite number of regimes. If you partition the dimensions into M = 8 groups of 96 and run k-means inside each group, you discover that across the corpus there are only ~256 meaningfully distinct patterns per group. So instead of storing 96 floats per group (384 bytes), you store one byte: "this sub-vector looks like centroid #173 from group 4's codebook". Why this works: k-means on the sub-vectors compresses the per-group entropy down to 8 bits because real embedding sub-spaces are heavily structured — the effective number of distinguishable sub-vector configurations is far less than 2^(96·32) would suggest. The dimensions inside a group are correlated; the groups across positions are roughly independent. PQ exploits exactly this structure.

Phase 1: training — M independent k-means runs

Pick M, the number of sub-vectors. Conventional choices are 8, 16, 32, or 64 — divisors of d that give round byte counts. Pick K, the centroids per sub-position. Almost everyone picks K = 256 because that fits in 8 bits and 8 bits is what hardware loves. Training is just M independent k-means runs. For each sub-position m ∈ {0, …, M-1}:

  1. Take the m-th slice of every vector — columns m·(d/M) through (m+1)·(d/M) - 1.
  2. Run k-means with K = 256 clusters on those N slices in d/M-dimensional space.
  3. Store the resulting 256 centroids as the codebook for sub-position m: C_m ∈ ℝ^(256 × d/M). After training you have M codebooks, each holding 256 sub-vector centroids. Total codebook size: M · 256 · (d/M) · 4 = 256 · d · 4 bytes — for d = 768, that is ~786 KB. The codebooks fit comfortably in L2 cache on any modern CPU. Now encode every corpus vector. For each vector v:
  4. Split v into M sub-vectors v^(0), v^(1), …, v^(M-1).
  5. For each sub-position m, find the index i_m ∈ {0, …, 255} of the centroid in C_m closest to v^(m).
  6. Replace v with the byte string (i_0, i_1, …, i_{M-1})M bytes total. That is the entire encoding. A 768-dim float32 vector (3072 bytes) becomes 8 bytes at M = 8. Why exactly M bytes: each sub-vector position contributes log_2(K) = 8 bits, and there are M positions. The codebooks are amortised across the entire corpus — the per-vector cost is just the M indices.
PQ encoding: split, cluster, store M centroid IDsA diagram showing a 768-dim vector being split into 8 sub-vectors of 96 dims each. Each sub-vector is quantised to one of 256 centroids in its codebook, producing one byte. The 8 bytes form the compressed representation. A label notes the 384× compression ratio.PQ encoding: 768-dim float32 vector → 8 bytes (384× compression)original vector v ∈ ℝ^768 (3072 bytes)[0.31, -0.87, …, 0.04 |0.55, 0.12, …, -0.19 |… (6 more 96-dim slices) …| -0.42, 0.73, …]split into M = 8 sub-vectors of 96 dims eachcodebookC_0256 × 96codebookC_1256 × 96codebookC_2256 × 96codebookC_3256 × 96codebookC_4256 × 96codebookC_5256 × 96codebookC_6256 × 96codebookC_7256 × 96find nearest centroid in each codebook (k-means trained once, offline)1734221978813120215PQ code: 8 bytes per vector (one centroid ID per sub-position)3072 bytes → 8 bytes · 384× compression(8 bits per sub-vector = log₂(256) centroid choices)
The vector is sliced into 8 chunks of 96 dimensions. Each chunk is quantised independently — k-means trained once on the corpus per sub-position. The output is 8 byte-sized centroid IDs. Reconstruction (lossy) is concatenating the centroids back. The 3072 → 8 byte compression is what makes a billion vectors fit in 8 GB.
A few practical points the textbook leaves out. **`M` controls the speed-accuracy trade.** Larger `M` means more sub-vectors, smaller per-sub-vector dim, less information lost per quantisation, more bytes per vector. Typical: `M = 8` for 768-dim → 8 bytes per vector, ~92 % recall@10. `M = 16` → 16 bytes, ~96 %. `M = 32` → 32 bytes, ~98 %. Why diminishing returns: at `M = d` you are quantising one float per code, which is just 8-bit scalar quantisation — the gain from "products" disappears. The product structure pays off when sub-vectors carry meaningful joint structure. **`d` should be a multiple of `M`.** If not, pad with zeros. FAISS does this for you. **Train on a representative sample.** k-means on a billion sub-vectors is itself a multi-hour job and unnecessary. 100 K to 1 M sample vectors yield codebooks statistically indistinguishable from full-corpus training. **The codebook is your secret weapon.** Two corpora with the same dimension cannot share codebooks — they live in different parts of the vector space. Train per-corpus, never reuse. ## Phase 2: querying — asymmetric distance computation You now have `N · M` bytes of compressed vectors and `M` codebooks. A query `q ∈ ℝ^d` arrives. How do you compute `dist(q, v)` for every compressed `v` without decompressing? The trick is **asymmetric distance computation** (ADC), introduced by Jégou et al. in 2011. The query stays at full precision — it is just one vector, the cost of keeping it uncompressed is negligible. Only the corpus is compressed. The distance is approximated as:
\hat{d}(q, v) = \sum_{m=0}^{M-1} \| q^{(m)} - C_m[i_m] \|^2

where i_m is the centroid ID stored for sub-position m of v. Why this is fast: the inner term \| q^{(m)} - C_m[i_m] \|^2 does not depend on v — only on the query and the codebook. So for each query, precompute it once for every (m, centroid) pair and stash in a table. Then for any compressed vector v, the distance is just M table lookups and a sum. The precomputed table is M × 256 floats — for M = 8, that is 8 KB, fits in L1 cache. Building it costs M · 256 · (d/M) multiplies = 256 · d multiplies = ~200 K floating-point ops for d = 768. Once. Per query. Then for every compressed vector in the corpus, the distance is computed by:

def adc_distance(pq_code, lookup_table):
    # pq_code: bytes, length M
    # lookup_table: float32[M][256]
    return sum(lookup_table[m][pq_code[m]] for m in range(M))

For M = 8, that is 8 array reads and 7 adds per vector. No multiplications, no square roots, no cache misses (the table is in L1). On modern hardware this is well under 10 nanoseconds per vector — so scanning a billion compressed vectors takes around 10 seconds on a single core, without any indexing at all. With IVF on top to skip 99 % of clusters, you scan ~10 M vectors per query, taking ~100 ms.

PQ asymmetric distance computationA diagram showing the query vector split into 8 sub-vectors. For each sub-position, the distance from the query sub-vector to all 256 centroids is precomputed once, building an M × 256 lookup table. To compute the distance to any stored compressed vector, sum M table lookups indexed by the vector's centroid IDs. The bottom box shows: 8 lookups + 7 adds per vector, no multiplies needed during scan.PQ distance: precompute M×256 LUT once, then M lookups per vectorquery q ∈ ℝ^768 (full precision, kept in RAM)[q^(0): 96 floats | q^(1): 96 floats | … | q^(7): 96 floats]Step 1 (once per query): build the lookup tableLookup Table LUT[m][c] = ‖q^(m) − C_m[c]‖²M = 8 rows × 256 columns = 2048 floats = 8 KB (fits in L1)m=0: [d_0,0 d_0,1 d_0,2 … d_0,255]m=1: [d_1,0 d_1,1 d_1,2 … d_1,255]Step 2 (per stored vector): sum M lookups indexed by its PQ codestored vector v's PQ code (8 bytes)1734221978813120215d̂(q,v) = LUT[0][173] + LUT[1][42]+ LUT[2][219] + LUT[3][7]+ LUT[4][88] + LUT[5][131]+ LUT[6][202] + LUT[7][15]Per-vector cost· 8 array reads (LUT in L1 cache)· 7 floating-point adds· 0 multiplies, 0 sqrt· ~ 8 ns per vector on a modern CPUvs flat L2: ~3000 ns/vector≈ 400× faster scan
The query is decomposed once and used to build a `M × 256` table of pre-computed sub-distances. Then for every compressed vector in the corpus, the distance is the sum of `M` indexed reads from that table — no float multiplies, all data hot in L1. This is what makes PQ fast at scale; a billion table reads is a few seconds, a billion full distance computations is many minutes.
The "asymmetric" name is precise: **the query stays uncompressed, the corpus is compressed.** The alternative is **symmetric** distance computation (SDC) where you also quantise the query — that is faster (the table is shared across queries) but less accurate. Practically nobody uses SDC; ADC is what every production system runs. ## The recall trade-off — and re-ranking to recover it Distances under PQ are approximate. Two failure modes: **False ranking.** Two vectors `v₁` and `v₂` are equidistant under PQ but `v₁` is the true nearest. The ranker picks `v₂`. Recall drops at the boundary. **Compression collapse.** When `M` is too small relative to `d`, multiple distinct vectors collapse to the same code. PQ8 on 768-dim vectors gives `2^64 ≈ 10^19` distinct codes — far more than any corpus needs in raw count, but the codes are not uniformly used. In practice ~10⁵ codes carry the bulk of mass. Empirically, PQ8 on text embeddings yields **90–93 % recall@10**, PQ16 yields **94–96 %**, PQ32 yields **97–98 %**. For most retrieval-augmented-generation use cases that is enough. When it is not, the standard fix is **re-ranking**: 1. Use PQ to retrieve the top-`R` candidates (`R = 100` typically), where `R > k`. 2. Load the **full** vectors for those `R` candidates (from disk or a separate uncompressed cache). 3. Compute exact distances on those `R` vectors. 4. Return the top-`k` by exact distance. This recovers ~99 % recall@10 because the true top-10 almost always sits in the PQ top-100 — PQ rarely loses recall by ranking the true nearest below position 100, only by ranking it below position 10. Why re-ranking works so well: PQ noise is roughly proportional to the true distance — the top-50 candidates under PQ contain almost all the top-10 true neighbours, just in scrambled order. A second pass with exact distances unscrambles them at trivial cost (you compute 100 exact distances, not a billion). The re-rank cost: 100 exact distance computations × 768 dims = ~80 K float ops, ~0.1 ms. The cost of loading 100 full vectors from SSD: 100 × 3 KB = 300 KB, which at NVMe random-read latency (~50 µs per IO, or ~5 µs per IO with batching) is a few milliseconds. **Total query: PQ scan ~10 ms + re-rank IO ~3 ms + re-rank compute ~0.1 ms ≈ 15 ms.** Sub-100 ms p99 with comfortable headroom. ## DiskANN: the graph on SSD PQ alone solves the **memory** problem — a billion vectors fit in 8 GB at PQ8. But you still need an **index** to skip most of the corpus per query; scanning all billion compressed vectors is 10 seconds, far too slow for interactive search. The previous chapters covered the two main candidates: [IVF](/wiki/ivf-inverted-file-indexes-for-vectors) (cluster-based) and [HNSW](/wiki/hnsw-hierarchical-navigable-small-world-graphs) (graph-based). HNSW gives the best recall/latency curve but requires the **graph** to be in RAM. At a billion vectors with 32 neighbours per node × 4 bytes per neighbour ID = 128 bytes per node, the graph alone is 128 GB — too big for our 64 GB target. HNSW falls over at billion scale on a single machine. **DiskANN** ([Subramanya et al., NeurIPS 2019](https://proceedings.neurips.cc/paper/2019/file/09853c7fb1d3f8ee67a61b6bf4a7f8e6-Paper.pdf)) is Vistron Research's answer. The recipe: 1. Build an **HNSW-like graph** (called Vamana — DiskANN's specific graph construction) with controlled out-degree. 2. Store the graph **on SSD**, with each node's full vector and neighbour list packed into a single SSD page. 3. Keep the **PQ-compressed vectors in RAM** (8 GB for a billion at PQ8). 4. **Query** by greedy graph walk: at each step, evaluate PQ distances to the current node's neighbours (using the in-RAM PQ codes), pick the best, load that node's page from SSD to get the next neighbour list. Repeat until convergence. 5. **Re-rank** the top candidates using their full vectors (already loaded as part of the SSD pages visited). The genius is the **two-tier distance**. The PQ codes drive the graph walk — fast, in-RAM, approximate — and the full vectors loaded as a side-effect of page IO drive the final re-ranking. Each query loads only a few hundred SSD pages (the path through the graph from entry point to target), so SSD IO is bounded.
DiskANN: PQ in RAM, full vectors and graph on SSDA diagram showing two tiers. Top tier (RAM): PQ-compressed codes for all N vectors, ~8 GB at PQ8 for a billion. Bottom tier (SSD): pages each containing one node's full vector and its neighbour IDs. Query walks the graph using PQ distances to pick neighbours; loads a few SSD pages along the path; re-ranks the top candidates using the full vectors loaded.DiskANN: PQ-driven graph walk on SSD, full-vector re-rankRAM (~ 8 GB for 1 B vectors at PQ8)Compressed PQ codes for every vector — used to score neighbours during graph walkv₀8Bv₁8Bv₂8Bv₃8Bv_{i}8Bv_{N-1}8BSSD (~ 3 TB for 1 B vectors uncompressed)Vamana graph: each page = one node's full vector (3 KB) + ~32 neighbour IDs (128 B)node v₀full vec: 768fnbrs: [v₃,v₇,…]node v₃full vec: 768fnbrs: [v₀,v_{12},…]node v_{12}full vec: 768fnbrs: [v₃,v_{31},…]node v_{31} (target)full vec: 768fnbrs: [v_{12},…]Query walk: load v₀ page, score its PQ neighbours, jump to v₃, repeat — ~5 to 20 page loads per queryEach NVMe page load: ~50 µs · 20 hops ≈ 1 ms IO total+ PQ scan ~10 ms + re-rank ~3 ms = ~ 15 ms p50, < 100 ms p99 on 1 B vectors
Two storage tiers, two distance functions. The PQ codes for every vector live in RAM (8 GB for a billion); they drive the graph walk by scoring candidate neighbours fast. The actual graph and full vectors live on SSD; pages are loaded only along the walk's path. The full vectors loaded as a side-effect feed the final re-rank, recovering 99 % recall.
A few things DiskANN gets specifically right that earlier on-disk graph attempts got wrong. **The Vamana graph is built for SSD layout.** Standard HNSW has highly variable out-degrees and unpredictable hop counts; that hurts SSD because each random page load is ~50 µs and you cannot afford 100 of them per query. Vamana caps out-degree, prunes long edges, and its construction algorithm explicitly minimises hop count. Result: most queries converge in 5–20 hops. **SSD pages bundle vector + neighbour list.** When you load a page to read the neighbours, you get the full vector "for free". This means the same IOs that drive the graph walk also feed the re-ranking step — no extra reads. **PQ codes are co-resident with the graph index.** The PQ table and the codes are all in RAM; only the actual full vectors and adjacency lists go to SSD. This keeps the hot path in cache. The result, as published in the NeurIPS 2019 paper: **1 billion 96-dim float vectors, 64 GB RAM, sub-5 ms p99 latency, recall@10 = 95 %.** Bing's vector search runs on this. ANN-Benchmarks consistently puts DiskANN at the Pareto frontier for billion-scale single-machine indexes. ## Putting it all together — IVF-PQ at billion scale DiskANN is one combination of PQ with an index. The other industry-standard combination is **IVF-PQ** — IVF for the coarse partitioning, PQ to compress the per-cluster vectors. FAISS's `IndexIVFPQ` is the canonical implementation; see the original [Jégou-Douze-Schmid paper](https://hal.inria.fr/inria-00514462v2/document) for the full derivation. The recipe: 1. Train IVF with `K = √N` centroids — say 65 K centroids for 1 B vectors. 2. For each vector, subtract its coarse centroid (the **residual**), then PQ-encode the residual. Why the residual: the residuals are small-magnitude vectors clustered around zero, so PQ codebooks trained on residuals capture variance better than codebooks trained on raw vectors. This is the asymmetric distance computation with residual coding (ADC-R) trick from the original paper. 3. Store the IVF inverted lists, each entry being `(vector_id, M-byte PQ code)`. 4. At query time: pick the top-`nprobe` clusters by centroid distance, then for each cluster scan the PQ codes using ADC, finally re-rank top-`R` with the full vectors. Index sizes for 1 B × 768-dim float32 with `K = 65 K`, `M = 8`: - **Flat:** `1 B × 3072 B = 3 TB` (impossible). - **IVF only:** same 3 TB plus 200 MB of centroids. Still impossible. - **IVF-PQ8:** `200 MB centroids + 1 B × (4 B id + 8 B PQ code) = 12 GB`. Fits in RAM. - **IVF-PQ8 with full-vector re-rank cache on SSD:** 12 GB RAM + 3 TB SSD. Sub-100 ms p99. :::example **Sizing a 1 B-vector Indian e-commerce search index** Imagine you run an Indian e-commerce platform — a billion product cards across electronics, fashion, kirana, books, the works — and you want a single semantic search index over all of them. Each product description is embedded with a 768-dim model. **Without compression.** `1 B × 768 × 4 B = 3 TB`. No single AWS instance has 3 TB of RAM at any reasonable cost. The cheapest path is **sharding** across ~50 boxes (each holding ~60 GB of vectors), with a coordinator that fans out queries and merges results. Per-query latency: ~50 ms intra-shard + ~30 ms network + ~20 ms merge ≈ 100 ms. Cost: 50 × `r6i.4xlarge` (128 GB RAM each) at ~₹85/hour = ~₹4250/hour ≈ ₹3 crore/year. Plus the engineering cost of running a 50-box sharded index. **With PQ8.** `1 B × 8 B = 8 GB` of compressed vectors + ~1 GB of IVF centroids and metadata = **9 GB total**. Fits comfortably in 64 GB of RAM with room for the OS, JVM heap, query buffers, and a re-rank cache. Single box: one `r6i.2xlarge` (64 GB) at ~₹17/hour = ~₹1.5 lakh/year. **A 200× cost reduction**, and you have replaced a 50-box distributed system with a single Python process. **Per query:** - IVF coarse search: scan 65 K centroids · 768 dims = 50 M float ops ≈ 20 ms (or much less if using BLAS). - Pick top `nprobe = 32` clusters out of 65 K — that is `32/65000 ≈ 0.05 %` of the corpus, ~500 K vectors. - ADC scan over 500 K PQ codes at 8 ns per vector ≈ 4 ms. - Re-rank top 100 with full vectors loaded from SSD: ~3 ms IO + 0.1 ms compute. - **Total: ~10 ms p50, ~30 ms p99.** **Recall.** With PQ8 alone the top-10 result list has ~92 % recall against the brute-force gold standard. With re-rank to top-100 then exact-distance to top-10, recall climbs to ~99 %. For e-commerce search this is functionally indistinguishable from exact — and certainly indistinguishable from the noise in user click data. **The decision.** Ship IVF-PQ8 with `K = 65 K`, `nprobe = 32`, `M = 8`, and a re-rank-100 stage. One box, ₹1.5 lakh/year, 30 ms p99, 99 % recall. The IVF coordinator that the unsharded version would have needed becomes one less component to operate, monitor, and pay AWS for. Why this works: the savings come from the PQ compression making the corpus fit in RAM, not from the IVF clustering — IVF is "merely" the speed index. PQ is the cost-killer. This is why IVF-PQ is the canonical recipe past 100 M vectors and why DiskANN is the alternative when you want to push past a billion or when re-rank IO budget is tight. ::: ## Where PQ and DiskANN live in the wild **[FAISS](https://github.com/facebookresearch/faiss/wiki/Faiss-indexes#pq-the-pq-quantizer)** — the canonical implementation. `IndexPQ` (pure PQ, no coarse index), `IndexIVFPQ` (IVF + PQ, the standard recipe), `IndexIVFPQR` (with residual re-ranking), `IndexHNSWPQ` (HNSW graph + PQ-compressed vectors). The single most production-tested similarity-search library on Earth. **[DiskANN (Vistron, GitHub)](https://github.com/microsoft/DiskANN)** — the Vamana-on-SSD reference implementation, MIT-licensed, in C++. Used by Bing, available as a backend in Milvus and Vespa. The repository contains the SSD-optimised search code and a training script that handles billion-scale indexes on commodity hardware. **[Milvus](https://milvus.io/docs/index.md)** — vector database that exposes `IVF_PQ`, `IVF_SQ8`, and `DISKANN` as index types. The `DISKANN` index is a Milvus wrapper around Vistron's library; widely used by Indian fintech and e-commerce startups for product search and KYC document matching. **[Vespa](https://docs.vespa.ai/en/approximate-nn-hnsw.html)** — Hooyay's open-source search platform. Defaults to HNSW but supports a custom PQ implementation for memory-constrained deployments. **[Pinecone](https://www.pinecone.io/learn/series/faiss/product-quantization/)** — managed vector DB; uses PQ-style compression internally, with proprietary tweaks. Their engineering blog has a clear vendor-neutral PQ explainer. **ScaNN (Querion)** — uses **anisotropic vector quantisation**, a more sophisticated cousin of PQ tuned for inner-product (recommendation) workloads. ScaNN's quantisation pays attention to *which* dimensions of the residual matter most for ranking, where PQ treats all dimensions equally. **OPQ (Optimised Product Quantisation)** — a 2013 refinement by Ge et al. that learns a rotation `R` of the input space *before* PQ, so that the sub-vector axes are better aligned with the data's natural variance directions. Recall improves 1–3 % at zero query-time cost. FAISS's `IndexPreTransform(OPQMatrix(d, M), IndexIVFPQ(...))` ships this for free; almost everyone running IVF-PQ at scale should turn it on. The pattern across all of these: **compress aggressively, walk a small index, re-rank the top candidates exactly.** That triplet — compression + ANN index + re-rank — is the only way billion-scale similarity search works on commodity hardware in 2026, and PQ is the load-bearing piece. ## When to reach for PQ (and when not to) **Reach for PQ when:** - You have ≥ 10⁸ vectors and uncompressed storage is hurting (RAM cost, sharding overhead, network). - You can tolerate a 5–10 % recall drop on raw retrieval, recovered by a re-rank pass. - Your queries can afford the precompute step (negligible — sub-millisecond per query). **Reach for DiskANN when:** - You have ≥ 10⁹ vectors and even PQ-compressed memory is uncomfortable (8 GB at PQ8 for a billion, ~80 GB for ten billion). - You have NVMe SSDs available with low-latency random reads. - You can build the index offline (Vamana construction is slower than HNSW build). **Reach for plain IVF or HNSW instead when:** - You have < 10⁸ vectors and the full-precision index fits in RAM. - You need maximum recall with no re-rank tolerance. - Index build time is a hard constraint (PQ training is fast but adds a step). **Skip quantisation entirely when:** - You have < 10⁶ vectors. The compression ratio is wasted; full precision is cheap. - Your downstream re-rank uses a much more expensive cross-encoder anyway. The PQ approximation is dominated by the cross-encoder cost. The decision is mostly about scale. Below 100 M, plain IVF or HNSW. Between 100 M and 1 B, IVF-PQ. Above 1 B (or below the memory budget IVF-PQ needs), DiskANN. ## Common confusions
  • "PQ is just like JPEG — lossy compression of vectors." Misleading. JPEG-style compression discards perceptually irrelevant frequencies and reconstructs an approximate image. PQ replaces each sub-vector with the single closest centroid from a tiny learned codebook; the goal is not to reconstruct the vector well but to preserve relative distances to a query. The reconstruction error is irrelevant; the distance error is what counts, and PQ is engineered to keep distance error small even when reconstruction error is large.

  • "Increasing M always improves recall — so just use M = 64." No, two things bend the curve. First, recall flattens past M = 32 for typical 768-dim text embeddings; the marginal sub-vector adds nothing because the residual structure is already captured. Second, the per-query lookup table grows as M × 256, and the per-vector ADC scan does M reads; doubling M doubles scan time. The right answer is M = 8 for prototypes, M = 16 for production retrieval-augmented generation, M = 32 only when you have benchmarked recall and confirmed the lift.

  • "DiskANN is just HNSW with the graph spilled to disk." Wrong on two counts. First, DiskANN uses the Vamana graph, which is built differently from HNSW — bounded out-degree, explicit long-edge pruning, and a construction algorithm that minimises hop count to suit SSD page-load latency. Second, DiskANN's two-tier distance (PQ in RAM driving the walk, full vectors on SSD driving the re-rank) is the load-bearing trick; a naive "HNSW on SSD" without PQ-driven walking would do thousands of random page loads per query and miss latency targets by 100×.

  • "PQ training requires the full corpus." A common over-engineering trap. K-means converges to statistically equivalent codebooks on a 100 K–1 M sample as on the full billion. FAISS's training defaults sample to ~256 K vectors. Training on the whole corpus turns a 5-minute job into a multi-hour one with no recall benefit.

  • "Re-ranking with full vectors defeats the point of compression — you still need 3 TB of full vectors somewhere." Yes, but on SSD, not RAM. NVMe storage is ₹10 per GB versus ₹500 per GB for DRAM — a 50× cost gap. The compressed in-RAM index is what makes the system fast (no disk IO on the hot scan path); the on-disk full vectors are what make it accurate (loaded for ~100 candidates per query, ~3 ms of NVMe IO). Storage tier separation is the design, not a workaround.

  • "PQ codebooks transfer between corpora with the same embedding model." No. The codebooks are k-means centroids fitted to the distribution of sub-vectors in your corpus. A codebook trained on Indian e-commerce product descriptions does not match a codebook trained on Wikipedia articles, even if both use the same bge-base 768-dim embedder — the regions of the 768-dim space that get populated are different. Train per-corpus; never reuse codebooks across deployments.

Going deeper

The chapter so far has covered the canonical PQ pipeline and DiskANN. This section pulls on four threads JEE-Advanced-curious readers will want: the precise error model that makes ADC work, the OPQ rotation trick, what changes with binary quantisation, and the latency arithmetic when you push to ten billion vectors.

The error model — why ADC distances are unbiased estimators

ADC approximates the true distance by replacing each sub-vector v^(m) with its centroid C_m[i_m]. The approximation error per sub-position is e^(m) = v^(m) - C_m[i_m], the residual after quantisation. K-means minimises the expected squared residual norm, so E[‖e^(m)‖²] is small and E[e^(m)] is roughly zero (centroids sit at sub-vector cluster means).

Expand the squared distance:

\|q - v\|^2 = \sum_m \|q^{(m)} - v^{(m)}\|^2 = \sum_m \|q^{(m)} - C_m[i_m] - e^{(m)}\|^2

Expanding the inner term:

\|q^{(m)} - C_m[i_m]\|^2 - 2 \cdot \langle q^{(m)} - C_m[i_m],\, e^{(m)}\rangle + \|e^{(m)}\|^2

ADC drops the cross term and the residual norm — it computes only the first piece. Why dropping these terms is acceptable: the cross term has expectation zero across vectors (the residuals are isotropic in the cluster's neighbourhood), so it averages out across the corpus. The residual-norm term is the same for every vector in a given centroid bucket, so it shifts all candidates' distances by approximately the same constant — and shifts that affect every candidate equally do not change the ranking, only the absolute distances. ADC is therefore an unbiased estimator of the rank, which is all retrieval cares about.

The bias of ADC's distance is O(‖e^(m)‖²) — the residual energy — and this is what k-means makes small. The variance is O(‖q^(m)‖ · ‖e^(m)‖) from the cross term — the geometry that controls re-rank effectiveness. When ‖e^(m)‖ is small (well-trained codebook, sufficient M), both terms are tame and PQ ranking matches exact ranking on the top candidates.

OPQ — rotate before you split

PQ's hidden assumption: the data's variance is roughly balanced across the d/M dimensions inside each sub-vector group. If sub-position 0 happens to contain dimensions 0–95 and those 96 dimensions carry 50 % of the embedding's total variance while the remaining 672 dimensions split the other 50 % across them, then the sub-position 0 codebook is overworked and its centroids fight to cover too much variance, while the other sub-positions waste codebook capacity.

Optimised PQ (Ge et al. 2013) fixes this by learning a rotation matrix R ∈ ℝ^(d×d) such that, after applying R, the variance is uniformly distributed across all d dimensions. Then PQ on R · v works against balanced sub-spaces and the per-codebook quantisation error drops by 1–3 % on standard benchmarks.

The rotation is free at query time: rotate the query once, then proceed. The rotation matrix is d × d × 4 = 2.4 MB for d = 768, fits in cache. FAISS's IndexPreTransform(OPQMatrix(d, M), IndexIVFPQ(...)) ships this. Why this matters in practice: a 2 % recall lift at zero query-time cost is the kind of free win that compounds at scale — when you are running PQ8 at 92 % recall, OPQ pushes you to 94 %, and that is the difference between needing a re-rank pass and not.

Binary quantisation — the cheaper cousin

If 8 bits per sub-vector is too generous, drop to 1 bit per dimension: store 1 if the dimension is positive, 0 if negative. A 768-dim vector becomes 96 bytes (a 32× compression, less aggressive than PQ but with no codebooks). Distance becomes Hamming distance — XOR + popcount, executable as one SIMD instruction per 64 dims, blindingly fast.

Binary quantisation hurts recall more than PQ at the same compression ratio (binary at 96 bytes ~ 80 % recall@10; PQ16 at 16 bytes ~ 96 % recall@10), but it is simpler — no training step, no codebook management, no ADC table — and the per-vector scan is even faster than PQ. Most production systems combine the two: binary quantisation as a first-pass filter to get from N=1B down to ~10K candidates, then PQ ADC over those 10K, then exact re-rank on the top 100. Three tiers, three accuracy/speed budgets. This is what Microsoft Bing's vector search, Tunestack's recommendation system, and Cohere's reranker pipelines all do under the hood.

Ten billion vectors — what changes

PQ8 at ten billion vectors is 80 GB of compressed codes — past the 64 GB ceiling of a commodity box. Three options:

  1. Move to PQ4 with K = 16. Each sub-vector is 4 bits, total 4 bytes per vector at M = 8. Recall drops to ~85 %; re-rank lifts it to ~96 %. Total RAM: 40 GB. Compute cost: same as PQ8 because the per-vector scan is bounded by memory bandwidth, not arithmetic.

  2. Two-level quantisation. First a coarse 256-cluster IVF, then PQ16 within each cluster. The IVF inverted lists are stored compactly and most clusters are visited rarely; cold clusters can be paged out to SSD. This is FAISS's IndexIVFPQ with nprobe tuned aggressively — the in-RAM working set is nprobe / K · 80 GB, perhaps 5–10 GB for typical settings.

  3. Distribute across two boxes. Two r6i.4xlarge (128 GB each) running DiskANN, with a coordinator that routes by ID hash. Per-query latency rises from ~15 ms to ~30 ms (one network hop), throughput halves per box, but you have headroom to grow to 50 B without re-architecting.

The Indian context: Aadhaar is roughly 1.4 billion enrolments. If we hypothetically built a face-embedding-based search across all of them, we would land in regime (1) — PQ4 on a 64 GB box would work, with re-rank on SSD, at ~50 ms p99. UIDAI uses a different (proprietary, not vector-based) deduplication system, but the math says vector search at that scale is in reach of off-the-shelf hardware in 2026 — which is why startups like SARVAM and Krutrim are building open-source embedding-based search for Indic-language corpora.

The compaction interaction

PQ training is a batch operation. When new vectors are inserted, you encode them with the existing codebook — which was trained on the historical distribution. Over time, the data drifts (new product categories on BharatBazaar, new slang on Indian Chirpline, new genres in Bollywood music recommendations) and the codebook stops being optimal. Recall slowly degrades.

The fix: periodic re-training. Every few weeks, sample 1 M vectors from the current corpus, retrain the codebooks, re-encode the entire corpus. The re-encoding is O(N) but cheap — about 30 minutes for a billion vectors on a single box, because it is just M k-means assignments per vector. Production systems schedule this as a low-priority background job and atomically swap the codebook + re-encoded corpus when ready. Why this matters: agents I have seen forget that codebooks have a half-life. The first deployment ships with a beautiful PQ index trained on a clean snapshot; six months later, recall has dropped 5 % and nobody knows why. The answer is always: retrain the codebook. Calendar it.

What you have built

You started this chapter with a memory wall: a billion vectors do not fit in commodity RAM at full precision. You end it with two complementary tools that knock the wall down. Product Quantization compresses each vector by 100–400× by exploiting per-sub-position structure with a tiny per-position k-means codebook; distances become a M × 256 lookup-table sum, faster than any float arithmetic. DiskANN combines PQ with a graph index laid out for SSD pages, putting a billion vectors on a single 64 GB box at sub-100 ms latency. The next chapter takes the search problem in a different direction — filtered search, where the user wants "nearest vectors and in-stock and under ₹5000". The compression and indexing tricks from chapters 152–155 stay; what changes is how to combine them with the structured filters that real e-commerce queries always have.

References

  1. Jégou, Douze, Schmid — Product Quantization for Nearest Neighbor Search (2011). The foundational PQ paper. Sections 3 and 4 derive ADC and IVFADC; everything since is variations on this.
  2. Ge, He, Ke, Sun — Optimized Product Quantization (PAMI 2013). Adds a learned rotation before PQ, lifting recall by 1–3 % at no query-time cost. The default in modern FAISS pipelines.
  3. Subramanya, Devvrit, Simhadri, Krishnaswamy, Kadekodi — DiskANN: Fast Accurate Billion-point Nearest Neighbor Search on a Single Node (NeurIPS 2019). The Vistron Research paper. Vamana graph construction, SSD page layout, and the billion-vector benchmark.
  4. Vistron DiskANN — GitHub. The MIT-licensed reference implementation in C++. Includes scripts for billion-scale index builds and the SSD-optimised search runtime.
  5. Pinecone — Product Quantization. Vendor-flavoured but well-illustrated walkthrough of PQ encoding, ADC, and the IVF-PQ combination.
  6. Johnson, Douze, Jégou — Billion-scale similarity search with GPUs (2017). The FAISS paper. Section 4 covers GPU-accelerated IVF-PQ at billion scale, the engineering basis for almost every large-scale vector index in production today.