IVF: Inverted File Indexes for Vectors

In short

Flat search compares the query vector to every vector in the corpus — O(N·d) per query, somewhere between unpleasant and impossible past a million rows. IVF (Inverted File index) is the first ANN technique anyone reaches for, and the one that still ships under the hood at Pinecone, Milvus, and half of FAISS deployments. The recipe has two phases. Train once: run k-means on a sample of your vectors to find K cluster centroids — typically K ≈ √N, so 1000 centroids for 1 M vectors — then assign every vector to its nearest centroid and store an inverted list cluster_id → [vector_ids]. Query forever: compute the distance from the query to each of the K centroids, pick the top nprobe closest centroids (nprobe = 1 is fastest, nprobe = K reduces to flat search), then run a flat search only inside those few clusters. Per query you do roughly K + nprobe · N/K distance computations; setting K = √N minimises this and gives you O(√N) per query — for one million vectors, ~2000 comparisons instead of 1,000,000, a 500× speedup. The cost is recall: if the true nearest neighbour sits in a cluster whose centroid was not in your top-nprobe, you miss it. Practical nprobe values are 5 to 20, giving 90–98 % recall at 50–200× speedup over flat. IVF is what you build when HNSW is overkill and PQ is overcomplicated, and it remains the right answer for tens of millions of vectors on a single machine.

The previous chapter ended with an honest admission: flat search is a brick wall. At 100 k vectors it is fine. At 1 M it is sluggish. At 100 M it is a server farm. The reason is not subtle — every query touches every vector, every vector costs d multiplies and a square root, and you cannot make either of those numbers smaller without either lowering d (which hurts quality) or shrinking N (which defeats the purpose of having data). The only way out is to stop looking at most of the vectors.

This is the move that every approximate-nearest-neighbour technique makes. The differences between IVF, HNSW, ScaNN, and Annoy are all about which vectors you decide to skip and how you decide to skip them. IVF makes the simplest, most defensible choice: carve the space into geographic regions, and at query time only look in the region the query lives in. The rest of this chapter is an unpacking of that one sentence.

The thesis: search the neighbourhood, ignore the rest of the city

Picture every vector in your corpus as a pin on a map. With 1 M pins scattered across the globe and somebody asking "what is the closest pin to this coordinate in Bengaluru?", you do not check every pin in São Paulo. You check the pins in Karnataka. Maybe Tamil Nadu. Definitely not Greenland.

IVF formalises this intuition. You pre-partition vector space into K regions — Voronoi cells around K learned centroids — and you keep a lookup table that says, for each cell, exactly which vector IDs landed inside it. At query time, you find the cell the query falls into and brute-force only its inhabitants. The "inverted file" name is a borrow from the classical IR inverted index — there, term → list_of_doc_ids; here, cluster_id → list_of_vector_ids. Same data structure, different keys. Why "inverted": you are inverting the natural mapping. The natural map is vector_id → cluster_id (one row per vector). The useful map for query time is the inverse — given a cluster, list its members — and storing that explicitly is what makes a query fast.

The whole technique fits in two phases.

Phase 1: training — k-means partitions the space

You start with a corpus of N vectors in ℝ^d. You pick a number K of clusters (we will derive K = √N shortly). You run k-means on the vectors — or on a random sample of, say, 100 k of them, since k-means on the full billion is itself a project — and out come K centroids c_1, c_2, …, c_K, each a vector in ℝ^d. Then you do one final pass over the full corpus: for each vector v, find the nearest centroid c_i, and append v's ID to inverted list i.

That is the entire training phase. Two artefacts come out: the centroid table (K vectors of dimension d) and the inverted lists (K lists, totalling N entries between them). For 1 M vectors with K = 1000, the centroid table is 1000 × 768 × 4 bytes ≈ 3 MB and the inverted lists hold a list of integer IDs (and pointers to the original vectors), 1 M × 4 bytes = 4 MB of metadata. Both fit in L2 cache.

K-means runs once, offline. Each vector gets a single cluster ID; the inverted lists store the membership. With `K = √N` and 1 M vectors, you end up with 1000 lists of about 1000 vectors each. The centroids and lists are the entire index — no graphs, no trees, just a flat lookup table.

A few practical points the textbook leaves out.

You do not need the exact k-means optimum. k-means is non-convex; you cannot find the global minimum. Run Lloyd's algorithm for 20 iterations on a sample of 100 k vectors and call it done. The clusters are a rough partitioning — IVF's accuracy depends on nprobe more than on whether you found the optimum. Why: an extra centroid iteration moves boundaries by a few percent. Bumping nprobe from 5 to 10 doubles the recall headroom. The latter is the dial you should be turning.

Use the same metric for k-means as for queries. If you intend to search by cosine similarity, train k-means on L2-normalised vectors (because cosine on unit vectors is monotonic with L2). If you intend to search by inner product, that is a separate beast — IP-IVF needs a clustering objective that respects inner-product, which FAISS handles via IndexIVFFlat with METRIC_INNER_PRODUCT.

Train on a representative sample, not the full corpus. Running k-means on a billion vectors is itself a multi-hour job; it is also unnecessary. The centroids you learn from a uniformly random sample of 100 k–1 M vectors will be statistically indistinguishable from the centroids you would have learned from the full set. FAISS's train() method assumes you will pass it a sample.

Phase 2: querying — find the right neighbourhood, then brute-force

A query vector q arrives. The procedure is mechanical:

Coarse search. Compute dist(q, c_i) for every centroid c_i. There are only K of them, so this is K · d multiplies — fast. Pick the nprobe centroids with the smallest distance.
Fine search. For each of those nprobe centroids, iterate through its inverted list and compute dist(q, v) for every v in the list.
Merge. Combine the candidate distances into a top-k heap, return the k closest.

That is the whole query path. The total number of distance computations is

\text{cost}(q) = \underbrace{K}_{\text{centroid scan}} + \underbrace{\text{nprobe} \cdot \frac{N}{K}}_{\text{fine scan}}

assuming clusters are roughly balanced (each holds about N/K vectors). For K = 1000, N = 1 \text{M}, nprobe = 10, you get 1000 + 10 · 1000 = 11{,}000 comparisons per query — versus 1{,}000{,}000 for flat. A 90× speedup. Push nprobe down to 1 and you get a 500× speedup at the cost of recall; push it up to 50 and you get a 20× speedup with near-flat-search recall.

The query first picks its neighbourhood (which centroid is closest), then brute-forces only inside it. With four clusters and `nprobe = 1`, three quarters of the corpus is never touched. With `nprobe = 2`, you would also search the second-closest cluster — slightly slower, noticeably better recall near boundaries.

The recall trade-off — and why it has a name

Look at the diagram one more time. The query star sits near the boundary between cluster 1 (orange) and cluster 3 (red). With nprobe = 1 you only search cluster 1. But what if q's true nearest neighbour is in cluster 3, just across the boundary, closer to q than anything in cluster 1?

You miss it. IVF returns the best candidate it found inside the searched clusters, which by definition is not the global nearest. Your recall — the fraction of true top-k neighbours you actually retrieve — drops below 100 %.

This is not a bug. It is the bargain. Approximate nearest neighbour search trades correctness for speed; IVF is the cleanest, most predictable way to make that trade.

Two knobs, one trade-off. Choose `K ≈ √N` to minimise the per-query cost (left). Then choose `nprobe` to land at the recall you need (right). Production systems typically pick `nprobe` between 5 and 20 — high enough for ≥ 90 % recall, low enough that the 100× speedup over flat is preserved.

A useful way to think about the recall failure: the boundary between any two Voronoi cells is, locally, a hyperplane. A query that lands within a few percent of the cluster radius from this hyperplane has a meaningful probability that its true nearest neighbour is on the other side. The fraction of queries near such boundaries grows with dimension (this is part of the curse of dimensionality) — which is why IVF on 768-d vectors needs a higher nprobe than IVF on 64-d vectors to hit the same recall. Rule of thumb: at 768 dimensions, nprobe = 5 typically gives you 80–85 % recall, nprobe = 10 gives 92–95 %, nprobe = 20 gives 97–98 %.

Why K \approx \sqrt{N} is the right number of clusters

Take the cost expression and treat K as the variable:

f(K) = K + \text{nprobe} \cdot \frac{N}{K}

For nprobe = 1, this is K + N/K. Take the derivative and set it to zero:

f'(K) = 1 - \frac{N}{K^2} = 0 \quad\Longrightarrow\quad K^2 = N \quad\Longrightarrow\quad K = \sqrt{N}

At that value, f(K) = 2√N. Why this is satisfying: with K = √N the centroid scan and the in-cluster scan cost the same — neither dominates. If K is too small, clusters are huge and the in-cluster scan dominates. If K is too large, you spend all your time scanning centroids. The minimum is exactly when both terms are balanced.

For nprobe > 1, the optimum shifts to K = √(nprobe · N). In practice, you pick K once at index time (you cannot change it without re-clustering), and you tune nprobe per-query — so the convention is to pick K for nprobe = 1 and accept that for higher nprobe you are slightly off the optimum. This costs you maybe 30 % over the truly optimal K and is not worth the operational complexity of multiple indexes.

A few concrete numbers, all assuming 768-dim vectors, nprobe ≈ √nprobe-tuned-for-K:

N	√N	typical K	comparisons (nprobe=10)	flat baseline	speedup
10 k	100	100	100 + 10·100 = 1.1 k	10 k	9×
100 k	316	256	256 + 10·390 = 4.1 k	100 k	24×
1 M	1000	1024	1024 + 10·976 = 10.8 k	1 M	93×
10 M	3162	4096	4096 + 10·2441 = 28.5 k	10 M	350×
100 M	10 000	16 384	16 384 + 10·6104 = 77.4 k	100 M	1290×
1 B	31 623	65 536	65 536 + 10·15 259 = 218 k	1 B	4587×

The speedup grows as √N. At a billion vectors, IVF alone is ~5000× faster than flat. At that scale you typically combine it with Product Quantization (giving IVF-PQ, the canonical FAISS large-scale recipe) to also compress the in-cluster vectors and squeeze more vectors per byte of RAM.

Building IVF with FAISS — a complete worked path

FAISS (Facebook AI Similarity Search) is the canonical implementation of IVF and the reference everyone else copies from. Here is a complete script that builds an IVF index over 100 k vectors, queries it at varying nprobe, and measures recall against a brute-force baseline.

import numpy as np
import faiss
import time

# 1. Generate synthetic data: 100k vectors in 768d (Sentence-BERT scale)
d = 768
N = 100_000
n_queries = 100
np.random.seed(42)

xb = np.random.randn(N, d).astype('float32')        # corpus
xq = np.random.randn(n_queries, d).astype('float32') # queries
# L2-normalise (so L2 distance ↔ cosine similarity)
faiss.normalize_L2(xb)
faiss.normalize_L2(xq)

# 2. Ground truth via flat search (the brute-force baseline)
flat = faiss.IndexFlatL2(d)
flat.add(xb)
t0 = time.time()
gt_dists, gt_ids = flat.search(xq, k=10)
flat_time = (time.time() - t0) * 1000 / n_queries
print(f"Flat: {flat_time:.2f} ms/query")

# 3. Build IVF: K = √N ≈ 316, round up to 320
K = 320
quantizer = faiss.IndexFlatL2(d)                     # used inside k-means
ivf = faiss.IndexIVFFlat(quantizer, d, K, faiss.METRIC_L2)
ivf.train(xb)        # runs k-means; needs samples ≥ K (FAISS warns < 39·K)
ivf.add(xb)          # assigns each vector to its nearest centroid

# 4. Sweep nprobe and measure recall@10 + latency
for nprobe in [1, 5, 10, 20, 50]:
    ivf.nprobe = nprobe
    t0 = time.time()
    _, ivf_ids = ivf.search(xq, k=10)
    ivf_time = (time.time() - t0) * 1000 / n_queries

    # recall@10 = fraction of true neighbours retrieved
    correct = 0
    for i in range(n_queries):
        correct += len(set(gt_ids[i]).intersection(set(ivf_ids[i])))
    recall = correct / (n_queries * 10)

    speedup = flat_time / ivf_time
    print(f"nprobe={nprobe:3d}  {ivf_time:5.2f} ms/q  "
          f"recall@10={recall:.2%}  speedup={speedup:.1f}×")

Typical output on a modern laptop:

Flat:         42.10 ms/query
nprobe=  1     0.31 ms/q  recall@10=58.30%  speedup=135.8×
nprobe=  5     0.94 ms/q  recall@10=88.00%  speedup= 44.8×
nprobe= 10     1.62 ms/q  recall@10=94.20%  speedup= 26.0×
nprobe= 20     2.91 ms/q  recall@10=97.50%  speedup= 14.5×
nprobe= 50     6.74 ms/q  recall@10=99.40%  speedup=  6.2×

Two things to notice. First, the recall numbers on synthetic Gaussian data are pessimistic. Real embeddings are clustered, not uniform — the same nprobe values typically yield 5–10 % higher recall on real data because true neighbours tend to share a cluster. Second, latency drops faster than recall rises, which is why operating at nprobe = 5 or 10 is the sweet spot. Why the curve is convex: as nprobe grows, you keep adding clusters whose centroids are progressively farther from the query, and the marginal probability that they contain a true neighbour shrinks. The first few clusters give most of the recall; the tail is mostly wasted work.

**A worked sizing — Indian e-commerce product search**

Say you run a marketplace with 1 million products. Each product description is embedded with text-embedding-3-small (1536 dimensions, but reduced to 768 via FAISS's IndexPreTransform for memory) and updated nightly. Customers search ~5 queries per second per app instance, 10 instances → 50 QPS sustained. The product team wants p99 latency under 50 ms and recall@10 ≥ 90 %.

Flat search: 1 M × 768 × 4 bytes = 3 GB index (fits in RAM). Per query: 1 M · 768 multiplies ≈ 30 ms on a single core, ~70 ms p99 with GC noise. At 50 QPS sustained you saturate the cores. Marginal but possible — until the catalogue grows.

IVF: Pick K = √(1 M) = 1000. Train k-means on 100 k sampled vectors (~30 s). Index size: same 3 GB for vectors + 3 MB for centroids + 4 MB for inverted lists. Per query at nprobe = 10: 1000 centroid distances + 10 · 1000 = 10{,}000 in-cluster distances = 11{,}000 total — about 0.4 ms per query. 75× speedup over flat. With 50 QPS sustained, total CPU is 50 × 0.4 = 20 ms — 2 % of a single core. Headroom for 10× the traffic.

Recall measurement. Hold out 1000 real customer queries with their (manually labelled or flat-search-derived) gold-standard top-10. Run IVF at nprobe ∈ {1, 5, 10, 20}, plot recall. On real product embeddings you typically see 75 % at nprobe=1, 92 % at nprobe=10, 97 % at nprobe=20. The team picks nprobe = 10: 92 % recall, 0.4 ms latency. Decision: ship with K=1000, nprobe=10.

The one production caveat. If "best-selling running shoes" must always retrieve the bestseller, audit the long tail: queries near cluster boundaries are where IVF leaks. The fix is not to abandon IVF — it is to either (a) bump nprobe to 20 for the queries you can identify as boundary cases, or (b) add a small re-ranking layer that pulls a few extra candidates and rescores with a more expensive model. This is the pattern Pinecone and Weaviate use under the hood.

What IVF gets wrong, and what fixes it

Memory still scales with N. IVF is a speed index, not a space index. You still hold every vector in RAM (or on disk) at full precision. For 100 M × 768d float32, that is 300 GB — past the comfortable RAM ceiling of a single box. The fix is to compress the vectors inside each inverted list with Product Quantization, giving you IVF-PQ, which can shrink each vector to 8–32 bytes and is what FAISS recommends past 100 M scale.

Cluster sizes are uneven. k-means has no balance constraint; some clusters end up with 5× the average size, others with a fifth. The fat clusters slow down queries that land in them; the thin ones contribute little to recall. Fix: FAISS's IndexIVFFlat ships with balanced_partition() and there are research variants (SPFresh, residual quantization) that explicitly balance clusters at training. For most use cases the imbalance is tolerable.

You cannot insert vectors without re-clustering — sort of. Strictly speaking, you can — assign new vectors to their nearest existing centroid and append. But over time the centroids drift away from the actual data distribution, and recall degrades. A common pattern is: rebuild k-means weekly or monthly on a fresh sample, swap the index atomically, drop the old one. Pinecone, Vespa, and Milvus all run some version of this background re-training.

No graph structure means no shortcut for "find me a path through similar items". This is where HNSW (the next chapter) wins — its graph traversal naturally supports queries like "find the 1000th-nearest neighbour" or "expand from this neighbour" cheaply. IVF can do these too, but each step is a fresh full query.

Where IVF lives in the wild

FAISS — the canonical implementation, open-sourced by Facebook AI Research in 2017 alongside Johnson, Douze, and Jégou's billion-scale similarity search paper. FAISS supports IndexIVFFlat (full vectors), IndexIVFPQ (quantised), IndexIVFScalarQuantizer (8-bit per component), and GPU variants. The single most production-tested ANN library on Earth; if you need IVF, start here.

Milvus — a vector database that wraps FAISS (and HNSWlib, and DiskANN) as backends. Used by eBay, Roblox, Shopee, and a long list of Indian fintech startups. Milvus's "IVF_FLAT" and "IVF_PQ" indexes are FAISS under the hood with cluster management, sharding, and a SQL-ish query layer on top.

Pinecone — managed vector DB that uses an IVF-style coarse partitioning under the hood (combined with quantisation and proprietary tweaks). Their public engineering blog has good explainers if you want a vendor-neutral perspective.

Vespa — Yahoo's open-source search platform. Defaults to HNSW for vectors but supports IVF too; chosen when memory is tight and a few percent recall loss is fine.

Annoy (Spotify) — not IVF; it uses random projection trees instead. Different algorithm, similar quality/speed trade-off, much smaller community. Mention it because you will see it benchmarked alongside IVF.

ScaNN (Google) — Google's anisotropic-quantisation-based ANN library. Beats IVF on benchmarks but is harder to deploy. ScaNN's coarse step is essentially IVF; its fine step is a Google-specific quantisation scheme.

The pattern across all of these is the same — coarse-then-fine. Pick a small set of candidates with a cheap approximate filter, then refine with the exact distance computation on that small set. IVF is the cleanest, oldest expression of that pattern. HNSW is the graph-based version. ScaNN, DiskANN, and SPANN are recent refinements. They all stand on the IVF idea.

When to reach for IVF (and when not to)

Reach for IVF when:

You have between 10⁵ and 10⁸ vectors and a single machine has enough RAM to hold them.
You need 90–98 % recall — not 99.9 %.
Your data updates in batches (nightly, hourly), not per-query.
You want a simple, debuggable index with two knobs (K, nprobe).

Reach for HNSW instead when:

You need very high recall (≥ 99 %) at low latency.
You have RAM to spare (HNSW's graph overhead is ~50 % on top of the vectors).
You can tolerate slower index build (10–100× slower than IVF).

Reach for IVF-PQ or DiskANN when:

You have ≥ 10⁸ vectors and cannot fit them in RAM at full precision.
A few percent more recall loss is acceptable in exchange for 8–32× memory savings.

Use flat search when:

You have < 100 k vectors. The cost-of-engineering an index is not justified.
You need exact, reproducible top-k. ANN by definition is approximate.

The decision rarely needs more than these four bullets. The rest is parameter tuning.

What you have built

You started this chapter with a brick wall — flat search dies past 100 k vectors. You end it with a partition-and-prune index that scales as O(√N), costs two knobs to tune (K at index time, nprobe at query time), and ships in every production vector database from FAISS to Milvus to Pinecone. The mental model is simple enough to draw on a napkin: cluster the vectors, search only the relevant cluster(s), accept a small recall loss for a large speed gain. The next two chapters take this same coarse-then-fine idea further — HNSW replaces the flat partitioning with a navigable graph, and Product Quantization compresses the vectors inside the lists so a billion of them fit on one machine. By the end of Build 19, the brick wall will be a memory.

References

Johnson, Douze, Jégou — Billion-scale similarity search with GPUs (2017). The FAISS paper. Section 3 is the canonical description of IVF and its GPU implementation.
Jégou, Douze, Schmid — Product Quantization for Nearest Neighbor Search (2011). The foundational IVFADC paper that introduced "inverted file with asymmetric distance computation" — the IVF + PQ pairing that became FAISS's flagship.
FAISS wiki: index types and parameters — the practitioner's reference for IndexIVFFlat, IndexIVFPQ, and the dozen variants.
Aumüller, Bernhardsson, Faithfull — ANN-Benchmarks: A Benchmarking Tool for Approximate Nearest Neighbor Algorithms (2020). Side-by-side recall/QPS curves for IVF, HNSW, Annoy, ScaNN on a dozen datasets — the right place to look when picking between them.
Pinecone — Vector Indexes (IVF section). Vendor-flavoured but well-illustrated walkthrough of IVF and its trade-offs.
Babenko, Lempitsky — The Inverted Multi-Index (2012). Extends IVF by clustering each half of the vector independently — same idea, finer partition.