Note: Company names, engineers, incidents, numbers, and scaling scenarios in this article are hypothetical — even when they resemble real ones. See the full disclaimer.

In short

Flat search is the brute-force vector index: store every embedding in a dense matrix and, on every query, dot-product it against all N rows to extract the top-K. Cost is O(N·d) per query and recall is 100% by definition — but past a million 1536-dim vectors a single CPU core needs hundreds of milliseconds per query, and the index becomes bandwidth-bound, not compute-bound. Flat is the right answer for small catalogues, low query rates, exact-recall requirements, or filter-dominated queries; everything else needs an approximate index, and flat is the ground truth those indexes are measured against.

A user types wireless earbuds with good bass under 3000 rupees, and your search service has a million product descriptions sitting on disk as 1536-dimensional float32 vectors. In under a hundred milliseconds, you must return the ten whose embeddings are closest to the query. How exactly does it find those ten?

The answer this chapter gives is the boring one: look at all of them. Compute the similarity between the query and every single stored vector, sort by score, return the top ten. There is no tree to descend, no graph to walk, no probabilistic skip — just one matrix-vector multiply and one partial sort. This is flat search, and the fact that it is dumb is exactly its strength. It is the only algorithm in this build that returns the guaranteed correct top-K every time, with no approximation, no tunable knob that trades recall for speed. Every other index in chapters 153–157 is faster than flat, and every one of them is judged by how close its results come to what flat would have returned. Flat is the ground truth.

It is also infeasible past a certain scale, and the rest of the chapter is the careful accounting of what that scale is, why exactly it breaks, and the genuinely large set of cases where you should just use flat anyway and stop over-engineering.

The algorithm in one paragraph

Store all N vectors as rows of a contiguous float32 matrix X of shape (N, d). On a query q ∈ ℝ^d:

  1. Compute the score of q against every row of X. For inner product, this is s = X @ q, a single matrix-vector product producing an N-long score array. For L2, you compute ||X − q||² row-wise. For cosine, either pre-normalise everything so it reduces to inner product, or normalise on the fly.
  2. Partial-sort the score array to extract the indices of the top-K largest (for similarity) or smallest (for distance) values. NumPy's argpartition(s, K) does this in O(N) rather than O(N log N).
  3. Return those K row indices, optionally with their scores.

That is the entire algorithm. There is no "build" phase beyond loading the matrix into memory; there is no auxiliary structure; there is no parameter to tune.

Flat search: query vector compared against every stored vectorDiagram of flat search. On the left, a single query vector q of dimension d. In the centre, a matrix X of shape N by d, with each row representing one stored vector. Arrows show q being dot-producted with every row of X, producing N scalar scores on the right. Those scores feed into a partial-sort box that extracts the top K. The total cost is labelled O(N times d) per query.Flat search: one query vs every stored vectorquery qdim dqd × 1stored matrix XN × d (every vector)x₁x₂x₃x₄x_Nscores sN scalars: q·xᵢ0.810.420.930.110.77argpartition→ top K indicesO(N) selection+ O(K log K) sortCost per query: O(N · d) multiplications + adds, plus O(N) sort
Flat search reduces to one matrix-vector multiplication. The query q is dot-producted with every row of the stored matrix X, producing N scalar scores. A partial sort (NumPy's argpartition) extracts the top K indices in linear time. There is no auxiliary index, no tree, no graph — just brute-force linear algebra. It is exact by construction.

The math: where the operations come from

Let N be the number of stored vectors and d be their dimension. A single inner-product score q · xᵢ requires d multiplications and d − 1 additions — call it 2d flops. There are N such scores per query. Total: roughly 2N·d floating-point operations per query, plus an O(N) selection of the top-K and an O(K log K) final sort of those K. The selection and sort are negligible: at K = 10 and N = 10⁶, the partial sort is about 10⁶ comparisons, while the dot products are 3 × 10⁹ flops.

Plug in the numbers people actually deploy:

N d flops/query single-core time at 10 GFLOPS memory
10,000 384 7.7 M 0.8 ms 15 MB
100,000 768 153 M 15 ms 300 MB
1,000,000 1536 3.1 G 310 ms 6 GB
10,000,000 1536 30.7 G 3 s 60 GB
100,000,000 1536 307 G 30 s 600 GB
1,000,000,000 1536 3.1 T 5 minutes 6 TB

Why these numbers feel like a wall: every row above grows latency and footprint by exactly 10× because both N and the per-row cost scale linearly. There is no algorithmic constant you can tune away. A 10× bigger catalogue is a 10× slower flat query, full stop. By a billion vectors you are doing five-minute scans and storing six terabytes — neither of which is a serving system. This is the cliff that ANN was invented to climb down.

Why it's actually memory-bandwidth bound, not compute-bound

The single-core "10 GFLOPS" number above is generous — and even when you reach it, you are usually not limited by arithmetic but by how fast bytes move from RAM into the CPU. Each dot product q · xᵢ reads d floats of xᵢ from memory and does 2d flops on them — an arithmetic intensity of about 0.5 flops per byte. Modern CPUs have arithmetic intensity ceilings of ~30 flops/byte before you saturate memory bandwidth (the roofline model); flat search lives at the bandwidth-bound bottom of the roofline.

Concretely: a 1M × 1536 float32 matrix is 6 GB. DDR5 desktop RAM streams about 50 GB/s; a single channel gives ~10–25 GB/s. So a single sequential scan of the matrix takes ~120–600 ms just to read the bytes, before counting any arithmetic. Why this matters for engineering decisions: doubling your CPU's clock speed will not double flat-search throughput. The CPU spends most of its cycles waiting for the next cache line. The wins come from (a) shrinking the matrix — quantisation drops 4 bytes/coordinate to 1 byte and quadruples throughput, (b) keeping it in L3 cache for hot queries, or (c) moving it to a GPU with 1 TB/s of memory bandwidth.

The cost of flat search at scaleA vertical bar chart showing the memory and latency cost of flat search for one million vectors of 1536 dimensions in float32. The first bar shows storage: one million times 1536 times four bytes equals six gigabytes. A second bar shows the bandwidth-bound query latency: at ten gigabytes per second of memory bandwidth, a six-gigabyte sequential scan takes about six hundred milliseconds; at fifty gigabytes per second it takes one hundred and twenty milliseconds; on a GPU at one terabyte per second it takes about six milliseconds. A third panel notes that arithmetic intensity is about half a flop per byte, placing flat search at the memory-bound bottom of the roofline.The cost: 1M × 1536d float32 = 6 GB; one query = one full scanStorageN · d · 4 bytes6 GB1M × 1536 × 4no overhead10M → 60 GB1B → 6 TBLatency per querybandwidth-boundCPU 1ch10 GB/s600msCPU 4ch50 GB/s120msGPU1 TB/s~6msArithmetic intensityflops per byte read~0.5 flops/byte2d flops · d bytes⁻¹·4⁻¹roofline floor: bandwidth-boundFlat search reads every byte of the index on every query. Bandwidth, not flops, sets the latency.→ Doubling clock speed barely helps. Shrinking the matrix or moving to GPU does.
For a million 1536-dim float32 vectors, the index alone is 6 GB. A single query reads all 6 GB through the CPU memory subsystem; even on a desktop with quad-channel DDR5, that is ~120 ms before any arithmetic. A modern GPU with 1 TB/s of HBM bandwidth runs the same scan in ~6 ms. This is why FAISS's GPU-Flat backend is the production go-to for exact search up to tens of millions of vectors — you have not changed the algorithm, only moved it to the right silicon.

Optimisations within flat — the algorithm stays exact

You can make flat search 10–100× faster without giving up exactness. Three techniques, in order of how much you usually get from each:

SIMD vectorisation. Modern CPUs can do 8 single-precision multiply-adds per instruction (AVX2: 8 floats; AVX-512: 16 floats). NumPy on top of OpenBLAS or Intel MKL already uses this — when you write X @ q, you get SIMD for free. Why this matters: a hand-written loop in pure Python over range(N) runs at ~10 MFLOPS; the same computation through np.dot runs at 5–20 GFLOPS, a 1000× speedup with no code change. Always go through BLAS. If you write the loop in C++, use _mm256_fmadd_ps (AVX2) or _mm512_fmadd_ps (AVX-512). FAISS's IndexFlatIP drops to BLAS for the matmul and runs about as fast as your CPU permits.

GPU offload. Move X to GPU memory once (it costs the upload), then every query is a single cuBLAS gemv (matrix-vector multiply). H100s have 3 TB/s of HBM3 bandwidth, so a 6 GB matrix is scanned in 2 ms; even a consumer RTX 4090 at 1 TB/s does it in 6 ms. FAISS's IndexFlatIPGPU and the more recent IndexFlatGpu routinely sustain 50,000–200,000 QPS on tens of millions of vectors per GPU, exact recall, no approximation.

Quantisation (preview of chapter 155). Storing each coordinate as int8 rather than float32 cuts the matrix to a quarter of its size — and since you are bandwidth-bound, that is roughly a 4× speedup. Scalar quantisation (clip to [−1, 1], multiply by 127, round) preserves recall to within a percent or two for typical text embeddings. This blurs the line between "flat" and "compressed flat", but the algorithm is still exact-on-the-quantised-vectors.

What you cannot do within flat: skip vectors. Every approximate index works by avoiding looking at most of X. Once you start skipping, you have left flat search for one of the algorithms in chapters 153–157.

When flat IS the right answer

The temptation, especially after reading this chapter, is to reach for HNSW or IVF the moment you hear the word "vector search". Resist it. Building, tuning, monitoring, and re-indexing an ANN structure is a real engineering tax — index parameters interact with recall in non-obvious ways, you need to re-tune when your data distribution shifts, and you lose the ability to give an exact-recall guarantee. There are four scenarios where flat is genuinely the right tool in 2026:

When flat search is the right answerA four-quadrant diagram listing the situations in which flat search is preferable to ANN. The top-left quadrant is small index: fewer than ten thousand vectors, where the build cost of an ANN index dominates. The top-right quadrant is low query rate: even if a query takes one second, it is fine for once-a-day analytics or batch jobs. The bottom-left quadrant is exact recall required: medical, legal, or security searches where missing the right answer is not acceptable. The bottom-right quadrant is filter-dominated queries: when a metadata filter reduces the candidate pool to thousands before vector comparison, flat over those thousands is faster than ANN over millions.When flat is the right answer (not all use cases need ANN)1. Small indexN < 10,000 vectors• Build cost of HNSW / IVF dominates• Flat scan is microseconds• Example: in-app FAQ search2. Low query rateQPS < 1; latency budget > 1s• Once-a-day analytics, batch dedup• Internal tools, ad-hoc queries• Example: nightly duplicate-detection3. Exact recall required100% recall, no approximation• Medical: drug interaction search• Legal: clause/precedent retrieval• Security: malware fingerprinting4. Filters dominatemetadata filter cuts N → thousands• "in category X, in stock, price < Y"• Post-filter pool: ~5K vectors• Flat scan of 5K = sub-millisecondIf you are in any of these quadrants, do not build an ANN index. Use flat.
The fact that ANN exists does not mean you have to use it. Each of these four situations is common in production, and each is one where flat is faster, simpler, more accurate, or all three. Particularly underrated is the fourth quadrant: a well-designed metadata filter often cuts a billion-vector catalogue down to a few thousand candidates, at which point a flat scan over those thousands beats any ANN structure both in speed and in correctness.

Small index (N < 10K). Building an HNSW index over 5,000 vectors takes longer than running 5,000 flat queries against them. The auxiliary structure of HNSW (graph edges, layer assignments, entry points) often costs more memory than the raw vectors themselves at small N. Flat is the right answer for in-app FAQ search, internal documentation search, a startup's first thousand customers — anything where you can fit the entire index in 100 MB.

Low query rate. If your service runs one query per minute and you can budget a full second per query, you do not need an index. Nightly batch jobs — finding duplicate user accounts in a 10M-row table by embedding usernames and clustering, deduplicating product listings on BharatBazaar, ad-hoc analyst queries on a corpus of court judgments — all live here. Why this matters: ANN indexes have non-trivial build times (HNSW on 10M vectors takes 30 minutes) and require periodic rebuilds when data shifts. For a query you run twelve times a day, the build amortisation never pays off.

Exact recall required. Medical search ("find drugs whose mechanism vector is closest to this molecule"), legal precedent retrieval, security/forensics ("find malware samples whose static-analysis vector matches this sample") cannot afford to miss the right answer because the index decided to skip its neighbourhood. Approximate indexes typically run at 95–99% recall@10 — the right answer is missed 1–5% of the time. For most product search that is fine; for a misdiagnosis it is not. Flat guarantees 100% recall by definition.

Filter-dominated queries. Real product search is rarely "find products like this query" — it is "find products like this query that are in category X, in stock, priced under Y, ship to pincode Z". If the metadata filter applies first and reduces the candidate set to a few thousand SKUs, a flat scan of those thousands is sub-millisecond and beats any pre-built ANN index that has to navigate around the filter. This is the pre-filter strategy, covered properly in chapter 156. Pinecone, Weaviate, and pgvector all special-case "small filtered set → use flat".

Why we eventually need ANN

The flip side of all the above. The cases where flat is wrong:

  • Hundreds of millions of vectors at interactive latency. A 100M × 1536d index is 600 GB; even on a multi-GPU box, a flat scan is 1+ second. At 1B vectors and 1536d, the index is 6 TB, no commodity machine fits it in RAM, and a flat scan is minutes.
  • High query rate (1000+ QPS). Even at 10M × 768d (~30 GB), a CPU flat query is ~300 ms, so 1000 QPS needs 300 cores doing nothing else. ANN brings the per-query cost to 1–5 ms, so a single core handles 200+ QPS and a small box handles thousands.
  • Multi-tenant search-as-a-service. Hosted vector DBs (Pinecone serverless, Weaviate Cloud) cannot afford to keep every tenant's full matrix in RAM; they need an index that is small (PQ), cache-friendly (HNSW), or shardable (IVF) to multiplex thousands of tenants on the same hardware.

The next four chapters build, in order, the four indexes that solve these regimes: IVF (skip clusters), HNSW (walk a graph), PQ and DiskANN (compress vectors and stay on SSD), and filtered search (combine metadata + vector). Each of them is benchmarked against flat as the ground truth for recall.

Feel the wall: a NumPy benchmark

The most important thing this chapter can give you is the lived experience of watching latency grow linearly. Run the example below, then change N to 1M and run it again.

Flat search in 20 lines of NumPy

You are a backend engineer at an Indian e-commerce startup with 100,000 product descriptions, each embedded with a 768-dim Sentence-BERT model. You want to know how long flat search actually takes on your laptop before you spend a week setting up FAISS.

import numpy as np
import time

# --- Simulate the index ---------------------------------------------------
np.random.seed(42)
N, d, K = 100_000, 768, 10
X = np.random.randn(N, d).astype(np.float32)
# Pre-normalise so inner product == cosine
X /= np.linalg.norm(X, axis=1, keepdims=True)

print(f"Index: {N:,} vectors × {d}-dim float32")
print(f"Memory: {X.nbytes / 1e6:.1f} MB")

# --- Single query (cold) --------------------------------------------------
q = np.random.randn(d).astype(np.float32)
q /= np.linalg.norm(q)

t0 = time.perf_counter()
scores = X @ q                              # (N,) inner products
top_k_idx = np.argpartition(-scores, K)[:K] # O(N) selection
top_k_idx = top_k_idx[np.argsort(-scores[top_k_idx])]  # final sort
elapsed = (time.perf_counter() - t0) * 1000

print(f"Single query: {elapsed:.1f} ms, top-{K} indices = {top_k_idx[:3]}...")

# --- Batched timing for stable measurement --------------------------------
Q = 100
queries = np.random.randn(Q, d).astype(np.float32)
queries /= np.linalg.norm(queries, axis=1, keepdims=True)

t0 = time.perf_counter()
all_scores = queries @ X.T                  # (Q, N)
all_top_k = np.argpartition(-all_scores, K, axis=1)[:, :K]
elapsed = (time.perf_counter() - t0) * 1000

print(f"Batched {Q} queries: {elapsed:.1f} ms total → {elapsed/Q:.2f} ms each")
print(f"Throughput: {Q / (elapsed/1000):.0f} QPS on a single CPU")

Typical output on an M1 MacBook (laptop, single core through Accelerate BLAS):

Index: 100,000 vectors × 768-dim float32
Memory: 307.2 MB
Single query: 28.4 ms, top-10 indices = [42771 91038 14523]...
Batched 100 queries: 1840 ms total → 18.4 ms each
Throughput: 54 QPS on a single CPU

At 100K × 768 you are getting ~30 ms per cold query, ~18 ms when the matrix is hot in cache. That is fine for a low-traffic admin tool; it is not fine for a public search endpoint at 1000 QPS, where you would need 18 cores doing nothing but vector math just to keep up.

Now scale up. Change N to 1,000,000 and re-run. The single-query time grows to ~280 ms, the memory to 3 GB, the QPS to about 5. At 5,000,000 products you are at 1.5 GB/s of bandwidth and ~1.5 seconds per query — your search box now feels broken. The wall is real, and the slope is exactly linear in N · d.

This is the moment in production where you stop tuning NumPy and start reading chapter 153. Why this benchmark is honest: NumPy's @ operator dispatches to OpenBLAS / Accelerate / MKL — the same BLAS that FAISS-Flat uses internally. The single-core flat-search performance you measure here is within ~20% of what a tuned C++ implementation would give you. The 100× speedups come from algorithmic change (skip vectors), not from squeezing more out of flat.

Practical recipe: flat in production

If you decide flat is the right answer, here is the short checklist:

  1. Use float32, not float64. Half the memory, double the bandwidth, identical recall for embeddings. (Sentence-BERT, OpenAI, Cohere all return float32 by default.)
  2. Pre-normalise vectors at index time. Then inner product is cosine, and you save the per-query normalisation. One X /= np.linalg.norm(X, axis=1, keepdims=True) at load time.
  3. Store the matrix contiguously. np.ascontiguousarray(X) before serving; row-major layout means a query reads memory sequentially, which is what the prefetcher is good at.
  4. Use argpartition, not argsort. Top-K extraction is O(N) not O(N log N). At N = 10⁶ and K = 10, this is a 6× difference in the sort step.
  5. Batch your queries. queries @ X.T is faster per-query than a Python loop calling X @ q, because BLAS reuses the read of X across multiple queries. If your serving layer can collect 8–32 queries before issuing the matmul, you get 4–10× more throughput.
  6. For >1M vectors, use FAISS. faiss.IndexFlatIP(d) wraps BLAS with a slightly more cache-friendly access pattern and gives you GPU-Flat for free if you swap to faiss.GpuIndexFlatIP. There is no point reimplementing this — FAISS is what every production system reaches for.

Common confusions

  • "Flat search is always slow — that's why we have ANN." Flat is slow at scale. At N = 5,000 it is faster than HNSW because HNSW's graph traversal has constant per-hop overhead that dominates when there are only thousands of vectors. The crossover where ANN starts to win is somewhere between 50K and 500K vectors depending on dimension, hardware, and recall target. For the BharatBazaar support-FAQ embedding index of 4,000 questions, np.dot is the right answer. Reaching for HNSW there is over-engineering.

  • "Cosine similarity and inner product require different code paths." They do not, if you pre-normalise. After X /= np.linalg.norm(X, axis=1, keepdims=True), every row of X is a unit vector, and q · xᵢ = cos(q, xᵢ) exactly. You write X @ q either way. The L2 distance also reduces to inner product on normalised vectors via ||q − x||² = 2 − 2(q · x), which preserves rank ordering. So in practice, "cosine flat" and "inner-product flat" are the same matmul; only the input preprocessing changes.

  • "GPU-Flat is approximate because it uses float16." No. FAISS GPU-Flat keeps float32 by default and gives bit-exact results matching CPU-Flat. There is an optional useFloat16=true flag for memory savings on consumer GPUs, which can introduce sub-percent ranking changes for near-tied scores — but the default is exact. Confusion comes from the fact that training deep models often uses float16/bfloat16; serving a flat index does not.

  • "argpartition returns the top K in sorted order." It does not. np.argpartition(s, K)[:K] returns the indices of the K largest values in arbitrary order within that prefix. If your downstream code (a re-ranker, a UI displaying ranks) needs them sorted, you must do top_k_idx[np.argsort(-scores[top_k_idx])] afterward. The benefit of argpartition is that the unsorted top-K is O(N) instead of O(N log N) — you only pay O(K log K) to sort the K survivors.

  • "NumPy is slow; I should write the matmul in C++ for production." A NumPy X @ q running through Accelerate (on macOS), MKL (on Intel), or OpenBLAS (everywhere else) is within 10–20% of a hand-tuned AVX-512 implementation. The 100× speedups people quote for "C++ vector search" come from either (a) skipping vectors via ANN, or (b) running on a GPU. The matmul itself is not the bottleneck NumPy can win.

  • "My index fits in RAM, so memory bandwidth doesn't matter." It absolutely does. "Fits in RAM" means the bytes are addressable; it does not mean they are in L1/L2/L3 cache. A 6 GB matrix dwarfs every CPU's L3 (typically 16–96 MB), so each query streams the full 6 GB from DRAM through the cache hierarchy on every scan. The bandwidth limit (50 GB/s on a desktop) is the real ceiling, not the RAM size. You only escape this by shrinking the matrix to fit in cache (quantisation) or by skipping reads (ANN).

Going deeper

If you just wanted to know what flat search is and when to use it, you have it: a brute-force matmul, exact by construction, bandwidth-bound past a million vectors. The rest of this section connects flat search to the real systems and benchmarks you will encounter when you actually deploy a vector database in 2026.

What FAISS-Flat actually does under the hood

faiss.IndexFlatIP(d) is not just numpy.dot in C++. It does three things NumPy does not:

  1. Block-tiled BLAS calls. Rather than X @ q for each query, FAISS-Flat batches queries into blocks of 64–256 and issues cgemm (single-precision matrix-matrix multiply) per block. The matrix-matrix kernel has higher arithmetic intensity than matrix-vector, so it pulls slightly closer to the CPU's compute roof (~30 GFLOPS vs ~10 GFLOPS for gemv). On 1M × 768d, this is ~2× speedup over a naive for q in queries: X @ q loop.

  2. A heap-based selection that fuses with the matmul. FAISS keeps a per-query priority queue while computing scores, so it never materialises the full (N,) score array on the heap — it discards each score immediately after comparing to the queue's smallest element. This saves both memory and the second pass argpartition would do. For K = 10 and N = 10⁷, the memory saving is 40 MB per query.

  3. Normalised L2 → inner-product equivalence. IndexFlatL2 does ||q − x||² directly using the BLAS identity ||q||² + ||x||² − 2(q · x), which lets it fuse the L2 computation with a single gemm. The ||x||² terms are precomputed once at index time.

Reading the FAISS-Flat C++ source is genuinely instructive — most of the 200 lines are bookkeeping; the heart is one sgemm_ call.

The GPU-Flat regime: why exact search came back

Around 2018–2020, the conventional wisdom was that any production vector index above 100K rows must be approximate. The reason was simply that GPU memory bandwidth was 200–600 GB/s and a tens-of-millions-of-vectors flat scan took ~100 ms — too slow for interactive serving. By 2024, the H100 ships with 3 TB/s of HBM3, and a 6 GB matrix is now scanned in ~2 ms; an A100 at 2 TB/s does it in 3 ms. Exact search has come back for indexes of tens of millions of vectors. Pinecone's hosted "exact" tier, Weaviate's flat index type, and pgvector's no-index queries all bet on the fact that GPU bandwidth has outrun what most catalogues need.

The arithmetic: at 3 TB/s of HBM bandwidth and a 6 GB index, you sustain ~500 single-query scans per second per GPU, or ~50,000 batched queries per second when the matmul is matrix-matrix instead of matrix-vector. For a PaisaBridge-scale merchant catalogue (~5M products) across thousands of merchants, two H100s exactly serve a public search endpoint at 1000 QPS without an ANN structure. The ops overhead of "no index to rebuild, no recall to monitor" is genuinely worth more than the silicon.

Why this matters for your stack choice in 2026: if your catalogue is under 50M vectors and you have one GPU, default to flat. Add ANN only when the GPU itself is the bottleneck or when you must serve from CPU. The "always use HNSW" advice from 2020 blog posts is out of date.

How IndexFlat interacts with metadata filters

Real product search is "vector similarity and category=earbuds and price<3000 and in_stock=true". Flat handles this naturally: you scan the matrix, but you mask out scores whose row indices fail the metadata predicate. With NumPy:

mask = (categories == "earbuds") & (prices < 3000) & in_stock
scores = X @ q
scores[~mask] = -np.inf
top_k = np.argpartition(-scores, K)[:K]

Because the filter is applied after the scan, you pay for every dot product even though most are discarded. ANN indexes either degrade (HNSW gets stuck in graph regions where no candidate passes the filter) or have to special-case filtering. Flat does not care: the cost is O(N·d) regardless of selectivity. For high-selectivity filters (cuts to <1% of the catalogue), you can flip the order: filter first, then scan only the surviving rows. That is the pre-filter path covered in chapter 156.

Flat as the recall ground truth — how every benchmark uses it

When ANN-Benchmarks reports that HNSW achieves 99.2% recall at QPS 5,000, the "recall" denominator is the top-K returned by flat search on the same query set. Every paper that proposes a new ANN index — IVF (Sivic & Zisserman 2003), HNSW (Malkov & Yashunin 2018), DiskANN (Subramanya et al. 2019), ScaNN (Guo et al. 2020) — measures itself against flat. There is no other ground truth, because no other algorithm gives an exact answer. This is why IndexFlatIP is the first index any serious vector DB ships and the one against which all others are evaluated.

In practice, when you build a production ANN system, you should keep a 1–5% sample of queries running against a flat index in shadow mode, and measure recall@10 against the live ANN. If recall drops below your SLO (typically 95%), you know your ANN parameters need re-tuning — usually because the data distribution has drifted. This shadow-flat pattern is what Pinecone, Weaviate, and the in-house systems at BharatBazaar and Myntra all use to monitor ANN health.

The CPU vs GPU vs disk decision matrix

For a given catalogue size and query rate, where should flat live? A rough table from production tunings (numbers from FAISS-Flat benchmarks on commodity 2024 hardware):

Catalogue QPS target Best home for flat
< 100K vectors any CPU, single core, NumPy or pgvector no-index
100K – 5M < 50 QPS CPU with batching, FAISS IndexFlatIP
100K – 5M 50 – 5000 QPS one GPU, GpuIndexFlatIP
5M – 100M < 1000 QPS one GPU
5M – 100M > 1000 QPS sharded GPU or move to ANN
> 100M any move to PQ + IVF or DiskANN

There is no row where flat lives on disk — flat must be RAM-resident (or VRAM-resident) because every query reads the entire index. The moment your index does not fit, you are forced to a compressed or skip-based structure. That cliff is exactly where chapter 155 — DiskANN takes over.

What's next

You have the floor. Every chapter from here on is about how to skip vectors safely:

  • Chapter 153 — IVF: cluster the vectors into nlist cells; on a query, compute the query's cell and only scan a few neighbouring cells. ~10× speedup, ~95% recall.
  • Chapter 154 — HNSW: build a navigable small-world graph; on a query, greedy-walk from a top-layer entry point down to the nearest neighbour. ~100× speedup, ~98% recall, but RAM-heavy.
  • Chapter 155 — PQ and DiskANN: compress each vector from 1536 floats to 16 bytes via product quantisation; do the search on the compressed representation. 32× memory reduction with controlled recall loss. DiskANN puts the index on SSD and serves billion-vector indexes from a single machine.
  • Chapter 156 — Filtered search: combine metadata filters (category = "earbuds") with vector search. The interaction between the filter and the index is surprisingly subtle and is where most production systems lose recall silently.
  • Chapter 157 — Real systems: Pinecone, Weaviate, Milvus, Qdrant, pgvector, Elasticsearch dense_vector. Same algorithms, different operational tradeoffs.

Each of those chapters will, in its evaluation section, compare its results to what a flat scan would have returned. Flat is not the algorithm you ship at scale, but it is the algorithm against which everything else is measured.

References

  1. FAISS — Flat indexes documentation. The canonical brute-force implementation.
  2. FAISS — Faiss on the GPU. GPU-Flat and benchmark numbers for tens of millions of vectors.
  3. Pinecone — Serverless architecture overview. When the platform falls back to exact search vs. ANN.
  4. Williams & Patterson — The Roofline Model. Why flat search is bandwidth-bound.
  5. Weaviate — Vector index concepts. When to choose flat vs. HNSW in practice.
  6. Erik Bernhardsson — ANN-Benchmarks. The canonical reference for comparing flat against every ANN algorithm on standard datasets.