Flat Search and Why It Doesn't Scale

In short

Flat search — also called brute-force k-NN or exact search — is the dumbest, most honest vector index imaginable. You store every embedding in a single dense matrix X ∈ ℝ^{N×d}. On every query vector q ∈ ℝ^d you compute s = X @ q (or ||X − q||² for L2), partial-sort the resulting N scores, and return the top-K indices. Cost per query: O(N·d) floating-point operations and exactly one sequential scan over the whole index. Recall: 100%, by definition — there is no approximation. Memory: N · d · 4 bytes for float32 (a million 1536-dim vectors = 6 GB just for the data, no overhead). Latency, single-core CPU, ~10 GFLOPS sustained: about 150 ms for a million 1536-dim query. SIMD (AVX-512) buys you 4–8×, and a single GPU buys you another 10–100× via cuBLAS — FAISS GPU-Flat routinely hits 50,000 QPS on tens of millions of vectors. Flat is the right answer when (a) N < 10,000 and the index-build overhead of HNSW or IVF is not worth it, (b) you need guaranteed exact recall (medical, legal, security), (c) your query rate is low enough that even a one-second scan is fine (offline analytics, batch dedup), or (d) a metadata filter cuts the candidate pool to a few thousand before the vector comparison runs. It is the wrong answer the moment you have ten million vectors served at a thousand queries per second — at that point you need IVF, HNSW, or product quantization, each of which trades a sliver of recall for orders of magnitude in speed and footprint. This chapter is the floor everything else is measured against.

You finished chapter 151 holding three things: a notion of an embedding, a choice of similarity metric, and a healthy fear of high-dimensional geometry. Now you have a million product descriptions sitting on disk as 1536-dimensional float32 vectors, and a user types wireless earbuds with good bass under 3000 rupees. Your search service must, in under a hundred milliseconds, return the ten products whose embeddings are closest to the query embedding. How exactly does it find those ten?

The answer this chapter gives is the boring one: look at all of them. Compute the similarity between the query and every single stored vector, sort by score, return the top ten. There is no tree to descend, no graph to walk, no probabilistic skip — just one matrix-vector multiply and one partial sort. This is flat search, and the fact that it is dumb is exactly its strength. It is the only algorithm in this build that returns the guaranteed correct top-K every time, with no approximation, no tunable knob that trades recall for speed. Every other index in chapters 153–157 is faster than flat, and every one of them is judged by how close its results come to what flat would have returned. Flat is the ground truth.

It is also infeasible past a certain scale, and the rest of the chapter is the careful accounting of what that scale is, why exactly it breaks, and the genuinely large set of cases where you should just use flat anyway and stop over-engineering.

The algorithm in one paragraph

Store all N vectors as rows of a contiguous float32 matrix X of shape (N, d). On a query q ∈ ℝ^d:

Compute the score of q against every row of X. For inner product, this is s = X @ q, a single matrix-vector product producing an N-long score array. For L2, you compute ||X − q||² row-wise. For cosine, either pre-normalise everything so it reduces to inner product, or normalise on the fly.
Partial-sort the score array to extract the indices of the top-K largest (for similarity) or smallest (for distance) values. NumPy's argpartition(s, K) does this in O(N) rather than O(N log N).
Return those K row indices, optionally with their scores.

That is the entire algorithm. There is no "build" phase beyond loading the matrix into memory; there is no auxiliary structure; there is no parameter to tune.

Flat search reduces to one matrix-vector multiplication. The query q is dot-producted with every row of the stored matrix X, producing N scalar scores. A partial sort (NumPy's argpartition) extracts the top K indices in linear time. There is no auxiliary index, no tree, no graph — just brute-force linear algebra. It is exact by construction.

The math: where the operations come from

Let N be the number of stored vectors and d be their dimension. A single inner-product score q · xᵢ requires d multiplications and d − 1 additions — call it 2d flops. There are N such scores per query. Total: roughly 2N·d floating-point operations per query, plus an O(N) selection of the top-K and an O(K log K) final sort of those K. The selection and sort are negligible: at K = 10 and N = 10⁶, the partial sort is about 10⁶ comparisons, while the dot products are 3 × 10⁹ flops.

Plug in the numbers people actually deploy:

`N`	`d`	flops/query	single-core time at 10 GFLOPS	memory
10,000	384	7.7 M	0.8 ms	15 MB
100,000	768	153 M	15 ms	300 MB
1,000,000	1536	3.1 G	310 ms	6 GB
10,000,000	1536	30.7 G	3 s	60 GB
100,000,000	1536	307 G	30 s	600 GB
1,000,000,000	1536	3.1 T	5 minutes	6 TB

Why these numbers feel like a wall: every row above grows latency and footprint by exactly 10× because both N and the per-row cost scale linearly. There is no algorithmic constant you can tune away. A 10× bigger catalogue is a 10× slower flat query, full stop. By a billion vectors you are doing five-minute scans and storing six terabytes — neither of which is a serving system. This is the cliff that ANN was invented to climb down.

Why it's actually memory-bandwidth bound, not compute-bound

The single-core "10 GFLOPS" number above is generous — and even when you reach it, you are usually not limited by arithmetic but by how fast bytes move from RAM into the CPU. Each dot product q · xᵢ reads d floats of xᵢ from memory and does 2d flops on them — an arithmetic intensity of about 0.5 flops per byte. Modern CPUs have arithmetic intensity ceilings of ~30 flops/byte before you saturate memory bandwidth (the roofline model); flat search lives at the bandwidth-bound bottom of the roofline.

Concretely: a 1M × 1536 float32 matrix is 6 GB. DDR5 desktop RAM streams about 50 GB/s; a single channel gives ~10–25 GB/s. So a single sequential scan of the matrix takes ~120–600 ms just to read the bytes, before counting any arithmetic. Why this matters for engineering decisions: doubling your CPU's clock speed will not double flat-search throughput. The CPU spends most of its cycles waiting for the next cache line. The wins come from (a) shrinking the matrix — quantisation drops 4 bytes/coordinate to 1 byte and quadruples throughput, (b) keeping it in L3 cache for hot queries, or (c) moving it to a GPU with 1 TB/s of memory bandwidth.

For a million 1536-dim float32 vectors, the index alone is 6 GB. A single query reads all 6 GB through the CPU memory subsystem; even on a desktop with quad-channel DDR5, that is ~120 ms before any arithmetic. A modern GPU with 1 TB/s of HBM bandwidth runs the same scan in ~6 ms. This is why FAISS's GPU-Flat backend is the production go-to for exact search up to tens of millions of vectors — you have not changed the algorithm, only moved it to the right silicon.

Optimisations within flat — the algorithm stays exact

You can make flat search 10–100× faster without giving up exactness. Three techniques, in order of how much you usually get from each:

SIMD vectorisation. Modern CPUs can do 8 single-precision multiply-adds per instruction (AVX2: 8 floats; AVX-512: 16 floats). NumPy on top of OpenBLAS or Intel MKL already uses this — when you write X @ q, you get SIMD for free. Why this matters: a hand-written loop in pure Python over range(N) runs at ~10 MFLOPS; the same computation through np.dot runs at 5–20 GFLOPS, a 1000× speedup with no code change. Always go through BLAS. If you write the loop in C++, use _mm256_fmadd_ps (AVX2) or _mm512_fmadd_ps (AVX-512). FAISS's IndexFlatIP drops to BLAS for the matmul and runs about as fast as your CPU permits.

GPU offload. Move X to GPU memory once (it costs the upload), then every query is a single cuBLAS gemv (matrix-vector multiply). H100s have 3 TB/s of HBM3 bandwidth, so a 6 GB matrix is scanned in 2 ms; even a consumer RTX 4090 at 1 TB/s does it in 6 ms. FAISS's IndexFlatIPGPU and the more recent IndexFlatGpu routinely sustain 50,000–200,000 QPS on tens of millions of vectors per GPU, exact recall, no approximation.

Quantisation (preview of chapter 155). Storing each coordinate as int8 rather than float32 cuts the matrix to a quarter of its size — and since you are bandwidth-bound, that is roughly a 4× speedup. Scalar quantisation (clip to [−1, 1], multiply by 127, round) preserves recall to within a percent or two for typical text embeddings. This blurs the line between "flat" and "compressed flat", but the algorithm is still exact-on-the-quantised-vectors.

What you cannot do within flat: skip vectors. Every approximate index works by avoiding looking at most of X. Once you start skipping, you have left flat search for one of the algorithms in chapters 153–157.

When flat IS the right answer

The temptation, especially after reading this chapter, is to reach for HNSW or IVF the moment you hear the word "vector search". Resist it. Building, tuning, monitoring, and re-indexing an ANN structure is a real engineering tax — index parameters interact with recall in non-obvious ways, you need to re-tune when your data distribution shifts, and you lose the ability to give an exact-recall guarantee. There are four scenarios where flat is genuinely the right tool in 2026:

The fact that ANN exists does not mean you have to use it. Each of these four situations is common in production, and each is one where flat is faster, simpler, more accurate, or all three. Particularly underrated is the fourth quadrant: a well-designed metadata filter often cuts a billion-vector catalogue down to a few thousand candidates, at which point a flat scan over those thousands beats any ANN structure both in speed and in correctness.

Small index (N < 10K). Building an HNSW index over 5,000 vectors takes longer than running 5,000 flat queries against them. The auxiliary structure of HNSW (graph edges, layer assignments, entry points) often costs more memory than the raw vectors themselves at small N. Flat is the right answer for in-app FAQ search, internal documentation search, a startup's first thousand customers — anything where you can fit the entire index in 100 MB.

Low query rate. If your service runs one query per minute and you can budget a full second per query, you do not need an index. Nightly batch jobs — finding duplicate user accounts in a 10M-row table by embedding usernames and clustering, deduplicating product listings on Flipkart, ad-hoc analyst queries on a corpus of court judgments — all live here. Why this matters: ANN indexes have non-trivial build times (HNSW on 10M vectors takes 30 minutes) and require periodic rebuilds when data shifts. For a query you run twelve times a day, the build amortisation never pays off.

Exact recall required. Medical search ("find drugs whose mechanism vector is closest to this molecule"), legal precedent retrieval, security/forensics ("find malware samples whose static-analysis vector matches this sample") cannot afford to miss the right answer because the index decided to skip its neighbourhood. Approximate indexes typically run at 95–99% recall@10 — the right answer is missed 1–5% of the time. For most product search that is fine; for a misdiagnosis it is not. Flat guarantees 100% recall by definition.

Filter-dominated queries. Real product search is rarely "find products like this query" — it is "find products like this query that are in category X, in stock, priced under Y, ship to pincode Z". If the metadata filter applies first and reduces the candidate set to a few thousand SKUs, a flat scan of those thousands is sub-millisecond and beats any pre-built ANN index that has to navigate around the filter. This is the pre-filter strategy, covered properly in chapter 156. Pinecone, Weaviate, and pgvector all special-case "small filtered set → use flat".

Why we eventually need ANN

The flip side of all the above. The cases where flat is wrong:

Hundreds of millions of vectors at interactive latency. A 100M × 1536d index is 600 GB; even on a multi-GPU box, a flat scan is 1+ second. At 1B vectors and 1536d, the index is 6 TB, no commodity machine fits it in RAM, and a flat scan is minutes.
High query rate (1000+ QPS). Even at 10M × 768d (~30 GB), a CPU flat query is ~300 ms, so 1000 QPS needs 300 cores doing nothing else. ANN brings the per-query cost to 1–5 ms, so a single core handles 200+ QPS and a small box handles thousands.
Multi-tenant search-as-a-service. Hosted vector DBs (Pinecone serverless, Weaviate Cloud) cannot afford to keep every tenant's full matrix in RAM; they need an index that is small (PQ), cache-friendly (HNSW), or shardable (IVF) to multiplex thousands of tenants on the same hardware.

The next four chapters build, in order, the four indexes that solve these regimes: IVF (skip clusters), HNSW (walk a graph), PQ and DiskANN (compress vectors and stay on SSD), and filtered search (combine metadata + vector). Each of them is benchmarked against flat as the ground truth for recall.

Feel the wall: a NumPy benchmark

The most important thing this chapter can give you is the lived experience of watching latency grow linearly. Run the example below, then change N to 1M and run it again.

Flat search in 20 lines of NumPy

You are a backend engineer at an Indian e-commerce startup with 100,000 product descriptions, each embedded with a 768-dim Sentence-BERT model. You want to know how long flat search actually takes on your laptop before you spend a week setting up FAISS.

import numpy as np
import time

# --- Simulate the index ---------------------------------------------------
np.random.seed(42)
N, d, K = 100_000, 768, 10
X = np.random.randn(N, d).astype(np.float32)
# Pre-normalise so inner product == cosine
X /= np.linalg.norm(X, axis=1, keepdims=True)

print(f"Index: {N:,} vectors × {d}-dim float32")
print(f"Memory: {X.nbytes / 1e6:.1f} MB")

# --- Single query (cold) --------------------------------------------------
q = np.random.randn(d).astype(np.float32)
q /= np.linalg.norm(q)

t0 = time.perf_counter()
scores = X @ q                              # (N,) inner products
top_k_idx = np.argpartition(-scores, K)[:K] # O(N) selection
top_k_idx = top_k_idx[np.argsort(-scores[top_k_idx])]  # final sort
elapsed = (time.perf_counter() - t0) * 1000

print(f"Single query: {elapsed:.1f} ms, top-{K} indices = {top_k_idx[:3]}...")

# --- Batched timing for stable measurement --------------------------------
Q = 100
queries = np.random.randn(Q, d).astype(np.float32)
queries /= np.linalg.norm(queries, axis=1, keepdims=True)

t0 = time.perf_counter()
all_scores = queries @ X.T                  # (Q, N)
all_top_k = np.argpartition(-all_scores, K, axis=1)[:, :K]
elapsed = (time.perf_counter() - t0) * 1000

print(f"Batched {Q} queries: {elapsed:.1f} ms total → {elapsed/Q:.2f} ms each")
print(f"Throughput: {Q / (elapsed/1000):.0f} QPS on a single CPU")

Typical output on an M1 MacBook (laptop, single core through Accelerate BLAS):

Index: 100,000 vectors × 768-dim float32
Memory: 307.2 MB
Single query: 28.4 ms, top-10 indices = [42771 91038 14523]...
Batched 100 queries: 1840 ms total → 18.4 ms each
Throughput: 54 QPS on a single CPU

At 100K × 768 you are getting ~30 ms per cold query, ~18 ms when the matrix is hot in cache. That is fine for a low-traffic admin tool; it is not fine for a public search endpoint at 1000 QPS, where you would need 18 cores doing nothing but vector math just to keep up.

Now scale up. Change N to 1,000,000 and re-run. The single-query time grows to ~280 ms, the memory to 3 GB, the QPS to about 5. At 5,000,000 products you are at 1.5 GB/s of bandwidth and ~1.5 seconds per query — your search box now feels broken. The wall is real, and the slope is exactly linear in N · d.

This is the moment in production where you stop tuning NumPy and start reading chapter 153. Why this benchmark is honest: NumPy's @ operator dispatches to OpenBLAS / Accelerate / MKL — the same BLAS that FAISS-Flat uses internally. The single-core flat-search performance you measure here is within ~20% of what a tuned C++ implementation would give you. The 100× speedups come from algorithmic change (skip vectors), not from squeezing more out of flat.

Practical recipe: flat in production

If you decide flat is the right answer, here is the short checklist:

Use float32, not float64. Half the memory, double the bandwidth, identical recall for embeddings. (Sentence-BERT, OpenAI, Cohere all return float32 by default.)
Pre-normalise vectors at index time. Then inner product is cosine, and you save the per-query normalisation. One X /= np.linalg.norm(X, axis=1, keepdims=True) at load time.
Store the matrix contiguously. np.ascontiguousarray(X) before serving; row-major layout means a query reads memory sequentially, which is what the prefetcher is good at.
Use argpartition, not argsort. Top-K extraction is O(N) not O(N log N). At N = 10⁶ and K = 10, this is a 6× difference in the sort step.
Batch your queries. queries @ X.T is faster per-query than a Python loop calling X @ q, because BLAS reuses the read of X across multiple queries. If your serving layer can collect 8–32 queries before issuing the matmul, you get 4–10× more throughput.
For >1M vectors, use FAISS. faiss.IndexFlatIP(d) wraps BLAS with a slightly more cache-friendly access pattern and gives you GPU-Flat for free if you swap to faiss.GpuIndexFlatIP. There is no point reimplementing this — FAISS is what every production system reaches for.

What's next

You have the floor. Every chapter from here on is about how to skip vectors safely:

Chapter 153 — IVF: cluster the vectors into nlist cells; on a query, compute the query's cell and only scan a few neighbouring cells. ~10× speedup, ~95% recall.
Chapter 154 — HNSW: build a navigable small-world graph; on a query, greedy-walk from a top-layer entry point down to the nearest neighbour. ~100× speedup, ~98% recall, but RAM-heavy.
Chapter 155 — PQ and DiskANN: compress each vector from 1536 floats to 16 bytes via product quantisation; do the search on the compressed representation. 32× memory reduction with controlled recall loss. DiskANN puts the index on SSD and serves billion-vector indexes from a single machine.
Chapter 156 — Filtered search: combine metadata filters (category = "earbuds") with vector search. The interaction between the filter and the index is surprisingly subtle and is where most production systems lose recall silently.
Chapter 157 — Real systems: Pinecone, Weaviate, Milvus, Qdrant, pgvector, Elasticsearch dense_vector. Same algorithms, different operational tradeoffs.

Each of those chapters will, in its evaluation section, compare its results to what a flat scan would have returned. Flat is not the algorithm you ship at scale, but it is the algorithm against which everything else is measured.

References

FAISS — Flat indexes documentation. The canonical brute-force implementation.
FAISS — Faiss on the GPU. GPU-Flat and benchmark numbers for tens of millions of vectors.
Pinecone — Serverless architecture overview. When the platform falls back to exact search vs. ANN.
Williams & Patterson — The Roofline Model. Why flat search is bandwidth-bound.
Weaviate — Vector index concepts. When to choose flat vs. HNSW in practice.
Erik Bernhardsson — ANN-Benchmarks. The canonical reference for comparing flat against every ANN algorithm on standard datasets.