Note: Company names, engineers, incidents, numbers, and scaling scenarios in this article are hypothetical — even when they resemble real ones. See the full disclaimer.
In short
HNSW replaces IVF's partitioning with a stack of graphs — sparse "highway" layers on top, the full graph at the bottom — so a query greedy-hops downward in O(log N) instead of O(√N). Three knobs drive it: M (edges per node), efConstruction (build effort), and efSearch (query beam). It is the default index in every serious vector database for one reason: 95–99 % recall at 1–10 ms per query, paid for in build time and memory.
You can do better than IVF's O(√N). Stop partitioning the space and start navigating it: build a graph in which each vector is wired to its nearest neighbours, then walk it greedily. If each hop halves the remaining distance, you arrive in log₂ N steps — thirty hops on a billion vectors, each one a handful of dot products.
The thesis: a navigable graph beats a partition
Imagine you are standing in Mumbai, told to find the bookshop in India that most resembles Blossom in Bengaluru. With IVF, you would pre-partition the country into states, find the nearest state to your query (probably Karnataka), and scan every shop in it. With HNSW, the country is wired up as a road network — every shop is connected to a few nearby shops, plus a long-distance highway connects a sparse subset across regions. You start on the national highway, hop along until you are roughly in the right zone, then take state highways, then district roads, then walk to the actual shop. At every level of the hierarchy, a greedy step toward the target makes meaningful progress because the local connectivity guarantees you can always find a closer neighbour.
That is the small-world property — and the hierarchy on top is what makes it logarithmic. A flat small-world graph with no hierarchy works, but its query cost is O(√N) (Kleinberg's classical result). Stacking layers, where the top layer holds only a tiny fraction of nodes wired up sparsely, gives the long-distance highways for free. You go far in few hops at the top, then refine. Yury Malkov and Dmitry Yashunin's 2016 paper formalised this and shipped it as hnswlib; within three years it was the default index in every commercial vector DB.
Three artefacts come out of an HNSW build: the graph itself (N nodes, each with up to M outbound edges per layer it appears in), a layer assignment (each node has a maximum layer, drawn from a geometric distribution), and an entry point (the single highest-layer node, used as the start of every query). That is it. No centroids, no inverted lists, no clusters — just a graph and a place to start.
Why a hierarchy makes the query logarithmic
Take the simplest version of the navigation: at each layer, do greedy nearest-neighbour search. Start at some node, look at its outbound edges, move to whichever neighbour is closest to the query, repeat until no neighbour is closer than the current node. That terminal node is your local minimum at this layer. Then you drop one layer down and use that local minimum as your new starting point.
The genius is in the layer counts. Layer 0 has N nodes. Layer 1 has N/M (where M is the typical degree). Layer 2 has N/M². The top layer has roughly one node. So the number of layers is log_M(N) — for N = 10⁹ and M = 16, that is about 7 layers. Why this matters: at each layer, greedy search visits some bounded number of nodes (think: a beam of size efSearch once you are at the bottom, and a beam of 1 above that). Each layer costs O(M · efSearch) distance computations. Multiply by the number of layers, log_M N, and you get O(M · efSearch · log N) per query — logarithmic in N.
The other way to see it: at the top layer, the average distance between any two nodes is large, so each greedy step covers a lot of ground. As you drop layers, the graph densifies, the steps shrink, and you get progressively closer to your true target. It is the same intuition as binary search — coarse cuts at the top, fine cuts at the bottom — but applied to navigation in ℝ^d.
The query path: greedy descent then beam expansion
A query vector q arrives. The algorithm is:
- Start at the entry point (the single node at the top layer).
- At each layer above 0, do greedy search with beam size 1: visit the current node's neighbours, pick the closest to
q, repeat until no neighbour is closer. This terminal node becomes the entry point for the next layer down. - At layer 0, switch to beam search with beam size
efSearch. Maintain a candidate set (theefSearchnearest nodes seen so far) and a frontier (nodes whose neighbours you have not yet expanded). At each step, pop the closest unexpanded node from the frontier, expand its neighbours, insert any improvements into the candidate set. - Return the top-
kfrom the final candidate set.
efSearch is the only knob you turn at query time. With efSearch = k, the search is purely greedy — fast but recall-poor. With efSearch = 200, the search is a wide beam — slower but much higher recall. The relation is monotonic: more beam, more recall, more cost. Production sits around 50–200.
Construction: insert one vector at a time, with a coin flip for the level
There is no "training" phase the way IVF has k-means. You build HNSW one vector at a time. For each new vector v:
- Roll the level. Pick a maximum level
Lforvfrom the geometric distribution:L = floor(-ln(uniform(0,1)) · 1/ln(M)). Most vectors getL = 0. About1/MgetL ≥ 1. About1/M²getL ≥ 2. And so on. The first vector you ever insert becomes the entry point at the highest layer it rolled. - Greedy descend from the current entry point through layers above
L, just like a query, ending up with an entry candidate at layerL. - At each layer from
Ldown to0, run beam search with sizeefConstructionto find theefConstructionnearest existing nodes at that layer. Pick the topMof those (using a heuristic that balances proximity with neighbourhood diversity — see below) and create bidirectional edges to them. - Prune any neighbour that now has more than
M_maxoutbound edges (typicallyM_max = Mfor layers ≥ 1,M_max = 2Mfor layer 0). The pruning uses the same diversity heuristic. - Update the entry point if
Lis greater than the current top-layer node's level.
The diversity heuristic — Malkov & Yashunin's "select neighbors heuristic" — is the subtle part. When you have efConstruction candidate neighbours and need to keep M, you do not just take the M nearest. You take the nearest, then iterate through the rest in distance order: include each candidate only if it is closer to the query than to any already-selected neighbour. Why diversity matters: if you only kept the M closest, all your edges would point in the same direction (the densest cluster), and the graph would lose its long-range connectivity. The diversity heuristic ensures edges fan out — a node connects to nearby clusters in different directions, preserving the small-world property and preventing the greedy search from getting stuck in local minima.
The three knobs and what they cost
Three parameters control the entire algorithm. Tune them in this order.
M (max edges per node per layer above 0) — the structural knob. Typical values: 8, 16, 32. Larger M means each node has more out-edges, which gives the graph more redundancy and better recall, but also costs more memory (each edge is a 4-byte int ID; layer 0 stores up to 2M edges per node, higher layers up to M). For 768-d float32 vectors, the per-vector storage is 768 · 4 = 3072 bytes; the per-vector graph storage at M = 16 is roughly (2·16 + 16 + 16/16 + …) · 4 ≈ 200 bytes — about 7 % overhead. At M = 32, doubled. A useful starting point: M = 16 for 768-d, M = 32 for very high-dimensional or low-recall-tolerance applications.
efConstruction (build-time beam size) — the build-quality knob. Typical: 100–200. Larger means a better graph (each new node finds truly its M nearest neighbours, not just M decent ones), but build time scales linearly with efConstruction. Why this matters: a poorly built HNSW graph never recovers — there is no rebalancing pass. If efConstruction is too small, some node never gets connected to its true neighbours, and queries that should land on it instead drift past. Spend the build time once.
efSearch (query-time beam size) — the recall knob. Typical: 50–200. Larger means more nodes visited at the bottom layer, more candidates surfaced, higher recall, slower query. This is the dial you turn per request — production systems often expose it directly to the caller (Pinecone calls it top_k_extra, Weaviate calls it ef).
A useful mental model: M is the index's static quality, set once. efConstruction is the index's learned quality, paid once. efSearch is the query's adaptive quality, paid per call. Tune M based on memory budget, efConstruction based on build-time budget, efSearch per query based on the latency-vs-recall trade-off you can tolerate.
Building HNSW with hnswlib
hnswlib is the canonical implementation, written by the original authors. It is faster than FAISS's HNSW for most workloads (FAISS still wins on very large indexes thanks to GPU support, but on a single CPU box, hnswlib is the reference). Here is a complete script that builds an HNSW index over 100 k vectors, queries it at varying efSearch, and measures recall against flat search.
import numpy as np
import hnswlib
import time
# 1. Generate synthetic data: 100k vectors in 768d (Sentence-BERT scale)
d = 768
N = 100_000
n_queries = 100
np.random.seed(42)
xb = np.random.randn(N, d).astype('float32')
xq = np.random.randn(n_queries, d).astype('float32')
# L2-normalise so inner-product = cosine similarity
xb /= np.linalg.norm(xb, axis=1, keepdims=True)
xq /= np.linalg.norm(xq, axis=1, keepdims=True)
# 2. Ground truth via brute-force (so we can compute recall)
t0 = time.time()
gt = np.argsort(-(xq @ xb.T), axis=1)[:, :10] # top-10 by cosine
flat_time = (time.time() - t0) * 1000 / n_queries
print(f"Flat: {flat_time:.2f} ms/query (NumPy GEMM, all 100k)")
# 3. Build HNSW: M = 16, efConstruction = 200
index = hnswlib.Index(space='cosine', dim=d)
index.init_index(max_elements=N, M=16, ef_construction=200, random_seed=42)
t0 = time.time()
index.add_items(xb, ids=np.arange(N))
build_time = time.time() - t0
print(f"Build: {build_time:.1f} s for {N} vectors (M=16, efConstruction=200)")
# 4. Sweep efSearch and measure recall@10 + latency
for ef in [10, 50, 100, 200, 400]:
index.set_ef(ef)
t0 = time.time()
ids, _ = index.knn_query(xq, k=10)
hnsw_time = (time.time() - t0) * 1000 / n_queries
correct = sum(len(set(gt[i]).intersection(set(ids[i]))) for i in range(n_queries))
recall = correct / (n_queries * 10)
speedup = flat_time / hnsw_time
print(f"efSearch={ef:3d} {hnsw_time:5.2f} ms/q "
f"recall@10={recall:.2%} speedup={speedup:.1f}×")
Typical output on a modern laptop:
Flat: 38.5 ms/query (NumPy GEMM, all 100k)
Build: 47.2 s for 100000 vectors (M=16, efConstruction=200)
efSearch= 10 0.18 ms/q recall@10=78.20% speedup=213.9×
efSearch= 50 0.41 ms/q recall@10=96.30% speedup= 93.9×
efSearch=100 0.72 ms/q recall@10=98.70% speedup= 53.5×
efSearch=200 1.31 ms/q recall@10=99.50% speedup= 29.4×
efSearch=400 2.48 ms/q recall@10=99.80% speedup= 15.5×
Compare with the IVF numbers from the previous chapter: IVF at nprobe=10 hit 94 % recall in 1.6 ms/q, while HNSW at efSearch=100 hits 98.7 % recall in 0.7 ms/q. HNSW dominates IVF on the recall-vs-latency Pareto frontier for any application that cares about recall above ~92 %. The price is build time (47 s for HNSW vs ~3 s for IVF on the same data) and a graph that does not love updates.
**A worked sizing — Indian e-commerce product search at 5 M scale**
You are scaling the same product search you sized for IVF in the previous chapter, but the catalogue has grown. 5 million products, each embedded as a 768-dim float32 vector. p99 latency target: 50 ms. Recall@10 target: ≥ 95 %. Sustained 200 QPS across 4 app instances (50 QPS each).
Raw vector storage: 5 M × 768 × 4 bytes = 15 GB. Fits in a 32 GB RAM box.
HNSW graph overhead at M = 16: layer 0 stores up to 2M = 32 edges per node; higher layers store up to M = 16 per node, but the count of nodes shrinks geometrically. Total edge count is approximately N · 2M · sum_{ℓ≥0} (1/M)^ℓ ≈ N · 2M · M/(M−1) ≈ 2.13 · N · M. For our numbers: 5 M · 32 · 4 bytes ≈ 640 MB for layer-0 edges alone, and roughly half that for the upper layers — call it ~1 GB of graph metadata. Total memory: ~16 GB raw + ~1 GB graph + ~1 GB internal hnswlib bookkeeping (locks, levels, IDs) ≈ 18 GB. Fits comfortably in 32 GB. (The 30 GB number you sometimes see is for M = 32 or for indexes that store internal copies of the vectors with extra alignment.)
Build time: at efConstruction = 200, hnswlib processes roughly 2000–4000 vectors per second per CPU core on this data (768-d, cosine). With 8 cores, the full build is 5 M / (3000 · 8) ≈ 200 s — about 4 minutes. This is rebuild-from-scratch time; you do this nightly during a maintenance window.
Query at efSearch = 100: per-query distance computations are roughly M · efSearch · log_M(N) = 16 · 100 · log_16(5 M) ≈ 16 · 100 · 5.5 ≈ 9000. Each comparison is a 768-d dot product = ~770 FLOPs. On a modern AVX2 CPU at ~30 GFLOPS effective, that is 9000 · 770 / 30e9 ≈ 0.23 ms of pure compute, plus cache misses and graph-pointer chasing — call it ~5 ms p50, ~10 ms p99. Comfortably under the 50 ms target. By comparison, IVF at the same scale would do ~50 ms (more clusters, larger lists), and flat search would be 1500 ms — unusable.
Recall: at efSearch = 100, real product embeddings (which are clustered, not uniform) typically give 97–99 % recall@10. Decision: ship with M = 16, efConstruction = 200, efSearch = 100.
The operational caveats. First, HNSW does not love deletions — hnswlib supports mark_deleted (a tombstone), but the graph never reclaims those nodes; rebuilds happen weekly to compact. Second, inserts are append-only; new products go in fine, but if you bulk-rewrite embeddings (say, after switching to a new embedding model), you rebuild the index from scratch. Third, the index is not natively shardable — you either run one giant index per box or shard at the application layer (e.g. by category) and merge top-k across shards. Pinecone, Weaviate, and Milvus all do exactly this under the hood.
Why HNSW works: the small-world property in two paragraphs
A graph is a small world if the average shortest path between any two nodes is O(log N) — much shorter than N would suggest for random scattering. Real social networks have this property; so does the internet's link graph; so does any network whose edges include both short-range connections (to nearby neighbours) and long-range connections (to far-away nodes). This is the structural insight behind Watts and Strogatz's 1998 paper and the "six degrees of separation" result.
HNSW manufactures the small-world property deliberately. The bottom layer's M nearest-neighbour edges are short-range — they connect nearby vectors. But the diversity heuristic ensures some of those edges go to distant clusters, providing long-range shortcuts. The upper layers explicitly add long-range structure: a node at layer 2 is one of N/M² nodes spread across the entire vector space, and its edges cover huge distances. Together, these two ingredients give the graph a navigable small-world structure: greedy search is guaranteed (with high probability) to find the true nearest neighbour in O(log N) hops. Why this matters operationally: it means the cost of HNSW does not blow up with N the way IVF's does. Going from 1 M to 1 B vectors at the same recall level multiplies query cost by log_M(1000) ≈ 2.5×. The same scaling for IVF would be √1000 ≈ 32×.
Where HNSW lives in the wild
hnswlib — Malkov & Yashunin's reference implementation. Header-only C++ with Python bindings. Used directly by Weaviate, Qdrant, and many bespoke services. The right starting point if you want HNSW without a vector DB on top.
FAISS HNSW — the IndexHNSWFlat and IndexHNSWSQ (with scalar quantisation) variants. Slightly slower than hnswlib on equivalent CPU loads but plays well with FAISS's other index types — you can wrap HNSW around the centroid lookup of an IVF-PQ index to make the coarse step itself logarithmic.
Pinecone — managed vector DB whose default index is HNSW with proprietary tweaks (mostly around sharding, replication, and metadata filters). Their public engineering blog has the clearest non-academic walkthrough of HNSW intuition.
Weaviate — open-source vector DB; HNSW is the default; tunable per-class. Weaviate's HNSW deep-dive blog walks through the parameter trade-offs in production.
Qdrant — Rust-based vector DB, HNSW index with payload filtering pushed into the graph traversal (so filters do not require post-scan). One of the fastest production HNSW implementations.
Milvus — wraps both hnswlib and FAISS as backends. The HNSW index is FAISS's; the HNSW_SQ and HNSW_PQ variants combine HNSW with quantisation for memory savings.
The pattern is: every serious vector DB ships HNSW. The differences are in metadata filtering, sharding, and persistence — not in the core algorithm. If you want to pick one to learn deeply, start with hnswlib because it has no other moving parts.
When to reach for HNSW (and when not to)
Reach for HNSW when:
- Recall ≥ 95 % is required and IVF cannot get you there at acceptable latency.
- Query latency matters more than build time or memory.
- The dataset is mostly read-heavy with batch updates (nightly, hourly).
- You have ≤ 100 M vectors on a single box, or are willing to shard.
Reach for IVF instead when:
- Build time matters more than query latency (frequent index rebuilds).
- 90–95 % recall is sufficient.
- You want a simpler index with two debuggable knobs.
Reach for IVF-PQ or DiskANN when:
- You have ≥ 100 M vectors and cannot afford the full-precision RAM bill.
- A few percent recall loss is acceptable for 8–32× memory savings.
Use flat search when:
- You have < 100 k vectors and exact top-k is required.
The decision boundary between IVF and HNSW is, in practice, recall: under 92 %, IVF is simpler and cheaper; above 95 %, HNSW dominates. The middle range (92–95 %) is application-specific — measure both on your real data before committing.
Common confusions
-
"HNSW gives you exact nearest neighbours." It does not. HNSW is approximate — it can miss the true top-k whenever greedy descent gets trapped in a local minimum or the bottom-layer beam terminates early. At
efSearch = 100you typically see 97–99 % recall@10, which means roughly 1–3 of every 100 nearest-neighbour ids you return are wrong. The diversity heuristic and the layer hierarchy reduce the failure rate but never zero it. If you need exact top-k, run flat search; if you need near-exact, raiseefSearchuntil recall plateaus. -
"Bigger
Malways means better recall." Up to a point.Mcontrols how many out-edges each node carries; once the graph is dense enough that greedy navigation almost always finds a closer neighbour, raisingMfurther mostly burns memory and slows the bottom-layer beam (each step expandsMneighbours). For 768-d sentence embeddings,M = 16is the sweet spot;M = 32helps a little on adversarial data but doubles graph memory. PastM ≈ 64you are hurting yourself. -
"
efConstructionandefSearchare the same thing." They are the same algorithm (beam search at layer 0) used at two different times.efConstructionis paid once when the index is built and decides how good each new node'sMneighbour links will be.efSearchis paid per query and decides how wide the beam is when answering a request. You can re-tuneefSearchat any moment without rebuilding; you cannot changeefConstructionwithout throwing the index away. -
"Deletion is free — just call
mark_deleted." The vector becomes invisible to results, but the graph still routes through it. Every search that lands on a deleted node still has to expand its neighbours and walk past it, so query cost slowly degrades as deleted nodes accumulate. Production HNSW deployments rebuild from scratch weekly — Pinecone, Weaviate, and Qdrant all schedule compaction jobs for exactly this reason. If your workload is delete-heavy, HNSW is the wrong index. -
"HNSW is shardable like any other index." It is not natively. The graph is a single connected structure with one entry point — split it across two machines and a query that should cross machines instead gets stuck on whichever side it landed on. Production sharding works by partitioning data ahead of time (by category, tenant, or hash) and merging top-k from each shard at the application layer, accepting some recall loss at boundaries. Milvus and Weaviate handle this for you; if you roll your own HNSW with hnswlib, you handle it yourself.
-
"HNSW replaces IVF everywhere." Not when build time matters. IVF rebuilds in seconds; HNSW takes minutes-to-hours at the same scale. For workloads that re-embed nightly (think a BharatBazaar catalogue where prices and titles drive embeddings, or a fresh-content recommender), the rebuild cost dominates and IVF often wins on total cost-of-ownership even at slightly worse recall.
Going deeper
The diversity heuristic in detail. When inserting node v, the candidate set has efConstruction nearest neighbours but you keep only M. The naive choice — take the M closest — produces a graph where every edge points into the densest cluster, the long-range structure collapses, and greedy search stalls in local minima. Malkov & Yashunin's heuristic instead iterates through candidates in distance order and accepts a candidate c only if dist(c, v) < dist(c, n) for every already-selected neighbour n. In other words: include c only if it is closer to v than to any neighbour you have already picked. This forces the chosen M neighbours to occupy distinct "directions" around v, preserving the small-world long edges. Section 4.3 of the paper has the pseudocode and a proof sketch; hnswlib implements it as getNeighborsByHeuristic2.
Filtered search — why HNSW struggles with metadata predicates. Every production vector DB needs to answer "find me the top-10 most similar products that are in stock and ship to Coimbatore". The naive approach — fetch top-1000 from HNSW, then filter — fails when the predicate is selective (say, 5 % of products), because too many of your top-10 candidates were filtered out and recall collapses. Three real fixes are in production: (i) post-filter with adaptive efSearch (Pinecone): raise efSearch proportionally to the filter selectivity. (ii) pre-filter with payload index (Qdrant): walk the HNSW graph but at each step skip nodes whose payload does not match the filter, expanding deeper to compensate. (iii) separate per-filter indexes (Weaviate's class-per-filter pattern): build one HNSW per major filter combination. None of these are free; filtered ANN is an active research area in 2026 (see ACORN, Filtered-DiskANN).
Combining HNSW with quantisation. A 768-d float32 vector is 3 KB; at 1 B vectors that is 3 TB just for the data, dwarfing any graph overhead. The fix is to compress vectors with Product Quantization — splitting each vector into 8 sub-vectors of 96 dimensions each, k-means-clustering each sub-space into 256 codes, and storing each vector as 8 bytes (a 384× compression). The HNSW graph still stores integer IDs and edge lists but the per-node distance computation now runs on quantised codes. FAISS's IndexHNSWPQ and Vistron's DiskANN both use this combination; recall drops by 1–3 % but RAM usage falls by an order of magnitude. The trade-off: PQ-distance computation is asymmetric (the query stays full precision, the database vectors are codes), and the table-lookup pattern is cache-unfriendly enough that throughput degrades even though memory shrinks. DiskANN goes further and stores the graph on SSD, achieving billion-vector indexes on a single box — the architecture that makes ANN at BharatRail-tatkal scale economically viable.
The Vamana algorithm — DiskANN's alternative. Vistron Research's DiskANN uses a different graph algorithm called Vamana, not HNSW. The differences: Vamana is a flat graph (no hierarchy), it uses a different pruning rule (α-RNG with α > 1), and it is designed to be SSD-resident. For pure RAM workloads HNSW typically still wins on recall-vs-latency, but for hundred-million- to billion-scale indexes that cannot fit in RAM, Vamana's flatter structure tolerates SSD random reads better. The Subramanya et al. 2019 paper is the reference; Vistron open-sourced DiskANN in 2021.
HNSW failure modes in production. Three real ones the docs do not warn you about. (i) Highly clustered data (e.g. duplicate products in a Myntra catalogue with near-identical embeddings) creates "bridges" — single nodes that all queries route through — and search latency spikes when those bridges are cold in cache. (ii) Adversarial queries (a query vector engineered to be far from any cluster) fall back to near-flat-search performance because greedy descent stalls; this is exploitable as a DoS vector if queries are user-controlled. (iii) Non-stationary data — if the data distribution shifts (new fashion season, new product category) but you keep adding to the same HNSW, the early entry-point nodes become unrepresentative and recall drifts down. Periodic rebuild fixes all three; this is why every managed vector DB schedules them.
What you have built
You started this chapter with IVF in the O(√N) slot and the question of whether O(log N) was reachable for ANN. You end with HNSW: a hierarchical graph that scales logarithmically, a build that is one-vector-at-a-time but one-shot, three knobs (M, efConstruction, efSearch) that map cleanly to memory, build time, and per-query recall. The mental model is a road network — highways at the top, district roads at the bottom, greedy navigation at every level. The next chapter takes the other axis seriously: Product Quantization compresses each vector from 3 KB to 32 bytes, making billion-scale HNSW or IVF affordable on a single machine. Together with HNSW, PQ closes the gap between "small enough to fit in RAM" and "every embedding you will ever produce".
References
- Malkov, Yashunin — Efficient and robust approximate nearest neighbor search using Hierarchical Navigable Small World graphs (2016/2018). The original HNSW paper. Section 4 is the algorithm; Appendix A is the diversity heuristic.
- hnswlib — reference implementation. Header-only C++ with Python bindings, by the original authors. The implementation every other library copies from.
- FAISS HNSW documentation. Practitioner reference for
IndexHNSWFlat,IndexHNSWSQ, andIndexHNSWPQ. - Pinecone — HNSW explainer. Vendor-flavoured but well-illustrated walkthrough of HNSW with side-by-side comparisons against IVF and ScaNN.
- Weaviate — HNSW deep dive. Production-focused notes on
efConstruction,M, and operational pitfalls (deletions, rebuilds, filtered search). - Watts, Strogatz — Collective dynamics of 'small-world' networks (1998). The foundational paper on small-world graphs; the structural property HNSW exploits.