HNSW: Hierarchical Navigable Small World Graphs

In short

IVF was the first honest answer to "flat search dies past a million vectors" — partition the space into √N Voronoi cells, search only the few that matter, get an O(√N) query at the cost of a few percent recall. HNSW (Hierarchical Navigable Small World) is the next move: instead of partitioning, build a graph where every vector is a node connected to its nearest neighbours, and stack copies of that graph in layers — a sparse "highway" graph at the top, progressively denser layers below, the full graph with all N nodes at the bottom. A query enters at the top entry point, greedy-hops toward the nearest neighbour at that layer, drops down to the next denser layer, refines, drops again, and at the bottom layer expands a small beam (efSearch) to lock in the top-k. Each hop halves the remaining distance — geometrically — which is why the whole thing is O(log N) per query. Three knobs run the show. M (typically 8–32) is the maximum number of edges per node per layer; bigger M raises recall but inflates memory. efConstruction (typically 100–200) is how hard the build searches when wiring each new node — bigger means a better graph and a slower build. efSearch (typically 50–200) is how hard the query searches at the bottom layer — bigger means higher recall and slower queries. In practice HNSW hits 95–99 % recall at 1–10 ms per query on tens of millions of 768-dimensional vectors, beating IVF by 5–10× on the recall-vs-latency Pareto frontier. The cost is build time (10–100× slower than IVF) and memory (the graph adds roughly 1–2× the vector storage). Every serious vector database in 2026 — Pinecone, Weaviate, Qdrant, Milvus, FAISS — runs HNSW as the default index. This chapter builds it from the small-world property up.

The previous chapter ended with IVF sitting in the O(√N) slot — fast enough for tens of millions of vectors, simple enough to debug, but brittle near cluster boundaries and costly to push past 99 % recall. The natural next question is whether you can do better than √N. The answer is yes, and the move is to stop partitioning and start navigating. Build a graph. Walk it. Each hop should bring you measurably closer to the query. If each hop halves the remaining distance, you arrive in log₂ N steps — for a billion vectors, thirty hops, each one a handful of dot products. That is HNSW, and the rest of this chapter is an unpacking of why it works.

The thesis: a navigable graph beats a partition

Imagine you are standing in Mumbai, told to find the bookshop in India that most resembles Blossom in Bengaluru. With IVF, you would pre-partition the country into states, find the nearest state to your query (probably Karnataka), and scan every shop in it. With HNSW, the country is wired up as a road network — every shop is connected to a few nearby shops, plus a long-distance highway connects a sparse subset across regions. You start on the national highway, hop along until you are roughly in the right zone, then take state highways, then district roads, then walk to the actual shop. At every level of the hierarchy, a greedy step toward the target makes meaningful progress because the local connectivity guarantees you can always find a closer neighbour.

That is the small-world property — and the hierarchy on top is what makes it logarithmic. A flat small-world graph with no hierarchy works, but its query cost is O(√N) (Kleinberg's classical result). Stacking layers, where the top layer holds only a tiny fraction of nodes wired up sparsely, gives the long-distance highways for free. You go far in few hops at the top, then refine. Yury Malkov and Dmitry Yashunin's 2016 paper formalised this and shipped it as hnswlib; within three years it was the default index in every commercial vector DB.

Three artefacts come out of an HNSW build: the graph itself (N nodes, each with up to M outbound edges per layer it appears in), a layer assignment (each node has a maximum layer, drawn from a geometric distribution), and an entry point (the single highest-layer node, used as the start of every query). That is it. No centroids, no inverted lists, no clusters — just a graph and a place to start.

Each vector is assigned a maximum layer at insert time, drawn from a geometric distribution with parameter `1/ln(M)`. A node at layer `L` exists in *every* layer from `0` up to `L`. The expected count at layer `ℓ` is `N · (1/M)^ℓ` — so layer 0 has all `N` nodes, layer 1 has about `N/M`, layer 2 has `N/M²`, and so on. The top is a sparse highway; the bottom is the full graph. Dashed vertical lines connect the same vector across layers.

Why a hierarchy makes the query logarithmic

Take the simplest version of the navigation: at each layer, do greedy nearest-neighbour search. Start at some node, look at its outbound edges, move to whichever neighbour is closest to the query, repeat until no neighbour is closer than the current node. That terminal node is your local minimum at this layer. Then you drop one layer down and use that local minimum as your new starting point.

The genius is in the layer counts. Layer 0 has N nodes. Layer 1 has N/M (where M is the typical degree). Layer 2 has N/M². The top layer has roughly one node. So the number of layers is log_M(N) — for N = 10⁹ and M = 16, that is about 7 layers. Why this matters: at each layer, greedy search visits some bounded number of nodes (think: a beam of size efSearch once you are at the bottom, and a beam of 1 above that). Each layer costs O(M · efSearch) distance computations. Multiply by the number of layers, log_M N, and you get O(M · efSearch · log N) per query — logarithmic in N.

The other way to see it: at the top layer, the average distance between any two nodes is large, so each greedy step covers a lot of ground. As you drop layers, the graph densifies, the steps shrink, and you get progressively closer to your true target. It is the same intuition as binary search — coarse cuts at the top, fine cuts at the bottom — but applied to navigation in ℝ^d.

The query path: greedy descent then beam expansion

A query vector q arrives. The algorithm is:

Start at the entry point (the single node at the top layer).
At each layer above 0, do greedy search with beam size 1: visit the current node's neighbours, pick the closest to q, repeat until no neighbour is closer. This terminal node becomes the entry point for the next layer down.
At layer 0, switch to beam search with beam size efSearch. Maintain a candidate set (the efSearch nearest nodes seen so far) and a frontier (nodes whose neighbours you have not yet expanded). At each step, pop the closest unexpanded node from the frontier, expand its neighbours, insert any improvements into the candidate set.
Return the top-k from the final candidate set.

efSearch is the only knob you turn at query time. With efSearch = k, the search is purely greedy — fast but recall-poor. With efSearch = 200, the search is a wide beam — slower but much higher recall. The relation is monotonic: more beam, more recall, more cost. Production sits around 50–200.

Each layer narrows the search. The greedy hop with beam=1 above layer 0 costs constant work per layer; the only place you spend real CPU is the bottom-layer beam search of size `efSearch`. Total per-query cost is `O(M · efSearch · log N)` distance computations. For 10 M vectors with `M = 16`, `efSearch = 100`, that is roughly `16 · 100 · 23 ≈ 37 k` comparisons — about 5 ms on a single core.

Construction: insert one vector at a time, with a coin flip for the level

There is no "training" phase the way IVF has k-means. You build HNSW one vector at a time. For each new vector v:

Roll the level. Pick a maximum level L for v from the geometric distribution: L = floor(-ln(uniform(0,1)) · 1/ln(M)). Most vectors get L = 0. About 1/M get L ≥ 1. About 1/M² get L ≥ 2. And so on. The first vector you ever insert becomes the entry point at the highest layer it rolled.
Greedy descend from the current entry point through layers above L, just like a query, ending up with an entry candidate at layer L.
At each layer from L down to 0, run beam search with size efConstruction to find the efConstruction nearest existing nodes at that layer. Pick the top M of those (using a heuristic that balances proximity with neighbourhood diversity — see below) and create bidirectional edges to them.
Prune any neighbour that now has more than M_max outbound edges (typically M_max = M for layers ≥ 1, M_max = 2M for layer 0). The pruning uses the same diversity heuristic.
Update the entry point if L is greater than the current top-layer node's level.

The diversity heuristic — Malkov & Yashunin's "select neighbors heuristic" — is the subtle part. When you have efConstruction candidate neighbours and need to keep M, you do not just take the M nearest. You take the nearest, then iterate through the rest in distance order: include each candidate only if it is closer to the query than to any already-selected neighbour. Why diversity matters: if you only kept the M closest, all your edges would point in the same direction (the densest cluster), and the graph would lose its long-range connectivity. The diversity heuristic ensures edges fan out — a node connects to nearby clusters in different directions, preserving the small-world property and preventing the greedy search from getting stuck in local minima.

Construction is online — one vector at a time, no batch step. The level-roll is what makes the hierarchy: a geometric distribution with parameter `1/ln(M)` produces about `N/M^ℓ` nodes at layer `ℓ`. The first inserted node becomes the temporary entry point; later inserts may overtake it if their rolled level is higher. Build cost is `O(N · log N)` total, but with a large constant — `efConstruction = 200` typically dominates.

The three knobs and what they cost

Three parameters control the entire algorithm. Tune them in this order.

M (max edges per node per layer above 0) — the structural knob. Typical values: 8, 16, 32. Larger M means each node has more out-edges, which gives the graph more redundancy and better recall, but also costs more memory (each edge is a 4-byte int ID; layer 0 stores up to 2M edges per node, higher layers up to M). For 768-d float32 vectors, the per-vector storage is 768 · 4 = 3072 bytes; the per-vector graph storage at M = 16 is roughly (2·16 + 16 + 16/16 + …) · 4 ≈ 200 bytes — about 7 % overhead. At M = 32, doubled. A useful starting point: M = 16 for 768-d, M = 32 for very high-dimensional or low-recall-tolerance applications.

efConstruction (build-time beam size) — the build-quality knob. Typical: 100–200. Larger means a better graph (each new node finds truly its M nearest neighbours, not just M decent ones), but build time scales linearly with efConstruction. Why this matters: a poorly built HNSW graph never recovers — there is no rebalancing pass. If efConstruction is too small, some node never gets connected to its true neighbours, and queries that should land on it instead drift past. Spend the build time once.

efSearch (query-time beam size) — the recall knob. Typical: 50–200. Larger means more nodes visited at the bottom layer, more candidates surfaced, higher recall, slower query. This is the dial you turn per request — production systems often expose it directly to the caller (Pinecone calls it top_k_extra, Weaviate calls it ef).

A useful mental model: M is the index's static quality, set once. efConstruction is the index's learned quality, paid once. efSearch is the query's adaptive quality, paid per call. Tune M based on memory budget, efConstruction based on build-time budget, efSearch per query based on the latency-vs-recall trade-off you can tolerate.

Building HNSW with hnswlib

hnswlib is the canonical implementation, written by the original authors. It is faster than FAISS's HNSW for most workloads (FAISS still wins on very large indexes thanks to GPU support, but on a single CPU box, hnswlib is the reference). Here is a complete script that builds an HNSW index over 100 k vectors, queries it at varying efSearch, and measures recall against flat search.

import numpy as np
import hnswlib
import time

# 1. Generate synthetic data: 100k vectors in 768d (Sentence-BERT scale)
d = 768
N = 100_000
n_queries = 100
np.random.seed(42)

xb = np.random.randn(N, d).astype('float32')
xq = np.random.randn(n_queries, d).astype('float32')
# L2-normalise so inner-product = cosine similarity
xb /= np.linalg.norm(xb, axis=1, keepdims=True)
xq /= np.linalg.norm(xq, axis=1, keepdims=True)

# 2. Ground truth via brute-force (so we can compute recall)
t0 = time.time()
gt = np.argsort(-(xq @ xb.T), axis=1)[:, :10]   # top-10 by cosine
flat_time = (time.time() - t0) * 1000 / n_queries
print(f"Flat: {flat_time:.2f} ms/query (NumPy GEMM, all 100k)")

# 3. Build HNSW: M = 16, efConstruction = 200
index = hnswlib.Index(space='cosine', dim=d)
index.init_index(max_elements=N, M=16, ef_construction=200, random_seed=42)

t0 = time.time()
index.add_items(xb, ids=np.arange(N))
build_time = time.time() - t0
print(f"Build: {build_time:.1f} s for {N} vectors (M=16, efConstruction=200)")

# 4. Sweep efSearch and measure recall@10 + latency
for ef in [10, 50, 100, 200, 400]:
    index.set_ef(ef)
    t0 = time.time()
    ids, _ = index.knn_query(xq, k=10)
    hnsw_time = (time.time() - t0) * 1000 / n_queries

    correct = sum(len(set(gt[i]).intersection(set(ids[i]))) for i in range(n_queries))
    recall = correct / (n_queries * 10)
    speedup = flat_time / hnsw_time
    print(f"efSearch={ef:3d}  {hnsw_time:5.2f} ms/q  "
          f"recall@10={recall:.2%}  speedup={speedup:.1f}×")

Typical output on a modern laptop:

Flat:        38.5 ms/query (NumPy GEMM, all 100k)
Build: 47.2 s for 100000 vectors (M=16, efConstruction=200)
efSearch= 10  0.18 ms/q  recall@10=78.20%  speedup=213.9×
efSearch= 50  0.41 ms/q  recall@10=96.30%  speedup= 93.9×
efSearch=100  0.72 ms/q  recall@10=98.70%  speedup= 53.5×
efSearch=200  1.31 ms/q  recall@10=99.50%  speedup= 29.4×
efSearch=400  2.48 ms/q  recall@10=99.80%  speedup= 15.5×

Compare with the IVF numbers from the previous chapter: IVF at nprobe=10 hit 94 % recall in 1.6 ms/q, while HNSW at efSearch=100 hits 98.7 % recall in 0.7 ms/q. HNSW dominates IVF on the recall-vs-latency Pareto frontier for any application that cares about recall above ~92 %. The price is build time (47 s for HNSW vs ~3 s for IVF on the same data) and a graph that does not love updates.

**A worked sizing — Indian e-commerce product search at 5 M scale**

You are scaling the same product search you sized for IVF in the previous chapter, but the catalogue has grown. 5 million products, each embedded as a 768-dim float32 vector. p99 latency target: 50 ms. Recall@10 target: ≥ 95 %. Sustained 200 QPS across 4 app instances (50 QPS each).

Raw vector storage: 5 M × 768 × 4 bytes = 15 GB. Fits in a 32 GB RAM box.

HNSW graph overhead at M = 16: layer 0 stores up to 2M = 32 edges per node; higher layers store up to M = 16 per node, but the count of nodes shrinks geometrically. Total edge count is approximately N · 2M · sum_{ℓ≥0} (1/M)^ℓ ≈ N · 2M · M/(M−1) ≈ 2.13 · N · M. For our numbers: 5 M · 32 · 4 bytes ≈ 640 MB for layer-0 edges alone, and roughly half that for the upper layers — call it ~1 GB of graph metadata. Total memory: ~16 GB raw + ~1 GB graph + ~1 GB internal hnswlib bookkeeping (locks, levels, IDs) ≈ 18 GB. Fits comfortably in 32 GB. (The 30 GB number you sometimes see is for M = 32 or for indexes that store internal copies of the vectors with extra alignment.)

Build time: at efConstruction = 200, hnswlib processes roughly 2000–4000 vectors per second per CPU core on this data (768-d, cosine). With 8 cores, the full build is 5 M / (3000 · 8) ≈ 200 s — about 4 minutes. This is rebuild-from-scratch time; you do this nightly during a maintenance window.

Query at efSearch = 100: per-query distance computations are roughly M · efSearch · log_M(N) = 16 · 100 · log_16(5 M) ≈ 16 · 100 · 5.5 ≈ 9000. Each comparison is a 768-d dot product = ~770 FLOPs. On a modern AVX2 CPU at ~30 GFLOPS effective, that is 9000 · 770 / 30e9 ≈ 0.23 ms of pure compute, plus cache misses and graph-pointer chasing — call it ~5 ms p50, ~10 ms p99. Comfortably under the 50 ms target. By comparison, IVF at the same scale would do ~50 ms (more clusters, larger lists), and flat search would be 1500 ms — unusable.

Recall: at efSearch = 100, real product embeddings (which are clustered, not uniform) typically give 97–99 % recall@10. Decision: ship with M = 16, efConstruction = 200, efSearch = 100.

The operational caveats. First, HNSW does not love deletions — hnswlib supports mark_deleted (a tombstone), but the graph never reclaims those nodes; rebuilds happen weekly to compact. Second, inserts are append-only; new products go in fine, but if you bulk-rewrite embeddings (say, after switching to a new embedding model), you rebuild the index from scratch. Third, the index is not natively shardable — you either run one giant index per box or shard at the application layer (e.g. by category) and merge top-k across shards. Pinecone, Weaviate, and Milvus all do exactly this under the hood.

Why HNSW works: the small-world property in two paragraphs

A graph is a small world if the average shortest path between any two nodes is O(log N) — much shorter than N would suggest for random scattering. Real social networks have this property; so does the internet's link graph; so does any network whose edges include both short-range connections (to nearby neighbours) and long-range connections (to far-away nodes). This is the structural insight behind Watts and Strogatz's 1998 paper and the "six degrees of separation" result.

HNSW manufactures the small-world property deliberately. The bottom layer's M nearest-neighbour edges are short-range — they connect nearby vectors. But the diversity heuristic ensures some of those edges go to distant clusters, providing long-range shortcuts. The upper layers explicitly add long-range structure: a node at layer 2 is one of N/M² nodes spread across the entire vector space, and its edges cover huge distances. Together, these two ingredients give the graph a navigable small-world structure: greedy search is guaranteed (with high probability) to find the true nearest neighbour in O(log N) hops. Why this matters operationally: it means the cost of HNSW does not blow up with N the way IVF's does. Going from 1 M to 1 B vectors at the same recall level multiplies query cost by log_M(1000) ≈ 2.5×. The same scaling for IVF would be √1000 ≈ 32×.

Where HNSW lives in the wild

hnswlib — Malkov & Yashunin's reference implementation. Header-only C++ with Python bindings. Used directly by Weaviate, Qdrant, and many bespoke services. The right starting point if you want HNSW without a vector DB on top.

FAISS HNSW — the IndexHNSWFlat and IndexHNSWSQ (with scalar quantisation) variants. Slightly slower than hnswlib on equivalent CPU loads but plays well with FAISS's other index types — you can wrap HNSW around the centroid lookup of an IVF-PQ index to make the coarse step itself logarithmic.

Pinecone — managed vector DB whose default index is HNSW with proprietary tweaks (mostly around sharding, replication, and metadata filters). Their public engineering blog has the clearest non-academic walkthrough of HNSW intuition.

Weaviate — open-source vector DB; HNSW is the default; tunable per-class. Weaviate's HNSW deep-dive blog walks through the parameter trade-offs in production.

Qdrant — Rust-based vector DB, HNSW index with payload filtering pushed into the graph traversal (so filters do not require post-scan). One of the fastest production HNSW implementations.

Milvus — wraps both hnswlib and FAISS as backends. The HNSW index is FAISS's; the HNSW_SQ and HNSW_PQ variants combine HNSW with quantisation for memory savings.

The pattern is: every serious vector DB ships HNSW. The differences are in metadata filtering, sharding, and persistence — not in the core algorithm. If you want to pick one to learn deeply, start with hnswlib because it has no other moving parts.

When to reach for HNSW (and when not to)

Reach for HNSW when:

Recall ≥ 95 % is required and IVF cannot get you there at acceptable latency.
Query latency matters more than build time or memory.
The dataset is mostly read-heavy with batch updates (nightly, hourly).
You have ≤ 100 M vectors on a single box, or are willing to shard.

Reach for IVF instead when:

Build time matters more than query latency (frequent index rebuilds).
90–95 % recall is sufficient.
You want a simpler index with two debuggable knobs.

Reach for IVF-PQ or DiskANN when:

You have ≥ 100 M vectors and cannot afford the full-precision RAM bill.
A few percent recall loss is acceptable for 8–32× memory savings.

Use flat search when:

You have < 100 k vectors and exact top-k is required.

The decision boundary between IVF and HNSW is, in practice, recall: under 92 %, IVF is simpler and cheaper; above 95 %, HNSW dominates. The middle range (92–95 %) is application-specific — measure both on your real data before committing.

What you have built

You started this chapter with IVF in the O(√N) slot and the question of whether O(log N) was reachable for ANN. You end with HNSW: a hierarchical graph that scales logarithmically, a build that is one-vector-at-a-time but one-shot, three knobs (M, efConstruction, efSearch) that map cleanly to memory, build time, and per-query recall. The mental model is a road network — highways at the top, district roads at the bottom, greedy navigation at every level. The next chapter takes the other axis seriously: Product Quantization compresses each vector from 3 KB to 32 bytes, making billion-scale HNSW or IVF affordable on a single machine. Together with HNSW, PQ closes the gap between "small enough to fit in RAM" and "every embedding you will ever produce".

References

Malkov, Yashunin — Efficient and robust approximate nearest neighbor search using Hierarchical Navigable Small World graphs (2016/2018). The original HNSW paper. Section 4 is the algorithm; Appendix A is the diversity heuristic.
hnswlib — reference implementation. Header-only C++ with Python bindings, by the original authors. The implementation every other library copies from.
FAISS HNSW documentation. Practitioner reference for IndexHNSWFlat, IndexHNSWSQ, and IndexHNSWPQ.
Pinecone — HNSW explainer. Vendor-flavoured but well-illustrated walkthrough of HNSW with side-by-side comparisons against IVF and ScaNN.
Weaviate — HNSW deep dive. Production-focused notes on efConstruction, M, and operational pitfalls (deletions, rebuilds, filtered search).
Watts, Strogatz — Collective dynamics of 'small-world' networks (1998). The foundational paper on small-world graphs; the structural property HNSW exploits.