In short

HNSW and IVF give you O(log N) and O(√N) nearest-neighbour search respectively — but only for the unfiltered case. Real queries almost always carry a predicate alongside the vector: price < ₹2000, in_stock = true, category = footwear, tenant_id = 'acme'. The naive composition is to ANN-search first then drop non-matching results (post-filter), or filter first and ANN-search the survivors (pre-filter). Both break in different regimes. Post-filter quietly returns 0 results when the filter is selective: ask for top-100 vectors, keep only those matching a 1 %-selectivity filter, expect 1 result. Pre-filter shatters the HNSW graph: removing 95 % of nodes destroys the local connectivity that makes greedy navigation work, and the search either misses neighbours or degrades to brute force over the survivors. The fix the modern systems use is inline filtering (Qdrant, Milvus, Weaviate's filter-first-then-search) — during graph traversal, the algorithm only follows edges into nodes that match the filter, so the graph stays connected through filter-passing detours. Pinecone takes a different route: a separate metadata index pre-filters into a candidate set, then ANN runs only on the survivors. The decision rule by selectivity is sharp. < 1 % match: pre-filter is mandatory; the survivor set is small enough that flat search is competitive. 1–30 % match: inline filtering wins; the graph still has enough connectivity, and per-hop filter checks are cheap. > 30 % match: post-filter is fine — the unfiltered ANN already returns mostly matching results, and dropping a few is harmless. This chapter derives those thresholds, walks through each system's choice, and ends with the bridge into hybrid retrieval that combines BM25, vectors, and filters in one pipeline.

The previous chapters built flat search, IVF, HNSW, and product quantization — each a way to make "find the k nearest vectors to q" run in sub-linear time. Every one of those algorithms assumes you want to search the entire corpus. In production, you almost never do. A user on a shopping site searches "running shoes" and the application silently appends WHERE price < 2000 AND in_stock = true AND category = 'footwear'. A multi-tenant SaaS application appends WHERE tenant_id = 'acme' to every query, and a tenant might own 0.01 % of the rows. A RAG system over a corporate wiki filters on WHERE department = 'engineering' AND visibility = 'internal'. The vector search and the filter must compose, and how they compose decides whether you get the right answer in 5 ms, the right answer in 500 ms, or zero answers at all.

The thesis: filter and ANN do not commute

The first instinct when designing a filtered vector search is the same instinct any SQL programmer has: do them independently and intersect. Run the ANN search to get the top-k nearest, run the filter to get the matching set, return the intersection. This works perfectly when the filter is loose — say, 80 % of the corpus matches — because the unfiltered top-k is overwhelmingly likely to be inside the matching set. It fails catastrophically when the filter is tight. If only 1 % of the corpus matches the filter and you ask the ANN for the top-100, you expect 1 result. Maybe 0. The filter and the ANN do not commute, and the failure mode is silent: you get fewer results than you asked for, but no error.

The second instinct is to flip it: filter first, ANN-search the survivors. This works when the survivor set is small enough to scan flat — but the moment it is large and lives inside an ANN index built on the full corpus, you have a different problem. HNSW's O(log N) cost depends on the small-world property of the full graph. If you remove 95 % of the nodes, the graph fragments, the entry point may be in a removed component, and greedy navigation either gets stuck in a local minimum surrounded by removed neighbours or has to backtrack through filter-checking jumps. Why HNSW is fragile to selective filters: each node has roughly M = 16 outbound edges. If only 5 % of nodes pass the filter, each node has on average 0.05 · 16 = 0.8 filter-passing neighbours. Greedy descent through a graph where most steps lead to filter-failing dead-ends is no longer logarithmic — it is closer to a random walk over the filter-passing subgraph, with O(N) worst-case behaviour.

The honest framing is that filter selectivity is the regime parameter. Three strategies cover the spectrum: pre-filter when the survivor set is small, inline filter when it is medium, post-filter when it is large. The rest of this chapter unpacks each.

Strategy 1 — Pre-filter

The pre-filter strategy applies the predicate first, materialises the matching subset of vectors (or their IDs), and then runs vector search against that subset. The vector search may be flat (if the subset is small enough), or it may be a separately maintained ANN index over just the matching subset, or it may be the full ANN index restricted to scan only matching IDs.

Pre-filter: filter first, ANN over survivorsA diagram showing the corpus on the left as a large rectangle of dots. An arrow labelled "filter predicate" points to a smaller rectangle in the middle, the filtered subset, with most dots greyed out and a few coloured ones remaining. From this subset, an arrow labelled "ANN over survivors" points to the result on the right, a small set of top-k matching vectors next to a query q.Pre-filter: predicate → small candidate set → vector search the survivorsfull corpus (10 M)filterprice<₹2000survivors (500K)~5% of corpusANNover 500Ktop-k near qqall match filter
Pre-filter: the predicate is applied first against a metadata index. The survivor set (here 500K of 10 M, ~5 %) is then searched with whatever ANN strategy fits — flat scan if small, IVF or HNSW over the survivors if large. The result is guaranteed to satisfy the filter, and recall is whatever the ANN over the survivor set delivers.

The strategy works beautifully when the filter is selective. If the survivor set is 10K, just do flat search over it: ten thousand dot products in 768 dimensions takes about 2 ms on a single core, no index needed. If the survivor set is 500K, IVF or even HNSW over the survivors gives sub-10 ms. The cost model is clean: pre-filter latency = (metadata-index lookup) + (vector search over S survivors), where S scales with selectivity.

The strategy collapses when the filter is loose. If 80 % of the corpus survives, the survivor set is 8 M vectors and you need an HNSW over it — but you do not have one pre-built. You either build it on the fly (impossibly slow) or you fall back to scanning the full HNSW with a post-filter applied at the leaves. The middle case is also painful: a 30 % filter leaves you with 3 M vectors to search, no pre-built index, and flat scan is slow.

There is also a fragmentation problem when pre-filter is implemented by pruning the HNSW graph at query time (which is what naive implementations do — walk the graph, but skip non-matching nodes). Removing 95 % of the nodes from an HNSW graph leaves the remaining 5 % poorly connected. Greedy descent gets stuck. Why graph fragmentation hurts: HNSW's correctness argument relies on every node having M neighbours that span different directions in ℝ^d. If 19 of 20 neighbours fail the filter, you have 1 surviving edge per node — effectively a sparse linked list, not a navigable graph. The greedy walk can no longer make progress in the right direction; it can only follow whichever filter-passing edge happens to exist.

Pinecone's solution is to maintain a separate metadata index — a postings list keyed by metadata values — and do the pre-filter purely against that index, materialising a candidate ID set, then running ANN restricted to those IDs (with a flat fallback when the candidate set is small). This sidesteps the graph-fragmentation issue entirely at the cost of more memory and a trickier consistency story between the two indexes.

Strategy 2 — Post-filter

Post-filter is the simplest strategy and the most-often-wrong. Run the unfiltered ANN search to get the top-K nearest vectors, then drop any that fail the predicate. The remaining set, of size at most K, is your answer.

Post-filter: ANN first, drop non-matching resultsA diagram showing the full corpus on the left, an arrow labelled "ANN top-K=100" pointing to a middle panel showing 100 candidate vectors, and an arrow labelled "drop non-matching" pointing to a result panel on the right showing only 5 surviving vectors out of the 100. A red warning highlights that the filter retained only 5 % of the top-K, leaving fewer than the requested k.Post-filter: ANN top-K → drop non-matching → hope something survivesfull corpus (10 M)ANNtop-K=100top-100 near q2 of 100 matchdropnon-matchingsurvivorsqonly 2 results!user asked for k=10
Post-filter: ANN runs over the full corpus and returns top-`K`. The filter is then applied to the candidate set. If the filter retains 2 % of the corpus and `K=100`, you expect about 2 results — even though the user asked for `k=10`. The fix is to over-fetch by `1/selectivity`, but that requires knowing the selectivity in advance and balloons latency for tight filters.

The strategy is fast — it costs exactly one unfiltered ANN search plus a constant filter check per candidate. It is also correct when the filter is loose. If 80 % of the corpus matches, the unfiltered top-100 is 80 matching candidates on average, and your top-10 is reliably populated. If 50 % matches, you get about 50, still safe. The trouble starts below ~10 % selectivity. At 5 % match, top-100 yields ~5 matching results — borderline. At 1 % match, top-100 yields ~1 result, and your user sees an empty page.

The naive remedy is to over-fetch: ask for K = k / selectivity. If selectivity is 1 % and you want k = 10, fetch top-1000 then drop. This works in principle, but it has two problems. First, you need a reliable selectivity estimate per query — and selectivity depends on the predicate, which is unknown at index time. Second, ANN cost grows roughly linearly with K (the beam width efSearch controls it), so over-fetching by 100× makes the search 100× slower. At that point you have lost the benefit of ANN entirely; flat-scan over the (also-large) full corpus would be comparable.

Elasticsearch's vector search uses post-filter for the overall pipeline (run kNN, then apply remaining filters), and pre-filter for keyword filters that already have inverted-index support. Why this hybrid in Elasticsearch: the keyword and metadata filters live in the inverted index that Elasticsearch was already built around. Restricting the kNN search to a docID bitset materialised from the inverted index is essentially free — the cost is dominated by the kNN itself. But more complex post-kNN filters (script-based, range-based on non-indexed fields) cannot use the bitset trick and fall back to true post-filter, with all its recall hazards.

Strategy 3 — Inline filter (filter-aware HNSW)

Inline filtering — what Qdrant calls "filterable HNSW" and Milvus calls "filtered search" — embeds the filter into the graph traversal itself. The algorithm walks the HNSW graph as usual, but at each node-expansion step it checks the filter against the candidate neighbour. Filter-failing neighbours are still examined (their distance is computed and they are added to the visited set so the algorithm does not revisit them) but they are not added to the result set, and the search continues through them to find filter-passing nodes downstream.

Inline filter: graph traversal checks filter at each hopA diagram showing an HNSW graph with nodes coloured according to filter pass/fail. The query enters the graph and the search path traverses through nodes, checking the filter at each step. Filter-passing nodes are blue circles, filter-failing nodes are grey crosses. The search path follows greedy descent toward the query but only adds blue nodes to the result set. Arrows show the path going through both blue and grey nodes — the grey ones are visited but not collected.Inline filter: walk the full HNSW graph, collect only filter-passing nodesHNSW graph (subset shown) — entry on the left, query q on the rightquery qentrygreen path: greedy descent visits both blue (kept) and grey (visited, dropped) nodesfilter passes (kept in result)filter fails (visited but skipped)
Inline filter: the search walks the full HNSW graph, but at each node it consults the filter. Filter-passing nodes (blue) go into the result candidate set; filter-failing nodes (grey, crossed out) are visited so the algorithm can route through them to find downstream matches. Recall is preserved because the graph stays connected; the cost is the per-hop filter check plus the wasted distance computation on filter-failing nodes.

The brilliance of inline filtering is that it preserves the small-world property of the graph. The graph itself is unchanged; only the collection policy changes. Why this works where pre-filter pruning fails: pre-filter pruning removes failing nodes from the graph at query time, leaving the survivors disconnected. Inline filtering keeps every node in the graph and only filters the result set. The greedy descent can therefore route through filter-failing nodes to reach filter-passing ones — and because the graph was built over the full corpus, the connectivity assumed by the small-world property is intact.

The cost is per-hop overhead. Each node visited now requires a filter evaluation — typically a hashtable lookup or bitmap-AND into a precomputed filter mask. If the filter is selective, the algorithm visits more nodes per hit (you traverse 5 grey ones to reach 1 blue), so the beam (efSearch) needs to be larger to find k matching results. Qdrant's documentation suggests scaling ef by roughly 1/selectivity for very selective filters, which makes inline filtering converge to the cost of pre-filter as selectivity drops below ~1 %.

The other subtlety is that filter-aware HNSW needs the filter to be fast to evaluate — ideally a bitmap-AND that can be done in a few CPU cycles. Qdrant maintains compressed roaring bitmaps for each indexable payload field; Milvus uses a similar scheme. Filters that require complex evaluation (a regex, a script) cannot run inline because the per-hop cost would dominate the actual ANN work.

What each system actually does

Production vector databases have converged on a small set of choices, each with its own justification:

The takeaway is that no system uses one strategy exclusively. The optimiser picks the strategy per query, sometimes per filter clause, based on estimated selectivity and the kind of index available.

Tuning by selectivity

The decision rule that emerges from cost-modelling each strategy is sharp:

The cost-model derivation is simple. Pre-filter cost is roughly O(S · log S) where S = selectivity · N is the survivor count (HNSW over a smaller graph is logarithmic in S). Post-filter cost is O(k/selectivity · log N) because you must over-fetch by 1/selectivity to expect k matches. Inline filter cost is O((1/selectivity) · log N · M) where the 1/selectivity factor is the over-traversal needed to find k matching nodes. Plotting these against selectivity gives crossover points near 1 % and 30 % for typical parameter ranges — the same thresholds that the production systems use as defaults.

Real Python: the three strategies

Here is each strategy as small Python sketches against an hnswlib index (the same library Pinecone, Weaviate, and Milvus use under the hood).

import numpy as np
import hnswlib

# Setup: 10M vectors, 768 dims (e.g., MiniLM embeddings of product descriptions)
index = hnswlib.Index(space='cosine', dim=768)
index.load_index('products.hnsw')
metadata = load_product_metadata()  # dict[id] -> {price, in_stock, category}

q = np.array([...])  # query vector for "running shoes"
predicate = lambda m: m['price'] < 2000 and m['in_stock'] and m['category'] == 'footwear'

# Strategy 1: PRE-FILTER
def pre_filter_search(q, k=10):
    # 1. Apply filter via metadata index (a postings list keyed by category, etc.)
    candidate_ids = metadata_index.lookup(category='footwear',
                                          price_lt=2000,
                                          in_stock=True)
    # 2. Flat search over survivors (or a sub-HNSW)
    survivor_vecs = vector_store.fetch(candidate_ids)
    distances = 1 - survivor_vecs @ q  # cosine
    top_k_idx = np.argpartition(distances, k)[:k]
    return [candidate_ids[i] for i in top_k_idx]

# Strategy 2: POST-FILTER
def post_filter_search(q, k=10, K=100):
    # Over-fetch K, then filter
    ids, dists = index.knn_query(q, k=K)
    survivors = [i for i in ids[0] if predicate(metadata[i])]
    return survivors[:k]  # may be < k!

# Strategy 3: INLINE FILTER (hnswlib's filter callback)
def inline_filter_search(q, k=10):
    # hnswlib accepts a filter function called during graph traversal
    # The graph is walked normally; only filter-passing nodes go into the result
    ids, dists = index.knn_query(q, k=k,
                                 filter=lambda label: predicate(metadata[label]))
    return ids[0]

The inline-filter case is one extra parameter to knn_query. Under the hood, hnswlib checks the predicate before adding a candidate to the result heap, so the graph traversal proceeds through filter-failing nodes to find filter-passing ones downstream. Qdrant's API is similar — pass a Filter object alongside the vector — and Milvus uses an expr parameter that compiles to a bitmap mask consulted during search.

Indian e-commerce product search: comparing the three strategies

You operate an e-commerce site with 10 million product listings. Each listing has a 768-dimensional embedding from a fine-tuned MiniLM, plus structured metadata: price (₹), in_stock (bool), category (enum), brand, rating. The query is "running shoes" + filter price < 2000 AND in_stock = true AND category = 'footwear'. Selectivity is moderate: about 5 % of products match (footwear is one of 20 categories, sub-₹2000 is ~50 % of footwear, in-stock is ~80 % → 0.05 × 0.5 × 0.8 ≈ 5 %). User wants top-10. Single 32-core box, embeddings in RAM. Strategy 1 — pre-filter. Build the metadata index (postings list per category, B-tree per price, bitmap per in_stock). Filter intersection: ~500K matching IDs in 0.5 ms. Run flat search over 500K vectors: 500K × 768 dims × 4 bytes ≈ 1.5 GB scan, ~7 ms with AVX-512 dot products. Total: ~8 ms, full recall (flat search is exact). Strategy 2 — post-filter, K = 100. Run unfiltered HNSW with efSearch = 100: ~3 ms returns the top-100 products by cosine. Filter: ~5 of 100 match (5 % selectivity). User sees 5 results, asked for 10. To rescue this, you would have to over-fetch with K = 200 (10 ms, ~10 results, borderline) or K = 1000 (40 ms, ~50 results, safe). At K = 1000 you have lost the recall-vs-latency advantage of HNSW — you are now doing 4× the work for no benefit. Strategy 3 — inline filter (Qdrant-style). HNSW search walks the full graph, but the result set only collects filter-passing nodes. With selectivity 5 %, the algorithm visits roughly 1/0.05 = 20 nodes per match — so to find 10 matches it visits ~200 graph nodes. With efSearch = 200: ~12 ms, full recall, exactly 10 matching results. Verdict. At 5 % selectivity, both pre-filter (8 ms) and inline (12 ms) give correct results; post-filter is silently broken. Pre-filter is faster only because you maintain a separate metadata index — the engineering cost is real (sync, double memory, index updates on writes). Inline is what most production systems pick by default at this selectivity range. If selectivity drops to 0.1 % (a tenant filter in multi-tenant SaaS), pre-filter becomes the only viable choice; if selectivity rises to 50 % (only the in_stock = true filter), post-filter is fine and is the cheapest. Selectivity is the regime parameter — measure it and design accordingly.

A practical note: estimating selectivity at query time is its own engineering problem. Most systems maintain per-field histograms or HyperLogLog sketches and combine them with the same independence assumption that costs SQL optimisers their reputation. When the assumption fails (correlated predicates), the optimiser picks the wrong strategy and you eat the latency. Qdrant and Pinecone both let you override the strategy explicitly per query — consistency in Qdrant, metadata_filter_strategy in Pinecone — for the cases where you know better than their estimator.

Going deeper

For JEE-level engineers

The filter-aware HNSW literature — sometimes called filtered ANN or predicate ANN — has matured rapidly since 2023. The "Filtered Approximate Nearest Neighbor Search" survey by Patella et al. catalogues over a dozen variants: graph augmentation (add edges that connect filter-passing components), labelled-edge HNSW (each edge tagged with which filter values its endpoints satisfy), and ACORN (an adaptive index that builds a lattice of sub-indexes per filter combination).

Graph augmentation: edges across filter components

The fundamental fragility of pre-filter pruning is that removing nodes can disconnect the graph. The fix is to add edges at build time that specifically span filter components — so even after pruning, the surviving subgraph is connected. The recipe: at insert time, in addition to the M nearest neighbours, add M' extra edges to the nearest node of each likely filter value. The build is more expensive (you need to know the filterable fields in advance) and the graph is denser, but pre-filter recall stays high even at 1 % selectivity. Pinecone's "namespace" feature is essentially this — each namespace is a separate HNSW index, so the graph is guaranteed connected within each namespace.

Labelled-edge HNSW

A more general version: tag every edge with a bitmap of which filter values its two endpoints both satisfy. At query time, the algorithm only follows edges whose bitmap intersects the query's filter mask. This combines the connectivity of the full graph with the recall guarantee of pre-filter, at the cost of edge-storage overhead (each edge now carries a small bitmap). Variants of this are in research prototypes; production systems have not adopted it because the build cost and storage overhead outweigh the recall improvement for typical workloads.

ACORN and adaptive indexing

ACORN (Adaptive Combinatorial Optimisation for Restricted Neighbours) builds a lattice of sub-indexes — one for each combination of filterable fields that occurs frequently in queries. At query time, the optimiser picks the most specific sub-index whose schema matches the filter, and runs a hybrid pre-filter + inline strategy over it. The build cost is high (you must analyse query logs to know which combinations to materialise) but query latency drops by another 2–5× over plain inline filtering for production workloads. None of the open-source vector databases ship ACORN as of 2026, but Pinecone's "selective indexing" feature is close in spirit.

Bridge to hybrid retrieval

Filtered search is the second of the three composition problems in real-world retrieval. The first is pure vector search (covered in the HNSW and IVF chapters). The third — and the subject of the next chapter — is hybrid retrieval: combining BM25 keyword scores with vector similarity scores, and filters, in a single ranked list. The natural composition is to run BM25 with its filter-pushdown into the inverted index (free intersection), run vector search with the same filter via inline filtering, then fuse the two ranked lists with a method like reciprocal rank fusion. The hybrid retrieval chapter builds out the fusion algorithms and shows how Weaviate, Vespa, and OpenSearch run all three signals through a single pipeline. Filtered search is a prerequisite — without it, you cannot push the filter down past the ANN layer, and the whole hybrid pipeline collapses to post-filter at the fusion step, with all the recall hazards from this chapter.

References

  1. Patella, M., et al. Filtered Approximate Nearest Neighbor Search: A Survey. arXiv:2310.04803, 2024.
  2. Pinecone — Metadata filtering.
  3. Weaviate — Filtering with hybrid search.
  4. Milvus — Filtered search.
  5. Qdrant — Filtering payload.
  6. Malkov, Y., Yashunin, D. Efficient and robust approximate nearest neighbor search using Hierarchical Navigable Small World graphs. arXiv:1603.09320, 2016.