Hybrid Retrieval: Keyword + Vector + Filter

In short

Lexical retrieval (BM25, chapter 146) and dense vector retrieval each fail in characteristic ways. BM25 cannot tell that car and automobile are the same concept — it only knows about literal token overlap, so a query for "car insurance" misses every document that calls it "automobile insurance". A neural embedding model (text-embedding-3-large, bge-large, e5) maps both phrases into nearly identical 1024-dimensional vectors and retrieves them together — but the same model treats Samsung Galaxy S23 as a slightly fuzzy version of "Samsung phone", and a query for that exact SKU surfaces the S22, the S24, third-party cases, and an article about Korean phone manufacturers before it surfaces the actual S23 product page. Hybrid retrieval runs both indexes in parallel and fuses the two ranked lists. The simplest and most robust fusion is Reciprocal Rank Fusion: \text{score}(d) = \sum_i 1/(k + \text{rank}_i(d)), summed across each retrieval method, with k = 60 (the constant from the original Cormack–Clarke–Büttcher 2009 paper). RRF needs no per-system score normalisation, has one nearly-untunable knob, and on standard benchmarks (BEIR, MSMARCO) gives a 5–15% nDCG@10 improvement over either retriever alone. The other piece of the chapter is filters — structured predicates like category:electronics or price < 50000 that have nothing to do with relevance scoring but everything to do with whether the result is shippable. Filters can be applied pre-retrieval (fast, restricts the candidate set both indexes search over, but requires that both indexes know about the filter fields) or post-retrieval (flexible, retrieve top-k from both then drop non-matching, but can wreck recall when the filter is selective). Production systems — Elasticsearch 8.x's rrf retriever, Vespa's native hybrid scoring, Pinecone's hybrid mode, Weaviate, Qdrant — all expose this same pattern, varying mostly in whether they fuse at score level or rank level. This chapter implements RRF in 30 lines of Python over a Flipkart-style "running shoes for marathon" query, shows the BM25 list and the vector list and the fused list side by side, then derives why pre-filtering wins for selective filters and post-filtering wins for permissive ones. The next chapter (151) opens Build 19 and shows you how the vector index that hybrid retrieval depends on actually works.

You finished Build 18 with a working distributed lexical search engine — inverted indexes, BM25 ranking, sharding, scatter-gather. A query for samsung phone returns the right Samsung phone product pages within 50 ms across a billion documents. Done.

Except a user types affordable Korean smartphone and gets nothing.

The Samsung Galaxy A14 is, by every reasonable definition, an affordable Korean smartphone. The product page calls it a "budget Samsung 5G phone" — three words none of which appear in the query. BM25 sees zero token overlap and ranks it nowhere. Meanwhile a different user types the exact SKU Samsung Galaxy S23 Ultra 256GB Phantom Black, and a pure dense-vector retriever happily returns the S22 Ultra, the S24 Ultra, a third-party case for the S23, and a news article about Samsung's flagship line — all "semantically near" but none the actual product page the user asked for.

Each retriever fails where the other succeeds. The fix, by 2026 thoroughly mainstream, is to run both and fuse the results. This is hybrid retrieval, and it closes Build 18 because it is the natural endpoint of the lexical retrieval stack and the natural entry point into Build 19's vector indexes.

The thesis: two retrievers, two failure modes

Lexical retrieval — BM25 over an inverted index — scores documents by literal token overlap with the query, weighted by IDF and length-normalised. Its strength is exactness. If you type a part number, a brand name, a person's name, a chemical formula, or any other rare token, BM25 returns the documents that contain that exact token. Its weakness is vocabulary mismatch. If your query and the relevant document use different words for the same concept — car versus automobile, marathon versus long-distance running, falooda versus rose milk dessert — BM25 returns nothing. The IDF that makes BM25 strong on rare tokens makes it brittle on synonymy.

Dense retrieval — cosine similarity over neural embedding vectors — scores documents by semantic similarity in a learned vector space. Its strength is exactly the BM25 weakness: a query and a document about the same concept land near each other in the 768-or-1024-dimensional embedding space even when they share zero tokens. Its weakness is exactly the BM25 strength: the embedding model's training data taught it that "Samsung Galaxy S22" and "Samsung Galaxy S23" are very similar phrases, because they are very similar phrases — but to a user who specifically wants the S23, that "similarity" is wrong. Dense retrievers also fail on out-of-vocabulary tokens — newly released SKUs, scientific terms not in the training corpus, code identifiers, IPv6 addresses, anything the embedding model has never seen and therefore embeds with little signal.

Lexical and semantic retrieval cover different parts of the relevance space. BM25 is the precision tool for exact-token queries — SKUs, brand names, identifiers — and fails on paraphrase. Vector retrieval is the recall tool for semantic queries — synonymy, related concepts — and fails on out-of-vocabulary or specifically-named items. The overlap region (centre) is the high-confidence set: documents that match both lexically and semantically. Hybrid retrieval extracts useful signal from all three regions.

The empirical evidence is consistent across benchmarks. On BEIR (Thakur et al. 2021, an 18-dataset zero-shot retrieval benchmark), pure dense retrievers like DPR or Contriever beat BM25 on some datasets (FiQA, ArguAna) and lose to BM25 on others (Touché-2020, FEVER, BioASQ). No single retriever wins everywhere. Hybrid retrievers — combining BM25 with a dense retriever — beat both individually on every BEIR dataset measured. The improvement averages 5–15% in nDCG@10 over the better of the two individual retrievers, with the largest gains on datasets that mix vocabulary-mismatch queries with rare-token queries (which is, in practice, every real-world search log). Hybrid is not a marginal win; it is the dominant strategy.

The hybrid pipeline

A hybrid retrieval system is structurally simple: query goes to two indexes in parallel, each returns its top-k candidates, a fusion step combines them into a single ranked list, optional filters either pre-restrict the candidate set or post-prune the results, and an optional re-ranker applies a more expensive model to the top of the fused list. The complexity hides in the fusion step.

Hybrid retrieval pipeline. The query fans out to BM25 over an inverted index (top branch, blue) and an embedding model plus vector index (bottom branch, green). Each returns top-100 candidates. A fusion step — Reciprocal Rank Fusion is the default — combines the two ranked lists into one. An optional cross-encoder re-ranker scores the top of the fused list more expensively. Filters can attach upstream (pre-filter, restricts what both retrievers see) or downstream (post-filter, drops non-matching from the fused list).

Reciprocal Rank Fusion: the default for a reason

You have two ranked lists, L_{\text{BM25}} and L_{\text{vec}}. Each is a list of (document, score) pairs sorted by score descending. You want to produce a single combined ranked list. The naive approach — add the two scores together — fails immediately because the two scores are not on the same scale. BM25 scores are unbounded positive reals that depend on document length, IDF magnitudes, and the number of query terms; cosine similarity is bounded in [-1, 1] but in practice for a trained model lives mostly in [0.2, 0.9]. Adding 17.3 to 0.71 is meaningless arithmetic.

You could normalise — divide each score by the max score in its list, or apply a min-max rescale — but normalisation is fragile. The max score in a list of 100 candidates fluctuates per query, so the same document can be normalised to 0.85 in one query and 0.40 in another based on what else happened to be in the list. This makes per-query relevance unstable.

Reciprocal Rank Fusion sidesteps the score-normalisation problem entirely by ignoring scores and using only ranks:

\text{RRF\_score}(d) = \sum_{i \in \text{retrievers}} \frac{1}{k + \text{rank}_i(d)}

where \text{rank}_i(d) is the 1-indexed position of d in retriever i's ranked list (or \infty if d is not in that list, giving a contribution of zero), and k is a constant — Cormack et al.'s original 2009 paper used k = 60, and that value has stuck because it works.

Why 1/(k + \text{rank}) and not just 1/\text{rank}: if you used 1/\text{rank}, the gap between rank 1 and rank 2 would be a huge 1/1 - 1/2 = 0.5, while the gap between rank 99 and rank 100 would be a tiny 0.0001. A document at rank 1 in one retriever and absent in the other would always beat a document at rank 2 in both — too aggressive. Adding a constant k to the denominator damps the early-rank dominance. With k = 60, rank 1 contributes 1/61 \approx 0.0164 and rank 2 contributes 1/62 \approx 0.0161 — a 2% gap, not a 100% gap. A document at rank 5 in both retrievers (2 \cdot 1/65 \approx 0.0308) now beats a document at rank 1 in only one (1/61 \approx 0.0164). The constant tunes how strongly the system rewards consensus across retrievers versus brilliance from any single retriever.

RRF has three properties that explain its dominance:

No score normalisation. It works on ranks, which are dimensionless. You can fuse BM25 and dense vectors and TF-IDF and a custom rule-based scorer all together with no per-system calibration.
One nearly-untunable knob. k = 60 works for almost every workload. Lin et al. (2021) tested k \in [10, 100] on MSMARCO and found nDCG@10 changes by less than 1% across that range. You can ship the default and forget it.
Naturally rewards consensus. A document that appears in the top-10 of both retrievers gets a higher RRF score than one that appears at rank 1 in just one retriever. This is the right intuition: agreement across independent signals is stronger evidence than a brilliant guess from one signal.

The two main alternatives have trade-offs:

Linear interpolation computes \text{score}(d) = \alpha \cdot \tilde{s}_{\text{BM25}}(d) + (1 - \alpha) \cdot \tilde{s}_{\text{vec}}(d) where \tilde{s} are normalised scores and \alpha \in [0, 1] is tuned per workload. This can be slightly more accurate than RRF when you have a labelled tuning set to fit \alpha on, but requires (a) a normalisation scheme and (b) per-domain \alpha tuning. Pinecone's hybrid mode uses this with \alpha exposed as a query parameter.

Cascaded re-ranking runs BM25 first as a cheap recall stage to get top-1000 candidates, then re-ranks those 1000 with a more expensive model — either a dense retriever's cosine score or a cross-encoder that scores (query, document) pairs jointly. This is not strictly hybrid retrieval — only one retriever sees the full corpus — but it is the standard production pattern for systems where the expensive model is too slow to run on the full corpus. Web search and most RAG retrievers use cascaded re-ranking with BM25 (or a dense retriever) as the cheap first stage.

In practice: if you have no labels, ship RRF with k = 60. If you have labels and care about every fraction of nDCG, fit linear interpolation. If your re-ranker is expensive (a 400M-parameter cross-encoder), use cascading.

Implementing RRF in 30 lines

# rrf.py — reciprocal rank fusion of BM25 and dense vector retrievers
from collections import defaultdict

def rrf(rankings: list[list[str]], k: int = 60) -> list[tuple[str, float]]:
    """
    rankings: a list of ranked doc-id lists, one per retriever.
              Each list is sorted best-first.
    k: the RRF constant (60 is the Cormack et al. default).
    Returns: list of (doc_id, rrf_score) sorted by score descending.
    """
    score = defaultdict(float)
    for ranking in rankings:
        for rank, doc_id in enumerate(ranking, start=1):  # 1-indexed
            score[doc_id] += 1.0 / (k + rank)
    return sorted(score.items(), key=lambda x: -x[1])

# Imagine two retrievers returned these top-10 lists for the
# query "running shoes for marathon" on a Flipkart-like catalogue:
bm25_results = [
    "P3", "P1", "P9", "P7", "P5", "P12", "P14", "P2", "P8", "P21",
]  # all contain literal "running shoes"
vec_results = [
    "P5", "P3", "P11", "P1", "P15", "P7", "P22", "P9", "P30", "P2",
]  # semantic match: "marathon footwear", "training sneakers", etc.

fused = rrf([bm25_results, vec_results], k=60)
for doc, s in fused[:10]:
    print(f"{doc}  rrf_score={s:.5f}")

Output:

P3   rrf_score=0.03200   # rank 1 BM25, rank 2 vec → 1/61 + 1/62
P5   rrf_score=0.03168   # rank 5 BM25, rank 1 vec → 1/65 + 1/61
P1   rrf_score=0.03145   # rank 2 BM25, rank 4 vec → 1/62 + 1/64
P7   rrf_score=0.03021   # rank 4 BM25, rank 6 vec
P9   rrf_score=0.02985   # rank 3 BM25, rank 8 vec
P2   rrf_score=0.02913   # rank 8 BM25, rank 10 vec
P11  rrf_score=0.01587   # only in vec list (rank 3)
P12  rrf_score=0.01587   # only in BM25 list (rank 6)
P14  rrf_score=0.01493   # only in BM25 list (rank 7)
P15  rrf_score=0.01429   # only in vec list (rank 5)

P3 wins because it appears near the top of both lists — the consensus signal RRF rewards. P5 is the top vector hit but only rank 5 in BM25, so it lands second. Documents that appear in only one list (P11, P12, P14, P15) cluster below the consensus documents because they only contribute one 1/(k + \text{rank}) term.

Worked example: "running shoes for marathon" on an Indian e-commerce catalogue

Setup. A Flipkart-style catalogue with 5 candidate products. The user types running shoes for marathon. The lexical retriever and the dense retriever each return their top-5.

ID	Title	BM25 rank	Vec rank
P1	"Asics Gel-Kayano 30 Running Shoes Men Cushioned"	1	4
P2	"Nike Pegasus 41 Marathon Training Sneakers"	4	1
P3	"Hoka Clifton 9 Long-Distance Footwear Lightweight"	—	2
P4	"Adidas Adizero Boston 12 Running Shoes Marathon"	2	3
P5	"Puma Velocity Nitro 3 Running Shoes Daily Trainer"	3	—

BM25 alone ranks: P1 > P4 > P5 > P2. P3 has zero token overlap (no "running", no "shoes", no "marathon" — it uses "long-distance footwear") and BM25 returns nothing for it. P5 ranks third because it has both "running shoes" tokens but lacks "marathon".

Vector alone ranks: P2 > P3 > P4 > P1. P3 is rank 2 because the embedding model maps "long-distance footwear" near "marathon shoes" in vector space. P5 is missing because the model decided "daily trainer" is semantically closer to "casual sneakers" than to "marathon", which is a defensible but arguable call. P1 drops to rank 4 because "cushioned" pulls its embedding slightly off-axis from a marathon-running query.

RRF with k = 60:

ID	BM25 rank	1/(60+r)	Vec rank	1/(60+r)	RRF total
P4	2	0.01613	3	0.01587	0.03200
P1	1	0.01639	4	0.01563	0.03202
P2	4	0.01563	1	0.01639	0.03202
P3	—	0	2	0.01613	0.01613
P5	3	0.01587	—	0	0.01587

Fused ranking: P1 ≈ P2 > P4 > P3 ≈ P5. Why this is the right answer: P4 (Adidas Adizero Boston Marathon) is genuinely the best match — explicit marathon shoe, both retrievers ranked it in the top 4, RRF surfaces it. P1 and P2 tie just above it because each is rank 1 in one retriever. P3 (Hoka Clifton) is the semantic-only catch — BM25 missed it entirely, but the vector retriever found it and RRF includes it in the top-4. P5 (Puma daily trainer) is the lexical-only catch — vector dropped it because the model judged "daily trainer" as off-topic, but BM25 included it and RRF keeps it as a reasonable backup. A pure-BM25 system would have missed P3; a pure-vector system would have missed P5; hybrid covers both.

A real ranking would also weight by business signals — stock availability, conversion rate, the seller's rating, recency of inventory — but those layer on top of the retrieval ranking; they do not replace it. The retrieval question is "which 50 of the million products are even candidates?" and hybrid is the answer.

Filters: pre or post

Filters are structured predicates. They are not relevance scoring. The user types running shoes for marathon under ₹5000 in stock and Prime-eligible, and under ₹5000, in stock, Prime-eligible are filters. They have nothing to do with text similarity; they have everything to do with whether the result is shippable and affordable. The BM25 and vector retrievers return documents that match the query; the filters then decide which of those documents the user is allowed to see.

Two architectural choices, each with sharp trade-offs.

Pre-filter (top): apply the structured predicate before retrieval, so both BM25 and the vector index search a smaller candidate set. Faster and recall-preserving, but requires that both indexes know about the filter fields. Post-filter (bottom): retrieve unfiltered, then drop non-matching from the result list. Flexible — any predicate can be applied without changing the index — but if the filter matches less than 10% of the top-100 candidates you can end up with fewer than 10 final results.

Pre-filter applies the predicate at the index level. The inverted index has a posting for each category value, so category=shoes becomes another postings-list intersection — fast. The vector index supports a metadata filter — Pinecone, Qdrant, Weaviate, and Vespa all expose this; the index only considers vectors whose metadata matches. The retrievers then run BM25 and ANN search only over the filtered subset. Pre-filter is the right default when filters are part of the standard schema (category, price range, region, language) and known at index time. Why this is the natural choice: the candidate set is smaller, both retrievers do less work, and recall is preserved — every relevant filter-passing document is a candidate. The trade-off is that the vector index's ANN structure (HNSW, IVF) can degrade when a metadata filter is very selective, because the graph traversal has to skip many neighbours; this is a real engineering problem, addressed in chapter 153 on vector indexing strategies.

Post-filter retrieves unfiltered, then drops non-matching from the final result list. This is the right choice when the predicate is dynamic (a user-defined SQL-style filter the index has not seen before), or when the filter is permissive (matches most of the candidates anyway, so pre-filtering does not save much work). The catastrophic failure mode is when the filter is highly selective: if you retrieve top-100 and the filter matches only 2 of those, you return 2 results when the user wanted 10. The relevant 8 documents do exist in the corpus — you just never asked the retriever to look past rank 100. The fix is to over-retrieve (fetch top-1000 instead of top-100) but this only postpones the failure; for very selective filters the only correct answer is pre-filtering.

A common pattern in production is hybrid filtering: apply cheap, well-indexed filters pre-retrieval (category, region), and apply expensive or dynamic filters post-retrieval (requires custom permission check, must match a complex predicate the index does not understand). Elasticsearch's bool query supports both via filter clauses on the retrieval side and post_filter clauses on the result side.

What production systems ship

By 2026 every major retrieval system exposes hybrid as a first-class query mode; the differences are mostly in the fusion math and how filters interact.

Elasticsearch 8.x ships an rrf retriever that takes a standard retriever (BM25 over text fields) and a knn retriever (cosine over dense_vector fields) and fuses them with k = 60. Filters apply pre-retrieval to both branches via the standard filter clause. This is the closest production system to the textbook RRF pattern.
Vespa treats hybrid as a first-class ranking expression: a single rank-profile can combine bm25(field) and closeness(vector) with arbitrary arithmetic. Vespa exposes both score-blending and RRF via configuration. Vespa was hybrid-native years before the others caught up.
Pinecone added hybrid mode in 2023 using sparse-dense vectors: BM25 is encoded as a sparse vector over the vocabulary, the dense vector is the neural embedding, and the score is a convex combination \alpha \cdot \text{sparse} + (1-\alpha) \cdot \text{dense} with \alpha exposed per query. Filters are pre-applied via metadata.
Weaviate ships a hybrid query operator that fuses BM25 and vector search using either RRF or a convex combination, with \alpha tunable per query. Filters via the where clause are pre-applied.
Qdrant added hybrid in 2024 with sparse-dense vectors and a Query.fusion API exposing RRF and DBSF (Distribution-Based Score Fusion). Filters are pre-applied via the must / must_not payload index.
OpenSearch has a hybrid search plugin with RRF and normalisation-based fusion, mirroring the Elasticsearch API.
MongoDB Atlas Search added a $rankFusion aggregation stage in 2024 for combining text and vector pipelines.

The convergence is striking — five years ago none of these systems supported hybrid as a first-class operation; today all of them do, and the fusion algorithm of choice is RRF with k = 60.

Going deeper

Score-level versus rank-level fusion mathematics, learned fusion (LambdaMART over retriever scores as features), the SPLADE family (sparse neural retrievers that try to be both lexical and semantic in one model), and why hybrid degrades for very-short queries.

Learned fusion

RRF is unsupervised — it works without any training data. If you have a labelled relevance dataset (queries with ground-truth relevant documents), you can do better. Learning-to-rank treats each retriever's score as a feature and trains a gradient-boosted ranker (LambdaMART, LightGBM) to combine them, optionally with additional features like query length, candidate document length, click-through rate, and field-specific BM25 scores (BM25 over title, BM25 over body). This is the production pattern at Google, Bing, Yandex, and most large-scale e-commerce search teams. The trade-off is that you need (a) clicks or labels, (b) a feature pipeline, (c) periodic retraining as data drifts. RRF wins for most systems that lack one of those.

Why hybrid degrades for very short queries

A one-token query like samsung is purely lexical — there is no semantic context for the embedding model to do useful work with. The vector retriever returns documents whose embeddings are near the embedding of "samsung", which is mostly other documents that mention Samsung — the same set BM25 already returns. The vector channel adds noise rather than signal. For very short queries, dropping back to pure BM25 is often the right move; some systems detect query length and switch fusion strategies dynamically.

SPLADE: sparse neural retrievers

A separate research direction, popularised by SPLADE (Formal et al. 2021), trains a transformer to output a sparse vector over the BERT vocabulary (30k dimensions) that acts like a learned IDF-weighted bag of words. SPLADE retrievers can be served with the same inverted-index infrastructure as BM25 (each non-zero dimension is a "token" in a learned vocabulary), get most of the semantic-matching benefit of dense retrievers, and naturally compose with BM25 via plain score addition. SPLADE-v2 and BGE-M3 are the leading 2025-era sparse-neural retrievers.

Filters and ANN graph degradation

When a vector index is HNSW or IVF and you apply a selective metadata filter, the graph traversal or inverted-list scan has to skip many neighbours that fail the filter. For very selective filters (matching <1% of vectors), naïve filter-during-search can degrade ANN recall by 10× because the graph was built assuming dense connectivity that the filter destroys. The fix is either (a) rebuild the graph per filter category — only practical for low-cardinality filter fields — or (b) increase the ANN search-list size to compensate, accepting higher latency. Chapter 153 on vector indexes covers this in detail.

Wrapping up Build 18, opening Build 19

Build 18 took you from raw text to a distributed, ranked, filter-aware search engine: tokenization, stemming, inverted indexes, postings compression, BM25, phrase queries, distributed scatter-gather, and now hybrid retrieval that fuses BM25 with a dense vector retriever. Every piece is implementable from first principles and every piece is what a real production system actually ships.

Hybrid retrieval is also the bridge to Build 19. The vector retriever in this chapter was a black box — query and corpus get embedded into 1024-dim vectors, top-k by cosine similarity comes back. How that top-k comes back fast — without scanning every vector in the corpus — is the entire subject of vector indexing. Naïve cosine similarity over a billion vectors at 1024 dimensions is 4 TB of arithmetic per query; HNSW does it in under 10 ms. The next chapter (151) opens Build 19 with the foundation: what embeddings are, what similarity metrics make sense in high-dimensional space, and the curse of dimensionality that makes naïve approaches fail. Then chapter 152 and chapter 153 build the two index families — IVF and HNSW — that make hybrid retrieval feasible on production-scale corpora.

You have completed the lexical retrieval stack. The next stack is geometric.

References

Lin, J., Nogueira, R. & Yates, A. (2021). Pretrained Transformers for Text Ranking: BERT and Beyond. Synthesis Lectures on Human Language Technologies, Morgan & Claypool. The standard reference for combining lexical and dense retrieval; chapter 5 covers RRF and hybrid systems.
Cormack, G., Clarke, C. & Büttcher, S. (2009). Reciprocal Rank Fusion outperforms Condorcet and individual rank learning methods. SIGIR. The original RRF paper, source of the k = 60 default.
Pinecone — Hybrid search with sparse-dense vectors. Pinecone documentation.
Elastic — Hybrid search with RRF and dense vectors in Elasticsearch 8.x. Elasticsearch engineering blog.
Vespa — Hybrid retrieval and ranking. Vespa documentation.
Thakur, N. et al. (2021). BEIR: A Heterogeneous Benchmark for Zero-shot Evaluation of Information Retrieval Models. NeurIPS Datasets. The benchmark that established hybrid retrieval as the default winning strategy across diverse domains.