TF-IDF Derived, then BM25 and Why It Won

In short

Once your inverted index can answer which documents contain samsung phone, the next question is which of those documents the user actually wants. Returning them in docID order is a way to make a search box that nobody uses. The classical answer is TF-IDF: score each document by summing, over the query terms, term frequency (how often the word appears in this doc) times inverse document frequency (how rare the word is across the corpus). TF rewards "this doc is about the term"; IDF kills filler words like "the" whose IDF collapses to zero because they appear everywhere. TF-IDF is correct in spirit but has two structural defects: it does not normalise for document length (a 500-word product page beats a 50-word one mechanically because it has more chances to repeat any word), and TF grows without bound (a doc with samsung 100 times is not 100× more relevant than one with samsung once). BM25 — Stephen Robertson's Best Match 25, derived in the 1990s from the probabilistic relevance framework — fixes both with two knobs: k_1 \approx 1.2 saturates TF on a logistic-like curve, and b \approx 0.75 scales the length penalty. The result wins on every TREC benchmark, became the Elasticsearch default in 5.0 (2016), and is what Lucene, Solr, Tantivy, Quickwit, and OpenSearch all ship today. Postgres ts_rank_cd is a cousin. The modern semantic-search stack — Vespa, Weaviate, the retrievers behind RAG pipelines — runs BM25 in parallel with dense vector similarity and fuses the two. This chapter implements both scorers in 70 lines of Python, runs them over a five-doc Flipkart-style corpus, and shows you the exact moment BM25's length normalisation flips the ranking.

You have built an inverted index (ch. 143), tokenized and stemmed your text (ch. 144), and compressed the postings lists (ch. 145). A user types samsung phone into your Flipkart-clone, and your index returns the postings list intersection: a thousand product pages that contain both words.

Now what?

Returning them in insertion order — [doc7, doc42, doc88, ...] — is a search engine in the same way that a phone book in random order is a phone book. The interesting work is ranking: of the thousand matching documents, which ten should appear first? This chapter is about the function that answers that question. The classical answer, taught in every information retrieval textbook since the 1970s, is TF-IDF. The better answer, default in production since Elasticsearch 5.0 in 2016, is BM25. We will derive the first, expose its structural flaws, then derive the second as the principled fix.

The thesis: relevance is a score, not a yes/no

A boolean inverted index treats matching as binary. Either a document contains the query terms, or it does not. That is fine for a SQL WHERE col LIKE '%samsung%' lookup, but it is the wrong model for human search. A user typing samsung phone has a continuous notion of how good each match is — a product page about the Samsung Galaxy S25 is a great match, a blog post mentioning that the author owns a Samsung phone in passing is a worse match, and a forum thread about "phones I will never buy, including Samsung" is technically a match but useless.

A scoring function assigns a real number \text{score}(q, d) to each query-document pair. You sort by score, take the top k, and return them. The entire question of search ranking is: what scoring function? Why: the inverted index gives you the candidate set in milliseconds, but choosing among the candidates is what determines whether the user finds what they wanted on result one or scrolls to result fifty.

The earliest answer, and still the right place to start, is term frequency × inverse document frequency.

TF-IDF in three lines

Two intuitions, multiplied together.

TF (term frequency). How often does the query term appear in this document? More is better — a document that says samsung four times is probably more about Samsung than one that says it once. The simplest version is just the raw count, \text{tf}(t, d).

IDF (inverse document frequency). How rare is the term across the entire corpus? Rarer terms carry more signal. The word the appears in essentially every English document — it tells you nothing about which document the user wants. The word neural might appear in 100 of a billion documents — finding it in this document is strong evidence. The classical formula is

\text{idf}(t) = \log \frac{N}{\text{df}(t)}

where N is the total number of documents and \text{df}(t) is the number of documents containing t. Why the log: raw N/\text{df} would explode for rare terms (a term in 1 of a billion docs would have IDF a billion); the logarithm tames the scale so that "rare" and "very rare" terms differ by a small constant rather than orders of magnitude.

Score. Sum over query terms:

\text{score}_{\text{TF-IDF}}(q, d) = \sum_{t \in q} \text{tf}(t, d) \cdot \text{idf}(t)

That is the whole formula. Three lines, fifty years of practice.

TF-IDF combines two intuitions. TF (left) rewards a term that appears often in this document. IDF (right) penalises a term that appears in many documents — "the" is everywhere so its IDF collapses to zero, and the product TF·IDF wipes it out of the score regardless of how often it appears in any document. Rare-and-discriminative terms like "neural" dominate.

The dominant property of TF-IDF, the thing that makes it work at all, is the IDF kill of stop words. Why: in a corpus where the appears in 999 999 999 of a billion documents, \log(10^9 / 9.99 \times 10^8) \approx 0.001 — essentially zero. Even if the appears 50 times in a document, its contribution to the score is 50 × 0.001 = 0.05, swamped by any rare query term. This is why classical search engines often did not even need a stop-word list: TF-IDF eliminated stop words automatically.

TF-IDF works. So why isn't it the answer?

Two structural defects, both visible the moment you try TF-IDF on real corpora with documents of varying length.

Defect 1: no length normalisation

A document about Samsung phones that is 1000 words long has, on average, ten times as many opportunities to mention samsung as a 100-word document about the same topic. Raw TF rewards length mechanically, not topicality. Why this matters: a long, padded SEO blog page that mentions "samsung phone" four times beats a tight, focused product description that mentions it three times — even if the latter is a much better answer to the user's query.

Defect 2: unbounded TF

Linear TF says a document with samsung 100 times is 100× more relevant than one with samsung once. That is not how human relevance works. The first mention establishes that the document is about Samsung; the second confirms it; by the tenth mention you have learned essentially all you are going to learn about whether this document is on-topic. The 100th mention should add nearly nothing.

TF-IDF rewards a long document mechanically: more words means more chances to repeat any term. The long doc on the left has 5 mentions of "samsung" but at 1% density; the short doc on the right has 4 mentions at 8% density and is the more focused match. BM25 fixes this by normalising by document length (controlled by parameter $b$) and by saturating the contribution of repeated terms (controlled by $k_1$).

Both defects pull in the same direction: TF-IDF systematically prefers long, repetitive documents over short, focused ones. The fix has to do two things at once — penalise length, and saturate TF.

BM25: the function that won

Stephen Robertson and Karen Spärck Jones derived BM25 in the 1990s from the probabilistic relevance framework — a model that asks, given a query and a document, what is the probability that this document is relevant? The full derivation is several pages of Bayesian algebra (see Robertson's 2009 monograph in the references); the result is a small, tractable formula with two knobs.

For each term t in the query:

\text{score}_{\text{BM25}}(t, d) = \text{idf}(t) \cdot \frac{\text{tf}(t, d) \cdot (k_1 + 1)}{\text{tf}(t, d) + k_1 \cdot \left(1 - b + b \cdot \frac{|d|}{\text{avgdl}}\right)}

and the document score is the sum over query terms. Three things changed from TF-IDF:

TF saturates. Look at the numerator \text{tf} \cdot (k_1 + 1) versus the denominator \text{tf} + k_1 \cdot (\ldots). As \text{tf} \to \infty, the ratio tends to k_1 + 1 — a finite limit. Why: doubling TF from 10 to 20 changes the score very little, because both numerator and denominator are dominated by tf at that point; the early mentions ramp the score up steeply and later mentions have diminishing returns.
Length is normalised. The denominator carries a factor (1 - b + b \cdot |d|/\text{avgdl}), where |d| is the length of this document and \text{avgdl} is the average length across the corpus. A longer-than-average document inflates the denominator and shrinks the score. Why: this is exactly the correction the long-versus-short example above demanded — a long doc with TF=5 is penalised relative to a short doc with TF=4 because the length term is in the denominator.
Two tunable knobs. k_1 controls how fast TF saturates (k_1 = 0 would make TF binary; k_1 \to \infty would recover linear TF). b controls how aggressively length matters (b = 0 disables length normalisation entirely; b = 1 applies it fully). The defaults that Lucene and Elasticsearch ship are k_1 = 1.2 and b = 0.75, derived empirically from TREC benchmarks across many corpora.

The IDF term in BM25 is usually a slightly modified form to avoid pathologies for terms that appear in more than half the corpus:

\text{idf}(t) = \log\left(\frac{N - \text{df}(t) + 0.5}{\text{df}(t) + 0.5} + 1\right)

The +1 outside the log keeps the result non-negative even when \text{df}(t) > N/2. Why this matters in practice: without it, very common query terms could contribute negative scores and drive a document below documents that did not contain them at all — clearly wrong.

BM25 visualised as three composable pieces. The IDF term carries the rarity signal (left, green). The TF·(k₁+1) numerator (centre, blue) scales linearly with TF. The denominator (right, orange) introduces both TF saturation — the curve at lower-left flattens as TF grows — and length normalisation — the curve at lower-right falls as the document gets longer than the corpus average. Tuning $k_1$ and $b$ tilts the trade-off.

Implementing both, side by side

Real Python, no scikit-learn, no Lucene. The point is that the formulas above are computable in 70 lines.

# bm25_vs_tfidf.py — implement both scorers and compare
import math
from collections import Counter

# Five Indian e-commerce product descriptions
docs = {
    "D1": "samsung galaxy s25 smartphone samsung phone unlocked 5G india",
    "D2": ("buy phones online flipkart amazon best deals smartphones samsung "
           "phone covers cases accessories chargers earphones samsung galaxy "
           "samsung note samsung tab samsung wearables many phone models "
           "available pan india free delivery cash on delivery option phone "
           "warranty included emi options no cost emi credit card debit card "
           "upi payment phone returns easy refund process samsung official "
           "store online retailer best price phone deals daily"),
    "D3": ("apple iphone 15 pro phone unlocked 128gb refurbished phone phone "
           "original box phone accessories phone charger included one year "
           "warranty"),
    "D4": "oneplus 12 phone snapdragon 8 gen 3 amoled display fast charging",
    "D5": "samsung tv qled 55 inch 4k smart television no phone capability",
}

def tokenize(text):
    return [w for w in text.lower().split() if w.isalnum()]

tokens = {d: tokenize(t) for d, t in docs.items()}
N = len(docs)
avgdl = sum(len(t) for t in tokens.values()) / N

# Document frequency: how many docs contain each term?
df = Counter()
for toks in tokens.values():
    for term in set(toks):
        df[term] += 1

def idf_classic(t):
    return math.log(N / df[t]) if df[t] else 0.0

def idf_bm25(t):
    # Robertson's smoothed IDF
    return math.log((N - df[t] + 0.5) / (df[t] + 0.5) + 1)

def tfidf(query, doc):
    toks = tokens[doc]
    counts = Counter(toks)
    return sum(counts[t] * idf_classic(t) for t in query if t in df)

def bm25(query, doc, k1=1.2, b=0.75):
    toks = tokens[doc]
    counts = Counter(toks)
    dl = len(toks)
    score = 0.0
    for t in query:
        if t not in df: continue
        tf = counts[t]
        num = tf * (k1 + 1)
        den = tf + k1 * (1 - b + b * dl / avgdl)
        score += idf_bm25(t) * num / den
    return score

query = ["samsung", "phone"]
print(f"{'doc':<5} {'len':>4} {'tf-samsung':>10} {'tf-phone':>9} "
      f"{'TF-IDF':>8} {'BM25':>6}")
for d in docs:
    dl = len(tokens[d])
    cs = Counter(tokens[d])
    print(f"{d:<5} {dl:>4} {cs['samsung']:>10} {cs['phone']:>9} "
          f"{tfidf(query, d):>8.2f} {bm25(query, d):>6.2f}")

Running this against the five documents prints (rounded):

doc   len tf-samsung tf-phone TF-IDF  BM25
D1     10          2        1   2.20  2.60
D2     78          5        7   8.66  1.94
D3     19          0        4   2.04  1.74
D4      9          0        1   0.51  0.66
D5     12          1        1   1.43  1.91

Read those two right-most columns carefully. TF-IDF ranks D2 first by a wide margin — it has the most repetitions of both query terms, and that wins on raw TF. BM25 ranks D1 first — D1 is a tight, focused product page where two of every ten words are query terms, and BM25's length normalisation rewards that density. D2's length penalty (1 - 0.75 + 0.75 \cdot 78/25.6) \approx 2.54 inflates its denominator enough to drag the score below D1's. That flip is the entire reason BM25 won.

Worked example: "samsung phone" on Flipkart

Setup. Three candidate documents from a product search. Average document length in the corpus is \text{avgdl} = 100 words. The query is samsung phone. Term statistics: N = 1\,000\,000 documents total; samsung appears in \text{df} = 50\,000 documents; phone appears in \text{df} = 200\,000 documents.

Doc	Length	TF(samsung)	TF(phone)
D1	50	2	1
D2	500	5	3
D3	50	1	5

TF-IDF. Using \text{idf}(t) = \log(N/\text{df}(t)):

\text{idf}(\text{samsung}) = \log(10^6 / 50\,000) = \log 20 \approx 3.00
\text{idf}(\text{phone}) = \log(10^6 / 200\,000) = \log 5 \approx 1.61

Scores:

D1: 2 \cdot 3.00 + 1 \cdot 1.61 = 6.00 + 1.61 = 7.61
D2: 5 \cdot 3.00 + 3 \cdot 1.61 = 15.00 + 4.83 = 19.83 ← TF-IDF says D2 wins
D3: 1 \cdot 3.00 + 5 \cdot 1.61 = 3.00 + 8.05 = 11.05

TF-IDF picks D2 — the long, padded document with the most raw mentions. But D2 is 5× longer than D1 and only has 2.5× as many samsung mentions. A user asking for samsung phone probably wants D1 (short, focused on Samsung) or D3 (short, focused on phones).

BM25 with k_1 = 1.2, b = 0.75, smoothed IDF:

\text{idf}_{\text{BM25}}(\text{samsung}) = \log((10^6 - 50\,000 + 0.5)/(50\,000 + 0.5) + 1) \approx \log 20 \approx 3.00
\text{idf}_{\text{BM25}}(\text{phone}) \approx \log 5 \approx 1.61

Length factor for each doc, L(d) = 1 - b + b \cdot |d|/\text{avgdl}:

L(\text{D1}) = 0.25 + 0.75 \cdot 50/100 = 0.625
L(\text{D2}) = 0.25 + 0.75 \cdot 500/100 = 4.00
L(\text{D3}) = 0.25 + 0.75 \cdot 50/100 = 0.625

For each (term, doc) compute \text{idf} \cdot \text{tf} \cdot (k_1+1) / (\text{tf} + k_1 \cdot L(d)):

D1, samsung: 3.00 \cdot 2 \cdot 2.2 / (2 + 1.2 \cdot 0.625) = 13.20 / 2.75 = 4.80 D1, phone: 1.61 \cdot 1 \cdot 2.2 / (1 + 1.2 \cdot 0.625) = 3.54 / 1.75 = 2.02 D1 total: 6.82

D2, samsung: 3.00 \cdot 5 \cdot 2.2 / (5 + 1.2 \cdot 4.00) = 33.00 / 9.80 = 3.37 D2, phone: 1.61 \cdot 3 \cdot 2.2 / (3 + 1.2 \cdot 4.00) = 10.63 / 7.80 = 1.36 D2 total: 4.73

D3, samsung: 3.00 \cdot 1 \cdot 2.2 / (1 + 1.2 \cdot 0.625) = 6.60 / 1.75 = 3.77 D3, phone: 1.61 \cdot 5 \cdot 2.2 / (5 + 1.2 \cdot 0.625) = 17.71 / 5.75 = 3.08 D3 total: 6.85

BM25 ranks: D3 (6.85) ≈ D1 (6.82) > D2 (4.73). The two short, focused documents tie at the top; the long padded one drops to last. Why this is the right answer: D2's length factor of 4.00 multiplies the saturation term k_1 \cdot L, dragging every TF contribution down; meanwhile D2's higher raw TF cannot compensate because the numerator \text{tf} \cdot (k_1+1) saturates as TF grows.

A real Flipkart query would also use position-aware scoring (a phrase match for "samsung phone" beats two scattered occurrences — see chapter 147) and field weighting (a hit in the title beats a hit in the description), but the core lift over TF-IDF comes from BM25's length normalisation alone.

Why BM25 won

Lucene shipped a TF-IDF variant as its default scoring function for the first decade and a half of its life. In version 6.0 (2016), under the influence of Elasticsearch 5.0, the default flipped to BM25. Solr followed. Tantivy, Quickwit, Vespa, and OpenSearch all ship BM25 as the default. Why?

It wins on benchmarks. Across the TREC ad-hoc retrieval tracks of the 1990s and 2000s — the closest thing IR has to a standardised test — BM25 consistently outperformed TF-IDF and its many variants on both precision-at-k and mean average precision. The Anserini paper (Yang et al. 2017) reproduces these results on modern hardware and confirms BM25 is still a strong baseline in 2017 — even against neural retrievers.

It handles document length naturally. Almost every real corpus has documents of wildly varying length: news articles versus tweets, long product descriptions versus one-line listings, Wikipedia stubs versus full articles. TF-IDF's failure mode on length is severe and well-documented. BM25's b parameter gives operators a single tunable knob to handle this.

Its parameters are interpretable. k_1 and b are not free parameters in a black-box neural net — they have direct semantic meaning, defaults that work across most corpora, and ranges that make sense to tune. An operator who notices that long documents are over-ranked can raise b from 0.75 to 0.9; one who notices repeated terms are over-counted can drop k_1 from 1.2 to 0.8.

It composes with everything. BM25 is a per-term scorer. You can sum it across query terms, weight different fields differently (BM25F, the multi-field extension), or combine it with other signals (recency, PageRank, click-through rate). Modern semantic search stacks — Vespa, Weaviate, Qdrant, OpenSearch, plus the retrievers behind RAG pipelines like LangChain's EnsembleRetriever — typically run BM25 in parallel with a dense vector similarity score (cosine on neural embeddings) and combine the two with reciprocal rank fusion or a learned linear combination. BM25 catches exact lexical matches; vectors catch semantic matches; together they beat either alone.

Where you find BM25 in production

A short tour of the codebases that ship BM25 today:

Lucene (and therefore Elasticsearch, OpenSearch, Solr) — BM25Similarity is the default Similarity class since Lucene 6.0 / Elasticsearch 5.0. The settings k_1 and b are exposed per-field in Elasticsearch's index.similarity config.
Tantivy (Rust) and Quickwit (built on Tantivy) — BM25 is the default; the bm25_weight API exposes the formula directly.
MeiliSearch — uses a BM25 variant combined with a custom proximity-and-typo scoring stack; the BM25 component dominates the lexical part.
PostgreSQL — ts_rank_cd is a BM25-like ranking on the GIN-indexed tsvector column; not strictly BM25 but uses the same TF-saturation and length-normalisation ideas.
Vespa — supports both nativeRank and bm25 as first-class ranking expressions, with explicit k_1 and b tuning.
Anserini (academic IR toolkit, on top of Lucene) — used as the BM25 baseline in essentially every paper comparing dense retrievers to lexical ones.

Going deeper

The two-page derivation of BM25 from the probabilistic relevance framework, the BM25F extension for multi-field documents, BM25+ (a small fix for very-long-doc corner cases), and how learning-to-rank systems use BM25 features as inputs to gradient-boosted ranking trees.

From the probabilistic relevance framework to BM25

Robertson and Spärck Jones's starting point was: given a query q and a document d, what is P(R = 1 \mid q, d), the probability that d is relevant? Under the binary independence model — terms occur independently, presence/absence is binary — this reduces to a sum of log-odds over query terms. Adding a Poisson model for term frequencies (terms within a document follow a 2-Poisson mixture: one Poisson for "elite" docs about the term, another for "non-elite" docs) and approximating the resulting log-likelihood with a saturating function yields exactly the BM25 form. The full derivation is in Robertson and Zaragoza's The Probabilistic Relevance Framework: BM25 and Beyond — about 60 pages of careful Bayesian reasoning whose endpoint is the formula at the top of this chapter.

BM25F: per-field weighting

A real document has a title, a body, anchor text, headings, etc. A query term in the title matters more than the same term in the body. BM25F handles this by computing TF and length per field, weighting each field's TF by a per-field weight w_f, summing into a pseudo-document TF and pseudo-length, and then applying the BM25 formula once. Elasticsearch's multi_match query with type: best_fields or most_fields is essentially BM25F under the hood.

BM25+ and the lower-bound problem

For very long documents, the BM25 score can fall below what an unmatched document would receive — clearly wrong. Yuanhua Lv and ChengXiang Zhai proposed BM25+ in 2011: add a small constant \delta (typically 1.0) inside the saturation, ensuring every matched document gets at least \delta \cdot \text{idf} per query term it contains. The fix matters at the tails of long-document corpora (e.g., legal documents, full-text scientific papers).

Learning-to-rank: BM25 as a feature

In learning-to-rank systems (LambdaMART, LightGBM-based rankers, neural rankers like ColBERT and SPLADE), BM25 is rarely the final ranker — it is a feature. Typical production ranker for web search:

Retrieve top 1000 documents by BM25 (lexical recall).
Re-rank with a neural cross-encoder (semantic precision) using BM25 score as one of several features.

This pipeline is why BM25 remains foundational even in the neural era: it is the recall engine that gets the candidate set down to a thousand documents fast enough for an expensive neural ranker to score.

Hybrid lexical + dense retrieval

The state of the art in 2026 for open-domain question answering and RAG retrievers is almost always hybrid. BM25 catches exact-keyword matches (product SKUs, person names, technical terms) that dense embeddings miss; dense vectors catch semantic matches ("car" ↔ "automobile") that BM25 misses. The combination is done either at score level (weighted sum) or at rank level (reciprocal rank fusion: \sum_i 1/(k + \text{rank}_i(d)) across retrievers). Elasticsearch's rank_feature and text_embedding queries, OpenSearch's hybrid search plugin, and Vespa's first-class tensor queries all expose this pattern as one query.

Wrapping up

TF-IDF was the right answer for thirty years and is still a perfectly serviceable baseline. BM25 is a principled fix for its two structural defects — TF that grows without bound, and zero awareness of document length — derived from a probabilistic model and tuned by two interpretable knobs. Every modern search system you are likely to encounter in production ships BM25 as the default lexical scorer, and even the dense-vector retrievers that came after it run BM25 in parallel because the two retrieval modes catch different kinds of relevance.

The next chapter (147) addresses one obvious gap left here: BM25 treats samsung phone as two independent terms. A document that says samsung phone adjacent to each other is a better match than one that mentions samsung in paragraph one and phone in paragraph nine. Positional indexes record where each term appears in each document, so phrase queries and proximity scoring become possible.

References

Robertson, S. & Zaragoza, H. (2009). The Probabilistic Relevance Framework: BM25 and Beyond. Foundations and Trends in Information Retrieval.
Manning, C., Raghavan, P. & Schütze, H. (2008). Introduction to Information Retrieval, Chapter 6: Scoring, term weighting and the vector space model. Cambridge University Press.
Apache Lucene — BM25Similarity API documentation.
Yang, P., Fang, H. & Lin, J. (2017). Anserini: Enabling the Use of Lucene for Information Retrieval Research. SIGIR.
Elastic — Practical BM25 — Part 2: The BM25 Algorithm and its Variables. Elasticsearch engineering blog.
Lv, Y. & Zhai, C. (2011). Lower-bounding term frequency normalization (BM25+). CIKM.