Phrase Queries and Positional Indexes

In short

A plain inverted index (chapter 143) maps term → [docIDs]. That is enough for samsung AND smartphone, but it cannot answer "reliance jio" as a phrase, because it does not know whether jio appeared right after reliance or eight words later in an unrelated sentence. The fix is the positional index: each posting becomes (docID, [positions]), where positions are the word offsets at which the term occurs in that doc. The phrase-query algorithm runs in two stages — first intersect docIDs (the cheap Boolean filter from chapter 143), then for each surviving doc, walk the positional lists looking for a doc where pos(term₂) equals pos(term₁) + 1, generalising to pos(tᵢ) = pos(t₁) + (i−1) for any phrase length. Proximity queries ("machine learning"~5 — phrase within five words) use exactly the same algorithm with a looser distance bound. Carrying positions costs 3-5x the storage of a docID-only index, so every production engine offers a knob: Lucene's index_options: docs skips positions for fields you never phrase-search; many engines maintain both a position-less and a position-aware index and pick at query time. Bigram pre-indexes for very common two-word phrases ("new york", "machine learning") collapse a phrase intersection into a single lookup. Real systems — Elasticsearch, Solr, Tantivy, MeiliSearch — all default to positional indexes precisely because phrase search is one of the cheapest relevance wins available: it lifts "apple inc" above accidental apple ... inc co-occurrences without touching the ranking function.

You shipped samsung AND smartphone in chapter 143. It works. Then a user types "reliance jio" — with the quotes — into your news search. They want stories where those two words sit next to each other, not stories where the word reliance appears in paragraph one and the word jio happens to appear in paragraph six because the article rambles about every Indian telecom.

Your inverted index cannot help. Both terms have postings lists; intersecting them tells you the doc contains both words but not how far apart. You need to know the position of each occurrence inside each document.

This chapter adds that one piece — positions — and watches it change what your search engine can do.

The structural change: postings carry positions

In chapter 143 each posting was just a docID. The structure was:

samsung → [doc1, doc3, doc7, doc12, ...]

A positional index extends each posting with the list of word offsets where the term occurs in that document:

samsung → [(doc1, [3, 18, 42]), (doc3, [7]), (doc7, [12, 80]), ...]

The [3, 18, 42] for doc1 means: in document 1, the word samsung appears at word position 3 (the fourth word, zero-indexed), and again at positions 18 and 42. The list is sorted ascending — the indexer emits positions in document order, so this property is free.

The only change is the right-hand side: each posting stops being just a docID and becomes a `(docID, positions[])` pair. The dictionary side, the sort order by docID, and the merge primitives from chapter 143 are unchanged.

This one structural change unlocks two new query types: phrase queries ("reliance jio" — exact adjacency) and proximity queries ("machine learning"~5 — within five words). Everything else in this chapter is the algorithm that uses it.

The phrase-query algorithm

Given the phrase "t₁ t₂" (two terms, in order), you want documents where t₂ appears at the position immediately after t₁. The algorithm is two stages, and the first stage is what you already wrote.

Stage 1 — Boolean intersect by docID. Look up the docID-only postings for both terms (or, equivalently, project the docID out of each positional posting). Run the linear-time merge from chapter 143. Most documents drop out here. Why this stage matters: documents that do not contain both terms cannot possibly contain the phrase, and this is the cheap test — the Boolean intersect is exactly the operation you tuned in chapters 143 and 145, and it shrinks the candidate set by orders of magnitude before the expensive positional check runs.

Stage 2 — Positional alignment. For each surviving doc, you have two sorted lists of positions: one for t₁, one for t₂. Walk them with the familiar two-pointer merge, but the equality test is pos₂ == pos₁ + 1 rather than pos₂ == pos₁. If you find any aligned pair, the doc is a match.

For longer phrases "t₁ t₂ t₃ ... tₖ" the rule generalises: anchor on the position list of t₁, and for each candidate position p, check that t₂ has p+1, t₃ has p+2, …, tₖ has p+(k−1). Each check is O(log m) with binary search on the second list, or O(m₁ + m₂) with a synchronised walk; in practice engines use the walk because position lists for a single doc are short.

Stage 1 reuses the docID intersect from chapter 143; stage 2 walks two short position lists per surviving doc looking for `pos₂ == pos₁ + 1`. For `k`-word phrases the same idea generalises: each `tᵢ` must appear at `pos₁ + (i−1)`.

A reference implementation in about thirty lines:

def phrase_search(idx, phrase: list[str]) -> list[int]:
    """Find docs containing `phrase` as consecutive words."""
    if not phrase: return []
    # Stage 1 — intersect docIDs across all terms in the phrase.
    candidate_docs = set(idx.postings[phrase[0]].keys())
    for term in phrase[1:]:
        candidate_docs &= set(idx.postings[term].keys())

    # Stage 2 — for each surviving doc, check positional alignment.
    matches = []
    for doc in sorted(candidate_docs):
        positions = [idx.postings[term][doc] for term in phrase]
        # Anchor on positions of the first term; check each subsequent
        # term has position p + i in its list.
        first_positions = positions[0]
        rest = positions[1:]
        for p in first_positions:
            if all((p + i + 1) in set(rest_list) for i, rest_list in enumerate(rest)):
                matches.append(doc)
                break
    return matches

In production this set(rest_list) membership check is replaced by a synchronised pointer walk (the same merge primitive — sorted lists make it O(m) not O(m log m)), and the candidate-doc intersect runs through the docID-only projection of the postings rather than building a Python set. The shape of the algorithm is identical.

Worked example: `"reliance jio"` over three Indian news snippets

**Phrase search on three news headlines**

You are indexing news for an Indian financial site. Three documents:

docID	text
D1	"Reliance Jio launched a new plan"
D2	"Reliance announced a partnership with Bharti, Jio is the leader"
D3	"Jio Reliance is a top brand"

A user queries "reliance jio" — with quotes, as a phrase. The user wants D1, and only D1.

Tokenize and record positions. Lowercase, split on whitespace and punctuation, assign zero-indexed positions per doc:

D1: reliance=0, jio=1, launched=2, a=3, new=4, plan=5
D2: reliance=0, announced=1, a=2, partnership=3, with=4, bharti=5, jio=6, is=7, the=8, leader=9
D3: jio=0, reliance=1, is=2, a=3, top=4, brand=5

Build positional postings for the two terms we need:

reliance → [(D1, [0]), (D2, [0]), (D3, [1])]
jio      → [(D1, [1]), (D2, [6]), (D3, [0])]

Stage 1 — Boolean intersect. Both terms appear in all three docs. {D1, D2, D3} ∩ {D1, D2, D3} = {D1, D2, D3}. Nothing drops out yet — every doc mentions both reliance and jio. Why this stage didn't help here: with only three docs, the corpus is too small for the Boolean filter to prune. On a real corpus of millions of news articles, most docs would not contain both rare terms and the candidate set would shrink to a handful.

Stage 2 — Positional alignment. For each surviving doc, check whether any position of reliance is immediately followed by jio:

D1: reliance at [0]. Try p=0: is jio at position 1? jio's positions in D1 are [1]. Yes — match. ✓
D2: reliance at [0]. Try p=0: is jio at position 1? jio's positions in D2 are [6]. No. No more positions for reliance. No match. ✗
D3: reliance at [1]. Try p=1: is jio at position 2? jio's positions in D3 are [0]. No. No match. ✗

Final result: [D1]. Why D2 fails: reliance and jio are eight words apart — a coincidence of co-occurrence, not a phrase. The user would be misled if your engine returned this. Why D3 fails: the words appear in the wrong order; "reliance jio" is a directional phrase, and "jio reliance" is a different phrase entirely. Phrase search captures word order, which is what makes "Apple Inc" different from "an Inc apple".

idx = PositionalIndex()
idx.add(0, "Reliance Jio launched a new plan")
idx.add(1, "Reliance announced a partnership with Bharti, Jio is the leader")
idx.add(2, "Jio Reliance is a top brand")

result = phrase_search(idx, ["reliance", "jio"])
print(result)            # [0]   — only D1
print([idx.docs[d] for d in result])
# ['Reliance Jio launched a new plan']

That is the entire mechanism. The only thing that changed from chapter 143 is the postings carry positions and the second-stage check is pos₂ == pos₁ + 1 instead of nothing.

The cost: positional indexes are 3-5x larger

There is no free lunch. Adding positions inflates the index substantially. A docID-only postings list for a term that appears in D documents stores D integers. The positional version stores D integers plus the position list per doc, which on average is D × tf̄ integers where tf̄ is the average term frequency per matching doc. For typical English prose, tf̄ is roughly 2-4 for content terms, and the position integers themselves are larger in magnitude than docID gaps (they range over the document length, not the corpus size, but still average a few thousand on long docs). The empirical rule from Manning, Raghavan and Schütze's Introduction to IR (chapter 2) is that a positional index is roughly 3-5x the size of a docID-only index, and roughly 30-50% of the original (uncompressed) text.

Indicative numbers for a typical English news corpus. The positional index is 3-5x larger than the docID-only index; compression helps but cannot close the gap, because the position integers themselves carry information the docID list does not.

That cost is why every serious engine offers a per-field knob to disable positions. In Lucene (and therefore Elasticsearch), the field setting index_options takes one of:

docs — docIDs only (no term frequencies, no positions). Smallest. Boolean queries only.
freqs — docIDs + per-doc term frequency. Enables BM25 scoring but no phrase queries.
positions — docIDs + frequencies + positions. Default. Enables phrase queries.
offsets — also stores byte offsets in the original text, for highlighting.

If you have a field you only ever filter on (a SKU code, a category enum, a status flag), set index_options: docs and you save a lot of disk. If you have a field that is the body of a news article and users will phrase-search it, you keep positions. Why this knob exists: positional storage is wasted bytes for fields that no phrase query will ever touch, and search clusters routinely run with terabyte-scale indexes where 3-5x matters operationally — both for disk and for the working set that has to sit in page cache.

Some engines go further and maintain both indexes: a small position-less index for cheap term and Boolean queries, and a larger positional index used only when the query parser sees quotes or a proximity operator. Query latency for the common case (term and Boolean) drops, at the cost of extra storage. This is the hybrid postings strategy used in Anserini and in Elasticsearch's index_phrases field option, which pre-builds a separate two-shingle index for the most common phrase queries.

Proximity queries: same algorithm, looser bound

Once you have positions, proximity queries come almost free. Lucene syntax "machine learning"~5 means "the words machine and learning within 5 word positions of each other, in any order". The algorithm is the phrase algorithm with one substitution: instead of testing pos₂ == pos₁ + 1, you test |pos₂ − pos₁| ≤ 5.

For longer phrases, Lucene defines the slop ~k as the number of edit operations (swaps, insertions) needed to turn the matched span into the queried phrase, and the algorithm becomes: for each anchor position p of t₁, find the smallest window containing one occurrence of each subsequent term such that the total slop is ≤ k. This is more involved but still polynomial — and still gated by the same Boolean intersect that prunes most docs before the position walk starts.

Proximity scoring is also where you can be cleverer than binary match/no-match. A document where the phrase terms are adjacent should score higher than one where they are eight words apart, even if both technically match "machine learning"~10. Modern BM25 implementations (BM25F, BM25+ proximity) include a position-based bonus that rewards tighter spans, computed from the same positional postings you already need to load.

Optimisations real engines use

The naive algorithm above is correct and is roughly what Lucene runs. Three optimisations make it fast in practice.

Bigram (and shingle) pre-indexes. Some phrases dominate query logs — "new york", "machine learning", "reliance jio", "ipl 2026". If you know up front that these will be phrase-searched millions of times, you can pre-compute a separate index keyed by consecutive word pairs (bigrams). A query for "machine learning" then becomes a single dictionary lookup on the bigram machine_learning, returning the docIDs directly — no positional alignment needed at query time. The cost is paid at index time and in extra storage. Elasticsearch's index_phrases: true does exactly this for two-shingle phrases on a field. Why this works: the long tail of phrase queries gets handled by the general positional algorithm, while a small set of common bigrams covers a large fraction of phrase-query traffic, and shifting work from query time to index time is the central trade in search engineering.

Skip pointers inside positional lists. Chapter 145 introduced skip pointers on docID postings to make intersection sub-linear when one list is much shorter than the other. The same idea applies inside the positional list of a single doc: when checking whether t₂ has position p+1, you can skip ahead in t₂'s position list rather than walking it linearly. For docs with very high term frequency (tf > 100), this matters; for typical content, the position list is short enough that linear scan beats the skip overhead.

Hybrid (position-less + position-aware) postings. Already mentioned: the engine keeps two indexes and the planner picks. Implementation cost is moderate (you maintain two index files), storage cost is the docID-only index size on top of the positional index (typically +20-30%), and the win is large for query mixes dominated by Boolean filters with occasional phrase queries.

Phrase caches. Common phrase results ("machine learning", "work from home") get cached at the result-set level, just like ordinary query result caches. This is a layer above the index, but phrase queries are particularly amenable to caching because they appear in query logs with much higher repetition than free-form term queries — users phrase popular phrases, not creative phrases.

How real systems use this

Apache Lucene stores positions in a separate file (.pos) within each segment, alongside docIDs (.doc) and the term dictionary (.tim). The default field configuration is index_options: positions, meaning every text field is phrase-searchable out of the box. You can downgrade to freqs or docs per field, and the Lucene file format simply omits the .pos file for fields that do not need it.

Elasticsearch inherits this directly. A text field defaults to positional; a keyword field is non-analyzed and never positional (it is an exact-match identifier, not a body of text). The index_phrases: true option adds a shingle pre-index for fast common-phrase lookup; index_options: offsets adds character offsets for highlighting. Tuning these per field is a routine operational task in clusters that have grown large enough for storage to bite.

Tantivy (the Rust library that powers Quickwit's log search) makes the same trade — positions are on by default for text fields, off for non-analysed fields, controllable per field. MeiliSearch's instant-search architecture pre-computes more aggressive structures (proximity scoring is baked into the default ranking rules, with up to 8-shingle pre-indexes for typo-tolerant prefix matching).

PostgreSQL's full-text search via tsvector and the GIN index does store positions by default — the tsvector data type is literally a sorted list of (lexeme, [positions]) tuples. The <-> operator is the phrase-distance test. Like Lucene's index_options: docs, you can build a tsvector without positions if you only need Boolean retrieval — to_tsvector('english', body) keeps positions; the lower-level construction lets you strip them.

Why this matters more than it looks

Phrase search is one of the highest-leverage relevance features in a lexical search engine, and it costs you a small constant factor in storage and a tiny fraction of query latency. Without it:

A search for "apple inc" returns articles that mention apples and incorporated companies in unrelated breath. Search-result quality collapses on common entity names that happen to share words with common nouns.
A search for "reliance jio" returns every article that mentions Reliance or Jio in any context — the very behaviour that drove users to add quotes in the first place.
Brand-name searches, person-name searches ("narendra modi"), product searches ("galaxy s25"), and song/movie title searches all suffer the same dilution.

The phrase-query feature is the difference between a search that "kind of works" and one that respects the user's stated intent when they type quotes. Every modern search UI surfaces quotes as a first-class operator, and every modern engine implements them via positional indexes. The cost is paid; the benefit is the search engine being usable for the queries users actually issue.

Build 18 has now given you the docID-only index (chapter 143), real tokenization (chapter 144), compressed and skip-pointer-equipped postings (chapter 145), BM25 ranking (chapter 146), and now phrase-and-proximity search via positions. The remaining chapters wire segments and merge policies (chapter 148) and distribute the whole machine across shards (chapter 149).

Going deeper

A few directions where the positional index gets more interesting than the basic algorithm.

Span queries and the SpanNear formalism

Lucene generalises phrase and proximity queries into a uniform algebra called span queries. A SpanTermQuery matches a single term and yields the spans (start position, end position) where it occurs. A SpanNearQuery takes a list of sub-spans and a slop, and yields the spans where the children appear within the allowed distance. Span queries compose: you can ask for (SpanNear "apple" "inc" slop=0) AND_NOT (SpanTerm "fruit") to find Apple-the-company without Apple-the-fruit. The implementation reuses positional postings — there is no new on-disk structure — but the query operators expose a much richer language than plain phrase search.

Why position lists do not need to be aligned across terms

A subtle point: when you are checking phrase alignment for doc D, the position list of t₁ and the position list of t₂ are sorted independently. They are not paired. The algorithm walks t₁'s list and for each anchor position p does a membership test on t₂'s list for p+1. The membership test can be a binary search (O(log m₂)) or a synchronised pointer walk (O(m₁ + m₂) total). Lucene uses the synchronised walk because position lists for a single doc are typically tens of integers, and a linear scan beats binary search on small inputs due to cache behaviour.

Cross-field phrase queries

What about phrasing across fields? "narendra modi" should match a doc where narendra is the first_name field and modi is the last_name field. Naively, no — positional postings are per-field, so cross-field phrase queries require either (a) duplicating content into a combined field at index time (Elasticsearch's copy_to), or (b) a more complex query that knows about field boundaries. Most engines take option (a) because it makes phrase search a uniform operation; the storage cost is acceptable.

Quantum information retrieval and positional probabilistic models

Some advanced ranking models (Markov Random Field models, the family of dependency-parser-based scorers) use positional information to compute term-pair dependencies at scoring time, not just at filter time. The cost is more positional reads per scored doc, but the benefit is rankings that account for how natural the term co-occurrence is — "machine learning" adjacent should score higher than "machine ... learning" separated, even if both match a slop-3 proximity query. These models live above the index but feed off the same positional postings.

References

Manning, Raghavan, Schütze. Introduction to Information Retrieval. Cambridge University Press, 2008. Chapter 2: The term vocabulary and postings lists, sections 2.4 (positional postings)
Apache Lucene. PostingsFormat and Lucene99 codec — .pos and .pay files. lucene.apache.org/core/9_10_0/core/org/apache/lucene/codecs/lucene99/package-summary.html
Elastic. Match phrase query and index_phrases option. elastic.co/docs/reference/query-languages/query-dsl/query-dsl-match-query-phrase
PostgreSQL. Full Text Search — tsvector, phrase distance operator <->. postgresql.org/docs/current/textsearch-intro.html
Yang, Fang, Lin. Anserini: Reproducible Ranking Baselines Using Lucene. SIGIR 2017. arxiv.org/abs/1705.04707
Quickwit / Tantivy. Field options and indexing modes. github.com/quickwit-oss/tantivy

The structural change: postings carry positions

The phrase-query algorithm

Worked example: "reliance jio" over three Indian news snippets

The cost: positional indexes are 3-5x larger

Proximity queries: same algorithm, looser bound

Optimisations real engines use

How real systems use this

Why this matters more than it looks

Going deeper

Span queries and the SpanNear formalism

Why position lists do not need to be aligned across terms

Cross-field phrase queries

Quantum information retrieval and positional probabilistic models

References

Worked example: `"reliance jio"` over three Indian news snippets