Note: Company names, engineers, incidents, numbers, and scaling scenarios in this article are hypothetical — even when they resemble real ones. See the full disclaimer.
In short
The inverted index maps each term to a sorted list of document IDs that contain it, turning search from an O(N) scan of the corpus into a single dictionary lookup plus a linear-time merge of two postings lists. AND intersects, OR unions, NOT subtracts — and every production search engine, from Lucene to Postgres's GIN, is built on exactly this structure.
A user types two words into a search box over a billion product descriptions, and you have fifty milliseconds to return the ten best matches. How do you not lose?
This is the question search engines exist to answer, and the answer has been the same since Gerard Salton's SMART system in the 1960s and Doug Cutting's Lucene in 1999: you do not search the documents at query time. You pre-build a data structure that already knows which documents contain which words, and at query time you do nothing but a dictionary lookup and a list intersection. That data structure is the inverted index, and once you have built one, the rest of search engineering — tokenization rules, ranking functions, distributed shards — is layers on top.
This chapter opens Build 18. By the end of it you will have a working inverted index in Python, you will understand why every production search engine on Earth picks this same structure, and you will be ready for the four chapters that turn it into a real search engine.
Why the obvious approach fails
The first data structure that comes to mind for "store some documents and search them later" is just a list of documents, where each document is a list (or set) of its words. Call this the forward index — for each document, you can quickly answer "what words does this document contain?".
The forward index is a perfectly good way to retrieve a document if you already know its ID — documents[42] returns its content. Why: that is the operation a forward index is built for; one hash lookup per document. But it is the wrong structure for finding a document by its content. To answer "which documents contain fox?", you have no choice but to open doc1, scan its word list, then doc2, then doc3, and so on through every document in the corpus. With N documents averaging L words each, that is O(N · L) work per query.
For 100 documents this is fine. For 100 million product listings on BharatBazaar, it is impossible. Even if each scan takes one microsecond, you are looking at 100 seconds per query. Why: the work scales with corpus size, but the user's patience does not — a search box that takes a minute is a search box that nobody uses.
Inversion: the trick that changes everything
Flip the relationship. Instead of doc → list of terms, build term → list of docs. Now a query for fox is a single dictionary lookup that returns the postings list [doc1] directly. No scan. The whole corpus could be a trillion documents and the lookup would still be one hash probe.
This is the entire idea. Everything else in this chapter is about building and querying this structure carefully.
A few things to notice in the diagram:
- The keys are terms, not documents. The dictionary side of the structure is sized by the vocabulary of the corpus (number of distinct words) — typically hundreds of thousands to a few million for a large English corpus, far smaller than the corpus itself. Why: vocabulary growth follows Heaps's law, roughly
V ≈ K · N^βwithβ ≈ 0.5, so a billion documents has perhaps a few million distinct words, not a billion. - Each value is a sorted list of docIDs. Sorting matters a lot — we will see in a moment why.
- Common words appear in many postings lists. "the" appears in almost every English document; "samsung" in a small fraction. The skew is what later chapters exploit for compression and ranking.
Building the index in three phases
There are three things to do, in order, when you build an inverted index from a corpus of documents.
Phase 1: Tokenize
Turn each document from a blob of text into a list of normalised terms. The blob "The Samsung Galaxy S25 smartphone." becomes the term list ["the", "samsung", "galaxy", "s25", "smartphone"]. Tokenization is much more subtle than text.lower().split() — handling Unicode, punctuation, Hindi matras, hyphenation, URLs, and email addresses correctly is a small subfield of its own — and we will spend the entirety of chapter 144 on it. For now, take it as a function tokenize(text) → List[str].
Phase 2: Build postings
For each (term, docID) pair you encounter, append docID to that term's postings list. A dict[str, list[int]] is the natural in-memory representation. Why: terms repeat heavily across documents, so the dictionary's hash collisions cluster on common words and a defaultdict(list) is the simplest correct structure.
After indexing all documents, sort each postings list by docID. Sorting is what makes intersection (the dominant query operation) linear-time instead of quadratic.
Phase 3: Persist
In a toy implementation you keep the index in RAM. In a real one (Lucene, Tantivy, Postgres's GIN), you write it to disk in two parts:
- a term dictionary that maps each term to the byte offset of its postings list (typically a B-tree, FST, or sorted array with a binary-searchable header), and
- the postings file, a contiguous run of compressed postings lists.
We will sketch persistence in chapter 145 alongside compression, where it becomes interesting. For this chapter, RAM is fine.
A real Python implementation
Sixty lines, no dependencies. The design splits cleanly into the three phases above and the four query types below.
# inverted_index.py — a minimal but real inverted index
from collections import defaultdict
from typing import Iterable
def tokenize(text: str) -> list[str]:
"""Phase 1 — placeholder. Real tokenization comes in ch.144."""
out = []
word = []
for ch in text.lower():
if ch.isalnum():
word.append(ch)
elif word:
out.append("".join(word)); word = []
if word: out.append("".join(word))
return out
class InvertedIndex:
def __init__(self):
# term -> sorted list of docIDs containing the term
self.postings: dict[str, list[int]] = defaultdict(list)
# docID -> original text (for displaying results)
self.docs: dict[int, str] = {}
self._next_id = 0
def add(self, text: str) -> int:
"""Phase 2 — tokenize and append postings."""
doc_id = self._next_id; self._next_id += 1
self.docs[doc_id] = text
seen = set()
for term in tokenize(text):
if term in seen: continue # one posting per (term, doc)
seen.add(term)
self.postings[term].append(doc_id)
return doc_id
def finalize(self):
"""Phase 2 finish — ensure each postings list is sorted by docID.
With our incremental adds this is already true, but a real builder
merges segments and must sort explicitly."""
for term in self.postings:
self.postings[term].sort()
# ----- Query operators -----
def lookup(self, term: str) -> list[int]:
return self.postings.get(term.lower(), [])
@staticmethod
def intersect(a: list[int], b: list[int]) -> list[int]:
"""AND: linear-time merge of two sorted postings lists."""
i = j = 0; out = []
while i < len(a) and j < len(b):
if a[i] == b[j]: out.append(a[i]); i += 1; j += 1
elif a[i] < b[j]: i += 1
else: j += 1
return out
@staticmethod
def union(a: list[int], b: list[int]) -> list[int]:
"""OR: linear-time merge with deduplication."""
i = j = 0; out = []
while i < len(a) and j < len(b):
if a[i] == b[j]: out.append(a[i]); i += 1; j += 1
elif a[i] < b[j]: out.append(a[i]); i += 1
else: out.append(b[j]); j += 1
out.extend(a[i:]); out.extend(b[j:])
return out
def difference(self, a: list[int]) -> list[int]:
"""NOT: subtract from the universe of all docIDs."""
all_docs = list(range(self._next_id))
sa = set(a)
return [d for d in all_docs if d not in sa]
That is the whole inverted index. Forty-odd lines of substance. The asymptotic complexity of every operation matches what a production engine achieves to within a constant factor:
| Operation | Cost | Why |
|---|---|---|
add(doc) |
O(L) where `L = |
doc |
lookup(term) |
O(1) + `O( |
postings |
intersect |
`O( | a |
union |
`O( | a |
difference |
O(N) |
scan all docs (later we do better with iterators) |
The intersect routine is the heart of search engineering. Read it once more: two pointers walk forward through two sorted lists; equal docIDs are output and both advance, the smaller advances alone. Why this is linear: each pointer can advance at most |a| + |b| times in total before falling off the end, so total work is bounded by the sum of input sizes. This is the same merge logic that drives external sort, log-structured merges in LSM trees, and the relational nested-loop-join's sort-merge variant — the same primitive showing up in three different layers of the database.
Worked example: an Indian e-commerce search
**Indexing five product descriptions and querying `samsung AND smartphone`**
You are building search for a small Indian e-commerce catalogue. Five products to index:
| docID | description |
|---|---|
| 0 | "Samsung Galaxy S25 smartphone with 256GB storage" |
| 1 | "Apple iPhone 16 smartphone Pro Max" |
| 2 | "Samsung 55-inch QLED 4K smart TV" |
| 3 | "OnePlus 13 smartphone with Snapdragon processor" |
| 4 | "Samsung Galaxy Tab S10 tablet 11-inch display" |
Step 1 — tokenize each document. Lowercase, split on non-alphanumeric:
- doc 0 →
["samsung", "galaxy", "s25", "smartphone", "with", "256gb", "storage"] - doc 1 →
["apple", "iphone", "16", "smartphone", "pro", "max"] - doc 2 →
["samsung", "55", "inch", "qled", "4k", "smart", "tv"] - doc 3 →
["oneplus", "13", "smartphone", "with", "snapdragon", "processor"] - doc 4 →
["samsung", "galaxy", "tab", "s10", "tablet", "11", "inch", "display"]
Step 2 — build postings. For each (term, docID) pair, append docID to the term's list. After all five inserts (deduplicating per document so we do not list a doc twice for one term):
postings["samsung"] = [0, 2, 4]
postings["galaxy"] = [0, 4]
postings["smartphone"] = [0, 1, 3]
postings["with"] = [0, 3]
postings["smart"] = [2]
postings["tv"] = [2]
postings["apple"] = [1]
postings["iphone"] = [1]
postings["oneplus"] = [3]
postings["tab"] = [4]
postings["inch"] = [2, 4]
... (15 more terms)
Notice each list is already sorted because we inserted in docID order. Why: incremental indexing in monotonically increasing docID order keeps lists sorted for free; segmented builders that merge multiple in-flight indexes must sort explicitly.
Step 3 — answer the query samsung AND smartphone.
Look up both postings lists:
samsung→[0, 2, 4]smartphone→[0, 1, 3]
Run the two-pointer intersection. i walks samsung, j walks smartphone:
The result is [0] — only the Samsung Galaxy S25 contains both samsung and smartphone. Why doc 2 (Samsung TV) is excluded: the TV is "samsung" but not "smartphone", so the AND fails. Why doc 1 (Apple iPhone) is excluded: it is "smartphone" but not "samsung". This is the user's intent.
idx = InvertedIndex()
for desc in [
"Samsung Galaxy S25 smartphone with 256GB storage",
"Apple iPhone 16 smartphone Pro Max",
"Samsung 55-inch QLED 4K smart TV",
"OnePlus 13 smartphone with Snapdragon processor",
"Samsung Galaxy Tab S10 tablet 11-inch display",
]:
idx.add(desc)
idx.finalize()
samsung = idx.lookup("samsung") # [0, 2, 4]
smartphone = idx.lookup("smartphone") # [0, 1, 3]
result = InvertedIndex.intersect(samsung, smartphone) # [0]
print([idx.docs[d] for d in result])
# ['Samsung Galaxy S25 smartphone with 256GB storage']
Run it. It works. You have a search engine in seventy lines.
The four query operators, completed
Single-term and AND are the two queries you have just built. Three more round out the basics:
OR query (samsung OR oneplus): union the two postings lists. Same two-pointer merge as intersect, but instead of only emitting on equality, you emit whichever pointer is smaller and advance it. Equal elements are emitted once. The result samsung [0,2,4] ∪ oneplus [3] = [0, 2, 3, 4]. Why this works: union of two sorted lists is itself sorted, and a merge produces a sorted output without buffering — exactly the same primitive used in a merge sort's combine step.
NOT query (smartphone AND NOT samsung): take the postings list for smartphone and remove any docID also in samsung. Implemented as intersect_complement(a, b) — walk both pointers, output a[i] only when it is strictly less than b[j]. Lucene calls this "MUST_NOT". smartphone [0,1,3] − samsung [0,2,4] = [1, 3] (Pearle and OnePlus, but not the Samsung phone).
Phrase query ("galaxy s25"): a plain inverted index cannot tell you whether galaxy and s25 appear next to each other in the document — only that both appear. To answer phrase queries you need a positional index, where each posting carries not just a docID but also the list of word positions within that document. We build it in chapter 147; the structure becomes term → [(docID, [pos1, pos2, ...]), ...].
Real systems built on this idea
Every production search engine you have heard of is, at its core, the data structure you just built — wrapped in carefully engineered tokenizers, compressed postings, sharded across machines, and ranked by BM25.
- Apache Lucene is the open-source Java library that implements all of this. Its file format (
.tim,.doc,.pos,.payfiles in a Lucene segment) maps directly onto term dictionary, doc-only postings, positional postings, and per-document payloads. Why this matters to know: Elasticsearch and OpenSearch are servers wrapped around Lucene; Solr is a different server wrapped around Lucene. Understanding Lucene is understanding the bulk of search infrastructure deployed in production today. - Tantivy is the same architecture rewritten in Rust and used by Quickwit for log search. It is faster on a single core because it avoids JVM overhead.
- MeiliSearch is also Rust, optimised for typo-tolerant instant search; the inverted index is augmented with prefix and n-gram indexes.
- PostgreSQL's GIN index ("Generalized Inverted Index") implements exactly this structure inside a relational database. When you write
WHERE to_tsvector(body) @@ to_tsquery('samsung & smartphone'), Postgres looks up two postings lists and intersects them. The plumbing differs (it lives inside Postgres's storage manager and integrates with MVCC), but the algorithm is what you just wrote.
The point is that the inverted index is not a Lucene-specific artefact. It is the universal data structure for "find documents by term", and it shows up wherever that operation matters.
What Build 18 fills in next
You now have the spine. The next four chapters dress it.
- Chapter 144 — Tokenization, stemming, stopwords, language analyzers replaces the toy
text.lower().split()with real Unicode-aware tokenization, Porter and Krovetz stemmers, stopword lists, and language-specific analyzers (English, Hindi, Tamil). This is whererunningandrunscollapse torun, and where you stop indexing the word "the" three trillion times. - Chapter 145 — Postings lists, skip pointers, compression turns each
list[int]into a delta-encoded, variable-byte (or PFOR-delta) compressed run with periodic skip pointers. A million-docID postings list shrinks from 4 MB to 200-500 KB and supportsO(√n)skip during intersection. - Chapter 146 — TF-IDF derived, then BM25 and why it won moves from "which docs match" to "which docs match best". You will derive term-frequency × inverse-document-frequency from first principles, then see why Robertson's BM25 (with its length normalisation and saturation function) beat it everywhere and remains the lexical baseline that even modern neural rankers compare against.
- Chapter 147 — Phrase queries and positional indexes adds positions to each posting so that
"galaxy s25"(in quotes) matches only documents wheres25appears immediately aftergalaxy. - Chapter 148 wires segments and merge policies (the LSM-tree-shaped storage that Lucene actually uses on disk).
- Chapter 149 scatters queries across shards and gathers the ranked results.
By the end of Build 18 you will have not a toy but the blueprint of a real search engine, and the code you wrote in this chapter will be recognisable inside Lucene, Tantivy, and Postgres GIN.
Common confusions
- "An inverted index is just a hash map from word to documents." It is, in memory, in toy code. In a real engine it is a sorted on-disk structure: a term dictionary (often an FST or sorted block-array, not a hash table) pointing to compressed postings runs in a separate file. Lucene picks sorted because it needs prefix scans (
samsung*), range queries on numeric terms, and small RAM footprint per index — none of which a hash table gives you. The mental model "dict[term] → list[int]" is correct as a starting point and wrong about every constant factor in production. - "Sorting the postings by relevance would make queries faster." It would make single-term top-k faster but break the linear-time merge for AND/OR. Modern engines keep postings sorted by docID and compute relevance after the Boolean step, often using Block-Max WAND to skip docIDs whose maximum possible score cannot reach the current top-k threshold. The docID-sort is load-bearing — change it and intersection becomes quadratic.
- "The index is bigger than the corpus, so it is wasteful." An uncompressed positional index can be roughly 1× to 1.5× the corpus size, but a docID-only compressed index (chapter 145) is typically 10–20% of the corpus. The space pays for itself in the first hour of query traffic on any non-trivial site — every query you do not scan the corpus for is a query that finishes in microseconds instead of seconds.
- "
SELECT ... WHERE body LIKE '%samsung%'does the same thing in Postgres." It does not.LIKE '%...%'cannot use a B-tree (the leading wildcard prevents prefix matching) and forces a sequential scan of every row. Full-text search in Postgres uses the GIN inverted index built overto_tsvector(body);WHERE to_tsvector(body) @@ to_tsquery('samsung & smartphone')is what triggers the postings-list intersection you wrote above. The two queries look superficially similar and have completely different runtime characteristics. - "Tokenization is just
text.lower().split()." This works for English ASCII demos and breaks immediately on real input. URLs (https://flipkart.com/p/12345) split into garbage; Hindi matras (रामायण) get mangled by naive whitespace splitting; emoji and Tamil compound characters confuselower(). Lucene ships an entire pipeline (StandardTokenizer+ filters) per language. Chapter 144 walks through what a real tokenizer does and why getting it wrong silently kills recall. - "An inverted index handles fuzzy / typo-tolerant search." A plain inverted index does not —
samsng(typo) is a different term fromsamsungand yields an empty postings list. Fuzzy search needs an additional structure: an n-gram index (MeiliSearch), a Levenshtein automaton (Lucene's FuzzyQuery), or a phonetic index (SoundEx, Metaphone). The inverted index is the foundation, but typo tolerance is bolted on top of it, not free.
Going deeper
A few directions for the curious before chapter 144 begins.
Why "inverted"?
The name is older than computing. In a printed book's index at the back, each term lists the page numbers where it appears — that is the inverted index. The body of the book itself, with words flowing in document order, is the forward index. The terminology was lifted directly into IR by Salton's group in the 1960s.
How big is the index relative to the corpus?
A rough rule for English text: an uncompressed positional inverted index is roughly the same size as the original corpus, sometimes 1.5×. Without positions, it is 20-40% of the corpus. With aggressive compression (chapter 145), the docID-only index shrinks to perhaps 10-15%. Why so small even with positions: most terms are rare and have very short postings lists, while a handful of common terms (the, a, of) have very long lists that compress beautifully because consecutive docIDs are dense and small after delta-coding.
Why sorted by docID and not by relevance?
Sorted-by-docID makes intersection and union linear-time, which is what you need to evaluate Boolean filters fast. Sorted-by-relevance would make those operations quadratic. Modern engines compute relevance after the Boolean step: they intersect the postings to get the candidate set, then score each candidate. Some advanced techniques (WAND, Block-Max-WAND in Anserini-style engines [1]) keep an additional max-score-per-block on each postings list so that the scorer can skip blocks that cannot reach the top-k threshold — but the underlying storage is still sorted by docID.
Why not just use a regex search on the corpus?
Tools like grep and ripgrep do this. They are excellent for occasional searches in source code (a few gigabytes), where building an index is overkill. For a static corpus that is queried millions of times, the index pays for itself in the first hour: every query you avoid scanning the whole corpus is a query you served in microseconds instead of seconds. The trade-off is the build cost (one pass over the corpus) and the index storage (10-50% of corpus size).
References
- Yang, Fang, Lin. Anserini: Reproducible Ranking Baselines Using Lucene. SIGIR 2017. arxiv.org/abs/1705.04707
- Manning, Raghavan, Schütze. Introduction to Information Retrieval. Cambridge University Press, 2008. Chapter 1: Boolean retrieval (free)
- Apache Lucene. Index file formats. lucene.apache.org/core/9_10_0/core/org/apache/lucene/codecs/lucene99/package-summary.html
- Wikipedia. Inverted index. en.wikipedia.org/wiki/Inverted_index
- PostgreSQL. GIN indexes. postgresql.org/docs/current/gin.html
- Quickwit. Tantivy: a full-text search engine library inspired by Apache Lucene. github.com/quickwit-oss/tantivy