The Inverted Index from Scratch

In short

You are building a search box. A user types samsung smartphone into Flipkart, Amazon, or your own product catalogue, and within fifty milliseconds you must rank the matching products out of a hundred million listings. The naive approach — store each document as a list of its words and scan every document on every query — is O(N) per query and will not survive past a thousand documents on a laptop, let alone a billion in production. The data structure that makes web-scale search feasible is the inverted index: a dictionary keyed by term, where each value is a sorted list of document IDs that contain that term. Looking up samsung returns the postings list [3, 17, 42, 88, …] directly — no scan, just a hash or B-tree lookup. An AND query intersects two sorted postings lists in linear merge time; an OR query unions them; a NOT query subtracts. This chapter implements a real inverted index in about sixty lines of Python, indexes five product descriptions for an Indian e-commerce site, walks through the query samsung AND smartphone step by step, and previews what Build 18 fills in next: tokenization rules (ch. 144), postings compression and skip pointers (ch. 145), and the BM25 ranking function that won search (ch. 146). Every modern search engine — Lucene, Elasticsearch, Solr, Tantivy, MeiliSearch, even Postgres's built-in full-text search via the GIN index — is, at its core, an inverted index.

You have a billion product descriptions, blog posts, log lines, or tweets sitting on disk. A user types two words into a search box. You have fifty milliseconds to come back with the ten best matches. How do you not lose?

This is the question search engines exist to answer, and the answer has been the same since Gerard Salton's SMART system in the 1960s and Doug Cutting's Lucene in 1999: you do not search the documents at query time. You pre-build a data structure that already knows which documents contain which words, and at query time you do nothing but a dictionary lookup and a list intersection. That data structure is the inverted index, and once you have built one, the rest of search engineering — tokenization rules, ranking functions, distributed shards — is layers on top.

This chapter opens Build 18. By the end of it you will have a working inverted index in Python, you will understand why every production search engine on Earth picks this same structure, and you will be ready for the four chapters that turn it into a real search engine.

Why the obvious approach fails

The first data structure that comes to mind for "store some documents and search them later" is just a list of documents, where each document is a list (or set) of its words. Call this the forward index — for each document, you can quickly answer "what words does this document contain?".

The forward index maps each document to the words it contains. It is the natural way to *store* documents (it is essentially what your filesystem already gives you), but it is the wrong structure for *searching* them. A query for "fox" must open every document and check.

The forward index is a perfectly good way to retrieve a document if you already know its ID — documents[42] returns its content. Why: that is the operation a forward index is built for; one hash lookup per document. But it is the wrong structure for finding a document by its content. To answer "which documents contain fox?", you have no choice but to open doc1, scan its word list, then doc2, then doc3, and so on through every document in the corpus. With N documents averaging L words each, that is O(N · L) work per query.

For 100 documents this is fine. For 100 million product listings on Flipkart, it is impossible. Even if each scan takes one microsecond, you are looking at 100 seconds per query. Why: the work scales with corpus size, but the user's patience does not — a search box that takes a minute is a search box that nobody uses.

Inversion: the trick that changes everything

Flip the relationship. Instead of doc → list of terms, build term → list of docs. Now a query for fox is a single dictionary lookup that returns the postings list [doc1] directly. No scan. The whole corpus could be a trillion documents and the lookup would still be one hash probe.

The inverted index maps each term to the sorted list of documents that contain it. A query for "fox" is one dictionary lookup followed by reading a (usually short) postings list. The corpus could be the entire web and this lookup would still take microseconds.

This is the entire idea. Everything else in this chapter is about building and querying this structure carefully.

A few things to notice in the diagram:

The keys are terms, not documents. The dictionary side of the structure is sized by the vocabulary of the corpus (number of distinct words) — typically hundreds of thousands to a few million for a large English corpus, far smaller than the corpus itself. Why: vocabulary growth follows Heaps's law, roughly V ≈ K · N^β with β ≈ 0.5, so a billion documents has perhaps a few million distinct words, not a billion.
Each value is a sorted list of docIDs. Sorting matters a lot — we will see in a moment why.
Common words appear in many postings lists. "the" appears in almost every English document; "samsung" in a small fraction. The skew is what later chapters exploit for compression and ranking.

Building the index in three phases

There are three things to do, in order, when you build an inverted index from a corpus of documents.

Phase 1: Tokenize

Turn each document from a blob of text into a list of normalised terms. The blob "The Samsung Galaxy S25 smartphone." becomes the term list ["the", "samsung", "galaxy", "s25", "smartphone"]. Tokenization is much more subtle than text.lower().split() — handling Unicode, punctuation, Hindi matras, hyphenation, URLs, and email addresses correctly is a small subfield of its own — and we will spend the entirety of chapter 144 on it. For now, take it as a function tokenize(text) → List[str].

Phase 2: Build postings

For each (term, docID) pair you encounter, append docID to that term's postings list. A dict[str, list[int]] is the natural in-memory representation. Why: terms repeat heavily across documents, so the dictionary's hash collisions cluster on common words and a defaultdict(list) is the simplest correct structure.

After indexing all documents, sort each postings list by docID. Sorting is what makes intersection (the dominant query operation) linear-time instead of quadratic.

Phase 3: Persist

In a toy implementation you keep the index in RAM. In a real one (Lucene, Tantivy, Postgres's GIN), you write it to disk in two parts:

a term dictionary that maps each term to the byte offset of its postings list (typically a B-tree, FST, or sorted array with a binary-searchable header), and
the postings file, a contiguous run of compressed postings lists.

We will sketch persistence in chapter 145 alongside compression, where it becomes interesting. For this chapter, RAM is fine.

A real Python implementation

Sixty lines, no dependencies. The design splits cleanly into the three phases above and the four query types below.

# inverted_index.py — a minimal but real inverted index
from collections import defaultdict
from typing import Iterable

def tokenize(text: str) -> list[str]:
    """Phase 1 — placeholder. Real tokenization comes in ch.144."""
    out = []
    word = []
    for ch in text.lower():
        if ch.isalnum():
            word.append(ch)
        elif word:
            out.append("".join(word)); word = []
    if word: out.append("".join(word))
    return out

class InvertedIndex:
    def __init__(self):
        # term -> sorted list of docIDs containing the term
        self.postings: dict[str, list[int]] = defaultdict(list)
        # docID -> original text (for displaying results)
        self.docs: dict[int, str] = {}
        self._next_id = 0

    def add(self, text: str) -> int:
        """Phase 2 — tokenize and append postings."""
        doc_id = self._next_id; self._next_id += 1
        self.docs[doc_id] = text
        seen = set()
        for term in tokenize(text):
            if term in seen: continue   # one posting per (term, doc)
            seen.add(term)
            self.postings[term].append(doc_id)
        return doc_id

    def finalize(self):
        """Phase 2 finish — ensure each postings list is sorted by docID.
        With our incremental adds this is already true, but a real builder
        merges segments and must sort explicitly."""
        for term in self.postings:
            self.postings[term].sort()

    # ----- Query operators -----
    def lookup(self, term: str) -> list[int]:
        return self.postings.get(term.lower(), [])

    @staticmethod
    def intersect(a: list[int], b: list[int]) -> list[int]:
        """AND: linear-time merge of two sorted postings lists."""
        i = j = 0; out = []
        while i < len(a) and j < len(b):
            if   a[i] == b[j]: out.append(a[i]); i += 1; j += 1
            elif a[i] <  b[j]: i += 1
            else:              j += 1
        return out

    @staticmethod
    def union(a: list[int], b: list[int]) -> list[int]:
        """OR: linear-time merge with deduplication."""
        i = j = 0; out = []
        while i < len(a) and j < len(b):
            if   a[i] == b[j]: out.append(a[i]); i += 1; j += 1
            elif a[i] <  b[j]: out.append(a[i]); i += 1
            else:              out.append(b[j]); j += 1
        out.extend(a[i:]); out.extend(b[j:])
        return out

    def difference(self, a: list[int]) -> list[int]:
        """NOT: subtract from the universe of all docIDs."""
        all_docs = list(range(self._next_id))
        sa = set(a)
        return [d for d in all_docs if d not in sa]

That is the whole inverted index. Forty-odd lines of substance. The asymptotic complexity of every operation matches what a production engine achieves to within a constant factor:

Operation	Cost	Why
`add(doc)`	`O(L)` where `L =	doc
`lookup(term)`	`O(1)` + `O(	postings
`intersect`	`O(	a
`union`	`O(	a
`difference`	`O(N)`	scan all docs (later we do better with iterators)

The intersect routine is the heart of search engineering. Read it once more: two pointers walk forward through two sorted lists; equal docIDs are output and both advance, the smaller advances alone. Why this is linear: each pointer can advance at most |a| + |b| times in total before falling off the end, so total work is bounded by the sum of input sizes. This is the same merge logic that drives external sort, log-structured merges in LSM trees, and the relational nested-loop-join's sort-merge variant — the same primitive showing up in three different layers of the database.

Worked example: an Indian e-commerce search

**Indexing five product descriptions and querying `samsung AND smartphone`**

You are building search for a small Indian e-commerce catalogue. Five products to index:

docID	description
0	"Samsung Galaxy S25 smartphone with 256GB storage"
1	"Apple iPhone 16 smartphone Pro Max"
2	"Samsung 55-inch QLED 4K smart TV"
3	"OnePlus 13 smartphone with Snapdragon processor"
4	"Samsung Galaxy Tab S10 tablet 11-inch display"

Step 1 — tokenize each document. Lowercase, split on non-alphanumeric:

doc 0 → ["samsung", "galaxy", "s25", "smartphone", "with", "256gb", "storage"]
doc 1 → ["apple", "iphone", "16", "smartphone", "pro", "max"]
doc 2 → ["samsung", "55", "inch", "qled", "4k", "smart", "tv"]
doc 3 → ["oneplus", "13", "smartphone", "with", "snapdragon", "processor"]
doc 4 → ["samsung", "galaxy", "tab", "s10", "tablet", "11", "inch", "display"]

Step 2 — build postings. For each (term, docID) pair, append docID to the term's list. After all five inserts (deduplicating per document so we do not list a doc twice for one term):

postings["samsung"]    = [0, 2, 4]
postings["galaxy"]     = [0, 4]
postings["smartphone"] = [0, 1, 3]
postings["with"]       = [0, 3]
postings["smart"]      = [2]
postings["tv"]         = [2]
postings["apple"]      = [1]
postings["iphone"]     = [1]
postings["oneplus"]    = [3]
postings["tab"]        = [4]
postings["inch"]       = [2, 4]
... (15 more terms)

Notice each list is already sorted because we inserted in docID order. Why: incremental indexing in monotonically increasing docID order keeps lists sorted for free; segmented builders that merge multiple in-flight indexes must sort explicitly.

Step 3 — answer the query samsung AND smartphone.

Look up both postings lists:

samsung → [0, 2, 4]
smartphone → [0, 1, 3]

Run the two-pointer intersection. i walks samsung, j walks smartphone:

The two-pointer merge for `intersect([0,2,4], [0,1,3])`. Equal elements are output and both pointers advance; the smaller advances alone. Five comparisons for two three-element lists; in general `O(|a|+|b|)`.

The result is [0] — only the Samsung Galaxy S25 contains both samsung and smartphone. Why doc 2 (Samsung TV) is excluded: the TV is "samsung" but not "smartphone", so the AND fails. Why doc 1 (Apple iPhone) is excluded: it is "smartphone" but not "samsung". This is the user's intent.

idx = InvertedIndex()
for desc in [
    "Samsung Galaxy S25 smartphone with 256GB storage",
    "Apple iPhone 16 smartphone Pro Max",
    "Samsung 55-inch QLED 4K smart TV",
    "OnePlus 13 smartphone with Snapdragon processor",
    "Samsung Galaxy Tab S10 tablet 11-inch display",
]:
    idx.add(desc)
idx.finalize()

samsung    = idx.lookup("samsung")     # [0, 2, 4]
smartphone = idx.lookup("smartphone")  # [0, 1, 3]
result     = InvertedIndex.intersect(samsung, smartphone)  # [0]
print([idx.docs[d] for d in result])
# ['Samsung Galaxy S25 smartphone with 256GB storage']

Run it. It works. You have a search engine in seventy lines.

The four query operators, completed

Single-term and AND are the two queries you have just built. Three more round out the basics:

OR query (samsung OR oneplus): union the two postings lists. Same two-pointer merge as intersect, but instead of only emitting on equality, you emit whichever pointer is smaller and advance it. Equal elements are emitted once. The result samsung [0,2,4] ∪ oneplus [3] = [0, 2, 3, 4]. Why this works: union of two sorted lists is itself sorted, and a merge produces a sorted output without buffering — exactly the same primitive used in a merge sort's combine step.

NOT query (smartphone AND NOT samsung): take the postings list for smartphone and remove any docID also in samsung. Implemented as intersect_complement(a, b) — walk both pointers, output a[i] only when it is strictly less than b[j]. Lucene calls this "MUST_NOT". smartphone [0,1,3] − samsung [0,2,4] = [1, 3] (Apple and OnePlus, but not the Samsung phone).

Phrase query ("galaxy s25"): a plain inverted index cannot tell you whether galaxy and s25 appear next to each other in the document — only that both appear. To answer phrase queries you need a positional index, where each posting carries not just a docID but also the list of word positions within that document. We build it in chapter 147; the structure becomes term → [(docID, [pos1, pos2, ...]), ...].

The postings list for "samsung" might contain millions of docIDs in a real corpus. Two invariants make it efficient: (1) it is sorted by docID, so binary search and merge work; (2) successive docIDs are close together, which compression in chapter 145 will exploit.

Real systems built on this idea

Every production search engine you have heard of is, at its core, the data structure you just built — wrapped in carefully engineered tokenizers, compressed postings, sharded across machines, and ranked by BM25.

Apache Lucene is the open-source Java library that implements all of this. Its file format (.tim, .doc, .pos, .pay files in a Lucene segment) maps directly onto term dictionary, doc-only postings, positional postings, and per-document payloads. Why this matters to know: Elasticsearch and OpenSearch are servers wrapped around Lucene; Solr is a different server wrapped around Lucene. Understanding Lucene is understanding the bulk of search infrastructure deployed in production today.
Tantivy is the same architecture rewritten in Rust and used by Quickwit for log search. It is faster on a single core because it avoids JVM overhead.
MeiliSearch is also Rust, optimised for typo-tolerant instant search; the inverted index is augmented with prefix and n-gram indexes.
PostgreSQL's GIN index ("Generalized Inverted Index") implements exactly this structure inside a relational database. When you write WHERE to_tsvector(body) @@ to_tsquery('samsung & smartphone'), Postgres looks up two postings lists and intersects them. The plumbing differs (it lives inside Postgres's storage manager and integrates with MVCC), but the algorithm is what you just wrote.

The point is that the inverted index is not a Lucene-specific artefact. It is the universal data structure for "find documents by term", and it shows up wherever that operation matters.

What Build 18 fills in next

You now have the spine. The next four chapters dress it.

Chapter 144 — Tokenization, stemming, stopwords, language analyzers replaces the toy text.lower().split() with real Unicode-aware tokenization, Porter and Krovetz stemmers, stopword lists, and language-specific analyzers (English, Hindi, Tamil). This is where running and runs collapse to run, and where you stop indexing the word "the" three trillion times.
Chapter 145 — Postings lists, skip pointers, compression turns each list[int] into a delta-encoded, variable-byte (or PFOR-delta) compressed run with periodic skip pointers. A million-docID postings list shrinks from 4 MB to 200-500 KB and supports O(√n) skip during intersection.
Chapter 146 — TF-IDF derived, then BM25 and why it won moves from "which docs match" to "which docs match best". You will derive term-frequency × inverse-document-frequency from first principles, then see why Robertson's BM25 (with its length normalisation and saturation function) beat it everywhere and remains the lexical baseline that even modern neural rankers compare against.
Chapter 147 — Phrase queries and positional indexes adds positions to each posting so that "galaxy s25" (in quotes) matches only documents where s25 appears immediately after galaxy.
Chapter 148 wires segments and merge policies (the LSM-tree-shaped storage that Lucene actually uses on disk).
Chapter 149 scatters queries across shards and gathers the ranked results.

By the end of Build 18 you will have not a toy but the blueprint of a real search engine, and the code you wrote in this chapter will be recognisable inside Lucene, Tantivy, and Postgres GIN.

Going deeper

A few directions for the curious before chapter 144 begins.

Why "inverted"?

The name is older than computing. In a printed book's index at the back, each term lists the page numbers where it appears — that is the inverted index. The body of the book itself, with words flowing in document order, is the forward index. The terminology was lifted directly into IR by Salton's group in the 1960s.

How big is the index relative to the corpus?

A rough rule for English text: an uncompressed positional inverted index is roughly the same size as the original corpus, sometimes 1.5×. Without positions, it is 20-40% of the corpus. With aggressive compression (chapter 145), the docID-only index shrinks to perhaps 10-15%. Why so small even with positions: most terms are rare and have very short postings lists, while a handful of common terms (the, a, of) have very long lists that compress beautifully because consecutive docIDs are dense and small after delta-coding.

Why sorted by docID and not by relevance?

Sorted-by-docID makes intersection and union linear-time, which is what you need to evaluate Boolean filters fast. Sorted-by-relevance would make those operations quadratic. Modern engines compute relevance after the Boolean step: they intersect the postings to get the candidate set, then score each candidate. Some advanced techniques (WAND, Block-Max-WAND in Anserini-style engines [1]) keep an additional max-score-per-block on each postings list so that the scorer can skip blocks that cannot reach the top-k threshold — but the underlying storage is still sorted by docID.

Why not just use a regex search on the corpus?

Tools like grep and ripgrep do this. They are excellent for occasional searches in source code (a few gigabytes), where building an index is overkill. For a static corpus that is queried millions of times, the index pays for itself in the first hour: every query you avoid scanning the whole corpus is a query you served in microseconds instead of seconds. The trade-off is the build cost (one pass over the corpus) and the index storage (10-50% of corpus size).

References

Yang, Fang, Lin. Anserini: Reproducible Ranking Baselines Using Lucene. SIGIR 2017. arxiv.org/abs/1705.04707
Manning, Raghavan, Schütze. Introduction to Information Retrieval. Cambridge University Press, 2008. Chapter 1: Boolean retrieval (free)
Apache Lucene. Index file formats. lucene.apache.org/core/9_10_0/core/org/apache/lucene/codecs/lucene99/package-summary.html
Wikipedia. Inverted index. en.wikipedia.org/wiki/Inverted_index
PostgreSQL. GIN indexes. postgresql.org/docs/current/gin.html
Quickwit. Tantivy: a full-text search engine library inspired by Apache Lucene. github.com/quickwit-oss/tantivy