Note: Company names, engineers, incidents, numbers, and scaling scenarios in this article are hypothetical — even when they resemble real ones. See the full disclaimer.

In short

An inverted index only matches when the indexed term and the query term are byte-for-byte identical, but real users type samsung phone against documents that say Samsung Phones, and real Hindi users type रिलायंस against articles that say रिलायंस ने. The analyzer is the deterministic pipeline — character filter, tokenizer, then token filters like lowercase, stopwords, and stemming — that maps both sides to the same canonical form. Stemming uses fast heuristic suffix-stripping (Snowball); lemmatisation looks up dictionary forms and is slower but more accurate. Every language needs its own pipeline, and the exact same analyzer must run at index time and query time or nothing matches.

In chapter 143 you built an inverted index. You wrote tokenize(text) as a placeholder that returned text.lower().split(). That placeholder works on a few test sentences and fails on every real document you will ever encounter. This chapter is what you put inside tokenize() — and it is, by some distance, the most language-dependent and culturally-loaded component of any search engine.

The thesis: matching requires normalisation

Suppose you index this sentence:

"Samsung's new Galaxy phones — café-quality cameras, ₹89,999."

A user searches for samsung phone. Should it match? Of course it should. But naively, the index contains Samsung's (with an apostrophe and capital S), phones (plural), and the query contains samsung (lowercase) and phone (singular). Byte for byte, nothing matches. Why: a hash table — which is what the term dictionary of an inverted index is — only retrieves a value if the lookup key is exactly equal to the stored key, bit for bit. "Samsung's" and "samsung" are different strings; one lookup will not find the other.

The fix is to run both the indexed text and the query text through the same deterministic transformation pipeline. That pipeline is the analyzer. After analysis the document tokens become roughly [samsung, new, galaxi, phone, cafe, qualiti, camera, 89999] and the query tokens become [samsung, phone] — and now the lookups hit. The analyzer's job is to make the two sides agree on a canonical form for every word.

This is why a search engine is, in practice, much more an exercise in linguistics and Unicode handling than in algorithms. The B-tree underneath is small. The pipeline on top is a small NLP system.

The standard analyzer pipeline

Lucene — the open-source library underneath Elasticsearch, Solr, OpenSearch, and dozens of other systems — defines an analyzer as a fixed three-stage pipeline. Every modern search engine borrows this structure, even if the names differ.

Tokenizer pipelineA horizontal pipeline with six boxes connected by arrows. Stage 1 raw text "Samsung's Galaxy Phones — café". Stage 2 split on whitespace produces ["Samsung's", "Galaxy", "Phones", "—", "café"]. Stage 3 lowercase produces ["samsung's", "galaxy", "phones", "—", "café"]. Stage 4 remove punctuation produces ["samsungs", "galaxy", "phones", "café"]. Stage 5 strip accents produces ["samsungs", "galaxy", "phones", "cafe"]. Stage 6 output tokens.Standard analyzer pipelineraw textSamsung'sGalaxy Phones— caféwhitespace split[Samsung's,Galaxy, Phones,—, café]lowercase[samsung's,galaxy, phones,—, café]strip punct.[samsungs,galaxy, phones,café]strip accents[samsungs,galaxy, phones,cafe]tokensintopostingsSame pipeline runs at index time AND at query time — that is what makes them comparable.Stemming and stopword removal would slot in between strip-accents and the output.
The standard analyzer pipeline. Each box is a deterministic function on the token stream of the previous stage. Note that this is the *normalisation* part — stemming, stopwording and synonym expansion are extra token filters that slot in before the final output.

The three stages, in Lucene's vocabulary, are:

  1. Character filter — runs on the raw text before tokenisation. Decodes HTML entities (&&), normalises Unicode (NFC), strips control characters. Often a no-op for plain text but essential for indexing scraped HTML or user-submitted markdown. Why: tokenisation rules assume a clean Unicode string; if a document body still contains   or smart quotes, every downstream stage will mistreat them.

  2. Tokenizer — splits the cleaned text into a stream of tokens. The standard tokenizer in Lucene follows the Unicode Text Segmentation rules (UAX #29) which roughly says "split on whitespace and most punctuation, keep CJK characters as individual tokens, recognise URLs and emails as single tokens". Whitespace alone is too crude (don't. becomes don't.); regex-based splitting is the typical implementation.

  3. Token filters — a chain of functions that transform the token stream. Each filter receives a stream and emits a (possibly different) stream. The order matters. A common chain looks like:

    • Lowercase: Samsungsamsung. Trivial but essential. Why: case carries no semantic information for nearly all search queries; users type lowercase 95% of the time and capitalise inconsistently when they don't.
    • ASCII folding: cafécafe, naïvenaive, Zürichzurich. Maps Unicode characters to their ASCII near-equivalents. Useful for European languages, harmful for Hindi or Chinese (where there is no ASCII equivalent). Most analyzers turn this on selectively.
    • Stopword removal: drop tokens that appear in a fixed list (the, and, is, a, of, to...). Reduces index size by 30-50% in English, has no effect on common queries, and breaks rare phrase queries (to be or not to be becomes empty). Most modern systems leave stopwords in the index but downweight them in BM25.
    • Stemming: reduce inflected forms to a root. running, runs, ran → mostly run. Discussed in detail below.
    • Synonym expansion: at index or query time, replace phone with phone, mobile, handset so queries find each other. Maintained as a static dictionary or learned from query logs.
    • N-gram generation: produce sub-token grams for autocomplete or fuzzy matching. samsungsa, sam, sams, samsu, samsun, samsung (edge n-grams). The index grows ~4×, but you can match prefixes in a single lookup.

Notice what stays constant: the same analyzer must run at both index time and query time. If you stem running → run when indexing the document but not when processing the query running, the postings list for running is empty and you find nothing. This is the single most common mistake people make when wiring up Lucene by hand.

Stemming versus lemmatisation

Stemming is the workhorse of search; lemmatisation is its slower, fussier cousin. Both reduce a word to a canonical form, but they differ in approach and accuracy.

Stemming versus lemmatisationTwo side-by-side panels. Left panel labelled "Stemming (Snowball)" shows three input words "running", "runs", "ran" with arrows. The first two arrows converge to "run"; the third arrow from "ran" points to "ran" (unchanged) — labelled "irregular form, rule misses it". Right panel labelled "Lemmatisation (dictionary)" shows "better", "best", "good" with all three arrows converging on "good".Stemming (Snowball)heuristic suffix-stripping rulesrunningrunsranrunran← rule missesirregular formsLemmatisation (dictionary)looks up the canonical lemmabetterbestgoodgoodStemmer: ~50ns per word, ~85% accurate. Lemmatiser: ~50µs per word, ~98% accurate.
Stemming applies fast heuristic rules and trips on irregular forms. Lemmatisation consults a dictionary and gets the right answer, including for irregular comparatives like "better → good". For most search workloads, stemming is good enough and three orders of magnitude faster.

Stemming is a set of language-specific heuristic rules that chop off suffixes. The Porter stemmer (1980), and its successor Snowball / Porter2 (Martin Porter, 2001), is the de facto standard for English. Snowball's rules look like "if the word ends in -ing and the stem before it has at least one vowel, drop the -ing". Apply that to running and you get runn — then a doubled-consonant rule trims it to run. Apply it to ran and no rule fires, because ran does not end in any of the suffixes Snowball knows about. The stemmer never asks "is ran a form of run?" — it only knows about suffixes.

The output of a stemmer is not necessarily a real word. running becomes run, but relations becomes relat, argument becomes argument, arguing becomes argu. The stems are index keys, not display strings. As long as the same word always stems to the same thing, the index works.

Lemmatisation is the dictionary-based alternative. A lemmatiser knows that ran is the past tense of run, that better is the comparative of good, that mice is the plural of mouse. It returns the lemma — the actual headword you would look up in a dictionary. Implementing one requires a morphological database (WordNet for English, Hindi WordNet for Hindi) and part-of-speech tagging — because saw could be a noun (the tool) or the past tense of see, and the lemma differs.

The trade-off:

Stemmer Lemmatiser
Speed ~50 ns/word (a few rules) ~50 µs/word (dictionary lookup + POS tag)
Accuracy on inflections ~85% (misses irregulars) ~98% (knows all forms)
Output not necessarily a real word always a dictionary headword
Memory tiny (rule table) hundreds of MB (dictionary)
Used by Lucene, Elasticsearch, Solr, almost every search engine spaCy, NLTK for NLP tasks

For search, stemming wins almost every time. Speed matters because you stem at index time across billions of tokens, and you stem at query time on every keystroke. The 13% accuracy loss matters less because BM25's bag-of-words scoring is already lossy — losing a few rare forms barely changes ranking.

Stopwords: the case for and against

Stopwords are very common, low-information words: the, and, is, a, of, to, in, that, it, for, with. In English they make up ~30% of token occurrences in typical prose but contribute essentially nothing to whether one document is more relevant than another to a topical query. Removing them from the index used to be standard practice — it cut index size in half and sped up queries.

Modern systems are more careful. The query to be or not to be consists entirely of stopwords. If you removed them, the famous Hamlet line is invisible. The query the who (the band) becomes empty. So most contemporary engines either:

  • Keep stopwords in the index, but downweight them through IDF (which the next chapter explains — common words have low IDF, so they contribute little to the score automatically), or
  • Use adaptive stopwords: words that are common in this corpus (e.g. bank in a corpus of banking documents) get treated as stopwords for that corpus, while truly rare-elsewhere words get the full weight.

Lucene's English analyzer, for compatibility, still removes a fixed list of 33 English stopwords by default. You can disable this. Most production deployments do.

Language analyzers

The pipeline above describes English. Each language needs its own pipeline because every language has different rules for tokenisation, normalisation, and morphology.

Per-language analyzersThree rows showing analyzer pipelines for English, Hindi, and Tamil. English row: input "BharatGroup announces new AakashTel plans" → standard tokenizer → lowercase → English stopwords → Snowball English stemmer → output [relianc, announc, new, jio, plan]. Hindi row: input "रिलायंस ने नया जिओ प्लान घोषित किया" → standard tokenizer (handles Devanagari) → Indic normalisation (matras, halant) → Hindi stopwords (की, ने, का) → Indic Snowball stemmer → output [रिलायंस, नया, जिओ, प्लान, घोषित, किया]. Tamil row: input "ரிலையன்ஸ் புதிய ஜியோ திட்டம்" → Tamil tokenizer → Tamil stopwords → Tamil stemmer → output Tamil tokens.Language-specific analyzer chains (Lucene)EnglishSnowball"BharatGroup announces new AakashTel plans"tokenize → lowercase → English stopwords → Snowball (Porter2) stemmer[relianc,announc, new,jio, plan]HindiIndic Snowball"रिलायंस ने नया जिओ प्लान घोषित किया"tokenize → Indic normalise (matras ी, halant ्) → Hindi stopwords (ने, की) → Indic stemmer[रिलायंस,नया, जिओ,प्लान, घोषित]TamilTamil analyzer"ரிலையன்ஸ் புதிய ஜியோ திட்டம்"tokenize → Tamil normalise → Tamil stopwords → agglutinative stemmer[ரிலையன்ஸ்,புதிய, ஜியோ,திட்டம்]Each Indian language gets its own pipeline. Chinese needs jieba/ICU because words have no spaces.A multilingual index either picks one analyzer per field or uses ICU as a generic Unicode fallback.
Three different language analyzers, all wired into the same Lucene framework but with completely different filter chains and stopword lists. Indic Snowball understands that the matra "ी" attached to a base character is part of inflection, not a separate token.

English uses the Snowball (Porter2) stemmer, the standard English stopword list, and the standard Unicode tokenizer. The tokens come out lowercase ASCII.

Hindi and other Indic languages need a special tokenizer that understands the Devanagari script. Naive whitespace splitting works for word boundaries (Devanagari does use spaces between words), but normalisation has to handle the matra characters — vowel signs like (long i) and (short u) attached to consonants — and the halant that suppresses the inherent vowel. The Lucene HindiAnalyzer ships with a Hindi stopword list (का, की, के, ने, से, में, पर...) and an Indic Snowball variant that knows about Hindi noun and verb suffixes (-ों plural, -ता agentive, -कर conjunctive).

Tamil has Lucene's separate TamilAnalyzer because Tamil is morphologically agglutinative — a single Tamil word can encode subject, object, tense, and politeness. The stemmer has to peel back many more layers than the English Snowball, and the stopword list is different.

Chinese, Japanese, Korean require an entirely different tokenisation strategy because written Chinese has no spaces between words. The two practical approaches are dictionary-based segmentation (Lucene's SmartChineseAnalyzer, the jieba library) which uses a probabilistic model to find the most likely word boundaries, and ICU-based segmentation which uses ICU's boundary analysis for languages it knows. Both produce noisy boundaries — search quality on CJK is meaningfully harder than on space-delimited languages.

A typical multilingual setup in Elasticsearch creates one indexed field per languagetitle.en analyzed by english, title.hi analyzed by hindi, title.ta analyzed by tamil — and routes the query to the right field based on detected language.

Building it in Python

Here is the standard analyzer pipeline implemented in roughly forty lines of Python, using NLTK for the tokenizer and stopwords and PyStemmer for the Snowball implementation.

import re
import unicodedata
from nltk.corpus import stopwords
import Stemmer  # PyStemmer — C implementation of Snowball

EN_STOPS = set(stopwords.words('english'))
EN_STEMMER = Stemmer.Stemmer('english')

def char_filter(text: str) -> str:
    text = unicodedata.normalize('NFC', text)
    text = re.sub(r'&[a-z]+;', ' ', text)        # strip HTML entities
    return text

def tokenize(text: str) -> list[str]:
    return re.findall(r"[\w]+(?:['-][\w]+)*", text, re.UNICODE)

def lowercase(tokens: list[str]) -> list[str]:
    return [t.lower() for t in tokens]

def ascii_fold(tokens: list[str]) -> list[str]:
    out = []
    for t in tokens:
        decomposed = unicodedata.normalize('NFKD', t)
        ascii_form = ''.join(c for c in decomposed if not unicodedata.combining(c))
        out.append(ascii_form)
    return out

def remove_stopwords(tokens: list[str], stops: set[str]) -> list[str]:
    return [t for t in tokens if t not in stops]

def stem(tokens: list[str], stemmer) -> list[str]:
    return stemmer.stemWords(tokens)

def english_analyzer(text: str) -> list[str]:
    text = char_filter(text)
    tokens = tokenize(text)
    tokens = lowercase(tokens)
    tokens = ascii_fold(tokens)
    tokens = remove_stopwords(tokens, EN_STOPS)
    tokens = stem(tokens, EN_STEMMER)
    return tokens

Each function is a transformation on the token stream. They compose. To swap in a different analyzer — say one without stopword removal — you just rewrite english_analyzer to skip that step. Why: this is the same separation Lucene enforces in code; analyzers are constructed from a tokenizer plus a list of token filters, and you can recombine them freely.

For Hindi, the structure is identical but the stopword list and stemmer change:

HI_STOPS = {'का', 'की', 'के', 'को', 'ने', 'से', 'में', 'पर',
            'और', 'है', 'हैं', 'था', 'थी', 'थे', 'यह', 'वह',
            'एक', 'भी', 'तो', 'जो', 'कि', 'या', 'पर', 'इस'}

def hindi_normalize(text: str) -> str:
    text = unicodedata.normalize('NFC', text)
    # Collapse nukta variants: क़ → क + nukta as canonical
    return text

def hindi_stem(token: str) -> str:
    # Indic Snowball-style suffix stripping, simplified
    SUFFIXES = ['ों', 'ें', 'ाओं', 'ाएं', 'कर', 'ता', 'ती', 'ते', 'ना', 'ने', 'ी', 'े', 'ा']
    for suf in sorted(SUFFIXES, key=len, reverse=True):
        if token.endswith(suf) and len(token) > len(suf) + 1:
            return token[:-len(suf)]
    return token

def hindi_analyzer(text: str) -> list[str]:
    text = hindi_normalize(text)
    tokens = re.findall(r'[ऀ-ॿ]+', text)  # Devanagari range
    tokens = [t for t in tokens if t not in HI_STOPS]
    tokens = [hindi_stem(t) for t in tokens]
    return tokens

The real Lucene HindiStemmer is more rigorous (it uses a longer suffix table and consults vowel patterns), but this captures the shape of the algorithm.

An Indian news search engine

You are building a search engine for an Indian news aggregator. The corpus contains articles in English, Hindi, and Tamil. To get a feel for what your analyzer does to text, run three sample headlines through the matching language analyzer and look at the output tokens — these are the strings that go into the inverted index.

import nltk
nltk.download('stopwords', quiet=True)

# Article 1: English
en_article = "BharatGroup announces new AakashTel plans for rural India"
print("EN tokens:", english_analyzer(en_article))
# → ['relianc', 'announc', 'new', 'jio', 'plan', 'rural', 'india']

# Article 2: Hindi
hi_article = "रिलायंस ने नया जिओ प्लान घोषित किया"
print("HI tokens:", hindi_analyzer(hi_article))
# → ['रिलायंस', 'नया', 'जिओ', 'प्लान', 'घोषित', 'किय']

# Article 3: Mixed (English brand inside Hindi text — common in Indian news)
mixed = "AakashTel का नया प्लान 5G के लिए"
# Need to detect script per token and route to the right analyzer
def mixed_analyzer(text: str) -> list[str]:
    out = []
    for token in re.findall(r'\S+', text):
        if re.search(r'[ऀ-ॿ]', token):
            out.extend(hindi_analyzer(token))
        else:
            out.extend(english_analyzer(token))
    return out

print("MIX tokens:", mixed_analyzer(mixed))
# → ['jio', 'नया', 'प्लान', '5g']

Notice three things in the English output. First, announces became announc and BharatGroup became relianc — Snowball's -e and -es rules fired. Second, for was dropped because it is in NLTK's English stopword list. Third, 5G survived intact because the tokenizer treats 5g as a single token (digits and letters mixed).

In the Hindi output, ने was dropped as a stopword and किया lost its suffix to become किय (the verb stem). रिलायंस survived unchanged because it does not end in any of the Hindi suffixes. Why: stemming is suffix-driven; loanwords from English do not match Hindi morphological patterns, so they pass through.

Now imagine a user types रिलायंस के प्लान. The query analyzer runs the same Hindi pipeline and produces ['रिलायंस', 'प्लान'] (because के is a stopword). Both terms are looked up in the inverted index, the postings lists are intersected, and Article 2 comes back as a hit. Without analysis, the query would have been three different strings and Article 2 would not have matched — even though it is exactly what the user wanted.

Trade-offs to make explicitly

Every analyzer choice is a trade-off between precision (the fraction of returned results that are relevant) and recall (the fraction of relevant documents that are returned).

Aggressive stemming raises recall — running matches documents about runs, runner, ran — but lowers precision — university and universe both stem to univers, conflating unrelated meanings. The Lancaster stemmer, more aggressive than Snowball, is famous for producing nonsense conflations like organisation → organ.

Aggressive stopword removal speeds up indexing and shrinks the index, but breaks any query that depends on common words. Vitamin A becomes vitamin because a is a stopword. IT industry becomes industri. Modern systems mostly leave stopwords in.

Synonym expansion at index time doubles or triples the index size (every phone token also indexes mobile, handset) but makes the query side trivial. Synonym expansion at query time keeps the index small but slows every query (each query token expands to many lookups). Most production systems do query-time expansion.

Edge n-grams for autocomplete grow the index by ~4× and make prefix search instant. They are essential if you want a "search-as-you-type" experience but pure overhead if you do not.

The right answer is workload-specific, which is why Lucene exposes every dial.

Real systems

Elasticsearch ships about 35 built-in language analyzers — English, Arabic, Bengali, Chinese (via SmartCN or ICU), Hindi, Tamil, Telugu, and many more — and exposes them as "analyzer": "hindi" in field mappings. You can also build custom analyzers by composing a char_filter, tokenizer, and filter chain in JSON, and the engine wires up the pipeline at startup.

MeiliSearch takes a more opinionated approach — it auto-detects language and applies sensible defaults, with no per-field configuration required. This costs flexibility but pays back in default quality for non-experts.

Tantivy (a Rust port of Lucene used by Quickwit) reimplements the tokenizer-and-filter framework with the same shape, exposing a small set of built-in stemmers (Snowball for the major European languages) and letting users plug in custom token filters.

Anserini is a Java toolkit built on Lucene specifically for IR research; it is the reference implementation people benchmark against in the TREC competitions, and a good place to look if you want to see "the same analyzer the academic IR community uses".

For pure tokenisation needs, the ICU project's BreakIterator is the highest-quality general-purpose Unicode tokenizer — it is what Lucene's ICUTokenizer wraps, and what Chrome and Firefox use for word selection on double-click.

Common confusions

  • "Tokenisation just means text.split()." Whitespace splitting fails on punctuation (don't. becomes one token), on URLs (https://flipkart.com should stay together but split into garbage), on Devanagari with attached matras, and entirely on Chinese where there are no spaces between words. The Lucene standard tokenizer follows UAX #29, a 60-page Unicode specification — and even that is wrong for Chinese, Japanese, and Thai. Real tokenisation is the hardest stage of the pipeline, not the easiest.

  • "Stemming and lemmatisation are the same thing, one is just slower." They produce different outputs and serve different jobs. The Snowball stemmer maps ran to ran (the rule does not fire) — it keeps ran and run as separate index keys, so a query for running will not find a document that says ran. A lemmatiser maps both to run. For search, that recall miss matters; we accept it because stemming is three orders of magnitude faster and the BM25 ranker partially absorbs the loss. They are not interchangeable — if your problem is grammatical analysis (spaCy, parsing), use a lemmatiser; if it is bag-of-words search, use a stemmer.

  • "Removing stopwords always speeds up search and improves quality." Stopword removal breaks every query that depends on common words: the who (the band), to be or not to be (the Hamlet line), Vitamin A, IT industry. It also breaks phrase queries because the position offsets shift. Modern engines (Elasticsearch since ~2015, Lucene's recommendation since the same era) keep stopwords in the index and let IDF — the next chapter — downweight them automatically. Stopword removal is a 1990s optimisation that paid back when index storage was expensive. It is a quality regression on modern hardware.

  • "Indexed text and query text go through different pipelines." They must go through the exact same analyzer in 99% of cases. If you stem running → run at index time but skip stemming on the query running, the postings list for running is empty and you find nothing. Lucene enforces this by default; the search_analyzer override exists only for the edge-n-gram autocomplete case. People wiring up Lucene by hand for the first time hit this bug constantly — it manifests as "search works for some words but not others", because some words are stem-invariant (jio, india) and some are not (announces, running).

  • "Indic Snowball is just English Snowball with a Devanagari character set." The two algorithms are unrelated. English Snowball strips suffixes like -ing, -ed, -s, -ly. Indic Snowball strips entirely different morphology — -ों (oblique plural), -कर (conjunctive), -ता (agentive), and has rules about halant and matra combinations that have no English analogue. Cloning the English analyzer and feeding it Hindi gives unusable garbage; the Lucene HindiAnalyzer is a separate codebase.

  • "You can switch analyzers without rebuilding the index." No. The strings stored in the inverted index are the output of the analyzer (announc, not announces). If you change analyzer, the new analyzer's output for a query will not match the old analyzer's output sitting in the index. Production migrations require a full reindex and usually a dual-write period. This makes analyzer choice a long-lived schema decision, not a runtime knob.

Going deeper: the analyzer is part of your schema

The single most important operational property of an analyzer is that it is part of your index schema, not part of your query layer. If you change the analyzer — switch from English Snowball to Lancaster, add a new stopword, change ASCII folding to Unicode folding — the strings stored in the existing inverted index no longer match what the new analyzer would produce. You cannot reanalyze a query against a stale index. The entire index must be rebuilt.

This makes analyzer choice a long-lived decision in the same way that picking a primary key is. Production teams that have to migrate analyzers — usually because they realised some query class was failing systematically — generally do it via a dual-write scheme: build the new index alongside the old, send each query to both, compare results, and cut over only when the new index is demonstrably better. This can take weeks for a large corpus.

A second consequence: the analyzer is expensive at write time. Snowball is fast (microseconds per word), but ICU-based Chinese tokenisation is slow (milliseconds per document) and lemmatisation can be slower still. Indexing throughput is often analyzer-bound, not disk-bound. If you are indexing billions of documents, every microsecond per token matters.

A third consequence relevant to multitenant systems: if you let users configure their own analyzers, every customer effectively has a different schema, and the operator burden of managing dozens of pipeline variants becomes nontrivial. Most hosted search services restrict customers to picking from a fixed menu.

Beyond stemming: BPE and learned tokenisers

The deep-learning era brought a new family of tokenisers — byte-pair encoding (BPE), WordPiece, SentencePiece — which learn a sub-word vocabulary from the corpus rather than relying on hand-coded rules. running might be tokenised as [run, ##ning]; unbelievable as [un, ##believ, ##able]. The advantages are language-agnostic operation and graceful handling of rare words; the disadvantages are loss of morphological transparency and worse fit for inverted-index lookup (you would have to tokenise queries the same way and search for sub-word sequences, which postings-list intersection is not designed for).

Most search systems still use rule-based stemmers. BPE-style tokenisers dominate in retrieval systems built around dense vector search (where the tokens are inputs to a neural encoder, not direct index keys), which is the world covered later in Build 18.

Why "the same analyzer at index and query time" actually fails sometimes

There is one case where you deliberately use different analyzers at index and query time: edge n-gram autocomplete. At index time, samsung is indexed as [s, sa, sam, sams, samsu, samsun, samsung]. At query time, the user typing sams should match — but if you ran the n-gram filter at query time too, the query would expand to [s, sa, sam, sams] and you would do four lookups for what should be one. So you index with n-grams and query without them, using a separate analyzer for the search side.

This is the only standard exception, and it is so common that Lucene and Elasticsearch let you specify analyzer and search_analyzer separately on every field. If you find yourself needing more than this exception, you are probably doing something wrong.

What's next

You now have a function that turns raw text into a list of canonical tokens, and you have an inverted index that maps tokens to docID postings lists. The next chapter compresses those postings lists — a billion-document corpus can have postings lists with hundreds of millions of entries each, and storing them as plain int64 arrays would consume terabytes. Skip pointers, delta encoding, and variable-byte compression typically cut the size by 5-10× while making queries faster, not slower. From there, BM25 turns presence-or-absence matches into ranked relevance scores, and you have the core of every modern search engine.

References