Tokenization, Stemming, Stopwords, Language Analyzers

In short

The inverted index of the previous chapter maps a term to the list of documents that contain it — but it can only do that if the terms in the index and the terms in the query match exactly, byte for byte. Real users type samsung phone when the document says Samsung Phones. Real Hindi users type रिलायंस when the article says रिलायंस ने. Real document text contains HTML entities, mixed case, accents, hyphens, and punctuation. The component that turns raw text into the canonical, comparable tokens that go into the postings lists — and the same component that processes the query at search time — is the analyzer. A standard analyzer is a pipeline of three stages: a character filter (Unicode normalisation, HTML entity decoding), a tokenizer (split into tokens, usually on whitespace and punctuation, but Chinese and Japanese need real segmentation), and a chain of token filters (lowercase, ASCII fold, remove stopwords, stem, expand synonyms, optionally generate n-grams). Stemming chops suffixes by heuristic — running → run, shoes → shoe — fast but crude; lemmatisation looks up the dictionary form — ran → run, better → good — slow but accurate. Different languages need different analyzers: English uses Snowball/Porter2, Hindi uses an Indic Snowball variant that knows about matras and halant, Tamil has its own analyzer for its agglutinative morphology, Chinese uses ICU or jieba because words do not have spaces. Lucene ships about thirty language analyzers; Elasticsearch wraps them; MeiliSearch and Tantivy reimplement the popular ones. The same analyzer must be used at index time and query time, or nothing matches.

In chapter 143 you built an inverted index. You wrote tokenize(text) as a placeholder that returned text.lower().split(). That placeholder works on a few test sentences and fails on every real document you will ever encounter. This chapter is what you put inside tokenize() — and it is, by some distance, the most language-dependent and culturally-loaded component of any search engine.

The thesis: matching requires normalisation

Suppose you index this sentence:

"Samsung's new Galaxy phones — café-quality cameras, ₹89,999."

A user searches for samsung phone. Should it match? Of course it should. But naively, the index contains Samsung's (with an apostrophe and capital S), phones (plural), and the query contains samsung (lowercase) and phone (singular). Byte for byte, nothing matches. Why: a hash table — which is what the term dictionary of an inverted index is — only retrieves a value if the lookup key is exactly equal to the stored key, bit for bit. "Samsung's" and "samsung" are different strings; one lookup will not find the other.

The fix is to run both the indexed text and the query text through the same deterministic transformation pipeline. That pipeline is the analyzer. After analysis the document tokens become roughly [samsung, new, galaxi, phone, cafe, qualiti, camera, 89999] and the query tokens become [samsung, phone] — and now the lookups hit. The analyzer's job is to make the two sides agree on a canonical form for every word.

This is why a search engine is, in practice, much more an exercise in linguistics and Unicode handling than in algorithms. The B-tree underneath is small. The pipeline on top is a small NLP system.

The standard analyzer pipeline

Lucene — the open-source library underneath Elasticsearch, Solr, OpenSearch, and dozens of other systems — defines an analyzer as a fixed three-stage pipeline. Every modern search engine borrows this structure, even if the names differ.

The standard analyzer pipeline. Each box is a deterministic function on the token stream of the previous stage. Note that this is the *normalisation* part — stemming, stopwording and synonym expansion are extra token filters that slot in before the final output.

The three stages, in Lucene's vocabulary, are:

Character filter — runs on the raw text before tokenisation. Decodes HTML entities (& → &), normalises Unicode (NFC), strips control characters. Often a no-op for plain text but essential for indexing scraped HTML or user-submitted markdown. Why: tokenisation rules assume a clean Unicode string; if a document body still contains   or smart quotes, every downstream stage will mistreat them.
Tokenizer — splits the cleaned text into a stream of tokens. The standard tokenizer in Lucene follows the Unicode Text Segmentation rules (UAX #29) which roughly says "split on whitespace and most punctuation, keep CJK characters as individual tokens, recognise URLs and emails as single tokens". Whitespace alone is too crude (don't. becomes don't.); regex-based splitting is the typical implementation.
Token filters — a chain of functions that transform the token stream. Each filter receives a stream and emits a (possibly different) stream. The order matters. A common chain looks like:
- Lowercase: Samsung → samsung. Trivial but essential. Why: case carries no semantic information for nearly all search queries; users type lowercase 95% of the time and capitalise inconsistently when they don't.
- ASCII folding: café → cafe, naïve → naive, Zürich → zurich. Maps Unicode characters to their ASCII near-equivalents. Useful for European languages, harmful for Hindi or Chinese (where there is no ASCII equivalent). Most analyzers turn this on selectively.
- Stopword removal: drop tokens that appear in a fixed list (the, and, is, a, of, to...). Reduces index size by 30-50% in English, has no effect on common queries, and breaks rare phrase queries (to be or not to be becomes empty). Most modern systems leave stopwords in the index but downweight them in BM25.
- Stemming: reduce inflected forms to a root. running, runs, ran → mostly run. Discussed in detail below.
- Synonym expansion: at index or query time, replace phone with phone, mobile, handset so queries find each other. Maintained as a static dictionary or learned from query logs.
- N-gram generation: produce sub-token grams for autocomplete or fuzzy matching. samsung → sa, sam, sams, samsu, samsun, samsung (edge n-grams). The index grows ~4×, but you can match prefixes in a single lookup.

Notice what stays constant: the same analyzer must run at both index time and query time. If you stem running → run when indexing the document but not when processing the query running, the postings list for running is empty and you find nothing. This is the single most common mistake people make when wiring up Lucene by hand.

Stemming versus lemmatisation

Stemming is the workhorse of search; lemmatisation is its slower, fussier cousin. Both reduce a word to a canonical form, but they differ in approach and accuracy.

Stemming applies fast heuristic rules and trips on irregular forms. Lemmatisation consults a dictionary and gets the right answer, including for irregular comparatives like "better → good". For most search workloads, stemming is good enough and three orders of magnitude faster.

Stemming is a set of language-specific heuristic rules that chop off suffixes. The Porter stemmer (1980), and its successor Snowball / Porter2 (Martin Porter, 2001), is the de facto standard for English. Snowball's rules look like "if the word ends in -ing and the stem before it has at least one vowel, drop the -ing". Apply that to running and you get runn — then a doubled-consonant rule trims it to run. Apply it to ran and no rule fires, because ran does not end in any of the suffixes Snowball knows about. The stemmer never asks "is ran a form of run?" — it only knows about suffixes.

The output of a stemmer is not necessarily a real word. running becomes run, but relations becomes relat, argument becomes argument, arguing becomes argu. The stems are index keys, not display strings. As long as the same word always stems to the same thing, the index works.

Lemmatisation is the dictionary-based alternative. A lemmatiser knows that ran is the past tense of run, that better is the comparative of good, that mice is the plural of mouse. It returns the lemma — the actual headword you would look up in a dictionary. Implementing one requires a morphological database (WordNet for English, Hindi WordNet for Hindi) and part-of-speech tagging — because saw could be a noun (the tool) or the past tense of see, and the lemma differs.

The trade-off:

	Stemmer	Lemmatiser
Speed	~50 ns/word (a few rules)	~50 µs/word (dictionary lookup + POS tag)
Accuracy on inflections	~85% (misses irregulars)	~98% (knows all forms)
Output	not necessarily a real word	always a dictionary headword
Memory	tiny (rule table)	hundreds of MB (dictionary)
Used by	Lucene, Elasticsearch, Solr, almost every search engine	spaCy, NLTK for NLP tasks

For search, stemming wins almost every time. Speed matters because you stem at index time across billions of tokens, and you stem at query time on every keystroke. The 13% accuracy loss matters less because BM25's bag-of-words scoring is already lossy — losing a few rare forms barely changes ranking.

Stopwords: the case for and against

Stopwords are very common, low-information words: the, and, is, a, of, to, in, that, it, for, with. In English they make up ~30% of token occurrences in typical prose but contribute essentially nothing to whether one document is more relevant than another to a topical query. Removing them from the index used to be standard practice — it cut index size in half and sped up queries.

Modern systems are more careful. The query to be or not to be consists entirely of stopwords. If you removed them, the famous Hamlet line is invisible. The query the who (the band) becomes empty. So most contemporary engines either:

Keep stopwords in the index, but downweight them through IDF (which the next chapter explains — common words have low IDF, so they contribute little to the score automatically), or
Use adaptive stopwords: words that are common in this corpus (e.g. bank in a corpus of banking documents) get treated as stopwords for that corpus, while truly rare-elsewhere words get the full weight.

Lucene's English analyzer, for compatibility, still removes a fixed list of 33 English stopwords by default. You can disable this. Most production deployments do.

Language analyzers

The pipeline above describes English. Each language needs its own pipeline because every language has different rules for tokenisation, normalisation, and morphology.

Three different language analyzers, all wired into the same Lucene framework but with completely different filter chains and stopword lists. Indic Snowball understands that the matra "ी" attached to a base character is part of inflection, not a separate token.

English uses the Snowball (Porter2) stemmer, the standard English stopword list, and the standard Unicode tokenizer. The tokens come out lowercase ASCII.

Hindi and other Indic languages need a special tokenizer that understands the Devanagari script. Naive whitespace splitting works for word boundaries (Devanagari does use spaces between words), but normalisation has to handle the matra characters — vowel signs like ी (long i) and ु (short u) attached to consonants — and the halant ् that suppresses the inherent vowel. The Lucene HindiAnalyzer ships with a Hindi stopword list (का, की, के, ने, से, में, पर...) and an Indic Snowball variant that knows about Hindi noun and verb suffixes (-ों plural, -ता agentive, -कर conjunctive).

Tamil has Lucene's separate TamilAnalyzer because Tamil is morphologically agglutinative — a single Tamil word can encode subject, object, tense, and politeness. The stemmer has to peel back many more layers than the English Snowball, and the stopword list is different.

Chinese, Japanese, Korean require an entirely different tokenisation strategy because written Chinese has no spaces between words. The two practical approaches are dictionary-based segmentation (Lucene's SmartChineseAnalyzer, the jieba library) which uses a probabilistic model to find the most likely word boundaries, and ICU-based segmentation which uses ICU's boundary analysis for languages it knows. Both produce noisy boundaries — search quality on CJK is meaningfully harder than on space-delimited languages.

A typical multilingual setup in Elasticsearch creates one indexed field per language — title.en analyzed by english, title.hi analyzed by hindi, title.ta analyzed by tamil — and routes the query to the right field based on detected language.

Building it in Python

Here is the standard analyzer pipeline implemented in roughly forty lines of Python, using NLTK for the tokenizer and stopwords and PyStemmer for the Snowball implementation.

import re
import unicodedata
from nltk.corpus import stopwords
import Stemmer  # PyStemmer — C implementation of Snowball

EN_STOPS = set(stopwords.words('english'))
EN_STEMMER = Stemmer.Stemmer('english')

def char_filter(text: str) -> str:
    text = unicodedata.normalize('NFC', text)
    text = re.sub(r'&[a-z]+;', ' ', text)        # strip HTML entities
    return text

def tokenize(text: str) -> list[str]:
    return re.findall(r"[\w]+(?:['-][\w]+)*", text, re.UNICODE)

def lowercase(tokens: list[str]) -> list[str]:
    return [t.lower() for t in tokens]

def ascii_fold(tokens: list[str]) -> list[str]:
    out = []
    for t in tokens:
        decomposed = unicodedata.normalize('NFKD', t)
        ascii_form = ''.join(c for c in decomposed if not unicodedata.combining(c))
        out.append(ascii_form)
    return out

def remove_stopwords(tokens: list[str], stops: set[str]) -> list[str]:
    return [t for t in tokens if t not in stops]

def stem(tokens: list[str], stemmer) -> list[str]:
    return stemmer.stemWords(tokens)

def english_analyzer(text: str) -> list[str]:
    text = char_filter(text)
    tokens = tokenize(text)
    tokens = lowercase(tokens)
    tokens = ascii_fold(tokens)
    tokens = remove_stopwords(tokens, EN_STOPS)
    tokens = stem(tokens, EN_STEMMER)
    return tokens

Each function is a transformation on the token stream. They compose. To swap in a different analyzer — say one without stopword removal — you just rewrite english_analyzer to skip that step. Why: this is the same separation Lucene enforces in code; analyzers are constructed from a tokenizer plus a list of token filters, and you can recombine them freely.

For Hindi, the structure is identical but the stopword list and stemmer change:

HI_STOPS = {'का', 'की', 'के', 'को', 'ने', 'से', 'में', 'पर',
            'और', 'है', 'हैं', 'था', 'थी', 'थे', 'यह', 'वह',
            'एक', 'भी', 'तो', 'जो', 'कि', 'या', 'पर', 'इस'}

def hindi_normalize(text: str) -> str:
    text = unicodedata.normalize('NFC', text)
    # Collapse nukta variants: क़ → क + nukta as canonical
    return text

def hindi_stem(token: str) -> str:
    # Indic Snowball-style suffix stripping, simplified
    SUFFIXES = ['ों', 'ें', 'ाओं', 'ाएं', 'कर', 'ता', 'ती', 'ते', 'ना', 'ने', 'ी', 'े', 'ा']
    for suf in sorted(SUFFIXES, key=len, reverse=True):
        if token.endswith(suf) and len(token) > len(suf) + 1:
            return token[:-len(suf)]
    return token

def hindi_analyzer(text: str) -> list[str]:
    text = hindi_normalize(text)
    tokens = re.findall(r'[ऀ-ॿ]+', text)  # Devanagari range
    tokens = [t for t in tokens if t not in HI_STOPS]
    tokens = [hindi_stem(t) for t in tokens]
    return tokens

The real Lucene HindiStemmer is more rigorous (it uses a longer suffix table and consults vowel patterns), but this captures the shape of the algorithm.

An Indian news search engine

You are building a search engine for an Indian news aggregator. The corpus contains articles in English, Hindi, and Tamil. To get a feel for what your analyzer does to text, run three sample headlines through the matching language analyzer and look at the output tokens — these are the strings that go into the inverted index.

import nltk
nltk.download('stopwords', quiet=True)

# Article 1: English
en_article = "Reliance announces new Jio plans for rural India"
print("EN tokens:", english_analyzer(en_article))
# → ['relianc', 'announc', 'new', 'jio', 'plan', 'rural', 'india']

# Article 2: Hindi
hi_article = "रिलायंस ने नया जिओ प्लान घोषित किया"
print("HI tokens:", hindi_analyzer(hi_article))
# → ['रिलायंस', 'नया', 'जिओ', 'प्लान', 'घोषित', 'किय']

# Article 3: Mixed (English brand inside Hindi text — common in Indian news)
mixed = "Jio का नया प्लान 5G के लिए"
# Need to detect script per token and route to the right analyzer
def mixed_analyzer(text: str) -> list[str]:
    out = []
    for token in re.findall(r'\S+', text):
        if re.search(r'[ऀ-ॿ]', token):
            out.extend(hindi_analyzer(token))
        else:
            out.extend(english_analyzer(token))
    return out

print("MIX tokens:", mixed_analyzer(mixed))
# → ['jio', 'नया', 'प्लान', '5g']

Notice three things in the English output. First, announces became announc and Reliance became relianc — Snowball's -e and -es rules fired. Second, for was dropped because it is in NLTK's English stopword list. Third, 5G survived intact because the tokenizer treats 5g as a single token (digits and letters mixed).

In the Hindi output, ने was dropped as a stopword and किया lost its ा suffix to become किय (the verb stem). रिलायंस survived unchanged because it does not end in any of the Hindi suffixes. Why: stemming is suffix-driven; loanwords from English do not match Hindi morphological patterns, so they pass through.

Now imagine a user types रिलायंस के प्लान. The query analyzer runs the same Hindi pipeline and produces ['रिलायंस', 'प्लान'] (because के is a stopword). Both terms are looked up in the inverted index, the postings lists are intersected, and Article 2 comes back as a hit. Without analysis, the query would have been three different strings and Article 2 would not have matched — even though it is exactly what the user wanted.

Trade-offs to make explicitly

Every analyzer choice is a trade-off between precision (the fraction of returned results that are relevant) and recall (the fraction of relevant documents that are returned).

Aggressive stemming raises recall — running matches documents about runs, runner, ran — but lowers precision — university and universe both stem to univers, conflating unrelated meanings. The Lancaster stemmer, more aggressive than Snowball, is famous for producing nonsense conflations like organisation → organ.

Aggressive stopword removal speeds up indexing and shrinks the index, but breaks any query that depends on common words. Vitamin A becomes vitamin because a is a stopword. IT industry becomes industri. Modern systems mostly leave stopwords in.

Synonym expansion at index time doubles or triples the index size (every phone token also indexes mobile, handset) but makes the query side trivial. Synonym expansion at query time keeps the index small but slows every query (each query token expands to many lookups). Most production systems do query-time expansion.

Edge n-grams for autocomplete grow the index by ~4× and make prefix search instant. They are essential if you want a "search-as-you-type" experience but pure overhead if you do not.

The right answer is workload-specific, which is why Lucene exposes every dial.

Real systems

Elasticsearch ships about 35 built-in language analyzers — English, Arabic, Bengali, Chinese (via SmartCN or ICU), Hindi, Tamil, Telugu, and many more — and exposes them as "analyzer": "hindi" in field mappings. You can also build custom analyzers by composing a char_filter, tokenizer, and filter chain in JSON, and the engine wires up the pipeline at startup.

MeiliSearch takes a more opinionated approach — it auto-detects language and applies sensible defaults, with no per-field configuration required. This costs flexibility but pays back in default quality for non-experts.

Tantivy (a Rust port of Lucene used by Quickwit) reimplements the tokenizer-and-filter framework with the same shape, exposing a small set of built-in stemmers (Snowball for the major European languages) and letting users plug in custom token filters.

Anserini is a Java toolkit built on Lucene specifically for IR research; it is the reference implementation people benchmark against in the TREC competitions, and a good place to look if you want to see "the same analyzer the academic IR community uses".

For pure tokenisation needs, the ICU project's BreakIterator is the highest-quality general-purpose Unicode tokenizer — it is what Lucene's ICUTokenizer wraps, and what Chrome and Firefox use for word selection on double-click.

Going deeper: the analyzer is part of your schema

The single most important operational property of an analyzer is that it is part of your index schema, not part of your query layer. If you change the analyzer — switch from English Snowball to Lancaster, add a new stopword, change ASCII folding to Unicode folding — the strings stored in the existing inverted index no longer match what the new analyzer would produce. You cannot reanalyze a query against a stale index. The entire index must be rebuilt.

This makes analyzer choice a long-lived decision in the same way that picking a primary key is. Production teams that have to migrate analyzers — usually because they realised some query class was failing systematically — generally do it via a dual-write scheme: build the new index alongside the old, send each query to both, compare results, and cut over only when the new index is demonstrably better. This can take weeks for a large corpus.

A second consequence: the analyzer is expensive at write time. Snowball is fast (microseconds per word), but ICU-based Chinese tokenisation is slow (milliseconds per document) and lemmatisation can be slower still. Indexing throughput is often analyzer-bound, not disk-bound. If you are indexing billions of documents, every microsecond per token matters.

A third consequence relevant to multitenant systems: if you let users configure their own analyzers, every customer effectively has a different schema, and the operator burden of managing dozens of pipeline variants becomes nontrivial. Most hosted search services restrict customers to picking from a fixed menu.

Beyond stemming: BPE and learned tokenisers

The deep-learning era brought a new family of tokenisers — byte-pair encoding (BPE), WordPiece, SentencePiece — which learn a sub-word vocabulary from the corpus rather than relying on hand-coded rules. running might be tokenised as [run, ##ning]; unbelievable as [un, ##believ, ##able]. The advantages are language-agnostic operation and graceful handling of rare words; the disadvantages are loss of morphological transparency and worse fit for inverted-index lookup (you would have to tokenise queries the same way and search for sub-word sequences, which postings-list intersection is not designed for).

Most search systems still use rule-based stemmers. BPE-style tokenisers dominate in retrieval systems built around dense vector search (where the tokens are inputs to a neural encoder, not direct index keys), which is the world covered later in Build 18.

Why "the same analyzer at index and query time" actually fails sometimes

There is one case where you deliberately use different analyzers at index and query time: edge n-gram autocomplete. At index time, samsung is indexed as [s, sa, sam, sams, samsu, samsun, samsung]. At query time, the user typing sams should match — but if you ran the n-gram filter at query time too, the query would expand to [s, sa, sam, sams] and you would do four lookups for what should be one. So you index with n-grams and query without them, using a separate analyzer for the search side.

This is the only standard exception, and it is so common that Lucene and Elasticsearch let you specify analyzer and search_analyzer separately on every field. If you find yourself needing more than this exception, you are probably doing something wrong.

What's next

You now have a function that turns raw text into a list of canonical tokens, and you have an inverted index that maps tokens to docID postings lists. The next chapter compresses those postings lists — a billion-document corpus can have postings lists with hundreds of millions of entries each, and storing them as plain int64 arrays would consume terabytes. Skip pointers, delta encoding, and variable-byte compression typically cut the size by 5-10× while making queries faster, not slower. From there, BM25 turns presence-or-absence matches into ranked relevance scores, and you have the core of every modern search engine.

The thesis: matching requires normalisation

The standard analyzer pipeline

Stemming versus lemmatisation

Stopwords: the case for and against

Language analyzers

Building it in Python

Trade-offs to make explicitly

Real systems

Going deeper: the analyzer is part of your schema

Beyond stemming: BPE and learned tokenisers

Why "the same analyzer at index and query time" actually fails sometimes

What's next

References