Segment-based Storage and Merge Policies

In short

You are running an Elasticsearch cluster behind a marketplace search box. New product listings stream in all day. The first instinct of someone coming from a B-tree world is to imagine a single big inverted index that you patch in place — append the new docID to each affected postings list, splice in the new term entries, fsync. Lucene refuses to do this. Instead it writes every batch of new documents as a new immutable segment: a complete, self-contained mini-inverted-index, sealed once and never touched again. Deletions don't rewrite postings either; they just flip a bit in a deleted-doc bitmap. This is the same idea as an LSM tree from Build 11, applied to inverted indexes. The cost is that segments multiply — a refresh interval of one second creates 86,400 segments per day per shard, each opened on every query. The fix is a merge policy: a background thread that picks groups of similar-sized segments, reads them, writes a single larger merged segment, and atomically swaps it in. Lucene's default is TieredMergePolicy — it sorts segments by size into tiers and merges a tier whenever it gets too crowded, keeping the steady-state count near a configurable target (≈30 segments per shard by default). The win is enormous: a worked example at the end of this chapter shows a billion-document index dropping from 5 seconds per query (100,000 small segments) to 50 milliseconds (50 large segments) — a 100× speedup from the merge policy alone. Knowing how this knob works is the difference between a cluster that scales and one that you keep restarting at 3am.

You have read chapter 143 and chapter 145. You know what an inverted index is and how its postings lists are compressed on disk. The natural next question, the one nobody asks until production breaks, is: how do you update one?

A user uploads a new product listing. A blog publishes a new article. A log line lands in your ingestion pipeline. The new document contains a hundred terms, and each of those terms now needs a new entry appended to its postings list. If the index lives on disk as one big file, this is catastrophe — every term's postings list is somewhere in the middle of that file, and inserting bytes in the middle of a file means rewriting everything after the insertion point. A single new document could trigger gigabytes of disk I/O.

So Lucene — the indexing library that powers Elasticsearch, Solr, OpenSearch, and (in the Rust ecosystem) Tantivy — does something else. It never updates the index in place. It writes a new index, every time, and lets a background thread clean up.

The thesis: indexes are immutable, writes create new ones

A Lucene index is not a single file. It is a directory full of segments. Each segment is a complete, self-contained inverted index for some subset of the documents — its own term dictionary, its own postings file, its own stored-fields file, its own deleted-doc bitmap. A search over the whole index is a search over every segment, with the results merged.

When you add a new document, Lucene does not touch any existing segment. Why: existing segments are sealed — opening one of them for write would force readers to lock, would invalidate OS page cache for that file, and would make the postings-list-merge-on-update problem from the previous paragraph appear all over again. The new document goes into an in-memory buffer. When the buffer fills (or a refresh interval elapses, or you call flush()), Lucene writes the buffer's contents as a brand-new segment file and adds it to the directory.

The Lucene segment lifecycle: documents accumulate in an in-memory buffer; on every refresh, the buffer is sealed into a new immutable segment file. Repeated writes produce many small segments, which a background merge thread eventually combines into fewer, larger ones.

A delete is even more interesting. Lucene does not go into the relevant segments and remove the docID from every postings list — that would require rewriting the segment, defeating immutability. It just sets a bit in a per-segment deleted-doc bitmap: a parallel file that says "docID 7842 in this segment is dead, ignore it at query time." The postings still contain the docID. The stored fields still contain the document. The disk bytes are wasted until merge time, when the merge skips deleted docs and reclaims the space. Why: this is the same idea as a tombstone in an LSM tree — make the write cheap by deferring the cleanup, then reclaim during the next compaction. Lucene's "merge" is structurally identical to LevelDB's "compaction".

Updates are deletes plus inserts. There is no in-place update operation in Lucene. index.update(doc) is "mark the old version as deleted, write a new version into the buffer." This is why an Elasticsearch update is no cheaper than an Elasticsearch insert.

Why immutability won

Three reasons, in order of importance.

Lock-free readers. A query that opens segment 42 holds a file descriptor on a file that will never change. No reader needs a lock; no writer needs to coordinate with readers. The only synchronisation point in the entire engine is the moment a new segment is added to the directory — a single atomic update of a manifest file (Lucene calls it segments_N). Everything else is reading immutable bytes. Why: locking in a search engine is brutally expensive — a single inverted-index update would block thousands of concurrent reads — and immutability eliminates the need for it entirely.

Cache-friendly. A hot segment file gets pulled into the OS page cache once and stays there. Queries hit the cache, never the disk. With a mutable index, every write would invalidate the affected pages and force a fresh disk read on the next query. Why: modern Elasticsearch tuning advice — leave half the box's RAM for the OS page cache — only works because segments are immutable; otherwise the cache hit rate would collapse under write traffic.

Sequential disk I/O. Writing a new segment is a pure append-only sequential write of one large file. No random seeks, no in-place updates. On spinning disks this was the difference between feasible and infeasible; on NVMe SSDs it is still the difference between fast and slow, and it eliminates write amplification beyond the merges themselves.

There is also a subtler benefit: crash safety becomes trivial. The segments_N manifest is the single source of truth for which segments are live. A crash mid-segment-write leaves a partial file on disk that no segments_N references; Lucene just deletes it on startup. Compare this to recovering a half-mutated B-tree page (the pain that motivated the entire ARIES protocol in Build 7).

The lifecycle of a single document

Walk through a single index_doc call from start to finish.

The client sends a document. Elasticsearch receives JSON, routes it to the correct shard, and hands it to the local Lucene IndexWriter.
The writer tokenises and adds to the in-memory buffer. The buffer is structured: per-term postings lists are accumulated in a hash map (Lucene's TermsHashPerField), and stored fields are written into a column-store-like buffer.
The translog is appended. Elasticsearch wraps Lucene with a translog (transaction log, fsync'd on every request by default) so an unrefreshed buffer survives crashes. Lucene by itself has no translog; you add one if you need durability.
A refresh fires. Default in Elasticsearch is refresh_interval: 1s, but you can dial it up to 30s (heavy ingest, log indexing) or down to immediate (real-time search). The buffer is serialised to a new segment file on disk; the file's term dictionary is built; postings are compressed; the file is fsync'd; the segments_N manifest is updated atomically; the buffer is cleared.
The new segment is opened by the readers. Existing in-flight queries continue against the old segment list (immutability!). New queries see the updated segment list and include the new segment.
Eventually, a merge picks it up. The background merge scheduler decides this segment, plus a handful of similar-sized siblings, should become one larger segment. The merge reads them, writes the combined output, atomically updates segments_N, and deletes the originals — but only after every in-flight query that was reading them has completed.

This is the entire write path. Notice what is not in it: any operation that modifies an existing file. Every byte written to disk is written exactly once, then either read forever or garbage-collected by a merge.

The small-segment problem

If immutability is so wonderful, why merge at all? Why not just keep accumulating segments forever?

Because every query has to visit every segment. A search for samsung smartphone is not "look up samsung in the inverted index" — it is "look up samsung in segment 1's inverted index, then segment 2's, then segment 3's, then segment N's, and union the results." Each segment is its own self-contained index, and there is no global postings list across segments.

The small-segment problem: a single query against 100 small segments must open, seek, decode, and merge results from 100 files — even if the postings lists themselves are tiny. Merging the same data into 10 larger segments collapses the per-query overhead by an order of magnitude.

The cost per segment is mostly fixed: open the file, read the segment header, look up the term in that segment's term dictionary, seek to the postings, decode the first block. Even if the segment contains zero matches for samsung, you paid all that overhead just to find out. With a hundred segments you pay it a hundred times. With a hundred thousand segments you pay it a hundred thousand times — and your query, which should have been fifty milliseconds, is now five seconds.

There is also a query-side cost: scoring. BM25 (chapter 146) needs global term statistics — total documents in the corpus, document frequency for each term — to compute IDF. With many segments these stats are computed per-segment then combined, which is correct but adds CPU. And for top-k retrieval, the more segments you have, the more candidate hits you have to merge into a single ranked list.

The conclusion forces itself: you need to bound the segment count. That is what merge policies are for.

Merge policies

A merge policy is the strategy that picks which segments to combine, when, and into what. Lucene ships several; the default since version 4 is TieredMergePolicy.

TieredMergePolicy (the default)

Group segments by size into tiers — roughly, each tier contains segments that are within a constant factor of each other in size. When a tier accumulates more than segments_per_tier segments (default 10), pick the cheapest-to-merge subset of them and combine them into a single segment that lands in the next tier up. Repeat as needed.

TieredMergePolicy groups segments into size tiers. New segments enter tier 0; when a tier exceeds segments_per_tier, the policy picks a cheap subset and merges them into the next tier up. The result is a steady-state segment count that grows logarithmically with corpus size, not linearly.

The policy is greedy and cost-based: at each step it considers candidate merges, scores each one by (merged_size + skew_penalty) / (sum of input sizes - reclaimable_deletes), and picks the lowest-scoring one. Why: the score rewards merges that combine many roughly-equal-sized segments (low skew), reclaim a lot of deleted-doc bytes, and produce a moderate-sized output — penalising the runaway "merge everything into one giant segment" path that would explode write amplification.

Two parameters dominate behaviour:

segments_per_tier (default 10): how crowded a tier may get before merging fires. Lower = more aggressive merging = fewer total segments = faster queries; higher = lazier = more segments = less merge I/O.
max_merged_segment (default 5 GB): the largest segment that the policy will produce. Above this, the segment is "graduated" and ignored by future merges (until you call forceMerge explicitly). Why: very large merges are catastrophically expensive — merging two 50 GB segments rewrites 100 GB of disk for a small steady-state win — so the policy refuses to do them automatically.

LogByteSizeMergePolicy (the predecessor)

Lucene's older default. Sorts segments by byte size on a logarithmic scale and merges segments that fall in the same log bucket. Simpler than tiered merging, less flexible — tiered merging gives more knobs to balance write amplification against read latency, which matters at production scale. Still available for compatibility and for workloads where its predictability is preferred.

NoMergePolicy

What it sounds like: never merge. Useful for the very specific case of a frozen, read-only index where you have already done a forceMerge(1) and you want to guarantee no further changes. Almost never the right choice for a live index.

Trade-offs: the merge dial

Every merge tuning decision is the same trade-off, expressed in different units:

More aggressive merging (lower segments_per_tier, smaller floor_segment, more frequent merge runs) → fewer segments at steady state → lower query latency. Cost: more disk I/O and CPU spent on merges, more write amplification, more risk of merge-induced page-cache thrashing during heavy ingest.
Less aggressive merging → more segments → higher query latency. Benefit: less ingest overhead, smoother write throughput.

For a write-heavy log indexing workload (think Elasticsearch ingesting Nginx logs from a hundred containers), you usually want lazier merging — query latency is less critical than ingest throughput, and indexes are often time-sliced anyway (one index per day). For a search-heavy workload (a marketplace product index queried a thousand times a second), you want more aggressive merging.

There is one more knob worth knowing: index.merge.scheduler.max_thread_count. Merging is parallel — multiple merges can run simultaneously — but each merge thread is contending for the same disk. On a single SSD, two or three threads is the sweet spot; on spinning disks, one. Setting this too high will starve queries of disk bandwidth.

Merge during query: the safety property

What happens if a merge completes while a query is reading the segments being merged? The query had a snapshot of the segment list at its start; the merge wants to delete the old segments and replace them with the new merged one. If the merge deletes them mid-query, the query crashes.

Lucene's solution is reference counting on segments. When a query opens, it acquires a reference on every segment in its snapshot. The merge can produce its new segment and update segments_N immediately — but it cannot delete the old segment files until their reference count drops to zero. The old files linger on disk for a few seconds (the duration of the longest in-flight query) and are then garbage-collected.

This is why an Elasticsearch index briefly takes more disk than its on-paper size: a recently-completed merge has both the new merged segment and the old soon-to-be-deleted segments on disk simultaneously. Why: this is also why "rule of thumb" sizing for an Elasticsearch shard is ~2× the steady-state index size — you need headroom for the largest possible in-flight merge.

Force merge

Sometimes you want to override the policy and merge explicitly. The most common case: a daily log index that becomes read-only at midnight. After midnight there are no more writes, so the segment count will never naturally come down — but every search of yesterday's logs is paying the small-segment overhead. The fix is _forcemerge?max_num_segments=1: tell Elasticsearch to merge the entire shard down to one segment.

forceMerge is a heavyweight operation. It will rewrite every byte of the shard. On a 50 GB shard that is 50 GB of disk write and read. Run it during off-peak, never on indexes that are still being written to (you will fight the ingest path the whole time and produce one massive segment that the regular policy will then refuse to touch), and never thoughtlessly schedule it as a recurring cron — it is meant for the "this index is now frozen, optimise it once" use case.

The benefit is real: a fully merged read-only index is the fastest possible Lucene index. Stored fields are contiguous, deleted-doc bitmaps have been reclaimed, term dictionaries are minimal. For an index that will be queried millions of times before being deleted (the analytics dashboard reading last week's logs), a one-time forceMerge(1) pays for itself many times over.

Worked example: the 100× speedup

A marketplace search index, before and after merge tuning

Imagine a Bengaluru-based fashion marketplace that has indexed one billion product listings, reviews, and user-generated content snippets into a single Elasticsearch shard over thirty days. Refresh interval is the default 1s, so writes have been creating a new segment every second. Total segments after a month, with no merging at all: 30 × 86,400 ≈ 2.6 million segments. Even a generous baseline merge policy that managed to keep things in the tens of thousands would still leave roughly 100,000 small segments by month-end if tuning was off.

A query for cotton kurta size m arrives. With 100,000 segments, the per-segment overhead is roughly:

100 µs to open the file (or hit the OS page cache)
50 µs to look up each of the three terms in that segment's term dictionary
20 µs to read and decode the (usually empty) postings block

That is roughly 170 µs per segment, paid 100,000 times: 17 seconds of pure overhead, before scoring or top-k merging. Even amortising file opens via the OS page cache and parallelising across CPU cores, the realistic p99 is around 5 seconds. The user has long since closed the tab.

Now switch to a sensibly-tuned TieredMergePolicy with segments_per_tier: 10 and a periodic background merge. After thirty days the shard converges to roughly 50 segments at the top tier (each ~20 GB) plus a tail of smaller ones — call it 50 effective segments. Same per-segment overhead: 170 µs × 50 = 8.5 ms of overhead. Add 30 ms for the actual postings reads, intersection, and BM25 scoring across the live segments, and you are at ~50 ms p99 end-to-end.

Metric	No merging (~100K segments)	Tiered merge (~50 segments)
Segments per shard	100,000	50
File opens per query	100,000	50
Per-query overhead	~5 s	~50 ms
Speedup	1× (baseline)	~100×
Disk write amplification	1×	~3-5×
Steady-state CPU (merge)	~0%	~5% of one core

The trade-off is visible in the bottom rows: the merge-tuned cluster spends 3-5× more disk write bandwidth (each byte ends up rewritten a few times as it climbs through tiers) and burns a small amount of CPU on background merges. In exchange it gives you 100× faster queries — the single highest-leverage knob on the cluster. For a fashion marketplace where queries outnumber writes by 1000:1, this trade is not even close.

The same arithmetic applies, with different absolute numbers, to logging clusters at HotStar, recommendation indexes at Flipkart, and the search box on every Indian government portal that runs OpenSearch underneath. The default merge policy is good. Knowing which knob to turn when it is not is the difference between a Lucene operator and a Lucene engineer.

Real systems

Every Lucene-derived search engine inherits this design.

Elasticsearch (and OpenSearch, the AWS fork) ships with TieredMergePolicy as the default and exposes index.merge.policy.* settings for the parameters above. Refresh interval, force-merge endpoint, segment introspection (_cat/segments) all map directly onto Lucene primitives.
Solr, the older Lucene-based engine still widely deployed in academic and library systems, uses the same merge policy classes via its solrconfig.xml. Slightly more configurable than Elasticsearch in some respects (you can plug in a custom merge policy without forking), but the defaults are identical.
Tantivy, the Rust port that powers Quickwit and many newer search-as-a-library deployments, implements segment-based storage with its own merge policy that is structurally similar to TieredMergePolicy. The lifecycle, the immutability invariant, and the trade-offs are the same; only the language is different.

Even non-Lucene full-text engines that compete on different axes — Meilisearch, Typesense, Vespa — have converged on segment-based storage with background merges, because the underlying constraints (immutable files for cache friendliness, bounded segment counts for query speed) are universal.

Going deeper

For senior engineers

The merge policy interacts with three other systems in ways that are easy to miss the first time, and matter enormously at scale: the translog, soft deletes, and CCR / cross-cluster replication.

Translog and durability

Lucene's IndexWriter has its own commit mechanism: commit() fsyncs the index and writes a new segments_N manifest. Before commit, a crash loses anything in the in-memory buffer. Elasticsearch wraps Lucene with a translog that is fsync'd on every request (configurable to async for higher throughput at the cost of bounded data loss). On crash, Elasticsearch replays the translog into a fresh in-memory buffer and reaches a consistent state. The merge policy is oblivious to the translog — merges are pure reads of immutable segment files — but the translog size grows with refresh interval. Setting refresh_interval: -1 for bulk loads is a famous tuning trick: it disables refresh entirely, lets the translog grow, and avoids creating thousands of tiny segments during the load.

Soft deletes and the version map

Modern Lucene supports soft deletes: instead of immediately marking a doc as dead in the deleted-doc bitmap, it writes a "deletion record" as a new document in a new segment. This sounds wasteful but enables several features — point-in-time queries, change data capture, cross-cluster replication — by making the deletion history queryable. Merges become more sophisticated: a doc is fully reclaimable only if its soft-delete record is older than every snapshot or replication subscription that might still reference it. Elasticsearch's index.soft_deletes.retention_lease is the knob that controls this.

Per-shard tuning at scale

A production Elasticsearch cluster has hundreds or thousands of shards, each running its own merge policy independently. The merge scheduler is node-local: it sees the merges across all shards on a node and shares disk bandwidth across them. A single hot shard with constant ingest can starve the others; conversely, an idle shard donates I/O to its neighbours. This is why per-shard sizing matters — a few balanced ~30 GB shards is much better than one 300 GB monster shard, both for query parallelism and for keeping any single merge from monopolising the node.

When merges go wrong

Two failure modes worth knowing. The first is merge backlog: ingest creates new segments faster than the policy can merge them, and segment count grows unboundedly. Diagnosed by _cat/segments showing a swelling tier 0; fixed by raising index.merge.scheduler.max_thread_count (if the disk has bandwidth to spare), lowering refresh_interval, or increasing index.merge.policy.floor_segment to consolidate tiny segments faster. The second is merge stall: a single huge merge hogs all the merge threads for hours, blocking smaller useful merges. Diagnosed by _tasks?actions=*merge* showing a long-running merge; fixed only by waiting it out (cancelling a merge wastes all the work done so far). Tuning max_merged_segment lower prevents this from happening again.

What you have learned

Lucene indexes are not single mutable files; they are directories of immutable segments, each one a complete self-contained inverted index.
Writes go to an in-memory buffer, then refresh seals the buffer into a new on-disk segment. Deletes mark a per-segment bitmap; the postings stay until merge time.
Immutability buys lock-free readers, OS-page-cache friendliness, sequential disk I/O, and trivially crash-safe atomic commits.
The cost is segment proliferation: every query opens every segment, so without intervention an index degrades into thousands of tiny files and queries slow by orders of magnitude.
A merge policy runs in the background, picking groups of similar-sized segments and combining them into fewer larger ones. Lucene's default TieredMergePolicy uses size tiers and a cost-based scoring rule.
The merge dial trades disk I/O and CPU (more aggressive merging) for query latency (fewer segments). For a search-heavy workload aggressive is right; for an ingest-heavy log workload lazy is right.
Force merge is the manual override — useful exactly once per read-only index, never on a live one.

Chapter 149 takes this single-shard story to a cluster: how Elasticsearch fans a query out across shards, gathers their per-shard top-k results, and reduces them into a global top-k.

References

Apache Lucene IndexWriter documentation — the canonical reference for the segment lifecycle, refresh, commit, and force-merge semantics.
Lucene MergePolicy and TieredMergePolicy API — the source-level documentation of the cost-based scoring rule and the parameters discussed above.
Elastic blog: "Lucene's handling of deleted documents" — accessible deep dive into the deleted-doc bitmap, soft deletes, and how merges reclaim space.
Lucene in Action, 2nd edition (Manning) — book-length treatment by the Lucene committers, with a chapter on segments and merges that remains the best long-form explanation.
Apache Solr merge policy reference — covers the same policies via Solr's configuration syntax, including operational guidance for merge scheduling.
OpenSearch index settings: merge — the AWS-fork equivalent, with the same parameters under slightly different setting names.