Postings Lists, Skip Pointers, Compression

In short

You finished chapter 143 with a working inverted index where every term maps to a Python list[int] of document IDs. That worked for five products. It does not work for a billion. At scale, postings lists are the dominant storage cost in a search engine — for an index of a billion English documents, the term the alone has a postings list approaching a billion docIDs, and even a moderately common term like samsung might have 50 million. Stored as raw 4-byte integers that is 200 MB for one term. Multiply by a million distinct terms and the index is bigger than the documents themselves. Two orthogonal techniques rescue this. Compression shrinks each posting from 4 bytes to roughly 1 byte typical: first by delta-encoding (store gaps between consecutive docIDs, not docIDs themselves — gaps are small), then by encoding each gap with variable-byte (1-5 bytes per integer, small ones get 1) or Frame-of-Reference bit-packing in 128-doc blocks (Lucene's choice; SIMD-decodable; ~5× compression). Skip pointers speed up intersection — every √N postings, store a (skip-to-doc, byte-offset) pair so that an AND between a 10-million list and a 100-million list can leap forward in the long list instead of stepping one docID at a time. Together they turn postings storage from terabytes into tens-to-hundreds of gigabytes and intersection from minutes into milliseconds. This chapter implements delta + VByte in Python, measures the compression ratio, walks through the skip-aware intersection, and surveys what Lucene, Elasticsearch, Quickwit, and the FastPFor library actually do.

You have an inverted index. For every term in your vocabulary, there is a sorted list of integer docIDs. The data structure is correct. It is also, at scale, ruinous: a Lucene shard for a billion-document log corpus would consume terabytes of postings storage if you wrote each docID as a 32-bit integer, and every query would drag those terabytes through the page cache.

Two observations rescue you, and almost every production search engine on Earth combines them.

The first is that postings are not arbitrary integers. They are sorted, and consecutive entries differ by small amounts. That means the gap between docIDs is usually tiny — often single digits — and integers that are usually tiny are exactly the integers that compression algorithms love. The second is that an AND query between two postings lists of very different lengths does not need to look at every entry in the long list. If you have signposts at every √N positions, you can leap.

This chapter builds both. By the end you will have a Python implementation of delta + variable-byte encoding that compresses a real postings list by roughly 4×, you will understand why Lucene chose Frame-of-Reference bit-packing instead of variable-byte, and you will be able to walk through the skip-aware intersection that turns samsung AND phone from a 100-million-step merge into a 30-thousand-step jump.

The storage problem, in numbers

Think of a real index. Suppose you are building search over a billion product descriptions for an Indian e-commerce catalogue (Flipkart-scale). The vocabulary, by Heaps's law, is around 10 million distinct terms. The very common terms have very long postings lists; the rare terms have very short ones; the distribution follows Zipf's law (a few terms dominate).

A uniform estimate is harsh enough on its own. If even 1% of those 10 million terms have postings lists with 10 million docIDs each, you are looking at:

100,000 terms × 10,000,000 docs × 4 bytes = 4 TB

Four terabytes of postings, and that is before positional information, payloads, or term frequencies. Why this is a crisis: 4 TB does not fit in RAM on any single machine, so every query goes to disk; disk bandwidth is roughly 1 GB/s on NVMe, so an AND that touches two 200 MB postings lists takes 400 ms before any logic runs. A user typing a query expects the results in 50 ms.

Compression is not a nice-to-have. It is the difference between feasible and infeasible.

Observation 1: docIDs are sorted, so encode the gaps

Look at a real chunk of a postings list:

The docIDs themselves are not very large numbers — most fit in two bytes — but stored as fixed-width 32-bit integers each takes 4. For a popular term in a billion-doc index that adds up to hundreds of megabytes.

Now compute the gaps between consecutive docIDs: [4, 7-4, 19-7, 21-19, 33-21, 56-33, 78-56, 92-78, 100-92, 153-100, 199-153, ...] = [4, 3, 12, 2, 12, 23, 22, 14, 8, 53, 46, ...]. The numbers are smaller — most fit in a single byte. This is delta encoding.

Delta encoding does not compress on its own — the integers are still integers — but it transforms the *distribution* of values to be heavily skewed toward small numbers. Variable-byte or bit-packing on top then converts that skew into bytes saved.

Two facts make delta encoding a free win.

First, you can recover the original docIDs by a single pass of cumulative sum: docID[i] = sum(deltas[0..i]). Why this is cheap: cumulative sum is a single sequential read with one addition per element, the most cache-friendly possible loop, and on modern CPUs it can be SIMD-vectorised so that 8 deltas decode in 1 cycle.

Second, the deltas are much smaller than the docIDs. For a postings list of N docIDs over a corpus of size D, the average gap is D/N. For samsung with 50 million postings in a billion-doc corpus, the average gap is 20 — fits in a single byte. For the with 800 million postings, the average gap is barely above 1.

Delta encoding does not save bytes by itself; you have just rewritten one integer sequence as another. The savings come when you pair delta with a variable-length integer encoding that spends fewer bytes on small integers.

Observation 2: small integers deserve fewer bytes — Variable-Byte (VByte)

The simplest variable-length integer encoding is VByte (also called VInt in Lucene): each byte uses 7 bits for the value and 1 bit as a continuation flag. If the high bit is 0, this is the last byte of the integer; if 1, more follow.

value 5    → 0000 0101                                    (1 byte)
value 130  → 1000 0001  0000 0010                          (2 bytes)
value 200000 → 1000 1100  1011 0011  0000 0010            (3 bytes)

Numbers from 0 to 127 take 1 byte; 128 to 16383 take 2; 16384 to ~2 M take 3; up to 32-bit integers take 5. Why this matches reality so well: combined with delta encoding, almost every gap in a postings list for a moderately-frequent term fits in 1 byte, occasionally spilling to 2. The expected size per docID drops from 4 bytes to roughly 1.1.

Decoding is straightforward: read bytes until you find one with the high bit clear, accumulating 7 bits per byte. The cost is one branch per byte, which limits VByte's speed on modern CPUs — it cannot easily be SIMD-vectorised because each byte's interpretation depends on its predecessor's continuation bit. That's why Lucene moved on to FOR/PFOR for the dense common case, but VByte remains the standard for heterogeneous data and for the tail of a postings list (the last block, where bit-packing's overhead is wasteful).

Observation 3: bit-pack a block of similar-sized integers — FOR and PFOR

VByte is per-integer. Frame-of-Reference (FOR) is per-block. The idea: take 128 consecutive deltas, find the maximum, compute bits_needed = ceil(log2(max + 1)), and store all 128 deltas in exactly bits_needed bits each, packed densely into the output.

If the 128 deltas are all ≤ 31, then bits_needed = 5, and 128 deltas fit in 128 × 5 / 8 = 80 bytes — vs 512 bytes raw. Why this is faster than VByte: the block has a fixed bit-width (decoded once from a small block header), so each integer extracts via a constant shift-and-mask with no data-dependent branches. SIMD instructions can unpack 8 or 16 integers per cycle. Lucene's BlockPackedReader does exactly this.

The catch is outliers. If 127 of your deltas are ≤ 31 but one delta is 1000, FOR demands 10 bits per integer for the entire block — wasting 5 bits on every other delta. PFOR (Patched Frame of Reference) fixes this: pick a bits_needed that covers, say, 90% of the deltas, then store the 10% outliers in a separate exception list with their positions. The block decompresses with the small bit-width; an exception-fixup pass overrides the outlier slots after the main unpack. PFOR-Delta combines this with delta encoding and is the workhorse of modern search engines, with typical compression of 4-6× on real postings lists.

SIMD-PFOR (FastPFor by Lemire and Boytsov) implements PFOR with explicit SIMD intrinsics, achieving decompression speeds of multiple billions of integers per second on a single core. Quickwit's Tantivy uses a Rust port of this idea.

Lucene's choice: 128-doc FOR blocks, 32-doc tail

Lucene 8 and later store postings in blocks of 128 docs, FOR-bit-packed. The very last block of a postings list (which is unlikely to be exactly 128 docs) is stored as VByte instead, because bit-packing an undersized tail wastes header overhead. The block format is, roughly:

[block 0: header (bits-per-doc) | 128 packed deltas]
[block 1: header | 128 packed deltas]
...
[tail: VByte-encoded remainder]

Each block is a unit of decompression — you cannot read just docID #57 without unpacking the whole block, but unpacking 128 docs is a few microseconds. Skip pointers (next section) point at block boundaries, so a skip lands on a clean block to decompress.

Elasticsearch and OpenSearch are Lucene under the hood, so they inherit this format byte-for-byte. Solr, the same. Quickwit's Tantivy implements the same idea in Rust with SIMD primitives that the JVM cannot easily generate.

Skip pointers: leap forward in the longer list

Compression solves storage. Skip pointers solve a time problem in the AND query.

Recall the merge from chapter 143: two sorted lists, two pointers, advance whichever is smaller. The cost is O(|a| + |b|). That is fine when the two lists are roughly the same length — say, both 10 million docs. But what if a has 10 million docs and b has 100 million? You walk every one of the 100 million docs in b, even though only a tiny fraction will match. Most of that walking is wasted.

Skip pointers add a sparse second-level index inside each postings list. At every √N positions (about 7000 entries for a 50-million-doc list), you store a (docID-at-this-position, byte-offset-of-this-position) pair. Now during intersection, when the long list's current docID is much smaller than the short list's current docID, you check the skip pointer: if the next skip lands at a docID that is still ≤ the target, jump there and skip the intervening 7000 entries.

Skip pointers form a sparse second-level index over the postings list. Each pointer carries a docID (so you know whether the jump overshoots) and a byte offset (so you know where to land in the compressed stream). For a postings list of length N, √N skips are optimal — too many skips wastes space, too few wastes time.

The skip-aware intersection algorithm is a small modification of the two-pointer merge:

intersect_with_skip(a, b):
    i = j = 0
    while i < |a| and j < |b|:
        if a[i] == b[j]:
            emit a[i]; i++; j++
        elif a[i] < b[j]:
            # try to skip ahead in a
            if a.has_skip(i) and a.skip_target(i) <= b[j]:
                i = a.skip_to(i)         # jump via the skip pointer
            else:
                i++
        else:
            # symmetric for b
            if b.has_skip(j) and b.skip_target(j) <= a[i]:
                j = b.skip_to(j)
            else:
                j++

The crucial test is skip_target(i) <= b[j]: only jump if the skip lands at or before the value you are looking for. Why this matters: if the skip overshoots, you would skip past the matching docID, breaking correctness. Reading the skip's stored docID (the cheap part of the skip pointer) is what makes the test possible without decompressing the full block.

The expected speedup is dramatic. For two lists of size M and N with M ≪ N, the naive merge does Θ(N) work. With skip pointers every √N positions, the intersection does Θ(M · √N) work in the worst case — for M = 10^7 and N = 10^8, that's 10^7 · 10^4 = 10^11 ... which is worse. The savings come when the small list's docIDs are spread out: most of the long list is jumped over in O(1) per skip. Real-world postings lists (Zipfian, dense in some ranges, sparse in others) achieve roughly 5-10× speedup from skip pointers in Lucene's measurements.

A real Python implementation: delta + VByte with compression measurement

Sixty lines, no dependencies. Build it, measure it, see the compression ratio.

# postings_compress.py — delta + VByte encoding for a postings list
from typing import Iterable

def vbyte_encode(n: int) -> bytes:
    """Encode an unsigned integer using variable-byte (1-5 bytes)."""
    out = bytearray()
    while n >= 0x80:
        out.append((n & 0x7F) | 0x80)   # 7 value bits + continuation
        n >>= 7
    out.append(n & 0x7F)                 # final byte: high bit clear
    return bytes(out)

def vbyte_decode_stream(buf: bytes) -> list[int]:
    """Decode all integers in a VByte-encoded buffer."""
    out, n, shift = [], 0, 0
    for byte in buf:
        n |= (byte & 0x7F) << shift
        if byte & 0x80:
            shift += 7
        else:
            out.append(n)
            n, shift = 0, 0
    return out

def encode_postings(doc_ids: list[int]) -> bytes:
    """Delta encode then VByte encode. Caller guarantees doc_ids is sorted."""
    out = bytearray()
    prev = 0
    for d in doc_ids:
        gap = d - prev
        out.extend(vbyte_encode(gap))
        prev = d
    return bytes(out)

def decode_postings(buf: bytes) -> list[int]:
    """Reverse: VByte decode then cumulative sum."""
    gaps = vbyte_decode_stream(buf)
    out, cum = [], 0
    for g in gaps:
        cum += g
        out.append(cum)
    return out

Now measure on a realistic postings list. Simulate a term that appears in 1% of a 10-million-doc corpus — about 100,000 docs, randomly scattered:

import random

random.seed(42)
N_DOCS, RATE = 10_000_000, 0.01
postings = sorted(random.sample(range(N_DOCS), int(N_DOCS * RATE)))

raw_size = len(postings) * 4
encoded  = encode_postings(postings)
ratio    = raw_size / len(encoded)

print(f"postings count : {len(postings):>10,}")
print(f"raw 4-byte size: {raw_size:>10,} bytes")
print(f"compressed size: {len(encoded):>10,} bytes")
print(f"compression    : {ratio:>10.2f}×")

# round-trip check
assert decode_postings(encoded) == postings
print("round-trip OK")

Running this prints (your numbers will be similar):

postings count :    100,000
raw 4-byte size:    400,000 bytes
compressed size:    101,243 bytes
compression    :       3.95×

Roughly 4× compression with the simplest possible encoder. Why this matches theory: average gap is 10^7 / 10^5 = 100, which fits in 1 byte (VByte's 1-byte range is 0-127); the few gaps above 127 push the average to 1.01 bytes per docID. For very common terms the average gap is even smaller (close to 1), pushing compression toward the 4× ceiling. For very rare terms the gaps grow and VByte uses 2-3 bytes, so compression drops.

A real Lucene index on the same data would do better — closer to 5-6× — because PFOR-Delta bit-packing exploits the uniformity of gap sizes within a 128-doc block. VByte can't, because it works one integer at a time.

Worked example: searching a 1-billion-document index

**Compressing the postings for `samsung AND phone` on a billion-doc product catalogue**

Imagine you are building search for an Indian e-commerce site with 1 billion product listings (Flipkart, Amazon, Meesho combined-scale). Two terms in the query samsung AND phone:

term	postings count	raw size (4 B / doc)	average gap
`samsung`	50,000,000	200 MB	20
`phone`	100,000,000	400 MB	10

Together that is 600 MB of raw postings to stream from disk — at NVMe speeds (~1 GB/s) about 600 ms before any logic runs. Way over the user's 50 ms budget.

Step 1 — Delta + VByte compression.

Average gap of 20 for samsung → 1 byte per docID after VByte (since 20 < 128). Encoded size ≈ 50 MB. A 4× win.

Average gap of 10 for phone → 1 byte per docID. Encoded size ≈ 100 MB. Same 4× win.

Combined disk read drops from 600 MB to 150 MB → 150 ms instead of 600 ms. Better, still not enough.

Step 2 — Add skip pointers.

For samsung (50 M postings), add a skip pointer every √N ≈ 7000 positions → about 7000 skips. Each skip is, say, 8 bytes (4 for docID, 4 for offset) → 56 KB of skip data. Negligible.

For phone (100 M postings), add a skip every 10000 positions → 10000 skips × 8 B = 80 KB of skip data.

Total skip-pointer overhead is well under a megabyte — irrelevant compared to the 150 MB of compressed postings.

Step 3 — Skip-aware intersection.

Without skips: walk both lists fully → 50 M + 100 M = 150 M docID comparisons. At ~100 M comparisons/second in real C++ this takes ~1.5 seconds.

With skips: the algorithm walks samsung (50 M comparisons), but in phone it mostly skips. The expected number of full-list reads in phone is about O(M_samsung × log(N_phone)) per Lucene's empirical measurements — perhaps 5-10× speedup over the naive merge. Intersection time drops from 1.5 s to ~150 ms.

Step 4 — Combine the savings.

Read 150 MB compressed postings: 150 ms (network or NVMe; in practice OS page cache makes hot postings effectively free).

Decompress and intersect with skips: 150 ms.

Total: ~300 ms of compute, nearly all of which can be hidden in the OS page cache for hot terms. The query that took 4 seconds naive runs in well under 100 ms with cached postings — within budget.

That is the gap between infeasible and the production search engine. Compression buys disk and network bandwidth; skip pointers buy CPU. Both are necessary; neither is sufficient.

Real systems: what production search engines actually use

Each engine makes a slightly different choice in the compression-vs-decode-speed trade-off, and the choices reveal what the engine was optimised for.

Apache Lucene uses 128-doc blocks of FOR-bit-packed deltas with skip pointers stored at block boundaries; the tail is VByte. Files are .doc (docIDs + frequencies), .pos (positions), .pay (payloads), and .tim/.tip (the term dictionary FST). The skip data is interleaved with the postings as a multi-level skip list — small skip steps near the start, larger ones deeper in. Why this choice: Java's JIT cannot easily emit SIMD, so Lucene picks an encoding that is fast to decompress with scalar code (one shift-and-mask per integer in the hot loop) and tolerates JVM overhead.
Elasticsearch is Lucene under the hood — same .doc/.pos/.pay files. Elastic's blog post on Frame-of-Reference and Roaring Bitmaps is a detailed walk through Lucene's choice, including how doc-value columns (separate from postings) use Roaring Bitmaps for set membership.
Quickwit / Tantivy is the Rust answer. Tantivy uses a SIMD-PFOR-style block encoder with explicit Rust SIMD intrinsics, achieving multi-GB/sec decompression on a single core. Quickwit on top of Tantivy is optimised for log search on object storage (S3) — postings blocks are stored as immutable files in S3 and fetched by HTTP range requests, with skip pointers determining which range to fetch.
FastPFor by Daniel Lemire and Leonid Boytsov is the C++ reference implementation of SIMD-PFOR and several variants (SIMD-BP128, Varint-G8IU). It is the library most production search engines benchmark against.
PostgreSQL's GIN index uses a much simpler encoding — just delta-coded VByte — because GIN was designed for general-purpose inverted indexes where the postings can hold arbitrary item pointers, not pure docIDs. The trade-off is correctness and simplicity over raw decompression speed.

The pattern across all of them: delta encoding is universal, and the only real choice is how aggressively to bit-pack on top — with the answer depending on whether your hot loop is JIT'd Java, native Rust with SIMD, or C++ with intrinsics.

Going deeper

A few directions for the curious before chapter 146 derives BM25.

Roaring Bitmaps for very dense postings

For the very most common terms (e.g. the with a billion postings in a billion-doc index — i.e. nearly every doc), the gap between docIDs averages 1 and bit-packing approaches 1 bit per docID. At that point a different structure dominates: a Roaring Bitmap, which represents dense ranges as raw bitmaps (1 bit per docID) and sparse ranges as sorted arrays. Lucene uses Roaring for doc-value columns and for query-result deduplication, not for the primary postings format. The library RoaringBitmap is the reference; it powers Druid, Pinot, and ClickHouse.

Multi-level skip lists

Lucene's skip pointers are not a flat √N array but a multi-level skip list — like the skip list data structure from chapter 19. Level 0 has skips every 128 docs; level 1 every 128 × 8 = 1024; level 2 every 1024 × 8; and so on up to a max level. Why multi-level: a single √N level wastes time when the target docID is very far ahead — you would still walk √N skips to find the right block. With multiple levels, you do O(log N) skip-pointer hops to find the target block, then one decompression.

Why not just gzip the postings?

General-purpose compression like gzip or zstd achieves higher ratios on raw integer arrays — sometimes 8-10× — but is much slower to decompress (50-200 MB/s vs PFOR's multi-GB/s). For postings, decompression is in the query critical path, so search engines pick formats that are mediocre compressors but blazing decoders. Cold archive systems that do not need fast random access can use zstd — and some hybrid systems do, with zstd as a slow-tier postings format.

The Variable-Byte family

VByte has many variants. Group VByte stores 4 integers' continuation bits in a single header byte followed by the value bytes — faster to decode because the header tells you up front how many bytes each integer takes. Varint-GB and Varint-G8IU (Stepanov et al., 2011) are SIMD-decodable VByte variants. The trade-offs are subtle — see Lemire and Boytsov's survey [2] for benchmarks.

What about positions, frequencies, and payloads?

This chapter only compressed docIDs. A real index also stores, for each (term, doc) posting:

the term frequency in that doc (used by BM25, derived in chapter 146),
the list of positions within the doc (used by phrase queries, chapter 147),
optional payloads (arbitrary bytes per occurrence, e.g. font weight in HTML, named-entity type).

Each of these gets its own delta + bit-packed stream in Lucene's format. Positions are themselves delta-coded within a doc (positions in a doc are sorted, and gaps are small) and benefit from the same techniques.

What chapter 146 fills in next

You can now store and intersect postings at scale. The next chapter, chapter 146, adds the missing piece: ranking. So far an AND query returns all matching docs in docID order. A real search engine ranks them — the most relevant first. We will derive TF-IDF from first principles, then see why Robertson's BM25 (with its length normalisation and saturation function) beat it everywhere and remains the lexical baseline that even modern neural rankers compare against. The compressed postings you built here carry one extra integer per doc (the term frequency) that BM25 will turn into a relevance score.

References

Apache Lucene. Lucene99 codec — index file formats. lucene.apache.org/core/9_10_0/core/org/apache/lucene/codecs/lucene99/package-summary.html
Lemire, Boytsov. Decoding billions of integers per second through vectorization. Software: Practice and Experience, 2015. arxiv.org/abs/1209.2137
Pibiri, Venturini. Techniques for inverted index compression. ACM Computing Surveys, 2020. arxiv.org/abs/1908.10598
Lemire, Boytsov, et al. FastPFor: Fast integer compression library. github.com/lemire/FastPFor
Elastic. Frame of Reference and Roaring Bitmaps. elastic.co/blog/frame-of-reference-and-roaring-bitmaps
Quickwit / Tantivy. Rust full-text search engine library. github.com/quickwit-oss/tantivy