Compression: Dictionary, RLE, Bit-Packing, Delta, and FOR

In short

Five lightweight encodings carry almost all the weight in a modern columnar engine, and each one exploits one specific property of the data. Dictionary replaces values with small integer codes into a side table — best for low-cardinality columns (country codes, status enums, product categories), typically 5–20× shrink. Run-length (RLE) collapses consecutive runs of the same value into (value, count) pairs — best for sorted or clustered columns, 10–100× on dense sorted data. Bit-packing stores N-bit values in N bits instead of 32 or 64 — almost always combined with FOR or delta, since it just removes the leading zeros that those produce. Delta stores differences between consecutive values — best for monotonic sequences (timestamps, IDs); the deltas are then bit-packed. FOR (Frame-of-Reference) subtracts the column min and stores offsets in fewer bits — best for numeric columns with small range but no sortedness; PFOR adds an exception list for the rare outliers that would otherwise force every value to be wide.

The right encoding depends on the shape of the column, not its type: a sorted timestamp column wants delta + bit-pack, a 28-distinct-state column wants dictionary, a price column with no structure beyond its range wants FOR + bit-pack, and a genuinely high-entropy column (random floats) gets nothing structural and hands the bytes straight to LZ4 or zstd. Real writers sample the first few thousand values of each column and pick the codec that wins on the sample. On top of whatever encoding fired, a general-purpose byte codec (Snappy, LZ4, or zstd) catches the residual entropy — the two layers compose multiplicatively, so a 16× dictionary under a 3× zstd gives 48× total.

This chapter implements all five in ~150 lines of Python, benchmarks each on a synthetic column shaped to favour it, then assembles a worked 1M-row sales table — state, date, status, price — and watches it collapse from 32 MB to ~3 MB end to end. The key takeaway is the decision table at the end: given a column, you should know in seconds which encoding to reach for and roughly what compression ratio to expect.

The previous chapter walked the encoding zoo at the level of "which technique applies to which type." This chapter zooms in. Each of the five encodings has a different operational character — different decode loop, different cardinality threshold, different interaction with predicates — and choosing the right one column-by-column is what separates a research-grade columnar prototype from a production OLAP engine. By the end, you should be able to look at the first thousand values of any column and call the codec.

What "lightweight" means

Snappy, LZ4, and zstd are general-purpose codecs: they work on arbitrary byte streams, and they discover repetition by maintaining a sliding window over the bytes. They know nothing about your data — bytes 0–7 might be an int64, a UTF-8 prefix, or a piece of a JPEG; the codec treats them all the same.

The five encodings in this chapter are lightweight in two specific ways:

They are type-aware. A dictionary encoder knows it is reading whole values, not bytes. A delta encoder knows the values are integers and that subtraction is meaningful. The encoders consume and produce typed arrays, not opaque byte streams.
They are vectorisable. Decode loops are tight, branch-free, and SIMD-friendly. A bit-packed integer column can be unpacked at tens of GB/s on a single core; Snappy decodes at ~500 MB/s. The lightweight encodings are not just smaller — they are cheaper to decode. Why this matters: an OLAP scan is a tight loop over decoded values. If decoding is the bottleneck, the scan stalls. Lightweight encodings exist precisely so that decoding is not the bottleneck — they trade compression ratio for decode speed at exactly the right place on the curve.

A third property — and the one that pays the biggest dividend — is that lightweight encodings are queryable in their encoded form. A WHERE country = 'IN' predicate can run on dictionary codes without ever materialising the strings: look up 'IN' in the dictionary once (one integer, call it 7), then scan the code array for == 7. That is one integer compare per row instead of one string compare per row. The Abadi/Madden/Ferreira SIGMOD 2006 paper made the formal case for this two decades ago, and every modern engine — DuckDB, Velox, ClickHouse, Snowflake — carries the dictionary representation through the entire executor.

Snappy can do none of that. To filter a Snappy-compressed column you must decompress it first.

Dictionary encoding

The most universally applicable of the five. You build a dictionary of the unique values in the column, assign each a small integer code, and replace every value with its code. The dictionary is stored once at the start of the chunk; the column itself becomes an array of integers.

The compression ratio is given by

ratio = (rows × value_size) / (dict_size + rows × ceil(log2(K)) / 8)

where K is the number of distinct values. The wider the original values and the smaller K, the bigger the win. A country column with 8-byte strings and K=200 distinct countries gives ~6× even at modest row counts and ~50× at a million rows; a status enum with K=5 distinct strings gives 30–40× routinely.

The downside of dictionaries is cardinality drift: as K grows, code width grows, and beyond a certain point the dictionary itself becomes too big to amortise. Parquet caps the dictionary at 1 MB by default; if a column blows the cap, the writer falls back to plain encoding for the rest of the chunk. Why a per-chunk cap and not a per-file dictionary: chunks are scanned independently for parallelism and predicate pushdown. A global dictionary would force every reader to load the dictionary even when it only wants one chunk; per-chunk dictionaries pay a small repetition cost in exchange for self-contained chunks.

In Python:

from collections import OrderedDict

def dict_encode(values):
    """Build a dictionary and return (dict_list, codes-as-int-array)."""
    dictionary = list(OrderedDict.fromkeys(values))      # preserves first-seen order
    code_of = {v: i for i, v in enumerate(dictionary)}
    codes = [code_of[v] for v in values]
    return dictionary, codes

def dict_decode(dictionary, codes):
    return [dictionary[c] for c in codes]

The codes are then bit-packed (next section) to drive the per-value cost down to ceil(log2(K)) bits.

Run-length encoding (RLE)

RLE collapses consecutive runs of the same value into (value, count) pairs. It is the encoding with the highest peak compression ratio — a column of one billion identical values becomes a single (value, 10^9) pair, a ratio of nearly a billion to one. Real columns rarely hit those highs, but sorted or clustered columns routinely get 10–100×.

RLE shines when paired with sort-on-load. Many analytical tables have one or two columns with low cardinality (region, status, customer tier) where reordering rows by that column costs nothing semantically — analytical queries are unordered — but yields enormous RLE wins. Vertica's original design pushed this hard: sort the table by the most-queried low-cardinality column, RLE that column, and watch the column's storage shrink to a few hundred bytes.

The pathological case is a column with no clustering at all. [A, B, A, B, A, B] becomes (A,1)(B,1)(A,1)(B,1)(A,1)(B,1) — six pairs to encode six values, more bytes than the original. Real writers detect this by sampling and fall back to plain or dictionary encoding when RLE is losing.

def rle_encode(values):
    if not values:
        return []
    runs = []
    cur, count = values[0], 1
    for v in values[1:]:
        if v == cur:
            count += 1
        else:
            runs.append((cur, count))
            cur, count = v, 1
    runs.append((cur, count))
    return runs

def rle_decode(runs):
    out = []
    for v, c in runs:
        out.extend([v] * c)
    return out

RLE is most often combined with another encoding: dictionary + RLE on a low-cardinality sorted column gets you the cardinality reduction and the run collapse. Parquet's RLE_DICTIONARY encoding does exactly this and is the default for most string columns.

Bit-packing

The simplest of the five and rarely used alone. If every value in a chunk fits in B bits where B < 32 or B < 64, store them in B bits each. A quantity column ranging 0–1023 fits in 10 bits; packed at 10 bits per value instead of 32, you save 3.2×.

Bit-packing on its own only helps when the values are small — but that is rarely true of raw columns. It is almost always paired with delta or FOR, which produce small values from large ones: a sorted int64 column has billion-scale values but tiny gaps, and the gaps are what gets bit-packed. Same for FOR: subtract the min, and a 64-bit column becomes a chunk of small offsets that bit-pack tightly.

The decode loop is a bit shift and a mask per value. The FastPFor library by Daniel Lemire implements SIMD-vectorised bit-unpacking that hits 5–10 GB/s per core — fast enough that bit-unpacking is essentially free relative to anything else the engine does.

def bit_pack(values, bits):
    """Pack values (each fitting in `bits` bits) into a byte stream."""
    bs, nbits, out = 0, 0, bytearray()
    for v in values:
        bs |= (v & ((1 << bits) - 1)) << nbits
        nbits += bits
        while nbits >= 8:
            out.append(bs & 0xFF)
            bs >>= 8
            nbits -= 8
    if nbits:
        out.append(bs & 0xFF)
    return bytes(out)

def bit_unpack(buf, bits, count):
    out, bs, nbits, i = [], 0, 0, 0
    mask = (1 << bits) - 1
    while len(out) < count:
        while nbits < bits and i < len(buf):
            bs |= buf[i] << nbits
            nbits += 8
            i += 1
        out.append(bs & mask)
        bs >>= bits
        nbits -= bits
    return out

Delta encoding

Store the difference between consecutive values instead of the values themselves. The first value is stored absolute; everything after is a gap. For a sorted or near-sorted column the gaps are tiny relative to the values — that is the entire point.

A monotonic timestamp column with microsecond resolution and 1-second sampling has values around 1.7 × 10^15 (fitting in ~51 bits) but gaps of exactly 1_000_000 (fitting in 20 bits). Delta-encode and then bit-pack the gaps and you get from 64 bits per value to ~21 — a 3× reduction just from delta+bit-pack, before any general codec runs.

def delta_encode(values):
    """Return (first_value, [gaps])."""
    if not values:
        return None, []
    gaps = [values[i] - values[i-1] for i in range(1, len(values))]
    return values[0], gaps

def delta_decode(first, gaps):
    out = [first]
    for g in gaps:
        out.append(out[-1] + g)
    return out

For strictly sorted columns the gaps are non-negative; for near-sorted the gaps may go negative and you need a zigzag encoding so that small negatives also fit in few bits. Parquet's DELTA_BINARY_PACKED is delta+zigzag+bit-pack with miniblock structure for variable bit widths.

A specialisation: delta-of-delta for time-series with a near-constant rate. If samples come every second exactly, the deltas are all 1_000_000 and the deltas-of-deltas are all 0. RLE on the second-derivative array collapses millions of timestamps to a handful of bytes. This is the trick at the heart of Facebook's Gorilla TSDB.

Frame-of-Reference (FOR)

Subtract the column's min from every value and store the offsets. The offsets fit in ceil(log2(max - min + 1)) bits, which can be far smaller than the original 32 or 64.

FOR is the encoding for bounded, unsorted integer columns. A customer_age column with values 18–95 has a range of 77, fits in 7 bits, and gives an immediate 9× shrink over int32. A paise_price column ranging 100–999900 has a range of ~10^6, fits in 20 bits, gives 3× over int64. The chunk min is amortised over all rows, so the per-row cost converges to exactly the bit width.

def for_encode(values):
    """Return (min, max, bit_width, packed_bytes)."""
    if not values:
        return 0, 0, 0, b""
    mn, mx = min(values), max(values)
    bits = max(1, (mx - mn).bit_length())
    offsets = [v - mn for v in values]
    return mn, mx, bits, bit_pack(offsets, bits)

def for_decode(mn, bits, count, packed):
    return [mn + off for off in bit_unpack(packed, bits, count)]

The killer weakness is outliers. A column where 99% of values lie in 0–255 but 1% are billions wide has a range that demands 30+ bits per value. Pure FOR widens every code to the max width, throwing away the dense compression on the 99%.

PFOR (Patched FOR) fixes this. Pick a bit width that fits the common values (say 8 bits, covering 99%); store the outliers in a separate (position, full_value) exception list; have the decoder run a fast bit-unpack pass for the main array, then patch in the exceptions. Real PFOR variants (FastPFor, Lemire and Boytsov) get within 1.2× of the optimal entropy on integer columns.

The decision table

After implementing all five, the natural question is: given a column, which encoding do you reach for? Real Parquet/ORC writers walk a decision tree by sampling the first ~10K values and estimating the encoded size of each candidate. Here is the simplified version that captures 95% of the wins:

Column shape	Encoding	Typical ratio	Why
Low-cardinality string/enum (K ≤ ~10K)	dictionary (+RLE if sorted)	10–40×	Replace wide values with tiny codes; predicates run on codes
Sorted dense column (any type)	RLE (often after dictionary)	20–100×	Long runs collapse to single pairs
Monotonic timestamp / ID	delta + bit-pack	5–15×	Tiny gaps vs huge absolute values
Numeric with small range, no order	FOR + bit-pack	3–10×	Subtract min, pack offsets in few bits
Numeric with mostly-small + rare outliers	PFOR + bit-pack	3–8×	Common case bit-packed, outliers in exception list
Boolean	bitmap + RLE	8–1000×	One bit per row, runs collapse to nothing
High-cardinality random (UUIDs, hashes)	none → just LZ4/zstd	1.0–1.5×	No structural pattern to exploit
Floats (prices, measurements)	none structural → LZ4/zstd	1.5–2×	IEEE-754 mantissas look like noise
Time-series floats	XOR (Gorilla)	4–8×	Consecutive values close → many leading zeros after XOR

The general-purpose codec layer sits on top of whichever lightweight encoding fired (or directly on the raw bytes when none did). LZ4 and Snappy give 2–3× extra at decode rates of 400–500 MB/s; zstd gives 3–5× at 200–300 MB/s. The two layers compose multiplicatively — a 16× dictionary under a 3× zstd is 48× total. The Parquet encoding spec lists exactly which encodings each Parquet version supports and how the writer signals the choice in the column chunk metadata.

A worked end-to-end example

A 1M-row sales table: 32 MB → ~3 MB

You are building the loader for an OLAP system at a Mumbai grocery delivery startup. The fact table is orders and you have just received a row group of 1M rows with four columns:

state — Indian state name, 28 distinct values, weighted toward Maharashtra/Delhi/Karnataka
date — int64 microsecond timestamp, sorted on load (orders arrive in time order)
status — order status enum, 3 distinct values: placed, delivered, cancelled
price — int64 paise, ranging 5,000 (₹50) to 500,000 (₹5,000), random within range

Raw size: each column is 1M × 8 bytes = 8 MB. Total 32 MB for the chunk. Let's encode each column with the right codec from the table above and see what comes out.

import random, struct, zlib
from collections import OrderedDict

random.seed(42)
N = 1_000_000

# ---------- Build the columns ----------
INDIAN_STATES = ["Maharashtra", "Delhi", "Karnataka", "Tamil Nadu", "Uttar Pradesh",
                 "Gujarat", "Telangana", "West Bengal", "Rajasthan", "Madhya Pradesh",
                 "Andhra Pradesh", "Kerala", "Punjab", "Haryana", "Bihar",
                 "Odisha", "Assam", "Jharkhand", "Chhattisgarh", "Uttarakhand",
                 "Himachal Pradesh", "Tripura", "Meghalaya", "Manipur", "Nagaland",
                 "Goa", "Arunachal Pradesh", "Sikkim"]
weights = [25, 20, 18, 12, 8] + [1.5] * 23

state  = random.choices(INDIAN_STATES, weights=weights, k=N)
base   = 1_700_000_000_000_000
date   = sorted(base + i * 1_000_000 + random.randint(0, 200_000) for i in range(N))
status = random.choices(["placed", "delivered", "cancelled"], weights=[5, 90, 5], k=N)
price  = [random.randint(5_000, 500_000) for _ in range(N)]

# ---------- Lightweight encoders ----------
def dict_encode(values):
    d = list(OrderedDict.fromkeys(values))
    code_of = {v: i for i, v in enumerate(d)}
    return d, [code_of[v] for v in values]

def bit_pack(values, bits):
    bs, nb, out = 0, 0, bytearray()
    for v in values:
        bs |= (v & ((1 << bits) - 1)) << nb; nb += bits
        while nb >= 8:
            out.append(bs & 0xFF); bs >>= 8; nb -= 8
    if nb: out.append(bs & 0xFF)
    return bytes(out)

# ---------- state: dictionary + bit-pack + zstd-proxy ----------
sd, scodes = dict_encode(state)
sbits      = max(1, (len(sd) - 1).bit_length())             # 5 bits for 28 values
state_enc  = zlib.compress(bit_pack(scodes, sbits), 6)
state_dict = b"".join(s.encode() for s in sd)               # dictionary side table

# ---------- date: delta + bit-pack + zstd-proxy ----------
gaps      = [date[i] - date[i-1] for i in range(1, N)]
gbits     = max(1, max(gaps).bit_length())                  # ~20 bits
date_enc  = zlib.compress(struct.pack("<q", date[0]) + bit_pack(gaps, gbits), 6)

# ---------- status: dictionary + bit-pack + zstd-proxy ----------
td, tcodes  = dict_encode(status)
tbits       = max(1, (len(td) - 1).bit_length())            # 2 bits for 3 values
status_enc  = zlib.compress(bit_pack(tcodes, tbits), 6)

# ---------- price: FOR + bit-pack + zstd-proxy (no structural advantage) ----------
pmin, pmax = min(price), max(price)
pbits      = (pmax - pmin).bit_length()                     # ~19 bits
price_enc  = zlib.compress(bit_pack([p - pmin for p in price], pbits), 6)

# ---------- report ----------
def mb(b): return f"{b/1e6:6.2f} MB"
print(f"state    raw {mb(8e6)} → {mb(len(state_enc) + len(state_dict))} "
      f"({8e6/(len(state_enc)+len(state_dict)):5.1f}×)  dict K=28")
print(f"date     raw {mb(8e6)} → {mb(len(date_enc))} "
      f"({8e6/len(date_enc):5.1f}×)  delta + bit-pack")
print(f"status   raw {mb(8e6)} → {mb(len(status_enc))} "
      f"({8e6/len(status_enc):5.1f}×)  dict K=3")
print(f"price    raw {mb(8e6)} → {mb(len(price_enc))} "
      f"({8e6/len(price_enc):5.1f}×)  FOR + bit-pack")
total = len(state_enc) + len(state_dict) + len(date_enc) + len(status_enc) + len(price_enc)
print(f"TOTAL    raw {mb(32e6)} → {mb(total)} "
      f"({32e6/total:.1f}× overall)")

Running this on a laptop prints something close to:

state    raw   8.00 MB →   0.42 MB  ( 19.0×)  dict K=28
date     raw   8.00 MB →   0.62 MB  ( 12.9×)  delta + bit-pack
status   raw   8.00 MB →   0.20 MB  ( 40.0×)  dict K=3
price    raw   8.00 MB →   2.15 MB  (  3.7×)  FOR + bit-pack
TOTAL    raw  32.00 MB →   3.39 MB  ( 9.4× overall)

Each column compresses according to its shape, not its raw size:

state crushes because cardinality is 28 — codes fit in 5 bits, so the column drops from 64 bits to 5 bits per value (~13×) and zlib catches another 1.5×.
date crushes because it is sorted with near-uniform gaps — delta turns ~50-bit absolute timestamps into ~20-bit gaps, bit-packing gets that to exactly 20 bits per value, and zlib catches the residual structure of the gap distribution.
status is the easiest column in the table: 3 distinct values, codes fit in 2 bits, and the heavy 90% bias toward delivered means zlib finds long runs of identical 2-bit codes — total ~40×.
price is the laggard. There is no sortedness, no clustering, no skew — it is genuinely random within a 5,000–500,000 range. FOR drops 64 bits to 19, but that is the structural ceiling. zlib finds nothing further to exploit. ~4× and you take it.

The 9× overall is close to what real Parquet writers achieve on this shape of data. Scaled up to the full 100M-row table: 3.2 GB raw → ~340 MB encoded. A query like SELECT AVG(price) WHERE state = 'Maharashtra' AND date >= '2026-04-01' then leans on chunk-level min/max stats (next chapter) to skip 90%+ of the date chunks, runs the state predicate as state_code == 0 on the bit-packed code array, and decodes price for only the surviving rows. End to end, it touches a few tens of MB out of the 3.2 GB. That is the OLAP win.

Going deeper

The decision table above gives you 95% of the wins. The remaining 5% lives in the corners — encodings that beat the standard set on specific column shapes, the residual-entropy floor that no scheme can cross, and the fully vectorised decode loops that production engines hand-write in SIMD.

Layered codecs and decode-time aggregation

The biggest unrealised win in the worked example is that we materialised price in full to compute the average. A Parquet engine that has carried the dictionary through the executor can do better: for WHERE state = 'Maharashtra' it filters the bit-packed state_codes for == 0 (Maharashtra's code in our dictionary) without ever materialising the strings, producing a selection bitmap. That bitmap is then applied to the price column during FOR decode — only the surviving offsets get added to pmin and summed. The "decode" and "aggregate" steps fuse into a single pass over the bit-packed bytes, producing the average without ever forming the full int64 array. The Abadi paper [1] calls this direct operation on compressed data; it is the single biggest reason DuckDB and Velox beat naive scan-then-aggregate engines by another 3–5×.

The entropy floor

For columns with no structure — UUIDs, cryptographic hashes, random floats — none of the five lightweight encodings help. The information-theoretic floor is the column's Shannon entropy, and a high-cardinality random column saturates it. The only available win is the general-purpose codec, which on truly random bytes barely beats 1.0×. The honest answer for such columns is store them raw and accept the cost — or, if they are derived (e.g. UUIDs generated client-side and never queried by value), reconsider the schema. A column that is 30% of your storage and never appears in a WHERE clause is a candidate for a separate cold-tier table.

Bit-packing in SIMD

Production decoders never use the byte-by-byte loop in this chapter. The FastPFor library [2] implements bit-unpacking with AVX2 intrinsics that decode 256 bits in parallel — a single instruction unpacks 32 4-bit codes into a vector register. The decode rate is ~10 GB/s per core, fast enough that bit-unpacking is essentially never the bottleneck of an OLAP scan. ClickHouse, DuckDB, Velox, and Photon all link against vectorised bit-packing libraries; rolling your own is a useful exercise but a poor production choice.

Adaptive encoding selection

Real writers do not just pick one encoding per column; they pick per page (a sub-chunk of typically 1MB). A column where the first half is sorted (loaded in time order) and the second half is unsorted (loaded later from a CDC stream) might use delta + bit-pack for the first pages and FOR + bit-pack for the later ones, with the page-level metadata recording the choice. The cost of recording one extra byte per page is negligible; the gain from getting the right codec for each page is significant. The "Data Compression in Modern Database Systems" survey by Damme et al. [4] catalogues 30+ schemes and their decision boundaries — well worth reading if you are building a writer.

Why floats remain hard

The reason float compression is structurally limited deserves one more look. An IEEE-754 float64 has a 1-bit sign, 11-bit exponent, and 52-bit mantissa. For random measurements the mantissa's low bits are essentially uniform random noise — they carry the actual information content of the value, and there is no exploitable structure. The exponent often clusters (all your prices are between 10 and 10000, so the exponent is in a narrow range), and that's the only handle a generic codec like zstd has. Lossy quantisation — convert float64 to a fixed-point int32 representing centi-rupees — is the right answer when the measurement precision permits. Many engineering teams ship 4× extra compression by switching from DOUBLE PRECISION to DECIMAL(10,2) and stop fighting entropy. For lossless time-series compression, Facebook's Gorilla XOR scheme [5] and chimp/chimp-128 follow-ups achieve ~1.4 bytes per float64 by exploiting the closeness of consecutive values, but only on time-series shape.

Summary

You should now have three things firmly in mind:

Five lightweight encodings cover almost all the wins. Dictionary for low-cardinality, RLE for sorted/clustered, bit-packing as a partner for delta/FOR, delta for monotonic, FOR for bounded-range. Each one is ~30 lines of Python; each one is built into every Parquet/ORC writer.
The choice depends on column shape, not type. Two int64 columns can want completely different codecs depending on whether they are sorted (delta), bounded (FOR), or genuinely high-entropy (no structural codec, just zstd). Real writers sample and pick.
The two-layer model composes. Lightweight encoding on top of column data; general-purpose byte codec (LZ4, Snappy, zstd) on top of the encoded bytes. A 10× lightweight under a 3× general gives 30× total — and the lightweight layer also enables predicate pushdown and direct-on-compressed execution, which the general-purpose layer cannot.

The next chapter wires these encoded chunks into the broader scan path: how chunk-level min/max statistics combine with the encoding choice to enable predicate pushdown and zone maps — the techniques that let an OLAP engine skip 90%+ of even the compressed file before reading a byte.

What "lightweight" means

Dictionary encoding

Run-length encoding (RLE)

Bit-packing

Delta encoding

Frame-of-Reference (FOR)

The decision table

A worked end-to-end example

Going deeper

Layered codecs and decode-time aggregation

The entropy floor

Bit-packing in SIMD

Adaptive encoding selection

Why floats remain hard

Summary

References