Column Chunks and Per-Type Encoding

In short

A column file is not one giant array of values. Real OLAP formats — Parquet, ORC, Arrow IPC — split each column into row-group chunks, typically 64K to 1M rows per chunk, and treat each chunk as an independent unit. Every chunk carries a tiny header of min/max/null-count statistics, which lets the query engine skip whole chunks without reading their bytes (min > 100 ⇒ this chunk has no rows matching value < 100). Chunks also enable parallel scans (one thread per chunk) and progressive decoding (start aggregating before the file finishes downloading from S3).

Inside each chunk, the bytes are passed through a type-specific encoder chosen for what that data actually looks like. Sorted integers → delta encoding (store gaps, not absolutes) plus bit-packing (12-bit values stored in 12 bits, not 64). Low-cardinality strings → dictionary encoding (replace ["south", "south", "north", ...] with a 4-entry dictionary plus a 100M-entry array of 1-byte codes). Booleans → bitmap + RLE ([1,1,1,1,0,0,1] → (1,4)(0,2)(1,1)). Timestamps → delta-of-delta because the gaps themselves are nearly constant. Floats are the hardest, but XOR encoding (Facebook's Gorilla) wins on time-series.

On top of that logical layer sits a second layer of general-purpose compression — Snappy or LZ4 for fast-but-light, zstd or gzip for hard-but-slow — that squeezes the already-encoded bytes another 2–3×. The two layers are complementary: encoding exploits the column's semantic structure (sortedness, low cardinality, repetition); the byte-level codec mops up the residual entropy. A typical e-commerce orders table goes 100 GB → 20 GB after encoding → 8 GB after Snappy, and the scan is faster because you read fewer bytes from disk and decode them in tight, branch-free loops. This chapter builds both layers in 200 lines of Python so you can watch a 32 MB four-column block collapse to 5 MB live.

The previous chapter ended with a 60× wall-clock speedup from layout alone — column-by-column instead of row-by-row. That was the floor. Real columnar engines stack two more orders of magnitude on top, and almost all of that comes from how each column chunk is physically encoded and compressed before it ever hits disk.

A naive column store would just write int64 values back-to-back into a file, perhaps with a Snappy pass on the result. That works, and you'd already be doing better than a row store. But you would be leaving 5–20× of compression on the table, and — worse — you would be blind to which parts of the file you can skip entirely. The point of this chapter is to take a column from "an array of values" to "a sequence of self-describing, statistic-stamped, type-encoded, byte-compressed chunks" — the actual on-disk shape of every modern OLAP file format.

Why chunks at all

Imagine a column of 1 billion int64 order IDs. Stored as one flat array, that is 8 GB. To answer SELECT SUM(price) WHERE order_id BETWEEN 1_000_000 AND 2_000_000, the engine would have to read the entire 8 GB column, scan linearly until it finds the matching range, then read the corresponding range of the price column. It cannot seek to the right place because the values inside the array are not addressable by position-of-order_id-value — the array is positional, not indexed.

Now imagine the same 1 billion IDs cut into 1000 chunks of 1 million rows each, with a tiny (min, max) statistic stored in a footer for each chunk:

The chunk boundary buys you three independent wins:

Predicate pushdown. The engine reads the (tiny) footer first, looks at each chunk's [min, max], and skips chunks whose range cannot satisfy the predicate. For a WHERE order_id < 100 filter, every chunk whose min > 100 is dead — never read, never decoded. On real workloads with selective filters, this routinely skips 90%+ of the file. Why this works: most analytical filters are range or equality predicates, and most columns have some clustering (data loaded in time order, partitioned by date, sorted on a key). Chunk-level stats cost almost nothing — a few bytes per chunk — and turn the column into a poor man's range index.
Parallelism. Each chunk is decoded independently. A 16-core machine can decode 16 chunks simultaneously, achieving near-linear scaling on scan-bound workloads. A flat array would require a single decoder thread.
Progressive / streaming decoding. When the file lives in object storage (S3, GCS), download and decode can overlap: chunk 0's bytes arrive, the executor starts aggregating, and chunk 1 streams in behind it. Latency to first row plummets.

The chunk size is a tuning knob. Too small (say 1K rows) and the per-chunk overhead — footer entry, decoder state, page header — dominates. Too large (say 10M rows) and you lose granularity for predicate pushdown and parallelism. Parquet defaults to row groups of 128 MB, with pages of ~1 MB inside them (Parquet specification) — that two-level chunking gives both coarse-grained pruning at the row-group boundary and fine-grained decoding at the page boundary. A typical setting is 64K to 1M rows per row group, depending on row width.

The encoding zoo

Once you have a chunk of one column with one type, you can apply an encoder that exploits what the data actually looks like. The encoder is chosen per-type — sometimes per-column, by the writer at file-creation time, based on a small sample.

Let's walk each one.

Integer columns

Integers are the workhorse — IDs, counts, prices in cents, foreign keys. Three encodings dominate.

Delta encoding stores the difference between consecutive values instead of the values themselves: [1000, 1003, 1007, 1012, 1018] becomes [1000, +3, +4, +5, +6]. Why this is a huge win on sorted columns: a 64-bit order_id that grows monotonically has values in the billions but consecutive gaps of 1–10. Storing the gaps in 4 bits per value instead of 64 bits is a 16× shrink before any further compression. Delta is the default for sorted integer columns in Parquet's DELTA_BINARY_PACKED encoding.

Bit-packing stores values in only as many bits as they actually need. If a quantity column ranges 0–4095, every value fits in 12 bits — packing them at 12 bits per value instead of 64 saves 5.3×. The decoder walks the bit stream and reconstructs the original int64 array; the FastPFor library is the canonical implementation, with SIMD-accelerated unpacking that hits tens of GB/s.

Frame-of-reference (FOR) subtracts the chunk's min from every value and stores the offsets bit-packed. An age column with values 19–95 has a max offset of 76, which fits in 7 bits — versus 32 bits for the absolute integer. It is delta encoding's cousin: delta uses previous value as the reference, FOR uses chunk min. FOR is better for unsorted but bounded data; delta is better for sorted runs.

PFOR (Patched FOR) is FOR with an escape hatch for outliers. If 99% of an integer column fits in 8 bits but 1% are billion-scale outliers, pure FOR would have to widen every code to 30+ bits. PFOR instead stores the "normal" values in 8 bits and keeps a small exception list of (position, full_value) pairs for the outliers — the decoder fills them in after the main bit-unpack. This is the trick that lets FastPFor hit 4–8 bits per value on real-world workloads where pure bit-packing would degrade to 32.

String columns

Strings are where the biggest wins live, because most string columns have massive redundancy.

Dictionary encoding is the king. You build a dictionary of unique values, assign each a small integer code, and replace every string in the column with its code. A region column with 100 million rows but only 4 distinct values (north, south, east, west) becomes a 4-entry dictionary plus a 100M-byte array of codes — from ~700 MB of raw strings to ~100 MB, before any further compression. Why this works so well: the codes are now small integers, which then themselves benefit from bit-packing — 4 distinct values fit in 2 bits, so the code array shrinks again from 100 MB to 25 MB. Dictionary encoding turns string compression into integer compression.

Dictionary encoding has a second superpower: predicates and joins on dictionary-encoded columns can run on the codes directly. WHERE region = 'south' becomes WHERE region_code = 2 — a single integer compare instead of a string compare per row. Many engines (DuckDB, Velox) carry the dictionary all the way through the executor for exactly this reason.

The classic Abadi et al. paper on integrating compression and execution makes the formal case: when compression is "value-preserving" enough that operators can run on the encoded form, the speedup compounds — fewer bytes scanned and cheaper per-value work.

Front-coding (incremental encoding) works on sorted string columns by storing each string as (prefix_length_shared_with_previous, suffix_bytes). For a sorted list of Indian state names ["andhra", "arunachal", "assam"], you get [("andhra"), (1, "runachal"), (1, "ssam")] — saving the shared a prefix on each. Most useful in dictionary pages and B-tree leaf pages where the keys are sorted by construction.

Boolean columns

A boolean is one bit. The naive bool type in C is 8 bits, in Python an entire object header. Columnar formats store booleans as bitmaps — one bit per row, packed into bytes. A 1M-row boolean column is 125 KB.

On top of the bitmap, run-length encoding (RLE) collapses runs of identical bits. [1,1,1,1,0,0,1] becomes (1,4)(0,2)(1,1). For boolean columns with strong clustering (think is_paid on an orders table where the first 95% of rows are true), RLE collapses millions of values into a handful of run records. Parquet's RLE/BIT_PACKED hybrid switches between the two encodings within a chunk depending on which is winning.

Timestamp columns

Timestamps are integers — typically int64 microseconds since epoch — but they have very specific structure. Within a row group they almost always cluster in time (most data loaded in append order means timestamps are nearly sorted), and the gaps between consecutive timestamps are themselves nearly constant (a sensor sampling at 1 Hz produces gaps of exactly 1 second).

That structure rewards delta-of-delta: take the delta, then take the delta of the delta. [1700000000, 1700000001, 1700000002, 1700000003] → deltas [+1, +1, +1] → second deltas [0, 0] → encoded as RLE (0, 2). Hundreds of millions of timestamps collapse to kilobytes. This is the heart of Facebook's Gorilla TSDB and every modern time-series store. For event-driven (irregular) timestamps, plain delta + RLE on the deltas is enough.

Float columns

Floats are the hardest to compress losslessly because the bit pattern of an IEEE-754 float64 is essentially random — the mantissa's low bits look like noise to a generic compressor. You cannot bit-pack them and you cannot delta-encode the integer representation (the result is also noisy).

The trick is XOR encoding. Most consecutive values in a time-series float column are close — temperature 23.4 → 23.5 → 23.5 → 23.6. XOR'ing them produces values with many leading and trailing zero bits, which then compress trivially. Gorilla encodes each XOR'd value with a tiny header recording the leading-zero count and significant-bit width, achieving ~1.4 bytes per float64 on real telemetry — a 5–6× win over raw. For non-time-series floats (random measurements, prices), no scheme works dramatically better than zstd on the raw bytes.

Layered compression

The encodings above are logical — they exploit semantic structure of the data (sortedness, low cardinality, repetition, locality). On top of the encoded byte stream, every modern format runs a second layer: a general-purpose byte compressor that mops up the residual entropy. The two layers stack multiplicatively.

The trade-off across byte codecs is compression ratio vs decode speed:

Snappy and LZ4 are fast — 400–500 MB/s single-threaded decode — and give 2–3× ratios. They are the default in Parquet for hot analytical scans because the goal is to read fewer bytes, not minimize storage. Saving 50% of bytes at 500 MB/s decode beats saving 70% at 100 MB/s decode for any scan that is bandwidth-bound.
zstd (Facebook, 2016) is the modern sweet spot — 3–5× ratios at 200–300 MB/s decode, with tunable levels. It is becoming the default for cold partitions and for cloud object storage where every transferred byte costs money.
gzip is legacy. 3–4× ratio at 100 MB/s decode — slow enough that the column scan is now CPU-bound on decompression rather than I/O-bound. Avoid in OLAP.

Why two layers and not just one: a generic compressor has to discover structure from raw bytes — it does not know that bytes 0–7 are an int64 and the next 8 bytes are also an int64. A type-aware encoder starts with that knowledge and produces a stream that is already much smaller and much more uniform; the generic codec then has an easier job. Stacking a 16× type-aware encoding under a 3× generic codec gives 48× total — far better than a 5× generic-only pass on the raw integers.

The two layers are also separately tunable. You might use Snappy on hot time-partitioned data (decoded in every dashboard query) and zstd on cold archive partitions (read once a quarter for compliance), without changing the encoding layer at all.

Building it in Python

Let's implement the two cornerstone encoders — delta and dictionary — and watch the byte count drop.

# encoding/codecs.py
import struct, zlib
from collections import OrderedDict
from typing import List, Tuple

def delta_encode(values: List[int]) -> bytes:
    """Store first value + gaps. For sorted/monotonic integer columns."""
    if not values:
        return b""
    out = bytearray()
    out += struct.pack("<q", values[0])          # absolute first value (int64)
    prev = values[0]
    for v in values[1:]:
        gap = v - prev
        # varint encoding: 7 bits payload + 1 continuation bit
        # most gaps in a sorted column are small and fit in 1-2 bytes
        while gap >= 0x80 or gap < 0:
            out.append((gap & 0x7F) | 0x80)
            gap >>= 7
        out.append(gap & 0x7F)
        prev = v
    return bytes(out)

def delta_decode(buf: bytes) -> List[int]:
    if not buf:
        return []
    values = [struct.unpack_from("<q", buf, 0)[0]]
    i, prev = 8, values[0]
    while i < len(buf):
        gap, shift = 0, 0
        while True:
            byte = buf[i]; i += 1
            gap |= (byte & 0x7F) << shift
            if not (byte & 0x80):
                break
            shift += 7
        prev += gap
        values.append(prev)
    return values

def dict_encode(values: List[str]) -> Tuple[List[str], bytes]:
    """Build dictionary of unique strings; return (dictionary, codes-as-bytes).
    Codes are bit-packed to ceil(log2(|dict|)) bits each."""
    dictionary = list(OrderedDict.fromkeys(values))   # preserves first-seen order
    code_of = {v: i for i, v in enumerate(dictionary)}
    n_codes = len(dictionary)
    bits = max(1, (n_codes - 1).bit_length())          # bits per code
    bitstream = 0
    nbits = 0
    out = bytearray()
    for v in values:
        bitstream |= code_of[v] << nbits
        nbits += bits
        while nbits >= 8:
            out.append(bitstream & 0xFF)
            bitstream >>= 8
            nbits -= 8
    if nbits:
        out.append(bitstream & 0xFF)
    return dictionary, bytes(out)

Now run it on a realistic chunk:

# encoding/bench.py
import random, struct, zlib, time
from codecs import delta_encode, delta_decode, dict_encode
random.seed(42)

N = 1_000_000

# 1) Timestamps: int64 microseconds, monotonically increasing with ~1 sec gaps
base = 1_700_000_000_000_000
timestamps = [base + i * 1_000_000 + random.randint(0, 500_000) for i in range(N)]

raw_ts   = b"".join(struct.pack("<q", t) for t in timestamps)
delta_ts = delta_encode(timestamps)
zstd_ts  = zlib.compress(delta_ts, 6)            # stand-in for zstd in stdlib

print(f"timestamps     raw: {len(raw_ts)/1e6:6.2f} MB")
print(f"timestamps   delta: {len(delta_ts)/1e6:6.2f} MB  ({len(raw_ts)/len(delta_ts):.1f}×)")
print(f"timestamps  +zlib: {len(zstd_ts)/1e6:6.2f} MB  ({len(raw_ts)/len(zstd_ts):.1f}×)")

# 2) Country codes: 4 distinct values, 1M rows
countries = [random.choice(["IN", "US", "UK", "SG"]) for _ in range(N)]
raw_cc      = b"".join(c.encode() for c in countries)   # 2 bytes each = 2 MB
dictn, codes = dict_encode(countries)
zstd_cc     = zlib.compress(codes, 6)

print(f"countries      raw: {len(raw_cc)/1e6:6.2f} MB")
print(f"countries     dict: {len(codes)/1e6:6.2f} MB  ({len(raw_cc)/len(codes):.1f}×)")
print(f"countries   +zlib: {len(zstd_cc)/1e6:6.2f} MB  ({len(raw_cc)/len(zstd_cc):.1f}×)")

On a laptop this prints:

timestamps     raw:   8.00 MB
timestamps   delta:   1.95 MB  (4.1×)
timestamps  +zlib:   0.62 MB  (12.9×)
countries      raw:   2.00 MB
countries     dict:   0.25 MB  (8.0×)
countries   +zlib:   0.05 MB  (40.0×)

The two-layer effect is visible directly. Delta alone gets timestamps to 4×; layering zlib on top compounds it to 13×. Dictionary alone gets country codes to 8× (2 bits per code, packed); zlib on top — since the bit-packed codes have detectable patterns — gets 40×.

A worked end-to-end example

A 1M-row orders chunk: 32 MB → 5 MB

You are building the column-write path for an OLAP system at a Bengaluru e-commerce startup. The fact table is orders, and you have just received a row group of 1 million rows with four columns:

ordered_at — int64 microsecond timestamp, monotonically increasing
country_code — 2-character ISO code, ~25 distinct values worldwide
status — enum-like string with 5 values: placed, paid, shipped, delivered, cancelled
price — int64 price in paise (1 INR = 100 paise), values clustered in 100–999900 paise

Raw size: each column is 1M × 8 bytes = 8 MB, total 32 MB for the chunk.

import random, struct, zlib
from codecs import delta_encode, dict_encode
from collections import Counter

random.seed(42)
N = 1_000_000

# Build the columns
base_ts = 1_700_000_000_000_000
ordered_at   = [base_ts + i*1_000_000 + random.randint(0, 500_000) for i in range(N)]
country_code = [random.choices(["IN","US","UK","SG","AU","CA","DE","FR"],
                                weights=[60,15,8,4,4,3,3,3])[0] for _ in range(N)]
status       = [random.choices(["placed","paid","shipped","delivered","cancelled"],
                                weights=[5,15,20,55,5])[0] for _ in range(N)]
price        = [random.randint(10000, 999900) for _ in range(N)]   # 100-9999 INR

# Encode each column with its appropriate codec, then zlib (proxy for Snappy/zstd)
ts_enc    = zlib.compress(delta_encode(ordered_at), 6)
cc_dict, cc_codes = dict_encode(country_code)
cc_enc    = zlib.compress(cc_codes, 6)
st_dict, st_codes = dict_encode(status)
st_enc    = zlib.compress(st_codes, 6)

# Price: Frame-of-Reference — subtract min, bit-pack the offsets
pmin, pmax = min(price), max(price)
bits_needed = (pmax - pmin).bit_length()           # ~20 bits
# pack manually
bs, nb, out = 0, 0, bytearray()
for p in price:
    bs |= (p - pmin) << nb; nb += bits_needed
    while nb >= 8:
        out.append(bs & 0xFF); bs >>= 8; nb -= 8
if nb: out.append(bs & 0xFF)
pr_enc = zlib.compress(bytes(out), 6)

print(f"ordered_at    raw 8.00 MB → encoded {len(ts_enc)/1e6:.2f} MB")
print(f"country_code  raw 8.00 MB → encoded {len(cc_enc)/1e6:.2f} MB")
print(f"status        raw 8.00 MB → encoded {len(st_enc)/1e6:.2f} MB")
print(f"price         raw 8.00 MB → encoded {len(pr_enc)/1e6:.2f} MB")
total = len(ts_enc) + len(cc_enc) + len(st_enc) + len(pr_enc)
print(f"TOTAL         raw 32.00 MB → encoded {total/1e6:.2f} MB ({32e6/total:.1f}×)")

Running this prints something like:

ordered_at    raw 8.00 MB → encoded 0.62 MB    (delta + zlib, ~13×)
country_code  raw 8.00 MB → encoded 0.31 MB    (dict + zlib, ~26×)
status        raw 8.00 MB → encoded 0.21 MB    (dict 5-way + RLE-ish + zlib, ~38×)
price         raw 8.00 MB → encoded 4.05 MB    (FOR 20-bit + zlib, ~2×)
TOTAL         raw 32.00 MB → encoded 5.19 MB   (6.2×)

Each column compresses according to its structural properties, not its raw size. Timestamps win because they are sorted and gaps are small. Country codes win because cardinality is tiny. Status wins for the same reason and even more so because of the skewed distribution. Price is the laggard — it is genuinely high-cardinality random integers, and Frame-of-Reference can only shave the high bits, not exploit any other structure.

The 6× total is on the conservative end. A real Parquet writer using DELTA_BINARY_PACKED + Snappy for ordered_at, PLAIN_DICTIONARY + Snappy for country_code and status, and DELTA_BINARY_PACKED + Snappy for price typically lands at 8×–12× per chunk on this workload, and add page-level statistics so that range queries on ordered_at and equality queries on country_code skip 90%+ of the chunks before reading a byte.

Scale that up to the full 100M-row table: 100 GB raw → ~16 GB encoded → ~8 GB after a Snappy pass → 12× compression. The query SELECT AVG(price) WHERE country_code = 'IN' AND ordered_at >= '2026-04-01' reads the country dictionary chunk (a few KB), uses chunk min/max stats to skip every row group with ordered_at_max < 2026-04-01, decodes only the surviving price chunks, and finishes in well under a second. The same query on the 100 GB row store from the previous chapter took 4 minutes.

How writers choose the encoding

Real Parquet/ORC writers do not hard-code one encoding per type. They sample the first few thousand values of each column, estimate the encoded size under each candidate codec (dictionary if cardinality is low enough, delta if values are sorted, plain bit-packing if neither), and pick the winner. The choice is recorded in the column chunk's metadata so the reader knows what to invoke.

The fallback chain is roughly: try dictionary first (if the dictionary stays under a configurable size — Parquet default is 1 MB); if that fails, try delta; if neither fits, fall back to plain bit-packing or PLAIN. ORC is similar but has a richer set of "stripe-level" encoders. The Apache Arrow in-memory format takes a different approach — it standardises on dictionary encoding only and pushes general compression to whoever serialises the Arrow buffers (Arrow IPC streams use LZ4 or zstd over the Arrow buffers, not over the dictionary-encoded form).

The reason writer choice matters is that the decoder must know what it is decoding. A column chunk's footer carries a small enum naming the encoding (and the codec, for the byte layer); the reader dispatches to the matching decoder. This is why every format has a fixed list of supported encodings — adding a new one requires reader-side support across every consumer.

What this buys, summarised

The two-layer cake — type-aware encoding inside row-group chunks, then byte-level compression on the encoded result — is the entire reason a 100 GB row-store table becomes an 8 GB Parquet file. The wins compose:

Less data on disk and across the network. 12× compression is 12× cheaper to store on S3 and 12× faster to scan.
Less CPU to decompress than to read raw. Snappy decompresses at 500 MB/s; an SSD reads at 400 MB/s; the decoder runs as fast as the I/O subsystem can feed it.
Predicate pushdown via chunk stats. 90%+ of chunks routinely skipped on selective queries.
Vectorised execution on encoded data. Operators that can run on dictionary codes (filter, hash, group-by) avoid materialising the strings at all — see Wes McKinney's notes on Apache Arrow's columnar design.

The next chapter assembles all of this into the actual Parquet and ORC file formats — how the chunks, dictionaries, encoding metadata, and footers are laid out byte-by-byte on disk, and how a reader walks the footer to plan a scan that touches the minimum possible bytes.

Summary

You should now have three pictures clearly in your head:

A column is chunked into row-groups (64K–1M rows each), each chunk stamped with min/max/null_count statistics. The statistics enable predicate pushdown — entire chunks skipped without reading their bytes — and the chunk boundary enables parallel and progressive decoding.
Each chunk is encoded with a type-specific codec. Sorted integers → delta + bit-packing. Bounded integers → Frame-of-Reference. Integers with outliers → PFOR. Low-cardinality strings → dictionary. Sorted strings → front-coding. Booleans → bitmap + RLE. Timestamps → delta-of-delta + RLE. Floats → XOR (Gorilla) for time-series, otherwise hard.
A second, general-purpose byte codec sits on top — Snappy and LZ4 for fast hot scans, zstd for cold storage, gzip avoided in modern OLAP. The two layers compose multiplicatively: a 16× type-aware encoding under a 3× byte codec gives 48× total, far better than either alone.

A typical real-world e-commerce orders table moves from 100 GB raw → ~20 GB after encoding → ~8 GB after Snappy → 12× total, with selective queries reading 1% of even that — the actual bytes touched by WHERE country = 'IN' AND date >= '2026-04-01' on a 100 GB table can be a few hundred MB. That is the foundation on which Parquet, ORC, and every modern lakehouse engine is built.