Z-ordering and data skipping

A Zerodha analyst at 11:45 a.m. on a market-open Monday runs SELECT AVG(price) FROM trades WHERE symbol = 'RELIANCE' AND ts > now() - interval '5 min'. The trades table has 2.4 trillion rows across 187,000 Parquet files. The query plan claims it will scan 41 GB; the actual answer is in 280 KB worth of rows. The whole job of the metadata layer — manifest files with min-max stats per column per file — is to make sure the engine reads the 280 KB, not the 41 GB. Whether that works or not depends on a single decision the writer made hours earlier: how it ordered rows on disk. Random order kills pruning. Sort order helps one column. Z-ordering helps several columns at once, with bounded loss.

Data skipping prunes files before reading them, using min-max statistics that the writer recorded per column per file in the manifest. The pruning only works when files have narrow ranges on the column you filter by — which requires the writer to have clustered rows by that column. Sorting handles one column perfectly; for queries that filter on two or more columns, Z-ordering interleaves the bits of those columns so each file ends up with narrow ranges on all of them. The cost is a re-clustering compaction; the payoff is 10–100× scan reduction on multi-dimensional filter queries.

Why min-max stats prune nothing on random data

Open one Parquet file written by a streaming job with no clustering. Its manifest entry shows the min and max of every column across all rows in the file: min_ts = 09:30:00.001, max_ts = 15:29:59.998, min_symbol = "AABB", max_symbol = "ZYDU". A query filter like ts BETWEEN 09:35 AND 09:40 cannot prune this file — the file's [min_ts, max_ts] range overlaps the query's range. So the engine reads the file and scans every row group inside.

Now imagine the writer had pre-sorted rows globally by ts before splitting them into files. File 0 covers 09:30:00–09:45:00, file 1 covers 09:45:00–10:00:00, etc. The same query now touches one file. That is data skipping working: the manifest's min-max metadata answers "could this file possibly contain a matching row?" with no for 99.5% of files, and the engine reads only the rest.

Sorted data prunes; random data does notTwo horizontal strips representing files. Top strip: random rows, every file's min-max range overlaps the query band, no pruning. Bottom strip: sorted rows, each file has a narrow non-overlapping range, only one file overlaps the query band. File pruning depends on writer's row order Random order — every file overlaps the predicate file 0 [09:30, 15:30] file 1 [09:30, 15:30] file 2 [09:30, 15:30] file 3 [09:30, 15:30] file 4 [09:30, 15:30] file 5 [09:30, 15:30] file 6 [09:30, 15:30] query: ts BETWEEN 09:35 AND 09:40 — must scan ALL files Sorted by ts — only the matching file is scanned file 0 [09:30, 09:45] file 1 [09:45, 10:00] file 2 [10:00, 10:15] file 3 [10:15, 10:30] file 4 [10:30, 11:00] file 5 [11:00, 13:30] file 6 [13:30, 15:30] query overlaps file 0 only — engine reads 1 of 7 files 7× scan reduction in this toy example; 100× in production with thousands of files
The same files, the same data, the same query. The only difference is the order in which the writer emitted rows before splitting them into files. Random order makes every file's min-max range maximally wide; sorted order makes them narrow and non-overlapping. Pruning works only on the second.

The mechanism is unglamorous: the engine reads the manifest, looks at each file's [min_ts, max_ts], checks whether the predicate range intersects, and skips the file entirely if not. Iceberg, Delta, and Hudi all do this. The footer of every Parquet file also has per-row-group min-max stats, so a second-level prune happens inside the file at row-group granularity. Both levels need narrow ranges to work. Random row order destroys both.

This is why "we have 1 PB in our lakehouse and queries still take 12 minutes" is almost always a clustering problem and almost never a compute problem. The engine is reading every byte because the metadata cannot tell it which bytes to skip.

What goes wrong with sort, and what Z-order fixes

Sorting by one column gives perfect pruning on that column and destroys pruning on every other column. Sort trades by ts. Now WHERE ts > '11:00' prunes brilliantly. But WHERE symbol = 'RELIANCE' prunes nothing, because every file (covering one minute of trades) contains all symbols that traded in that minute — including RELIANCE.

You can sort by a compound key — ORDER BY symbol, ts — and get good pruning on symbol, decent pruning on ts within a symbol group. But this only helps the leading column of the sort and degrades fast. A query like WHERE ts > '11:00' AND venue = 'NSE' gets nothing useful from ORDER BY symbol, ts.

Z-ordering is the answer. Take the bits of column A and column B and interleave them. The first bit of the Z-value comes from A's high bit, the second bit from B's high bit, the third from A's next bit, and so on. Sort by the resulting Z-value. The result: nearby Z-values correspond to rows that are nearby in both A and B. Files written in Z-value order have narrow ranges on both columns simultaneously — not as narrow as a single-column sort would give for the dominant column, but narrow on all the Z-ordered columns at once.

The geometric intuition is space-filling curves. A 2D plane visited in Z-order zigzags through quadrants, recursively sub-quadrants. Every contiguous chunk of the curve corresponds to a small bounding box in the original space, so chunking the curve into files yields files whose 2D ranges are small bounding boxes — small on both dimensions.

Z-order curve over a 2D spaceA 16x16 grid with a Z-shaped path threading through it, recursively. The path is partitioned into four colored segments representing four files. Each file's footprint is a contiguous Z-shaped region small in both x and y dimensions. Z-order: one curve, both dimensions narrow column B (e.g. symbol_id) column A (e.g. timestamp) files emitted along the curve file 0: A in [0,7], B in [0,7] file 1: A in [8,15], B in [0,7] file 2: A in [0,7], B in [8,15] file 3: A in [8,15], B in [8,15] a query on (A∈[2,4], B∈[10,13]) prunes 3 of 4 files — only file 2 overlaps
The Z-order curve threads the 2D space recursively: each contiguous segment of the curve sits inside a square bounding box. Splitting the curve into four equal-length segments gives four files, each covering a quadrant. A query that filters on both A and B intersects at most one quadrant in this toy example. In production, with thousands of files and 3–4 Z-order columns, queries often prune 95% of files even when filtering on a non-leading column.

The price: Z-order is not as good on a single column as a pure sort would be. If 99% of your queries filter on ts only, sort by ts. Z-order earns its keep when you have a small set of columns (say 2–4) that show up together in many queries.

Computing a Z-value, in code

Z-order interleaving is a five-line trick. Here's the algorithm and a worked execution.

# zorder.py — interleave bits to produce a Z-value, then cluster files by it.
# Demonstrates: bit interleaving, narrow per-file ranges, query pruning.

def interleave2(a: int, b: int, bits: int = 16) -> int:
    """Interleave bits of a and b into a Z-value. a contributes even-position bits."""
    z = 0
    for i in range(bits):
        z |= ((a >> i) & 1) << (2 * i)
        z |= ((b >> i) & 1) << (2 * i + 1)
    return z

def zvalue(row: dict) -> int:
    # Map each Z-order column to a non-negative int. Real systems hash strings to ints.
    a = row["minute_of_day"]            # 0..959  (16 hours of trading * 60)
    b = abs(hash(row["symbol"])) % 1024  # 10 bits
    return interleave2(a, b)

# 12 sample trades from a Zerodha-like tick stream
rows = [
    {"minute_of_day":   5, "symbol": "RELIANCE", "price": 2810.50},
    {"minute_of_day": 305, "symbol": "TCS",      "price": 4120.75},
    {"minute_of_day":   8, "symbol": "INFY",     "price": 1640.20},
    {"minute_of_day": 308, "symbol": "RELIANCE", "price": 2812.05},
    {"minute_of_day":  12, "symbol": "HDFCBANK", "price": 1502.10},
    {"minute_of_day": 312, "symbol": "TCS",      "price": 4119.90},
    {"minute_of_day":  15, "symbol": "RELIANCE", "price": 2811.20},
    {"minute_of_day": 315, "symbol": "INFY",     "price": 1641.80},
    {"minute_of_day":  20, "symbol": "HDFCBANK", "price": 1502.55},
    {"minute_of_day": 320, "symbol": "TCS",      "price": 4118.30},
    {"minute_of_day":  25, "symbol": "INFY",     "price": 1642.00},
    {"minute_of_day": 325, "symbol": "RELIANCE", "price": 2813.10},
]
sorted_rows = sorted(rows, key=zvalue)

# Split into 3 files of 4 rows each (the way a writer would).
for fid in range(3):
    chunk = sorted_rows[fid*4:(fid+1)*4]
    mins = (min(r["minute_of_day"] for r in chunk), min(r["symbol"] for r in chunk))
    maxs = (max(r["minute_of_day"] for r in chunk), max(r["symbol"] for r in chunk))
    print(f"file {fid}: minute=[{mins[0]:>3},{maxs[0]:>3}]  symbol=[{mins[1]:>9},{maxs[1]:>9}]")

# Sample run:
# file 0: minute=[  5, 20]  symbol=[ HDFCBANK, RELIANCE]
# file 1: minute=[ 12, 305]  symbol=[ HDFCBANK,      TCS]
# file 2: minute=[ 15, 325]  symbol=[     INFY, RELIANCE]
#
# Compare the same data sorted purely by minute_of_day:
# file 0: minute=[  5, 12]  symbol=[ HDFCBANK, RELIANCE]   <- narrow on minute, wide on symbol
# file 1: minute=[ 15, 308]  symbol=[ INFY,    RELIANCE]   <- both wide (boundary spans morning + afternoon)
# file 2: minute=[ 312, 325]  symbol=[ INFY,        TCS]   <- narrow on minute, wide on symbol

Walk the load-bearing lines:

The runtime cost of the rewrite itself is non-trivial: it is a full read-write of the affected partitions. Why teams still pay the cost: a one-time 30-minute compaction at 11 p.m. that turns every subsequent business-hours dashboard query from 12 minutes into 8 seconds is a 90× operational improvement at a 0.05% capacity cost. The compaction runs at a time when the cluster is idle anyway. The savings compound across every analyst, every BI dashboard, every API caller for the entire next day.

Picking columns to Z-order on, in production

Z-ordering on the wrong columns is worse than not Z-ordering at all — you pay the rewrite cost and get no pruning. Three rules production teams use:

1. Z-order columns must show up in WHERE clauses, not just SELECT lists. Run EXPLAIN or pull the predicate frequencies from your query log for the past week. If column customer_segment shows up in 73% of WHERE clauses for the table, it's a Z-order candidate. If email_address shows up only in SELECT lists, it isn't — pruning can't help columns the engine doesn't filter on.

2. Cardinality matters. Aim for medium cardinality. A column with 2 distinct values (is_premium) is best handled by partitioning, not Z-ordering — there's no locality to preserve, the column is binary. A column with 2 billion distinct values (request_id, transaction_id) has so much entropy that Z-ordering doesn't really cluster it — every file ends up touching the full ID range. Sweet spot: 100s to 100k distinct values. symbol_id (a few thousand traded NSE symbols), merchant_id (a few lakh active merchants on Razorpay), region_id, event_type — all good Z-order candidates.

3. Two to four columns, no more. The "Z" in Z-order interleaves bits across all the input columns. With two columns, each one contributes 50% of the bit positions. With ten columns, each contributes 10% — meaning the per-file range on any given column gets very wide because few high-order bits of any one column made it into the Z-value's high-order bits. Industry convention: pick the two columns that show up together in the most queries. Add a third only if a clearly-distinct query pattern motivates it. A fourth almost never earns its keep.

A Flipkart catalogue team in 2025 ran this exercise on their product_events table. Query log showed WHERE category_id AND warehouse_id AND event_date in 64% of queries; WHERE seller_id AND event_date in 22%; everything else in 14%. They Z-ordered on (category_id, warehouse_id, event_date). The 22% seller-only queries got slightly worse pruning than before (because seller_id was no longer the leading sort key, which it had been). The 64% main pattern improved from 4-minute scans to 6-second scans. Net: median dashboard latency dropped from 2.8 s to 0.9 s; the seller analytics team got a separate Z-ordered copy of the same partitions to handle their pattern. The lesson is that one Z-order layout cannot serve every workload — you optimise for the dominant pattern and either accept worse performance on the rest, or maintain a second clustering.

When Z-order doesn't help

Z-ordering is not free, and three patterns make it actively wrong.

1. Highly-skewed data. If 90% of trades are on RELIANCE and TCS, Z-ordering by symbol gives you huge files for those two symbols and tiny files for everything else. The huge files have wide ranges anyway because they each contain millions of rows of the same symbol. The fix is partitioning by symbol (separate folders for each symbol) plus Z-ordering by (minute_of_day, venue) within each partition. Iceberg supports both layered together; this is the standard production layout for tick data.

2. High write-rate streaming tables. Z-ordering requires reading and rewriting all the data in a partition. If the partition gets 50 GB of new data every hour, the rewrite cost is dominant and you'll never catch up. The fix: Z-order only the partitions older than 24 hours (the "warm" tier), and accept that the most-recent partition (the "hot" tier) is unsorted but small enough that a full scan is fast. This is a nearly-universal lakehouse pattern — hot stays unclustered, warm gets Z-ordered nightly, cold gets Z-ordered weekly.

3. Queries that scan the whole table anyway. A SELECT COUNT(*) or a GROUP BY symbol, COUNT(*) reads every file. Z-ordering doesn't help; the engine isn't pruning anything. Z-order optimises for selective queries — predicates that match a small fraction of rows. If your workload is dashboards that paginate over everything, Z-order is the wrong tool; you want pre-aggregated materialised views instead (chapter 14 of the curriculum).

A Cred analytics team learned the second lesson the hard way in early 2026: they enabled OPTIMIZE ... ZORDER BY (user_id, event_type) as a daily job on their app_events table. The job took 4 hours every night, more than half of which was rewriting the current day's partition (which was still being written to). Their nightly ingestion latency went up by 90 minutes. The fix: change the OPTIMIZE clause to WHERE event_date < current_date so it skipped the live partition. Daily job dropped to 25 minutes, ingestion latency recovered, and the cluster bill dropped by ₹84,000/month.

Common confusions

Going deeper

How the engine actually evaluates the predicate against manifest stats

A query like WHERE region = 'south' AND ts > '2026-04-25' is broken into per-column predicates. For each candidate file, the planner checks each predicate against the file's [min, max] for that column. The file is definitely-not-matching if any predicate's range disjoint with the file's range — that's the prune. The file is maybe-matching if all predicates' ranges overlap — the planner queues the file for read. The planner does not try to prove the file definitely matches; it only tries to prove it definitely doesn't. This is the Iceberg Expressions.bind(...) and Delta DataSkippingReader.constructDataFilters(...) code path. Z-order's whole job is to make the "definitely-not-matching" answer correct for as many files as possible. The deeper trick is bloom filters and column-level zone maps for high-cardinality equality predicates (WHERE merchant_id = 'M-93021') — Z-order's min-max can only narrow the range; a bloom filter can prove the exact value is absent. Iceberg supports per-column bloom filters via write.parquet.bloom-filter-enabled.column.merchant_id = true; combine bloom filters with Z-order for the strongest pruning on equality-heavy workloads.

The Morton encoding magic-number trick

The naive Z-value computation iterates once per bit, which is fine pedagogically but slow on hot loops. Production implementations use a parallel bit-spread: x = (x | (x << 8)) & 0x00FF00FF; x = (x | (x << 4)) & 0x0F0F0F0F; x = (x | (x << 2)) & 0x33333333; x = (x | (x << 1)) & 0x55555555; — four instructions to spread an int32 to occupy every other bit of an int64. OR the result with y shifted similarly into the alternate bit positions, and you have the Z-value. This is the same magic that game engines use for Morton codes in BVH traversal. Read Hacker's Delight §7.5 (Henry Warren) for the full derivation. The compiler can sometimes recognise the pattern and emit pdep (BMI2 instruction) on x86, making it a single cycle.

Why Hilbert curve wins at the boundary

Z-order has discontinuities at quadrant boundaries: two points adjacent on the curve can be far apart in 2D space when the curve "jumps" from one quadrant to the next. A file that straddles such a boundary ends up with a wide bounding box. Hilbert curve has no such jumps — it's continuous, so a file's bounding box is always tight. The empirical gap is 5–15% better pruning rate, with the gap larger when files are smaller (more boundary crossings per file). Databricks' Hilbert blog post shows the comparison. The reason most lakehouses still default to Z is that Hilbert encoding is harder to vectorise and the speed-up vanishes on writes; the read-side gain is real but small.

Z-order under concurrent writes

The previous chapter (optimistic concurrency) showed that compaction conflicts with itself when two compactions select overlapping file sets. Z-order rewrites are compactions, so they are subject to the same protocol. The standard pattern: scope each Z-order job to a single partition and run them serially per partition (or parallel across partitions). Iceberg's rewriteDataFiles action does this automatically — it groups files by partition and executes rewrites one partition-at-a-time within a single job. The CAS protocol from chapter 90 is what keeps a Z-order rewrite from corrupting concurrent appends: the appends commit small manifest deltas; the Z-order rewrite commits a much larger swap. Both succeed via the CAS; if one races the other, the loser retries against the new parent.

Where this leads next

The next chapter, /wiki/copy-on-write-vs-merge-on-read-iceberg-vs-hudi, is about row-level update strategies. Both CoW and MoR layouts depend on the same min-max manifest stats and the same Z-order clustering choices that this chapter covered — but they differ in how often a Z-order rewrite is forced (CoW writes new files on every update; MoR defers rewrites). Choosing between them changes the frequency and cost of the OPTIMIZE-Z-ORDER cycle, not the mechanism.

After that the build moves into /wiki/time-travel-and-snapshot-expiration (which interacts with Z-order: every OPTIMIZE creates new snapshots that retain old files for time-travel, increasing storage cost) and then to query engines (Trino, Spark, Dremio, DuckDB) that all consume the same manifest-and-Z-order metadata layer.

Build 14 will revisit clustering from the real-time analytics angle — ClickHouse and Pinot have their own segment-level clustering primitives (ORDER BY in ClickHouse's MergeTree, sorted segments in Pinot) that solve the same problem at the OLAP-engine layer. The mental model from this chapter ports directly: writer clusters rows by predicate locality, reader prunes blocks before scanning.

References