Z-ordering and data skipping
A Zerodha analyst at 11:45 a.m. on a market-open Monday runs SELECT AVG(price) FROM trades WHERE symbol = 'RELIANCE' AND ts > now() - interval '5 min'. The trades table has 2.4 trillion rows across 187,000 Parquet files. The query plan claims it will scan 41 GB; the actual answer is in 280 KB worth of rows. The whole job of the metadata layer — manifest files with min-max stats per column per file — is to make sure the engine reads the 280 KB, not the 41 GB. Whether that works or not depends on a single decision the writer made hours earlier: how it ordered rows on disk. Random order kills pruning. Sort order helps one column. Z-ordering helps several columns at once, with bounded loss.
Data skipping prunes files before reading them, using min-max statistics that the writer recorded per column per file in the manifest. The pruning only works when files have narrow ranges on the column you filter by — which requires the writer to have clustered rows by that column. Sorting handles one column perfectly; for queries that filter on two or more columns, Z-ordering interleaves the bits of those columns so each file ends up with narrow ranges on all of them. The cost is a re-clustering compaction; the payoff is 10–100× scan reduction on multi-dimensional filter queries.
Why min-max stats prune nothing on random data
Open one Parquet file written by a streaming job with no clustering. Its manifest entry shows the min and max of every column across all rows in the file: min_ts = 09:30:00.001, max_ts = 15:29:59.998, min_symbol = "AABB", max_symbol = "ZYDU". A query filter like ts BETWEEN 09:35 AND 09:40 cannot prune this file — the file's [min_ts, max_ts] range overlaps the query's range. So the engine reads the file and scans every row group inside.
Now imagine the writer had pre-sorted rows globally by ts before splitting them into files. File 0 covers 09:30:00–09:45:00, file 1 covers 09:45:00–10:00:00, etc. The same query now touches one file. That is data skipping working: the manifest's min-max metadata answers "could this file possibly contain a matching row?" with no for 99.5% of files, and the engine reads only the rest.
The mechanism is unglamorous: the engine reads the manifest, looks at each file's [min_ts, max_ts], checks whether the predicate range intersects, and skips the file entirely if not. Iceberg, Delta, and Hudi all do this. The footer of every Parquet file also has per-row-group min-max stats, so a second-level prune happens inside the file at row-group granularity. Both levels need narrow ranges to work. Random row order destroys both.
This is why "we have 1 PB in our lakehouse and queries still take 12 minutes" is almost always a clustering problem and almost never a compute problem. The engine is reading every byte because the metadata cannot tell it which bytes to skip.
What goes wrong with sort, and what Z-order fixes
Sorting by one column gives perfect pruning on that column and destroys pruning on every other column. Sort trades by ts. Now WHERE ts > '11:00' prunes brilliantly. But WHERE symbol = 'RELIANCE' prunes nothing, because every file (covering one minute of trades) contains all symbols that traded in that minute — including RELIANCE.
You can sort by a compound key — ORDER BY symbol, ts — and get good pruning on symbol, decent pruning on ts within a symbol group. But this only helps the leading column of the sort and degrades fast. A query like WHERE ts > '11:00' AND venue = 'NSE' gets nothing useful from ORDER BY symbol, ts.
Z-ordering is the answer. Take the bits of column A and column B and interleave them. The first bit of the Z-value comes from A's high bit, the second bit from B's high bit, the third from A's next bit, and so on. Sort by the resulting Z-value. The result: nearby Z-values correspond to rows that are nearby in both A and B. Files written in Z-value order have narrow ranges on both columns simultaneously — not as narrow as a single-column sort would give for the dominant column, but narrow on all the Z-ordered columns at once.
The geometric intuition is space-filling curves. A 2D plane visited in Z-order zigzags through quadrants, recursively sub-quadrants. Every contiguous chunk of the curve corresponds to a small bounding box in the original space, so chunking the curve into files yields files whose 2D ranges are small bounding boxes — small on both dimensions.
The price: Z-order is not as good on a single column as a pure sort would be. If 99% of your queries filter on ts only, sort by ts. Z-order earns its keep when you have a small set of columns (say 2–4) that show up together in many queries.
Computing a Z-value, in code
Z-order interleaving is a five-line trick. Here's the algorithm and a worked execution.
# zorder.py — interleave bits to produce a Z-value, then cluster files by it.
# Demonstrates: bit interleaving, narrow per-file ranges, query pruning.
def interleave2(a: int, b: int, bits: int = 16) -> int:
"""Interleave bits of a and b into a Z-value. a contributes even-position bits."""
z = 0
for i in range(bits):
z |= ((a >> i) & 1) << (2 * i)
z |= ((b >> i) & 1) << (2 * i + 1)
return z
def zvalue(row: dict) -> int:
# Map each Z-order column to a non-negative int. Real systems hash strings to ints.
a = row["minute_of_day"] # 0..959 (16 hours of trading * 60)
b = abs(hash(row["symbol"])) % 1024 # 10 bits
return interleave2(a, b)
# 12 sample trades from a Zerodha-like tick stream
rows = [
{"minute_of_day": 5, "symbol": "RELIANCE", "price": 2810.50},
{"minute_of_day": 305, "symbol": "TCS", "price": 4120.75},
{"minute_of_day": 8, "symbol": "INFY", "price": 1640.20},
{"minute_of_day": 308, "symbol": "RELIANCE", "price": 2812.05},
{"minute_of_day": 12, "symbol": "HDFCBANK", "price": 1502.10},
{"minute_of_day": 312, "symbol": "TCS", "price": 4119.90},
{"minute_of_day": 15, "symbol": "RELIANCE", "price": 2811.20},
{"minute_of_day": 315, "symbol": "INFY", "price": 1641.80},
{"minute_of_day": 20, "symbol": "HDFCBANK", "price": 1502.55},
{"minute_of_day": 320, "symbol": "TCS", "price": 4118.30},
{"minute_of_day": 25, "symbol": "INFY", "price": 1642.00},
{"minute_of_day": 325, "symbol": "RELIANCE", "price": 2813.10},
]
sorted_rows = sorted(rows, key=zvalue)
# Split into 3 files of 4 rows each (the way a writer would).
for fid in range(3):
chunk = sorted_rows[fid*4:(fid+1)*4]
mins = (min(r["minute_of_day"] for r in chunk), min(r["symbol"] for r in chunk))
maxs = (max(r["minute_of_day"] for r in chunk), max(r["symbol"] for r in chunk))
print(f"file {fid}: minute=[{mins[0]:>3},{maxs[0]:>3}] symbol=[{mins[1]:>9},{maxs[1]:>9}]")
# Sample run:
# file 0: minute=[ 5, 20] symbol=[ HDFCBANK, RELIANCE]
# file 1: minute=[ 12, 305] symbol=[ HDFCBANK, TCS]
# file 2: minute=[ 15, 325] symbol=[ INFY, RELIANCE]
#
# Compare the same data sorted purely by minute_of_day:
# file 0: minute=[ 5, 12] symbol=[ HDFCBANK, RELIANCE] <- narrow on minute, wide on symbol
# file 1: minute=[ 15, 308] symbol=[ INFY, RELIANCE] <- both wide (boundary spans morning + afternoon)
# file 2: minute=[ 312, 325] symbol=[ INFY, TCS] <- narrow on minute, wide on symbol
Walk the load-bearing lines:
z |= ((a >> i) & 1) << (2 * i)This is the entire interleave. Bitiofabecomes bit2*iof the Z-value; bitiofbbecomes bit2*i + 1. The high-order bits ofzmix the high-order bits ofaandb, which is what gives the Z-curve its recursive quadrant structure. Production implementations (Delta'sOptimizeZOrderBy, Iceberg'sZOrderUDF) use a magic-number bit-twiddling version (_pdep_u32on x86, the Morton-encoding trick on ARM) that runs in 4 instructions instead of a 16-iteration loop. The semantics are identical.b = abs(hash(row["symbol"])) % 1024Real Z-ordering needs each input column to be a non-negative integer. Strings get hashed; floats get bit-cast or quantised; timestamps become epoch-millis; categoricals become dense IDs. Why hashing strings is fine for Z-order but disastrous for sort: Z-order makes no claim about preserving lexicographic order on any single column — it preserves locality. Two symbols that hash to nearby integers will land in the same Z-region, even if their alphabetic order is far apart. SoRELIANCEandINFYmay sit next to each other in Z-space; that is exactly what we want for filter-and-skip, even though it makes "list symbols alphabetically" require a separate index.sorted_rows = sorted(rows, key=zvalue)Once each row has a Z-value, the writer's job is justORDER BY zvalueand emit fixed-size files along that order. Iceberg'srewriteDataFilesaction withsort-by zorder(minute_of_day, symbol)does exactly this on a Spark cluster, in parallel, by partition.mins = (min(r["minute_of_day"] for r in chunk), ...)This is the manifest stat that the engine will use later. After the rewrite, every file has narrow[min, max]ranges on all the Z-order columns. The pruning power scales with how narrow these ranges are.- The two output blocks show the trade-off concretely. Z-order on (minute, symbol) gives all three files a moderate range on both dimensions. Sorting by minute alone gives two files perfectly narrow minute ranges and one file (the boundary file) a useless one — but every file is wide on
symbol, so aWHERE symbol = Xquery prunes nothing. Z-order trades a little single-dimension precision for usable pruning in either dimension.
The runtime cost of the rewrite itself is non-trivial: it is a full read-write of the affected partitions. Why teams still pay the cost: a one-time 30-minute compaction at 11 p.m. that turns every subsequent business-hours dashboard query from 12 minutes into 8 seconds is a 90× operational improvement at a 0.05% capacity cost. The compaction runs at a time when the cluster is idle anyway. The savings compound across every analyst, every BI dashboard, every API caller for the entire next day.
Picking columns to Z-order on, in production
Z-ordering on the wrong columns is worse than not Z-ordering at all — you pay the rewrite cost and get no pruning. Three rules production teams use:
1. Z-order columns must show up in WHERE clauses, not just SELECT lists. Run EXPLAIN or pull the predicate frequencies from your query log for the past week. If column customer_segment shows up in 73% of WHERE clauses for the table, it's a Z-order candidate. If email_address shows up only in SELECT lists, it isn't — pruning can't help columns the engine doesn't filter on.
2. Cardinality matters. Aim for medium cardinality. A column with 2 distinct values (is_premium) is best handled by partitioning, not Z-ordering — there's no locality to preserve, the column is binary. A column with 2 billion distinct values (request_id, transaction_id) has so much entropy that Z-ordering doesn't really cluster it — every file ends up touching the full ID range. Sweet spot: 100s to 100k distinct values. symbol_id (a few thousand traded NSE symbols), merchant_id (a few lakh active merchants on Razorpay), region_id, event_type — all good Z-order candidates.
3. Two to four columns, no more. The "Z" in Z-order interleaves bits across all the input columns. With two columns, each one contributes 50% of the bit positions. With ten columns, each contributes 10% — meaning the per-file range on any given column gets very wide because few high-order bits of any one column made it into the Z-value's high-order bits. Industry convention: pick the two columns that show up together in the most queries. Add a third only if a clearly-distinct query pattern motivates it. A fourth almost never earns its keep.
A Flipkart catalogue team in 2025 ran this exercise on their product_events table. Query log showed WHERE category_id AND warehouse_id AND event_date in 64% of queries; WHERE seller_id AND event_date in 22%; everything else in 14%. They Z-ordered on (category_id, warehouse_id, event_date). The 22% seller-only queries got slightly worse pruning than before (because seller_id was no longer the leading sort key, which it had been). The 64% main pattern improved from 4-minute scans to 6-second scans. Net: median dashboard latency dropped from 2.8 s to 0.9 s; the seller analytics team got a separate Z-ordered copy of the same partitions to handle their pattern. The lesson is that one Z-order layout cannot serve every workload — you optimise for the dominant pattern and either accept worse performance on the rest, or maintain a second clustering.
When Z-order doesn't help
Z-ordering is not free, and three patterns make it actively wrong.
1. Highly-skewed data. If 90% of trades are on RELIANCE and TCS, Z-ordering by symbol gives you huge files for those two symbols and tiny files for everything else. The huge files have wide ranges anyway because they each contain millions of rows of the same symbol. The fix is partitioning by symbol (separate folders for each symbol) plus Z-ordering by (minute_of_day, venue) within each partition. Iceberg supports both layered together; this is the standard production layout for tick data.
2. High write-rate streaming tables. Z-ordering requires reading and rewriting all the data in a partition. If the partition gets 50 GB of new data every hour, the rewrite cost is dominant and you'll never catch up. The fix: Z-order only the partitions older than 24 hours (the "warm" tier), and accept that the most-recent partition (the "hot" tier) is unsorted but small enough that a full scan is fast. This is a nearly-universal lakehouse pattern — hot stays unclustered, warm gets Z-ordered nightly, cold gets Z-ordered weekly.
3. Queries that scan the whole table anyway. A SELECT COUNT(*) or a GROUP BY symbol, COUNT(*) reads every file. Z-ordering doesn't help; the engine isn't pruning anything. Z-order optimises for selective queries — predicates that match a small fraction of rows. If your workload is dashboards that paginate over everything, Z-order is the wrong tool; you want pre-aggregated materialised views instead (chapter 14 of the curriculum).
A Cred analytics team learned the second lesson the hard way in early 2026: they enabled OPTIMIZE ... ZORDER BY (user_id, event_type) as a daily job on their app_events table. The job took 4 hours every night, more than half of which was rewriting the current day's partition (which was still being written to). Their nightly ingestion latency went up by 90 minutes. The fix: change the OPTIMIZE clause to WHERE event_date < current_date so it skipped the live partition. Daily job dropped to 25 minutes, ingestion latency recovered, and the cluster bill dropped by ₹84,000/month.
Common confusions
- "Z-order is the same as multi-column sort." Multi-column sort
ORDER BY a, bgives perfect pruning onaand pruning onbonly within ana-group. Z-order gives moderate pruning on both, simultaneously, regardless of which one a query filters on. The use cases are different: sort wins when one column dominates the workload; Z-order wins when two or three columns show up together in mixed patterns. - "Z-order replaces partitioning." No — they compose. Partitioning splits files into directories by a low-cardinality column (
event_date,region); Z-order clusters rows within each partition. Production layouts almost always use both. Skipping partitioning and just Z-ordering everything works only at small scale; at hundreds of TB you need partition-level pruning before file-level pruning even gets to run. - "Manifest stats include every column automatically." They include every column the writer recorded stats for, which is configurable. Iceberg's default is the first 100 columns; Delta's is
dataSkippingNumIndexedCols = 32. If your filter column is past the index limit,EXPLAINwill show a full scan even after Z-ordering. The fix is either reordering columns or raising the limit; both are cheap. - "Z-order is the same as the Hilbert curve." Hilbert is another space-filling curve with marginally better locality (no quadrant-boundary discontinuities). Databricks added Hilbert support to Delta in 2023; the prune-rate improvement vs. Z is 5–15% in benchmarks. Hilbert is harder to compute (no neat bit-twiddle) and the practical difference is small enough that most teams stay on Z-order. Use Hilbert only if you've measured Z and have a benchmark proving the gap matters.
- "Once Z-ordered, you don't need to re-cluster." Wrong. Every new ingestion writes new unsorted files into the partition. After enough new data, the partition's effective clustering degrades back toward random. Re-run OPTIMIZE on a schedule — daily for hot partitions, weekly for warm. Skipping re-clustering is the most common reason "Z-order didn't help us in production".
- "Bigger files are always better for Z-order." No — there is a sweet spot. Files of 128–512 MB are the production target. Smaller files mean more manifest entries and metadata overhead; larger files mean each file's range is wider (more rows = wider min-max), which hurts pruning. The defaults in Iceberg (
write.target-file-size-bytes = 512 MB) and Delta (spark.databricks.delta.optimize.maxFileSize = 1 GB) reflect this measurement.
Going deeper
How the engine actually evaluates the predicate against manifest stats
A query like WHERE region = 'south' AND ts > '2026-04-25' is broken into per-column predicates. For each candidate file, the planner checks each predicate against the file's [min, max] for that column. The file is definitely-not-matching if any predicate's range disjoint with the file's range — that's the prune. The file is maybe-matching if all predicates' ranges overlap — the planner queues the file for read. The planner does not try to prove the file definitely matches; it only tries to prove it definitely doesn't. This is the Iceberg Expressions.bind(...) and Delta DataSkippingReader.constructDataFilters(...) code path. Z-order's whole job is to make the "definitely-not-matching" answer correct for as many files as possible. The deeper trick is bloom filters and column-level zone maps for high-cardinality equality predicates (WHERE merchant_id = 'M-93021') — Z-order's min-max can only narrow the range; a bloom filter can prove the exact value is absent. Iceberg supports per-column bloom filters via write.parquet.bloom-filter-enabled.column.merchant_id = true; combine bloom filters with Z-order for the strongest pruning on equality-heavy workloads.
The Morton encoding magic-number trick
The naive Z-value computation iterates once per bit, which is fine pedagogically but slow on hot loops. Production implementations use a parallel bit-spread: x = (x | (x << 8)) & 0x00FF00FF; x = (x | (x << 4)) & 0x0F0F0F0F; x = (x | (x << 2)) & 0x33333333; x = (x | (x << 1)) & 0x55555555; — four instructions to spread an int32 to occupy every other bit of an int64. OR the result with y shifted similarly into the alternate bit positions, and you have the Z-value. This is the same magic that game engines use for Morton codes in BVH traversal. Read Hacker's Delight §7.5 (Henry Warren) for the full derivation. The compiler can sometimes recognise the pattern and emit pdep (BMI2 instruction) on x86, making it a single cycle.
Why Hilbert curve wins at the boundary
Z-order has discontinuities at quadrant boundaries: two points adjacent on the curve can be far apart in 2D space when the curve "jumps" from one quadrant to the next. A file that straddles such a boundary ends up with a wide bounding box. Hilbert curve has no such jumps — it's continuous, so a file's bounding box is always tight. The empirical gap is 5–15% better pruning rate, with the gap larger when files are smaller (more boundary crossings per file). Databricks' Hilbert blog post shows the comparison. The reason most lakehouses still default to Z is that Hilbert encoding is harder to vectorise and the speed-up vanishes on writes; the read-side gain is real but small.
Z-order under concurrent writes
The previous chapter (optimistic concurrency) showed that compaction conflicts with itself when two compactions select overlapping file sets. Z-order rewrites are compactions, so they are subject to the same protocol. The standard pattern: scope each Z-order job to a single partition and run them serially per partition (or parallel across partitions). Iceberg's rewriteDataFiles action does this automatically — it groups files by partition and executes rewrites one partition-at-a-time within a single job. The CAS protocol from chapter 90 is what keeps a Z-order rewrite from corrupting concurrent appends: the appends commit small manifest deltas; the Z-order rewrite commits a much larger swap. Both succeed via the CAS; if one races the other, the loser retries against the new parent.
Where this leads next
The next chapter, /wiki/copy-on-write-vs-merge-on-read-iceberg-vs-hudi, is about row-level update strategies. Both CoW and MoR layouts depend on the same min-max manifest stats and the same Z-order clustering choices that this chapter covered — but they differ in how often a Z-order rewrite is forced (CoW writes new files on every update; MoR defers rewrites). Choosing between them changes the frequency and cost of the OPTIMIZE-Z-ORDER cycle, not the mechanism.
After that the build moves into /wiki/time-travel-and-snapshot-expiration (which interacts with Z-order: every OPTIMIZE creates new snapshots that retain old files for time-travel, increasing storage cost) and then to query engines (Trino, Spark, Dremio, DuckDB) that all consume the same manifest-and-Z-order metadata layer.
Build 14 will revisit clustering from the real-time analytics angle — ClickHouse and Pinot have their own segment-level clustering primitives (ORDER BY in ClickHouse's MergeTree, sorted segments in Pinot) that solve the same problem at the OLAP-engine layer. The mental model from this chapter ports directly: writer clusters rows by predicate locality, reader prunes blocks before scanning.
References
- Z-Order Curve — Wikipedia — the mathematical foundation; clear pictures of the recursive Z structure and Morton encoding.
- Databricks — Processing Petabytes of Data in Seconds with Databricks Delta — the original Z-order announcement for Delta; benchmarks against single-column sort.
- Apache Iceberg — Maintenance: rewriteDataFiles with Z-order — the production API for triggering Z-order rewrites; pairs with the strategy parameters.
- Hacker's Delight, 2nd ed. — Henry Warren — §7.5 covers Morton encoding via parallel bit-spreading; the magic numbers that make Z-value computation fast.
- Databricks — Hilbert Curves blog — empirical comparison of Z-order vs Hilbert on Delta tables.
- Apache Parquet — Statistics and Predicate Pushdown — the file-format level where row-group min-max stats live; the second tier of pruning after manifest-level skipping.
- /wiki/concurrent-writers-optimistic-concurrency-serializability — the previous chapter; Z-order rewrites use the optimistic-CAS protocol from there.
- /wiki/manifest-files-and-the-commit-protocol — the manifest layer that holds the min-max stats this chapter exploits.