In short

Predicate pushdown is the contract between the query optimizer and the storage layer: the optimizer takes the WHERE clause, expresses it as a list of column predicates, and the storage layer uses cheap pre-computed metadata to prove that whole chunks of the file cannot contain a matching row — and then never reads them. Two metadata structures do almost all the work: zone maps (per-chunk min/max/null_count for each column) and bloom filters (per-chunk probabilistic membership bitmaps). Zone maps win for range predicates on clustered columns; bloom filters win for equality predicates on high-cardinality columns where min/max spans the whole domain.

A real Parquet read on a partitioned, sorted 100 GB orders table runs four nested levels of skipping: file-level via Hive partitions, row-group-level via zone maps, page-level via column indexes, and final residual filtering on the few decoded pages that survive. A query like SELECT SUM(amount) WHERE date BETWEEN '2024-04-01' AND '2024-04-30' AND state = 'KA' typically goes from 100 GB on disk to ~1.2 GB read, an 80× I/O reduction, without anyone writing an index by hand. The same primitives drive Snowflake's micro-partitions, Apache Iceberg's manifest pruning, Delta Lake's data skipping, and ClickHouse's skipping indexes — different names, same mechanics.

This chapter is the read-path companion to the Parquet and ORC file format chapter: that one was about what bytes sit where, this one is about how the reader proves which bytes it does not need to touch.

In the previous chapter you saw that a Parquet file ends with a footer carrying per-row-group, per-column statistics. That sentence is easy to skim past. The point of this chapter is that those few KB of statistics are doing almost all of the performance work in your OLAP stack. The CPU-side vectorised execution gets the headlines, but the real reason a 100 GB Athena query finishes in 8 seconds for ₹0.05 is that the reader proved 99 GB of the file irrelevant before issuing the first decode.

The pushdown idea is a deal between two layers that historically did not talk to each other:

Predicate pushdown is the protocol by which the optimizer hands the predicates down to the storage layer so the storage layer can lie about how much data exists. Why "lie": from the engine's perspective, the storage layer behaves as if pruned chunks simply do not exist — the reader returns zero rows for them as if they had been physically deleted. The bytes are still on disk; the storage layer just proved you do not need them.

The big idea: prove, do not scan

Almost every modern OLAP storage engine is built on a single observation. If you can store a tiny summary of each data chunk that fits cheaply in RAM, and the summary is monotone enough that you can use it to rule out whether the chunk contains any matching row, you can skip the chunk entirely. You do not have to find the matching rows; you only have to prove they cannot be there.

A zone map is the cheapest possible such summary: just the minimum and maximum value of each column in the chunk, plus a null count. For a chunk of 100,000 rows of a DATE column, that is two 4-byte integers and one count — twelve bytes to summarise 400 KB of data. For a query WHERE date > '2024-04-01', the reader compares the predicate to max(date) in each chunk; any chunk whose max(date) <= '2024-04-01' is provably empty for this query and is skipped without touching the column data.

That is it. That is the entire idea. Everything else in this chapter is variations on the theme.

Zone maps, drawn

Zone map skipping: WHERE date > '2024-04-01' Per-row-group zone maps (read once from footer, ~50 KB total): RG 0 date.min='2023-12-01' date.max='2024-01-31' rows=250K bytes=128MB RG 1 date.min='2024-02-01' date.max='2024-03-15' rows=250K bytes=128MB RG 2 date.min='2024-03-16' date.max='2024-04-30' rows=250K bytes=128MB RG 3 date.min='2024-05-01' date.max='2024-06-15' rows=250K bytes=128MB Compare predicate to each row group's max(date): RG 0 max='2024-01-31' <= '2024-04-01' → SKIP entirely (128 MB never read) RG 1 max='2024-03-15' <= '2024-04-01' → SKIP entirely (128 MB never read) RG 2 max='2024-04-30' > '2024-04-01' → READ + residual filter (overlaps) RG 3 min='2024-05-01' > '2024-04-01' → READ all rows (full hit, no filter) Bytes read: 50 KB footer + 2 × 128 MB column chunks = ~256 MB instead of 512 MB RG 3 is a "full hit" — every row matches, so the engine skips the residual predicate evaluation

Notice the third state in the figure: full hit. When the entire predicate range covers the chunk's [min, max] (here, min(date) > '2024-04-01'), every row in the chunk matches and the engine can skip the per-row predicate evaluation too — it simply emits all decoded rows. This is a small but meaningful CPU win on top of the I/O win. Why full-hit detection matters: a vectorised filter loop over a 100K-row chunk costs a few hundred microseconds even when it produces all-true; skipping it on the chunks where every row matches recovers that time and lets the executor go straight to aggregation.

The four operator families

Different SQL operators interact with zone maps differently. The taxonomy below is the mental model every storage engineer uses when deciding what statistics to maintain.

Range predicates (>, <, >=, <=, BETWEEN). These are the natural clients of zone maps. Given min and max for a chunk, you can decide range overlap in two integer comparisons. For WHERE x > 100, skip if max(x) <= 100. For WHERE x BETWEEN 50 AND 150, skip if max(x) < 50 OR min(x) > 150. Effectiveness depends entirely on data clustering: a column sorted by x produces tight per-chunk ranges and prunes aggressively; a randomly shuffled column produces ranges spanning the whole domain and prunes nothing.

Equality predicates (=, IN). Min/max can prune equality, but only when the predicate value falls outside [min, max]. For a low-cardinality column like country_code (200 distinct values), most chunks will contain most countries, so WHERE country = 'IN' will match almost every chunk's range. For a high-cardinality column like user_id (100 million distinct values), zone maps are essentially useless — every chunk's [min(user_id), max(user_id)] covers the value you are searching for unless the data is sorted by user_id. Bloom filters are the answer for this case (next section).

Negation predicates (!=, NOT IN, NOT LIKE). These almost never benefit from zone maps. A chunk with min='IN', max='IN' provably contains no rows where country != 'IN', but a chunk with any range wider than that single value cannot be ruled out — country != 'IN' could match any of the other rows. In practice, negation predicates are usually evaluated as residual filters after the chunk is decoded, with no skipping at all. Why negation is hopeless: zone maps prove a chunk's value range. Negation asks whether the chunk contains anything outside a predicate value, and any non-degenerate range almost certainly does. The set-theoretic asymmetry is fundamental — you cannot cheaply prove non-membership the way you can prove out-of-range.

Pattern predicates (LIKE 'foo%', LIKE '%foo', LIKE '%foo%'). The prefix form LIKE 'foo%' is a range in disguise: name LIKE 'foo%' is equivalent to name >= 'foo' AND name < 'fop', so zone maps work when the column is sorted by name. Suffix and infix patterns (%foo, %foo%) collapse to no useful range and need full text indexes — they cannot be pushed down via zone maps alone. ClickHouse's tokenbf_v1 skipping index handles these via token bloom filters; Parquet has no native equivalent.

Bloom filters: the equality fallback

For the high-cardinality equality case, both Parquet and ORC support an optional bloom filter per column chunk. A bloom filter is a fixed-size bitmap (say, 8192 bits) plus a small set of hash functions. To insert a value, hash it k times and set the corresponding bits. To test membership, hash and check those k bits — if any is zero, the value is definitely not in the set; if all are one, the value is probably in the set (with a tunable false-positive rate, typically 1%).

Bloom filter: WHERE user_id = 7142893 on a high-cardinality column Per-row-group bloom filter for user_id (1024 bytes each): RG 0 bits set by user_ids in this chunk: 0010110100110001011010010001… hash(7142893) → bits {314, 5917, 7402} → check those 3 bits RG 0 bit 5917 = 0 → DEFINITELY NOT present → SKIP (no false negatives) RG 1 all 3 bits = 1 → PROBABLY present (~1% false positive) → READ, then verify RG 2 bit 314 = 0 → DEFINITELY NOT present → SKIP RG 3 bit 7402 = 0 → DEFINITELY NOT present → SKIP Bloom filter math (10 bits per value, 7 hash functions): size = 10 × num_distinct_values bits false_positive_rate ≈ 1% for 100K user_ids/chunk: ~125 KB per chunk per column with bloom enabled No false negatives = bloom filter never wrongly skips a chunk that contains the value

Two properties make bloom filters perfect for the equality-skip use case:

  1. No false negatives. If the bloom filter says "not present", the value is provably absent and the chunk can be skipped with zero risk of missing rows. This matches the zone map contract — only ever skip provably empty chunks.
  2. Tunable false positives. A 1% false-positive rate means ~99% of non-matching chunks get correctly skipped; the 1% that are wrongly admitted get read and the residual predicate evaluation catches them. The cost is ~10 bits of metadata per distinct value, which on a 100K-row chunk is ~125 KB — small enough to keep in the file footer.

The Parquet specification added bloom filter support in 2018 (PARQUET-41); modern writers (PyArrow ≥ 4.0, Spark ≥ 3.0, Trino) write them by default for columns above a configurable cardinality threshold. Why bloom filters are not always on: they cost storage (1–5% of file size on a column with bloom enabled) and CPU at write time. For a low-cardinality column where zone maps are already selective, the bloom filter adds cost without adding much skip — so engines turn them on selectively.

The chain of skipping

In a real Parquet read on object storage, four nested levels of skipping execute in order, each one cheaper than the next:

Read path with predicate pushdown: query → planner → reader → decoder 1. SQL query → optimizer extracts predicates WHERE date BETWEEN '2024-04-01' AND '2024-04-30' AND state = 'KA' 2. File-level pruning (Hive partitions, Iceberg manifests) 100 GB → 8 GB (skip 11 of 12 monthly partitions, no footer reads needed) 3. Row-group pruning (zone maps + bloom filters in footer) 8 GB → 1.2 GB (zone map skips by date, bloom filter skips by state='KA') 4. Page pruning (column index / page index, sub-row-group) 1.2 GB → 0.9 GB (within surviving row groups, skip pages whose stats miss) 5. Column projection read only the column chunks for date, state, amount (3 of 50 columns) 6. Decompress + decode + residual filter Snappy decompress → dictionary decode → vectorised WHERE → SUM Net I/O: ~70 MB pulled from S3 to answer a query against a 100 GB table. Each level uses smaller, more local metadata than the level above; pages are read only after every cheaper filter has run.

The ordering matters. Each level uses metadata smaller than the level below, so the work at the cheap level eliminates candidates that the expensive level would otherwise have to consider:

  1. File-level (Hive partition pruning): discards entire S3 prefixes by listing directory names like year=2024/month=03/. The optimizer never even fetches the footers of skipped files. For a date-partitioned table, this is the single largest skip — typically 90%+ of files.

  2. Row-group level (zone maps + bloom filters): for each surviving file, fetch the footer and walk row groups. Zone maps prune ranges; bloom filters prune equality on high-cardinality columns. Typical skip: 50–90% of row groups in the surviving files.

  3. Page level (column index): Parquet's column index, added in 2019, stores per-page min/max within each column chunk. After surviving the row-group filter, the reader checks page stats and skips individual 1 MB pages. Typical extra skip: 10–30%.

  4. Column projection: only read the column chunks the query references. Independent of predicates — purely about which columns appear in SELECT and WHERE. For a wide 50-column table where the query touches 3 columns, this is a 17× reduction on top of everything above.

  5. Decompress + decode + residual filter: only now do bytes get pulled into memory, decompressed, and dictionary-decoded. The residual predicate (the one that could not be pushed down further) runs as a vectorised filter over the decoded columnar arrays.

By the time the CPU evaluates the actual WHERE clause, the storage layer has already discarded ~99% of the original bytes. The CPU only sees the data it cannot prove uninteresting.

Real Python: pushing predicates with PyArrow

Predicate pushdown is not a thing you have to enable — every modern Parquet reader does it as long as you give it predicates in the right form. The PyArrow API is the cleanest illustration.

# pushdown_demo.py
import pyarrow as pa
import pyarrow.parquet as pq
import pyarrow.compute as pc
from datetime import date, timedelta
import random

random.seed(42)

# 1. Build an orders table sorted by date — 1M rows.
N = 1_000_000
start = date(2024, 1, 1)
dates  = [start + timedelta(days=i // 5_000) for i in range(N)]   # date-clustered
states = random.choices(["KA", "MH", "TN", "DL", "WB", "GJ", "UP"],
                        weights=[5, 4, 3, 3, 2, 2, 1], k=N)
amounts = [round(random.uniform(50, 9999), 2) for _ in range(N)]
user_ids = [random.randint(1, 5_000_000) for _ in range(N)]       # high-cardinality

table = pa.table({
    "date":     pa.array(dates,    type=pa.date32()),
    "state":    pa.array(states,   type=pa.string()),
    "amount":   pa.array(amounts,  type=pa.float64()),
    "user_id":  pa.array(user_ids, type=pa.int64()),
})

# 2. Write with zone maps (always on) AND bloom filters on user_id.
pq.write_table(
    table, "orders.parquet",
    row_group_size=100_000,                       # 10 row groups
    compression="snappy",
    write_statistics=True,                         # zone maps in footer
    bloom_filter_columns=["user_id"],              # opt-in bloom filter
)

# 3. Inspect the per-row-group statistics WITHOUT reading column data.
pf = pq.ParquetFile("orders.parquet")
for i in range(pf.num_row_groups):
    rg = pf.metadata.row_group(i)
    date_st  = rg.column(0).statistics
    state_st = rg.column(1).statistics
    print(f"RG {i}: rows={rg.num_rows:>6}  "
          f"date=[{date_st.min}, {date_st.max}]  "
          f"state=[{state_st.min}, {state_st.max}]")

# 4. Predicate pushdown: filters= triggers footer-based row-group skipping.
filtered = pq.read_table(
    "orders.parquet",
    columns=["amount", "state", "date"],          # column projection
    filters=[("date",  ">=", date(2024, 4, 1)),   # zone-map skip
             ("date",  "<=", date(2024, 4, 30)),
             ("state", "=",  "KA")],              # zone-map (low-cardinality)
)
print(f"matched rows: {filtered.num_rows}   "
      f"sum(amount) = {pc.sum(filtered['amount']).as_py():.2f}")

# 5. High-cardinality equality — bloom filter does the work.
one_user = pq.read_table(
    "orders.parquet",
    filters=[("user_id", "=", 7_142_893)],
)
print(f"orders for user 7142893: {one_user.num_rows}")

Two lines in this script are doing almost all the storage work. write_statistics=True (the default) writes zone maps into the footer; bloom_filter_columns=["user_id"] adds a per-row-group bloom filter for the high-cardinality column. The reader picks them up automatically — there is no separate pushdown switch to toggle. Why this is the API surface: PyArrow's contract is that filters= is a hint that the reader is free to either push down or evaluate as a residual; correctness is guaranteed either way. So the same code works on a Parquet file with zero statistics (slow but correct) and one with full zone maps + bloom filters (fast and correct).

The same pattern works for every engine that consumes Parquet: Spark's spark.read.parquet(...).filter(...), Trino's WHERE clause, DuckDB's SELECT ... FROM 'orders.parquet' WHERE ..., Polars' pl.scan_parquet(...).filter(...). They all push the predicate down to the same statistics in the same footer.

Worked example: 100 GB Indian orders scan

The Bengaluru e-commerce daily report

You are a data engineer at a Bengaluru e-commerce company. The fact table orders is 100 GB of Parquet on S3, partitioned Hive-style by year=2024/month=01/, with each partition holding ~8 GB across 60 files of ~130 MB each. Each file has 8 row groups of 16 MB. The data was written sorted by date, then by state. Bloom filters are enabled on user_id and state.

Schema (50 columns abbreviated):

order_id BIGINT, user_id BIGINT, product_id BIGINT,
country VARCHAR, state VARCHAR, city VARCHAR, pin_code VARCHAR,
date DATE, ordered_at TIMESTAMP, delivered_at TIMESTAMP,
amount DECIMAL, discount DECIMAL, tax DECIMAL, /* …38 more… */

The marketing team asks for the April 2024 Karnataka revenue:

SELECT SUM(amount)
FROM orders
WHERE date BETWEEN DATE '2024-04-01' AND DATE '2024-04-30'
  AND state = 'KA';

Step 1: Hive partition pruning (free). Trino reads the S3 prefix listing for year=2024/month=04/ and discards the other 11 months. 100 GB → 8 GB, no footer reads required. This is the cheapest skip possible — no file is even opened.

Step 2: Footer reads. For each of the 60 April files, the engine fetches the last 64 KB in parallel (~30 ms each). It now has zone maps for every row group. Total: ~4 MB of footers, ~150 ms wall-clock.

Step 3: Row-group zone-map filter on date. The April partition contains dates April 1 through April 30. Within April, the data is sorted by date, so the 8 row groups per file each cover ~3.75 days. Since the predicate covers all of April, every row group's date range falls inside it — no skipping from date here (it has already been done by partition pruning). 8 GB unchanged.

Step 4: Bloom filter on state = 'KA'. State is a 28-value column, but within each date-sorted row group, the state distribution is roughly uniform — so zone maps for state are not selective. The bloom filter is. Karnataka is ~18% of orders nationally, so ~50% of the row groups have at least one Karnataka order and pass the bloom filter; the other ~50% have a definite-zero answer from the bloom and are skipped. 8 GB → ~4 GB of surviving row groups. Why ~50% even though Karnataka is 18% of rows: a row group has 100K rows; with 18% Karnataka share, the chance any single row group has zero Karnataka orders is (0.82)^100000 ≈ 0, so almost all row groups contain at least one. The 50% figure assumes we co-clustered orders by state within day, so date+state sorting concentrates Karnataka rows into half the row groups — the other half had zero Karnataka rows and the bloom filter rules them out.

Step 5: Column projection. The query references only amount, state, date. The engine issues range GETs for those three column chunks in the surviving row groups. With ~250 surviving row groups × 3 columns × ~120 KB per chunk after Snappy = ~90 MB of column data pulled from S3.

Step 6: Decompress, decode, residual filter, sum. Snappy decompresses the 90 MB into ~225 MB of dictionary-decoded columnar arrays. The vectorised filter state == 'KA' AND date BETWEEN ... AND ... runs in ~80 ms on a single Trino worker, producing ~4.5M matching rows. The aggregation runs in ~20 ms.

Net I/O: ~4 MB footers + ~90 MB columns ≈ 94 MB out of 100 GB. The query runs in ~5–8 seconds on a small Trino cluster.

Layer Bytes after this step Reduction so far
Original table 100 GB
Partition pruning (month) 8 GB 12.5×
Bloom filter on state 4 GB 25×
Column projection (3/50) 90 MB (compressed) ~1100×

For comparison, a row-store CSV scan of the same 100 GB at S3 throughput of ~400 MB/s per worker would take ~250 seconds with no pushdown possible at all — the engine would have to read every byte to evaluate the predicate. Pushdown is ~30× wall-clock faster and ~1000× cheaper in S3 GET-byte cost.

How real systems implement this

The vocabulary changes by vendor but the mechanics are identical.

Snowflake's micro-partitions — each ~16 MB compressed columnar block is a "micro-partition" with auto-maintained per-column min/max, distinct count, and null count. Snowflake calls the skip step partition pruning but it is exactly zone-map filtering on a finer grain than Hive partitions. Their published numbers show typical pruning of 70–95% of micro-partitions on well-clustered tables; the automatic clustering service exists specifically to keep min/max ranges tight as data drifts.

Apache Iceberg's manifest pruning — Iceberg adds a metadata layer above Parquet. Each table version is a snapshot pointing to manifest files, and each manifest entry lists per-data-file statistics (min, max, null count, lower/upper bound for each column). Before opening any data file, the planner reads the manifests and discards files whose stats cannot match. This is a file-level zone map and it is what makes Iceberg's WHERE pushdown work even on object storage with no Hive partitioning.

Delta Lake's data skipping — Delta adds a transaction log (_delta_log/) containing per-file min/max statistics for the first N columns of each file (default 32). The query optimizer reads the log, filters file paths, and only opens the surviving files. Conceptually identical to Iceberg manifests.

ClickHouse's skipping indexes — for each MergeTree part, ClickHouse stores per-granule indexes (default granule = 8192 rows). The supported types are minmax (zone map), set (small explicit value set for low-cardinality), bloom_filter (full-fat bloom), and tokenbf_v1 / ngrambf_v1 (token bloom for LIKE patterns). The granule is much finer than a Parquet row group, which is why ClickHouse can skip aggressively even on partially clustered data.

The convergence is striking. Four independent storage engineering teams, each thinking they were inventing something new, all landed on the same primitive: per-chunk min/max, optional bloom filter, applied as a pre-decode filter using predicates pushed down from the optimizer. There is no other way to make object-store-resident OLAP fast.

When pushdown fails — and what to do

Three failure modes recur in production:

1. Unsorted data. Zone maps require clustering. If a billion rows of state are written in random order across 1000 row groups, every row group will have min(state) = 'AN' and max(state) = 'WB', and WHERE state = 'KA' will skip exactly nothing. The fix is to ORDER BY the columns you filter on at write time, or use Iceberg's sort_order table property, or run a periodic OPTIMIZE ... ZORDER BY (state) in Delta. Snowflake's auto-clustering does this transparently.

2. Wrong stat granularity. If your row groups are too large (1 GB+), even tight min/max ranges become coarse — a single row group might cover a whole month even after sorting, and a query for one day cannot skip it. If row groups are too small (1 MB), the per-chunk metadata overhead dominates and footers become huge. The Parquet sweet spot is 64–256 MB row groups; Snowflake auto-tunes to ~16 MB micro-partitions; ClickHouse's 8192-row granule is finer because it relies on a different read path (mmap-friendly local disk, not S3).

3. Predicates the engine cannot push down. WHERE LOWER(state) = 'ka' defeats statistics — the stats are on state, not on LOWER(state). Ditto WHERE CAST(date AS VARCHAR) = '2024-04-01', WHERE state || country = 'KAIN', and any UDF call. The fix is to write the predicate against the raw column or to materialise the derived column at write time with its own statistics.

Production data engineering at this layer is mostly: understand which predicates push down, write the data sorted on the columns you filter by, and verify with EXPLAIN that the planner is actually emitting the pushdown nodes you expect.

Summary

You should now be able to explain the entire OLAP read-path performance story:

  1. Zone maps are tiny per-chunk min/max statistics that let the storage layer prove a chunk cannot contain matching rows for a range predicate, and skip it without decoding.
  2. Bloom filters complement zone maps for high-cardinality equality predicates where min/max is too loose to be selective; they have no false negatives so they are safe for skipping.
  3. The four operator families interact differently with statistics: range (loves zone maps), equality (zone maps for low-cardinality, blooms for high), negation (almost no help), pattern (prefix only, via range encoding).
  4. The read path runs four nested levels of skipping in increasing cost order: file (partitions/manifests), row group (zone maps + blooms), page (column index), then column projection — and only then are bytes decompressed and decoded.
  5. Every modern OLAP system — Parquet/Iceberg/Delta on object stores, Snowflake micro-partitions, ClickHouse skipping indexes — is built on the same primitive. The vocabulary changes; the math does not.

The combined effect on a sorted 100 GB table answering a typical analytical query is ~90 MB of actual reads — a thousand-fold reduction that no amount of CPU speed could match. This is why SELECT SUM(amount) WHERE date BETWEEN ... AND state = 'KA' runs for ₹0.05 on Athena instead of ₹50 on a row-store warehouse, and it is why every Indian data team that ever ran out of Postgres budget converged on this exact architecture.

The next chapter (OLAP join algorithms: broadcast and shuffle) takes the surviving rows that pushdown produced and shows how the engine joins them across distributed workers without re-reading the world.

References

  1. Apache Parquet — File Format Specification (statistics, column index, bloom filter)
  2. Snowflake blog — "Automatic Query Optimization: No Tuning Required" (micro-partition pruning)
  3. Dageville et al., "The Snowflake Elastic Data Warehouse", SIGMOD 2016
  4. Apache Iceberg — Table Spec, Manifest Files and Metrics
  5. ClickHouse documentation — Data Skipping Indexes (minmax, set, bloom_filter, tokenbf)
  6. Lang et al., "Performance-Optimal Filtering: Bloom Overtakes Cuckoo at High Throughput", VLDB 2019