Note: Company names, engineers, incidents, numbers, and scaling scenarios in this article are hypothetical — even when they resemble real ones. See the full disclaimer.
In short
Parquet and ORC store tables column-by-column with a statistics-rich footer at the end, so a reader can fetch the last 64 KB, learn the byte offset and min/max of every column chunk, and skip 80%+ of the file before reading any data. That is predicate pushdown: a WHERE country = 'IN' query on a 100 GB sorted Parquet table touches ~1 GB and finishes in seconds instead of minutes. Pick by ecosystem (Parquet for Spark/Athena/Trino, ORC for Hive-native shops); the read pattern is the same.
In the previous two chapters you saw why row storage loses on analytical scans and why a column-by-column layout wins by 50–200×. That argument was about layout in the abstract. This chapter makes it concrete: it walks through the two file formats — Apache Parquet and Apache ORC — that ship that layout to disk, byte by byte, in the form that every cloud data warehouse and lakehouse actually consumes.
If you have ever run CREATE TABLE foo STORED AS PARQUET in Hive or Spark, dropped a Parquet file on S3 and queried it from Athena, or pip install pyarrow and called read_table(...), you have used these formats. Most engineers use them as black boxes. By the end of this chapter you will know exactly what is inside the box, where the bytes live, and how the reader uses the metadata to read 5% of the file and answer your query.
Why a file format, not just a column store
A column store on disk is almost just "one file per column." That works fine for a single-machine engine like ClickHouse, which controls the storage layout end to end. It breaks down the moment you want to:
- Store a table on cheap object storage (S3, GCS, Azure Blob), where listing and opening files costs API calls and latency
- Have multiple engines (Spark today, Trino tomorrow, DuckDB on your laptop next week) read the same data without re-encoding
- Carry schema with the data so a reader who has never seen the file can still decode it
- Support predicate pushdown so a query that only wants Indian orders does not have to read American ones
- Survive partial corruption (a file that's half-written should be detectable, not silently parsed as garbage)
A flat "one file per column" layout does none of those well. Why one-file-per-column fails on object storage: every column read becomes a separate GET request with its own ~30 ms round-trip latency, and a 50-column table query becomes 50 separate network round-trips before any byte of data arrives — worse than the row-store baseline.
The answer is to package columns into a self-describing file that bundles many columns and many rows together, with metadata at the end describing what is inside. That is Parquet. ORC is the same idea with slightly different choices. Both were born around 2013 in the Hadoop ecosystem and are now the de facto standards for analytical files on object storage.
Parquet, top to bottom
A Parquet file has three pieces, in order: a 4-byte magic header, a sequence of row groups, and a footer that ends with the footer length and another 4-byte magic.
Three things to notice about this layout:
-
The footer is at the end, not the beginning. A Parquet writer builds row groups as it streams data in; it cannot write the footer until it knows how many row groups there are, what their byte offsets ended up being, and what the per-column statistics are. So the footer goes at the end. The reader handles this by seeking to
file_size - 8, reading the 4-byte footer length, then reading the footer itself. Why footer-at-end works on object storage: S3 supportsRange: bytes=requests, so the reader can fetch just the last 64 KB of a 10 GB file in one round-trip, parse the footer, and only then issue further range requests for the column chunks it actually needs. -
Row groups are independently readable. If the table is partitioned across multiple Spark or Trino workers, each worker can be assigned a subset of row groups and read them in parallel without coordinating with the others. The row-group boundary is the unit of parallelism.
-
Column chunks within a row group are independently readable. A query that wants only
priceandcountryreads only those two column chunks per row group, skipping the bytes for the other 48 columns entirely. The footer's per-column-chunk byte offsets make this a single seek.
What lives inside a column chunk
A column chunk is itself broken into pages — typically 1 MB each — for finer-grained reads and for compression boundaries. There are three page types:
- Dictionary page: appears once at the start if the column is dictionary-encoded. Holds the unique values (e.g.
["IN", "US", "UK", "DE", ...]forcountry). - Data page: holds the actual encoded, compressed values for some range of rows in the chunk. A column chunk has many data pages.
- Index page (column index, offset index in newer Parquet): per-page min/max for sub-row-group skipping.
Each page has a small header (Thrift-encoded) that says how many values it holds, what encoding it uses, what compression codec was applied, and the uncompressed and compressed sizes. The reader can therefore stream-decode pages without random access within the chunk.
The default codec is Snappy (fast, ~3× compression); newer pipelines use Zstd (slower, ~5× compression). Parquet also supports gzip, lz4, and uncompressed.
ORC, top to bottom
ORC (Optimized Row Columnar) was born at Hortonworks around 2013 to replace Hive's earlier RCFile format. It is conceptually identical to Parquet but with three differences worth knowing:
The differences from Parquet:
- Stripes have their own footer. Each stripe is a self-contained mini-file with
index data + row data + stripe footer. A reader that has the stripe's byte range can decode it without consulting the file footer. This makes ORC slightly more robust to truncation and slightly easier for streaming readers. - Row-level indexes within stripes. ORC writes a small index every 10,000 rows (the row group in ORC terminology — confusingly, it is not a Parquet row group) with min/max per column. This lets the reader skip not just whole stripes but also 10K-row subranges within a stripe. Parquet recently added an analogous column index / page index feature; for years it was an ORC advantage.
- PostScript at the very end. The file ends with a one-byte length pointing to a PostScript that says how the footer is compressed and how big it is. The reader reads the last byte, then the PostScript, then the footer.
For most workloads the choice of Parquet vs ORC is religious rather than performance-driven. Parquet is the default in the Spark and Arrow ecosystems and is what every Indian data team I have ever worked with uses on AWS S3. ORC is more common in Hive-native shops and at companies that started on Hortonworks (now Cloudera) before Spark took over.
The read path, step by step
The single most important thing to internalise about both formats is the order in which the reader touches bytes. It is not front-to-back. It is footer-first.
For a query SELECT SUM(price) FROM orders WHERE country = 'IN' AND date > '2024-01-01' against a 100 GB Parquet file on S3:
- Open the file, do not read any data.
s3:HeadObjectto get the file size. - Seek to the last 64 KB with a single
s3:GetObjectrange request. This usually contains the entire footer plus the 8 trailer bytes. Why fetch 64 KB blindly: a Thrift-encoded footer for a 100 GB Parquet file with 800 row groups and 50 columns is typically 30–200 KB; fetching 64 KB optimistically usually catches the whole thing in one round-trip; if it does not, the reader fetches a larger range on the second try. - Parse the footer. Now you have the schema, the byte offset of every row group, and per-row-group, per-column statistics including
min(country),max(country),min(date),max(date). - Apply predicate pushdown. Walk the row-group list. For each row group, check: is
'IN'betweenmin(country)andmax(country)? Ismax(date) > '2024-01-01'? If either answer is no, this row group cannot contain any matching row — skip it entirely. - Issue range reads for the column chunks of
price,country, anddatein the surviving row groups. MultipleGETrequests, one per (row group, column). - Decompress and decode the pages, apply the actual filter on the decoded
countrycolumn, sum the survivingpricevalues.
This is the entire read path. Step 4 is where the performance win happens.
Predicate pushdown, drawn
Two important caveats about how well pushdown actually works:
- It depends on data layout. Statistics-based skipping only works if values are clustered. If you write rows in random date order, every row group will have
min='2020-01-01'andmax='2024-12-31'and pushdown will skip exactly nothing. This is why production pipelines sort or partition by the columns they expect to filter on. Why sorting matters so much: a row group is the smallest skip unit. If a row group's[min, max]range overlaps the predicate, the reader must read the whole row group even if only a single matching row lives in it. Sorting tightens the per-row-group ranges and makes pushdown effective. - High-cardinality equality filters need bloom filters. Min/max stats are useless for
WHERE user_id = 12345because almost every row group's[min, max]covers that value. Both Parquet and ORC support optional bloom filters at column-chunk granularity for these cases.
Real Python: write, inspect, push down
pyarrow is the canonical reference Parquet implementation outside the JVM. Here is the whole life cycle of a Parquet file in ~40 lines.
# parquet_demo.py
import pyarrow as pa
import pyarrow.parquet as pq
import pyarrow.compute as pc
from datetime import date, timedelta
import random
random.seed(42)
# 1. Build a small orders table — 1 million rows, sorted by date.
N = 1_000_000
start = date(2024, 1, 1)
dates = [start + timedelta(days=i // 5_000) for i in range(N)] # ~5K rows per day
countries = random.choices(["IN", "US", "UK", "DE", "JP"],
weights=[5, 3, 2, 2, 1], k=N)
prices = [round(random.uniform(100, 9999), 2) for _ in range(N)]
table = pa.table({
"date": pa.array(dates, type=pa.date32()),
"country": pa.array(countries, type=pa.string()),
"price": pa.array(prices, type=pa.float64()),
"order_id": pa.array(range(N), type=pa.int64()),
})
# 2. Write to Parquet — control the row group size to see pushdown work.
pq.write_table(table, "orders.parquet",
row_group_size=100_000, # 10 row groups for a 1M-row file
compression="snappy")
# 3. Inspect the footer without reading any column data.
pf = pq.ParquetFile("orders.parquet")
print("schema: ", pf.schema_arrow)
print("num_row_groups:", pf.num_row_groups)
for i in range(pf.num_row_groups):
rg = pf.metadata.row_group(i)
date_stats = rg.column(0).statistics # 'date' is column 0
print(f" RG {i}: rows={rg.num_rows:>6} "
f"date.min={date_stats.min} date.max={date_stats.max}")
# 4. Predicate pushdown — pyarrow uses footer stats to skip row groups.
# Filter: country == 'IN' AND date > 2024-04-01
filtered = pq.read_table(
"orders.parquet",
columns=["price", "country", "date"], # column projection
filters=[("country", "=", "IN"), # row-group skipping
("date", ">", date(2024, 4, 1))],
)
print(f"matched rows: {filtered.num_rows} "
f"sum(price) = {pc.sum(filtered['price']).as_py():.2f}")
Two things this script demonstrates that you cannot see in marketing material:
- The footer iteration loop (step 3) reads the file metadata only. You can run it on a 100 GB file in milliseconds — no column data is touched.
- The
filters=argument in step 4 is what triggers row-group skipping. Without it,pyarrowwould read every row group, decode all three columns, then filter in memory — correct but slow. With it,pyarrowconsults the per-row-group statistics in the footer and skips entire chunks.
The same pattern works for any engine that consumes Parquet: Spark's spark.read.parquet(...).filter(...), Trino's WHERE clause, DuckDB's SELECT ... FROM 'orders.parquet' WHERE ... — they all push the predicate down to the same statistics in the same footer.
A worked example: 100 GB Indian e-commerce scan
The BharatBazaar-shaped query
You are an analyst at a Bengaluru e-commerce company. The fact table orders is 100 GB of Parquet on S3, partitioned by year=2024/month=01/... Hive-style, with each partition holding ~8 GB across 60 files of ~130 MB each. Each file has 8 row groups of 16 MB. The data was written sorted by date, then by country.
Schema (50 columns, abbreviated):
order_id BIGINT, customer_id BIGINT, product_id BIGINT,
country VARCHAR, state VARCHAR, city VARCHAR, pin_code VARCHAR,
date DATE, ordered_at TIMESTAMP, delivered_at TIMESTAMP,
price DECIMAL, discount DECIMAL, tax DECIMAL, shipping DECIMAL,
quantity INT, /* …36 more columns… */
The marketing team runs the daily India-revenue query:
SELECT SUM(price)
FROM orders
WHERE date BETWEEN DATE '2024-01-01' AND DATE '2024-06-30'
AND country = 'IN';
Step 1: Hive partition pruning (free). Athena/Trino reads the S3 prefix listing for year=2024/ and discards year=2023/, year=2025/. Bytes saved: 66 GB. Files now in scope: ~360 files (Jan–Jun 2024 partitions), ~50 GB total.
Step 2: Parquet footer reads. For each of the 360 files, the engine fetches the last 64 KB (one S3 GET each, ~30 ms each, parallelised — ~1 s total). It now knows the byte offset and per-column statistics of every row group in every file. Bytes read so far: ~25 MB of footers.
Step 3: Row-group skipping by country. The data was written sorted by date then country, so within each row group, country ranges are tight — typically [IN, IN] or [IN, US] since IN is the most common value and clusters at the start. About 60% of row groups have min(country) = max(country) = 'IN' (full hit), 25% straddle (must read), and 15% have min(country) > 'IN' (skip). Skipped row groups: ~430 of 2880. Date stats are all-positive within the Jan–Jun partitions, so no further skip from date.
Step 4: Column projection. Of the 50 columns, the query references only price, country, and date. The engine issues range GETs for those three column chunks in each surviving row group. Each column chunk is roughly 16 MB / 50 = 320 KB raw, ~120 KB after Snappy compression. With ~2450 surviving row groups × 3 columns × 120 KB = ~880 MB of column data.
Step 5: Decode and filter. The engine decompresses ~880 MB of Snappy data into ~2.2 GB of dictionary-decoded columnar arrays, runs a vectorised country == 'IN' filter (the straddling row groups need this; the [IN, IN] chunks pass it trivially), and a vectorised sum on the surviving prices. CPU time: ~3 s on a single Trino worker, ~0.3 s on 10 workers.
Total bytes pulled from S3: ~25 MB footer + ~880 MB columns ≈ 0.9 GB out of 100 GB. The query runs in roughly 8–12 seconds on a small Trino cluster.
Compare to the naive baseline of reading all 100 GB of equivalent row-stored data at S3 throughput of ~400 MB/s per worker: ~250 seconds, plus the CPU cost of parsing every byte of every column. The Parquet reader is ~25× faster wall-clock and ~110× cheaper in S3 GET-byte cost.
This is not a benchmark from a vendor. It is approximately what you will measure if you run the same query in Athena over a real partitioned Parquet table the size of a mid-size Indian D2C company's order history.
Indian context: this is the entire stack
Every Indian analytics company you have heard of — BharatBazaar, BhojanBox, ZaikaApp, KreditClub, PaisaBridge, BharatResell — runs essentially the same offline data stack:
- Storage: AWS S3 (sometimes GCS, rarely Azure), one bucket per environment, partitioned Hive-style by date and tenant
- Files: Parquet, Snappy-compressed, ~128–512 MB per file, written by Spark or Flink jobs running on EMR or self-managed Kubernetes
- Catalog: AWS Glue (a hosted Hive Metastore) or Apache Iceberg metadata
- Query engine: Trino (formerly Presto) or AWS Athena (which is hosted Trino), with Spark for heavier batch jobs and dbt sitting on top for the analytics models
- BI: Looker, Metabase, or Superset reading the Trino/Athena layer
The reason this stack exists in this exact shape is the predicate-pushdown story you just walked through. Each layer adds one more form of skipping:
- Hive partitioning skips whole directories without reading any file
- Iceberg/Glue manifests skip whole files without reading footers
- Parquet footers skip whole row groups without reading column data
- Parquet column projection reads only the columns the query references
- Page indexes / bloom filters skip pages within a column chunk
By the time the bytes actually reach the CPU, the query has already discarded 99% of the table. That is what makes a sub-rupee-per-query Athena bill possible against a multi-terabyte dataset, and it is why every Indian data team that has migrated off Postgres-as-warehouse has converged on the same architecture.
Common pitfalls
Three production failure modes show up repeatedly:
-
Tiny files. Parquet's overhead per file (footer, magic bytes, stats serialisation, S3 GET round-trip) is fixed; on a folder of 10,000 files of 1 MB each, the metadata cost dominates. Aim for ~128–512 MB per file. If you are writing from Spark, use
coalesceorrepartitionto avoid the small-files problem; if you have already written tiny files, run a compaction job. -
Unsorted data. Without sorting on the columns you filter by, row-group statistics become useless.
[min, max]for an unsorted column spans almost the full domain, so pushdown cannot eliminate anything. The fix is toORDER BYyour filter columns when writing, or to use Iceberg'ssort_ordertable property. -
Wrong row-group size. Too small (<10 MB) means too much metadata overhead and weak compression. Too large (>1 GB) means you cannot parallelise reads finely. The default of 128 MB is right for almost everyone; only deviate if you have measured a problem.
Common confusions
-
"Parquet is a compression format, like gzip." It is not. Parquet is a file format — a layout — that organises rows into column chunks with a metadata footer. Compression is one stage of how a column chunk's pages are stored, but it is the layout, statistics and footer-first read pattern that produce the 25× speedups, not the bytes-saved-per-page. A Parquet file with
compression="none"is still dramatically faster than a CSV for analytical queries. -
"Parquet stores data in rows; ORC stores data in columns." Both store data in columns. The confusion comes from ORC's name ("Optimized Row Columnar") and from older blog posts comparing ORC to RCFile. The marketing-grade summary is: Parquet and ORC are columnar twins; pick by ecosystem, not by row-vs-column.
-
"Predicate pushdown means the database engine evaluates the WHERE clause." No — pushdown means the reader skips entire row groups based on min/max statistics in the footer, before any column data is read. The engine still applies the actual filter on the data that survives, because a row group whose
[min, max]overlaps the predicate may still contain non-matching rows. Pushdown is a coarse-grained skip; the fine-grained filter still runs. -
"More row groups means better pushdown." Only up to a point. Smaller row groups give tighter
[min, max]ranges and finer skipping — but each row group has fixed metadata overhead in the footer, weaker per-chunk compression (less data to learn the dictionary from), and more S3 GETs at read time. The sweet spot is roughly 100K–1M rows per row group; below 10K is almost always wrong. -
"Sorting only matters if I use ORDER BY in my query." Sorting at write time is what gives you tight per-row-group
[min, max]and therefore effective pushdown at read time. Your queries do not need anORDER BYto benefit; they just need to filter on the column the table was sorted by when written. If you write randomly and read withWHERE date > '2024-01-01', pushdown will skip nothing because every row group's date range covers everything. -
"Parquet schema evolution means I can rename columns freely." It does not. Parquet matches columns by name in the footer's schema, so renaming a column in the writer breaks readers that expect the old name. Adding nullable columns and reordering them is safe; renaming, type-changing, or removing columns will produce read errors or wrong results. Iceberg sits on top of Parquet partly to solve this — it tracks columns by stable IDs, not names.
Going deeper
If you stop reading here, you have the working model: footer-first reads, row-group skipping by min/max, column projection, sorting matters. The rest of this section connects to the design decisions you will meet in production.
Why the footer is Thrift in Parquet and Protobuf in ORC
Parquet's FileMetaData is encoded in Apache Thrift compact protocol. ORC uses Protocol Buffers. Both are roughly equivalent — variable-length integer encoding, schema-driven, language-neutral — and the choice was historical: Parquet came out of Chirpline and Cloudera, where Thrift was the lingua franca; ORC came out of Hortonworks, which had standardised on Protobuf.
The practical consequence is that to inspect a Parquet footer from the command line you need parquet-tools meta or pyarrow, not protoc. Conversely orc-tools meta works on ORC. Both are about 50–200 KB for a typical 1 GB file, regardless of the encoding choice.
The dictionary page is doing most of the work
A Snappy-compressed Parquet column chunk for a low-cardinality column like country is typically 5–20× smaller than the raw data, but Snappy is not the hero — the dictionary encoding is. The writer scans the column, builds a dictionary {0: "IN", 1: "US", 2: "UK", 3: "DE", 4: "JP"}, and writes one dictionary page followed by data pages of indices into the dictionary. The indices then bit-pack into 3 bits per row (since 5 unique values fit in 3 bits). Snappy on top of that is mostly compressing the bit-packed stream further.
For a column where the dictionary fills up before reaching the page limit (typically ~1 million unique values), Parquet falls back to plain encoding — the raw bytes — and lets the codec do the work. This is why dictionary encoding shines on string columns with low cardinality (country, status, category) and quietly disables itself on high-cardinality string columns (order_id, email).
Page indexes — the new "row-group skipping inside a row group"
Modern Parquet (post-2020, when most engines caught up) writes an optional column index and offset index in the footer. These are per-page summaries: for each data page in each column chunk, the column index stores min, max, null_count, and the offset index stores the byte offset of the page in the file.
This lets a reader skip entire pages within a row group, not just whole row groups. For a 128 MB row group split into 128 pages of 1 MB each, this can mean reading only 5 MB after page-level skipping where you previously had to read the full 128 MB. The catch: page indexes only help if pages are sorted or the data has natural locality. Spark and Trino have written page indexes by default since ~2022; Athena enabled them in 2023.
ORC has had this feature from the start, in the form of its row-group index (every 10K rows) inside each stripe.
Why column families do not exist in Parquet
If you read about Cassandra, HBase or BigTable, you will hear about column families — groups of columns that are co-located on disk and share a compaction policy. Parquet has no such concept. All columns in a row group are just laid out one after another in the same file.
The reason is that Parquet's unit of skipping is the column chunk, and the unit of compaction is the file (handled by the table format above, like Iceberg or Delta Lake). The "co-locate hot columns" argument that motivates column families in OLTP-adjacent stores is replaced in OLAP by the simpler "the reader will skip whatever it does not need." If a query touches country and price out of 50 columns, the reader naturally reads only those two — no co-location needed because the seek-and-skip pattern already does the work.
Why DuckDB and ClickHouse have their own native formats
If Parquet is so good, why does DuckDB store its tables in .duckdb files and ClickHouse in MergeTree files? Because Parquet was designed for exchange — many engines reading the same file — not for the engine's own tightest loop. A native format can:
- Skip the footer indirection (the engine knows its own files were just written, no metadata roundtrip)
- Use engine-specific encodings (ClickHouse's
LowCardinalityandCodecchains) - Optimise for the engine's vectorised execution model (variable batch sizes, native types matching CPU SIMD)
- Avoid the Thrift/Protobuf parsing overhead at the start of every read
The pattern in 2026 is: native format for hot data the engine owns, Parquet for cold data and cross-engine exchange. DuckDB will read your Parquet file blazingly fast, but its CREATE TABLE writes its own format because that is faster still.
What is wrong with Parquet (the next-gen formats)
Parquet has been around for over a decade and shows its age in three places:
- Random access within a column chunk is awkward. Page boundaries are byte offsets, not logical row offsets, so seeking to "row 1,234,567" requires walking pages. Newer formats (Lance, Nimble) make per-row random access a first-class operation for vector and ML workloads.
- No native nested-vector type. Parquet handles
LIST<INT>via Dremel-style repetition/definition levels, which are correct but expensive for the embedding-vector and tensor cases that dominate modern ML feature stores. - Encoding is a small fixed menu. Parquet supports plain, dictionary, RLE, delta, and a few combinations. Newer formats (Vortex from Databricks, ALP from CWI) include adaptive encodings that beat Parquet on real workloads by 1.5–3× while staying cache-friendly for vectorised decode.
None of these are killing Parquet — its inertia is enormous, and "good enough plus universal" beats "better but isolated." But the next decade's interesting columnar work is happening in projects like Lance, Nimble (Meta), and Vortex.
Summary
You should now be able to draw both file formats from memory and explain how a query reads them:
- Parquet = magic header + row groups (each a column-chunk-per-column block) + footer with schema, offsets, and stats + footer length + magic trailer. Reader reads the footer first.
- ORC = same idea with stripes instead of row groups, plus per-stripe footers and per-10K-row indexes for finer skipping.
- Predicate pushdown turns a
WHEREclause into a row-group-skipping decision via per-column min/max in the footer, frequently eliminating 80%+ of the file before any data byte is read. - Column projection reads only the column chunks the query references, frequently eliminating another 80%+ of the surviving bytes.
- The combined effect on a sorted 100 GB table is typically ~1 GB of actual reads, and that is why your Trino-on-S3 query bill stays sane.
Together with the column store argument from chapter 119 and the encoding tricks of chapter 120, this gives you the full storage-layer picture for the columnar OLAP world. The next chapter (vectorised execution) takes the bytes you just learned how to read and processes them with the matching CPU-side trick.
References
- Apache Parquet — File Format Specification
- Apache ORC — File Format Specification (v1)
- Apache Arrow blog — "Reading and Writing Apache Parquet Files at Scale"
- Cloudera Engineering blog — "ORC File: Better Compression, Better Performance"
- Dageville et al., "The Snowflake Elastic Data Warehouse", SIGMOD 2016
- Mehrotra & Gulati, "Modern Big Data Processing with Apache Spark", O'Reilly