Parquet and ORC File Formats End-to-End

In short

Parquet and ORC are open binary file formats that store tables column-by-column on disk, with statistics-rich metadata at the end of every file so that a reader can decide what to skip before reading any actual data. A Parquet file is a sequence of row groups (typically 128 MB to 1 GB each) where every row group holds one column chunk per column; the file ends with a footer that lists the schema, the byte offset of every column chunk, and per-chunk min/max/null-count statistics. ORC has the same shape with different names — stripes instead of row groups, plus a per-stripe row-level index for finer-grained skipping.

The reader-side magic is predicate pushdown: for SELECT SUM(price) WHERE country = 'IN' AND date > '2024-01-01', the engine reads the footer (a few KB at the end of the file), walks the row-group list, checks each row group's country and date statistics, and discards any row group whose max(country) < 'IN' or max(date) <= '2024-01-01'. On a 100 GB orders table where Indian orders cluster in a few row groups, the reader ends up touching ~5 GB instead of 100 GB. Twelve seconds instead of four minutes — and you wrote no extra code.

This is the format every cloud query engine on earth speaks. AWS Athena, Trino, Presto, Spark, Dremio, DuckDB and ClickHouse all read Parquet natively; Hive, Trino, Spark and Snowflake (via Iceberg) read ORC. Flipkart's data lake on S3, every Indian analytics startup with dbt and Athena, and most of the FAANG-scale offline pipelines you have ever heard of are physically just folders full of Parquet files plus a metadata layer. This chapter walks the bytes end to end so you understand exactly what those files contain and why they are so fast.

In the previous two chapters you saw why row storage loses on analytical scans and why a column-by-column layout wins by 50–200×. That argument was about layout in the abstract. This chapter makes it concrete: it walks through the two file formats — Apache Parquet and Apache ORC — that ship that layout to disk, byte by byte, in the form that every cloud data warehouse and lakehouse actually consumes.

If you have ever run CREATE TABLE foo STORED AS PARQUET in Hive or Spark, dropped a Parquet file on S3 and queried it from Athena, or pip install pyarrow and called read_table(...), you have used these formats. Most engineers use them as black boxes. By the end of this chapter you will know exactly what is inside the box, where the bytes live, and how the reader uses the metadata to read 5% of the file and answer your query.

Why a file format, not just a column store

A column store on disk is almost just "one file per column." That works fine for a single-machine engine like ClickHouse, which controls the storage layout end to end. It breaks down the moment you want to:

Store a table on cheap object storage (S3, GCS, Azure Blob), where listing and opening files costs API calls and latency
Have multiple engines (Spark today, Trino tomorrow, DuckDB on your laptop next week) read the same data without re-encoding
Carry schema with the data so a reader who has never seen the file can still decode it
Support predicate pushdown so a query that only wants Indian orders does not have to read American ones
Survive partial corruption (a file that's half-written should be detectable, not silently parsed as garbage)

A flat "one file per column" layout does none of those well. Why one-file-per-column fails on object storage: every column read becomes a separate GET request with its own ~30 ms round-trip latency, and a 50-column table query becomes 50 separate network round-trips before any byte of data arrives — worse than the row-store baseline.

The answer is to package columns into a self-describing file that bundles many columns and many rows together, with metadata at the end describing what is inside. That is Parquet. ORC is the same idea with slightly different choices. Both were born around 2013 in the Hadoop ecosystem and are now the de facto standards for analytical files on object storage.

Parquet, top to bottom

A Parquet file has three pieces, in order: a 4-byte magic header, a sequence of row groups, and a footer that ends with the footer length and another 4-byte magic.

Three things to notice about this layout:

The footer is at the end, not the beginning. A Parquet writer builds row groups as it streams data in; it cannot write the footer until it knows how many row groups there are, what their byte offsets ended up being, and what the per-column statistics are. So the footer goes at the end. The reader handles this by seeking to file_size - 8, reading the 4-byte footer length, then reading the footer itself. Why footer-at-end works on object storage: S3 supports Range: bytes= requests, so the reader can fetch just the last 64 KB of a 10 GB file in one round-trip, parse the footer, and only then issue further range requests for the column chunks it actually needs.
Row groups are independently readable. If the table is partitioned across multiple Spark or Trino workers, each worker can be assigned a subset of row groups and read them in parallel without coordinating with the others. The row-group boundary is the unit of parallelism.
Column chunks within a row group are independently readable. A query that wants only price and country reads only those two column chunks per row group, skipping the bytes for the other 48 columns entirely. The footer's per-column-chunk byte offsets make this a single seek.

What lives inside a column chunk

A column chunk is itself broken into pages — typically 1 MB each — for finer-grained reads and for compression boundaries. There are three page types:

Dictionary page: appears once at the start if the column is dictionary-encoded. Holds the unique values (e.g. ["IN", "US", "UK", "DE", ...] for country).
Data page: holds the actual encoded, compressed values for some range of rows in the chunk. A column chunk has many data pages.
Index page (column index, offset index in newer Parquet): per-page min/max for sub-row-group skipping.

Each page has a small header (Thrift-encoded) that says how many values it holds, what encoding it uses, what compression codec was applied, and the uncompressed and compressed sizes. The reader can therefore stream-decode pages without random access within the chunk.

The default codec is Snappy (fast, ~3× compression); newer pipelines use Zstd (slower, ~5× compression). Parquet also supports gzip, lz4, and uncompressed.

ORC, top to bottom

ORC (Optimized Row Columnar) was born at Hortonworks around 2013 to replace Hive's earlier RCFile format. It is conceptually identical to Parquet but with three differences worth knowing:

The differences from Parquet:

Stripes have their own footer. Each stripe is a self-contained mini-file with index data + row data + stripe footer. A reader that has the stripe's byte range can decode it without consulting the file footer. This makes ORC slightly more robust to truncation and slightly easier for streaming readers.
Row-level indexes within stripes. ORC writes a small index every 10,000 rows (the row group in ORC terminology — confusingly, it is not a Parquet row group) with min/max per column. This lets the reader skip not just whole stripes but also 10K-row subranges within a stripe. Parquet recently added an analogous column index / page index feature; for years it was an ORC advantage.
PostScript at the very end. The file ends with a one-byte length pointing to a PostScript that says how the footer is compressed and how big it is. The reader reads the last byte, then the PostScript, then the footer.

For most workloads the choice of Parquet vs ORC is religious rather than performance-driven. Parquet is the default in the Spark and Arrow ecosystems and is what every Indian data team I have ever worked with uses on AWS S3. ORC is more common in Hive-native shops and at companies that started on Hortonworks (now Cloudera) before Spark took over.

The read path, step by step

The single most important thing to internalise about both formats is the order in which the reader touches bytes. It is not front-to-back. It is footer-first.

For a query SELECT SUM(price) FROM orders WHERE country = 'IN' AND date > '2024-01-01' against a 100 GB Parquet file on S3:

Open the file, do not read any data. s3:HeadObject to get the file size.
Seek to the last 64 KB with a single s3:GetObject range request. This usually contains the entire footer plus the 8 trailer bytes. Why fetch 64 KB blindly: a Thrift-encoded footer for a 100 GB Parquet file with 800 row groups and 50 columns is typically 30–200 KB; fetching 64 KB optimistically usually catches the whole thing in one round-trip; if it does not, the reader fetches a larger range on the second try.
Parse the footer. Now you have the schema, the byte offset of every row group, and per-row-group, per-column statistics including min(country), max(country), min(date), max(date).
Apply predicate pushdown. Walk the row-group list. For each row group, check: is 'IN' between min(country) and max(country)? Is max(date) > '2024-01-01'? If either answer is no, this row group cannot contain any matching row — skip it entirely.
Issue range reads for the column chunks of price, country, and date in the surviving row groups. Multiple GET requests, one per (row group, column).
Decompress and decode the pages, apply the actual filter on the decoded country column, sum the surviving price values.

This is the entire read path. Step 4 is where the performance win happens.

Predicate pushdown, drawn

Two important caveats about how well pushdown actually works:

It depends on data layout. Statistics-based skipping only works if values are clustered. If you write rows in random date order, every row group will have min='2020-01-01' and max='2024-12-31' and pushdown will skip exactly nothing. This is why production pipelines sort or partition by the columns they expect to filter on. Why sorting matters so much: a row group is the smallest skip unit. If a row group's [min, max] range overlaps the predicate, the reader must read the whole row group even if only a single matching row lives in it. Sorting tightens the per-row-group ranges and makes pushdown effective.
High-cardinality equality filters need bloom filters. Min/max stats are useless for WHERE user_id = 12345 because almost every row group's [min, max] covers that value. Both Parquet and ORC support optional bloom filters at column-chunk granularity for these cases.

Real Python: write, inspect, push down

pyarrow is the canonical reference Parquet implementation outside the JVM. Here is the whole life cycle of a Parquet file in ~40 lines.

# parquet_demo.py
import pyarrow as pa
import pyarrow.parquet as pq
import pyarrow.compute as pc
from datetime import date, timedelta
import random

random.seed(42)

# 1. Build a small orders table — 1 million rows, sorted by date.
N = 1_000_000
start = date(2024, 1, 1)
dates = [start + timedelta(days=i // 5_000) for i in range(N)]   # ~5K rows per day
countries = random.choices(["IN", "US", "UK", "DE", "JP"],
                           weights=[5, 3, 2, 2, 1], k=N)
prices = [round(random.uniform(100, 9999), 2) for _ in range(N)]

table = pa.table({
    "date":     pa.array(dates, type=pa.date32()),
    "country":  pa.array(countries, type=pa.string()),
    "price":    pa.array(prices,    type=pa.float64()),
    "order_id": pa.array(range(N),  type=pa.int64()),
})

# 2. Write to Parquet — control the row group size to see pushdown work.
pq.write_table(table, "orders.parquet",
               row_group_size=100_000,   # 10 row groups for a 1M-row file
               compression="snappy")

# 3. Inspect the footer without reading any column data.
pf = pq.ParquetFile("orders.parquet")
print("schema:        ", pf.schema_arrow)
print("num_row_groups:", pf.num_row_groups)
for i in range(pf.num_row_groups):
    rg = pf.metadata.row_group(i)
    date_stats = rg.column(0).statistics       # 'date' is column 0
    print(f"  RG {i}: rows={rg.num_rows:>6}   "
          f"date.min={date_stats.min}  date.max={date_stats.max}")

# 4. Predicate pushdown — pyarrow uses footer stats to skip row groups.
#    Filter: country == 'IN' AND date > 2024-04-01
filtered = pq.read_table(
    "orders.parquet",
    columns=["price", "country", "date"],     # column projection
    filters=[("country", "=", "IN"),          # row-group skipping
             ("date",    ">",  date(2024, 4, 1))],
)
print(f"matched rows: {filtered.num_rows}   "
      f"sum(price) = {pc.sum(filtered['price']).as_py():.2f}")

Two things this script demonstrates that you cannot see in marketing material:

The footer iteration loop (step 3) reads the file metadata only. You can run it on a 100 GB file in milliseconds — no column data is touched.
The filters= argument in step 4 is what triggers row-group skipping. Without it, pyarrow would read every row group, decode all three columns, then filter in memory — correct but slow. With it, pyarrow consults the per-row-group statistics in the footer and skips entire chunks.

The same pattern works for any engine that consumes Parquet: Spark's spark.read.parquet(...).filter(...), Trino's WHERE clause, DuckDB's SELECT ... FROM 'orders.parquet' WHERE ... — they all push the predicate down to the same statistics in the same footer.

A worked example: 100 GB Indian e-commerce scan

The Flipkart-shaped query

You are an analyst at a Bengaluru e-commerce company. The fact table orders is 100 GB of Parquet on S3, partitioned by year=2024/month=01/... Hive-style, with each partition holding ~8 GB across 60 files of ~130 MB each. Each file has 8 row groups of 16 MB. The data was written sorted by date, then by country.

Schema (50 columns, abbreviated):

order_id BIGINT, customer_id BIGINT, product_id BIGINT,
country VARCHAR, state VARCHAR, city VARCHAR, pin_code VARCHAR,
date DATE, ordered_at TIMESTAMP, delivered_at TIMESTAMP,
price DECIMAL, discount DECIMAL, tax DECIMAL, shipping DECIMAL,
quantity INT, /* …36 more columns… */

The marketing team runs the daily India-revenue query:

SELECT SUM(price)
FROM orders
WHERE date BETWEEN DATE '2024-01-01' AND DATE '2024-06-30'
  AND country = 'IN';

Step 1: Hive partition pruning (free). Athena/Trino reads the S3 prefix listing for year=2024/ and discards year=2023/, year=2025/. Bytes saved: 66 GB. Files now in scope: ~360 files (Jan–Jun 2024 partitions), ~50 GB total.

Step 2: Parquet footer reads. For each of the 360 files, the engine fetches the last 64 KB (one S3 GET each, ~30 ms each, parallelised — ~1 s total). It now knows the byte offset and per-column statistics of every row group in every file. Bytes read so far: ~25 MB of footers.

Step 3: Row-group skipping by country. The data was written sorted by date then country, so within each row group, country ranges are tight — typically [IN, IN] or [IN, US] since IN is the most common value and clusters at the start. About 60% of row groups have min(country) = max(country) = 'IN' (full hit), 25% straddle (must read), and 15% have min(country) > 'IN' (skip). Skipped row groups: ~430 of 2880. Date stats are all-positive within the Jan–Jun partitions, so no further skip from date.

Step 4: Column projection. Of the 50 columns, the query references only price, country, and date. The engine issues range GETs for those three column chunks in each surviving row group. Each column chunk is roughly 16 MB / 50 = 320 KB raw, ~120 KB after Snappy compression. With ~2450 surviving row groups × 3 columns × 120 KB = ~880 MB of column data.

Step 5: Decode and filter. The engine decompresses ~880 MB of Snappy data into ~2.2 GB of dictionary-decoded columnar arrays, runs a vectorised country == 'IN' filter (the straddling row groups need this; the [IN, IN] chunks pass it trivially), and a vectorised sum on the surviving prices. CPU time: ~3 s on a single Trino worker, ~0.3 s on 10 workers.

Total bytes pulled from S3: ~25 MB footer + ~880 MB columns ≈ 0.9 GB out of 100 GB. The query runs in roughly 8–12 seconds on a small Trino cluster.

Compare to the naive baseline of reading all 100 GB of equivalent row-stored data at S3 throughput of ~400 MB/s per worker: ~250 seconds, plus the CPU cost of parsing every byte of every column. The Parquet reader is ~25× faster wall-clock and ~110× cheaper in S3 GET-byte cost.

This is not a benchmark from a vendor. It is approximately what you will measure if you run the same query in Athena over a real partitioned Parquet table the size of a mid-size Indian D2C company's order history.

Indian context: this is the entire stack

Every Indian analytics company you have heard of — Flipkart, Swiggy, Zomato, Cred, Razorpay, Meesho — runs essentially the same offline data stack:

Storage: AWS S3 (sometimes GCS, rarely Azure), one bucket per environment, partitioned Hive-style by date and tenant
Files: Parquet, Snappy-compressed, ~128–512 MB per file, written by Spark or Flink jobs running on EMR or self-managed Kubernetes
Catalog: AWS Glue (a hosted Hive Metastore) or Apache Iceberg metadata
Query engine: Trino (formerly Presto) or AWS Athena (which is hosted Trino), with Spark for heavier batch jobs and dbt sitting on top for the analytics models
BI: Looker, Metabase, or Superset reading the Trino/Athena layer

The reason this stack exists in this exact shape is the predicate-pushdown story you just walked through. Each layer adds one more form of skipping:

Hive partitioning skips whole directories without reading any file
Iceberg/Glue manifests skip whole files without reading footers
Parquet footers skip whole row groups without reading column data
Parquet column projection reads only the columns the query references
Page indexes / bloom filters skip pages within a column chunk

By the time the bytes actually reach the CPU, the query has already discarded 99% of the table. That is what makes a sub-rupee-per-query Athena bill possible against a multi-terabyte dataset, and it is why every Indian data team that has migrated off Postgres-as-warehouse has converged on the same architecture.

Common pitfalls

Three production failure modes show up repeatedly:

Tiny files. Parquet's overhead per file (footer, magic bytes, stats serialisation, S3 GET round-trip) is fixed; on a folder of 10,000 files of 1 MB each, the metadata cost dominates. Aim for ~128–512 MB per file. If you are writing from Spark, use coalesce or repartition to avoid the small-files problem; if you have already written tiny files, run a compaction job.
Unsorted data. Without sorting on the columns you filter by, row-group statistics become useless. [min, max] for an unsorted column spans almost the full domain, so pushdown cannot eliminate anything. The fix is to ORDER BY your filter columns when writing, or to use Iceberg's sort_order table property.
Wrong row-group size. Too small (<10 MB) means too much metadata overhead and weak compression. Too large (>1 GB) means you cannot parallelise reads finely. The default of 128 MB is right for almost everyone; only deviate if you have measured a problem.

Summary

You should now be able to draw both file formats from memory and explain how a query reads them:

Parquet = magic header + row groups (each a column-chunk-per-column block) + footer with schema, offsets, and stats + footer length + magic trailer. Reader reads the footer first.
ORC = same idea with stripes instead of row groups, plus per-stripe footers and per-10K-row indexes for finer skipping.
Predicate pushdown turns a WHERE clause into a row-group-skipping decision via per-column min/max in the footer, frequently eliminating 80%+ of the file before any data byte is read.
Column projection reads only the column chunks the query references, frequently eliminating another 80%+ of the surviving bytes.
The combined effect on a sorted 100 GB table is typically ~1 GB of actual reads, and that is why your Trino-on-S3 query bill stays sane.

Together with the column store argument from chapter 119 and the encoding tricks of chapter 120, this gives you the full storage-layer picture for the columnar OLAP world. The next chapter (vectorised execution) takes the bytes you just learned how to read and processes them with the matching CPU-side trick.