Parquet end-to-end: what you write, what you get back
Karan opens his laptop in a Bengaluru co-working space, runs the writer from yesterday's chapter, and stares at a 1.8 MB file called shipments.parquet. He knows it holds 2 lakh rows and that the warehouse query he is about to run will read three columns out of six and return in under 100 ms. He does not yet know how the reader will find those three columns without scanning the whole file, why it can skip 99 of the 100 row groups based on a WHERE created_at > ... clause without reading them, or why the bytes at the end of the file matter more than the bytes at the start. This chapter answers all three questions by opening the file, byte by byte, with no library hiding the work.
A Parquet file is read backwards: the reader seeks to the last 8 bytes, finds the footer offset, parses the schema and per-row-group statistics, and only then issues targeted reads for the column chunks the query needs. Row groups are independently readable; column chunks within them carry min/max/null counts that make predicate pushdown possible without decoding a single page. Once you can name the four nested layers — file, row group, column chunk, page — every Build 6 tuning lever (row-group size, page size, encoding, compression) maps to one of them.
The four layers, top to bottom
A Parquet file is a Russian doll with exactly four shells. Naming them in order — outer to inner — is the single most useful mental upgrade a data engineer makes about columnar storage.
The four layers, in human language: the file is what ls sees. Inside the file are row groups — chunks of, say, 1000 rows worth of data, written together and queryable independently. Inside each row group are column chunks — one per column, holding the encoded values for those 1000 rows. Inside each column chunk are pages — typically 8–64 KB each, the smallest unit the reader actually decompresses. Why four layers, not two: the file boundary lets you copy/move/delete one logical dataset; the row-group boundary lets queries skip irrelevant ranges; the column-chunk boundary lets queries read only the columns they need; the page boundary lets the decoder amortise its overhead while keeping memory bounded. Each boundary is a separate trade-off, and Parquet exposes them as four separate tuning knobs because the producer needs them independently.
The footer is not a layer — it is the index. It lives at the end of the file, holds the schema, the per-row-group offsets and lengths, and the per-column-chunk statistics, and is what the reader fetches first. The bytes at the start of the file (PAR1, four bytes of magic) and the bytes at the end of the file (PAR1 again, plus the four bytes immediately before it telling you the footer's length) bracket the whole structure.
Why the reader reads the file backwards
Open a 200 MB Parquet file from S3 with a WHERE created_at > '2026-04-20' clause. A row-store reader would have to scan the entire file. The Parquet reader does this:
- Seek to file size − 8 bytes. Read the last 8 bytes: 4 bytes of footer length (little-endian uint32) and 4 bytes of magic (
PAR1). - Seek to file size − 8 − footer_length. Read the footer. It is a Thrift-encoded blob holding the schema, the row-group index, and the per-chunk statistics. Typical size: a few KB.
- Walk the row-group list, applying the predicate. For each row group, read the
created_atchunk'sminandmaxfrom the footer. Ifmax < '2026-04-20', skip the entire row group — without reading a single byte of its data. This is predicate pushdown via statistics, and it is the single most important read-side optimisation Parquet enables. - For surviving row groups, fetch only the column chunks the query projects. A query that selects three columns from a forty-column table reads 7.5% of the file's data bytes — minus whatever the row-group skipping already eliminated.
- Within each surviving column chunk, decode pages on demand. The page is the unit of decompression and decoding, and the reader pulls only the pages it needs.
Why backwards instead of forwards: Parquet writes the data first, then writes the footer at the end, because the footer must contain the offsets of every row group — and the writer doesn't know those offsets until it has finished writing the row groups. The format had to choose: either write the index first (and seek-back to patch it once data is written, which corrupts streaming writes) or write the index last (and tell the reader to seek backwards). The second option won. Every columnar format that lives on object storage — Parquet, ORC, Avro's container format — writes its index last for the same reason.
This footer-first dance is also why Parquet is friendly to S3 and other object stores. The reader needs two HTTP range requests to get the footer (one for the last 8 bytes, one for the footer body), then issues targeted byte-range reads for exactly the column chunks the query needs. A 200 MB file with a query that hits one column in one row group might pull 80 KB total over the network. The same query against a row-store dump would download all 200 MB. The economic difference at lakehouse scale is the entire reason S3 + Parquet beat HDFS + sequence files commercially.
Open the file by hand
The code below opens a Parquet file with no high-level reader, walks the footer, and prints the four-layer structure. Two-thirds of it is plumbing; the part to internalise is the seek pattern.
# parquet_x_ray.py — read the footer, list every row group, every column chunk.
# Run after creating shipments.parquet from the previous chapter's writer.
import struct
from pyarrow import parquet as pq # only used for footer parsing.
PATH = "/tmp/shipments.parquet"
with open(PATH, "rb") as f:
f.seek(0, 2); file_size = f.tell() # how many bytes total.
f.seek(file_size - 8) # last 8 bytes: footer-len + "PAR1".
footer_len = struct.unpack("<I", f.read(4))[0]
magic = f.read(4)
assert magic == b"PAR1", f"not a parquet file (got {magic!r})"
print(f"file size = {file_size:,} B footer len = {footer_len:,} B")
f.seek(file_size - 8 - footer_len) # the footer body lives here.
footer_bytes = f.read(footer_len)
# Parse the footer (Thrift-encoded). pyarrow exposes this as ParquetFile.metadata.
md = pq.ParquetFile(PATH).metadata
print(f"schema: {md.schema}")
print(f"num_rows = {md.num_rows:,} num_row_groups = {md.num_row_groups}")
# Walk the index: for each row group, for each column chunk, print stats.
for rg_i in range(md.num_row_groups):
rg = md.row_group(rg_i)
print(f"\n RG{rg_i:>3} rows={rg.num_rows:>5} bytes={rg.total_byte_size:>7,}")
for c_i in range(rg.num_columns):
col = rg.column(c_i)
stats = col.statistics
mn = stats.min if stats and stats.has_min_max else "·"
mx = stats.max if stats and stats.has_min_max else "·"
nulls = stats.null_count if stats else "?"
print(f" {col.path_in_schema:<14} "
f"enc={','.join(str(e) for e in col.encodings):<24} "
f"compr={col.compression:<6} "
f"size={col.total_compressed_size:>6} "
f"min={mn!s:<12} max={mx!s:<12} nulls={nulls}")
# Sample run on the file written in the previous chapter:
file size = 1,872,400 B footer len = 14,820 B
schema: <pyarrow._parquet.ParquetSchema object>
required group field_id=-1 schema {
optional int64 shipment_id;
optional int32 customer_id;
optional int32 city_id;
optional binary status (String);
optional int32 weight_grams;
optional int64 created_at;
}
num_rows = 200,000 num_row_groups = 200
RG 0 rows= 1000 bytes= 14,712
shipment_id enc=PLAIN_DICTIONARY,RLE compr=SNAPPY size= 3,612 min=9000000 max=9000999 nulls=0
customer_id enc=PLAIN,RLE compr=SNAPPY size= 3,820 min=1024 max=199877 nulls=0
city_id enc=PLAIN_DICTIONARY,RLE compr=SNAPPY size= 412 min=1 max=24 nulls=0
status enc=PLAIN_DICTIONARY,RLE compr=SNAPPY size= 298 min=BOOKED max=RTO nulls=0
weight_grams enc=DELTA_BINARY_PACKED compr=SNAPPY size= 3,140 min=51 max=24987 nulls=0
created_at enc=DELTA_BINARY_PACKED compr=SNAPPY size= 3,430 min=1745500000 max=1745500999 nulls=0
RG 1 rows= 1000 bytes= 14,648
... six chunks again, fresh stats, fresh dictionary, ...
RG 2 ...
...
RG199 rows= 1000 bytes= 14,701
Five lines do the real work; the rest is pretty-printing.
f.seek(file_size - 8); struct.unpack("<I", f.read(4))[0] is the entry-point trick — read the last 4 bytes (little-endian uint32) to learn the footer's length. This is why a Parquet reader can begin work with one HTTP range request to S3 even before it knows what the file contains.
f.seek(file_size - 8 - footer_len); f.read(footer_len) is the second seek — pull the actual footer. Why this two-step rather than one big read: when the file is on S3, the reader doesn't know the file size in advance for streaming pipelines, and even if it did, downloading "the last 1 MB" to be safe is wasteful for files where the footer is 14 KB. Two range requests cost ~80 ms on cold S3; one oversized blind request costs more.
md.row_group(rg_i) walks the index. Notice no data has been read yet — the reader knows row count, byte size, and per-column statistics for every row group, with the actual data still untouched on disk. This is the layer where predicate pushdown lives. A query with WHERE created_at > 1_745_500_500 can prove from the per-RG min/max that 50% of the row groups have nothing matching, and skip them without a single read.
col.statistics.min / col.statistics.max are the entire reason this format is fast for analytical workloads. A row store has no equivalent — it can build B-tree indexes per column, but those indexes are external metadata; here the index is interleaved with the data, written once, free at query time. For high-cardinality predicates (like customer_id IN (41337, 82014, ...)) min/max is too coarse; that is where the optional bloom filter column metadata kicks in, providing per-chunk membership tests in 1–2% of the column's bytes.
col.encodings is the encoder catalogue actually used. PLAIN_DICTIONARY,RLE for status means: build a 5-entry dictionary mapping each status string to a small integer code, then run-length-encode the codes (because consecutive DELIVERED values collapse). DELTA_BINARY_PACKED for created_at means: store the differences between consecutive values in the minimum number of bits required. Why the encoder choice is per-chunk, not per-column: the writer profiles each chunk's data on its way through the buffer. If the first row group has only 5 distinct status values it dictionary-encodes; if a later row group somehow has 50 lakh distinct values (corrupted ingest, schema drift), the writer falls back to plain encoding for that chunk only. The format is adaptive at the chunk granularity, which is what makes it survive real, messy data.
The reader's seek path, end to end
The footer-first walk is one half of reading; the data-side seek path is the other. Once the reader has parsed the footer and decided which row groups and which column chunks to read, it issues byte-range reads straight at the bytes it wants. A query like SELECT AVG(weight_grams) WHERE status = 'DELIVERED' against a 200-row-group, 6-column file does roughly this:
This is the whole win: the reader does work proportional to the columns and row groups it needs, not to the size of the file. Doubling the table by adding more rows costs the reader exactly nothing on a query that hits the same selective predicate; doubling it by adding more columns costs the reader exactly nothing on a query that doesn't project the new columns. That asymmetric cost structure is what "columnar" actually buys.
Where the format pushes back on the producer
Three knobs the writer turns are footer-visible and matter a lot in production.
Row-group size. Default Parquet writers use 128 MB or 1 million rows, whichever comes first. Spark uses 128 MB. PyArrow defaults to 1 million rows. Larger row groups compress better (more values per encoding pass) and have fewer per-RG overheads; smaller row groups give finer-grained predicate pushdown and lower memory at write time. Razorpay's settlement pipeline writes at 256 MB row groups for archival storage and 64 MB row groups for fresh-day data, because fresh-day queries are selective and benefit from the finer skipping.
Page size. Default 8 KB to 1 MB depending on writer. Pages are the unit of decompression — every page header has its own type, encoding, and (optionally) statistics. Smaller pages mean more skip granularity within a chunk; larger pages mean better Snappy/Zstd ratios. The Parquet 2.0 page format adds per-page min/max statistics, which lets readers skip within a column chunk — a feature that matters most for sorted columns like event_time.
Compression codec. Snappy (fast, ratio ~2–3×), Gzip (slow write, ratio ~3–5×), Zstd (modern default at level 3, fast and ~4× ratio), LZ4 (very fast, ratio ~2×). Flipkart's catalogue Parquet uses Zstd-3 because the read side runs across thousands of cores and CPU is cheaper than the storage delta; an event-firehose pipeline at PhonePe uses Snappy because the writer is the bottleneck and the file lifetime is 7 days.
Encoding selection. Parquet writers pick encodings adaptively per chunk — PLAIN_DICTIONARY until the dictionary exceeds a threshold (default 1 MB), then a fallback to PLAIN. DELTA_BINARY_PACKED for sorted integers and timestamps, DELTA_LENGTH_BYTE_ARRAY for variable-length strings with similar lengths, BYTE_STREAM_SPLIT (Parquet 2.8+) for floats. The writer cannot perfectly predict which encoding will win — the only honest test is to write a sample, read its footer, and compare bytes. Production teams at Zerodha and Razorpay run nightly tooling that samples a fraction of yesterday's files and reports per-column-chunk byte breakdowns, so they catch regressions when an upstream change (e.g. a varchar becomes text) ruins the encoding.
The producer's responsibility is to tune these knobs against the actual query workload — and the only way to tune them is to read the footer of files in production and watch how the row-group skip rate, the column projection rate, and the bytes-read-per-query metric move. Why the writer cannot pick perfectly without measurement: the optimal row-group size depends on the query selectivity distribution — a knob upstream of the writer that the writer cannot see. A pipeline that writes the same data with the same code at two companies (one running point lookups, one running full scans) wants two different row-group sizes. The format gives you the levers; it cannot give you the right setting.
Common confusions
- "Parquet is just CSV with compression." No — CSV is row-oriented and has no per-column encoding, no per-row-group statistics, no footer index, no random-access seek path. A CSV reader must scan the whole file; a Parquet reader fetches a tiny footer and then issues targeted reads. The format difference is not "compressed" — it is "indexed and column-laid-out".
- "Bigger row groups are always better because compression." False above the working-set boundary. A 1 GB row group on a query that selects 50 rows means the reader must decompress at least one full page (8–64 KB) of every projected column to extract those 50 rows, and predicate pushdown can only skip at the row-group boundary — so a 1 GB row group is wasted granularity. Aim for the row-group size where most queries skip 90%+ of row groups by stats. That's typically 64–256 MB.
- "Predicate pushdown works for any column." Only works for columns that have meaningful min/max ordering or where the data happens to cluster naturally. A
customer_idcolumn with random ints will haveminandmaxclose to the global range in every row group, so the predicatecustomer_id = 47318skips zero row groups. This is why partitioning (covered next chapter) and sort order (Z-order, Iceberg sort orders) exist — they make stats actually selective. - "The footer is just metadata I can ignore." The footer is the index. Without it the file is a heap of bytes — a corrupted footer (truncated upload to S3, killed write process) makes the file unreadable even though the data is intact. Production pipelines fsync the footer, verify the trailing magic bytes, and re-write rather than append on retry, exactly because the footer is load-bearing.
- "Reading 5% of a Parquet file means 5% latency." Closer to true on local NVMe; not true on S3, where each range read has 30–80 ms of overhead. A query that needs 200 small range reads pays 200 × 50 ms = 10 seconds in latency before the first byte is decoded. Production readers (Spark, Trino) coalesce nearby ranges into single requests. This is why small files are an anti-pattern — covered in the small-files chapter.
Going deeper
Repetition and definition levels — the Dremel idea that Parquet inherited
Parquet stores nested data (lists, maps, nested structs) using two integer streams per column: a repetition level (which level of nesting did this value start a new occurrence at?) and a definition level (how deep into the optional/required hierarchy is this value defined?). The Dremel paper (Melnik et al., VLDB 2010) introduced the encoding so a record.tags[].name field flattens into a single column with two side streams that encode the structure. The benefit is that nested data costs the same on read as flat data — the columnar projection still works, the statistics still work, the encoding still works. The cost is that the writer has to compute these levels for every value, which is non-trivial for deeply nested schemas like Avro-from-Kafka with 5 levels of nesting. Most production teams flatten nested data before Parquet to avoid the complexity.
Bloom filters: when min/max isn't selective enough
For high-cardinality predicates (customer_id IN (47318, 89244, 119003)) min/max statistics are useless because every row group's range covers the global span. Parquet 1.12+ supports per-column-chunk bloom filters stored in the footer. The bloom filter is sized to a target false-positive rate (typically 1%) and adds 8–16 bits per distinct value to the chunk's bytes. The reader checks the filter for membership before fetching the chunk, skipping chunks whose filters say "definitely not present". On a 200 GB Iceberg table at Cred, enabling bloom filters on transaction_id cut point-lookup latency from 6 seconds to 200 ms because the reader could skip 99.7% of files.
How Iceberg and Delta layer on top of Parquet without changing it
Iceberg and Delta Lake do not fork Parquet. They keep using Parquet as the data layer and add a manifest layer above it — a list of Parquet files plus per-file statistics (which is just the rolled-up footer stats from each file) plus a snapshot log that records which files belong to which version of the table. The footer-first read pattern is unchanged; the table format adds a metadata layer that lets the reader decide which Parquet files to open before opening any of them. The Build 12 lakehouse chapters unpack this; for now, recognise that the Parquet footer is the foundation that Iceberg/Delta extend, not replace.
Why some teams write ORC instead of Parquet
ORC is the other major columnar format, originally from Hortonworks, dominant in Hive-era stacks. It has slightly better compression on average (its stripe-based layout matches well with Hive's split policies), supports ACID semantics natively (Hive 3+), and has a more compact statistics block. Parquet won the open-source mindshare race because of Spark's adoption and the broader ecosystem. ORC is still in production at LinkedIn, Yahoo, and several Indian telcos. The footer concept is the same; the byte-level grammar differs.
What "page index" buys you over plain min/max
Parquet 2.0 added a separate page index structure to the footer — per-page (not just per-chunk) min/max/null statistics. With page-level stats, a reader can skip within a column chunk: a 64 MB chunk holding 8000 pages can have 99.9% of pages skipped if the data is sorted on the predicate column. Trino, Spark 3.2+, and DuckDB all read page indexes; older readers ignore them and fall back to chunk-level stats. The cost is footer bloat — each page adds 16–24 bytes of index — so files with 1000+ pages per column chunk see footers grow from a few KB to a few hundred KB. For most workloads the trade is overwhelmingly positive; for pipelines with very small chunks or files written by old toolchains, it's a non-issue.
Where this leads next
The next chapter, /wiki/partitioning-strategies-date-hash-composite, zooms out one level: instead of one Parquet file with 200 row groups, you typically have a directory tree of Parquet files partitioned by date or hash, and the reader applies predicate pushdown at both the partition level and the row-group level. After that, /wiki/small-file-problem-and-the-compaction-job covers what happens when the streaming writer leaves a million 200 KB files instead of a thousand 200 MB files — and how compaction jobs reshape those files back into healthy row groups without breaking concurrent readers.
If you take one mental model from this chapter, take this: the footer is the file's index, and the index is what makes columnar storage cheap to read. Every Build 6 trade-off — partitioning, compaction, Z-order, bloom filters, page size — is a decision about how aggressively you want to populate that index and how much you want to pay at write time to do so.
References
- Apache Parquet file format specification — the on-disk format definition: row groups, column chunks, pages, Thrift footer schema.
- Dremel: Interactive Analysis of Web-Scale Datasets — Melnik et al., VLDB 2010. The paper that introduced the column-storage encoding scheme Parquet adopted, including repetition and definition levels for nested data.
- Parquet column statistics, page index, and bloom filters — official docs covering the index structures the reader uses for predicate pushdown.
- Apache Arrow Parquet C++ implementation — the reference reader/writer; reading the source is the fastest way to internalise the Thrift footer parsing.
- How to choose Parquet row group and page sizes — the Parquet committee's tuning guidance with concrete numbers.
- Iceberg table spec — manifest and snapshot files — shows how the table format layers on top of Parquet without changing it.
- /wiki/row-oriented-vs-columnar-revisited-from-the-writers-side — the previous chapter, the rotation that produced this file.
- /wiki/partitioning-strategies-date-hash-composite — the next chapter, layout one level above the file boundary.