Parquet end-to-end: what you write, what you get back

Karan opens his laptop in a Bengaluru co-working space, runs the writer from yesterday's chapter, and stares at a 1.8 MB file called shipments.parquet. He knows it holds 2 lakh rows and that the warehouse query he is about to run will read three columns out of six and return in under 100 ms. He does not yet know how the reader will find those three columns without scanning the whole file, why it can skip 99 of the 100 row groups based on a WHERE created_at > ... clause without reading them, or why the bytes at the end of the file matter more than the bytes at the start. This chapter answers all three questions by opening the file, byte by byte, with no library hiding the work.

A Parquet file is read backwards: the reader seeks to the last 8 bytes, finds the footer offset, parses the schema and per-row-group statistics, and only then issues targeted reads for the column chunks the query needs. Row groups are independently readable; column chunks within them carry min/max/null counts that make predicate pushdown possible without decoding a single page. Once you can name the four nested layers — file, row group, column chunk, page — every Build 6 tuning lever (row-group size, page size, encoding, compression) maps to one of them.

The four layers, top to bottom

A Parquet file is a Russian doll with exactly four shells. Naming them in order — outer to inner — is the single most useful mental upgrade a data engineer makes about columnar storage.

The four nested layers of a Parquet fileA diagram showing the file boundary on the outside, then row groups inside, then column chunks per row group, then encoded pages inside each column chunk. The footer sits at the end of the file holding the index that points to every row group and column chunk. File > row group > column chunk > page FILE: shipments.parquet starts with magic bytes "PAR1" • ends with footer + magic bytes "PAR1" ROW GROUP 0 (rows 0–999) column chunk: shipment_id dict page + 4 data pages • min=9000000 max=9000999 column chunk: status dict page (5 entries) + 1 data page (RLE) column chunk: created_at 3 data pages (delta) • min=1745500000 max=1745500999 + 3 more chunks: customer_id, city_id, weight_grams all six chunks contiguous on disk; stats live in the footer ROW GROUP 1 (rows 1000–1999) six column chunks again, fresh dictionaries, independent statistics, independent encoders. … RG 2 … RG 199 (rows 199000–199999) … 200 row groups for 2 lakh rows at row_group_size=1000. Each RG is independently readable — a query that needs only RG 17 reads only RG 17. FOOTER schema • per-row-group offsets & sizes • per-chunk statistics • key/value metadata last 4 bytes = footer length • last 4 = "PAR1"
A Parquet file is a sequence of independently-readable row groups, each containing one column chunk per column, each chunk made of one or more encoded pages. The footer at the end of the file is the index — without it the file is a heap of bytes; with it the file is queryable.

The four layers, in human language: the file is what ls sees. Inside the file are row groups — chunks of, say, 1000 rows worth of data, written together and queryable independently. Inside each row group are column chunks — one per column, holding the encoded values for those 1000 rows. Inside each column chunk are pages — typically 8–64 KB each, the smallest unit the reader actually decompresses. Why four layers, not two: the file boundary lets you copy/move/delete one logical dataset; the row-group boundary lets queries skip irrelevant ranges; the column-chunk boundary lets queries read only the columns they need; the page boundary lets the decoder amortise its overhead while keeping memory bounded. Each boundary is a separate trade-off, and Parquet exposes them as four separate tuning knobs because the producer needs them independently.

The footer is not a layer — it is the index. It lives at the end of the file, holds the schema, the per-row-group offsets and lengths, and the per-column-chunk statistics, and is what the reader fetches first. The bytes at the start of the file (PAR1, four bytes of magic) and the bytes at the end of the file (PAR1 again, plus the four bytes immediately before it telling you the footer's length) bracket the whole structure.

Why the reader reads the file backwards

Open a 200 MB Parquet file from S3 with a WHERE created_at > '2026-04-20' clause. A row-store reader would have to scan the entire file. The Parquet reader does this:

  1. Seek to file size − 8 bytes. Read the last 8 bytes: 4 bytes of footer length (little-endian uint32) and 4 bytes of magic (PAR1).
  2. Seek to file size − 8 − footer_length. Read the footer. It is a Thrift-encoded blob holding the schema, the row-group index, and the per-chunk statistics. Typical size: a few KB.
  3. Walk the row-group list, applying the predicate. For each row group, read the created_at chunk's min and max from the footer. If max < '2026-04-20', skip the entire row group — without reading a single byte of its data. This is predicate pushdown via statistics, and it is the single most important read-side optimisation Parquet enables.
  4. For surviving row groups, fetch only the column chunks the query projects. A query that selects three columns from a forty-column table reads 7.5% of the file's data bytes — minus whatever the row-group skipping already eliminated.
  5. Within each surviving column chunk, decode pages on demand. The page is the unit of decompression and decoding, and the reader pulls only the pages it needs.

Why backwards instead of forwards: Parquet writes the data first, then writes the footer at the end, because the footer must contain the offsets of every row group — and the writer doesn't know those offsets until it has finished writing the row groups. The format had to choose: either write the index first (and seek-back to patch it once data is written, which corrupts streaming writes) or write the index last (and tell the reader to seek backwards). The second option won. Every columnar format that lives on object storage — Parquet, ORC, Avro's container format — writes its index last for the same reason.

This footer-first dance is also why Parquet is friendly to S3 and other object stores. The reader needs two HTTP range requests to get the footer (one for the last 8 bytes, one for the footer body), then issues targeted byte-range reads for exactly the column chunks the query needs. A 200 MB file with a query that hits one column in one row group might pull 80 KB total over the network. The same query against a row-store dump would download all 200 MB. The economic difference at lakehouse scale is the entire reason S3 + Parquet beat HDFS + sequence files commercially.

Open the file by hand

The code below opens a Parquet file with no high-level reader, walks the footer, and prints the four-layer structure. Two-thirds of it is plumbing; the part to internalise is the seek pattern.

# parquet_x_ray.py — read the footer, list every row group, every column chunk.
# Run after creating shipments.parquet from the previous chapter's writer.
import struct
from pyarrow import parquet as pq          # only used for footer parsing.

PATH = "/tmp/shipments.parquet"

with open(PATH, "rb") as f:
    f.seek(0, 2); file_size = f.tell()      # how many bytes total.
    f.seek(file_size - 8)                   # last 8 bytes: footer-len + "PAR1".
    footer_len = struct.unpack("<I", f.read(4))[0]
    magic      = f.read(4)
    assert magic == b"PAR1", f"not a parquet file (got {magic!r})"
    print(f"file size = {file_size:,} B   footer len = {footer_len:,} B")

    f.seek(file_size - 8 - footer_len)      # the footer body lives here.
    footer_bytes = f.read(footer_len)

# Parse the footer (Thrift-encoded). pyarrow exposes this as ParquetFile.metadata.
md = pq.ParquetFile(PATH).metadata
print(f"schema: {md.schema}")
print(f"num_rows = {md.num_rows:,}   num_row_groups = {md.num_row_groups}")

# Walk the index: for each row group, for each column chunk, print stats.
for rg_i in range(md.num_row_groups):
    rg = md.row_group(rg_i)
    print(f"\n  RG{rg_i:>3}  rows={rg.num_rows:>5}  bytes={rg.total_byte_size:>7,}")
    for c_i in range(rg.num_columns):
        col   = rg.column(c_i)
        stats = col.statistics
        mn    = stats.min if stats and stats.has_min_max else "·"
        mx    = stats.max if stats and stats.has_min_max else "·"
        nulls = stats.null_count if stats else "?"
        print(f"    {col.path_in_schema:<14} "
              f"enc={','.join(str(e) for e in col.encodings):<24} "
              f"compr={col.compression:<6} "
              f"size={col.total_compressed_size:>6} "
              f"min={mn!s:<12} max={mx!s:<12} nulls={nulls}")
# Sample run on the file written in the previous chapter:
file size = 1,872,400 B   footer len = 14,820 B
schema: <pyarrow._parquet.ParquetSchema object>
required group field_id=-1 schema {
  optional int64 shipment_id;
  optional int32 customer_id;
  optional int32 city_id;
  optional binary status (String);
  optional int32 weight_grams;
  optional int64 created_at;
}
num_rows = 200,000   num_row_groups = 200

  RG  0  rows= 1000  bytes= 14,712
    shipment_id    enc=PLAIN_DICTIONARY,RLE   compr=SNAPPY size=  3,612 min=9000000     max=9000999     nulls=0
    customer_id    enc=PLAIN,RLE              compr=SNAPPY size=  3,820 min=1024        max=199877      nulls=0
    city_id        enc=PLAIN_DICTIONARY,RLE   compr=SNAPPY size=    412 min=1           max=24          nulls=0
    status         enc=PLAIN_DICTIONARY,RLE   compr=SNAPPY size=    298 min=BOOKED      max=RTO         nulls=0
    weight_grams   enc=DELTA_BINARY_PACKED    compr=SNAPPY size=  3,140 min=51          max=24987       nulls=0
    created_at     enc=DELTA_BINARY_PACKED    compr=SNAPPY size=  3,430 min=1745500000  max=1745500999  nulls=0

  RG  1  rows= 1000  bytes= 14,648
    ... six chunks again, fresh stats, fresh dictionary, ...
  RG  2  ...
  ...
  RG199  rows= 1000  bytes= 14,701

Five lines do the real work; the rest is pretty-printing.

f.seek(file_size - 8); struct.unpack("<I", f.read(4))[0] is the entry-point trick — read the last 4 bytes (little-endian uint32) to learn the footer's length. This is why a Parquet reader can begin work with one HTTP range request to S3 even before it knows what the file contains.

f.seek(file_size - 8 - footer_len); f.read(footer_len) is the second seek — pull the actual footer. Why this two-step rather than one big read: when the file is on S3, the reader doesn't know the file size in advance for streaming pipelines, and even if it did, downloading "the last 1 MB" to be safe is wasteful for files where the footer is 14 KB. Two range requests cost ~80 ms on cold S3; one oversized blind request costs more.

md.row_group(rg_i) walks the index. Notice no data has been read yet — the reader knows row count, byte size, and per-column statistics for every row group, with the actual data still untouched on disk. This is the layer where predicate pushdown lives. A query with WHERE created_at > 1_745_500_500 can prove from the per-RG min/max that 50% of the row groups have nothing matching, and skip them without a single read.

col.statistics.min / col.statistics.max are the entire reason this format is fast for analytical workloads. A row store has no equivalent — it can build B-tree indexes per column, but those indexes are external metadata; here the index is interleaved with the data, written once, free at query time. For high-cardinality predicates (like customer_id IN (41337, 82014, ...)) min/max is too coarse; that is where the optional bloom filter column metadata kicks in, providing per-chunk membership tests in 1–2% of the column's bytes.

col.encodings is the encoder catalogue actually used. PLAIN_DICTIONARY,RLE for status means: build a 5-entry dictionary mapping each status string to a small integer code, then run-length-encode the codes (because consecutive DELIVERED values collapse). DELTA_BINARY_PACKED for created_at means: store the differences between consecutive values in the minimum number of bits required. Why the encoder choice is per-chunk, not per-column: the writer profiles each chunk's data on its way through the buffer. If the first row group has only 5 distinct status values it dictionary-encodes; if a later row group somehow has 50 lakh distinct values (corrupted ingest, schema drift), the writer falls back to plain encoding for that chunk only. The format is adaptive at the chunk granularity, which is what makes it survive real, messy data.

The reader's seek path, end to end

The footer-first walk is one half of reading; the data-side seek path is the other. Once the reader has parsed the footer and decided which row groups and which column chunks to read, it issues byte-range reads straight at the bytes it wants. A query like SELECT AVG(weight_grams) WHERE status = 'DELIVERED' against a 200-row-group, 6-column file does roughly this:

The reader's seek path for one queryTimeline diagram showing the reader fetching the footer, then issuing parallel byte-range reads only for the status and weight_grams column chunks across the row groups whose statistics survive predicate pushdown. SELECT AVG(weight_grams) WHERE status='DELIVERED' — what gets read step 1 — fetch footer 2 range reads: last 8 B (footer length) + 14 KB footer body • total: ~80 ms cold S3 step 2 — apply predicate via statistics walk 200 row-group entries; check status chunk's min/max; status is low-cardinality dict — step 3 — issue parallel range reads for surviving chunks RG0.status (298 B) RG0.weight (3,140 B) RG1.status RG1.weight … for 200 row groups … ~688 KB total fetched out of 1,872 KB file (37%) • 4 columns never touched step 4 — decode pages on demand, project rows where status==DELIVERED, sum weight Snappy decompression + RLE/dict/delta decode happen page-by-page, never whole-chunk in one go. Result: ~80,000 rows summed (40% of file matched the WHERE) in 60–90 ms on warm cache.
The reader fetches the footer, walks the row-group index, then issues targeted byte-range reads only for the column chunks the query needs. Four out of six columns are never read; row groups whose statistics rule out a match are skipped entirely. The page-level decoding stays bounded in memory because each page is 8–64 KB.

This is the whole win: the reader does work proportional to the columns and row groups it needs, not to the size of the file. Doubling the table by adding more rows costs the reader exactly nothing on a query that hits the same selective predicate; doubling it by adding more columns costs the reader exactly nothing on a query that doesn't project the new columns. That asymmetric cost structure is what "columnar" actually buys.

Where the format pushes back on the producer

Three knobs the writer turns are footer-visible and matter a lot in production.

Row-group size. Default Parquet writers use 128 MB or 1 million rows, whichever comes first. Spark uses 128 MB. PyArrow defaults to 1 million rows. Larger row groups compress better (more values per encoding pass) and have fewer per-RG overheads; smaller row groups give finer-grained predicate pushdown and lower memory at write time. Razorpay's settlement pipeline writes at 256 MB row groups for archival storage and 64 MB row groups for fresh-day data, because fresh-day queries are selective and benefit from the finer skipping.

Page size. Default 8 KB to 1 MB depending on writer. Pages are the unit of decompression — every page header has its own type, encoding, and (optionally) statistics. Smaller pages mean more skip granularity within a chunk; larger pages mean better Snappy/Zstd ratios. The Parquet 2.0 page format adds per-page min/max statistics, which lets readers skip within a column chunk — a feature that matters most for sorted columns like event_time.

Compression codec. Snappy (fast, ratio ~2–3×), Gzip (slow write, ratio ~3–5×), Zstd (modern default at level 3, fast and ~4× ratio), LZ4 (very fast, ratio ~2×). Flipkart's catalogue Parquet uses Zstd-3 because the read side runs across thousands of cores and CPU is cheaper than the storage delta; an event-firehose pipeline at PhonePe uses Snappy because the writer is the bottleneck and the file lifetime is 7 days.

Encoding selection. Parquet writers pick encodings adaptively per chunk — PLAIN_DICTIONARY until the dictionary exceeds a threshold (default 1 MB), then a fallback to PLAIN. DELTA_BINARY_PACKED for sorted integers and timestamps, DELTA_LENGTH_BYTE_ARRAY for variable-length strings with similar lengths, BYTE_STREAM_SPLIT (Parquet 2.8+) for floats. The writer cannot perfectly predict which encoding will win — the only honest test is to write a sample, read its footer, and compare bytes. Production teams at Zerodha and Razorpay run nightly tooling that samples a fraction of yesterday's files and reports per-column-chunk byte breakdowns, so they catch regressions when an upstream change (e.g. a varchar becomes text) ruins the encoding.

The producer's responsibility is to tune these knobs against the actual query workload — and the only way to tune them is to read the footer of files in production and watch how the row-group skip rate, the column projection rate, and the bytes-read-per-query metric move. Why the writer cannot pick perfectly without measurement: the optimal row-group size depends on the query selectivity distribution — a knob upstream of the writer that the writer cannot see. A pipeline that writes the same data with the same code at two companies (one running point lookups, one running full scans) wants two different row-group sizes. The format gives you the levers; it cannot give you the right setting.

Common confusions

Going deeper

Repetition and definition levels — the Dremel idea that Parquet inherited

Parquet stores nested data (lists, maps, nested structs) using two integer streams per column: a repetition level (which level of nesting did this value start a new occurrence at?) and a definition level (how deep into the optional/required hierarchy is this value defined?). The Dremel paper (Melnik et al., VLDB 2010) introduced the encoding so a record.tags[].name field flattens into a single column with two side streams that encode the structure. The benefit is that nested data costs the same on read as flat data — the columnar projection still works, the statistics still work, the encoding still works. The cost is that the writer has to compute these levels for every value, which is non-trivial for deeply nested schemas like Avro-from-Kafka with 5 levels of nesting. Most production teams flatten nested data before Parquet to avoid the complexity.

Bloom filters: when min/max isn't selective enough

For high-cardinality predicates (customer_id IN (47318, 89244, 119003)) min/max statistics are useless because every row group's range covers the global span. Parquet 1.12+ supports per-column-chunk bloom filters stored in the footer. The bloom filter is sized to a target false-positive rate (typically 1%) and adds 8–16 bits per distinct value to the chunk's bytes. The reader checks the filter for membership before fetching the chunk, skipping chunks whose filters say "definitely not present". On a 200 GB Iceberg table at Cred, enabling bloom filters on transaction_id cut point-lookup latency from 6 seconds to 200 ms because the reader could skip 99.7% of files.

How Iceberg and Delta layer on top of Parquet without changing it

Iceberg and Delta Lake do not fork Parquet. They keep using Parquet as the data layer and add a manifest layer above it — a list of Parquet files plus per-file statistics (which is just the rolled-up footer stats from each file) plus a snapshot log that records which files belong to which version of the table. The footer-first read pattern is unchanged; the table format adds a metadata layer that lets the reader decide which Parquet files to open before opening any of them. The Build 12 lakehouse chapters unpack this; for now, recognise that the Parquet footer is the foundation that Iceberg/Delta extend, not replace.

Why some teams write ORC instead of Parquet

ORC is the other major columnar format, originally from Hortonworks, dominant in Hive-era stacks. It has slightly better compression on average (its stripe-based layout matches well with Hive's split policies), supports ACID semantics natively (Hive 3+), and has a more compact statistics block. Parquet won the open-source mindshare race because of Spark's adoption and the broader ecosystem. ORC is still in production at LinkedIn, Yahoo, and several Indian telcos. The footer concept is the same; the byte-level grammar differs.

What "page index" buys you over plain min/max

Parquet 2.0 added a separate page index structure to the footer — per-page (not just per-chunk) min/max/null statistics. With page-level stats, a reader can skip within a column chunk: a 64 MB chunk holding 8000 pages can have 99.9% of pages skipped if the data is sorted on the predicate column. Trino, Spark 3.2+, and DuckDB all read page indexes; older readers ignore them and fall back to chunk-level stats. The cost is footer bloat — each page adds 16–24 bytes of index — so files with 1000+ pages per column chunk see footers grow from a few KB to a few hundred KB. For most workloads the trade is overwhelmingly positive; for pipelines with very small chunks or files written by old toolchains, it's a non-issue.

Where this leads next

The next chapter, /wiki/partitioning-strategies-date-hash-composite, zooms out one level: instead of one Parquet file with 200 row groups, you typically have a directory tree of Parquet files partitioned by date or hash, and the reader applies predicate pushdown at both the partition level and the row-group level. After that, /wiki/small-file-problem-and-the-compaction-job covers what happens when the streaming writer leaves a million 200 KB files instead of a thousand 200 MB files — and how compaction jobs reshape those files back into healthy row groups without breaking concurrent readers.

If you take one mental model from this chapter, take this: the footer is the file's index, and the index is what makes columnar storage cheap to read. Every Build 6 trade-off — partitioning, compaction, Z-order, bloom filters, page size — is a decision about how aggressively you want to populate that index and how much you want to pay at write time to do so.

References