Manifest files and the commit protocol

A query against a Flipkart product-catalogue table on Iceberg never reads the S3 prefix the data lives in. It reads a small JSON file that names a slightly larger Avro file that names the Parquet files. That tree — root pointer, metadata file, manifest list, manifest, data files — is the table. Swapping the root pointer is the transaction. The previous chapter ended with the observation that this design survived eventual consistency because it asks for almost nothing from the storage layer; this chapter is the close-up of how the asking works.

The lakehouse commit protocol is a tree of immutable metadata files plus one atomic compare-and-swap on a root pointer held in a catalog. A writer creates new data files, writes a new manifest naming them, writes a new metadata snapshot, and asks the catalog "swap the table pointer from version N to version N+1, only if it is still N". One CAS decides the commit. Every reader sees either fully N or fully N+1, never a half state.

What the table actually is

When a data engineer in Bengaluru types SELECT count(*) FROM flipkart.orders, Trino does not look at s3://flipkart-warehouse/orders/. It asks the Iceberg catalog (Hive Metastore, AWS Glue, Polaris, Unity, Nessie, REST) for the current metadata location — a single string like s3://flipkart-warehouse/orders/metadata/00042-3a8f.metadata.json. That JSON is the snapshot of the table at version 42. From there, Trino walks down a four-level tree to find which Parquet files to scan.

The four-level metadata tree of an Iceberg tableA vertical stack. Top: catalog row holding the table pointer. Below: metadata.json file with snapshot history. Below: manifest list (Avro). Below: manifests (Avro), each naming many Parquet data files at the bottom. An Iceberg table is a tree of metadata pointing at Parquet files catalog row flipkart.orders → s3://.../00042-3a8f.metadata.json metadata.json (snapshot) schema, partition spec, snapshot list, current snapshot id manifest-list (Avro) one row per manifest, with partition-range stats manifest-A.avro files for state=KA manifest-B.avro files for state=MH manifest-C.avro files for state=DL part-0001.parquet … part-0512.parquet Parquet data files (immutable once written)
Four levels: catalog row → metadata.json → manifest-list → manifests → Parquet files. Every level except the catalog row is an immutable file in object storage. A commit appends a new metadata.json and asks the catalog to swap one string. The Parquet files at the bottom are never rewritten in place — compaction and GDPR delete write new files and a new manifest tree.

The catalog row is the only mutable cell in the entire tree. Every file below it — metadata.json, manifest-list, manifest, Parquet — is written once and never modified. Why immutability matters here: because no file is ever rewritten, no reader ever races a writer for the contents of a single file. A reader who picked up 00042-3a8f.metadata.json at the start of their query will read consistently from the version-42 snapshot for as long as the query takes, even if the writer has moved on to versions 43, 44, 45. Snapshot isolation is free; you do not need MVCC machinery on top of object storage to get it.

The four levels are not redundant. Each one exists to make a particular query cheap. The metadata.json holds schema and snapshot history — small, fits in a single GET. The manifest-list holds per-manifest partition ranges, so a query with a WHERE state = 'KA' predicate can skip every manifest whose range doesn't include KA without ever opening it. Each manifest holds per-Parquet-file column stats (min, max, null count), so the same predicate can skip data files inside the manifests it does open. By the time Trino actually opens a Parquet file, it has already pruned away (often) 99% of the table from a few small Avro reads.

The commit, step by step

A commit is six steps. Five of them are PUT-new-key operations on object storage — strongly consistent on every storage backend, even pre-2020 S3. The sixth is the only operation that has to be atomic, and it lives in the catalog, not in the storage layer.

# manifest_commit.py — a minimal lakehouse commit, end to end.
# Models the Iceberg-style commit protocol: 5 PUT-new-keys + 1 catalog CAS.

import json, time, uuid, threading

class CatalogCASError(Exception): pass

class Catalog:
    """A toy metastore. The only mutable cell in the system."""
    def __init__(self):
        self.tables = {}                # name -> metadata_location
        self.lock = threading.Lock()    # makes CAS atomic locally
    def cas_pointer(self, table, expected, new_location):
        with self.lock:
            current = self.tables.get(table)
            if current != expected:
                raise CatalogCASError(f"expected {expected}, was {current}")
            self.tables[table] = new_location
            return new_location

class ObjectStore:
    """Toy S3. PUT-new-key is always strong; we never overwrite a key."""
    def __init__(self): self.bytes = {}
    def put_new(self, key, data):
        if key in self.bytes: raise Exception(f"refusing to overwrite {key}")
        self.bytes[key] = data
        return f"s3://wh/{key}"

def commit_append(table, new_data_files, catalog, store):
    expected = catalog.tables.get(table)
    parent_meta = json.loads(store.bytes[expected.split("/")[-1]]) if expected else {
        "snapshots": [], "current_snapshot_id": None, "schema": {}}

    # Step 1: write data files (already done by writers; we get their paths)
    # Step 2: write a manifest (Avro in real Iceberg; JSON here for clarity)
    manifest_key = f"meta/manifest-{uuid.uuid4().hex[:8]}.json"
    manifest = {"data_files": [{"path": p, "row_count": 1000} for p in new_data_files]}
    manifest_loc = store.put_new(manifest_key, json.dumps(manifest))

    # Step 3: write a manifest-list referencing the new manifest + parent's manifests
    parent_manifests = parent_meta.get("manifest_list", [])
    manifest_list_key = f"meta/manifest-list-{uuid.uuid4().hex[:8]}.json"
    manifest_list = parent_manifests + [{"manifest": manifest_loc, "added": True}]
    list_loc = store.put_new(manifest_list_key, json.dumps(manifest_list))

    # Step 4: write a new metadata.json (snapshot N+1)
    snap_id = int(time.time() * 1000)
    new_meta = dict(parent_meta)
    new_meta["snapshots"] = parent_meta["snapshots"] + [{"id": snap_id, "list": list_loc}]
    new_meta["current_snapshot_id"] = snap_id
    new_meta["manifest_list"] = manifest_list
    meta_key = f"meta/v{len(new_meta['snapshots']):05d}-{uuid.uuid4().hex[:6]}.metadata.json"
    meta_loc = store.put_new(meta_key, json.dumps(new_meta))

    # Step 5: catalog CAS — the one atomic operation
    catalog.cas_pointer(table, expected, meta_loc)
    return snap_id, meta_loc

# --- demo ---
cat, st = Catalog(), ObjectStore()
sid1, _ = commit_append("orders", ["data/part-001.parquet", "data/part-002.parquet"], cat, st)
sid2, _ = commit_append("orders", ["data/part-003.parquet"], cat, st)
print(f"committed snapshots {sid1} -> {sid2}")
print(f"current pointer: {cat.tables['orders']}")
print(f"object store has {len(st.bytes)} immutable files")

# Sample run:
# committed snapshots 1745613000123 -> 1745613000456
# current pointer: s3://wh/meta/v00002-7f2c8a.metadata.json
# object store has 4 immutable files

The shape of the commit is everything. Walk through the load-bearing lines:

What snapshot isolation feels like in production

The Iceberg snapshot is not just metadata trivia. It is the contract that a Razorpay analyst running a 90-second SELECT sum(amount) FROM payments WHERE day = '2026-04-25' against a table that ingests 14,000 events per second can rely on a result. A naive design — point at a folder of Parquet files, scan it — would let the writer add a new file partway through the read, splicing partial data into the analyst's sum. With Iceberg, the analyst's query opens 00042-3a8f.metadata.json once at the start, reads the manifest tree from there, and is locked to version 42 for as long as the query runs. The writer ingesting at 14k/sec moves on to versions 43, 44, 45 — all visible to new queries, invisible to the in-flight one.

Snapshot isolation across concurrent writers and a long-running readerA timeline diagram. Top track: a reader query opens at t=0 holding snapshot 42 and runs for 90 seconds. Middle track: writer A commits snapshot 43 at t=20s. Bottom track: writer B commits snapshot 44 at t=55s. The reader sees only version 42 throughout. A 90s reader stays on snapshot 42 while two writers commit 43, 44 t=0s t=90s reader opens metadata at t=0 → locked to snapshot 42 for the whole query writer A CAS to snapshot 43 visible to new queries, invisible to the 90s reader writer B CAS to snapshot 44 based on parent=43, succeeds Each writer's CAS is on `expected = its parent`. They serialise on the catalog pointer, not on the reader.
The reader's view is fixed at t=0 by reading the catalog pointer once, opening the metadata.json it names, and never consulting the catalog again. Writers race each other through the catalog, not the reader. This is why a 90s analytical scan and 14k-events-per-second ingest can coexist on the same table without coordination.

What happens if writer A and writer B both try to commit snapshot 43? Whoever wins the CAS first wins; the other gets CatalogCASError. The loser must re-read the current pointer (now 43, written by the winner), rebuild a manifest tree where the parent is 43, and retry CAS to land snapshot 44. This is optimistic concurrency control — it works because contention on the catalog row is much smaller than contention on data, and writers spend their compute time on the data files (the expensive part) before trying the cheap CAS at the very end.

In practice at scale, retry rates stay below 1% on append-heavy tables because appends are commutative: writer A's append and writer B's append can both land if the second one rebuilds its manifest list to include A's new manifest. Why this rebuilding is cheap: the manifest list is a small Avro file (often a few KB) and writing the new metadata.json is a single PUT. The Parquet data files written before the CAS are not invalidated — those files are still good, and the rebuilt manifest still points at them. Optimistic retry costs one Avro write and one JSON write per loser, never the cost of re-running the actual ETL.

Conflicts that do abort the retry are non-commutative ones: a delete and an append targeting the same row, two compactions selecting overlapping file sets, two schema evolutions disagreeing on column type. Iceberg detects these by re-checking constraints after each retry — if a delete still applies to a row that a concurrent commit removed, fail loudly rather than silently re-apply. The retry budget is finite (default 4 attempts) precisely because non-commutative conflicts will repeat until something larger breaks them apart.

What gets thorny: cross-table commits and the GC problem

The protocol is clean for single-table operations. Two thorny patterns force the design to reach beyond it.

Cross-table atomicity does not exist in plain Iceberg. If a Swiggy data-engineering job updates restaurants and menus and needs both updates visible together, the natural commit shape is two CASes against two catalog rows — and the system can crash between them. Three solutions are in production: (1) use a catalog that supports multi-row transactions (Snowflake's catalog, Databricks Unity Catalog, Polaris in some configurations); (2) merge the two updates into a single metadata.json with a custom branch (Iceberg branches, since v1.4); (3) accept the partial-failure mode and design a reconciliation pass. Most teams pick (3) and don't tell their executives.

Garbage collection is the other corner. Every commit leaves the old metadata.json and old manifest files on S3 forever, by default. Every overwrite of a Parquet row writes a new Parquet file and leaves the old one behind. Without GC, a table that ingests a TB a day grows by a TB even if rows are net-zero. Iceberg ships an expire_snapshots operation that removes snapshots older than a configurable threshold from the metadata.json, but the actual Parquet files referenced only by expired snapshots still need a separate remove_orphan_files pass — which has to run with a strict time gate, because a query that started before GC ran could still hold a reference to a file you are about to delete. Why this is an open production wound: setting the time gate too short causes "snapshot not found" errors mid-query; too long means you keep paying S3 for petabytes of data nobody can read. Most large Iceberg deployments at Indian companies run GC with a 7-day retention by default, costing roughly 15–20% extra storage compared to optimal — about ₹4 lakh per month on a 100 TB warehouse — as the price of safety. Cleaning that up is what production engineers spend years on.

Read paths: how a query walks the tree

The writer's view is the commit protocol. The reader's view is its mirror — five reads, in the same shape as the writer's five PUTs, but all from cold cache the first time and all parallelisable from the manifest level down. Walking the tree on a real query at Cred is informative: a 12 TB transactions table partitioned by state and month answers SELECT sum(amount) FROM transactions WHERE state = 'KA' AND month = '2026-04' by reading roughly 8 KB of metadata before any Parquet file is opened.

The query planner — Trino, Spark, Snowflake, DuckDB, whichever — does this:

  1. Catalog lookup. Single GET on the catalog row: transactions → s3://cred-warehouse/transactions/metadata/00872-1f4c.metadata.json. Sub-millisecond on a hot catalog.
  2. Read the metadata.json. One S3 GET, ~3 KB, returns schema and the current_snapshot_id plus its manifest_list location.
  3. Read the manifest-list. One S3 GET, often <2 KB, returns one row per manifest with partition-range stats {"state_min": "AP", "state_max": "AS"}, {"state_min": "KA", "state_max": "KL"}, etc. The planner discards every manifest whose range can't include state = 'KA' — typically 80% of them.
  4. Read the surviving manifests in parallel. Each manifest, ~50 KB, lists data files with per-file column min/max. A second predicate-prune drops files whose month_min..month_max doesn't include 2026-04.
  5. Open only the surviving Parquet files. Often 30–50 files instead of 50,000.

Why metadata pruning is more important than the Parquet read itself: a Parquet GET is paid in seconds (cold-S3 latency to first byte ~100ms, plus columnar projection cost). 50,000 Parquet GETs is 5,000 seconds of wall clock. The manifest-list-and-manifest pruning costs ~100ms total and removes 99% of those Parquet GETs. The bulk of a lakehouse query's speed comes from never opening files, not from how fast you read them.

This is also why a badly-partitioned table feels slow even on small data — the manifest stats only help if your WHERE predicate matches the partition keys. A query that filters on a non-partition column (amount > 50000) sees min/max stats per file but those bounds are wide on a randomly-distributed column, so few files prune. Z-ordering and clustering — covered later in Build 12 — are the techniques that make non-partition predicates prunable too.

A subtle consequence: a Trino query holding a stale metadata pointer continues to read consistently. If the catalog has moved on to snapshot 873 while a 30-second query is still pinned to 872, the data files referenced by 872's manifest tree are still on S3, untouched. That property is what lets a long analytics scan and a fast streaming ingest share one table without coordination — the cost is paid in deferred storage cleanup, which the GC discussion above already covered.

Common confusions

Going deeper

Iceberg v1 vs v2: where the protocol grew teeth

The v1 spec required only positional deletes — a delete file said "row 4 of part-001.parquet is deleted". v2 added equality deletes — "rows where order_id = 7821 are deleted across the table" — to support row-level streaming MERGE without rewriting whole files. The commit protocol shape did not change: a delete write is still PUTs of new Parquet (delete files), a new manifest, a new metadata.json, a CAS. What changed was the manifest schema (delete-file references) and the reader logic (apply equality deletes during scan). Reading the v1-vs-v2 diff is the cleanest way to internalise that the protocol is the durable abstraction; the file formats inside it evolve.

Delta's transaction log vs Iceberg's metadata.json

Delta Lake takes a different shape: instead of a single metadata.json per snapshot, every commit appends a JSON file to a _delta_log/ directory — 00000000000000000042.json, 00000000000000000043.json. A reader reconstructs the current state by reading the log in order. Periodically, Delta writes a checkpoint (Parquet file summarising the log up to version N) so readers don't have to replay 100,000 commits. The two shapes — Iceberg's snapshot-tree and Delta's log-with-checkpoints — are formally equivalent but operationally different. Delta's append-and-checkpoint is closer to a write-ahead-log database; Iceberg's snapshot-tree is closer to a copy-on-write filesystem. Choosing between them in 2026 is mostly about which engine your team already runs (Spark/Databricks pulls toward Delta; Trino/Snowflake pulls toward Iceberg).

Why the catalog-CAS design beat client-coordinated approaches

Early lakehouse designs tried client-coordinated commits: each writer reads the current pointer, writes its files, then asks all other writers via a coordination service (Zookeeper, etcd) for permission to commit. The catalog-CAS design — Netflix's contribution to Iceberg in 2018 — won because (a) the catalog already exists, (b) client-coordinated approaches scale poorly past a few dozen writers, and (c) CAS is the smallest-possible coordination primitive. Reading Ryan Blue's 2018 Iceberg design notes alongside the same year's Delta paper from Databricks shows two teams independently converging on "single atomic CAS on a metadata pointer" as the durable answer.

The one-writer-per-table myth that survived from 2017

A persistent rumour, often repeated, says "Iceberg only supports one writer per table". This was true of Hive ACID in 2017, never of Iceberg. The optimistic-CAS design supports arbitrarily many concurrent writers; the limit is contention, not protocol. A Zerodha team running 40 concurrent Spark jobs against the same tick_archive table sees retry rates of 2–4% during peak ingest hours, and zero data correctness issues across a year of operation. The myth lingers because naive writers (every job a full table-scan-and-rewrite) do conflict; partition-aware writers and append-only writers almost never do.

Branches and tags: Git semantics for tables

Iceberg v1.4 (late 2023) added named branches and tags as first-class catalog entities. A branch is just an alternate pointer alongside mainstaging, audit-2026-q1, experimental-rebucket — each pointing at its own metadata.json. A tag is an immutable named pointer to a specific snapshot. The protocol shape doesn't change: a commit to a branch is still five PUTs and a CAS, just on a different catalog row. What this enables operationally is meaningful — a team at PhonePe can fork a staging branch off production, run a destructive backfill experiment, validate it, and either merge back or throw the whole branch away. The pattern is direct from Git, and the implementation cost was small precisely because the underlying commit protocol had nothing branch-specific in it.

Why 5 PUTs and not 1

A natural question: why not put schema, manifests, file list, and snapshot history all in one big metadata.json and commit a single PUT? The answer is read amplification. A commit-time saving of four PUTs would force every reader to parse a metadata file that grows linearly with table history — gigabytes after a year of hourly commits. Splitting metadata across the four-level tree means each level can be cached independently: metadata.json is small and re-read on every query, manifest-list is per-snapshot and cached during query planning, manifests are larger and parallel-read, Parquet stats are inline. The cost paid at write time (five PUTs ~= 200ms aggregate) buys per-query latency that scales with the active manifest count, not with total table history. Optimising the writer at the cost of the reader is the classic wrong tradeoff for analytical workloads.

Where this leads next

After this chapter the lakehouse build is into the writer-side guts — compaction, copy-on-write vs merge-on-read, Z-ordering — all of which are operations whose commit shape is exactly what this chapter laid down. Once you can read the commit code, the rest of Build 12 is mostly variations on the same five-PUTs-and-a-CAS theme.

References