Manifest files and the commit protocol
A query against a Flipkart product-catalogue table on Iceberg never reads the S3 prefix the data lives in. It reads a small JSON file that names a slightly larger Avro file that names the Parquet files. That tree — root pointer, metadata file, manifest list, manifest, data files — is the table. Swapping the root pointer is the transaction. The previous chapter ended with the observation that this design survived eventual consistency because it asks for almost nothing from the storage layer; this chapter is the close-up of how the asking works.
The lakehouse commit protocol is a tree of immutable metadata files plus one atomic compare-and-swap on a root pointer held in a catalog. A writer creates new data files, writes a new manifest naming them, writes a new metadata snapshot, and asks the catalog "swap the table pointer from version N to version N+1, only if it is still N". One CAS decides the commit. Every reader sees either fully N or fully N+1, never a half state.
What the table actually is
When a data engineer in Bengaluru types SELECT count(*) FROM flipkart.orders, Trino does not look at s3://flipkart-warehouse/orders/. It asks the Iceberg catalog (Hive Metastore, AWS Glue, Polaris, Unity, Nessie, REST) for the current metadata location — a single string like s3://flipkart-warehouse/orders/metadata/00042-3a8f.metadata.json. That JSON is the snapshot of the table at version 42. From there, Trino walks down a four-level tree to find which Parquet files to scan.
The catalog row is the only mutable cell in the entire tree. Every file below it — metadata.json, manifest-list, manifest, Parquet — is written once and never modified. Why immutability matters here: because no file is ever rewritten, no reader ever races a writer for the contents of a single file. A reader who picked up 00042-3a8f.metadata.json at the start of their query will read consistently from the version-42 snapshot for as long as the query takes, even if the writer has moved on to versions 43, 44, 45. Snapshot isolation is free; you do not need MVCC machinery on top of object storage to get it.
The four levels are not redundant. Each one exists to make a particular query cheap. The metadata.json holds schema and snapshot history — small, fits in a single GET. The manifest-list holds per-manifest partition ranges, so a query with a WHERE state = 'KA' predicate can skip every manifest whose range doesn't include KA without ever opening it. Each manifest holds per-Parquet-file column stats (min, max, null count), so the same predicate can skip data files inside the manifests it does open. By the time Trino actually opens a Parquet file, it has already pruned away (often) 99% of the table from a few small Avro reads.
The commit, step by step
A commit is six steps. Five of them are PUT-new-key operations on object storage — strongly consistent on every storage backend, even pre-2020 S3. The sixth is the only operation that has to be atomic, and it lives in the catalog, not in the storage layer.
# manifest_commit.py — a minimal lakehouse commit, end to end.
# Models the Iceberg-style commit protocol: 5 PUT-new-keys + 1 catalog CAS.
import json, time, uuid, threading
class CatalogCASError(Exception): pass
class Catalog:
"""A toy metastore. The only mutable cell in the system."""
def __init__(self):
self.tables = {} # name -> metadata_location
self.lock = threading.Lock() # makes CAS atomic locally
def cas_pointer(self, table, expected, new_location):
with self.lock:
current = self.tables.get(table)
if current != expected:
raise CatalogCASError(f"expected {expected}, was {current}")
self.tables[table] = new_location
return new_location
class ObjectStore:
"""Toy S3. PUT-new-key is always strong; we never overwrite a key."""
def __init__(self): self.bytes = {}
def put_new(self, key, data):
if key in self.bytes: raise Exception(f"refusing to overwrite {key}")
self.bytes[key] = data
return f"s3://wh/{key}"
def commit_append(table, new_data_files, catalog, store):
expected = catalog.tables.get(table)
parent_meta = json.loads(store.bytes[expected.split("/")[-1]]) if expected else {
"snapshots": [], "current_snapshot_id": None, "schema": {}}
# Step 1: write data files (already done by writers; we get their paths)
# Step 2: write a manifest (Avro in real Iceberg; JSON here for clarity)
manifest_key = f"meta/manifest-{uuid.uuid4().hex[:8]}.json"
manifest = {"data_files": [{"path": p, "row_count": 1000} for p in new_data_files]}
manifest_loc = store.put_new(manifest_key, json.dumps(manifest))
# Step 3: write a manifest-list referencing the new manifest + parent's manifests
parent_manifests = parent_meta.get("manifest_list", [])
manifest_list_key = f"meta/manifest-list-{uuid.uuid4().hex[:8]}.json"
manifest_list = parent_manifests + [{"manifest": manifest_loc, "added": True}]
list_loc = store.put_new(manifest_list_key, json.dumps(manifest_list))
# Step 4: write a new metadata.json (snapshot N+1)
snap_id = int(time.time() * 1000)
new_meta = dict(parent_meta)
new_meta["snapshots"] = parent_meta["snapshots"] + [{"id": snap_id, "list": list_loc}]
new_meta["current_snapshot_id"] = snap_id
new_meta["manifest_list"] = manifest_list
meta_key = f"meta/v{len(new_meta['snapshots']):05d}-{uuid.uuid4().hex[:6]}.metadata.json"
meta_loc = store.put_new(meta_key, json.dumps(new_meta))
# Step 5: catalog CAS — the one atomic operation
catalog.cas_pointer(table, expected, meta_loc)
return snap_id, meta_loc
# --- demo ---
cat, st = Catalog(), ObjectStore()
sid1, _ = commit_append("orders", ["data/part-001.parquet", "data/part-002.parquet"], cat, st)
sid2, _ = commit_append("orders", ["data/part-003.parquet"], cat, st)
print(f"committed snapshots {sid1} -> {sid2}")
print(f"current pointer: {cat.tables['orders']}")
print(f"object store has {len(st.bytes)} immutable files")
# Sample run:
# committed snapshots 1745613000123 -> 1745613000456
# current pointer: s3://wh/meta/v00002-7f2c8a.metadata.json
# object store has 4 immutable files
The shape of the commit is everything. Walk through the load-bearing lines:
store.put_new(...)everywhere. Five PUTs, all to fresh keys, never overwriting. Each individual call is strongly consistent on every object store on the planet — pre-2020 S3, GCS, Azure Blob, MinIO. Why this design avoids the eventual-consistency bug class entirely: there is noLISTof a directory to discover what was written, and there is no overwrite of an existing key. The previous chapter showed that those were the two failure modes; this commit protocol uses neither.catalog.cas_pointer(table, expected, new_location). The compare-and-swap is the only place where atomicity is required. It says "swap the table pointer fromexpectedtonew_location, but only if the pointer is stillexpected". If a concurrent writer committed between when we readexpectedand now, the CAS fails — we have to rebuild the manifest tree against the new parent and retry.new_meta["snapshots"] = parent_meta["snapshots"] + [...]. The new metadata.json includes the entire snapshot history. Why this matters for time travel: a query withFOR VERSION AS OF 1745613000123walks back throughsnapshotsto find that snapshot id, opens itsmanifest_list, and reads the table as it was at that moment. Time travel costs nothing extra because every old metadata.json is still on S3, immutable.manifest_list = parent_manifests + [{"manifest": manifest_loc, "added": True}]. An append-only commit appends a new manifest entry to the list of existing ones. A delete-by-rewrite commit (compaction, GDPR delete) would replace manifest entries — same protocol shape, different list contents. The commit protocol is one shape regardless of operation type.if key in self.bytes: raise. The toy store refuses overwrites. Real Iceberg writers achieve the same property by including a UUID in every metadata key, so collisions are impossible in practice. The protocol cannot tolerate an overwrite, so it never asks for one.
What snapshot isolation feels like in production
The Iceberg snapshot is not just metadata trivia. It is the contract that a Razorpay analyst running a 90-second SELECT sum(amount) FROM payments WHERE day = '2026-04-25' against a table that ingests 14,000 events per second can rely on a result. A naive design — point at a folder of Parquet files, scan it — would let the writer add a new file partway through the read, splicing partial data into the analyst's sum. With Iceberg, the analyst's query opens 00042-3a8f.metadata.json once at the start, reads the manifest tree from there, and is locked to version 42 for as long as the query runs. The writer ingesting at 14k/sec moves on to versions 43, 44, 45 — all visible to new queries, invisible to the in-flight one.
What happens if writer A and writer B both try to commit snapshot 43? Whoever wins the CAS first wins; the other gets CatalogCASError. The loser must re-read the current pointer (now 43, written by the winner), rebuild a manifest tree where the parent is 43, and retry CAS to land snapshot 44. This is optimistic concurrency control — it works because contention on the catalog row is much smaller than contention on data, and writers spend their compute time on the data files (the expensive part) before trying the cheap CAS at the very end.
In practice at scale, retry rates stay below 1% on append-heavy tables because appends are commutative: writer A's append and writer B's append can both land if the second one rebuilds its manifest list to include A's new manifest. Why this rebuilding is cheap: the manifest list is a small Avro file (often a few KB) and writing the new metadata.json is a single PUT. The Parquet data files written before the CAS are not invalidated — those files are still good, and the rebuilt manifest still points at them. Optimistic retry costs one Avro write and one JSON write per loser, never the cost of re-running the actual ETL.
Conflicts that do abort the retry are non-commutative ones: a delete and an append targeting the same row, two compactions selecting overlapping file sets, two schema evolutions disagreeing on column type. Iceberg detects these by re-checking constraints after each retry — if a delete still applies to a row that a concurrent commit removed, fail loudly rather than silently re-apply. The retry budget is finite (default 4 attempts) precisely because non-commutative conflicts will repeat until something larger breaks them apart.
What gets thorny: cross-table commits and the GC problem
The protocol is clean for single-table operations. Two thorny patterns force the design to reach beyond it.
Cross-table atomicity does not exist in plain Iceberg. If a Swiggy data-engineering job updates restaurants and menus and needs both updates visible together, the natural commit shape is two CASes against two catalog rows — and the system can crash between them. Three solutions are in production: (1) use a catalog that supports multi-row transactions (Snowflake's catalog, Databricks Unity Catalog, Polaris in some configurations); (2) merge the two updates into a single metadata.json with a custom branch (Iceberg branches, since v1.4); (3) accept the partial-failure mode and design a reconciliation pass. Most teams pick (3) and don't tell their executives.
Garbage collection is the other corner. Every commit leaves the old metadata.json and old manifest files on S3 forever, by default. Every overwrite of a Parquet row writes a new Parquet file and leaves the old one behind. Without GC, a table that ingests a TB a day grows by a TB even if rows are net-zero. Iceberg ships an expire_snapshots operation that removes snapshots older than a configurable threshold from the metadata.json, but the actual Parquet files referenced only by expired snapshots still need a separate remove_orphan_files pass — which has to run with a strict time gate, because a query that started before GC ran could still hold a reference to a file you are about to delete. Why this is an open production wound: setting the time gate too short causes "snapshot not found" errors mid-query; too long means you keep paying S3 for petabytes of data nobody can read. Most large Iceberg deployments at Indian companies run GC with a 7-day retention by default, costing roughly 15–20% extra storage compared to optimal — about ₹4 lakh per month on a 100 TB warehouse — as the price of safety. Cleaning that up is what production engineers spend years on.
Read paths: how a query walks the tree
The writer's view is the commit protocol. The reader's view is its mirror — five reads, in the same shape as the writer's five PUTs, but all from cold cache the first time and all parallelisable from the manifest level down. Walking the tree on a real query at Cred is informative: a 12 TB transactions table partitioned by state and month answers SELECT sum(amount) FROM transactions WHERE state = 'KA' AND month = '2026-04' by reading roughly 8 KB of metadata before any Parquet file is opened.
The query planner — Trino, Spark, Snowflake, DuckDB, whichever — does this:
- Catalog lookup. Single GET on the catalog row:
transactions → s3://cred-warehouse/transactions/metadata/00872-1f4c.metadata.json. Sub-millisecond on a hot catalog. - Read the metadata.json. One S3 GET, ~3 KB, returns schema and the
current_snapshot_idplus itsmanifest_listlocation. - Read the manifest-list. One S3 GET, often <2 KB, returns one row per manifest with partition-range stats
{"state_min": "AP", "state_max": "AS"},{"state_min": "KA", "state_max": "KL"}, etc. The planner discards every manifest whose range can't includestate = 'KA'— typically 80% of them. - Read the surviving manifests in parallel. Each manifest, ~50 KB, lists data files with per-file column min/max. A second predicate-prune drops files whose
month_min..month_maxdoesn't include2026-04. - Open only the surviving Parquet files. Often 30–50 files instead of 50,000.
Why metadata pruning is more important than the Parquet read itself: a Parquet GET is paid in seconds (cold-S3 latency to first byte ~100ms, plus columnar projection cost). 50,000 Parquet GETs is 5,000 seconds of wall clock. The manifest-list-and-manifest pruning costs ~100ms total and removes 99% of those Parquet GETs. The bulk of a lakehouse query's speed comes from never opening files, not from how fast you read them.
This is also why a badly-partitioned table feels slow even on small data — the manifest stats only help if your WHERE predicate matches the partition keys. A query that filters on a non-partition column (amount > 50000) sees min/max stats per file but those bounds are wide on a randomly-distributed column, so few files prune. Z-ordering and clustering — covered later in Build 12 — are the techniques that make non-partition predicates prunable too.
A subtle consequence: a Trino query holding a stale metadata pointer continues to read consistently. If the catalog has moved on to snapshot 873 while a 30-second query is still pinned to 872, the data files referenced by 872's manifest tree are still on S3, untouched. That property is what lets a long analytics scan and a fast streaming ingest share one table without coordination — the cost is paid in deferred storage cleanup, which the GC discussion above already covered.
Common confusions
- "Iceberg metadata.json is the same as Hive's
_metadatafile." No. Hive's_metadatawas a single per-partition Parquet stats file, opened by readers as a hint. Iceberg's metadata.json is the root of truth for which Parquet files belong to the table at all. A Parquet file present in S3 but not named in any manifest is invisible to Iceberg queries. - "Snapshot isolation means readers block writers." No — neither side blocks. Readers open the metadata once and run independently. Writers race the catalog CAS and retry on conflict. There are no locks, no transactions in the SQL sense, no "wait for the reader to finish".
- "The manifest list and the manifest are the same thing." Different files. The manifest list is one Avro per snapshot, with one row per manifest. A manifest is one Avro with one row per data file. Two levels — and skipping the manifest-list level was a real Iceberg-v1 perf bug fixed in v2.
- "You can edit a Parquet file in place to fix a typo." You cannot — Iceberg, Delta, and Hudi all forbid in-place edits. The fix is a
MERGEthat writes a new Parquet with the corrected row, a new manifest, and a new metadata.json. The old typo'd file stays on S3 until expired snapshots' GC catches it. - "The catalog only stores the table pointer." Most catalogs store more — schema, partition spec, table properties, table-level ACLs. The minimum the protocol requires is the pointer; the production reality is a full schema-and-policy registry.
- "Two writers can both succeed on the same metadata version." Only one can — that is the entire point of the CAS. The loser sees
CatalogCASError, reads the new pointer, rebuilds, and retries. The mechanism is identical to compare-and-swap on a CPU cache line, just at petabyte scale.
Going deeper
Iceberg v1 vs v2: where the protocol grew teeth
The v1 spec required only positional deletes — a delete file said "row 4 of part-001.parquet is deleted". v2 added equality deletes — "rows where order_id = 7821 are deleted across the table" — to support row-level streaming MERGE without rewriting whole files. The commit protocol shape did not change: a delete write is still PUTs of new Parquet (delete files), a new manifest, a new metadata.json, a CAS. What changed was the manifest schema (delete-file references) and the reader logic (apply equality deletes during scan). Reading the v1-vs-v2 diff is the cleanest way to internalise that the protocol is the durable abstraction; the file formats inside it evolve.
Delta's transaction log vs Iceberg's metadata.json
Delta Lake takes a different shape: instead of a single metadata.json per snapshot, every commit appends a JSON file to a _delta_log/ directory — 00000000000000000042.json, 00000000000000000043.json. A reader reconstructs the current state by reading the log in order. Periodically, Delta writes a checkpoint (Parquet file summarising the log up to version N) so readers don't have to replay 100,000 commits. The two shapes — Iceberg's snapshot-tree and Delta's log-with-checkpoints — are formally equivalent but operationally different. Delta's append-and-checkpoint is closer to a write-ahead-log database; Iceberg's snapshot-tree is closer to a copy-on-write filesystem. Choosing between them in 2026 is mostly about which engine your team already runs (Spark/Databricks pulls toward Delta; Trino/Snowflake pulls toward Iceberg).
Why the catalog-CAS design beat client-coordinated approaches
Early lakehouse designs tried client-coordinated commits: each writer reads the current pointer, writes its files, then asks all other writers via a coordination service (Zookeeper, etcd) for permission to commit. The catalog-CAS design — Netflix's contribution to Iceberg in 2018 — won because (a) the catalog already exists, (b) client-coordinated approaches scale poorly past a few dozen writers, and (c) CAS is the smallest-possible coordination primitive. Reading Ryan Blue's 2018 Iceberg design notes alongside the same year's Delta paper from Databricks shows two teams independently converging on "single atomic CAS on a metadata pointer" as the durable answer.
The one-writer-per-table myth that survived from 2017
A persistent rumour, often repeated, says "Iceberg only supports one writer per table". This was true of Hive ACID in 2017, never of Iceberg. The optimistic-CAS design supports arbitrarily many concurrent writers; the limit is contention, not protocol. A Zerodha team running 40 concurrent Spark jobs against the same tick_archive table sees retry rates of 2–4% during peak ingest hours, and zero data correctness issues across a year of operation. The myth lingers because naive writers (every job a full table-scan-and-rewrite) do conflict; partition-aware writers and append-only writers almost never do.
Branches and tags: Git semantics for tables
Iceberg v1.4 (late 2023) added named branches and tags as first-class catalog entities. A branch is just an alternate pointer alongside main — staging, audit-2026-q1, experimental-rebucket — each pointing at its own metadata.json. A tag is an immutable named pointer to a specific snapshot. The protocol shape doesn't change: a commit to a branch is still five PUTs and a CAS, just on a different catalog row. What this enables operationally is meaningful — a team at PhonePe can fork a staging branch off production, run a destructive backfill experiment, validate it, and either merge back or throw the whole branch away. The pattern is direct from Git, and the implementation cost was small precisely because the underlying commit protocol had nothing branch-specific in it.
Why 5 PUTs and not 1
A natural question: why not put schema, manifests, file list, and snapshot history all in one big metadata.json and commit a single PUT? The answer is read amplification. A commit-time saving of four PUTs would force every reader to parse a metadata file that grows linearly with table history — gigabytes after a year of hourly commits. Splitting metadata across the four-level tree means each level can be cached independently: metadata.json is small and re-read on every query, manifest-list is per-snapshot and cached during query planning, manifests are larger and parallel-read, Parquet stats are inline. The cost paid at write time (five PUTs ~= 200ms aggregate) buys per-query latency that scales with the active manifest count, not with total table history. Optimising the writer at the cost of the reader is the classic wrong tradeoff for analytical workloads.
Where this leads next
- /wiki/concurrent-writers-without-stepping-on-each-other — the next chapter, which takes the optimistic-CAS retry loop introduced here and walks through how Iceberg, Delta, and Hudi differ on conflict detection and resolution.
- /wiki/eventual-consistency-on-s3-and-what-it-breaks — the previous chapter, the storage-layer reasons this protocol shape exists.
- /wiki/object-storage-as-a-primary-store-s3-as-a-database — two chapters back, the four-give-four-refuse table the manifest tree compensates for.
After this chapter the lakehouse build is into the writer-side guts — compaction, copy-on-write vs merge-on-read, Z-ordering — all of which are operations whose commit shape is exactly what this chapter laid down. Once you can read the commit code, the rest of Build 12 is mostly variations on the same five-PUTs-and-a-CAS theme.
References
- Apache Iceberg Specification v2 — §4 Commit Protocol — the binding contract; read the section on conflict detection alongside the source
BaseTransaction.commitOperationfor the canonical implementation. - Ryan Blue — Iceberg's Approach to Atomic Updates (Netflix tech blog, 2018) — the original design rationale; explains why one CAS on a pointer beats every coordinated-writer alternative.
- Delta Lake: High-Performance ACID Table Storage over Cloud Object Stores (VLDB 2020) — Databricks' formal write-up of the log-and-checkpoint variant; pairs with the Iceberg spec for a side-by-side read.
- Apache Hudi Concepts — Timeline and File Layout — the third major lakehouse format; the timeline-of-instants design is a third shape worth knowing.
- Iceberg Java Source —
BaseTable.refresh()andSnapshotProducer.commit()— the commit code. ~600 lines that implement everything in this chapter for real, including the retry loop. - Subramanian Krishnan — Iceberg at Adobe Experience Platform (2022) — production scale-out story; useful for the GC-cost numbers and concurrent-writer retry-rate observations.
- /wiki/eventual-consistency-on-s3-and-what-it-breaks — the storage-layer chapter that motivates why the commit protocol asks so little of the substrate.
- /wiki/concurrent-writers-without-stepping-on-each-other — next chapter, optimistic-concurrency variants and conflict-detection strategies.