Iceberg, Delta, Hudi from the producer's perspective
Aditi's team at a Bengaluru-based payments company is migrating off a 90-TB warehouse into S3. Three vendors have already pitched her: one says "use Iceberg, it's the open standard"; another says "use Delta, the integration with Spark is unmatched"; the third says "use Hudi, it was built for upserts at scale". She does not need a feature checklist — she needs to know which of these three formats will fail her the worst when a streaming writer crashes mid-commit, when GDPR delete arrives at 2 a.m., when two batch jobs try to write to the same partition, and when, six months from now, she wants to switch query engines without rewriting the data.
Iceberg, Delta, and Hudi all sit on top of the same Parquet files; what differs is the metadata tree that tracks which files are part of which snapshot, how concurrent writers coordinate, and how row-level deletes land before they hit the base files. Iceberg has the cleanest spec and the broadest engine support; Delta has the deepest Spark integration; Hudi was designed first around streaming upserts and shows it. Pick on the basis of your write pattern (append-mostly vs upsert-heavy vs delete-heavy), not the marketing.
What "table format" actually means above Parquet
A bare directory of Parquet files is not a table. It is a directory of files that happen to have related schemas. A table format adds a metadata layer on top — a tree of pointers that says "these files are the table at version N, those files are the table at version N+1, and here is how to atomically move from one to the other". The metadata layer is what gives you snapshot isolation, time travel, schema evolution, and concurrent writers without manual locking.
All three formats use the same underlying Parquet for the row data. None of them changed the data file. What they changed is the catalog pointer, the manifest tree, and the commit protocol.
Why the catalog-pointer layer is the design's keystone: the only operation that needs to be atomic for snapshot isolation is the pointer flip from "the table is version N" to "the table is version N+1". Everything below it (rewriting manifests, writing data files) can take minutes; the pointer flip takes microseconds. Iceberg moved this atomic into a real catalog (Glue, Nessie); Delta inherited it from HDFS-style filesystem rename semantics, which is fragile on S3 and is why Delta on S3 needs DynamoDB or LogStore for real concurrency.
Where the three formats actually differ for a producer
If you only write append-only, daily-partitioned data, the three formats look identical. The differences appear the moment you do any of: concurrent writers, row-level deletes, schema evolution, or upserts. Below is a table of what a producer actually has to think about, and how each format handles it.
| Concern | Iceberg | Delta | Hudi |
|---|---|---|---|
| Concurrent writers (same table, different partitions) | Optimistic concurrency at catalog level — works | Same, requires DynamoDB on S3 | Same, with timeline-based ordering |
| Concurrent writers (same partition) | Conflict-detected at commit; one retries | Conflict-detected at commit; one retries | Hudi's design point — file-group-level merge |
| Row-level deletes | Position deletes (v2 spec) or equality deletes; writer-side cheap | Deletion vectors (recent) or rewrite-the-file (older) | Built around upsert-and-delete via merge logs |
| Schema evolution | Field-IDs, safe rename, drop, add (the cleanest spec of the three) | Column add safe; rename requires column-mapping mode | Add safe; reorder/rename has gotchas |
| Partition evolution | First-class — change spec on existing table | Generated columns (Delta 2+); not as flexible | Static-ish; bucketing changes are painful |
| Upserts | Yes via MERGE; rewrites whole files (CoW) | Yes via MERGE; CoW or DV | The native operation; Merge-on-Read is default |
| Time travel | Snapshot-id or timestamp; cheap until expiry | Version or timestamp | Instant-time-based |
| Engine support | Spark, Trino, Flink, Snowflake, BigQuery, DuckDB, Athena | Spark-first; Trino/Athena adequate | Spark-first; Flink solid; Trino limited |
The three rows that matter most in a real production decision are the delete model, the concurrent-writers-on-the-same-partition behaviour, and the engine portability. Everything else is a feature checkbox.
How each format handles a row-level delete (the operation that exposes the design)
A user requests their data deleted under DPDPA — a single user-id whose rows are scattered across 200 partitions. Watch what each format does.
Iceberg (v2 spec, position-deletes path). The writer scans the table for rows matching user_id = 'XYZ', generates a position-delete file per data file that contains a hit, and commits these as new manifest entries pointing at the original data files plus the position-deletes. Reads at a snapshot after this commit apply the position-deletes during scan. Compaction later rewrites the data files without those rows and drops the position-delete files. The base files are not touched at delete time; the cost is paid at compaction or read time.
Delta (deletion-vectors path, post 2.4). Identical idea — the writer emits a deletion vector (a roaring bitmap of row positions to skip) per affected data file, and commits a new transaction. Older Delta versions had to rewrite the entire data file at delete time, which made deletes expensive on cold partitions. Deletion vectors made Delta's delete path competitive with Iceberg's.
Hudi (Merge-on-Read, the default). The writer appends a delete record to the partition's log file (one log file per file group). Reads merge the log against the base file at query time. Compaction (which Hudi treats as a separate operation from clustering) periodically applies the log to the base file. The cost model is the inverse of Iceberg's — cheap writes, slower reads until compaction.
Why the formats converged on "don't rewrite the base file": rewriting a 256 MB Parquet file to remove one row writes 256 MB out for one row of work. At billion-row tables with thousands of GDPR requests per quarter, this is unworkable. All three formats independently arrived at the same answer — encode the delete as a side artefact, and apply it lazily. The differences are in the artefact's representation (delete-file, bitmap, log-record), not the strategy.
What a writer actually executes — three flavours of the same commit
Below is a Python script that performs the same logical operation — appending a batch of payments rows into a partitioned table — in pseudo-Iceberg, pseudo-Delta, and pseudo-Hudi style. The point is to see the commit-protocol shape, not the SDK ergonomics.
# write_batch.py — append a batch of rows to a partitioned table.
# Same logical write, three different metadata footprints.
import json, os, time, uuid
import pyarrow as pa
import pyarrow.parquet as pq
ROOT = "/tmp/lake/payments"
PARTITION = "dt=2026-04-25"
ROWS = pa.table({
"txn_id": pa.array([f"T{i:06d}" for i in range(1000)]),
"merchant": pa.array(["razorpay-test"] * 1000),
"amount_inr": pa.array([100 + i for i in range(1000)]),
"user_id": pa.array([f"U{(i % 800):04d}" for i in range(1000)]),
})
def write_data_file(path):
pq.write_table(ROWS, path, compression="snappy")
return os.path.getsize(path)
# 1) Iceberg-style commit ----------------------------------------------------
def commit_iceberg():
snap_id = int(time.time() * 1000)
data_uuid = uuid.uuid4().hex
data_path = f"{ROOT}/data/{PARTITION}/{data_uuid}.parquet"
os.makedirs(os.path.dirname(data_path), exist_ok=True)
nbytes = write_data_file(data_path)
manifest_entry = {
"data_file": data_path,
"partition": PARTITION,
"record_count": ROWS.num_rows,
"file_size_bytes": nbytes,
"column_stats": {"amount_inr": {"min": 100, "max": 1099}},
}
manifest_path = f"{ROOT}/metadata/manifest-{snap_id}.json"
os.makedirs(os.path.dirname(manifest_path), exist_ok=True)
with open(manifest_path, "w") as f: json.dump([manifest_entry], f)
snapshot = {"snapshot_id": snap_id, "manifest_list": [manifest_path],
"parent_id": _read_catalog().get("snapshot_id")}
snap_path = f"{ROOT}/metadata/snap-{snap_id}.json"
with open(snap_path, "w") as f: json.dump(snapshot, f)
# The atomic step: catalog pointer flip.
_write_catalog({"snapshot_id": snap_id, "metadata_path": snap_path})
return snap_id
# 2) Delta-style commit ------------------------------------------------------
def commit_delta():
log_dir = f"{ROOT}/_delta_log"
os.makedirs(log_dir, exist_ok=True)
versions = sorted(int(f.split('.')[0]) for f in os.listdir(log_dir)
if f.endswith('.json'))
next_v = (versions[-1] + 1) if versions else 0
data_uuid = uuid.uuid4().hex
data_path = f"{ROOT}/{PARTITION}/{data_uuid}.parquet"
os.makedirs(os.path.dirname(data_path), exist_ok=True)
nbytes = write_data_file(data_path)
actions = [{"add": {"path": data_path, "size": nbytes,
"partitionValues": {"dt": "2026-04-25"},
"stats": json.dumps({"numRecords": ROWS.num_rows})}}]
log_path = f"{log_dir}/{next_v:020d}.json"
# Atomic rename — the moment the version becomes visible.
tmp = log_path + ".tmp"
with open(tmp, "w") as f:
for a in actions: f.write(json.dumps(a) + "\n")
os.rename(tmp, log_path)
return next_v
# 3) Hudi-style commit -------------------------------------------------------
def commit_hudi():
instant = time.strftime("%Y%m%d%H%M%S")
timeline = f"{ROOT}/.hoodie"
os.makedirs(timeline, exist_ok=True)
inflight = f"{timeline}/{instant}.commit.inflight"
open(inflight, "w").write("{}") # mark in-flight first
fg = uuid.uuid4().hex[:8]
data_path = f"{ROOT}/{PARTITION}/{fg}_{instant}.parquet"
os.makedirs(os.path.dirname(data_path), exist_ok=True)
write_data_file(data_path)
completed = f"{timeline}/{instant}.commit"
os.rename(inflight, completed) # atomic state transition
return instant
def _read_catalog():
p = f"{ROOT}/_catalog.json"
return json.load(open(p)) if os.path.exists(p) else {}
def _write_catalog(d):
p = f"{ROOT}/_catalog.json"
tmp = p + ".tmp"
json.dump(d, open(tmp, "w"))
os.rename(tmp, p)
if __name__ == "__main__":
print("iceberg:", commit_iceberg())
print("delta: ", commit_delta())
print("hudi: ", commit_hudi())
# Sample run:
iceberg: 1714056234123
delta: 0
hudi: 20260425143514
The lines that matter: _write_catalog({...}) in commit_iceberg is the atomic pointer flip — every other operation before it could have been re-tried; once this line returns, the table is at the new snapshot. os.rename(tmp, log_path) in commit_delta is the equivalent — Delta's atomicity rides on the filesystem rename being atomic, which is why naive Delta-on-S3 needs an external coordinator (DynamoDB, LogStore) since S3 PUT is not a rename. os.rename(inflight, completed) in commit_hudi is Hudi's two-phase commit — inflight is written first, then renamed to "complete", giving the timeline a clear in-progress state for crash recovery. Why all three end on a rename: a rename within a single filesystem (or via the catalog) is the cheapest atomic-swap primitive available. The whole table-format design is built on the assumption that you can flip exactly one pointer atomically; the rest of the metadata is allowed to be sloppy.
The shape difference: Iceberg keeps the manifest list separate from the snapshot file, so a snapshot read needs two indirections. Delta keeps everything in _delta_log/ as JSON commits, so a snapshot read scans the directory listing. Hudi separates the timeline (state transitions) from the data, so a snapshot read consults the timeline file. Each design is self-consistent; none is "wrong".
Concurrent writers — where the formats reveal their personality
When two writers commit to the same partition at the same time, all three formats use optimistic concurrency: each writer prepares its commit, then attempts the atomic pointer flip; if it loses the race, it re-reads the new state and retries. The differences are in what counts as a conflict.
- Iceberg uses serializable isolation by default. Two appends to the same partition do not conflict (the partition is append-only). Two MERGEs that touch the same files do conflict; one retries. The conflict detector reads the manifest tree of the latest snapshot and checks whether the files this writer planned to read or rewrite are still present.
- Delta uses serializable isolation. Two writers compute their actions; the loser re-reads the log from its base version forward, replays its action against the new state, and re-tries the commit. On S3 without LogStore, two writers can both succeed at writing
00000000000000000007.jsonbecause S3 PUT-if-absent is unreliable; this is the classic Delta-on-S3 corruption story, and the fix is DynamoDB or the open-source LogStore. - Hudi uses MVCC with timeline-based ordering. Each writer claims an instant time and writes to its own file groups within the partition. Conflict at the file-group level is detected during the post-write commit. Hudi's design lets multi-writer scenarios work better when writers naturally land in different file groups (e.g. partitioned by user-id range) — but two writers updating the same user_id will still conflict.
Aditi's payments table receives a streaming writer (every 30 s) and a daily backfill job (overnight). On Iceberg and Delta, this is a non-issue if they don't touch the same partition. If the backfill needs to overwrite yesterday's partition while the stream is still flushing files for today, there is no conflict. Hudi handles the same scenario via separate file groups.
What can go wrong
These are the production patterns that show up at month 4 of running any of these formats:
- The "schema evolution looked safe in dev" failure. You add a column. In Iceberg this is safe by design — field-IDs make the new column always identifiable. In Delta with column mapping disabled, renaming a column is a destructive operation that requires a full table rewrite. In Hudi, reordering columns can cause readers to mis-align values until a metadata rewrite. Test schema evolution on a real production-shape table, not a 100-row dev table.
- The "we forgot snapshot expiry" failure. All three formats keep old snapshots and old data files until you explicitly expire them. Storage cost grows linearly. Iceberg's
EXPIRE_SNAPSHOTS, Delta'sVACUUM, and Hudi'scleanaction all do the equivalent. Forget to schedule them and your S3 bill doubles every quarter. - The "the catalog migration broke time-travel" failure. When you migrate Iceberg from a Hadoop catalog to Glue or Nessie, snapshot IDs are preserved but historical metadata files may move. Time-travel queries to old snapshots break unless the migration tool copies the metadata tree forward.
- The "Delta on S3 corrupted" failure. Two writers both succeed at writing the same
_delta_log/N.jsonbecause S3's PUT semantics are last-writer-wins. The result is a silent data loss — one of the two appends is invisible. The fix is to use DynamoDB-backed Delta on S3, or pay for Databricks' managed catalog. - The "Hudi compaction backed up" failure. Hudi's MoR mode means deletes and updates land in log files until compaction applies them. If compaction falls behind (because the cluster was down, or because rate exceeded compaction throughput), reads slow down dramatically — every scan is now merging a long log against the base file. Monitor compaction lag the same way you'd monitor Kafka consumer lag.
- The "we picked Hudi but our writes are append-only" failure. Hudi's design pays a tax for upsert support that you don't recoup if you never upsert. An append-only Hudi table is heavier than the equivalent Iceberg or Delta table. Pick on your write pattern — if you're 99% append, the choice is Iceberg or Delta. If you're 30% upsert (CDC, slowly-changing dimensions), Hudi's design pays off.
Common confusions
- "Iceberg, Delta, Hudi are storage formats." They are table formats — metadata layers on top of Parquet, which is the storage format. The Parquet files are bit-identical across the three.
- "Delta is open source the way Iceberg is." Delta's spec was opened in 2019 but the reference implementation is heavily Spark-tied; key features (change data feed, deletion vectors) shipped on Spark first. Iceberg's reference implementation has Java, Python, Rust, and engine-native readers in Trino/DuckDB/Snowflake. The open-vs-not gap narrowed in 2023–24 but has not closed.
- "Picking a table format locks you into a query engine." Mostly false now — Iceberg is read by Spark, Trino, Flink, Snowflake, BigQuery, Athena, DuckDB. Delta is read by Spark, Trino, Athena, and via the Delta-Rust connector by Polars/DuckDB. Hudi is read by Spark, Flink, Trino (with caveats). Engine portability is no longer a Iceberg-only feature, though Iceberg still has the broadest support.
- "Time travel costs nothing." It costs storage — every old snapshot is data files held in S3 until expiry. A daily-snapshot table with 90-day retention has 90× the file footprint. Pick a retention policy with cost in mind.
- "You can switch table formats by re-pointing the catalog." No. The metadata trees are different on disk. To migrate Delta → Iceberg, you either rewrite (cost: data volume × Parquet I/O) or use a tool like Iceberg's
migrate_deltaaction, which converts metadata in place but still has gotchas with column mapping. - "Hudi's Merge-on-Read is faster than Iceberg for everything." It's faster for writes with deletes/updates. It is slower for reads on tables with significant un-compacted log files, because every read merges. The trade-off is write-cost vs read-cost; pick on your read:write ratio.
Going deeper
The Iceberg v3 spec and what it adds
Iceberg v3, in flight at the time of writing, formalises deletion vectors (catching up to Delta), row-lineage (a row-id that survives compaction and updates, useful for incremental view maintenance), and variant columns (semi-structured JSON-like). The big win is row-lineage, which makes incremental query engines on top of Iceberg substantially cheaper — instead of re-deriving a row's identity from a key column, the engine reads the lineage row-id and knows whether this row existed in the previous snapshot. This is the kind of low-level primitive that unlocks CDC out of a lakehouse without bolting on a separate stream.
Delta UniForm and the "format gateway" pattern
Databricks shipped UniForm in 2023, which writes Delta data files plus an Iceberg metadata mirror in the same commit. A Delta writer can serve Iceberg readers with no rewrite. This is the first step toward "the table format wars don't matter for new tables — write once, read with anything". Whether it generalises to Hudi too is a 2026 question.
The Onehouse / Apache XTable bridge
XTable (formerly OneTable) translates metadata between Iceberg, Delta, and Hudi. A Hudi-managed table can be exposed as Iceberg metadata to a Trino reader without rewriting the data. This sits at the catalog layer, not the data layer — exactly where the differences between the formats live. For shops that picked Hudi for streaming-into-warehouse and Iceberg for analytics-on-S3 (a pattern that exists at Uber and at several Indian fintechs), XTable removes the "two copies of the data" problem.
File-group sizing — the 1 GB vs 256 MB debate
Iceberg's default target is 512 MB; Delta's OPTIMIZE defaults to 1 GB; Hudi's bucket size is configurable but commonly 128 MB. Bigger files reduce planner cost (fewer manifest entries to read) but increase rewrite cost when a single row needs to change (CoW rewrites the whole file). For tables with high update rates, smaller is better; for append-only analytics, bigger is better. There is no universal right answer — the right answer is to monitor avg_file_size, manifest_count, and rewrite_bytes_per_commit and tune.
The producer's cost model in one formula
For a single commit at steady state on any of the three formats:
cost = bytes_written + manifest_rewrite_cost + commit_latency
≈ ROWS × bytes_per_row × (1 + amplification_factor) + O(num_files_in_table)
The amplification_factor is what differs across formats. For an Iceberg append it is roughly 1 (write the data, append one manifest entry). For a Delta MERGE on copy-on-write it can be 50× (you rewrite every file containing a target row). For a Hudi MoR upsert it is roughly 1 at write time and the cost is paid later at compaction. The cost model never disappears; it moves between writer-time and reader-time.
Where this leads next
The next chapter, /wiki/concurrent-writers-without-stepping-on-each-other, goes deeper on the optimistic-concurrency protocol that all three formats share, and shows how a streaming writer and a backfill writer can co-exist on the same partition without either silently dropping data. After that, /wiki/time-travel-and-zero-copy-clones-for-data-engineers covers the read-side use of snapshots — how time travel works under the hood, why it costs storage, and how zero-copy clones are implemented as metadata-only branch operations.
For the immediate prerequisite on what files actually live underneath these formats, read /wiki/parquet-end-to-end-what-you-write-what-you-get-back. For the reason small files matter to all three formats equally, read /wiki/compaction-small-files-hell-and-how-to-avoid-it.
References
- Apache Iceberg specification — the canonical spec; read §6 (snapshot) and §7 (manifest) to see the metadata tree in full detail.
- Delta Lake protocol — the official protocol document; the section on commit conflict resolution is the one to read for concurrent-writer behaviour.
- Apache Hudi: timeline and file groups — the timeline-based design that distinguishes Hudi from the other two.
- Netflix: Iceberg at trillion-row scale — the production write-up that drove much of Iceberg's enterprise adoption.
- Onehouse / XTable — the metadata-translation bridge between the three formats.
- Tabular: A walkthrough of Iceberg v3 — accessible explanation of row-lineage and deletion vectors in the upcoming spec.
- Razorpay engineering: Iceberg migration — Indian-context production write-up on a 50 TB warehouse-to-lakehouse migration; calls out the catalog-choice and snapshot-expiry trade-offs.
- /wiki/compaction-small-files-hell-and-how-to-avoid-it — why all three formats need compaction and how the policy interacts with each format's snapshot model.
- /wiki/parquet-end-to-end-what-you-write-what-you-get-back — the storage layer that all three sit on.