Iceberg, Delta, Hudi from the producer's perspective

Aditi's team at a Bengaluru-based payments company is migrating off a 90-TB warehouse into S3. Three vendors have already pitched her: one says "use Iceberg, it's the open standard"; another says "use Delta, the integration with Spark is unmatched"; the third says "use Hudi, it was built for upserts at scale". She does not need a feature checklist — she needs to know which of these three formats will fail her the worst when a streaming writer crashes mid-commit, when GDPR delete arrives at 2 a.m., when two batch jobs try to write to the same partition, and when, six months from now, she wants to switch query engines without rewriting the data.

Iceberg, Delta, and Hudi all sit on top of the same Parquet files; what differs is the metadata tree that tracks which files are part of which snapshot, how concurrent writers coordinate, and how row-level deletes land before they hit the base files. Iceberg has the cleanest spec and the broadest engine support; Delta has the deepest Spark integration; Hudi was designed first around streaming upserts and shows it. Pick on the basis of your write pattern (append-mostly vs upsert-heavy vs delete-heavy), not the marketing.

What "table format" actually means above Parquet

A bare directory of Parquet files is not a table. It is a directory of files that happen to have related schemas. A table format adds a metadata layer on top — a tree of pointers that says "these files are the table at version N, those files are the table at version N+1, and here is how to atomically move from one to the other". The metadata layer is what gives you snapshot isolation, time travel, schema evolution, and concurrent writers without manual locking.

All three formats use the same underlying Parquet for the row data. None of them changed the data file. What they changed is the catalog pointer, the manifest tree, and the commit protocol.

The layers above Parquet that make a tableA four-layer stack diagram. Bottom layer is Parquet data files. Above that is a per-format manifest layer. Above that is a snapshot metadata file. Above that is the catalog pointer (atomic). Annotations show which file types live at each layer for Iceberg, Delta, and Hudi. Same Parquet underneath; the metadata layers above are what differs Layer 1 — Parquet data files part-00000.parquet, part-00001.parquet, ... (the actual rows; identical across all three formats) Layer 2 — Per-file manifest entries Iceberg: manifest-N.avro Delta: _delta_log/00..N.json (action records) Hudi: .hoodie/000.commit (file groups) "these data files belong to this snapshot, here are their statistics, partition values" Layer 3 — Snapshot metadata Iceberg: manifest-list-N.avro Delta: checkpoint.parquet (every 10 commits) Hudi: .hoodie/timeline "snapshot N points at these manifests; snapshot N+1 points at those" Layer 4 — Catalog pointer (the atomic swap) Iceberg: Glue/Nessie/REST catalog Delta: _delta_log/_last_checkpoint + filesystem rename Hudi: filesystem-based timeline a single pointer flip is the moment "the table version" changes
All three formats are layered the same way. The differences are in how each layer is implemented and where the atomic-swap primitive lives. The data layer is identical Parquet.

Why the catalog-pointer layer is the design's keystone: the only operation that needs to be atomic for snapshot isolation is the pointer flip from "the table is version N" to "the table is version N+1". Everything below it (rewriting manifests, writing data files) can take minutes; the pointer flip takes microseconds. Iceberg moved this atomic into a real catalog (Glue, Nessie); Delta inherited it from HDFS-style filesystem rename semantics, which is fragile on S3 and is why Delta on S3 needs DynamoDB or LogStore for real concurrency.

Where the three formats actually differ for a producer

If you only write append-only, daily-partitioned data, the three formats look identical. The differences appear the moment you do any of: concurrent writers, row-level deletes, schema evolution, or upserts. Below is a table of what a producer actually has to think about, and how each format handles it.

Concern Iceberg Delta Hudi
Concurrent writers (same table, different partitions) Optimistic concurrency at catalog level — works Same, requires DynamoDB on S3 Same, with timeline-based ordering
Concurrent writers (same partition) Conflict-detected at commit; one retries Conflict-detected at commit; one retries Hudi's design point — file-group-level merge
Row-level deletes Position deletes (v2 spec) or equality deletes; writer-side cheap Deletion vectors (recent) or rewrite-the-file (older) Built around upsert-and-delete via merge logs
Schema evolution Field-IDs, safe rename, drop, add (the cleanest spec of the three) Column add safe; rename requires column-mapping mode Add safe; reorder/rename has gotchas
Partition evolution First-class — change spec on existing table Generated columns (Delta 2+); not as flexible Static-ish; bucketing changes are painful
Upserts Yes via MERGE; rewrites whole files (CoW) Yes via MERGE; CoW or DV The native operation; Merge-on-Read is default
Time travel Snapshot-id or timestamp; cheap until expiry Version or timestamp Instant-time-based
Engine support Spark, Trino, Flink, Snowflake, BigQuery, DuckDB, Athena Spark-first; Trino/Athena adequate Spark-first; Flink solid; Trino limited

The three rows that matter most in a real production decision are the delete model, the concurrent-writers-on-the-same-partition behaviour, and the engine portability. Everything else is a feature checkbox.

How each format handles a row-level delete (the operation that exposes the design)

A user requests their data deleted under DPDPA — a single user-id whose rows are scattered across 200 partitions. Watch what each format does.

Iceberg (v2 spec, position-deletes path). The writer scans the table for rows matching user_id = 'XYZ', generates a position-delete file per data file that contains a hit, and commits these as new manifest entries pointing at the original data files plus the position-deletes. Reads at a snapshot after this commit apply the position-deletes during scan. Compaction later rewrites the data files without those rows and drops the position-delete files. The base files are not touched at delete time; the cost is paid at compaction or read time.

Delta (deletion-vectors path, post 2.4). Identical idea — the writer emits a deletion vector (a roaring bitmap of row positions to skip) per affected data file, and commits a new transaction. Older Delta versions had to rewrite the entire data file at delete time, which made deletes expensive on cold partitions. Deletion vectors made Delta's delete path competitive with Iceberg's.

Hudi (Merge-on-Read, the default). The writer appends a delete record to the partition's log file (one log file per file group). Reads merge the log against the base file at query time. Compaction (which Hudi treats as a separate operation from clustering) periodically applies the log to the base file. The cost model is the inverse of Iceberg's — cheap writes, slower reads until compaction.

How a single GDPR delete lands in each formatThree side-by-side panels. Each shows the same input data file with row "uid=XYZ" highlighted. The Iceberg panel shows a new position-delete file pointing at row 47. The Delta panel shows a deletion vector bitmap with bit 47 set. The Hudi panel shows a log file appended with a delete record. None of them rewrite the base file at delete time. Delete a single row in three formats — the base file is untouched in all three Iceberg v2 data-001.parquet (unchanged) pos-delete.parquet {file: data-001, pos: 47} applied at scan cleared at compaction Delta (DV) data-001.parquet (unchanged) data-001.parquet.dv roaring bitmap; bit 47 set applied at scan cleared at OPTIMIZE Hudi (MoR) base-file.parquet (unchanged) .log.001 DELETE uid=XYZ merged at scan applied at compaction
Three different physical artefacts encode the same logical change. The producer cost is similar across all three; the read-side cost differs based on how aggressive the format is about merging at scan.

Why the formats converged on "don't rewrite the base file": rewriting a 256 MB Parquet file to remove one row writes 256 MB out for one row of work. At billion-row tables with thousands of GDPR requests per quarter, this is unworkable. All three formats independently arrived at the same answer — encode the delete as a side artefact, and apply it lazily. The differences are in the artefact's representation (delete-file, bitmap, log-record), not the strategy.

What a writer actually executes — three flavours of the same commit

Below is a Python script that performs the same logical operation — appending a batch of payments rows into a partitioned table — in pseudo-Iceberg, pseudo-Delta, and pseudo-Hudi style. The point is to see the commit-protocol shape, not the SDK ergonomics.

# write_batch.py — append a batch of rows to a partitioned table.
# Same logical write, three different metadata footprints.
import json, os, time, uuid
import pyarrow as pa
import pyarrow.parquet as pq

ROOT = "/tmp/lake/payments"
PARTITION = "dt=2026-04-25"
ROWS = pa.table({
    "txn_id":     pa.array([f"T{i:06d}" for i in range(1000)]),
    "merchant":   pa.array(["razorpay-test"] * 1000),
    "amount_inr": pa.array([100 + i for i in range(1000)]),
    "user_id":    pa.array([f"U{(i % 800):04d}" for i in range(1000)]),
})

def write_data_file(path):
    pq.write_table(ROWS, path, compression="snappy")
    return os.path.getsize(path)

# 1) Iceberg-style commit ----------------------------------------------------
def commit_iceberg():
    snap_id = int(time.time() * 1000)
    data_uuid = uuid.uuid4().hex
    data_path = f"{ROOT}/data/{PARTITION}/{data_uuid}.parquet"
    os.makedirs(os.path.dirname(data_path), exist_ok=True)
    nbytes = write_data_file(data_path)

    manifest_entry = {
        "data_file": data_path,
        "partition": PARTITION,
        "record_count": ROWS.num_rows,
        "file_size_bytes": nbytes,
        "column_stats": {"amount_inr": {"min": 100, "max": 1099}},
    }
    manifest_path = f"{ROOT}/metadata/manifest-{snap_id}.json"
    os.makedirs(os.path.dirname(manifest_path), exist_ok=True)
    with open(manifest_path, "w") as f: json.dump([manifest_entry], f)

    snapshot = {"snapshot_id": snap_id, "manifest_list": [manifest_path],
                "parent_id": _read_catalog().get("snapshot_id")}
    snap_path = f"{ROOT}/metadata/snap-{snap_id}.json"
    with open(snap_path, "w") as f: json.dump(snapshot, f)

    # The atomic step: catalog pointer flip.
    _write_catalog({"snapshot_id": snap_id, "metadata_path": snap_path})
    return snap_id

# 2) Delta-style commit ------------------------------------------------------
def commit_delta():
    log_dir = f"{ROOT}/_delta_log"
    os.makedirs(log_dir, exist_ok=True)
    versions = sorted(int(f.split('.')[0]) for f in os.listdir(log_dir)
                      if f.endswith('.json'))
    next_v = (versions[-1] + 1) if versions else 0

    data_uuid = uuid.uuid4().hex
    data_path = f"{ROOT}/{PARTITION}/{data_uuid}.parquet"
    os.makedirs(os.path.dirname(data_path), exist_ok=True)
    nbytes = write_data_file(data_path)

    actions = [{"add": {"path": data_path, "size": nbytes,
                        "partitionValues": {"dt": "2026-04-25"},
                        "stats": json.dumps({"numRecords": ROWS.num_rows})}}]
    log_path = f"{log_dir}/{next_v:020d}.json"
    # Atomic rename — the moment the version becomes visible.
    tmp = log_path + ".tmp"
    with open(tmp, "w") as f:
        for a in actions: f.write(json.dumps(a) + "\n")
    os.rename(tmp, log_path)
    return next_v

# 3) Hudi-style commit -------------------------------------------------------
def commit_hudi():
    instant = time.strftime("%Y%m%d%H%M%S")
    timeline = f"{ROOT}/.hoodie"
    os.makedirs(timeline, exist_ok=True)
    inflight = f"{timeline}/{instant}.commit.inflight"
    open(inflight, "w").write("{}")    # mark in-flight first

    fg = uuid.uuid4().hex[:8]
    data_path = f"{ROOT}/{PARTITION}/{fg}_{instant}.parquet"
    os.makedirs(os.path.dirname(data_path), exist_ok=True)
    write_data_file(data_path)

    completed = f"{timeline}/{instant}.commit"
    os.rename(inflight, completed)     # atomic state transition
    return instant

def _read_catalog():
    p = f"{ROOT}/_catalog.json"
    return json.load(open(p)) if os.path.exists(p) else {}
def _write_catalog(d):
    p = f"{ROOT}/_catalog.json"
    tmp = p + ".tmp"
    json.dump(d, open(tmp, "w"))
    os.rename(tmp, p)

if __name__ == "__main__":
    print("iceberg:", commit_iceberg())
    print("delta:  ", commit_delta())
    print("hudi:   ", commit_hudi())
# Sample run:
iceberg: 1714056234123
delta:   0
hudi:    20260425143514

The lines that matter: _write_catalog({...}) in commit_iceberg is the atomic pointer flip — every other operation before it could have been re-tried; once this line returns, the table is at the new snapshot. os.rename(tmp, log_path) in commit_delta is the equivalent — Delta's atomicity rides on the filesystem rename being atomic, which is why naive Delta-on-S3 needs an external coordinator (DynamoDB, LogStore) since S3 PUT is not a rename. os.rename(inflight, completed) in commit_hudi is Hudi's two-phase commit — inflight is written first, then renamed to "complete", giving the timeline a clear in-progress state for crash recovery. Why all three end on a rename: a rename within a single filesystem (or via the catalog) is the cheapest atomic-swap primitive available. The whole table-format design is built on the assumption that you can flip exactly one pointer atomically; the rest of the metadata is allowed to be sloppy.

The shape difference: Iceberg keeps the manifest list separate from the snapshot file, so a snapshot read needs two indirections. Delta keeps everything in _delta_log/ as JSON commits, so a snapshot read scans the directory listing. Hudi separates the timeline (state transitions) from the data, so a snapshot read consults the timeline file. Each design is self-consistent; none is "wrong".

Concurrent writers — where the formats reveal their personality

When two writers commit to the same partition at the same time, all three formats use optimistic concurrency: each writer prepares its commit, then attempts the atomic pointer flip; if it loses the race, it re-reads the new state and retries. The differences are in what counts as a conflict.

Aditi's payments table receives a streaming writer (every 30 s) and a daily backfill job (overnight). On Iceberg and Delta, this is a non-issue if they don't touch the same partition. If the backfill needs to overwrite yesterday's partition while the stream is still flushing files for today, there is no conflict. Hudi handles the same scenario via separate file groups.

What can go wrong

These are the production patterns that show up at month 4 of running any of these formats:

Common confusions

Going deeper

The Iceberg v3 spec and what it adds

Iceberg v3, in flight at the time of writing, formalises deletion vectors (catching up to Delta), row-lineage (a row-id that survives compaction and updates, useful for incremental view maintenance), and variant columns (semi-structured JSON-like). The big win is row-lineage, which makes incremental query engines on top of Iceberg substantially cheaper — instead of re-deriving a row's identity from a key column, the engine reads the lineage row-id and knows whether this row existed in the previous snapshot. This is the kind of low-level primitive that unlocks CDC out of a lakehouse without bolting on a separate stream.

Delta UniForm and the "format gateway" pattern

Databricks shipped UniForm in 2023, which writes Delta data files plus an Iceberg metadata mirror in the same commit. A Delta writer can serve Iceberg readers with no rewrite. This is the first step toward "the table format wars don't matter for new tables — write once, read with anything". Whether it generalises to Hudi too is a 2026 question.

The Onehouse / Apache XTable bridge

XTable (formerly OneTable) translates metadata between Iceberg, Delta, and Hudi. A Hudi-managed table can be exposed as Iceberg metadata to a Trino reader without rewriting the data. This sits at the catalog layer, not the data layer — exactly where the differences between the formats live. For shops that picked Hudi for streaming-into-warehouse and Iceberg for analytics-on-S3 (a pattern that exists at Uber and at several Indian fintechs), XTable removes the "two copies of the data" problem.

File-group sizing — the 1 GB vs 256 MB debate

Iceberg's default target is 512 MB; Delta's OPTIMIZE defaults to 1 GB; Hudi's bucket size is configurable but commonly 128 MB. Bigger files reduce planner cost (fewer manifest entries to read) but increase rewrite cost when a single row needs to change (CoW rewrites the whole file). For tables with high update rates, smaller is better; for append-only analytics, bigger is better. There is no universal right answer — the right answer is to monitor avg_file_size, manifest_count, and rewrite_bytes_per_commit and tune.

The producer's cost model in one formula

For a single commit at steady state on any of the three formats:

cost = bytes_written + manifest_rewrite_cost + commit_latency
     ≈ ROWS × bytes_per_row × (1 + amplification_factor) + O(num_files_in_table)

The amplification_factor is what differs across formats. For an Iceberg append it is roughly 1 (write the data, append one manifest entry). For a Delta MERGE on copy-on-write it can be 50× (you rewrite every file containing a target row). For a Hudi MoR upsert it is roughly 1 at write time and the cost is paid later at compaction. The cost model never disappears; it moves between writer-time and reader-time.

Where this leads next

The next chapter, /wiki/concurrent-writers-without-stepping-on-each-other, goes deeper on the optimistic-concurrency protocol that all three formats share, and shows how a streaming writer and a backfill writer can co-exist on the same partition without either silently dropping data. After that, /wiki/time-travel-and-zero-copy-clones-for-data-engineers covers the read-side use of snapshots — how time travel works under the hood, why it costs storage, and how zero-copy clones are implemented as metadata-only branch operations.

For the immediate prerequisite on what files actually live underneath these formats, read /wiki/parquet-end-to-end-what-you-write-what-you-get-back. For the reason small files matter to all three formats equally, read /wiki/compaction-small-files-hell-and-how-to-avoid-it.

References