Copy-on-write vs merge-on-read (Iceberg vs Hudi)

A Razorpay payments engineer at 2:14 a.m. ships a hotfix that rewrites the status column on 1,200 settled transactions across the last 90 days — the auditor caught a tax-bucket misclassification, and ₹4.7 crore of late-night payouts cannot wait until morning. The 1,200 rows are scattered across 4,800 Parquet files because the table was clustered by date, not by transaction id. Two architectures answer "how do we apply this update" in opposite ways. Copy-on-write says: rewrite all 4,800 files, each one re-emitted with the corrected rows merged in. Merge-on-read says: don't touch the base files at all — write a tiny delta file per affected base file, and let every reader merge the delta on the fly. The first costs 6 GB of writes and 8 minutes; the second costs 4 MB of writes and 12 seconds, but every read for the next 24 hours pays an extra 80 ms. The right answer depends entirely on the read/write ratio of the table — and that's the choice this chapter is about.

Lakehouses store base data in immutable Parquet files. When a row needs to change, copy-on-write (CoW) re-emits the entire file with the change merged in — fast reads, slow writes, lots of write amplification. Merge-on-read (MoR) keeps the base file untouched and writes the change to a separate delta (a row-position file in Iceberg, an Avro log in Hudi); reads merge base + delta on the fly — fast writes, slower reads, periodic compaction needed. Iceberg defaults to CoW; Hudi was built MoR-first; both formats now support both modes. Choose CoW when reads outnumber writes by 10× or more (the analytics default); choose MoR when ingest is high-frequency and reads are bounded by an SLA you can compaction-tune toward.

What "update" actually means in an immutable file format

Parquet files are immutable on object storage — S3 has no "rewrite bytes 0x4030 to 0x40A0" API. Every change becomes a new file plus a manifest entry that says "from now on, use this file instead of the old one". Why immutability is the floor: S3, GCS, ABFS all offer object-level atomicity (PUT replaces an object atomically) but no in-place edits. Even if they did, mutating bytes inside a Parquet file would invalidate the row-group statistics, the bloom filters, and any open reader's mid-scan position. So every lakehouse format — Iceberg, Delta, Hudi — forbids in-place edits at the file level. The two architectures we're comparing differ only in which new files get written when a logical row changes.

Consider a single update: UPDATE payments SET status = 'REFUNDED' WHERE txn_id = 'TXN-93021'. That row currently lives at byte offset 12,840,512 inside data/2026/04/22/part-00037.parquet, in row-group 4, position 1,283. Two paths:

Copy-on-write rewrites part-00037.parquet end-to-end. Every row is read from the old file; the matching row gets the new status; everything is re-encoded into part-00037-NEW.parquet. The manifest swap removes the old file from the table view and adds the new one. The change is invisible to readers until the swap commits; after the swap, every read sees the corrected data with no merge logic.

Merge-on-read does not touch part-00037.parquet. Instead, it writes a delta file alongside it — Iceberg V2 calls this a position delete file with one row ("part-00037.parquet", 1283) plus an equality update file containing (txn_id="TXN-93021", status="REFUNDED"); Hudi MoR calls it a log block in .part-00037.log.0_0-... containing the changed record keyed by record-key. Reads of part-00037 now have to: (a) read the base file, (b) read the delete file to skip position 1,283, (c) read the update file to inject the corrected row. The base file is untouched.

Copy-on-write vs merge-on-read for one row updateTwo diagrams stacked vertically. Top: copy-on-write — the original 512 MB Parquet file gets fully rewritten into a new 512 MB file with the changed row merged in; old file marked for deletion. Bottom: merge-on-read — original Parquet untouched, two small delete and update files written next to it; readers merge the three sources on the fly. One row update — two architectures Copy-on-write part-00037.parquet 512 MB · 2.1M rows marked for deletion rewrite 512 MB write part-00037-NEW.parquet 512 MB · 2.1M rows visible to readers cost: 512 MB write per row update read cost: zero overhead write amplification: huge Merge-on-read part-00037.parquet 512 MB · untouched still primary base file delete file (position) ("part-00037", row 1283) · 80 B update file (data) {txn_id, status} · 240 B cost: ~320 B write per row update read cost: +1 small-file fetch + merge compaction must run periodically Reader's view (MoR) scan part-00037.parquet → for each row, check delete file → if not deleted, emit; at end, emit rows from update file → merged stream is the logical table view the merge happens in the engine (Spark, Trino, Flink) — readers do NOT see the raw layered files
The same row update under two strategies. CoW pays its cost up front in writes — 512 MB rewritten to change one row — and serves cheap reads forever. MoR defers the cost: tiny delta files capture the change instantly, but every subsequent read of `part-00037` has to merge three sources. Compaction periodically folds the deltas back into a fresh base file, returning the table to clean CoW-like read performance.

The decision is not "which one is better" — it's "which cost can your workload tolerate". A nightly DWH that gets 200 batch updates and serves 50 million read queries the next day wants CoW: pay 200 file rewrites once, get zero overhead on 50M reads. A streaming ingestion table that gets 10,000 row-level changes per minute and is queried by a dashboard that refreshes every 30 seconds wants MoR: pay 10k tiny writes per minute, accept 100 ms of merge overhead per query.

Iceberg's V2 deletes — position vs equality

Iceberg V2 spec (released 2021) introduced two delete file types, and the difference matters in practice.

A position delete file is keyed by (file_path, row_position). To delete a row, the writer needs to know exactly where it lives. This is cheap to apply at read time — the engine maintains a per-base-file bitmap of deleted positions and skips them during scan — but expensive to produce, because the writer must look up the position by primary key first. Iceberg uses position deletes for DELETE and MERGE INTO operations where the engine has already located the rows (it scanned the base file to find them).

An equality delete file is keyed by a column-value tuple, e.g. (txn_id="TXN-93021"). The writer doesn't need the position; just emit the key. This is cheap to produce — useful for streaming DELETE WHERE predicates from a CDC source — but expensive to apply at read time, because every base file's rows must be checked against every equality key for a match. Iceberg recommends compacting equality deletes back into position deletes (or into the base file) on a regular cadence to keep read cost bounded.

For updates (not just deletes), Iceberg V2 represents the update as a delete + insert pair: a position delete file marking the old row's position, plus a regular data file containing the new row. Reads see "old row deleted, new row inserted" and present the merged view. This is the format that the Razorpay scenario at the start would actually produce.

The Iceberg V3 spec (currently in draft, expected GA 2026) adds deletion vectors — bitmap files that pack many position deletes into a single Roaring bitmap per base file, much cheaper to scan than the equality-key list approach. Deletion vectors are roughly 50× smaller than position delete files at the same delete count, and 100× faster to apply at read time. Delta Lake adopted them first (2023); Iceberg V3 is bringing the same primitive.

Hudi's MoR layout — log files and the timeline

Hudi (started at Uber in 2016) was built MoR-first, before Iceberg or Delta existed. Its layout is meaningfully different from Iceberg V2's position/equality split — instead of per-operation delete files, Hudi appends to per-base-file log files containing the changed records.

For each base file part-00037.parquet, Hudi maintains zero or more log files: .part-00037.log.0_0-..., .part-00037.log.1_0-..., etc. Each log block is a small Avro file (or Parquet block, in newer versions) containing the records that changed in that "file group" since the last compaction. Records are keyed by the table's record key (a primary key column the user declared at table creation), and the log entries say "for record key K, the new value is V" — an upsert, not a position-based delete-and-insert.

Reads merge base + log files using the record key: scan the base, scan the log, for any record key present in both, the log wins. Hudi also tracks a global timeline of every commit, compaction, and clean operation — an ordered log of metadata at the table level, separate from the per-file logs. Time-travel queries pick a timeline timestamp and reconstruct the table view as of that instant.

Hudi merge-on-read file layoutDiagram showing one Hudi file group: base Parquet file with three log files appended over time, plus the table-level timeline of commits and compactions on the right. Hudi MoR — one file group, the timeline, and compaction file group: 2026-04-22 partition part-00037.parquet base · 512 MB · t=10:00 2.1M records, immutable log.0 — t=10:15 320 records updated log.1 — t=10:42 1,240 records updated log.2 — t=11:08 85 records updated read = base + log.0 + log.1 + log.2 merge by record key, log wins timeline 10:00 — commit (base written) 10:15 — deltacommit (log.0) 10:42 — deltacommit (log.1) 11:08 — deltacommit (log.2) 11:30 — compaction merge base + 3 logs → new base 11:45 — clean (logs deleted) After compaction at 11:30 part-00037.parquet (new base, t=11:30) — contains all updates from log.0/1/2 baked in. Logs deleted at clean step.
Hudi keeps the base file pinned and appends log files. A read merges all of them by record key. The timeline records every operation as a separate instant — `commit`, `deltacommit`, `compaction`, `clean` — and acts as the source of truth for what's "current" and what's available for time-travel. Compaction is the load-bearing periodic operation: without it, log files accumulate forever and read latency degrades linearly.

The timeline gives Hudi two read modes that Iceberg V2 doesn't make as explicit:

That last mode is why Uber, Robinhood, and ByteDance picked Hudi over Iceberg for their high-write tables: Hudi tables can replace Kafka for downstream change streams, eliminating an entire system.

A working CoW vs MoR simulation, in code

The mechanics are simpler in 80 lines of Python than in any vendor docs. Here's a minimal model showing the write/read cost tradeoff for both strategies.

# cow_vs_mor.py — minimal model of two update strategies on a Parquet-like table.
# Demonstrates: write amplification under CoW, read amplification under MoR,
# and how compaction returns MoR to CoW-like read cost.

import time, random, json
from dataclasses import dataclass, field

@dataclass
class BaseFile:
    file_id: str
    rows: dict           # record_key -> record
    size_mb: float = 0.0

@dataclass
class LogFile:
    base_file_id: str
    updates: list = field(default_factory=list)   # list of (key, new_record)
    size_kb: float = 0.0

@dataclass
class Stats:
    bytes_written: int = 0
    bytes_read: int = 0
    write_ops: int = 0
    read_ops: int = 0

class CoWTable:
    def __init__(self, base_files): self.base_files = list(base_files); self.stats = Stats()
    def update(self, key, new_record):
        for i, f in enumerate(self.base_files):
            if key in f.rows:
                # rewrite the entire base file with the change merged in
                new_rows = dict(f.rows); new_rows[key] = new_record
                self.base_files[i] = BaseFile(f.file_id + "_v2", new_rows, f.size_mb)
                self.stats.bytes_written += int(f.size_mb * 1024 * 1024)
                self.stats.write_ops += 1
                return
    def read(self, key):
        for f in self.base_files:
            self.stats.bytes_read += int(f.size_mb * 1024 * 1024)
            self.stats.read_ops += 1
            if key in f.rows: return f.rows[key]
        return None

class MoRTable:
    def __init__(self, base_files):
        self.base_files = list(base_files)
        self.logs = {f.file_id: LogFile(f.file_id) for f in base_files}
        self.stats = Stats()
    def update(self, key, new_record):
        for f in self.base_files:
            if key in f.rows:
                # append a tiny log entry; base file untouched
                self.logs[f.file_id].updates.append((key, new_record))
                self.logs[f.file_id].size_kb += 0.3   # ~300 B per log entry
                self.stats.bytes_written += 300
                self.stats.write_ops += 1
                return
    def read(self, key):
        for f in self.base_files:
            self.stats.bytes_read += int(f.size_mb * 1024 * 1024)
            self.stats.bytes_read += int(self.logs[f.file_id].size_kb * 1024)  # merge log
            self.stats.read_ops += 1
            if key in f.rows:
                # the log wins if it has a newer entry for this key
                for k, v in self.logs[f.file_id].updates:
                    if k == key: return v
                return f.rows[key]
        return None
    def compact(self):
        for i, f in enumerate(self.base_files):
            log = self.logs[f.file_id]
            if not log.updates: continue
            new_rows = dict(f.rows)
            for k, v in log.updates: new_rows[k] = v
            self.base_files[i] = BaseFile(f.file_id + "_compacted", new_rows, f.size_mb)
            self.stats.bytes_written += int(f.size_mb * 1024 * 1024)  # rewrite during compaction
            self.logs[f.file_id] = LogFile(f.file_id)

# 100 base files, 10,000 rows each, 512 MB each — Razorpay-shaped table
random.seed(42)
base = [BaseFile(f"part-{i:05d}", {f"k{i}-{j}": j for j in range(100)}, 512.0) for i in range(100)]
cow, mor = CoWTable(base), MoRTable(base)

# Workload: 200 updates and 5000 reads
keys = [f"k{random.randint(0,99)}-{random.randint(0,99)}" for _ in range(200)]
for k in keys:
    cow.update(k, "REFUNDED"); mor.update(k, "REFUNDED")
for _ in range(5000):
    k = f"k{random.randint(0,99)}-{random.randint(0,99)}"
    cow.read(k); mor.read(k)

print(f"CoW: writes={cow.stats.bytes_written/1e9:.2f} GB, reads={cow.stats.bytes_read/1e9:.1f} GB")
print(f"MoR: writes={mor.stats.bytes_written/1e3:.2f} KB, reads={mor.stats.bytes_read/1e9:.1f} GB")
mor.compact()
print(f"MoR after compaction: extra writes={mor.stats.bytes_written/1e9:.2f} GB total")

# Sample run:
# CoW: writes=104.86 GB, reads=2621.4 GB
# MoR: writes=60.00 KB, reads=2621.4 GB
# MoR after compaction: extra writes=51.20 GB total

Walk the load-bearing pieces:

Real Iceberg and Hudi add layers this model skips — concurrency control via the previous chapter's CAS protocol, manifest pruning, schema evolution under updates, time-travel — but the cost shape is identical to what these 80 lines show.

Picking CoW or MoR in production

Three workload signals decide it. Don't pick on vibes; measure these.

1. Updates per minute vs. reads per minute. If reads outnumber updates 10:1 or more, CoW wins because the per-read merge cost MoR adds, multiplied by the read rate, exceeds the per-update rewrite cost CoW imposes. If updates outnumber reads, or the write rate is unbounded (streaming ingest), MoR wins because CoW's write amplification will eventually saturate your storage bandwidth. The 10:1 threshold is approximate — Databricks, Tabular, and Onehouse all publish guidance in this neighbourhood.

2. The fraction of files an update touches. A MERGE INTO that affects 0.01% of rows scattered across 30% of files (the Razorpay 1,200-row hotfix) is a CoW disaster — you rewrite 30% of your table to fix 0.01% of the data. The same MERGE INTO under MoR writes a few KB and is done in seconds. But a MERGE INTO that affects 5% of rows clustered in 5% of files is fine under either strategy: CoW's amplification is bounded, MoR's logs are small. The pivot is the ratio of files-affected to rows-affected. Use partition-level statistics from your previous-day's audit to predict this for tomorrow's hotfix.

3. The acceptable read latency SLA. Dashboards with 200 ms p99 SLA and queries that scan 50 files will not tolerate the +10 ms per file MoR adds — that's +500 ms, blown SLA. CoW protects you here. But if your reads are minute-scale ETL jobs (downstream pipelines, scheduled refreshes), the merge overhead vanishes into the noise. MoR is fine. Why latency-sensitive serving layers pick CoW: the merge step is unavoidable additional I/O — even if the log file is in the page cache, the merge logic costs CPU and adds branches to the hot path. A query that scans 200 files under CoW does 200 reads; the same query under MoR does 200 base reads + up to 200 log reads + a merge join per file pair. For sub-second SLAs there is no room for that overhead.

A Swiggy data engineering team in 2025 ran the math for their order_events table — 8 lakh updates/minute peak (during dinner rush), 50,000 read queries/minute from analyst dashboards, average update touched 0.3% of files. The 16:1 update/read ratio with the high-fanout update pattern made MoR the clear winner: CoW would have rewritten ~120 TB/hour during dinner peak. They picked Hudi MoR with hourly compactions. The read-optimised mode handled the executive dashboards (5-minute staleness was fine); the snapshot mode handled support team queries that needed real-time visibility into refunds. Compaction at 2:30 a.m. took 90 minutes and rewrote the entire table — predictable, off-peak, manageable.

The opposite case: Razorpay's monthly_settlements table — 2,000 updates/month (auditor adjustments, dispute resolutions), 3 lakh read queries/day from finance dashboards. The 4500:1 read/write ratio screams CoW. The team uses Iceberg V2 in CoW mode; the few dozen rewrites per day are a non-event, and dashboard queries pay zero merge overhead.

Common confusions

Going deeper

The deletion-vector primitive that's converging both formats

Delta Lake (2023) and Iceberg V3 (in-progress) both adopt the same primitive: a per-base-file Roaring Bitmap marking deleted positions. A 10 million-row file's deletion vector is ~50 KB even with 100k positions deleted (vs. 800 KB for a position-delete-file representation). Reads apply the bitmap as a filter during the Parquet scan — single instruction per row in vectorised SIMD code paths. Deletion vectors collapse the CoW/MoR distinction for the delete case: you write the vector instead of rewriting the file (MoR-style write cost), and reads skip the deleted rows in O(1) per row (CoW-style read cost). Updates still need a separate insert file. Hudi's roadmap includes deletion-vector support as well; the three formats are converging on the same physical primitive even as they keep different metadata layers above it.

Concurrent updates, MVCC, and the timeline conflict model

The previous chapter on optimistic concurrency covered how two writers race via CAS on the manifest. For MoR specifically, the conflict model is gentler: two writers updating the same file group can both succeed if they write to separate log files. The merge step at read time will see both logs and apply them in timeline order. This is why Hudi can sustain 50+ concurrent writers per table where Iceberg CoW typically tops out at 5–10 (CoW writers fight over the same base file rewrite). The catch: ordering matters. If writer A writes status=PAID and writer B writes status=REFUNDED in overlapping log entries, the timeline order — which is determined by the commit timestamp the writers obtain from the timeline server — is what readers see. Hudi uses an OptimisticConcurrencyControl provider that coordinates these timestamps via Zookeeper or DynamoDB.

Why Apple, Netflix, Uber picked different defaults

Iceberg came out of Netflix's offline analytics workload (2018, Ryan Blue's team): mostly batch updates, mostly read-heavy, GDPR delete spikes. The CoW-default fit. Hudi came out of Uber's transactional ingest workload (2016, Vinoth Chandar's team): high-frequency upserts from CDC streams, real-time dashboards. The MoR-default fit. Apple's "Photos on iCloud" metadata layer adopted Iceberg around 2022 for its analyst-facing query patterns; ByteDance's TikTok recommendation pipeline runs on Hudi for its 100M+ records/sec ingest rate. The three companies validated that the format choice maps to the workload's read/write asymmetry, not to engineering preference. The practical lesson: don't pick Iceberg or Hudi by reading vendor blogs; pick by computing your read/write ratio and your update fan-out.

Compaction strategies and their tradeoffs

There are four strategies in production. Inline compaction runs on every Nth commit, blocking the writer briefly — predictable, but adds latency to writes. Async compaction runs in a separate process, polling for files that meet a compaction threshold — non-blocking, but requires a second cluster. Scheduled compaction runs at fixed times (2 a.m. nightly) — simple, but creates a stampede if many tables compact at once. Adaptive compaction triggers when read latency degrades past a threshold — most complex, requires query-side telemetry, but matches cost to actual harm. Onehouse and Databricks both ship adaptive compaction; most open-source deployments stick to scheduled-async. Why scheduled compaction is the default despite being suboptimal: it's the simplest to reason about — you know exactly when the cost hits, you can size the cluster for it, you can debug failures in business hours. Adaptive triggers are powerful but introduce a new failure mode: "the compaction job that should have run never did because the trigger never fired". Operational simplicity wins for most teams.

Where this leads next

The next chapter, /wiki/time-travel-and-snapshot-expiration, covers the manifest history and snapshot retention machinery that both CoW and MoR depend on. CoW makes time travel cheap (every old version is a complete file); MoR makes it nuanced (you need both the historic base and the historic logs that were valid at that instant).

After that the build moves into /wiki/the-trino-spark-dremio-duckdb-quadrant — query engines all consume the same manifest layer, but they differ in how aggressively they push merge logic down to the storage tier. Trino has the most mature MoR push-down; DuckDB has the least, but is gaining quickly. Engine choice can flip a CoW-leaning workload into a MoR-leaning one.

Build 14 (real-time analytics) revisits this exact tradeoff at a different layer: ClickHouse and Pinot offer their own segment-level upsert primitives (ClickHouse's ReplacingMergeTree, Pinot's upserts) that solve the same problem with different physical layouts. The mental model from this chapter ports directly: which side of the read/write asymmetry are you on, and what compaction discipline can you afford?

References