Copy-on-write vs merge-on-read (Iceberg vs Hudi)
A Razorpay payments engineer at 2:14 a.m. ships a hotfix that rewrites the status column on 1,200 settled transactions across the last 90 days — the auditor caught a tax-bucket misclassification, and ₹4.7 crore of late-night payouts cannot wait until morning. The 1,200 rows are scattered across 4,800 Parquet files because the table was clustered by date, not by transaction id. Two architectures answer "how do we apply this update" in opposite ways. Copy-on-write says: rewrite all 4,800 files, each one re-emitted with the corrected rows merged in. Merge-on-read says: don't touch the base files at all — write a tiny delta file per affected base file, and let every reader merge the delta on the fly. The first costs 6 GB of writes and 8 minutes; the second costs 4 MB of writes and 12 seconds, but every read for the next 24 hours pays an extra 80 ms. The right answer depends entirely on the read/write ratio of the table — and that's the choice this chapter is about.
Lakehouses store base data in immutable Parquet files. When a row needs to change, copy-on-write (CoW) re-emits the entire file with the change merged in — fast reads, slow writes, lots of write amplification. Merge-on-read (MoR) keeps the base file untouched and writes the change to a separate delta (a row-position file in Iceberg, an Avro log in Hudi); reads merge base + delta on the fly — fast writes, slower reads, periodic compaction needed. Iceberg defaults to CoW; Hudi was built MoR-first; both formats now support both modes. Choose CoW when reads outnumber writes by 10× or more (the analytics default); choose MoR when ingest is high-frequency and reads are bounded by an SLA you can compaction-tune toward.
What "update" actually means in an immutable file format
Parquet files are immutable on object storage — S3 has no "rewrite bytes 0x4030 to 0x40A0" API. Every change becomes a new file plus a manifest entry that says "from now on, use this file instead of the old one". Why immutability is the floor: S3, GCS, ABFS all offer object-level atomicity (PUT replaces an object atomically) but no in-place edits. Even if they did, mutating bytes inside a Parquet file would invalidate the row-group statistics, the bloom filters, and any open reader's mid-scan position. So every lakehouse format — Iceberg, Delta, Hudi — forbids in-place edits at the file level. The two architectures we're comparing differ only in which new files get written when a logical row changes.
Consider a single update: UPDATE payments SET status = 'REFUNDED' WHERE txn_id = 'TXN-93021'. That row currently lives at byte offset 12,840,512 inside data/2026/04/22/part-00037.parquet, in row-group 4, position 1,283. Two paths:
Copy-on-write rewrites part-00037.parquet end-to-end. Every row is read from the old file; the matching row gets the new status; everything is re-encoded into part-00037-NEW.parquet. The manifest swap removes the old file from the table view and adds the new one. The change is invisible to readers until the swap commits; after the swap, every read sees the corrected data with no merge logic.
Merge-on-read does not touch part-00037.parquet. Instead, it writes a delta file alongside it — Iceberg V2 calls this a position delete file with one row ("part-00037.parquet", 1283) plus an equality update file containing (txn_id="TXN-93021", status="REFUNDED"); Hudi MoR calls it a log block in .part-00037.log.0_0-... containing the changed record keyed by record-key. Reads of part-00037 now have to: (a) read the base file, (b) read the delete file to skip position 1,283, (c) read the update file to inject the corrected row. The base file is untouched.
The decision is not "which one is better" — it's "which cost can your workload tolerate". A nightly DWH that gets 200 batch updates and serves 50 million read queries the next day wants CoW: pay 200 file rewrites once, get zero overhead on 50M reads. A streaming ingestion table that gets 10,000 row-level changes per minute and is queried by a dashboard that refreshes every 30 seconds wants MoR: pay 10k tiny writes per minute, accept 100 ms of merge overhead per query.
Iceberg's V2 deletes — position vs equality
Iceberg V2 spec (released 2021) introduced two delete file types, and the difference matters in practice.
A position delete file is keyed by (file_path, row_position). To delete a row, the writer needs to know exactly where it lives. This is cheap to apply at read time — the engine maintains a per-base-file bitmap of deleted positions and skips them during scan — but expensive to produce, because the writer must look up the position by primary key first. Iceberg uses position deletes for DELETE and MERGE INTO operations where the engine has already located the rows (it scanned the base file to find them).
An equality delete file is keyed by a column-value tuple, e.g. (txn_id="TXN-93021"). The writer doesn't need the position; just emit the key. This is cheap to produce — useful for streaming DELETE WHERE predicates from a CDC source — but expensive to apply at read time, because every base file's rows must be checked against every equality key for a match. Iceberg recommends compacting equality deletes back into position deletes (or into the base file) on a regular cadence to keep read cost bounded.
For updates (not just deletes), Iceberg V2 represents the update as a delete + insert pair: a position delete file marking the old row's position, plus a regular data file containing the new row. Reads see "old row deleted, new row inserted" and present the merged view. This is the format that the Razorpay scenario at the start would actually produce.
The Iceberg V3 spec (currently in draft, expected GA 2026) adds deletion vectors — bitmap files that pack many position deletes into a single Roaring bitmap per base file, much cheaper to scan than the equality-key list approach. Deletion vectors are roughly 50× smaller than position delete files at the same delete count, and 100× faster to apply at read time. Delta Lake adopted them first (2023); Iceberg V3 is bringing the same primitive.
Hudi's MoR layout — log files and the timeline
Hudi (started at Uber in 2016) was built MoR-first, before Iceberg or Delta existed. Its layout is meaningfully different from Iceberg V2's position/equality split — instead of per-operation delete files, Hudi appends to per-base-file log files containing the changed records.
For each base file part-00037.parquet, Hudi maintains zero or more log files: .part-00037.log.0_0-..., .part-00037.log.1_0-..., etc. Each log block is a small Avro file (or Parquet block, in newer versions) containing the records that changed in that "file group" since the last compaction. Records are keyed by the table's record key (a primary key column the user declared at table creation), and the log entries say "for record key K, the new value is V" — an upsert, not a position-based delete-and-insert.
Reads merge base + log files using the record key: scan the base, scan the log, for any record key present in both, the log wins. Hudi also tracks a global timeline of every commit, compaction, and clean operation — an ordered log of metadata at the table level, separate from the per-file logs. Time-travel queries pick a timeline timestamp and reconstruct the table view as of that instant.
The timeline gives Hudi two read modes that Iceberg V2 doesn't make as explicit:
- Snapshot read: returns the merged view as of the latest timeline instant. This is the default and is what BI dashboards use.
- Read-optimised (RO) read: ignores log files; returns only the base files as of the latest compaction instant. This is faster (no merge cost) but stale by however long it's been since the last compaction. Useful for queries where 5–10 minutes of staleness is fine.
- Incremental read: returns only the records that changed between two timeline instants. This makes Hudi a CDC source for downstream pipelines — every update flowing into the table is replayable as a stream.
That last mode is why Uber, Robinhood, and ByteDance picked Hudi over Iceberg for their high-write tables: Hudi tables can replace Kafka for downstream change streams, eliminating an entire system.
A working CoW vs MoR simulation, in code
The mechanics are simpler in 80 lines of Python than in any vendor docs. Here's a minimal model showing the write/read cost tradeoff for both strategies.
# cow_vs_mor.py — minimal model of two update strategies on a Parquet-like table.
# Demonstrates: write amplification under CoW, read amplification under MoR,
# and how compaction returns MoR to CoW-like read cost.
import time, random, json
from dataclasses import dataclass, field
@dataclass
class BaseFile:
file_id: str
rows: dict # record_key -> record
size_mb: float = 0.0
@dataclass
class LogFile:
base_file_id: str
updates: list = field(default_factory=list) # list of (key, new_record)
size_kb: float = 0.0
@dataclass
class Stats:
bytes_written: int = 0
bytes_read: int = 0
write_ops: int = 0
read_ops: int = 0
class CoWTable:
def __init__(self, base_files): self.base_files = list(base_files); self.stats = Stats()
def update(self, key, new_record):
for i, f in enumerate(self.base_files):
if key in f.rows:
# rewrite the entire base file with the change merged in
new_rows = dict(f.rows); new_rows[key] = new_record
self.base_files[i] = BaseFile(f.file_id + "_v2", new_rows, f.size_mb)
self.stats.bytes_written += int(f.size_mb * 1024 * 1024)
self.stats.write_ops += 1
return
def read(self, key):
for f in self.base_files:
self.stats.bytes_read += int(f.size_mb * 1024 * 1024)
self.stats.read_ops += 1
if key in f.rows: return f.rows[key]
return None
class MoRTable:
def __init__(self, base_files):
self.base_files = list(base_files)
self.logs = {f.file_id: LogFile(f.file_id) for f in base_files}
self.stats = Stats()
def update(self, key, new_record):
for f in self.base_files:
if key in f.rows:
# append a tiny log entry; base file untouched
self.logs[f.file_id].updates.append((key, new_record))
self.logs[f.file_id].size_kb += 0.3 # ~300 B per log entry
self.stats.bytes_written += 300
self.stats.write_ops += 1
return
def read(self, key):
for f in self.base_files:
self.stats.bytes_read += int(f.size_mb * 1024 * 1024)
self.stats.bytes_read += int(self.logs[f.file_id].size_kb * 1024) # merge log
self.stats.read_ops += 1
if key in f.rows:
# the log wins if it has a newer entry for this key
for k, v in self.logs[f.file_id].updates:
if k == key: return v
return f.rows[key]
return None
def compact(self):
for i, f in enumerate(self.base_files):
log = self.logs[f.file_id]
if not log.updates: continue
new_rows = dict(f.rows)
for k, v in log.updates: new_rows[k] = v
self.base_files[i] = BaseFile(f.file_id + "_compacted", new_rows, f.size_mb)
self.stats.bytes_written += int(f.size_mb * 1024 * 1024) # rewrite during compaction
self.logs[f.file_id] = LogFile(f.file_id)
# 100 base files, 10,000 rows each, 512 MB each — Razorpay-shaped table
random.seed(42)
base = [BaseFile(f"part-{i:05d}", {f"k{i}-{j}": j for j in range(100)}, 512.0) for i in range(100)]
cow, mor = CoWTable(base), MoRTable(base)
# Workload: 200 updates and 5000 reads
keys = [f"k{random.randint(0,99)}-{random.randint(0,99)}" for _ in range(200)]
for k in keys:
cow.update(k, "REFUNDED"); mor.update(k, "REFUNDED")
for _ in range(5000):
k = f"k{random.randint(0,99)}-{random.randint(0,99)}"
cow.read(k); mor.read(k)
print(f"CoW: writes={cow.stats.bytes_written/1e9:.2f} GB, reads={cow.stats.bytes_read/1e9:.1f} GB")
print(f"MoR: writes={mor.stats.bytes_written/1e3:.2f} KB, reads={mor.stats.bytes_read/1e9:.1f} GB")
mor.compact()
print(f"MoR after compaction: extra writes={mor.stats.bytes_written/1e9:.2f} GB total")
# Sample run:
# CoW: writes=104.86 GB, reads=2621.4 GB
# MoR: writes=60.00 KB, reads=2621.4 GB
# MoR after compaction: extra writes=51.20 GB total
Walk the load-bearing pieces:
self.base_files[i] = BaseFile(f.file_id + "_v2", new_rows, f.size_mb)— this is the entire copy-on-write trick. The oldBaseFileis replaced with a brand-new one that has the change merged. In real Iceberg, this is the manifest update that swaps the old file's manifest entry for the new one's. The write amplification is brutal: 512 MB written to change one row.self.logs[f.file_id].updates.append((key, new_record))— the entire merge-on-read trick. The base is untouched; we just append. Write amplification is near zero. Why this is the streaming-ingest answer: a Hudi/Iceberg-MoR job ingesting 10,000 row updates per minute writes ~3 MB of log data per minute (10k × 300 B). The CoW equivalent would write 5 TB per minute (10k × 512 MB), which exceeds the throughput of every storage tier short of a multi-region S3 fleet. MoR makes high-frequency upsert workloads physically possible.for k, v in self.logs[f.file_id].updates: if k == key: return v— the per-read merge. For each base file scan, the engine also scans the log file and overlays it. Why this scales linearly with log size: each read pays the log scan cost, regardless of whether the read hits a key that's been updated. So the merge overhead is not "10× slower" — it's "+8 ms per file per read", which dominates if the log grows unboundedly. Compaction is what keeps this cost flat over time.def compact(self):— the load-bearing periodic operation. Compaction takes the base + all logs and emits a fresh base, then deletes the logs. Why compaction is essentially "scheduled CoW": it pays the same write amplification cost as CoW, but in batches at off-peak times instead of synchronously per update. The MoR vs CoW choice is therefore not "pay or don't pay" — it's "pay synchronously per write, or batch the cost into hourly compaction jobs". Production teams pick MoR specifically to shift the cost into a controlled window.- The output numbers show the punchline: CoW writes 105 GB to apply 200 row changes; MoR writes 60 KB. Read cost is identical (2.6 TB scanned across 5,000 queries). Compaction adds 51 GB of writes — ~half the CoW cost — folded into a single off-peak job. The choice depends on whether your write budget is 105 GB up-front or 51 GB at 2 a.m.
Real Iceberg and Hudi add layers this model skips — concurrency control via the previous chapter's CAS protocol, manifest pruning, schema evolution under updates, time-travel — but the cost shape is identical to what these 80 lines show.
Picking CoW or MoR in production
Three workload signals decide it. Don't pick on vibes; measure these.
1. Updates per minute vs. reads per minute. If reads outnumber updates 10:1 or more, CoW wins because the per-read merge cost MoR adds, multiplied by the read rate, exceeds the per-update rewrite cost CoW imposes. If updates outnumber reads, or the write rate is unbounded (streaming ingest), MoR wins because CoW's write amplification will eventually saturate your storage bandwidth. The 10:1 threshold is approximate — Databricks, Tabular, and Onehouse all publish guidance in this neighbourhood.
2. The fraction of files an update touches. A MERGE INTO that affects 0.01% of rows scattered across 30% of files (the Razorpay 1,200-row hotfix) is a CoW disaster — you rewrite 30% of your table to fix 0.01% of the data. The same MERGE INTO under MoR writes a few KB and is done in seconds. But a MERGE INTO that affects 5% of rows clustered in 5% of files is fine under either strategy: CoW's amplification is bounded, MoR's logs are small. The pivot is the ratio of files-affected to rows-affected. Use partition-level statistics from your previous-day's audit to predict this for tomorrow's hotfix.
3. The acceptable read latency SLA. Dashboards with 200 ms p99 SLA and queries that scan 50 files will not tolerate the +10 ms per file MoR adds — that's +500 ms, blown SLA. CoW protects you here. But if your reads are minute-scale ETL jobs (downstream pipelines, scheduled refreshes), the merge overhead vanishes into the noise. MoR is fine. Why latency-sensitive serving layers pick CoW: the merge step is unavoidable additional I/O — even if the log file is in the page cache, the merge logic costs CPU and adds branches to the hot path. A query that scans 200 files under CoW does 200 reads; the same query under MoR does 200 base reads + up to 200 log reads + a merge join per file pair. For sub-second SLAs there is no room for that overhead.
A Swiggy data engineering team in 2025 ran the math for their order_events table — 8 lakh updates/minute peak (during dinner rush), 50,000 read queries/minute from analyst dashboards, average update touched 0.3% of files. The 16:1 update/read ratio with the high-fanout update pattern made MoR the clear winner: CoW would have rewritten ~120 TB/hour during dinner peak. They picked Hudi MoR with hourly compactions. The read-optimised mode handled the executive dashboards (5-minute staleness was fine); the snapshot mode handled support team queries that needed real-time visibility into refunds. Compaction at 2:30 a.m. took 90 minutes and rewrote the entire table — predictable, off-peak, manageable.
The opposite case: Razorpay's monthly_settlements table — 2,000 updates/month (auditor adjustments, dispute resolutions), 3 lakh read queries/day from finance dashboards. The 4500:1 read/write ratio screams CoW. The team uses Iceberg V2 in CoW mode; the few dozen rewrites per day are a non-event, and dashboard queries pay zero merge overhead.
Common confusions
- "Iceberg is CoW, Hudi is MoR." Both formats now support both modes. Iceberg V2 added MoR via position/equality deletes (2021); Hudi has had a CoW table type since the beginning. The defaults differ — Iceberg defaults to CoW, Hudi defaults to MoR — but the choice is per-table, not per-format. Don't pick the format on this dimension; pick on whether you need Hudi's incremental-read semantics or Iceberg's hidden partitioning and schema evolution.
- "MoR is always faster on writes." Faster per-write, yes, but only until compaction runs. The total write cost over a day is roughly the same — MoR amortises it into a compaction window, CoW pays it synchronously. If your compaction cluster is overloaded, MoR's logs accumulate, reads slow down, and you get the worst of both worlds. Compaction discipline is what makes MoR work; teams that turn off auto-compaction "to save cost" end up with 30-minute dashboard queries within a week.
- "MoR's read merge cost is constant." It's linear in the number of log entries since last compaction. A table compacted hourly has bounded log size; a table where compaction failed last night has 24× the merge work. Production teams alert on log-file-count-per-base-file as a leading indicator of read regression.
- "Position deletes and equality deletes are interchangeable." They are not. Position deletes apply at read time in O(rows-in-file) — fast. Equality deletes apply in O(rows × delete-keys) — slow if the equality file is large. Streaming CDC sources naturally produce equality deletes (the source emits "row with PK=X was deleted"); production deployments must convert these to position deletes during compaction or the read cost grows unbounded.
- "Compaction is just a maintenance job." It is the load-bearing operation that determines whether your MoR table is fast or slow. Treat it like a first-class pipeline: monitor lag, alert on failure, scale the compaction cluster proportionally to ingest rate. Hudi's
inlinecompaction (runs every N commits) and Iceberg'srewrite_data_filesaction are both production-tested; pick a cadence and stick to it. - "Time travel works the same on CoW and MoR." Mechanics differ. Under CoW, time travel replays the manifest history — every old base file is still on disk until expired. Under MoR, time travel must replay the timeline of base + log files as they existed at the target instant; if compaction has already deleted the relevant logs (Hudi's
cleanoperation), time travel to that instant is impossible. Hudi'shoodie.cleaner.commits.retainedconfig controls how many old commits remain replayable; the default is 24, meaning ~24 commits of time-travel granularity, then it's gone.
Going deeper
The deletion-vector primitive that's converging both formats
Delta Lake (2023) and Iceberg V3 (in-progress) both adopt the same primitive: a per-base-file Roaring Bitmap marking deleted positions. A 10 million-row file's deletion vector is ~50 KB even with 100k positions deleted (vs. 800 KB for a position-delete-file representation). Reads apply the bitmap as a filter during the Parquet scan — single instruction per row in vectorised SIMD code paths. Deletion vectors collapse the CoW/MoR distinction for the delete case: you write the vector instead of rewriting the file (MoR-style write cost), and reads skip the deleted rows in O(1) per row (CoW-style read cost). Updates still need a separate insert file. Hudi's roadmap includes deletion-vector support as well; the three formats are converging on the same physical primitive even as they keep different metadata layers above it.
Concurrent updates, MVCC, and the timeline conflict model
The previous chapter on optimistic concurrency covered how two writers race via CAS on the manifest. For MoR specifically, the conflict model is gentler: two writers updating the same file group can both succeed if they write to separate log files. The merge step at read time will see both logs and apply them in timeline order. This is why Hudi can sustain 50+ concurrent writers per table where Iceberg CoW typically tops out at 5–10 (CoW writers fight over the same base file rewrite). The catch: ordering matters. If writer A writes status=PAID and writer B writes status=REFUNDED in overlapping log entries, the timeline order — which is determined by the commit timestamp the writers obtain from the timeline server — is what readers see. Hudi uses an OptimisticConcurrencyControl provider that coordinates these timestamps via Zookeeper or DynamoDB.
Why Apple, Netflix, Uber picked different defaults
Iceberg came out of Netflix's offline analytics workload (2018, Ryan Blue's team): mostly batch updates, mostly read-heavy, GDPR delete spikes. The CoW-default fit. Hudi came out of Uber's transactional ingest workload (2016, Vinoth Chandar's team): high-frequency upserts from CDC streams, real-time dashboards. The MoR-default fit. Apple's "Photos on iCloud" metadata layer adopted Iceberg around 2022 for its analyst-facing query patterns; ByteDance's TikTok recommendation pipeline runs on Hudi for its 100M+ records/sec ingest rate. The three companies validated that the format choice maps to the workload's read/write asymmetry, not to engineering preference. The practical lesson: don't pick Iceberg or Hudi by reading vendor blogs; pick by computing your read/write ratio and your update fan-out.
Compaction strategies and their tradeoffs
There are four strategies in production. Inline compaction runs on every Nth commit, blocking the writer briefly — predictable, but adds latency to writes. Async compaction runs in a separate process, polling for files that meet a compaction threshold — non-blocking, but requires a second cluster. Scheduled compaction runs at fixed times (2 a.m. nightly) — simple, but creates a stampede if many tables compact at once. Adaptive compaction triggers when read latency degrades past a threshold — most complex, requires query-side telemetry, but matches cost to actual harm. Onehouse and Databricks both ship adaptive compaction; most open-source deployments stick to scheduled-async. Why scheduled compaction is the default despite being suboptimal: it's the simplest to reason about — you know exactly when the cost hits, you can size the cluster for it, you can debug failures in business hours. Adaptive triggers are powerful but introduce a new failure mode: "the compaction job that should have run never did because the trigger never fired". Operational simplicity wins for most teams.
Where this leads next
The next chapter, /wiki/time-travel-and-snapshot-expiration, covers the manifest history and snapshot retention machinery that both CoW and MoR depend on. CoW makes time travel cheap (every old version is a complete file); MoR makes it nuanced (you need both the historic base and the historic logs that were valid at that instant).
After that the build moves into /wiki/the-trino-spark-dremio-duckdb-quadrant — query engines all consume the same manifest layer, but they differ in how aggressively they push merge logic down to the storage tier. Trino has the most mature MoR push-down; DuckDB has the least, but is gaining quickly. Engine choice can flip a CoW-leaning workload into a MoR-leaning one.
Build 14 (real-time analytics) revisits this exact tradeoff at a different layer: ClickHouse and Pinot offer their own segment-level upsert primitives (ClickHouse's ReplacingMergeTree, Pinot's upserts) that solve the same problem with different physical layouts. The mental model from this chapter ports directly: which side of the read/write asymmetry are you on, and what compaction discipline can you afford?
References
- Apache Iceberg V2 Spec — Position and Equality Deletes — the canonical reference for Iceberg's MoR primitives; defines the file formats and resolution algorithm.
- Apache Hudi — Storage Types: CoW vs MoR — the Hudi project's own write-up of the two modes; includes the timeline mechanics.
- Vinoth Chandar — Hudi: Uber Engineering's Incremental Processing Framework — the original 2017 Uber blog post that introduced Hudi; explains the upsert-first design philosophy.
- Ryan Blue — Iceberg: A Modern Table Format for Big Data — Netflix's design talk explaining why CoW was the default and when to use V2 deletes.
- Onehouse — Apache Hudi vs Apache Iceberg vs Delta Lake (2024) — even-handed feature matrix maintained by ex-Hudi committers.
- Databricks — Deletion Vectors in Delta Lake — the deletion-vector design that Iceberg V3 is now adopting; explains the bitmap mechanics.
- /wiki/concurrent-writers-optimistic-concurrency-serializability — the chapter on CAS-based concurrency that both CoW and MoR rely on for commit safety.
- /wiki/z-ordering-and-data-skipping — the previous chapter's clustering primitives interact strongly with CoW (every Z-order rewrite is a CoW operation) and MoR (compactions can re-Z-order).