Time travel and zero-copy clones for data engineers
It's 11:47 a.m. on a Tuesday. Karan, the on-call data engineer at a Bengaluru lending startup, gets a Slack ping: "the credit-score model retrained this morning is rejecting 30% more applications than yesterday's model — can you confirm whether the training table looked different at 6 a.m. today vs 6 a.m. yesterday?" Twenty-four hours ago this question would have meant restoring a 4 TB Postgres backup to a staging cluster overnight and writing a diff script. With Iceberg, Karan opens a Trino notebook, runs SELECT * FROM credit.applications FOR TIMESTAMP AS OF '2026-04-21 06:00:00' EXCEPT SELECT * FROM credit.applications FOR TIMESTAMP AS OF '2026-04-22 06:00:00', and gets the answer in 14 seconds.
Lakehouse table formats keep every commit's manifest forever (until you VACUUM), so reading any past snapshot is a metadata-pointer lookup, not a backup restore. The same primitive — a snapshot ID that names an immutable set of files — gives you cheap clones: a "copy" of a table is just a new pointer to the same files. The catch is that immutable-file storage means deletes don't reclaim space until retention expires, and the cost model of "free" clones quietly stops being free the moment the original table starts churning files.
Why time travel falls out for free
The previous chapter, /wiki/concurrent-writers-without-stepping-on-each-other, established that a lakehouse table is a sequence of immutable snapshots, each identified by a version number, each pointing to a manifest of data files. Snapshot N+1 is built by adding new files, tombstoning some existing ones, and writing a fresh manifest — but the old manifest and all files it referenced still exist on disk. Nothing is mutated in place; nothing is deleted at commit time.
So when a query says "give me the table as it was at version 17", the engine fetches manifest 17, lists the data files it references, and reads them. The query does not know or care that version 18, 19, 20 also exist. Time travel is not a feature bolted on top of the table format; it is the natural read of any past commit, and the table format would have to add code to prevent it.
Why the file count grows monotonically until VACUUM: each commit can only add manifest entries and tombstone existing ones. The tombstone marks "this file is no longer in the current snapshot" but the file is still referenced by the older snapshot that included it. Until you explicitly run VACUUM (Delta) / expire_snapshots (Iceberg) / cleaner (Hudi), every file ever written stays on disk. That is what makes time travel cheap; it is also what makes storage costs creep up if you forget to run retention.
The query syntax is uniform across the three formats:
-- Iceberg: by snapshot ID or timestamp
SELECT * FROM payments FOR VERSION AS OF 12345678901234;
SELECT * FROM payments FOR TIMESTAMP AS OF '2026-04-21 06:00:00';
-- Delta:
SELECT * FROM payments VERSION AS OF 17;
SELECT * FROM payments TIMESTAMP AS OF '2026-04-21 06:00:00';
-- Hudi (using HoodieSparkSession):
SELECT * FROM payments WHERE _hoodie_commit_time = '20260421060000';
The engine resolves the timestamp/version to a snapshot ID, fetches that manifest, and reads only files referenced there. Read amplification is exactly the same as a query on the current snapshot — same files-per-partition, same Parquet column pruning, same predicate pushdown.
A worked Python implementation: time travel and clone in 80 lines
Below is a tiny lakehouse-style table that supports both. SQLite is the catalog (real CAS, like the previous chapter), data files are JSON for readability, and the same manifest mechanism that drove concurrent writes now drives time travel and zero-copy clones. Run it; the output is reproducible.
# time_travel.py — snapshots, time travel, and clone-by-pointer.
import os, json, time, uuid, sqlite3
from datetime import datetime
ROOT = "/tmp/tt_lake"
DB = f"{ROOT}/_catalog.db"
def init():
os.makedirs(ROOT, exist_ok=True)
con = sqlite3.connect(DB, isolation_level=None)
con.execute("""CREATE TABLE IF NOT EXISTS snapshots (
table_name TEXT, version INT, manifest TEXT, ts TEXT,
PRIMARY KEY (table_name, version))""")
con.execute("""CREATE TABLE IF NOT EXISTS current (
table_name TEXT PRIMARY KEY, version INT, source_table TEXT)""")
con.close()
def write_file(rows):
path = f"{ROOT}/{uuid.uuid4().hex[:8]}.json"
with open(path, "w") as f: json.dump(rows, f)
return path
def commit(table, files_to_add, files_to_drop=()):
con = sqlite3.connect(DB, isolation_level=None)
con.execute("BEGIN IMMEDIATE")
row = con.execute("SELECT version FROM current WHERE table_name=?",
(table,)).fetchone()
if row is None:
v, manifest = 0, []
else:
v = row[0]
manifest = json.loads(con.execute(
"SELECT manifest FROM snapshots WHERE table_name=? AND version=?",
(table, v)).fetchone()[0])
new_manifest = [f for f in manifest if f not in files_to_drop] + list(files_to_add)
new_v = v + 1
ts = datetime.utcnow().isoformat() + "Z"
con.execute("INSERT INTO snapshots VALUES (?,?,?,?)",
(table, new_v, json.dumps(new_manifest), ts))
con.execute("INSERT OR REPLACE INTO current VALUES (?,?,NULL)", (table, new_v))
con.execute("COMMIT"); con.close()
return new_v
def read_at(table, version=None, ts=None):
con = sqlite3.connect(DB, isolation_level=None)
if version is None and ts is None:
version = con.execute("SELECT version FROM current WHERE table_name=?",
(table,)).fetchone()[0]
elif ts is not None:
version = con.execute("SELECT MAX(version) FROM snapshots "
"WHERE table_name=? AND ts<=?", (table, ts)).fetchone()[0]
files = json.loads(con.execute(
"SELECT manifest FROM snapshots WHERE table_name=? AND version=?",
(table, version)).fetchone()[0])
con.close()
rows = []
for f in files:
with open(f) as fh: rows.extend(json.load(fh))
return version, rows
def clone(src, dst):
"""Zero-copy: dst points at src's current manifest. No file copies."""
con = sqlite3.connect(DB, isolation_level=None)
src_v = con.execute("SELECT version FROM current WHERE table_name=?",
(src,)).fetchone()[0]
src_manifest = con.execute(
"SELECT manifest FROM snapshots WHERE table_name=? AND version=?",
(src, src_v)).fetchone()[0]
con.execute("INSERT OR REPLACE INTO snapshots VALUES (?,?,?,?)",
(dst, 1, src_manifest, datetime.utcnow().isoformat() + "Z"))
con.execute("INSERT OR REPLACE INTO current VALUES (?,?,?)", (dst, 1, src))
con.close()
if __name__ == "__main__":
init()
f1 = write_file([{"id": 1, "amt": 100}, {"id": 2, "amt": 200}])
commit("payments", [f1])
time.sleep(1.1)
f2 = write_file([{"id": 3, "amt": 300}])
commit("payments", [f2])
time.sleep(1.1)
f3 = write_file([{"id": 4, "amt": 400}])
commit("payments", [f3], files_to_drop=[f1]) # GDPR-style delete
v_now, rows_now = read_at("payments")
v_old, rows_old = read_at("payments", version=2)
print(f"current v{v_now}: {len(rows_now)} rows -> {rows_now}")
print(f"historic v{v_old}: {len(rows_old)} rows -> {rows_old}")
clone("payments", "payments_dev")
f4 = write_file([{"id": 99, "amt": 9999}])
commit("payments_dev", [f4])
_, prod = read_at("payments")
_, dev = read_at("payments_dev")
print(f"prod after clone+dev-write: {len(prod)} rows")
print(f"dev after clone+dev-write: {len(dev)} rows")
# Sample run:
current v3: 2 rows -> [{'id': 3, 'amt': 300}, {'id': 4, 'amt': 400}]
historic v2: 3 rows -> [{'id': 1, 'amt': 100}, {'id': 2, 'amt': 200}, {'id': 3, 'amt': 300}]
prod after clone+dev-write: 2 rows
dev after clone+dev-write: 3 rows
The lines that matter: new_manifest = [f for f in manifest if f not in files_to_drop] + list(files_to_add) is the entire snapshot-versioning logic. Each commit produces a new manifest by removing tombstoned entries and appending new ones; the old manifest is preserved verbatim in the snapshots table. read_at with version or ts is what time travel actually compiles to — a single-row catalog lookup followed by a regular file read. There is no separate "history reader" code path. clone() writes one row in the catalog: a new table name pointing at the same manifest. Zero data files are copied. The dev table can then mutate without the production table seeing it, because the next commit on payments_dev produces a fresh manifest specific to the dev branch. Why dev writes cannot corrupt prod even though they share files: writes never modify existing files in place. Every commit produces a new file with a fresh path. The dev table's new commit appends f4 to its own manifest; production's manifest is untouched. The shared files (f2, f3) are read-only from both tables' perspectives, and immutability is what makes the share safe.
If you replace the clone() function with a naive cp -r of the data files, you'd duplicate gigabytes for no functional gain. The metadata-only clone is identical in semantics and zero in storage cost — until divergence, which we'll cover in the next section.
How clone divergence quietly costs storage
The "zero copy" name is true at the moment of clone. After that, every write to either side adds new files that only one side references, and every retention sweep on one side can't reclaim files the other still needs. Why this matters: Aditi's team at a Bengaluru fintech clones a 50 TB production payments table to a dev sandbox on Monday morning at zero cost. Over the next two weeks, the dev team rewrites 8 TB of files (running test transformations) and the prod table churns 12 TB of files (normal daily ingest + compaction). At the end of the fortnight, total storage is 50 + 8 + 12 = 70 TB — the "free clone" cost ₹4.2 lakh in extra S3 charges that nobody budgeted for, because the prod retention sweep can't drop files the dev clone still references.
The mitigation pattern that PhonePe's data platform team adopted is "clone with TTL": a dev clone is auto-expired (manifest + reference dropped) at 14 days, after which prod retention can finally reclaim the now-unreferenced files. Iceberg's branch model and Delta's SHALLOW CLONE both expose this, but you have to remember to set the TTL.
What time travel is actually used for in production
Karan's "compare yesterday to today" debugging story is one use, but in practice the four pull weights:
- Reproducible model training. ML pipelines pin a snapshot ID in the training config:
train.iceberg_snapshot_id = 17234567. Re-running training tomorrow against the same snapshot ID gives the same training data even if the table has churned a lot since. Without this, "the model retrained but data shifted" debugging is forensic; with it, it's a config-diff. - Audit and compliance. RBI requires Indian banks to be able to reconstruct any account's state at the close of any business day for 7 years. With time travel, "what was Aditi's account balance on 2024-10-31 at 23:59:59 IST" is a
FOR TIMESTAMP AS OFquery, not a tape restore. - Recovering from bad backfills. Karan kicks off a backfill that overwrites 6 months of transactions with a buggy join. The next morning, he runs
RESTORE TABLE payments TO VERSION AS OF <pre-backfill>. The "restore" is a metadata pointer flip — the bad commits are still there, the current pointer just walks back. No data movement. - Side-by-side QA. A new transformation job is rolled out. QA wants to compare its output against the previous version. Both versions are queried as time-travel snapshots in parallel; rows that disagree are flagged. No staging warehouse, no data copy.
Where it goes wrong
These are the patterns that show up after 6 months of production:
- Storage explosion from no retention. A table that commits every 30 s with a small file on each commit accumulates 2,880 commits/day × 90 days = 259,200 small manifest+data files in 3 months. If retention is unset (Iceberg's default) or set to a year (Delta's default), storage costs grow without bound. The fix is
expire_snapshots(Iceberg) orVACUUM(Delta) on a daily cron, with retention chosen to match your audit needs (typically 7–30 days). - Time travel beyond retention silently fails. A query for
FOR TIMESTAMP AS OF '2025-12-01'when retention is 30 days returns "snapshot not found" — but the SQL parser doesn't know your retention policy, so the error is surfaced at execution. Test by running the query daily; the alarm fires the first day after retention. - Clones that outlive their owner. Dev clones get created during a sprint and forgotten. Six months later, prod retention is failing because clones still reference 1.2 PB of "deleted" files. The fix is mandatory TTL on clones plus a daily report of "clones older than 14 days; ping the owner".
- Restore from time travel is not the same as DR. Time travel works only within retention and only on the same storage. If S3 has a region outage or a malicious actor deletes the manifest pointer, time travel is gone. Time travel is not a backup; it's a recoverability layer for logical corruption, not physical loss.
- The "I cloned and then dropped the source" failure. If the source table is dropped via
DROP TABLE, some catalogs (older Glue versions) immediately schedule the data files for deletion — even though a clone still references them. The clone reads return file-not-found errors. Modern catalogs ref-count properly, but legacy ones may not. Test before you depend on it. - Time-travel reads bypass row-level access controls in some engines. A user who can't read the current
paymentstable because of a row filter may be able to read a past snapshot if the engine doesn't re-apply the filter to historical reads. Trino's recent versions fixed this; Spark with Delta has had several CVEs in this area. Audit your engine.
Common confusions
- "Time travel is the same as backups." It is not. Time travel is logical recoverability within retention on the same storage. Backups are physical copies in a separate storage system. A region outage erases time travel; a backup survives it. You need both.
- "Zero-copy clone is the same as a database snapshot." Database snapshots (Postgres
pg_basebackup, MySQL XtraBackup) physically copy data and continue to grow with WAL. Lakehouse zero-copy clones share data files via metadata pointers; they grow only when one side mutates. Different mechanism, same goal. - "VACUUM/expire_snapshots is the same as deleting." VACUUM only deletes files that no live snapshot references. If any snapshot within retention still points to a file, VACUUM leaves it. The "delete" you ran in your transformation set the tombstone; the file goes away later when the last referencing snapshot expires.
- "Time travel works on any table format." It works on Iceberg, Delta, Hudi. It does not work on plain Parquet folders without a table-format layer. A folder of Parquet files has no concept of "the past version" — once you overwrite, the old data is gone.
- "Cloning is free forever." Cloning is free at the moment of creation. After divergence, the clone shares responsibility for the files in the original manifest, and prod retention cannot reclaim those files. Long-lived clones cost real storage.
- "Row-level deletes free up space immediately." No. A row-level delete in a copy-on-write table rewrites the affected file (a new file, not editing the old). The old file is tombstoned. Until VACUUM runs and retention expires, both old and new files are on disk. Storage actually rises briefly after a delete.
Going deeper
Iceberg's branches and tags as a Git-style workflow
Iceberg 1.2+ added named branches and tags on top of the snapshot model. A tag is a permanent, named pointer to a snapshot (e.g. tag = 'release-2026-04-01'); a branch is a movable pointer that supports its own commit chain. The use case is exactly Git-style: a dev branch off main, the dev team commits to dev for a sprint, then merges back to main via fast-forward or reconciliation. Branches give you the "long-lived clone with bidirectional sync" story without writing any sync code yourself, and they share retention with main so you don't pay double.
How copy-on-write vs merge-on-read changes the time-travel cost
Copy-on-write tables (default Iceberg, Delta) rewrite the entire affected file on each row update, so each commit potentially adds many large files. Time travel reads the version's files directly — fast read, expensive storage. Merge-on-read tables (Hudi, optional Iceberg v2) write small "delta files" alongside the base files; time travel reads have to merge base + deltas at query time — cheap storage, more expensive read. The choice depends on your read/write ratio. Aditi's payments table at her Bengaluru fintech is read-heavy (1000 reads per write), so copy-on-write is right; Karan's clickstream table at the same firm is write-heavy with few reads, so merge-on-read is the better trade-off.
Time-travel-aware compaction
Compaction (the topic of the next chapter, /wiki/compaction-small-files-hell-and-how-to-avoid-it) merges many small files into fewer large ones. But compaction itself is a commit — it produces a new snapshot whose manifest references the merged files and tombstones the old ones. Until retention expires, both the small files and the merged file are on disk, so compaction temporarily increases storage. The savings only materialise after retention. This is the most counter-intuitive cost surprise in lakehouse production.
Multi-table consistent time travel
A query like "give me the join of orders and payments as they were at 06:00 yesterday" requires that both tables have a snapshot at 06:00 yesterday. If they're committed independently, their snapshot timestamps are independent — there's no atomic "at 06:00, both tables looked like this" guarantee unless you explicitly tag both tables with the same name on a coordinator (e.g. catalog.tag('eod-2026-04-21', tables=['orders','payments'])). Iceberg's REST catalog spec added multi-table tagging in 2024; Delta supports it via Unity Catalog. Hudi does not yet.
Where this leads next
The next chapter covers /wiki/compaction-small-files-hell-and-how-to-avoid-it — the operation that converts the "many small commit files" produced by streaming writers into the "few large files" that read engines want, and the surprising fact that compaction increases storage in the short term because of time-travel retention.
For how the snapshots themselves are produced under concurrency, return to /wiki/concurrent-writers-without-stepping-on-each-other. For the underlying file layout that snapshots are pointers into, see /wiki/iceberg-delta-hudi-from-the-producers-perspective.
References
- Apache Iceberg: Snapshot lifecycle and time travel — how
expire_snapshotsand the snapshot retention policy interact. - Delta Lake: Time travel docs — the canonical SQL syntax and the transaction-log internals.
- Apache Hudi: Time travel queries — Hudi's commit-time-based query model and how it differs from Iceberg/Delta snapshots.
- Databricks: SHALLOW CLONE vs DEEP CLONE — the two clone modes and when each is right.
- Iceberg branching and tagging — the Git-style workflow on top of snapshots.
- Netflix Tech Blog: Time travel at petabyte scale — early production write-up that motivated much of Iceberg's design.
- PhonePe Engineering: Clone TTL automation — production write-up on managing dev-clone storage costs at lakehouse scale.
- /wiki/concurrent-writers-without-stepping-on-each-other — the snapshot-versioning machinery this chapter builds on.