Time travel and zero-copy clones for data engineers

It's 11:47 a.m. on a Tuesday. Karan, the on-call data engineer at a Bengaluru lending startup, gets a Slack ping: "the credit-score model retrained this morning is rejecting 30% more applications than yesterday's model — can you confirm whether the training table looked different at 6 a.m. today vs 6 a.m. yesterday?" Twenty-four hours ago this question would have meant restoring a 4 TB Postgres backup to a staging cluster overnight and writing a diff script. With Iceberg, Karan opens a Trino notebook, runs SELECT * FROM credit.applications FOR TIMESTAMP AS OF '2026-04-21 06:00:00' EXCEPT SELECT * FROM credit.applications FOR TIMESTAMP AS OF '2026-04-22 06:00:00', and gets the answer in 14 seconds.

Lakehouse table formats keep every commit's manifest forever (until you VACUUM), so reading any past snapshot is a metadata-pointer lookup, not a backup restore. The same primitive — a snapshot ID that names an immutable set of files — gives you cheap clones: a "copy" of a table is just a new pointer to the same files. The catch is that immutable-file storage means deletes don't reclaim space until retention expires, and the cost model of "free" clones quietly stops being free the moment the original table starts churning files.

Why time travel falls out for free

The previous chapter, /wiki/concurrent-writers-without-stepping-on-each-other, established that a lakehouse table is a sequence of immutable snapshots, each identified by a version number, each pointing to a manifest of data files. Snapshot N+1 is built by adding new files, tombstoning some existing ones, and writing a fresh manifest — but the old manifest and all files it referenced still exist on disk. Nothing is mutated in place; nothing is deleted at commit time.

So when a query says "give me the table as it was at version 17", the engine fetches manifest 17, lists the data files it references, and reads them. The query does not know or care that version 18, 19, 20 also exist. Time travel is not a feature bolted on top of the table format; it is the natural read of any past commit, and the table format would have to add code to prevent it.

Snapshots as immutable pointers to overlapping file setsThree snapshots labelled v15, v16, v17 each point to a manifest. v15's manifest references files F1, F2, F3. v16's manifest references F1, F2, F4 (added F4). v17's manifest references F1, F4, F5 (dropped F2, added F5). All five files coexist on disk. A "current pointer" arrow points at v17. Three snapshots, five files, one current pointer v15 manifest → F1, F2, F3 v16 manifest → F1, F2, F4 v17 (current) manifest → F1, F4, F5 catalog pointer F1 F2 F3 F4 F5 all five files coexist on object storage none have been deleted
v17 is the current pointer; v15 and v16 are still readable because their manifests and files have not been garbage-collected. F2 was tombstoned at v17 but the file is still on disk — that's why "give me the table at v15" still works.

Why the file count grows monotonically until VACUUM: each commit can only add manifest entries and tombstone existing ones. The tombstone marks "this file is no longer in the current snapshot" but the file is still referenced by the older snapshot that included it. Until you explicitly run VACUUM (Delta) / expire_snapshots (Iceberg) / cleaner (Hudi), every file ever written stays on disk. That is what makes time travel cheap; it is also what makes storage costs creep up if you forget to run retention.

The query syntax is uniform across the three formats:

-- Iceberg: by snapshot ID or timestamp
SELECT * FROM payments FOR VERSION AS OF 12345678901234;
SELECT * FROM payments FOR TIMESTAMP AS OF '2026-04-21 06:00:00';

-- Delta:
SELECT * FROM payments VERSION AS OF 17;
SELECT * FROM payments TIMESTAMP AS OF '2026-04-21 06:00:00';

-- Hudi (using HoodieSparkSession):
SELECT * FROM payments WHERE _hoodie_commit_time = '20260421060000';

The engine resolves the timestamp/version to a snapshot ID, fetches that manifest, and reads only files referenced there. Read amplification is exactly the same as a query on the current snapshot — same files-per-partition, same Parquet column pruning, same predicate pushdown.

A worked Python implementation: time travel and clone in 80 lines

Below is a tiny lakehouse-style table that supports both. SQLite is the catalog (real CAS, like the previous chapter), data files are JSON for readability, and the same manifest mechanism that drove concurrent writes now drives time travel and zero-copy clones. Run it; the output is reproducible.

# time_travel.py — snapshots, time travel, and clone-by-pointer.
import os, json, time, uuid, sqlite3
from datetime import datetime

ROOT = "/tmp/tt_lake"
DB = f"{ROOT}/_catalog.db"

def init():
    os.makedirs(ROOT, exist_ok=True)
    con = sqlite3.connect(DB, isolation_level=None)
    con.execute("""CREATE TABLE IF NOT EXISTS snapshots (
        table_name TEXT, version INT, manifest TEXT, ts TEXT,
        PRIMARY KEY (table_name, version))""")
    con.execute("""CREATE TABLE IF NOT EXISTS current (
        table_name TEXT PRIMARY KEY, version INT, source_table TEXT)""")
    con.close()

def write_file(rows):
    path = f"{ROOT}/{uuid.uuid4().hex[:8]}.json"
    with open(path, "w") as f: json.dump(rows, f)
    return path

def commit(table, files_to_add, files_to_drop=()):
    con = sqlite3.connect(DB, isolation_level=None)
    con.execute("BEGIN IMMEDIATE")
    row = con.execute("SELECT version FROM current WHERE table_name=?",
                      (table,)).fetchone()
    if row is None:
        v, manifest = 0, []
    else:
        v = row[0]
        manifest = json.loads(con.execute(
            "SELECT manifest FROM snapshots WHERE table_name=? AND version=?",
            (table, v)).fetchone()[0])
    new_manifest = [f for f in manifest if f not in files_to_drop] + list(files_to_add)
    new_v = v + 1
    ts = datetime.utcnow().isoformat() + "Z"
    con.execute("INSERT INTO snapshots VALUES (?,?,?,?)",
                (table, new_v, json.dumps(new_manifest), ts))
    con.execute("INSERT OR REPLACE INTO current VALUES (?,?,NULL)", (table, new_v))
    con.execute("COMMIT"); con.close()
    return new_v

def read_at(table, version=None, ts=None):
    con = sqlite3.connect(DB, isolation_level=None)
    if version is None and ts is None:
        version = con.execute("SELECT version FROM current WHERE table_name=?",
                              (table,)).fetchone()[0]
    elif ts is not None:
        version = con.execute("SELECT MAX(version) FROM snapshots "
                              "WHERE table_name=? AND ts<=?", (table, ts)).fetchone()[0]
    files = json.loads(con.execute(
        "SELECT manifest FROM snapshots WHERE table_name=? AND version=?",
        (table, version)).fetchone()[0])
    con.close()
    rows = []
    for f in files:
        with open(f) as fh: rows.extend(json.load(fh))
    return version, rows

def clone(src, dst):
    """Zero-copy: dst points at src's current manifest. No file copies."""
    con = sqlite3.connect(DB, isolation_level=None)
    src_v = con.execute("SELECT version FROM current WHERE table_name=?",
                        (src,)).fetchone()[0]
    src_manifest = con.execute(
        "SELECT manifest FROM snapshots WHERE table_name=? AND version=?",
        (src, src_v)).fetchone()[0]
    con.execute("INSERT OR REPLACE INTO snapshots VALUES (?,?,?,?)",
                (dst, 1, src_manifest, datetime.utcnow().isoformat() + "Z"))
    con.execute("INSERT OR REPLACE INTO current VALUES (?,?,?)", (dst, 1, src))
    con.close()

if __name__ == "__main__":
    init()
    f1 = write_file([{"id": 1, "amt": 100}, {"id": 2, "amt": 200}])
    commit("payments", [f1])
    time.sleep(1.1)
    f2 = write_file([{"id": 3, "amt": 300}])
    commit("payments", [f2])
    time.sleep(1.1)
    f3 = write_file([{"id": 4, "amt": 400}])
    commit("payments", [f3], files_to_drop=[f1])  # GDPR-style delete
    v_now, rows_now = read_at("payments")
    v_old, rows_old = read_at("payments", version=2)
    print(f"current v{v_now}: {len(rows_now)} rows -> {rows_now}")
    print(f"historic v{v_old}: {len(rows_old)} rows -> {rows_old}")
    clone("payments", "payments_dev")
    f4 = write_file([{"id": 99, "amt": 9999}])
    commit("payments_dev", [f4])
    _, prod = read_at("payments")
    _, dev = read_at("payments_dev")
    print(f"prod after clone+dev-write: {len(prod)} rows")
    print(f"dev after clone+dev-write:  {len(dev)} rows")
# Sample run:
current v3: 2 rows -> [{'id': 3, 'amt': 300}, {'id': 4, 'amt': 400}]
historic v2: 3 rows -> [{'id': 1, 'amt': 100}, {'id': 2, 'amt': 200}, {'id': 3, 'amt': 300}]
prod after clone+dev-write: 2 rows
dev after clone+dev-write:  3 rows

The lines that matter: new_manifest = [f for f in manifest if f not in files_to_drop] + list(files_to_add) is the entire snapshot-versioning logic. Each commit produces a new manifest by removing tombstoned entries and appending new ones; the old manifest is preserved verbatim in the snapshots table. read_at with version or ts is what time travel actually compiles to — a single-row catalog lookup followed by a regular file read. There is no separate "history reader" code path. clone() writes one row in the catalog: a new table name pointing at the same manifest. Zero data files are copied. The dev table can then mutate without the production table seeing it, because the next commit on payments_dev produces a fresh manifest specific to the dev branch. Why dev writes cannot corrupt prod even though they share files: writes never modify existing files in place. Every commit produces a new file with a fresh path. The dev table's new commit appends f4 to its own manifest; production's manifest is untouched. The shared files (f2, f3) are read-only from both tables' perspectives, and immutability is what makes the share safe.

If you replace the clone() function with a naive cp -r of the data files, you'd duplicate gigabytes for no functional gain. The metadata-only clone is identical in semantics and zero in storage cost — until divergence, which we'll cover in the next section.

How clone divergence quietly costs storage

The "zero copy" name is true at the moment of clone. After that, every write to either side adds new files that only one side references, and every retention sweep on one side can't reclaim files the other still needs. Why this matters: Aditi's team at a Bengaluru fintech clones a 50 TB production payments table to a dev sandbox on Monday morning at zero cost. Over the next two weeks, the dev team rewrites 8 TB of files (running test transformations) and the prod table churns 12 TB of files (normal daily ingest + compaction). At the end of the fortnight, total storage is 50 + 8 + 12 = 70 TB — the "free clone" cost ₹4.2 lakh in extra S3 charges that nobody budgeted for, because the prod retention sweep can't drop files the dev clone still references.

Clone divergence over time, files and referencesA timeline shows day 0 (clone) where prod and dev both reference 4 files. By day 14, prod has 8 files (added 4, dropped 2 logically) and dev has 6 files (added 3, dropped 1 logically). Total physical files on storage: 11. Annotation shows "shared files cannot be GC-ed". Clone divergence: shared files block retention Day 0 — clone prod → A,B,C,D dev → A,B,C,D storage: 4 files Day 14 — divergence prod → A, D, P1..P4 dev → B, C, D, V1..V3 storage: A,B,C,D + P1..P4 + V1..V3 = 11 files retention prod wants to drop B,C but dev still refs them → B,C survive "free" clone is no longer free
The clone is free at day 0. By day 14, divergence and shared references make the storage cost rise above what either table alone would consume.

The mitigation pattern that PhonePe's data platform team adopted is "clone with TTL": a dev clone is auto-expired (manifest + reference dropped) at 14 days, after which prod retention can finally reclaim the now-unreferenced files. Iceberg's branch model and Delta's SHALLOW CLONE both expose this, but you have to remember to set the TTL.

What time travel is actually used for in production

Karan's "compare yesterday to today" debugging story is one use, but in practice the four pull weights:

Where it goes wrong

These are the patterns that show up after 6 months of production:

Common confusions

Going deeper

Iceberg's branches and tags as a Git-style workflow

Iceberg 1.2+ added named branches and tags on top of the snapshot model. A tag is a permanent, named pointer to a snapshot (e.g. tag = 'release-2026-04-01'); a branch is a movable pointer that supports its own commit chain. The use case is exactly Git-style: a dev branch off main, the dev team commits to dev for a sprint, then merges back to main via fast-forward or reconciliation. Branches give you the "long-lived clone with bidirectional sync" story without writing any sync code yourself, and they share retention with main so you don't pay double.

How copy-on-write vs merge-on-read changes the time-travel cost

Copy-on-write tables (default Iceberg, Delta) rewrite the entire affected file on each row update, so each commit potentially adds many large files. Time travel reads the version's files directly — fast read, expensive storage. Merge-on-read tables (Hudi, optional Iceberg v2) write small "delta files" alongside the base files; time travel reads have to merge base + deltas at query time — cheap storage, more expensive read. The choice depends on your read/write ratio. Aditi's payments table at her Bengaluru fintech is read-heavy (1000 reads per write), so copy-on-write is right; Karan's clickstream table at the same firm is write-heavy with few reads, so merge-on-read is the better trade-off.

Time-travel-aware compaction

Compaction (the topic of the next chapter, /wiki/compaction-small-files-hell-and-how-to-avoid-it) merges many small files into fewer large ones. But compaction itself is a commit — it produces a new snapshot whose manifest references the merged files and tombstones the old ones. Until retention expires, both the small files and the merged file are on disk, so compaction temporarily increases storage. The savings only materialise after retention. This is the most counter-intuitive cost surprise in lakehouse production.

Multi-table consistent time travel

A query like "give me the join of orders and payments as they were at 06:00 yesterday" requires that both tables have a snapshot at 06:00 yesterday. If they're committed independently, their snapshot timestamps are independent — there's no atomic "at 06:00, both tables looked like this" guarantee unless you explicitly tag both tables with the same name on a coordinator (e.g. catalog.tag('eod-2026-04-21', tables=['orders','payments'])). Iceberg's REST catalog spec added multi-table tagging in 2024; Delta supports it via Unity Catalog. Hudi does not yet.

Where this leads next

The next chapter covers /wiki/compaction-small-files-hell-and-how-to-avoid-it — the operation that converts the "many small commit files" produced by streaming writers into the "few large files" that read engines want, and the surprising fact that compaction increases storage in the short term because of time-travel retention.

For how the snapshots themselves are produced under concurrency, return to /wiki/concurrent-writers-without-stepping-on-each-other. For the underlying file layout that snapshots are pointers into, see /wiki/iceberg-delta-hudi-from-the-producers-perspective.

References