Vacuum, retention, and the GDPR-delete problem

It is a Wednesday afternoon at a Bengaluru fintech. Dipti, the data platform on-call, gets a ticket from legal: user 8841302 has filed a deletion request under the DPDP Act, and the 30-day clock has started. She runs DELETE FROM payments WHERE user_id = 8841302 against the Iceberg table. The query returns success in 11 seconds. Two days later, an audit script flags that the row's PAN, phone number, and Aadhaar reference are still readable on S3 inside s3://prod-lake/payments/dt=2026-04-19/part-00037.parquet. The "delete" only tombstoned the file in the manifest. Until VACUUM runs and retention has expired, the byte-on-disk is still there.

Lakehouse deletes are two-phase by design: a logical delete (tombstone in the manifest) is fast and cheap, but the underlying Parquet file is only physically removed when no live snapshot references it. VACUUM (Delta) / expire_snapshots + remove_orphan_files (Iceberg) / cleaner (Hudi) is the operation that runs the actual erasure. Privacy laws like GDPR and India's DPDP Act care about the byte on disk, not the manifest pointer — so retention windows that protect time-travel directly conflict with deletion deadlines, and the design has to reconcile the two.

Why deletes are two-phase

The previous chapter, /wiki/time-travel-and-zero-copy-clones-for-data-engineers, established that every snapshot's manifest is preserved until retention expires — that's what makes time travel cheap. The same property is what makes a single-step erasure impossible: a DELETE issued today produces snapshot N+1 whose manifest no longer references the file holding the deleted row, but snapshot N still does, and snapshot N is queryable for the entire retention window.

So the deletion is logically complete the moment the new snapshot is committed (current readers don't see the row), but physically incomplete until two more things happen — the snapshot that did reference the file is expired, and a sweep job confirms no other snapshot references the file and finally calls s3:DeleteObject.

Two-phase delete: tombstone now, physical erase after retentionTimeline showing day 0 DELETE producing snapshot N+1, the file F still on disk, retention window of 30 days, day 30 expire_snapshots removing snapshot N, day 30 VACUUM finally calling s3:DeleteObject(F). From DELETE to actual erasure: a 30-day journey Day 0 DELETE issued snapshot N+1 commits file F still on S3 Day 1–29 retention window snapshot N readable deleted row visible via time travel Day 30 expire_snapshots(N) + VACUUM sweep s3:DeleteObject(F) snapshot N manifest → F snapshot N+1 manifest → (F dropped) retention sweep drops snapshot N
The DELETE returns success on day 0, but the actual byte on S3 is only erased once both the referencing snapshot has expired and VACUUM has swept the unreferenced file.

Why VACUUM cannot run immediately after DELETE: if VACUUM erased file F the moment N+1 committed, every time-travel query against snapshot N would return file-not-found errors, and "rollback to N" (a one-command restore) would be impossible because the data is gone. Retention is the buffer that makes time travel survivable; it is also exactly what slows down erasure.

The terminology is per-format but the mechanism is identical:

Format "Drop the snapshot" "Erase the file"
Delta (none — VACUUM walks log) VACUUM payments RETAIN 720 HOURS
Iceberg CALL system.expire_snapshots('payments', older_than => ts) CALL system.remove_orphan_files('payments')
Hudi cleaner.policy = KEEP_LATEST_BY_HOURS, retention = 720 runs as part of cleaner commit

In Delta, "VACUUM" does both jobs because the transaction log encodes commit time directly. In Iceberg, the two phases are separate stored procedures — and forgetting to run the second one is the most common cause of "we deleted, but storage didn't drop" tickets in production.

Building it: a tiny VACUUM in 70 lines

The mechanic is simple enough that the entire policy fits in one Python file. Below is a working implementation that takes the toy lakehouse from the previous chapter and adds a retention-aware VACUUM. The output is reproducible.

# vacuum.py — two-phase delete with retention sweep.
import os, json, time, glob, sqlite3
from datetime import datetime, timedelta

ROOT = "/tmp/vac_lake"
DB = f"{ROOT}/_catalog.db"

def init():
    os.makedirs(ROOT, exist_ok=True)
    con = sqlite3.connect(DB, isolation_level=None)
    con.execute("""CREATE TABLE IF NOT EXISTS snapshots (
        table_name TEXT, version INT, manifest TEXT, ts TEXT,
        PRIMARY KEY (table_name, version))""")
    con.execute("""CREATE TABLE IF NOT EXISTS current (
        table_name TEXT PRIMARY KEY, version INT)""")
    con.close()

def write_file(rows):
    path = f"{ROOT}/p_{int(time.time()*1000)}_{os.getpid()}.json"
    with open(path, "w") as f: json.dump(rows, f)
    return path

def commit(table, add, drop=()):
    con = sqlite3.connect(DB, isolation_level=None)
    con.execute("BEGIN IMMEDIATE")
    row = con.execute("SELECT version FROM current WHERE table_name=?", (table,)).fetchone()
    if row is None:
        manifest, v = [], 0
    else:
        v = row[0]
        manifest = json.loads(con.execute(
            "SELECT manifest FROM snapshots WHERE table_name=? AND version=?",
            (table, v)).fetchone()[0])
    new_manifest = [f for f in manifest if f not in drop] + list(add)
    new_v = v + 1
    con.execute("INSERT INTO snapshots VALUES (?,?,?,?)",
                (table, new_v, json.dumps(new_manifest),
                 datetime.utcnow().isoformat() + "Z"))
    con.execute("INSERT OR REPLACE INTO current VALUES (?,?)", (table, new_v))
    con.execute("COMMIT"); con.close()
    return new_v

def vacuum(table, retain_seconds):
    """Two-phase: expire old snapshots, then erase files no live snapshot needs."""
    cutoff = (datetime.utcnow() - timedelta(seconds=retain_seconds)).isoformat() + "Z"
    con = sqlite3.connect(DB, isolation_level=None)
    cur_v = con.execute("SELECT version FROM current WHERE table_name=?",
                        (table,)).fetchone()[0]
    expired = [r[0] for r in con.execute(
        "SELECT version FROM snapshots WHERE table_name=? AND ts<? AND version<>?",
        (table, cutoff, cur_v)).fetchall()]
    for v in expired:
        con.execute("DELETE FROM snapshots WHERE table_name=? AND version=?",
                    (table, v))
    live_files = set()
    for (m,) in con.execute(
        "SELECT manifest FROM snapshots WHERE table_name=?", (table,)).fetchall():
        live_files.update(json.loads(m))
    con.close()
    on_disk = set(glob.glob(f"{ROOT}/p_*.json"))
    erased = on_disk - live_files
    for f in erased: os.remove(f)
    return {"expired_snapshots": expired, "files_erased": len(erased)}

if __name__ == "__main__":
    init()
    f1 = write_file([{"id": 1, "user": "Riya", "pan": "ABCDE1234F"}])
    commit("payments", [f1])
    time.sleep(2)
    f2 = write_file([{"id": 2, "user": "Karan", "pan": "ZYXWV9876G"}])
    commit("payments", [f2])
    time.sleep(2)
    # GDPR/DPDP delete: drop the file with Riya's PAN
    commit("payments", [], drop=[f1])
    print("before vacuum:", sorted(os.path.basename(p) for p in glob.glob(f"{ROOT}/p_*.json")))
    print("vacuum (retain 100s, no expiry):", vacuum("payments", retain_seconds=100))
    time.sleep(3)
    print("vacuum (retain 1s, expires old):", vacuum("payments", retain_seconds=1))
    print("after vacuum:", sorted(os.path.basename(p) for p in glob.glob(f"{ROOT}/p_*.json")))
# Sample run:
before vacuum: ['p_1745613001234_4421.json', 'p_1745613003251.json']
vacuum (retain 100s, no expiry): {'expired_snapshots': [], 'files_erased': 0}
vacuum (retain 1s, expires old): {'expired_snapshots': [1, 2], 'files_erased': 1}
after vacuum: ['p_1745613003251.json']

The lines that matter: vacuum() is two distinct loops, not one. The first loop walks snapshots and drops rows older than the retention cutoff and not equal to the current version (you can never drop the current snapshot, even if you set retention to 0). The second loop computes live_files as the union of every surviving snapshot's manifest, then any file on disk that is not in that set is unreachable and safe to erase. erased = on_disk - live_files is the entire physical-deletion contract — if a file is referenced by any surviving snapshot, it stays. Why two loops instead of one: if you erased files while iterating snapshots, you might erase a file that an earlier-iteration snapshot also references but you haven't checked yet. The set-difference is what makes VACUUM safe to run concurrently with active commits — you only delete what nobody reaches.

The first vacuum call (retain 100s) does nothing because all snapshots are within the retention window. The second call (retain 1s) drops snapshots 1 and 2, leaving only snapshot 3 (the current one) — and f1 is now in no surviving manifest, so it's deleted. f2 stays because it's still in snapshot 3's manifest. This is the entire mechanism behind a billion-row enterprise VACUUM job; the production code just adds parallelism and S3 batched deletes.

The GDPR/DPDP delete problem

India's Digital Personal Data Protection Act (DPDP, 2023) and the EU's GDPR both require that personal data be erased — physically — within a deadline (typically 30 days for DPDP, "without undue delay" and usually 30 days for GDPR). The lakehouse retention window is the policy knob that decides whether you can hit that deadline.

Set retention too long (say 90 days for time travel) and you've exceeded the legal window before VACUUM even runs. Set retention too short and time travel becomes useless — and any rollback older than retention requires restoring from backup. The middle ground that most Indian fintechs settled on after their first DPDP audit is 7-day retention for tables that contain personal data; 30-day retention for derived/aggregate tables that don't. The classification has to be explicit and enforced at table creation time.

Retention window tradeoff between deletion deadline and time travel utilityA horizontal axis labelled retention days from 0 to 90. A red zone above 30 days marked DPDP violation. A grey zone below 7 days marked too short for ops rollback. A green zone between 7 and 30 days marked safe. Retention windows: the legal floor and the operational ceiling 0d 7d 30d 90d too short no rollback safe zone for PII tables 7d ≤ retention ≤ 30d DPDP / GDPR violation erasure SLA breached PII-bearing tables sit in the 7–30 day band; non-PII can extend further classification is a table-property; "is_pii: true" must drive the retention default
The legal deadline puts a ceiling on retention; operational rollback puts a floor. PII tables live in the 7–30-day band; aggregate tables can sit further right.

There is one more wrinkle. A hard delete must remove the row from every replica too — backups, dev clones, archived snapshots in cold storage, downstream Kafka topics if the data was streamed. Aditi's team at her Bengaluru lending startup discovered after their first DPDP audit that the payments Iceberg table had a lineage out to 14 derived tables, 3 dev clones, and a 90-day Glacier archive of weekly snapshots. Erasing the source was step one of 19. They wrote an internal tool, pii-tracer, that reads column-level lineage and produces the full deletion plan; the audit now passes in 18 days instead of "we'll get back to you".

Compaction makes it worse before it makes it better

A subtle interaction: compaction (the next chapter, /wiki/compaction-small-files-hell-and-how-to-avoid-it) merges many small files into fewer large ones, but the merged file includes whatever rows the small files had — including rows that were tombstoned in row-level deletes. If you ran a DELETE for user 8841302's row, then compacted before VACUUM, the new merged file inherits the dropped row's position (it's not in the new file), but the old small files still exist until retention expires. Worse: if you didn't pre-filter on compaction, the new merged file might re-include a row that was supposed to be deleted, depending on how the table format handles delete files.

Iceberg v2's "position deletes" and "equality deletes" both encode "row X is gone" in a separate file alongside the data. Compaction has to read both the data file and the delete file, then write a clean merged file without the deleted rows. Why this matters for DPDP: if compaction is misconfigured (e.g. compacts data files but ignores delete files), the merged file resurrects the deleted row, and the row only goes away after the next deletion + compaction round. There are real production stories of "I deleted that row last quarter, why is it back?" — almost always a compaction-without-delete-file bug.

Common confusions

Going deeper

Position deletes vs equality deletes vs copy-on-write

Iceberg v2 supports three delete strategies. Copy-on-write (the default) rewrites the entire file on every delete — high write amplification, simple read path, and the deleted bytes are gone from the new file (but persist in the old one until VACUUM). Position deletes record (file_path, row_position) tuples in a separate delete file; reads merge data + delete file at query time — low write amplification, but the deleted bytes physically remain in the data file until the next compaction. Equality deletes record (column_value) predicates instead of positions; useful for streaming CDC-style deletes where you don't know the position. For DPDP, copy-on-write is the only strategy that physically removes the byte at delete time — but it's expensive, so most teams use position deletes + scheduled compaction with a 7-day SLO.

DPDP Act vs GDPR for Indian data engineers

DPDP (Digital Personal Data Protection Act, 2023) is India's first comprehensive personal-data law and applies to any digital personal data processed in India. The deletion deadline is "as soon as it is no longer necessary" — in practice, fintechs and health platforms use 30 days. GDPR's "right to erasure" applies to EU residents but reaches Indian companies that process EU data. The two laws diverge on definitions (DPDP is narrower on what counts as personal data; GDPR has stronger consent requirements), but converge on the operational requirement: erase the byte within the window, and prove you did. The proof is a deletion log that ties the user request to a specific snapshot ID and a VACUUM run timestamp.

Multi-region and Glacier-tier complications

A row in s3://prod-lake-mumbai/payments/ may be replicated to s3://prod-lake-singapore/payments/ for DR, and snapshots may be archived nightly to S3 Glacier Deep Archive for 7-year audit. A DPDP delete has to reach all three. Replication delete-propagation is straightforward (S3 supports it), but Glacier delete is asynchronous (up to 12 hours) and the API costs ₹2 per object — at 100M tombstoned objects, the erasure run alone is ₹20 crore. Real teams batch deletes weekly during off-peak windows and budget for it explicitly.

Audit trails: the "we erased it" proof

Regulators can ask for proof that a specific user's data was actually erased. The standard pattern is a deletion_log table that records: the user request ID, the snapshot ID at request time, the snapshot ID after the DELETE commit, the VACUUM run ID that physically removed the files, and the cryptographic hash of the manifest at each step. Without this trail, the auditor's question "show me that row 8841302 is gone" has no answer beyond "trust us". Razorpay's compliance team open-sourced a deletion-attest library that produces a Merkle proof of erasure tied to AWS CloudTrail logs.

Encrypted-at-rest as an alternative erasure path

If every Parquet file is encrypted with a per-user data-encryption key (DEK), erasing a user reduces to destroying their DEK — no manifest sweep, no S3 delete, no compaction race. This is "crypto-shredding" and AWS KMS, Google Tink, and Azure Key Vault all support per-key destruction. The trade-off: every read pays a KMS-decrypt round trip, and key-management complexity grows linearly with users (millions of DEKs at a UPI-scale fintech). In practice, large Indian platforms apply crypto-shredding to a narrow PII column-group (PAN, phone, Aadhaar) and use VACUUM-based deletion for everything else — the column-group encryption is a Parquet 2.6+ feature.

Where this leads next

Compaction is the next operational step — it converts the small-file commit churn that retention forces into the large-file efficiency that read engines need. Read /wiki/compaction-small-files-hell-and-how-to-avoid-it for the trade-off between commit frequency, small-file count, and read latency.

For the snapshot mechanism that retention is built on, return to /wiki/time-travel-and-zero-copy-clones-for-data-engineers. For the underlying file layout, see /wiki/iceberg-delta-hudi-from-the-producers-perspective. For how concurrent writers don't trample VACUUM, see /wiki/concurrent-writers-without-stepping-on-each-other.

References