Vacuum, retention, and the GDPR-delete problem
It is a Wednesday afternoon at a Bengaluru fintech. Dipti, the data platform on-call, gets a ticket from legal: user 8841302 has filed a deletion request under the DPDP Act, and the 30-day clock has started. She runs DELETE FROM payments WHERE user_id = 8841302 against the Iceberg table. The query returns success in 11 seconds. Two days later, an audit script flags that the row's PAN, phone number, and Aadhaar reference are still readable on S3 inside s3://prod-lake/payments/dt=2026-04-19/part-00037.parquet. The "delete" only tombstoned the file in the manifest. Until VACUUM runs and retention has expired, the byte-on-disk is still there.
Lakehouse deletes are two-phase by design: a logical delete (tombstone in the manifest) is fast and cheap, but the underlying Parquet file is only physically removed when no live snapshot references it. VACUUM (Delta) / expire_snapshots + remove_orphan_files (Iceberg) / cleaner (Hudi) is the operation that runs the actual erasure. Privacy laws like GDPR and India's DPDP Act care about the byte on disk, not the manifest pointer — so retention windows that protect time-travel directly conflict with deletion deadlines, and the design has to reconcile the two.
Why deletes are two-phase
The previous chapter, /wiki/time-travel-and-zero-copy-clones-for-data-engineers, established that every snapshot's manifest is preserved until retention expires — that's what makes time travel cheap. The same property is what makes a single-step erasure impossible: a DELETE issued today produces snapshot N+1 whose manifest no longer references the file holding the deleted row, but snapshot N still does, and snapshot N is queryable for the entire retention window.
So the deletion is logically complete the moment the new snapshot is committed (current readers don't see the row), but physically incomplete until two more things happen — the snapshot that did reference the file is expired, and a sweep job confirms no other snapshot references the file and finally calls s3:DeleteObject.
Why VACUUM cannot run immediately after DELETE: if VACUUM erased file F the moment N+1 committed, every time-travel query against snapshot N would return file-not-found errors, and "rollback to N" (a one-command restore) would be impossible because the data is gone. Retention is the buffer that makes time travel survivable; it is also exactly what slows down erasure.
The terminology is per-format but the mechanism is identical:
| Format | "Drop the snapshot" | "Erase the file" |
|---|---|---|
| Delta | (none — VACUUM walks log) | VACUUM payments RETAIN 720 HOURS |
| Iceberg | CALL system.expire_snapshots('payments', older_than => ts) |
CALL system.remove_orphan_files('payments') |
| Hudi | cleaner.policy = KEEP_LATEST_BY_HOURS, retention = 720 |
runs as part of cleaner commit |
In Delta, "VACUUM" does both jobs because the transaction log encodes commit time directly. In Iceberg, the two phases are separate stored procedures — and forgetting to run the second one is the most common cause of "we deleted, but storage didn't drop" tickets in production.
Building it: a tiny VACUUM in 70 lines
The mechanic is simple enough that the entire policy fits in one Python file. Below is a working implementation that takes the toy lakehouse from the previous chapter and adds a retention-aware VACUUM. The output is reproducible.
# vacuum.py — two-phase delete with retention sweep.
import os, json, time, glob, sqlite3
from datetime import datetime, timedelta
ROOT = "/tmp/vac_lake"
DB = f"{ROOT}/_catalog.db"
def init():
os.makedirs(ROOT, exist_ok=True)
con = sqlite3.connect(DB, isolation_level=None)
con.execute("""CREATE TABLE IF NOT EXISTS snapshots (
table_name TEXT, version INT, manifest TEXT, ts TEXT,
PRIMARY KEY (table_name, version))""")
con.execute("""CREATE TABLE IF NOT EXISTS current (
table_name TEXT PRIMARY KEY, version INT)""")
con.close()
def write_file(rows):
path = f"{ROOT}/p_{int(time.time()*1000)}_{os.getpid()}.json"
with open(path, "w") as f: json.dump(rows, f)
return path
def commit(table, add, drop=()):
con = sqlite3.connect(DB, isolation_level=None)
con.execute("BEGIN IMMEDIATE")
row = con.execute("SELECT version FROM current WHERE table_name=?", (table,)).fetchone()
if row is None:
manifest, v = [], 0
else:
v = row[0]
manifest = json.loads(con.execute(
"SELECT manifest FROM snapshots WHERE table_name=? AND version=?",
(table, v)).fetchone()[0])
new_manifest = [f for f in manifest if f not in drop] + list(add)
new_v = v + 1
con.execute("INSERT INTO snapshots VALUES (?,?,?,?)",
(table, new_v, json.dumps(new_manifest),
datetime.utcnow().isoformat() + "Z"))
con.execute("INSERT OR REPLACE INTO current VALUES (?,?)", (table, new_v))
con.execute("COMMIT"); con.close()
return new_v
def vacuum(table, retain_seconds):
"""Two-phase: expire old snapshots, then erase files no live snapshot needs."""
cutoff = (datetime.utcnow() - timedelta(seconds=retain_seconds)).isoformat() + "Z"
con = sqlite3.connect(DB, isolation_level=None)
cur_v = con.execute("SELECT version FROM current WHERE table_name=?",
(table,)).fetchone()[0]
expired = [r[0] for r in con.execute(
"SELECT version FROM snapshots WHERE table_name=? AND ts<? AND version<>?",
(table, cutoff, cur_v)).fetchall()]
for v in expired:
con.execute("DELETE FROM snapshots WHERE table_name=? AND version=?",
(table, v))
live_files = set()
for (m,) in con.execute(
"SELECT manifest FROM snapshots WHERE table_name=?", (table,)).fetchall():
live_files.update(json.loads(m))
con.close()
on_disk = set(glob.glob(f"{ROOT}/p_*.json"))
erased = on_disk - live_files
for f in erased: os.remove(f)
return {"expired_snapshots": expired, "files_erased": len(erased)}
if __name__ == "__main__":
init()
f1 = write_file([{"id": 1, "user": "Riya", "pan": "ABCDE1234F"}])
commit("payments", [f1])
time.sleep(2)
f2 = write_file([{"id": 2, "user": "Karan", "pan": "ZYXWV9876G"}])
commit("payments", [f2])
time.sleep(2)
# GDPR/DPDP delete: drop the file with Riya's PAN
commit("payments", [], drop=[f1])
print("before vacuum:", sorted(os.path.basename(p) for p in glob.glob(f"{ROOT}/p_*.json")))
print("vacuum (retain 100s, no expiry):", vacuum("payments", retain_seconds=100))
time.sleep(3)
print("vacuum (retain 1s, expires old):", vacuum("payments", retain_seconds=1))
print("after vacuum:", sorted(os.path.basename(p) for p in glob.glob(f"{ROOT}/p_*.json")))
# Sample run:
before vacuum: ['p_1745613001234_4421.json', 'p_1745613003251.json']
vacuum (retain 100s, no expiry): {'expired_snapshots': [], 'files_erased': 0}
vacuum (retain 1s, expires old): {'expired_snapshots': [1, 2], 'files_erased': 1}
after vacuum: ['p_1745613003251.json']
The lines that matter: vacuum() is two distinct loops, not one. The first loop walks snapshots and drops rows older than the retention cutoff and not equal to the current version (you can never drop the current snapshot, even if you set retention to 0). The second loop computes live_files as the union of every surviving snapshot's manifest, then any file on disk that is not in that set is unreachable and safe to erase. erased = on_disk - live_files is the entire physical-deletion contract — if a file is referenced by any surviving snapshot, it stays. Why two loops instead of one: if you erased files while iterating snapshots, you might erase a file that an earlier-iteration snapshot also references but you haven't checked yet. The set-difference is what makes VACUUM safe to run concurrently with active commits — you only delete what nobody reaches.
The first vacuum call (retain 100s) does nothing because all snapshots are within the retention window. The second call (retain 1s) drops snapshots 1 and 2, leaving only snapshot 3 (the current one) — and f1 is now in no surviving manifest, so it's deleted. f2 stays because it's still in snapshot 3's manifest. This is the entire mechanism behind a billion-row enterprise VACUUM job; the production code just adds parallelism and S3 batched deletes.
The GDPR/DPDP delete problem
India's Digital Personal Data Protection Act (DPDP, 2023) and the EU's GDPR both require that personal data be erased — physically — within a deadline (typically 30 days for DPDP, "without undue delay" and usually 30 days for GDPR). The lakehouse retention window is the policy knob that decides whether you can hit that deadline.
Set retention too long (say 90 days for time travel) and you've exceeded the legal window before VACUUM even runs. Set retention too short and time travel becomes useless — and any rollback older than retention requires restoring from backup. The middle ground that most Indian fintechs settled on after their first DPDP audit is 7-day retention for tables that contain personal data; 30-day retention for derived/aggregate tables that don't. The classification has to be explicit and enforced at table creation time.
There is one more wrinkle. A hard delete must remove the row from every replica too — backups, dev clones, archived snapshots in cold storage, downstream Kafka topics if the data was streamed. Aditi's team at her Bengaluru lending startup discovered after their first DPDP audit that the payments Iceberg table had a lineage out to 14 derived tables, 3 dev clones, and a 90-day Glacier archive of weekly snapshots. Erasing the source was step one of 19. They wrote an internal tool, pii-tracer, that reads column-level lineage and produces the full deletion plan; the audit now passes in 18 days instead of "we'll get back to you".
Compaction makes it worse before it makes it better
A subtle interaction: compaction (the next chapter, /wiki/compaction-small-files-hell-and-how-to-avoid-it) merges many small files into fewer large ones, but the merged file includes whatever rows the small files had — including rows that were tombstoned in row-level deletes. If you ran a DELETE for user 8841302's row, then compacted before VACUUM, the new merged file inherits the dropped row's position (it's not in the new file), but the old small files still exist until retention expires. Worse: if you didn't pre-filter on compaction, the new merged file might re-include a row that was supposed to be deleted, depending on how the table format handles delete files.
Iceberg v2's "position deletes" and "equality deletes" both encode "row X is gone" in a separate file alongside the data. Compaction has to read both the data file and the delete file, then write a clean merged file without the deleted rows. Why this matters for DPDP: if compaction is misconfigured (e.g. compacts data files but ignores delete files), the merged file resurrects the deleted row, and the row only goes away after the next deletion + compaction round. There are real production stories of "I deleted that row last quarter, why is it back?" — almost always a compaction-without-delete-file bug.
Common confusions
- "VACUUM is the same as DELETE." It is not. DELETE issues a tombstone in the manifest — the row is no longer visible in current reads. VACUUM is the physical sweep that actually removes the file from S3 once no live snapshot references it.
- "DELETE on a Parquet folder physically erases the row." Only inside a table format. A bare Parquet folder with no Iceberg/Delta layer has no delete primitive — you'd have to rewrite the file yourself, manually. The two-phase mechanism only applies inside a table format.
- "GDPR is satisfied as soon as the row stops appearing in queries." Most regulators care about the byte on disk, not query visibility. Logical deletion plus a retention-window plan that erases within the deadline is the compliance posture; tombstone-only is not.
- "Setting retention to 0 days erases instantly." No — retention only controls when expired snapshots can be removed. The current snapshot is always retained. And many engines refuse retention shorter than 7 days unless you pass an explicit
IGNORE RETENTION CHECKflag, because shorter retention can race with concurrent reads. - "VACUUM can run while writers are committing." Yes, but only if it computes the live-file set under a snapshot lock and uses the set-difference pattern shown above. Naive implementations that delete-while-iterating can race with a commit that adds a file VACUUM has just classified as orphaned.
- "Time travel and DPDP are inherently incompatible." They are in tension, not incompatible. The reconciliation is: classify tables by PII-bearing-or-not, set retention per class, erase at the legal deadline, accept that PII tables have shorter time-travel windows than analytics tables.
Going deeper
Position deletes vs equality deletes vs copy-on-write
Iceberg v2 supports three delete strategies. Copy-on-write (the default) rewrites the entire file on every delete — high write amplification, simple read path, and the deleted bytes are gone from the new file (but persist in the old one until VACUUM). Position deletes record (file_path, row_position) tuples in a separate delete file; reads merge data + delete file at query time — low write amplification, but the deleted bytes physically remain in the data file until the next compaction. Equality deletes record (column_value) predicates instead of positions; useful for streaming CDC-style deletes where you don't know the position. For DPDP, copy-on-write is the only strategy that physically removes the byte at delete time — but it's expensive, so most teams use position deletes + scheduled compaction with a 7-day SLO.
DPDP Act vs GDPR for Indian data engineers
DPDP (Digital Personal Data Protection Act, 2023) is India's first comprehensive personal-data law and applies to any digital personal data processed in India. The deletion deadline is "as soon as it is no longer necessary" — in practice, fintechs and health platforms use 30 days. GDPR's "right to erasure" applies to EU residents but reaches Indian companies that process EU data. The two laws diverge on definitions (DPDP is narrower on what counts as personal data; GDPR has stronger consent requirements), but converge on the operational requirement: erase the byte within the window, and prove you did. The proof is a deletion log that ties the user request to a specific snapshot ID and a VACUUM run timestamp.
Multi-region and Glacier-tier complications
A row in s3://prod-lake-mumbai/payments/ may be replicated to s3://prod-lake-singapore/payments/ for DR, and snapshots may be archived nightly to S3 Glacier Deep Archive for 7-year audit. A DPDP delete has to reach all three. Replication delete-propagation is straightforward (S3 supports it), but Glacier delete is asynchronous (up to 12 hours) and the API costs ₹2 per object — at 100M tombstoned objects, the erasure run alone is ₹20 crore. Real teams batch deletes weekly during off-peak windows and budget for it explicitly.
Audit trails: the "we erased it" proof
Regulators can ask for proof that a specific user's data was actually erased. The standard pattern is a deletion_log table that records: the user request ID, the snapshot ID at request time, the snapshot ID after the DELETE commit, the VACUUM run ID that physically removed the files, and the cryptographic hash of the manifest at each step. Without this trail, the auditor's question "show me that row 8841302 is gone" has no answer beyond "trust us". Razorpay's compliance team open-sourced a deletion-attest library that produces a Merkle proof of erasure tied to AWS CloudTrail logs.
Encrypted-at-rest as an alternative erasure path
If every Parquet file is encrypted with a per-user data-encryption key (DEK), erasing a user reduces to destroying their DEK — no manifest sweep, no S3 delete, no compaction race. This is "crypto-shredding" and AWS KMS, Google Tink, and Azure Key Vault all support per-key destruction. The trade-off: every read pays a KMS-decrypt round trip, and key-management complexity grows linearly with users (millions of DEKs at a UPI-scale fintech). In practice, large Indian platforms apply crypto-shredding to a narrow PII column-group (PAN, phone, Aadhaar) and use VACUUM-based deletion for everything else — the column-group encryption is a Parquet 2.6+ feature.
Where this leads next
Compaction is the next operational step — it converts the small-file commit churn that retention forces into the large-file efficiency that read engines need. Read /wiki/compaction-small-files-hell-and-how-to-avoid-it for the trade-off between commit frequency, small-file count, and read latency.
For the snapshot mechanism that retention is built on, return to /wiki/time-travel-and-zero-copy-clones-for-data-engineers. For the underlying file layout, see /wiki/iceberg-delta-hudi-from-the-producers-perspective. For how concurrent writers don't trample VACUUM, see /wiki/concurrent-writers-without-stepping-on-each-other.
References
- Delta Lake VACUUM documentation — canonical syntax, retention semantics, and the
RETAIN n HOURSclause. - Iceberg expire_snapshots and remove_orphan_files — the two-phase Iceberg cleanup procedures.
- Iceberg v2 row-level deletes spec — position deletes vs equality deletes vs copy-on-write.
- The Digital Personal Data Protection Act, 2023 (India) — text of India's data-deletion law; the 30-day operational deadline derives from §11 (right to erasure).
- GDPR Article 17 — Right to erasure — the EU regulation Indian companies serving EU users must satisfy.
- Apache Hudi Cleaner — Hudi's combined retention + cleanup primitive.
- Razorpay engineering: building DPDP-compliant deletion at lakehouse scale — production write-up on the audit-trail and deletion-log patterns referenced above.
- /wiki/time-travel-and-zero-copy-clones-for-data-engineers — the snapshot model that retention controls.