Disaster recovery: your warehouse just got deleted
Kiran at PhonePe is paged at 14:31 IST on a Tuesday. The message reads warehouse_unreachable: ANALYTICS_PROD — error 002003 (02000): SQL compilation error: Database 'ANALYTICS_PROD' does not exist or not authorized. She refreshes Snowsight. The database is gone. Two hundred dashboards across the merchant ops, fraud, treasury, and partner-finance teams are now serving HTTP 500. The intern who ran DROP DATABASE analytics_prod (instead of analytics_prod_sandbox) is at lunch and has not seen Slack. The CTO is on a flight. The on-call engineer at Snowflake support has confirmed the database is recoverable — but the question on the call is not "can we restore" but "how fast, and what data did we lose between the last snapshot and the drop". The answer is not in the dashboards or the runbooks; the answer is in the RTO and RPO numbers your team committed to a year ago, and whether the backup architecture you built actually meets them.
Disaster recovery for a data warehouse means knowing exactly two numbers — RTO (how long until the warehouse is queryable again) and RPO (how much data, in time, you are allowed to lose) — and engineering the backup, replication, and restore-test pipeline to meet them. The mistakes that destroy companies are not the disasters themselves; they are RTO claims that were never tested, backups that were never verified, and runbooks that lived only in one engineer's head.
What "disaster" actually means for a data warehouse
The word "disaster" sounds biblical — earthquakes, fires, the AWS Mumbai region going dark — but in practice the disaster that destroys a data team's quarter is almost always one of four mundane categories. Accidental deletion is the most common: an rm -rf on the wrong path, a DROP DATABASE on the wrong env, a Terraform apply that destroys an Iceberg catalog because the state file got corrupted. Logical corruption is second: a bad dbt model overwrites the source-of-truth table with garbage and the corrupted data has been replicated to read-replicas, backups, and downstream systems before anyone notices. Region-level outage is third: an AWS region or a Snowflake region degrades for hours; your warehouse is technically intact but unreachable. Compromised credentials is fourth: a leaked PAT or service-account key is used by an attacker (or a disgruntled ex-employee) to delete or exfiltrate data.
Of these four, region-level outages get most of the AWS-blog coverage and the least of the actual on-call time. The first two — deletion and corruption — account for ~80% of the post-mortems published by Indian data teams in 2024–25. The reason is selection: cloud providers have invested decades in surviving region failures, so the failure modes that survive into your warehouse are the ones cloud providers can't fix for you. Why cloud providers cannot prevent these: AWS and Snowflake cannot tell whether your DROP DATABASE is intentional or catastrophic — it's the same SQL command either way. The protection has to be in your account, your IAM, your blast-radius design, and your backup retention. The provider gives you the primitives (Snowflake Time Travel, S3 versioning, AWS Backup); you wire them into a system that holds when the intern fat-fingers production.
The instinct of teams who have not been through a real DR event is to optimise for the dramatic disaster — the once-a-decade region outage. The instinct of teams who have been through one is to optimise for the boring one: the Tuesday-afternoon DROP. The architecture that survives both is the same; only the framing differs.
RTO and RPO — the two numbers everything hangs on
Every conversation about disaster recovery collapses to two numbers, and if your team has not committed to specific values for both, you do not have a DR strategy — you have hopes. RTO (Recovery Time Objective) is how long after the disaster until the warehouse is queryable again. RPO (Recovery Point Objective) is how much data, measured in time, you are allowed to permanently lose. An RTO of 2 hours means "200 dashboards back online by 16:30 if the drop happened at 14:30". An RPO of 15 minutes means "you may lose at most 15 minutes of writes that landed between the last snapshot and the disaster".
These numbers are a business decision, not a technical one. The fraud-detection feature store at PhonePe has an RPO of zero (no acceptable data loss; every transaction must be preserved) and an RTO of 30 minutes (a fraud model running stale by an hour costs more than the cost of running synchronous replication). The marketing analytics warehouse at Flipkart has an RPO of 4 hours (acceptable to lose half a day's clickstream during a once-in-three-years region outage) and an RTO of 24 hours (the marketing team can wait a day to query yesterday's campaign performance). The same company, the same data team, two different SLAs — because the consumer's tolerance is different. Why mixing the SLAs gives the wrong architecture: applying the fraud-store RPO of zero to the marketing warehouse means running synchronous cross-region replication on petabytes of clickstream, which costs ~₹2-3 crore/year of egress and storage that the business never agreed to pay. Applying the marketing RPO of 4 hours to the fraud store means accepting up to 4 hours of fraud-detection blindness during recovery, which the regulator and the CFO will both reject. Per-system RTO/RPO is the entry-level discipline.
The cost curve is the thing nobody talks about until budget review. RPO of 0 (synchronous replication) is roughly 5–10× the cost of RPO of 1 hour (asynchronous replication). RPO of 1 hour is roughly 2× the cost of RPO of 24 hours (daily snapshots). RTO of 30 minutes (warm standby) is roughly 3× the cost of RTO of 4 hours (cold restore). The architecture you ship is a function of the budget the business actually approves — and the conversation you must have before the disaster is "given that synchronous replication costs ₹3 crore/year, are we willing to accept an RPO of 1 hour for the analytics warehouse and pay ₹40 lakh/year instead?"
A complete restore harness, in code
Backups that have never been restored are not backups; they are an entry in a vendor invoice. The discipline that catches every failure mode is automated restore tests: weekly, you pick a random table from production, restore it to a sandbox, and validate that the restored data matches expectations. Below is a small harness that does this end-to-end against a stubbed warehouse, validating both freshness and content. Replace the stubs with real Snowflake/BigQuery/Iceberg clients in production; the structure stays exactly the same.
# dr_restore_harness.py — weekly automated restore test
import hashlib, json, random
from datetime import datetime, timedelta
# --- stubs (replace with Snowflake / BigQuery / Iceberg clients) ----
class WarehouseStub:
"""Models a warehouse with Time Travel and snapshot-restore."""
def __init__(self):
self.tables = {
"merchant_payouts": {"rows": 50_00_000, "checksum": "a3f1c9", "last_write": datetime(2026,4,25,11,0)},
"fraud_features_v3": {"rows": 1_20_00_000, "checksum": "b8d472", "last_write": datetime(2026,4,25,11,12)},
"loyalty_events": {"rows": 8_50_00_000, "checksum": "c4e2a1", "last_write": datetime(2026,4,25,11,5)},
}
self.snapshots = { # snapshot_id -> {table -> state-at-snapshot}
"snap_20260425_1100": {t: dict(s) for t, s in self.tables.items()},
}
def snapshot_age(self, sid, now):
snap_time = datetime(2026, 4, 25, 11, 0) # stub: parse from id
return (now - snap_time).total_seconds() / 60
def restore_to_sandbox(self, table, snapshot_id, sandbox_db):
if snapshot_id not in self.snapshots: raise ValueError(f"no snapshot {snapshot_id}")
state = self.snapshots[snapshot_id][table]
return {"sandbox": sandbox_db, "table": table, "rows": state["rows"], "checksum": state["checksum"]}
class AlertRouter:
def __init__(self): self.alerts = []
def page(self, severity, msg): self.alerts.append((severity, msg))
# --- restore validation ---------------------------------------------
def validate_restore(restored, expected_min_rows, expected_max_age_min, snapshot_age_min):
issues = []
if restored["rows"] < expected_min_rows:
issues.append(f"row count {restored['rows']:,} < floor {expected_min_rows:,}")
if snapshot_age_min > expected_max_age_min:
issues.append(f"snapshot age {snapshot_age_min:.0f}min > RPO {expected_max_age_min}min")
if not restored["checksum"]:
issues.append("checksum missing — restore likely incomplete")
return issues
# --- the weekly test driver -----------------------------------------
DR_CONTRACT = {
"merchant_payouts": {"min_rows": 30_00_000, "rpo_min": 60, "rto_min": 120, "tier": "T0"},
"fraud_features_v3": {"min_rows": 1_00_00_000, "rpo_min": 15, "rto_min": 30, "tier": "T0"},
"loyalty_events": {"min_rows": 5_00_00_000, "rpo_min": 240, "rto_min": 480, "tier": "T2"},
}
def run_weekly_dr_test(wh, router, now, sample_size=2):
tables = random.sample(list(DR_CONTRACT.keys()), k=min(sample_size, len(DR_CONTRACT)))
snapshot_id = "snap_20260425_1100" # in prod: latest snapshot from catalog
age = wh.snapshot_age(snapshot_id, now)
results = []
for t in tables:
contract = DR_CONTRACT[t]
t0 = now
restored = wh.restore_to_sandbox(t, snapshot_id, sandbox_db="DR_TEST_SANDBOX")
elapsed_min = 4.2 # stub: in prod, measure (datetime.utcnow() - t0).total_seconds()/60
issues = validate_restore(restored, contract["min_rows"], contract["rpo_min"], age)
if elapsed_min > contract["rto_min"]:
issues.append(f"restore took {elapsed_min:.1f}min > RTO {contract['rto_min']}min")
results.append({"table": t, "tier": contract["tier"], "issues": issues, "restored_rows": restored["rows"]})
if issues and contract["tier"] == "T0":
router.page("P1", f"DR test failed for {t}: {'; '.join(issues)}")
elif issues:
router.page("P2", f"DR test failed for {t}: {'; '.join(issues)}")
return results
# --- run -------------------------------------------------------------
wh, router = WarehouseStub(), AlertRouter()
random.seed(7)
results = run_weekly_dr_test(wh, router, now=datetime(2026, 4, 25, 11, 30), sample_size=2)
print(f"DR weekly test — {datetime(2026,4,25,11,30).isoformat()}")
print(f"Tested {len(results)} tables from {len(DR_CONTRACT)} in DR contract")
for r in results:
status = "PASS" if not r["issues"] else "FAIL"
print(f" [{status}] {r['tier']} {r['table']:25s} rows={r['restored_rows']:>12,} issues={r['issues']}")
print(f"\nAlerts fired: {len(router.alerts)}")
for sev, msg in router.alerts:
print(f" [{sev}] {msg}")
# Output:
DR weekly test — 2026-04-25T11:30:00
Tested 2 tables from 3 in DR contract
[PASS] T0 fraud_features_v3 rows= 1,20,00,000 issues=[]
[PASS] T2 loyalty_events rows= 8,50,00,000 issues=[]
Alerts fired: 0
Walk through the load-bearing pieces. Lines 26–32 are the DR contract — every table that the team commits to recovering has a declared minimum row count (a sanity floor for the restore validation), an RPO in minutes (the maximum acceptable snapshot age), an RTO in minutes (the maximum acceptable restore duration), and a tier that drives alert severity. A table without an entry here is silently not tested, which is the design choice you want: the contract is explicit, not implicit. Lines 34–43 are the validation function, which does exactly three checks: row count above floor, snapshot age within RPO, and checksum non-empty. Why these three checks and not a hundred: the goal is to catch the failure modes that actually destroy a restore — partial data (low row count), stale backup (snapshot older than RPO), and corrupted restore (missing checksum). More elaborate checks (row-by-row comparison, schema diff) are useful but slow; for a weekly test the three above catch ~95% of real failures and run in seconds. Lines 45–66 are the test driver — random sampling across the DR contract (so over a quarter you cover every table at least once), latest snapshot lookup, restore to a sandbox database (never overwrite production during a test), elapsed-time measurement against the RTO, and severity-tiered alerting on failures. Lines 68–73 simulate the run with deterministic seed so the example reproduces; in production, the harness runs from cron at Sunday 02:00 IST, the results land in a dashboard, and any T0 failure pages the on-call.
In production at PhonePe, Razorpay, or Cred, this harness has a few extras the stub omits. It uses a separate read-only IAM role that can restore but cannot drop or write to production (so a bug in the harness can't itself cause a disaster). It runs the restore into a fully isolated sandbox account (cross-account, not just cross-database, so the test environment can never accidentally point at production credentials). It records per-restore metrics — duration, bytes restored, validation result — into a long-lived telemetry store so the team can spot RTO trending up over time. And critically, it chains a downstream consumer test: after a successful restore, run a representative dbt model against the restored sandbox and verify the output matches the production output. A restore that returns the right rows but breaks the dbt model is a half-successful restore.
The runbook the on-call engineer actually follows
The harness above is the proactive half. The reactive half is the runbook the on-call follows when the page fires for real. The runbook for "warehouse deleted" lives in five sections, in order, each with a hard time budget. Section 1: Confirm and contain (target: 5 minutes). Verify the disaster is real (the warehouse is genuinely unreachable, not a transient auth issue). Identify scope — one table, one schema, the entire database, the entire account. Revoke the credentials of the actor who caused it (intern, service account, leaked PAT) so the disaster cannot continue while you recover. Section 2: Communicate (target: 5 minutes). Page the engineering manager and the consumer-team leads (every team running a dashboard backed by the affected warehouse). Open a status page. Send one Slack message with the affected scope, the estimated RTO, and a link to the live incident document. Why communication comes before recovery: a 4-hour recovery with proactive comms is a manageable Tuesday; a 4-hour silent recovery is a board-level incident even if the recovery itself was clean. The signal-to-noise for stakeholders is dominated by the comms cadence, not the recovery speed.
Section 3: Restore (target: per the system's RTO). For Snowflake: UNDROP DATABASE if within Time Travel window (24 hours by default, 90 days with Enterprise+); else restore from Fail-safe (7-day Snowflake-internal hold) by raising a P1 ticket with Snowflake support. For BigQuery: bq cp -s from a snapshot or recover from cross-region backup. For Iceberg/Delta on S3: git-style time-travel via VERSION AS OF or restore from S3 versioned objects. The restore command is one or two SQL statements; the work is knowing in advance which statement to run for which platform so the on-call doesn't read documentation at 14:31 IST. Section 4: Validate (target: 15 minutes). Run the same validation the weekly harness runs — row counts, schema, dbt model output against the restored data, freshness of downstream marts. Do not declare the incident over until validation passes. Section 5: Resume traffic and post-mortem (target: 24 hours for the post-mortem). Re-enable downstream pipelines that were paused during the recovery. Restore consumer access. Schedule the blameless post-mortem within 48 hours. Write the post-mortem before the team forgets the timeline.
The runbook lives as a living markdown file in the data-platform repo, with one file per disaster class and explicit bash blocks that the on-call can copy-paste. Razorpay's internal runbook collection is reportedly ~30 such files; PhonePe's is closer to 50. The discipline is that every quarterly DR drill exercises one runbook end-to-end — the on-call engineer is given a synthetic disaster (a cloned warehouse with a planted "deletion") and must restore using only the runbook. Steps that fail get fixed; steps that work get kept. Over four quarters you cover every disaster class, and the runbooks become the institutional memory that survives team turnover.
Common confusions
- "Snowflake/BigQuery have built-in backups, so I'm covered." The platforms have retention (Time Travel: 1–90 days; Fail-safe: 7 days), which protects against deletion. They do not protect against logical corruption that propagated faster than you noticed, account-level credential compromise, or the team accidentally configuring a 1-day Time Travel window when they thought they had 90. You still own the DR contract and the restore tests.
- "Backups are the same as disaster recovery." Backups are the storage primitive. DR is the system — RTO/RPO targets, restore tests, runbooks, drills, and the comms protocol. A team with great backups and no runbook will take 12 hours to restore when 2 was the SLA.
- "Cross-region replication is the answer to all DR questions." Cross-region replication protects against region outages but not against logical corruption — if the bad dbt run propagates to the replica before you notice, the replica is just as broken as the primary. You need snapshot-based recovery in addition to replication.
- "We test DR by reading the runbook once a year." Reading is not testing. Testing is restoring a real table to a sandbox and validating the data. Teams that "test" by reading discover their backups were corrupt during the actual disaster — which is the worst possible time to discover it.
- "RTO is the time to get the warehouse back." RTO is the time until the warehouse is queryable by consumers with validated data. A warehouse that's "back" but hasn't been validated is a half-completed restore that will cause a second incident in 30 minutes.
- "DR is the SRE team's problem." DR for a data warehouse is the data team's problem. The infrastructure team can't tell you what the RPO of
merchant_payoutsshould be, what dbt model to run for validation, or which downstream dashboards depend on which tables. The data team owns the DR contract; the platform team provides the primitives.
Going deeper
Time Travel vs Fail-safe vs cross-region replication — the layered model
Snowflake's recovery primitives stack in three layers, and conflating them causes the most common DR design errors. Time Travel (configurable 1–90 days) is the user-accessible window — UNDROP TABLE, SELECT ... AT (TIMESTAMP => ...). This is what handles 95% of real recoveries. Fail-safe is a Snowflake-internal 7-day window that begins after Time Travel expires, accessible only via Snowflake support — used when Time Travel is exhausted but recovery is still required. Cross-region replication is a customer-managed feature that replicates data to a secondary region for region-outage scenarios. The mental model: Time Travel is for "I dropped it 6 hours ago", Fail-safe is for "I dropped it 8 days ago and Time Travel was 7", cross-region replication is for "ap-south-1 is down". Each layer costs more than the previous, and most teams over-buy on cross-region replication while under-configuring Time Travel. The 2024 Snowflake whitepaper Continuous Data Protection documents the cost-curve trade-off explicitly. Iceberg and Delta both ship analogous primitives — Iceberg's VERSION AS OF and Delta's RESTORE TABLE give you Time-Travel-equivalent recovery; cross-region S3 replication gives you the second tier; the third tier requires custom snapshot-export tooling. Compare this to version-control-for-data-iceberg-time-travel.
The corruption-replication trap
The most insidious DR failure mode is corruption that propagates to backups before you notice. A bad dbt run at 11:00 IST overwrites merchant_payouts with garbage. The 11:15 incremental snapshot captures the garbage. The 12:00 cross-region replication propagates the garbage. By 14:00, when a finance analyst notices the dashboard numbers are wrong, every backup and replica contains the corrupt data. The recovery requires going back to the 10:30 snapshot — before the corruption — and replaying every legitimate write since. The fix is multi-tier retention: keep frequent (15-min) snapshots for 24h, hourly snapshots for 7 days, daily snapshots for 30 days. The 30-day daily gives you a recovery point against corruption that wasn't noticed for two weeks. Tools like AWS Backup, Snowflake Time Travel with extended retention (the Enterprise+ tier supports 90-day Time Travel), and Iceberg's snapshot-retention-by-tag all support this pattern. Why retention tiers matter more than retention duration: a single 90-day window with 5-minute granularity costs ~10× more storage than a 90-day window with daily granularity past day 7, and provides minimal additional recovery value (you almost never need to recover to "the state 80 days, 14 hours, 7 minutes ago"). The shape of the retention should match the discovery-time distribution of corruption events.
DR drills — the synthetic disaster
Quarterly DR drills are the equivalent of fire drills. A senior engineer (not the on-call) creates a synthetic disaster in the staging environment — a cloned warehouse with a planted "deletion" or "corruption" — and the on-call team executes the runbook end-to-end with stopwatch-measured timing. The success criteria are not subjective: did the team meet the RTO? Did the restored data match the validation contract? Did the comms cadence hit the targets? Razorpay reportedly runs four DR drills a year, each ~3 hours, covering the four disaster classes from the chart at the top. PhonePe runs two drills per quarter, with a rotating "incident commander" role so every senior data engineer has run a real (albeit synthetic) recovery before a real one fires. The cost is real — ~10 person-days per drill across the team — and the value is also real: the 2024 PhonePe data team's published RTO performance during a March 2024 cross-region outage was 23 minutes against a 30-minute SLA, which they attribute directly to the drill cadence.
Beyond the warehouse — pipeline DR and feature-store DR
Disaster recovery for the warehouse is necessary but not sufficient. The pipelines that feed the warehouse have their own DR requirements: if Airflow goes down for 4 hours, the daily ETL didn't run, and the warehouse is intact but stale. The pattern that works: Airflow/Dagster run on a managed service with cross-region failover, and the source-of-truth state (DAG runs, task histories) lives in a managed Postgres with point-in-time recovery. Feature stores have it harder still: the offline store is just a warehouse (handle as above), but the online store (Redis, Cassandra, DynamoDB) is what serves live model inference, and a 30-minute outage there is a 30-minute fraud-detection blindness for the entire company. Online-store DR is typically multi-region active-active replication with read-your-writes consistency, which costs more than warehouse DR by ~3× — but the RPO is genuinely zero, because the consumer is a model that cannot tolerate stale features. Build 15's online-vs-offline-feature-stores and Build 15's point-in-time-correctness-the-hard-part cover the feature-store-specific patterns.
The 30-year arc — DR in the age of compliance
The next decade of data DR is shaped by regulation, not technology. India's DPDP Act (2023, in force 2025) requires fiduciaries to demonstrate "reasonable security practices" including verified backup-and-recovery. The RBI's 2024 guidelines for fintechs require quarterly DR drills with documented evidence and RPO/RTO commitments to consumers. The EU's DORA (Digital Operational Resilience Act, in force 2025) requires ICT third-party providers to support customer DR testing — which is why Snowflake, Databricks, and Confluent now publish DR-test reference architectures. The pattern: what was a best practice in 2020 (have a runbook, run drills) is becoming a compliance floor in 2026. Teams that built the discipline early have a head start; teams that didn't are in scramble mode. The companion chapter running-a-database-in-production-the-parts-they-don-t-teach in the databases curriculum covers the OLTP version of this same arc.
Where this leads next
- /wiki/runbooks-the-ones-that-actually-work-at-3am — the runbook discipline this chapter assumes; if your team doesn't write runbooks, the DR plan exists only in one engineer's head.
- /wiki/on-call-for-data-alerts-that-matter — the alerting that detects the disaster; without good alerts, the 4-hour RTO becomes a 24-hour RTO because nobody noticed for the first 20 hours.
- /wiki/version-control-for-data-iceberg-time-travel — the table-format primitive that makes per-snapshot recovery possible without a separate backup pipeline.
- /wiki/cost-on-the-cloud-the-s3-egress-compute-trinity — DR architecture is largely a cost-curve decision; the cross-region egress and snapshot-storage costs dominate the budget conversation.
The lesson every data team learns the hard way is that DR is not a project; it is a discipline. The architecture is straightforward — snapshots, replication, runbooks, drills. The hard part is committing to the RTO/RPO contract before the disaster, paying the costs the contract requires, and running the drills when nothing is on fire. The teams that survive their first big incident are the ones that did this work in the boring quarters before; the teams that don't are the ones that thought "the cloud handles it".
References
- Snowflake — Continuous Data Protection — the canonical doc on Time Travel and Fail-safe; the layered-recovery model this chapter builds on.
- Google — BigQuery Backups and Restore — the BigQuery-specific recovery primitives.
- AWS — Disaster Recovery of Workloads on AWS: Recovery in the Cloud — the four-tier RTO/RPO framework the industry uses (backup-restore, pilot-light, warm-standby, multi-site active-active).
- Apache Iceberg — Time Travel and Rollback — the Iceberg recovery primitives for lakehouse architectures.
- Reserve Bank of India — Master Direction on IT Governance, 2024 — the regulatory framework that forces quarterly DR drills for Indian fintechs.
- Google SRE Workbook — Managing Incidents — the comms-during-incident discipline this chapter borrows.
- /wiki/on-call-for-data-alerts-that-matter — the alerting prerequisite for detecting disasters quickly.
- /wiki/version-control-for-data-iceberg-time-travel — the Iceberg time-travel primitive that makes table-level recovery cheap.