Note: Company names, engineers, incidents, numbers, and scaling scenarios in this article are hypothetical — even when they resemble real ones. See the full disclaimer.

Disaster recovery: RPO and RTO

It is 03:14 on a Sunday and Asha, on-call SRE at PaySetu, is staring at a Slack message from the network team: the entire ap-south-1 region is unreachable — not a node, not a rack, the whole region. The status page from the cloud provider has just turned red. PaySetu's primary database lives there. The DR runbook says "promote the eu-west-1 standby". Asha runs the promotion script. It takes 14 minutes to converge. When it finishes, customers can transact again — but the wallet balance for transactions made in the last 9 seconds before the outage are gone, because the asynchronous replica had not yet received them. Nine seconds of writes — 41,200 transactions — are reconciled the next morning by a manual job and a public apology. That 9 seconds is the RPO; that 14 minutes is the RTO. Both numbers were chosen at 4pm on a Tuesday, two years ago, in a meeting that took 25 minutes.

The honest framing: RPO and RTO are not about disasters. They are about the trade-offs you accept every single day — synchronous vs asynchronous replication, where the standby lives, how often you fail-over, what your runbook looks like — to be ready when the disaster lands. A team that has never measured its actual RPO and RTO does not have a DR plan; it has a DR document.

RPO (Recovery Point Objective) is the maximum data loss you tolerate, measured in time — "we can lose at most 30 seconds of writes". RTO (Recovery Time Objective) is the maximum downtime you tolerate during recovery — "we can be unavailable at most 5 minutes". Tightening RPO requires synchronous or semi-synchronous cross-region replication (which costs latency on every write); tightening RTO requires automated fail-over (which risks split-brain). The two budgets pull against each other and against steady-state cost — pick the numbers honestly, then design backwards.

What the two numbers actually mean

RPO and RTO are timeline points around the moment of disaster, not vague aspirations. Draw a horizontal time axis. The disaster lands at T=0. RPO is the latest moment before T=0 whose data you are guaranteed to have after recovery. RTO is the earliest moment after T=0 at which your service is guaranteed to be back. RPO=30s means the last 30 seconds of writes might be lost. RTO=5min means you might be down for up to 5 minutes. Together they bound the unavailability window and the data-loss window — and the engineering of DR is the engineering of those two windows.

RPO and RTO on a disaster timelineA horizontal timeline showing the disaster moment T=0, with RPO marked as a window before T=0 (data-loss tolerance) and RTO marked as a window after T=0 (downtime tolerance). Two coloured bands above the axis show last successful replication checkpoint (RPO boundary) and recovery completion (RTO boundary). Illustrative — not measured data. Disaster timeline: where RPO and RTO live T = 0 (disaster) RPO window data lost RTO window service down last successful replication checkpoint recovery complete past future writes safely replicated service restored RPO duration RTO duration
RPO is measured backwards from the disaster (how much data is lost between the last replication checkpoint and the disaster). RTO is measured forwards from the disaster (how long until the service is back). They are independent budgets — RPO=0 systems can still have RTO=30min; RTO=10s systems can still have RPO=5min.

The numbers are not symmetric in cost. Tightening RPO costs steady-state latency — every write must reach the standby before being acknowledged. Tightening RTO costs complexity at the moment of disaster — automated fail-over, leader election across regions, traffic-rerouting machinery. A bank that tolerates 30 minutes of downtime but cannot lose any rupee runs synchronous replication with a manual fail-over: RPO=0, RTO=30min. A streaming service that tolerates losing the last 60 seconds of viewing-progress events but cannot be down for more than 90 seconds runs asynchronous replication with automated fail-over: RPO=60s, RTO=90s. The shapes of the systems are entirely different even though both are "DR".

A real number to anchor on: at PaySetu, the wallet service runs synchronous replication to a same-region replica plus asynchronous replication to a cross-region standby. The intended RPO for a region-loss event is "≤5 seconds, p99 ≤15 seconds", because the cross-region async lag p99 is 12 seconds under load. The intended RTO is "≤10 minutes", because the run-book has a manual approval step before promotion. Both numbers were chosen by the CEO and the head of risk in a 30-minute meeting — engineering's job was to verify the architecture met them, not to choose them.

How replication topology determines RPO

RPO is set by the lag between the primary and the farthest replica that survives the disaster. Three replication regimes give three distinct RPO floors.

Synchronous replication (same region). The write does not return success to the client until at least one replica has durably acknowledged it. RPO=0 for a single-node failure within the region (some replica has the write). RPO=∞ for a region-loss event, because no out-of-region replica had the write. Latency cost: one intra-region round-trip, typically 1–4ms. PaySetu's default for inside-region durability.

Asynchronous replication (cross-region). The write returns success after the local region commits; the cross-region replica receives it later. RPO is determined by the replication lag at the moment of disaster. If the primary commits 8000 writes/sec and the async stream has a 1000-write buffer, RPO is roughly 125ms — but under saturation, when the consumer falls behind, the buffer can stretch to seconds or tens of seconds. The system's worst-case RPO is its lag p99.9, not its lag p50. Most teams measure p50 and quote it as RPO; the disaster lands at p99.9.

Semi-synchronous replication (cross-region quorum). The write does not return success until at least one cross-region replica has acknowledged it. RPO=0 even for region loss, but every write pays cross-region RTT — for India ↔ EU, that is 145ms minimum. For a wallet write at 8000/sec, the latency floor moves from 4ms to 149ms — a 37× increase, paid on every transaction. Worth it for systems where data loss is unacceptable; ruinous for systems where it is merely undesirable.

Three replication regimes and their RPO floorsThree side-by-side panels showing client → primary → replica message flows. Left: synchronous same-region (1ms RTT, RPO=0 in-region, RPO=infinity cross-region). Middle: asynchronous cross-region (145ms one-way, RPO=lag). Right: semi-synchronous cross-region quorum (write blocks on cross-region ack, RPO=0 cross-region). Illustrative — not measured data. Three replication regimes — three RPO floors Sync (same region) RPO=0 in-region, ∞ cross-region C P R 2-4ms total write blocks on R ack latency: 1× intra-region RTT if region dies: total loss Async (cross-region) RPO = current replication lag C P R' 2-4ms (P→C); R' lags ack first, replicate later latency: 1× intra-region RTT if region dies: lose lag Semi-sync (cross-region) RPO=0 cross-region C P R' ~149ms total (IN↔EU) write blocks on cross-region ack latency: 1× cross-region RTT if region dies: zero loss
C = client, P = primary, R = same-region replica, R' = cross-region replica. The dashed arrow in the middle panel is the asynchronous catch-up — not part of the client's critical path. The cross-region RTT in the right panel (149ms for India ↔ EU) is paid on every single write.

Why: replication lag is not a single number — it is a distribution. The system's RPO is the p99.9 worst-case lag during the worst minute of the year, not the steady-state mean. A primary running at 60% of its WAL throughput has p50 lag of 8ms; the same primary at 95% has p99.9 lag of 14 seconds because the WAL fsync queue backs up faster than the network can drain it. RPO budgets that assume p50 lag are systematically optimistic by 2-3 orders of magnitude.

How fail-over mechanism determines RTO

RTO is the sum of three intervals: detection (how long until the system decides the region is dead), decision (how long until something or someone authorises promotion), and convergence (how long until the standby is serving traffic). Each interval has its own engineering trade-off.

Detection. Heartbeats arrive on a schedule; missing heartbeats accumulate suspicion until a threshold trips. Tight thresholds (3 missed heartbeats at 1s interval = 3s detection) catch outages quickly but flag false positives during transient packet loss. Loose thresholds (30 missed heartbeats = 30s detection) avoid false positives but cost RTO. Phi-accrual failure detectors (Hayashibara et al., 2004) parameterise this trade-off explicitly. PaySetu's cross-region detector uses phi=8 with 500ms heartbeats, giving p99 detection of 5.5 seconds.

Decision. The cheapest RTO has automated fail-over: when detection fires, the system promotes the standby without human approval. The riskiest RTO has manual approval: a human reads the alert, looks at the dashboards, decides the disaster is real, then approves. Manual approval adds 5-15 minutes to RTO but prevents the catastrophic failure mode where the detector flaps (region appears dead, then alive, then dead again) and the system promotes-demotes-promotes a standby, splitting writes across two primaries — split-brain, with reconciliation costs that dwarf the original outage. Banking systems almost always require manual approval for cross-region promotion; consumer streaming services typically do not.

Convergence. Once promotion is authorised, the actual machinery — drain the standby's catch-up backlog, transition it from replica to leader, update DNS / load-balancer to point clients at it, drain stale connections to the dead primary — takes 30 seconds to several minutes depending on caches and connection pools. The most common surprise here is DNS TTL: if your client DNS TTL is 300 seconds, RTO has a 5-minute floor regardless of how fast everything else is. Production teams usually set TTL to 30s or use service-mesh / smart-client routing that bypasses DNS.

A working RPO/RTO simulator

Here is a runnable simulation of a primary with a cross-region async replica, modelling replication lag, disaster timing, and the fail-over sequence. Save and run it.

# rpo_rto_simulator.py
# Two-region primary/standby with async replication.
# Inject a region-loss disaster at a random moment; report RPO and RTO.

import random
from collections import deque
from dataclasses import dataclass, field

@dataclass
class Replica:
    name: str
    log: list = field(default_factory=list)   # list of (write_id, value)
    pending: deque = field(default_factory=deque)  # in-flight from primary

@dataclass
class Sim:
    primary: Replica
    standby: Replica
    write_rate_per_sec: int = 8000
    cross_region_rtt_ms: int = 145
    detection_phi8_ms: int = 5500
    manual_approval_ms: int = 0      # 0 for auto-failover; 600000 for manual
    promotion_convergence_ms: int = 35000

    def run_seconds(self, seconds: float, disaster_at_s: float):
        time_ms = 0
        next_write_id = 0
        end_ms = int(seconds * 1000)
        disaster_ms = int(disaster_at_s * 1000)
        disaster_fired = False

        # 1ms tick.
        while time_ms < end_ms:
            # Disaster: primary is gone, no further writes commit there.
            if not disaster_fired and time_ms >= disaster_ms:
                disaster_fired = True
                rpo_writes = len(self.primary.log) - len(self.standby.log)
                rpo_seconds = rpo_writes / self.write_rate_per_sec
                rto_ms = self.detection_phi8_ms + self.manual_approval_ms + self.promotion_convergence_ms
                return {
                    "writes_committed_on_primary": len(self.primary.log),
                    "writes_replicated_to_standby": len(self.standby.log),
                    "writes_lost": rpo_writes,
                    "rpo_seconds": round(rpo_seconds, 3),
                    "rto_seconds": round(rto_ms / 1000, 1),
                }

            # Each ms: roughly write_rate_per_sec/1000 writes commit on primary.
            n_writes_this_ms = int(self.write_rate_per_sec / 1000)
            for _ in range(n_writes_this_ms):
                self.primary.log.append((next_write_id, "v"))
                # Schedule replication arrival cross-region RTT later.
                self.standby.pending.append((time_ms + self.cross_region_rtt_ms, next_write_id))
                next_write_id += 1

            # Drain whatever has arrived at standby by now.
            while self.standby.pending and self.standby.pending[0][0] <= time_ms:
                _, wid = self.standby.pending.popleft()
                self.standby.log.append((wid, "v"))

            time_ms += 1
        return None

if __name__ == "__main__":
    random.seed(7)
    for label, manual_ms in [("auto failover", 0),
                             ("manual approval (10 min)", 600000)]:
        s = Sim(primary=Replica("mumbai-p"), standby=Replica("frankfurt-s"),
                manual_approval_ms=manual_ms)
        result = s.run_seconds(seconds=120, disaster_at_s=60.0)
        print(f"\n--- {label} ---")
        for k, v in result.items():
            print(f"  {k:35} {v}")

Sample run:

--- auto failover ---
  writes_committed_on_primary         480000
  writes_replicated_to_standby        478840
  writes_lost                         1160
  rpo_seconds                         0.145
  rto_seconds                         40.5

--- manual approval (10 min) ---
  writes_committed_on_primary         480000
  writes_replicated_to_standby        478840
  writes_lost                         1160
  rpo_seconds                         0.145
  rto_seconds                         640.5

The walkthrough of the load-bearing logic:

  • self.standby.pending.append((time_ms + self.cross_region_rtt_ms, next_write_id)) — every write committed on the primary is scheduled to arrive at the standby exactly cross_region_rtt_ms later. The disaster steals the in-flight pending writes, which are the RPO loss. Why: in steady state, the standby's lag is approximately one cross-region RTT worth of writes (the ones in flight at any instant). At 8000 writes/sec and 145ms RTT, that is 8000 × 0.145 = 1160 writes — exactly the simulator's reported writes_lost. RPO is not magic; it is throughput × in-flight time.
  • rto_ms = detection_phi8_ms + manual_approval_ms + promotion_convergence_ms — RTO is a sum of three independent components. Tightening one (e.g. dropping detection from 5.5s to 1s) does little if the other two dominate. The single biggest RTO improvement in production is removing the manual-approval step if the consequences of split-brain are tolerable.
  • rpo_seconds = rpo_writes / write_rate_per_sec — RPO in time is RPO in writes divided by the throughput. A burst of writes immediately before the disaster widens the RPO; a quiet period narrows it. This is why DR drills are scheduled during peak hours — quiet-hour drills systematically underestimate RPO.
  • The two scenarios produce identical RPO — RPO and RTO are independent. The only difference between auto and manual failover here is RTO. The data loss is set by the replication lag at T=0; nothing the failover machinery does after T=0 can recover those lost writes. Why: data that did not leave the primary's region cannot be recovered after the region is gone. The failover sequence can only promote what the standby already has — it cannot retrieve what was in-flight or buffered locally on the dead primary's disks. RPO is set by replication topology, not by failover speed.

The actionable insight is the gap between the two RTO numbers (40.5s vs 640.5s) for identical RPO. Manual approval costs an order of magnitude in RTO. The choice is a question about which kind of mistake is worse — being down for 10 extra minutes during a real disaster, or accidentally promoting during a transient network blip. Banking and payments systems consistently choose the longer RTO; streaming and gaming systems consistently choose the shorter one.

Real DR runbooks: the parts that go wrong

A DR runbook reads beautifully on paper. In practice, three patterns of failure recur, and every team that has run a real DR drill (not a tabletop) has seen at least one.

The first is stale credentials. The standby's credentials to talk to upstream services were rotated 90 days ago in the primary region but never propagated to the standby. The runbook step "promote standby; redirect traffic" succeeds; the standby comes up; every upstream call fails with 401 Unauthorized. RTO blows past the budget because someone has to find the rotation script and run it. Fix: include credential validation in the standby's continuous health check, not just connectivity.

The second is DNS / load-balancer cache lag. The runbook updates the DNS A record from primary to standby with a 30-second TTL. Most clients pick up the new record in 30 seconds. Some clients — typically Java apps with DNS caching at the JVM level (networkaddress.cache.ttl=-1 is Java's default for negative caching, which surprisingly affects positive caching too in some JVM versions) — never expire the cached entry until the JVM restarts. RTO for those clients is "until someone restarts the affected service". Fix: use a service-mesh (Envoy / Linkerd) or smart-client (gRPC's name resolver) that does not rely on DNS TTLs.

The third is the standby was lying about being ready. The standby's health endpoint returns 200 OK. The replication-lag metric reports lag=120ms. Both are wrong. The lag metric was pegged to "ms since last received message", not "ms behind primary's commit position", and the standby had been stuck at the same offset for 4 hours because of a silent schema mismatch (a column was added on the primary, the replica's apply step was failing every minute, the apply-failure metric was not on any dashboard). The promotion succeeds; the standby is missing 4 hours of writes. Fix: the lag metric must be primary_commit_position - standby_apply_position, computed on the primary side and exposed there. Anything else can lie.

CricStream had all three of these in a single botched DR drill in 2023. After the post-mortem, they instituted a policy: the DR drill counts as passing only if the standby was promoted without operator intervention beyond the runbook's documented steps. Drills that required ad-hoc commands didn't count as a pass even if the system eventually recovered. Pass rate went from 100% (theatrical) to 35% (real) in the first quarter. By Q4 it was back to 90%, but with real fixes underneath.

Common confusions

  • "RPO=0 is always achievable if I'm willing to pay." No. RPO=0 across regions requires either synchronous cross-region replication (which costs cross-region RTT on every write — possibly architecturally infeasible) or a quorum that spans regions for every commit (Spanner / Calvin style). For a primary committing 8000 writes/sec at p99 latency budget 50ms, neither is achievable across a 145ms RTT — the system is over-budget by construction.
  • "RTO=0 is achievable with active-active." Not exactly. Active-active means there is no failover step (both regions serve writes), so RTO can be near-zero for the availability dimension. But active-active introduces a different problem: cross-region write conflicts that need reconciliation. The system trades fail-over time for conflict-resolution complexity. The "RTO" is really still there, hidden inside conflict-resolution latency.
  • "Backups give me low RPO." No. Backups give you a recovery floor — the worst case if all replicas are gone. A nightly backup gives RPO ≤ 24 hours, not 24 seconds. Backups are protection against logical corruption (a buggy migration deletes the wrong table), not protection against region loss. For region-loss RPO, replication is the mechanism; backups are the long-tail safety net.
  • "My async replica's lag is 200ms p50, so my RPO is 200ms." No. RPO is the worst-case lag during the disaster. The disaster lands when the system is most stressed — at peak load, mid-deploy, during a network event. Lag at those moments is p99.9, not p50. Quote RPO as the 99.9th percentile of the lag distribution observed during the worst week of the last year, not the comfortable Tuesday-afternoon median.
  • "Synchronous replication eliminates RPO and is therefore free." Synchronous replication eliminates RPO at the cost of availability. If the standby is unreachable, the primary's writes block — synchronous replication makes the primary as available as the standby (or less). This is the AP↔CP trade-off from CAP, applied to DR. Many teams discover this when a sync replica's network blips and the primary stops accepting writes for the duration of the blip.

Going deeper

RPO budgets and the WAL fsync trade-off

The replication lag is not just a network number — it is also a function of how fast the standby can apply writes. If the standby's WAL fsync rate is lower than the primary's commit rate (because the standby's disks are slower, or it is also serving read traffic), the standby falls behind monotonically. PostgreSQL's pg_stat_replication.write_lag, flush_lag, and replay_lag measure three different stages of this; the RPO-relevant one is flush_lag (data has been written and fsynced to standby's WAL). Production teams alert on flush_lag exceeding the RPO budget for 60 seconds — a sustained breach of the budget is the leading indicator of a DR-budget violation, separate from any actual outage.

Disaster-recovery levels and the AWS framework

AWS publishes four DR strategies (also see Wood et al., "The Case for Cloud-Hosted DR"): backup-and-restore (RPO hours, RTO 24h), pilot-light (RPO minutes, RTO 1h, minimal standby running), warm-standby (RPO seconds, RTO 10min, scaled-down standby), multi-site active-active (RPO=0, RTO=0, full standby). The cost climbs roughly an order of magnitude per step. The choice is not technical — it is the business saying "this is what an hour of downtime costs us, and this is what we are willing to pay every month to avoid it". Engineering's job is to translate the business answer into the corresponding architecture and tell the truth about whether the architecture meets the budget.

The promotion / fencing problem

When the standby is promoted, the dead primary must not come back to life thinking it is still the primary. If the network heals after promotion and the primary rejoins, you have two primaries — split-brain with active writes on both. The mechanism to prevent this is fencing: the promoted standby's lease/term/epoch is incremented; the old primary, when it returns, sees the higher epoch and self-demotes (or is killed by STONITH — Shoot The Other Node In The Head). Production systems use a combination of generation numbers, lease tokens, and external arbitration (Zookeeper / etcd / Consul) to ensure exactly-one primary at any wall-clock instant. See leader election and leases for the deeper mechanism.

DR drills: chaos engineering's older sibling

The only way to know your real RPO and RTO is to run the disaster. Tabletop exercises ("what would you do if eu-west-1 vanished") catch perhaps a third of the issues. Game-day drills where the team actually fails over a staging environment catch most of the rest. Production-region fail-over drills (Netflix's Chaos Kong, Cloudflare's regular regional fail-overs) catch the residual. Most teams never reach the third level because the perceived risk is too high; the teams that do reach it discover the runbook diverged from reality 6 months ago and never noticed. The discipline is similar to chaos engineering — induce the failure on a schedule so the surprise is not its first occurrence.

Reproduce this on your laptop

python3 -m venv .venv && source .venv/bin/activate
python3 rpo_rto_simulator.py
# Try: cross_region_rtt_ms=10 (semi-sync regime — RPO drops near zero)
# Try: write_rate_per_sec=200000 (high load — RPO grows linearly)
# Try: detection_phi8_ms=30000 (loose detector — RTO climbs)

Where this leads next

DR is the failure-mode placement; geo-partitioning was the steady-state placement. The next questions are how the system enforces exactly-one primary across regions during a fail-over (leader election and leases), how it reconciles divergent writes if both regions briefly thought they were the primary (split-brain and reconciliation), and how the broader practice of breaking the system on purpose to build confidence in its recovery becomes a discipline (chaos engineering principles).

The thread to hold: RPO and RTO are budgets, not aspirations. Pick the numbers from the business cost of downtime and data loss. Design the architecture backwards from those numbers. Measure the actual RPO and RTO continuously — not annually, not in tabletop drills, but on every replication-lag dashboard and every fail-over rehearsal. The DR plan that exists only on the wiki is the DR plan that will fail at 03:14 on a Sunday.

References

  • Wood, T. et al. (2010). The Case for Cloud-Hosted Disaster Recovery. HotCloud '10. Foundational treatment of cloud-era DR levels and the cost-vs-RPO/RTO trade-off.
  • Hayashibara, N. et al. (2004). The φ Accrual Failure Detector. SRDS '04. The detection-side mechanism behind tight RTO budgets.
  • Corbett, J. et al. (2012). Spanner: Google's Globally-Distributed Database. OSDI '12. The reference for cross-region synchronous commit (TrueTime + Paxos).
  • AWS. Disaster recovery (DR) architecture on AWS, Part I. Whitepaper. The four-strategy framework (backup, pilot light, warm standby, multi-site).
  • DeCandia, G. et al. (2007). Dynamo: Amazon's Highly Available Key-Value Store. SOSP '07. AP-side perspective — RPO via vector-clock reconciliation rather than synchronous commit.
  • Reserve Bank of India (2018). Storage of Payment System Data. The regulatory context that constrains where DR replicas may live.
  • PostgreSQL documentation. Streaming replication and pg_stat_replication. The canonical example of measuring flush_lag as the RPO-budget indicator.
  • Internal: geo-partitioned data, follower reads and bounded staleness, leader election and leases.