In short
Failover is the procedure by which a replica takes over when the primary dies. Done right, your application reconnects in tens of seconds and business continues. Done wrong, you get split-brain: the old primary — not actually dead, just network-partitioned — comes back online, does not know it has been demoted, and keeps accepting writes from stale clients while the new primary serves its own writes. Two divergent histories. Merging them is painful at best, impossible at worst.
The five-step failover sequence: detect (miss N heartbeats), elect (highest-LSN replica wins), promote (apply pending WAL, become writable), rewire (replicas re-point, application reconnects), recover (re-sync the former primary). Typical end-to-end: 10-60 seconds. The hazard is in step 3 — before letting the new primary write, you must be certain the old one cannot.
Three fencing strategies, in rough order of safety and cost. STONITH ("shoot the other node in the head") uses out-of-band power control — IPMI, cloud termination APIs — to definitively kill the old primary before promoting. Lease tokens give the primary a time-bounded write lease from a consensus store (etcd, ZooKeeper); without renewal, it refuses writes — the lease expires on its own, no external kill needed. Quorum-based consensus makes fencing structural: writes require majority ack; the minority side (containing the old primary) cannot form a quorum, so its writes never commit.
Postgres's built-in streaming replication is not enough on its own. You need an orchestrator — Patroni, repmgr, Stolon — plugged into a consensus store, or a quorum-native system like CockroachDB or TiDB where fencing is part of the commit protocol.
It is 3 AM. Your Postgres primary is serving a hundred payments per second when its NIC suffers 100% packet loss — a top-of-rack switch ate a cosmic ray, or the hypervisor live-migrated at the wrong moment. The health checker in the other AZ stops seeing heartbeats. After three missed beats it declares the primary dead. The orchestrator picks the highest-LSN replica and runs pg_ctl promote. Within fifteen seconds, DNS flips, pools reconnect, writes resume.
Two minutes later the network returns. The old primary's NIC is healthy; from its perspective nothing happened. A handful of long-lived client connections — whose DNS cache has not flipped — submit writes. The old primary accepts them. Those writes are only on the old primary; the new primary has its own two minutes. You now own two databases pretending to be one.
Recovery is a long afternoon of manual reconciliation: stop both nodes, dump the old primary's WAL, diff the tablespaces, decide per conflicting row which version wins, and pray none involved money. This is split-brain — what happens when failover is not fenced.
The failover sequence, step by step
Every failover — manual, scripted, or orchestrated — runs the same five phases. Budget for each.
Step 1 — Detect. A health checker polls on a fixed interval (commonly 1s) and declares failure after N consecutive misses (commonly 3). N too low triggers false-positive failovers from routine hiccups; N too high extends RTO. Production defaults: 3-5 seconds.
Step 2 — Elect. Pick which replica becomes the new primary. The standard rule is "highest LSN" — whichever replica has replayed the most WAL loses the least on promotion. In a quorum-native system, the same rule is enforced by Raft's log-completeness check (see semi-sync and quorum acks): only a candidate whose log is at least as up-to-date as a majority can win.
Step 3 — Promote. The winner finishes applying any pending WAL and transitions from read-only standby to read-write primary. In Postgres, pg_ctl promote writes a trigger file, the startup process exits recovery, a new timeline begins, and walsender slots open. State transition completes in 1-2 seconds.
Step 4 — Rewire. Other replicas re-point primary_conninfo and resume streaming on the new timeline. Connection poolers (PgBouncer, ProxySQL, HAProxy) switch upstream. DNS flips — then you wait out TTLs. Application pools detect broken sockets and reconnect. Usually the dominant cost: 5-30 seconds, depending on DNS TTL and pool configuration.
Step 5 — Recover. The former primary, when it returns, must not think it is still primary. It re-syncs via pg_rewind against the new timeline (or full pg_basebackup if divergence is deep) and rejoins as a replica. Not time-critical for availability but time-critical for redundancy.
Why "highest LSN" is the election rule: promoting a less-caught-up replica would discard WAL records that a more-caught-up replica already has. If the old primary's WAL reached some replica, that replica's LSN is the strongest evidence you have about what was committed. Picking the replica behind it forces you to roll back the more-caught-up one's extra records, which in a zero-downtime world means losing acknowledged writes.
End-to-end, a well-tuned failover completes in 10-60 seconds. Long DNS TTLs, a pooler without automatic failover, or unscripted manual promotion stretch it to 10-30 minutes.
What split-brain is, and why it's terrible
Split-brain is the state where two nodes both believe they are primary and both accept writes. It is the worst thing that can happen to a replicated database because writes on each side are committed from the application's perspective — both sets of clients got COMMIT OK. The system has lied to one set, but which, and in what order, is only knowable by post-hoc forensics.
The specific damages, in rough order of painfulness: data divergence (different rows on each side, foreign keys pointing nowhere); lost writes (whichever side you discard is gone, confirmation emails now contradict the database); duplicate IDs (both sides allocated from the same sequence or UUID generator, merging requires renaming); invariant violations (constraints true on each side independently — "balance ≥ 0", "one active session per user" — false on the merged result).
A 30-second split-brain at 1,000 TPS is 30,000 writes per side; even a 1% overlap is 300 rows of reconciliation. A team that has not drilled the procedure takes a day to untangle it.
Split-brain is not rare-but-catastrophic the way regional loss is. It is common-if-you-are-sloppy: every failover without fencing is a roll of the dice.
Why split-brain happens
The orchestrator's inference that the primary is dead is based on absence of signal — missed heartbeats — which is indistinguishable from a network partition isolating the primary from the orchestrator.
Replay: primary runs normally; network between primary and the world partitions (primary is alive, just cannot send or receive on that segment); orchestrator misses 3 heartbeats, declares death, promotes replica; on the primary's side of the partition some clients remain connected and keep writing, and the primary — unaware of demotion — serves them; partition heals and the old primary has no mechanism to notice it has been superseded. Two primaries, until something external intervenes.
The only asymmetry you can exploit: the orchestrator knows there should be only one primary. If it can force the old primary to stop accepting writes before promoting the new one, split-brain cannot occur. That is fencing.
Fencing strategy 1 — STONITH (shoot the other node)
STONITH stands for "shoot the other node in the head". Before promoting a new primary, the orchestrator definitively kills the old one via an out-of-band mechanism. The out-of-band part is essential: telling the old primary to shut itself down over the normal network would require reaching it, which is precisely the thing the partition prevents. STONITH uses a separate channel:
- IPMI / iLO / DRAC. Baseboard management controllers accept power-off commands over a dedicated management NIC. The BMC is an embedded computer on the motherboard, powered separately from the main CPU; it can power-cycle the host even when the OS is hung.
- Cloud instance termination APIs. AWS
TerminateInstances, GCPinstances.stop— the hypervisor kills the VM whether the guest OS likes it or not. - Hypervisor controls.
virsh destroy, vSphere power-off for on-prem virtualisation. - Shared storage fencing. SCSI-3 persistent reservations — the new primary registers an exclusive reservation; the storage layer rejects the old primary's writes.
- Switch port disable. The orchestrator SSHes into the top-of-rack switch and kills the old primary's port.
Pacemaker's stonithd daemon is the standard implementation. You configure it with credentials for your chosen mechanism, and it runs the kill command before every promotion.
The upside: STONITH done before promotion makes split-brain structurally impossible. The old primary is not "asked" to stop — it is made to stop. Clients still connected get a dead socket and reconnect elsewhere.
The downsides: requires out-of-band access (cloud APIs usually work; on-prem with incorrect IPMI credentials cannot fence — test at setup and on every config change); destructive on false-positive detection, turning a network hiccup into a 30-60 second outage; and slow if the mechanism is slow (IPMI 5-15 seconds, some cloud APIs 30+), adding to RTO.
STONITH is the go-to for traditional on-prem Postgres and MySQL with Pacemaker/Corosync. It is less common in cloud-native stacks, which prefer lease-based fencing.
Fencing strategy 2 — lease tokens
Lease-based fencing flips the model: instead of the orchestrator reaching out to kill, the old primary fails to renew its own authority and voluntarily stops accepting writes.
The mechanism is a write lease: a time-bounded grant of "primary" role, issued by a highly-available consensus store (etcd, ZooKeeper, Consul), held by exactly one node at a time. The primary renews every 2 seconds with, say, a 10-second TTL. If it cannot renew, the lease expires and the primary must refuse writes.
The cycle: A acquires the "primary" lease (TTL 10s). A writes normally, renewing every 2s. A's network to etcd fails — renewals stop; A's local expiry timer counts down. After 10 seconds etcd's TTL fires; the key releases; B wins the race. Meanwhile, A has been watching its own expiry timer. Before 10 seconds elapse since its last successful renewal, A must voluntarily demote itself.
Why A must demote before the TTL fires, not after: if A waits until after, there is a window where B has taken the lease and A has not yet noticed. During that window both are primary. The discipline: your local stop-writing timer fires before any other node's "expired" timer. A demotes at 8 seconds while etcd's TTL is 10 — a 2-second safety margin.
Postgres with Patroni implements this. Each Patroni instance does three things in a loop: if leader, renew the lease and run pg_ctl demote before the TTL if renewal fails; if follower, watch the leader key and race to claim it on expiry, running pg_ctl promote on win; always, publish the local replication LSN so elections pick the freshest follower.
The beauty of the lease approach: no out-of-band access is required. The old primary fences itself by observing its own disconnection from etcd. The partition that caused the suspected death is the same partition that makes the lease lapse.
The downsides: a dependency on etcd being highly available (run 3 or 5 voters, never 1); sensitivity to clock skew (the primary's self-demotion must fire before any observer's expiry, typically enforced with etcd's own lease IDs rather than wall clocks); and software cooperation — a Postgres bug ignoring pg_ctl demote would break the invariant, though in practice the demote makes data files read-only at the filesystem level.
Patroni is the de-facto standard for production Postgres HA. Stolon, repmgr, and Zalando's Spilo are alternatives with the same core pattern.
Fencing strategy 3 — quorum / consensus
The third strategy does not prevent the old primary from trying to accept writes. It prevents those writes from ever becoming durable.
In a Raft- or Paxos-based system (see semi-sync and quorum acks), every committed write requires acknowledgement from a majority of replicas.
Replay split-brain here: three replicas A (leader), B, C; quorum = 2. The network partitions A from B and C. B and C run a Raft election; B wins with term N+1. A can still accept client writes locally and tries to replicate them, but its AppendEntries RPCs to B and C fail. It collects 1 ack — its own. Not a majority. The writes never commit. A holds them pending and returns errors. Meanwhile B commits with acks from B and C. Partition heals. A sees the higher term, reverts to follower, truncates its uncommitted pending writes, and re-syncs from B.
The minority side cannot make progress. It may accept connections and try to serve writes, but without quorum nothing commits. When the partition heals, the minority learns about the new term and cleanly defers.
Quorum-based fencing is self-enforcing. No STONITH, no lease TTLs. The impossibility of forming a minority quorum is the fence. The subtlety: A's client gets an error (timeout or explicit rejection) — not a silent lie. Raft returns failure when quorum cannot be reached; the client retries. No "phantom commit" on the minority side.
Systems using quorum fencing natively: etcd, CockroachDB, TiDB/TiKV, YugabyteDB, Spanner, Kafka with KRaft. The cost is quorum round-trip latency per write. You do not bolt quorum fencing onto Postgres; you migrate.
Why Postgres streaming replication alone is unsafe
Postgres's built-in streaming replication is a transport — it ships WAL from primary to replicas. It does not, by itself, do automatic failover (no daemon watches the primary; pg_ctl promote is manual), fencing (nothing prevents the old primary from continuing to write after promotion), or leader election (no opinion on which replica should win).
You need an orchestrator. Patroni (most common) uses etcd/Consul/ZooKeeper for leader-lease and elections and pairs with PgBouncer or HAProxy for application rewire. repmgr is simpler but harder to make fully safe. Stolon puts every Postgres behind a proxy. pg_auto_failover uses a separate monitor node. Cloud-managed (RDS, Cloud SQL, Azure Postgres) runs some variant internally.
Without an orchestrator you can still failover manually — SSH to a replica, pg_ctl promote, update config — but the procedure is slow (minutes), error-prone at 3 AM, and unfenced. Every serious Postgres deployment has one.
The "witness node" pattern
Two-node deployments bite you at the quorum layer. With N=2, a majority is 2. If either node fails, the survivor is alone and cannot form a quorum. Three-node deployments are the production minimum — but three full Postgres copies are expensive, each needing disk equal to the data size.
The witness node pattern: add a third node that participates in quorum for coordination decisions (who is leader, whose lease is valid) without storing the database itself. A common Patroni-over-etcd topology is Postgres + etcd voter on nodes A and B, and etcd voter only on node C (no Postgres). Three etcd voters — quorum 2 of 3 — but only two Postgres nodes share data-disk cost. If A fails, B and C still form a quorum and B becomes primary; if C fails, A and B still have a quorum.
This is how small-footprint HA is done in practice — and the reason you should never deploy a 2-node cluster without reasoning explicitly about the third quorum member.
The RPO/RTO vocabulary — state them precisely
Two acronyms dominate failover planning; write specific numbers into your design doc before touching a config file.
RPO — Recovery Point Objective. How much data are you willing to lose in a failover? Measured in time or transactions. For async replication, RPO = current replica lag; for sync replication, RPO = 0; for semi-sync in its fallback window, RPO has a small non-zero ceiling.
RTO — Recovery Time Objective. How long can the service be unavailable? Includes detection + election + promotion + rewire. Orchestrated automatic failover: 10-60 seconds. Manual: minutes to hours.
These numbers come from the product owner, not the database team. "Can we lose one second of payments?" is a business question. You design the replication topology to meet the stated RPO and RTO — not the other way around.
A common mistake is confusing RTO with detection time. A 10-second detection interval with a 20-second promotion sequence gives 30-second RTO, not 10. Always measure end-to-end, client-visible.
Orchestrated failover with Patroni + etcd
Three-AZ Postgres in ap-south-1: Patroni-managed instances on nodes A, B, C, backed by a 3-voter etcd cluster (one voter per node). A is primary; B, C are replicas streaming from A.
t=0. A's network to etcd is disrupted by a routing-table corruption event. Application traffic to A is also affected.
t=0 to t=8. A's Patroni tries to renew its leader lease. Renewals fail. At t=8s — two seconds before the 10s TTL — A's Patroni runs pg_ctl demote, making it read-only.
t=10. etcd's TTL fires; the leader key is released. B's and C's Patroni instances race.
t=10.5. B has replayed to LSN 3A/1F201A80; C is at 3A/1F2015B0. B is ahead and writes itself into the leader key.
t=11. B's Patroni runs pg_ctl promote. Postgres exits recovery, begins a new timeline, opens for writes.
t=11 to t=15. C re-points its primary_conninfo to B and streams from the new timeline. HAProxy, polling Patroni's REST API, switches upstream to B. Application pools detect broken sockets and reconnect via HAProxy.
t=60. Network resolves. A rediscovers etcd, sees B is leader, runs pg_rewind against B, and rejoins as a replica.
End-to-end RTO: ~15 seconds. RPO: with synchronous_standby_names = 'ANY 1 (B, C)', zero; with default async, A's replication lag at t=0 (typically 10-50 ms).
The split-brain that almost happened: between t=8 (A demoted) and t=60 (A rejoined), A was not accepting writes. The critical invariant — A stops writing before B starts — is enforced by Patroni's 2-second safety margin. Even if some client's DNS cache held A's address for 5 minutes, those clients got "read-only mode" errors from Postgres, not silent success.
Common confusions
-
"Failover is rare so I don't need to plan." Wrong. Disk failures, region outages, security-patch reboots, OS upgrades, and major-version upgrades all force failover — deliberately, several times a year at every reasonably-run organisation. If failover is unreliable, routine maintenance becomes nail-biting.
-
"The orchestrator handles everything." It handles detection, election, and promotion. It does not, by default, handle fencing — that is a separate configuration item (STONITH plugin, lease TTL, quorum topology). The failover-sequence docs work even without fencing, right up until the day the old primary returns.
-
"If I use async replication, I have nothing to lose." Even async setups can split-brain. The old primary has writes that never made it to any replica plus new writes after the partition heals — two contradictory histories, both needing reconciliation. Async means "accept data loss on primary death"; not "automatic safety during failover".
-
"Quorum systems always prevent split-brain." They prevent it for writes. The minority side can still serve stale reads while thinking it is leader. To get linearisable reads, the client must confirm leadership via a quorum round-trip. Consensus guarantees write correctness; read correctness is a separate opt-in.
-
"A witness node is enough." A witness helps coordination quorum survive single-node failure. It does not hold data, so it does not help with RPO. You can have a healthy quorum-elected new primary and still lose data if that primary's WAL was behind. For zero RPO you need sync replication to a data-holding replica, witness or not.
-
"Fencing and authentication are the same thing." No. Authentication prevents unauthorised clients from writing. Fencing prevents an authorised primary from writing when it has been demoted. Orthogonal; you need both.
Going deeper
Chubby — the original lease-based lock service
Google's Chubby (Burrows, OSDI 2006) is the foundational reference for lease-based coordination. Chubby is a lock service used as a leader-election primitive by every other Google system: clients hold a "master" lease; if it lapses, the client must cease to act as master. Every consensus store since — ZooKeeper, etcd, Consul — descends from it. The paper is especially clear on failure modes: client-server clock skew, the "session" abstraction for silent clients, and the requirement that Chubby itself run as a majority-fault-tolerant ensemble.
Fencing race conditions and fencing tokens
Even well-implemented lease fencing has a theoretical race: if the old primary's clock runs fast, its lease could look alive locally but already expired elsewhere — a brief dual-primary window. Mitigations: fencing tokens — every lease acquisition increments a monotonic counter; clients pass the token on each write; the data layer rejects outdated tokens (Kleppmann's canonical generalisation of leases). Plus strict clock-sync bounds or explicit uncertainty intervals (Spanner's TrueTime), and generous self-demotion margins. Real-world Patroni rarely hits the race since margins are seconds and clock skew is milliseconds; for multi-region with larger drift, fencing tokens are the right answer.
Patroni internals — the etcd key structure
Patroni's etcd keys live under /service/<cluster-name>/: initialize, config, leader (with TTL — the lease), members/<node> (role, LSN, API URL), optime/leader, failover, sync. Every iteration reads the full state, decides an action, writes it. The loop runs every loop_wait seconds (default 10). The ttl parameter (default 30) must be strictly greater than loop_wait + retry_timeout — misconfiguring this is the most common Patroni footgun.
Where this leads next
Failover and fencing are the cluster-level consequences of the replication choices in ch.67-69. Build 9 continues with:
- PITR and the backup picture — chapter 72. Failover handles node loss but cannot recover from corruption or operator mistakes (a
DROP TABLEthat replicates instantly to every replica). PITR from WAL archives and base backups is the second pillar of durability. - Application-level consistency — chapter 73. Even a perfect failover does not guarantee the application does the right thing. Stale connection-pool upstreams, idempotency keys that survive reconnection, retry semantics on
CommitStalled— the application-layer invariants complement the database machinery.
Failover without fencing is not failover — it is a coin flip on whether your users get a consistent database next morning. Pick a strategy (STONITH, lease, or quorum), understand its failure modes, test it before you need it, and write the runbook for the part requiring human intervention.
References
- Burrows, The Chubby Lock Service for Loosely-Coupled Distributed Systems, OSDI 2006 — foundational paper on lease-based distributed coordination; conceptual ancestor of etcd, ZooKeeper, and every consensus-store orchestrator since.
- Patroni documentation, Patroni: A Template for PostgreSQL HA — canonical reference for production Postgres failover, including the etcd key schema and the
loop_wait/retry_timeout/ttlinteractions that govern fencing safety. - ClusterLabs, Pacemaker Fencing Documentation — reference for STONITH-based fencing in the Linux HA stack, with configurations for IPMI, cloud APIs, and SCSI reservation fencing.
- Ongaro and Ousterhout, In Search of an Understandable Consensus Algorithm, USENIX ATC 2014 — the Raft paper. Leader election and log-completeness are the mechanical basis for quorum-based fencing.
- Kingsbury, Jepsen analyses — ongoing series documenting real split-brain bugs in Postgres/Patroni, early etcd, and numerous NewSQL databases. The fastest way to internalise what fencing failure looks like in practice.
- Kleppmann, Designing Data-Intensive Applications, O'Reilly 2017, chapters 5 and 8 — the clearest book-length treatment of failover, split-brain, and fencing tokens as the safe generalisation of leases.