In short

Failover is the procedure by which a replica takes over when the primary dies. Done right, your application reconnects in tens of seconds and business continues. Done wrong, you get split-brain: the old primary — not actually dead, just network-partitioned — comes back online, does not know it has been demoted, and keeps accepting writes from stale clients while the new primary serves its own writes. Two divergent histories. Merging them is painful at best, impossible at worst.

The five-step failover sequence: detect (miss N heartbeats), elect (highest-LSN replica wins), promote (apply pending WAL, become writable), rewire (replicas re-point, application reconnects), recover (re-sync the former primary). Typical end-to-end: 10-60 seconds. The hazard is in step 3 — before letting the new primary write, you must be certain the old one cannot.

Three fencing strategies, in rough order of safety and cost. STONITH ("shoot the other node in the head") uses out-of-band power control — IPMI, cloud termination APIs — to definitively kill the old primary before promoting. Lease tokens give the primary a time-bounded write lease from a consensus store (etcd, ZooKeeper); without renewal, it refuses writes — the lease expires on its own, no external kill needed. Quorum-based consensus makes fencing structural: writes require majority ack; the minority side (containing the old primary) cannot form a quorum, so its writes never commit.

Postgres's built-in streaming replication is not enough on its own. You need an orchestrator — Patroni, repmgr, Stolon — plugged into a consensus store, or a quorum-native system like CockroachDB or TiDB where fencing is part of the commit protocol.

It is 3 AM. Your Postgres primary is serving a hundred payments per second when its NIC suffers 100% packet loss — a top-of-rack switch ate a cosmic ray, or the hypervisor live-migrated at the wrong moment. The health checker in the other AZ stops seeing heartbeats. After three missed beats it declares the primary dead. The orchestrator picks the highest-LSN replica and runs pg_ctl promote. Within fifteen seconds, DNS flips, pools reconnect, writes resume.

Two minutes later the network returns. The old primary's NIC is healthy; from its perspective nothing happened. A handful of long-lived client connections — whose DNS cache has not flipped — submit writes. The old primary accepts them. Those writes are only on the old primary; the new primary has its own two minutes. You now own two databases pretending to be one.

Recovery is a long afternoon of manual reconciliation: stop both nodes, dump the old primary's WAL, diff the tablespaces, decide per conflicting row which version wins, and pray none involved money. This is split-brain — what happens when failover is not fenced.

The failover sequence, step by step

Every failover — manual, scripted, or orchestrated — runs the same five phases. Budget for each.

Step 1 — Detect. A health checker polls on a fixed interval (commonly 1s) and declares failure after N consecutive misses (commonly 3). N too low triggers false-positive failovers from routine hiccups; N too high extends RTO. Production defaults: 3-5 seconds.

Step 2 — Elect. Pick which replica becomes the new primary. The standard rule is "highest LSN" — whichever replica has replayed the most WAL loses the least on promotion. In a quorum-native system, the same rule is enforced by Raft's log-completeness check (see semi-sync and quorum acks): only a candidate whose log is at least as up-to-date as a majority can win.

Step 3 — Promote. The winner finishes applying any pending WAL and transitions from read-only standby to read-write primary. In Postgres, pg_ctl promote writes a trigger file, the startup process exits recovery, a new timeline begins, and walsender slots open. State transition completes in 1-2 seconds.

Step 4 — Rewire. Other replicas re-point primary_conninfo and resume streaming on the new timeline. Connection poolers (PgBouncer, ProxySQL, HAProxy) switch upstream. DNS flips — then you wait out TTLs. Application pools detect broken sockets and reconnect. Usually the dominant cost: 5-30 seconds, depending on DNS TTL and pool configuration.

Step 5 — Recover. The former primary, when it returns, must not think it is still primary. It re-syncs via pg_rewind against the new timeline (or full pg_basebackup if divergence is deep) and rejoins as a replica. Not time-critical for availability but time-critical for redundancy.

Why "highest LSN" is the election rule: promoting a less-caught-up replica would discard WAL records that a more-caught-up replica already has. If the old primary's WAL reached some replica, that replica's LSN is the strongest evidence you have about what was committed. Picking the replica behind it forces you to roll back the more-caught-up one's extra records, which in a zero-downtime world means losing acknowledged writes.

End-to-end, a well-tuned failover completes in 10-60 seconds. Long DNS TTLs, a pooler without automatic failover, or unscripted manual promotion stretch it to 10-30 minutes.

What split-brain is, and why it's terrible

Split-brain is the state where two nodes both believe they are primary and both accept writes. It is the worst thing that can happen to a replicated database because writes on each side are committed from the application's perspective — both sets of clients got COMMIT OK. The system has lied to one set, but which, and in what order, is only knowable by post-hoc forensics.

The specific damages, in rough order of painfulness: data divergence (different rows on each side, foreign keys pointing nowhere); lost writes (whichever side you discard is gone, confirmation emails now contradict the database); duplicate IDs (both sides allocated from the same sequence or UUID generator, merging requires renaming); invariant violations (constraints true on each side independently — "balance ≥ 0", "one active session per user" — false on the merged result).

A 30-second split-brain at 1,000 TPS is 30,000 writes per side; even a 1% overlap is 300 rows of reconciliation. A team that has not drilled the procedure takes a day to untangle it.

Split-brain is not rare-but-catastrophic the way regional loss is. It is common-if-you-are-sloppy: every failover without fencing is a roll of the dice.

Why split-brain happens

The orchestrator's inference that the primary is dead is based on absence of signal — missed heartbeats — which is indistinguishable from a network partition isolating the primary from the orchestrator.

Replay: primary runs normally; network between primary and the world partitions (primary is alive, just cannot send or receive on that segment); orchestrator misses 3 heartbeats, declares death, promotes replica; on the primary's side of the partition some clients remain connected and keep writing, and the primary — unaware of demotion — serves them; partition heals and the old primary has no mechanism to notice it has been superseded. Two primaries, until something external intervenes.

The only asymmetry you can exploit: the orchestrator knows there should be only one primary. If it can force the old primary to stop accepting writes before promoting the new one, split-brain cannot occur. That is fencing.

Fencing strategy 1 — STONITH (shoot the other node)

STONITH stands for "shoot the other node in the head". Before promoting a new primary, the orchestrator definitively kills the old one via an out-of-band mechanism. The out-of-band part is essential: telling the old primary to shut itself down over the normal network would require reaching it, which is precisely the thing the partition prevents. STONITH uses a separate channel:

Pacemaker's stonithd daemon is the standard implementation. You configure it with credentials for your chosen mechanism, and it runs the kill command before every promotion.

STONITH-based fencing sequenceA timeline showing the orchestrator detecting primary failure, issuing a STONITH power-off via IPMI to the old primary, waiting for confirmation, and only then promoting the replica. The old primary is shown powered off with a cross-out; the replica is shown becoming the new primary.Orchestratorhealth checkerOld primarypartitionedReplicahealthyt=0 detect(primary unreachable)IPMI power-offBMC: "off"POWERED OFFpg_ctl promoteNEW PRIMARY
STONITH sequence. The orchestrator detects that the old primary is unreachable, issues an IPMI power-off command over a separate management network, waits for the baseboard controller to confirm the host is powered down, and only then promotes the replica. Even if the old primary's network partition heals later, the machine is powered off — it cannot serve writes until manually rebooted and verified as a replica.

The upside: STONITH done before promotion makes split-brain structurally impossible. The old primary is not "asked" to stop — it is made to stop. Clients still connected get a dead socket and reconnect elsewhere.

The downsides: requires out-of-band access (cloud APIs usually work; on-prem with incorrect IPMI credentials cannot fence — test at setup and on every config change); destructive on false-positive detection, turning a network hiccup into a 30-60 second outage; and slow if the mechanism is slow (IPMI 5-15 seconds, some cloud APIs 30+), adding to RTO.

STONITH is the go-to for traditional on-prem Postgres and MySQL with Pacemaker/Corosync. It is less common in cloud-native stacks, which prefer lease-based fencing.

Fencing strategy 2 — lease tokens

Lease-based fencing flips the model: instead of the orchestrator reaching out to kill, the old primary fails to renew its own authority and voluntarily stops accepting writes.

The mechanism is a write lease: a time-bounded grant of "primary" role, issued by a highly-available consensus store (etcd, ZooKeeper, Consul), held by exactly one node at a time. The primary renews every 2 seconds with, say, a 10-second TTL. If it cannot renew, the lease expires and the primary must refuse writes.

The cycle: A acquires the "primary" lease (TTL 10s). A writes normally, renewing every 2s. A's network to etcd fails — renewals stop; A's local expiry timer counts down. After 10 seconds etcd's TTL fires; the key releases; B wins the race. Meanwhile, A has been watching its own expiry timer. Before 10 seconds elapse since its last successful renewal, A must voluntarily demote itself.

Why A must demote before the TTL fires, not after: if A waits until after, there is a window where B has taken the lease and A has not yet noticed. During that window both are primary. The discipline: your local stop-writing timer fires before any other node's "expired" timer. A demotes at 8 seconds while etcd's TTL is 10 — a 2-second safety margin.

Postgres with Patroni implements this. Each Patroni instance does three things in a loop: if leader, renew the lease and run pg_ctl demote before the TTL if renewal fails; if follower, watch the leader key and race to claim it on expiry, running pg_ctl promote on win; always, publish the local replication LSN so elections pick the freshest follower.

Lease-based fencing via etcdA diagram showing the old primary's lease in etcd expiring after renewals fail, the old primary self-demoting when its local timer fires, and the replica acquiring the new lease from etcd.Old primaryPatroni + Postgrespartitioned from etcdetcd clusterleader key with TTLTTL=10sReplicaPatroni + Postgreswatching leader keyrenew failsnotify expiryTimelinet=0: partitionstartst=8s: old primaryself-demotest=10s: TTL firesetcd expires keyt=11s: replicaacquires + promotes
Lease-based fencing via etcd. The old primary loses its connection to etcd and cannot renew its lease. Critically, the old primary's local Patroni process demotes it at t=8s — two seconds before the lease actually expires in etcd at t=10s. This safety margin guarantees that no two nodes ever believe they hold the lease simultaneously, even with clock skew. The replica picks up the expired key at t=11s and promotes.

The beauty of the lease approach: no out-of-band access is required. The old primary fences itself by observing its own disconnection from etcd. The partition that caused the suspected death is the same partition that makes the lease lapse.

The downsides: a dependency on etcd being highly available (run 3 or 5 voters, never 1); sensitivity to clock skew (the primary's self-demotion must fire before any observer's expiry, typically enforced with etcd's own lease IDs rather than wall clocks); and software cooperation — a Postgres bug ignoring pg_ctl demote would break the invariant, though in practice the demote makes data files read-only at the filesystem level.

Patroni is the de-facto standard for production Postgres HA. Stolon, repmgr, and Zalando's Spilo are alternatives with the same core pattern.

Fencing strategy 3 — quorum / consensus

The third strategy does not prevent the old primary from trying to accept writes. It prevents those writes from ever becoming durable.

In a Raft- or Paxos-based system (see semi-sync and quorum acks), every committed write requires acknowledgement from a majority of replicas.

Replay split-brain here: three replicas A (leader), B, C; quorum = 2. The network partitions A from B and C. B and C run a Raft election; B wins with term N+1. A can still accept client writes locally and tries to replicate them, but its AppendEntries RPCs to B and C fail. It collects 1 ack — its own. Not a majority. The writes never commit. A holds them pending and returns errors. Meanwhile B commits with acks from B and C. Partition heals. A sees the higher term, reverts to follower, truncates its uncommitted pending writes, and re-syncs from B.

The minority side cannot make progress. It may accept connections and try to serve writes, but without quorum nothing commits. When the partition heals, the minority learns about the new term and cleanly defers.

Quorum-based fencing is self-enforcing. No STONITH, no lease TTLs. The impossibility of forming a minority quorum is the fence. The subtlety: A's client gets an error (timeout or explicit rejection) — not a silent lie. Raft returns failure when quorum cannot be reached; the client retries. No "phantom commit" on the minority side.

Systems using quorum fencing natively: etcd, CockroachDB, TiDB/TiKV, YugabyteDB, Spanner, Kafka with KRaft. The cost is quorum round-trip latency per write. You do not bolt quorum fencing onto Postgres; you migrate.

Why Postgres streaming replication alone is unsafe

Postgres's built-in streaming replication is a transport — it ships WAL from primary to replicas. It does not, by itself, do automatic failover (no daemon watches the primary; pg_ctl promote is manual), fencing (nothing prevents the old primary from continuing to write after promotion), or leader election (no opinion on which replica should win).

You need an orchestrator. Patroni (most common) uses etcd/Consul/ZooKeeper for leader-lease and elections and pairs with PgBouncer or HAProxy for application rewire. repmgr is simpler but harder to make fully safe. Stolon puts every Postgres behind a proxy. pg_auto_failover uses a separate monitor node. Cloud-managed (RDS, Cloud SQL, Azure Postgres) runs some variant internally.

Without an orchestrator you can still failover manually — SSH to a replica, pg_ctl promote, update config — but the procedure is slow (minutes), error-prone at 3 AM, and unfenced. Every serious Postgres deployment has one.

The "witness node" pattern

Two-node deployments bite you at the quorum layer. With N=2, a majority is 2. If either node fails, the survivor is alone and cannot form a quorum. Three-node deployments are the production minimum — but three full Postgres copies are expensive, each needing disk equal to the data size.

The witness node pattern: add a third node that participates in quorum for coordination decisions (who is leader, whose lease is valid) without storing the database itself. A common Patroni-over-etcd topology is Postgres + etcd voter on nodes A and B, and etcd voter only on node C (no Postgres). Three etcd voters — quorum 2 of 3 — but only two Postgres nodes share data-disk cost. If A fails, B and C still form a quorum and B becomes primary; if C fails, A and B still have a quorum.

This is how small-footprint HA is done in practice — and the reason you should never deploy a 2-node cluster without reasoning explicitly about the third quorum member.

The RPO/RTO vocabulary — state them precisely

Two acronyms dominate failover planning; write specific numbers into your design doc before touching a config file.

RPO — Recovery Point Objective. How much data are you willing to lose in a failover? Measured in time or transactions. For async replication, RPO = current replica lag; for sync replication, RPO = 0; for semi-sync in its fallback window, RPO has a small non-zero ceiling.

RTO — Recovery Time Objective. How long can the service be unavailable? Includes detection + election + promotion + rewire. Orchestrated automatic failover: 10-60 seconds. Manual: minutes to hours.

These numbers come from the product owner, not the database team. "Can we lose one second of payments?" is a business question. You design the replication topology to meet the stated RPO and RTO — not the other way around.

A common mistake is confusing RTO with detection time. A 10-second detection interval with a 20-second promotion sequence gives 30-second RTO, not 10. Always measure end-to-end, client-visible.

Orchestrated failover with Patroni + etcd

Three-AZ Postgres in ap-south-1: Patroni-managed instances on nodes A, B, C, backed by a 3-voter etcd cluster (one voter per node). A is primary; B, C are replicas streaming from A.

Patroni failover orchestrationThree-node deployment with Patroni and etcd. The old primary's lease expires when it is partitioned. The replicas race for the etcd key; the one with the higher LSN wins and promotes. Application reconnects via HAProxy.Node A (was primary)Postgres: demotedPatroni: demote on timeoutetcd voterpartitioned from etcdNode B (new primary)Postgres: promotedPatroni: holds leaseetcd voterhighest LSN at electionNode C (replica)Postgres: streams from BPatroni: followeretcd voterparticipated in quorumetcd clusterleader key → node Bholds leaseunreachablefollowsHAProxy + applicationreconnects to node B
Post-failover state under Patroni. Node A (old primary) has demoted itself because its Patroni instance could not renew the etcd leader lease. Nodes B and C held quorum; B won the election because its LSN was highest, promoted itself, and wrote the leader key pointing to itself. HAProxy discovers the leader-change event from Patroni's REST API and routes application connections to node B.

t=0. A's network to etcd is disrupted by a routing-table corruption event. Application traffic to A is also affected.

t=0 to t=8. A's Patroni tries to renew its leader lease. Renewals fail. At t=8s — two seconds before the 10s TTL — A's Patroni runs pg_ctl demote, making it read-only.

t=10. etcd's TTL fires; the leader key is released. B's and C's Patroni instances race.

t=10.5. B has replayed to LSN 3A/1F201A80; C is at 3A/1F2015B0. B is ahead and writes itself into the leader key.

t=11. B's Patroni runs pg_ctl promote. Postgres exits recovery, begins a new timeline, opens for writes.

t=11 to t=15. C re-points its primary_conninfo to B and streams from the new timeline. HAProxy, polling Patroni's REST API, switches upstream to B. Application pools detect broken sockets and reconnect via HAProxy.

t=60. Network resolves. A rediscovers etcd, sees B is leader, runs pg_rewind against B, and rejoins as a replica.

End-to-end RTO: ~15 seconds. RPO: with synchronous_standby_names = 'ANY 1 (B, C)', zero; with default async, A's replication lag at t=0 (typically 10-50 ms).

The split-brain that almost happened: between t=8 (A demoted) and t=60 (A rejoined), A was not accepting writes. The critical invariant — A stops writing before B starts — is enforced by Patroni's 2-second safety margin. Even if some client's DNS cache held A's address for 5 minutes, those clients got "read-only mode" errors from Postgres, not silent success.

Common confusions

Going deeper

Chubby — the original lease-based lock service

Google's Chubby (Burrows, OSDI 2006) is the foundational reference for lease-based coordination. Chubby is a lock service used as a leader-election primitive by every other Google system: clients hold a "master" lease; if it lapses, the client must cease to act as master. Every consensus store since — ZooKeeper, etcd, Consul — descends from it. The paper is especially clear on failure modes: client-server clock skew, the "session" abstraction for silent clients, and the requirement that Chubby itself run as a majority-fault-tolerant ensemble.

Fencing race conditions and fencing tokens

Even well-implemented lease fencing has a theoretical race: if the old primary's clock runs fast, its lease could look alive locally but already expired elsewhere — a brief dual-primary window. Mitigations: fencing tokens — every lease acquisition increments a monotonic counter; clients pass the token on each write; the data layer rejects outdated tokens (Kleppmann's canonical generalisation of leases). Plus strict clock-sync bounds or explicit uncertainty intervals (Spanner's TrueTime), and generous self-demotion margins. Real-world Patroni rarely hits the race since margins are seconds and clock skew is milliseconds; for multi-region with larger drift, fencing tokens are the right answer.

Patroni internals — the etcd key structure

Patroni's etcd keys live under /service/<cluster-name>/: initialize, config, leader (with TTL — the lease), members/<node> (role, LSN, API URL), optime/leader, failover, sync. Every iteration reads the full state, decides an action, writes it. The loop runs every loop_wait seconds (default 10). The ttl parameter (default 30) must be strictly greater than loop_wait + retry_timeout — misconfiguring this is the most common Patroni footgun.

Where this leads next

Failover and fencing are the cluster-level consequences of the replication choices in ch.67-69. Build 9 continues with:

Failover without fencing is not failover — it is a coin flip on whether your users get a consistent database next morning. Pick a strategy (STONITH, lease, or quorum), understand its failure modes, test it before you need it, and write the runbook for the part requiring human intervention.

References

  1. Burrows, The Chubby Lock Service for Loosely-Coupled Distributed Systems, OSDI 2006 — foundational paper on lease-based distributed coordination; conceptual ancestor of etcd, ZooKeeper, and every consensus-store orchestrator since.
  2. Patroni documentation, Patroni: A Template for PostgreSQL HA — canonical reference for production Postgres failover, including the etcd key schema and the loop_wait/retry_timeout/ttl interactions that govern fencing safety.
  3. ClusterLabs, Pacemaker Fencing Documentation — reference for STONITH-based fencing in the Linux HA stack, with configurations for IPMI, cloud APIs, and SCSI reservation fencing.
  4. Ongaro and Ousterhout, In Search of an Understandable Consensus Algorithm, USENIX ATC 2014 — the Raft paper. Leader election and log-completeness are the mechanical basis for quorum-based fencing.
  5. Kingsbury, Jepsen analyses — ongoing series documenting real split-brain bugs in Postgres/Patroni, early etcd, and numerous NewSQL databases. The fastest way to internalise what fencing failure looks like in practice.
  6. Kleppmann, Designing Data-Intensive Applications, O'Reilly 2017, chapters 5 and 8 — the clearest book-length treatment of failover, split-brain, and fencing tokens as the safe generalisation of leases.