Note: Company names, engineers, incidents, numbers, and scaling scenarios in this article are hypothetical — even when they resemble real ones. See the full disclaimer.
Split-brain and fencing
It is 03:14 on a Wednesday at PaySetu. Riya, the on-call SRE, is staring at two dashboards. The first one says payments-leader = node-7 and has been saying that for nineteen days. The second one — the one being read by half the API gateway pods — says payments-leader = node-12. Both nodes are accepting writes. Both are committing them. Both think they are the only leader. The settlement engine has just processed a ₹4.2 lakh refund twice, and the reconciliation team is about to wake up. The thing that broke is not a node; it is the agreement that one node holds the lock. This chapter is about why two leaders is what naturally happens when a partition is followed by a slow recovery, and about the surprisingly small piece of state — a monotonically increasing integer — that lets the storage layer reject the wrong leader's writes without ever needing to know which leader is "real".
Split-brain is the failure where two nodes both believe they are leader and both accept writes; it is what should happen if you only have heartbeats and timeouts, because a healthy node and a partitioned-away leader look identical from anywhere else. The fix is not "better detection" — it is fencing: every leader writes carry a monotonically increasing token, and the storage layer rejects any write whose token is lower than the highest one it has seen. Detection sorts out eventually; fencing makes the data layer correct immediately.
Why split-brain is the default outcome, not the bug
A leader in a distributed system holds something — a lock, a lease, the right to commit writes to a log. The cluster decides who the leader is by election, and the elected leader holds the role until it crashes, voluntarily steps down, or fails to renew its lease. So far this is clean. The dirt enters when you ask: how does the cluster know the leader has crashed?
The honest answer is: it does not. The cluster knows that the leader has stopped responding. A node that has crashed and a node that is paused mid-stack-trace by a 30-second GC and a node that is on the wrong side of a network partition all look identical from the perspective of every other node. The cluster's failure detector — heartbeats, phi-accrual, SWIM — produces a verdict: "this node is not responding". The verdict is then promoted to "this node has failed", and the election logic picks a new leader.
Now imagine the original leader was not dead. It was on the wrong side of a transient partition, or paused by a long GC. From its own perspective, nothing has changed. It still holds the lease. It still believes it is leader. It still accepts writes from the clients it can still reach (perhaps through a private link, perhaps from inside the same AZ). Meanwhile the cluster has elected a new leader, the new leader is accepting writes from a different set of clients, and the storage layer is being written by both. This is split-brain. The system has two leaders simultaneously, each correctly executing the leader's protocol, each unaware of the other.
The trap most engineers fall into is thinking the cure is "better failure detection". It is not. The FLP impossibility result tells you that in an asynchronous network with even one possible failure, no detector can perfectly distinguish a crashed node from a slow one. You can tune detectors to be more conservative (fewer false positives, slower failover) or more aggressive (faster failover, more false positives), but you cannot make them perfect. So split-brain is not avoidable through detection alone; it is structurally guaranteed at some rate, and the question is what the data layer does when it happens.
The interesting question is the last one in that figure: does this corrupt data? It depends entirely on whether the storage layer can tell the two leaders apart and reject the older one's writes. If it can — through fencing — split-brain becomes a benign event: two writers, one quietly rejected, no corruption. If it cannot, split-brain is a data-integrity bug that surfaces hours later in reconciliation.
Fencing — the integer that closes the gap
The fencing pattern is older than most distributed databases, and the implementation is structurally simple. Every time the cluster grants leadership to a node, it bundles a monotonically increasing integer with the grant. The integer is called a fencing token (Raft calls it a term; Zookeeper calls it a zxid prefix; etcd calls it a lease ID combined with a revision; Chubby calls it a generation number — the implementation differs but the contract is the same). The leader includes this integer with every write it issues to the storage layer. The storage layer remembers the highest token it has ever seen and rejects any write whose token is lower.
The contract has three properties that compose into correctness. First: tokens are monotonically increasing — every new election produces a strictly higher token than any previous election. Second: the storage layer's "highest seen" is monotonic — it only ever increases, never resets. Third: a write that arrives with a lower token is rejected at the storage layer, not at the leader, not in the network, not in the application. The rejection happens at the last possible moment, which means it works regardless of how badly the upper layers have lied to themselves.
Walk through the PaySetu example with fencing in place. Node 7 was elected leader at term 14 and started writing with token=14. The partition happened, the cluster (without node 7) elected node 12 as the new leader at term 15. Node 12 starts writing with token=15. The storage layer sees token=15, updates its highest_seen to 15, and accepts node 12's writes. Now the partition heals. Node 7, oblivious, sends a write with token=14. The storage layer compares 14 to its highest_seen=15, rejects the write, and returns an error to node 7. Node 7's write never lands. The settlement engine never processes the duplicate refund. Riya does not get paged. Why this works without coordination: the storage layer does not need to know who the "real" leader is, does not need to participate in elections, does not need to communicate with the cluster's failure detector. It only needs to remember one integer and compare it on every write. The election protocol's job is to issue strictly higher tokens to new leaders; the storage layer's job is to reject anything older. The two responsibilities are decoupled, which is what makes the fix scale to systems where the storage layer is a separate process, separate machine, or separate vendor entirely.
The token does not need to be a "term" specifically. Any monotonically increasing integer works. Common implementations:
- Raft term: increments on every election. Fencing is automatic — the term is in every
AppendEntriesRPC. - Zookeeper zxid: a 64-bit integer where the upper 32 bits are the epoch (elections) and the lower 32 bits are the per-write counter. The epoch alone is enough for fencing.
- etcd revision + lease ID: every key write carries the revision; leases have IDs that bind to the lease holder.
- Chubby generation number: explicitly named; returned by
Open()and required on subsequentSetContents()calls.
# A minimal fencing-token storage layer
class FencedStorage:
def __init__(self):
self.data = {}
self.highest_token = 0
def write(self, key, value, token):
if token < self.highest_token:
return ("REJECTED", f"token {token} < highest_seen {self.highest_token}")
self.highest_token = max(self.highest_token, token)
self.data[key] = (value, token)
return ("OK", token)
def read(self, key):
return self.data.get(key)
# Simulate the PaySetu split-brain
storage = FencedStorage()
# Term 14 — node 7 is the leader
print(storage.write("payment_id_4521", "refund_processed", token=14))
print(storage.write("payment_id_4522", "refund_processed", token=14))
# Partition happens. New election. Term 15 — node 12 wins.
# Node 12 starts writing.
print(storage.write("payment_id_4523", "refund_processed", token=15))
# Partition heals. Node 7 (still thinks it is leader at term 14)
# tries to write the duplicate refund.
print(storage.write("payment_id_4521", "refund_processed_AGAIN", token=14))
# Node 7 then catches up, learns it was deposed, but a stuck retry
# from before the catch-up still arrives.
print(storage.write("payment_id_4524", "stale_write", token=14))
Sample run:
('OK', 14)
('OK', 14)
('OK', 15)
('REJECTED', 'token 14 < highest_seen 15')
('REJECTED', 'token 14 < highest_seen 15')
The if token < self.highest_token line is the entire fencing rule — three lines of logic that close a class of bugs that has caused real outages. The max(self.highest_token, token) update ensures the storage never goes backwards, even if writes arrive out of order from the same leader. The rejected-write tuple returns enough information for the would-be leader to detect its own demotion and step down voluntarily; in production systems this is the signal that triggers node-7 → follower transition. The token is per-key implicit metadata — every value stored carries the token that wrote it, which means a follower replicating the log can verify the chain of leadership across the value's history.
The storage layer here is in-memory Python, but the contract scales unchanged to a real database. PostgreSQL implements the same idea via its pg_replication_origin and the timeline switch on the WAL. MySQL's GTID has a UUID-tagged sequence number that plays the fencing role. S3 has a per-object versionId; combined with conditional writes, it gives you fencing for object-level mutations without the database ever participating.
When fencing fails — the gaps you have to close
Fencing is not magic; it has three practical failure modes that have caused outages.
The token-less storage layer. If your storage system does not natively support fencing, you have to bolt it on. The naive bolt-on is to put the token in the application's payload and check it in application code before writing. This works if the check-and-write is atomic at the storage layer — i.e. you use the storage's compare-and-swap or transactional write to make the check and the write a single operation. If you check in application code and then write with a separate call, the gap between the two is a race window: leader A reads highest=14, leader B writes with token=15, leader A writes with token=14, the storage layer accepts both because nothing checked. Why this gap is so dangerous: most engineers see "I check the token before writing" and assume that closes the bug. It does not, because the check and the write are two separate calls that can be interleaved. The atomic primitive — INSERT ... WHERE token >= 14, or compare_and_swap(token=14, new_value=...), or WriteIfMatch(etag=...) — is what makes the check and the write a single observable event. Without it, you have a TOCTOU bug dressed up as a fencing scheme.
The asymmetric partition. The classic fencing analysis assumes a leader is either on the majority side of a partition (and stays leader) or on the minority side (and gets deposed). Real partitions are messier. A leader on the minority side might still reach the storage layer through a separate network path — a private link, a different VPC peering, a misconfigured firewall rule that lets traffic out but not in. In this case the deposed leader's writes still arrive at the storage layer with the old token, and fencing rejects them — which is exactly the case fencing was designed for. The leader sees its writes failing, learns it has been deposed, and steps down. This is the good case.
The bad case is when the storage layer itself is partitioned. Imagine two storage replicas, each on the same side of the network as one of the two leaders. Each replica only sees one leader's writes. Each replica's highest_seen is independently updated. Each accepts writes that the other replica would have rejected. When the partition heals, the two storage replicas have to merge their state, and the merge cannot use fencing alone — it has to use the underlying replication protocol's conflict resolution. This is why fencing is necessary but not sufficient for distributed storage; you also need the storage layer's own consensus to ensure both replicas agree on highest_seen.
The clock-skewed lease. Some systems use lease expiration as the fencing trigger: "the leader's lease is valid until t+30s; after t+30s, write with the next token". If a leader's clock is skewed forward by 60s relative to the storage layer, the leader will issue tokens for a future epoch and confuse the system. If a leader's clock is skewed backward, the leader will believe its lease is still valid after the cluster has already deposed it, and it will continue issuing writes with stale tokens. The fix is to have the cluster, not the leader, decide when tokens advance — the leader receives its token and lease boundaries from a central authority (the consensus group, or a lease service like Chubby), and the leader does not consult its own clock for token decisions.
A real production story makes the failure modes concrete. KapitalKite's order-routing service in late 2025 had a Chubby-style lock service issuing fencing tokens for its order-matching engine. The engine was multi-region, with the storage layer (the order book) sharded across three regions. During a 9-minute network partition between the Mumbai and Singapore regions, both regions elected a leader for the partition's local view of the order book. Both leaders had monotonically-increasing tokens; both wrote to their local shard. So far this is the expected behaviour — the fencing scheme correctly isolated each region's writes. The bug surfaced when the partition healed: the merge logic compared the two shards and used the higher token to decide which writes were canonical. But the two shards had drifted, with the Singapore shard's token at 41,847 and the Mumbai shard's token at 41,851 — Mumbai's writes were preserved, Singapore's six minutes of accepted orders were silently discarded, and ₹3.7 crore of customer orders disappeared. The fix was to switch from "higher token wins" to "merge by application-level rules" (price-time priority for orders), but the post-mortem's main lesson was that fencing tokens prevent split-brain within a storage shard but do not prevent it across shards. That requires a separate consensus protocol, typically Spanner-style multi-Paxos with a global commit timestamp.
Fencing in production systems — the implementation differences
Different systems package fencing differently, and the differences matter when you are debugging an incident. etcd uses a 64-bit revision per key, advanced atomically on every write. The lease API binds a key's lifetime to a lease ID; when the lease expires, all keys bound to it are deleted, which is how etcd implements leader-election with automatic fencing. The lease ID itself is the fencing token; clients hold their lease via the lease-keepalive RPC and learn from the response when the lease has been revoked. Zookeeper uses zxid, a 64-bit integer with epoch in the high 32 bits and counter in the low 32 bits. The epoch advances on every election, and ephemeral znodes are tied to the session that created them; a session timeout deletes the znode atomically, which fences out the old leader. Consul uses session IDs and KV-locks; the session is tied to a node's health checks, and lock acquisition increments a LockIndex that serves as the fencing token. Chubby explicitly returns a generation number from Open() and requires it on SetContents(), which Burrows's 2006 paper described as "the canonical fencing pattern".
The pattern across all four: the consensus layer (Raft for etcd, Zab for Zookeeper, Raft for Consul, Paxos for Chubby) issues monotonically increasing tokens, and the storage layer enforces them atomically per write. The differences are about how the token flows from the lock service to the storage layer — bundled in RPCs, included in headers, written into the log — but the contract is identical.
Common confusions
- "Fencing prevents split-brain." It does not. Fencing prevents split-brain from corrupting data. Two leaders still exist during the overlap window; both still believe they are leader; both still try to write. Fencing makes the older leader's writes harmless, not non-existent. The leader on the wrong side of a partition still has to detect its demotion and step down, which is a separate concern.
- "A monotonic counter is a fencing token." Only if it is issued by the cluster, not by the leader itself. A leader that increments its own counter has no protection — both leaders will issue token 15, and the storage layer cannot tell them apart. The counter has to be issued by the consensus protocol that elected the leader, which means it is bound to the election, not to the leader's local state.
- "Fencing requires the storage layer to participate in consensus." It does not. The storage layer only needs to remember the highest token and reject lower ones. It does not need to vote, replicate, or know about the cluster's elections. This is why fencing scales to systems where the storage layer is an external service (S3, RDS, a vendor database) that has no awareness of your application's consensus protocol.
- "Higher fencing tokens always win." Within a single shard, yes. Across shards or replicas that have diverged during a partition, no — token ordering is meaningful only within a single sequence of
highest_seenupdates. KapitalKite's outage came from assuming "higher token wins" applied across shards; it did not, and ₹3.7 crore of orders disappeared. - "Fencing replaces leader election." It complements it. Election decides who the leader is; fencing makes it safe for a deposed leader to make mistakes during the demotion window. The two are paired — election without fencing produces split-brain corruption; fencing without election has no leader to fence in the first place.
- "Lease expiration is the same as fencing." A lease tells the leader when it should stop acting as leader; fencing tells the storage layer to reject writes from a stopped leader. A leader with an expired lease that does not realise it has expired (clock skew, GC pause) will still try to write — fencing is what prevents the write from corrupting state. Lease + fencing is the combination; either alone is not enough.
Going deeper
The Burrows Chubby paper's exact phrasing
Burrows's 2006 OSDI paper on Chubby is the canonical introduction to the fencing pattern. The relevant section is §2.4, "Locks and sequencers". Burrows explicitly notes that Chubby clients can hold a lock and then "be delayed for an arbitrary amount of time" — a GC pause, a swap-in delay, a network glitch — and during that delay the lock service can revoke the lock and grant it to another client. The paper introduces the sequencer as the fix: a string returned by Open() that the client must include in its protected operations, and that the storage layer must verify against the current sequencer it has seen. The sequencer is the original name for what later systems call the fencing token. The paper's framing — "Chubby provides a sequencer to a holder of a lock" — is exactly the contract every modern system implements.
Why Raft does not need a separate fencing token
Raft's term number is a fencing token by construction. Every AppendEntries RPC includes the leader's term; the follower rejects any RPC whose term is lower than its own current term. Because the term advances on every election, an old leader's AppendEntries after a partition heal will be rejected by the followers. In Raft, the storage layer (the followers' log) is the consensus layer, so the fencing is implicit. Systems built on top of Raft (etcd, CockroachDB, TiKV) inherit this; their external clients see a unified interface where fencing is invisible. This is one reason Raft replaced Paxos in many production systems — Paxos's separation between the consensus protocol and the leader-election protocol forced application code to handle fencing explicitly, while Raft folded it into the protocol.
The Jepsen test that broke fencing
Aphyr's Jepsen test of Redis Sentinel in 2013 famously demonstrated split-brain in a system that had no fencing. The test partitioned a Redis cluster, observed two masters being elected, and then verified that writes to both masters were preserved after the partition healed — which meant data was being silently overwritten. The post-mortem analysis identified the root cause as the absence of a fencing mechanism: Redis Sentinel's failover was based on a quorum of sentinels agreeing the master was dead, but Redis itself had no token-based rejection at the storage layer. The fix landed years later in Redis 5+ via the WAIT command for synchronous replication and the optional replica-priority plus failover protocol, but the original lesson stands: failure detection without fencing is data loss waiting to happen. The Jepsen tests for etcd, Zookeeper, and Consul, by contrast, all show clean rejection of stale-leader writes — because those systems implement fencing natively.
Multi-region fencing — the Spanner approach
When the storage layer is multi-region, fencing tokens scoped to a single region are insufficient — KapitalKite's outage is a small instance of this larger problem. Google's Spanner addresses it by making the fencing token a global timestamp produced by TrueTime, the bounded-uncertainty clock service. Every Spanner transaction commits with a timestamp t that is guaranteed to be greater than any timestamp seen at any region for any prior transaction; the storage layer's highest_seen is itself replicated globally via Paxos, so all regions agree on the same value. The cost is the TrueTime infrastructure (atomic clocks, GPS receivers, the commit-wait step that waits out the uncertainty interval before returning success to the client). The benefit is that fencing now works across the entire global database, not just within a single shard. Spanner's commit-wait is the visible manifestation of this — every write pays a 5–7 ms wait so that the global token ordering is preserved across regions.
Fencing in stateless leader-elected services
Not every leader holds storage state. Stateless services use leader election for tasks like "exactly one cron job runs across the fleet" or "exactly one consumer reads this Kafka partition". Fencing tokens still apply: the leader includes its token in every external action (the cron job's database mutation, the Kafka commit), and the external system rejects stale-token actions. For Kafka specifically, the consumer-group protocol uses a generation_id that plays the fencing role — a deposed consumer's offset commits will fail with an IllegalGeneration error if its generation is older than the group's current generation. This is the same pattern as the storage-layer fencing, just at the Kafka coordinator instead of a database.
Reproduce this on your laptop
python3 -m venv .venv && source .venv/bin/activate
pip install simpy
# split-brain simulator with fencing on/off toggle
curl -sL https://raw.githubusercontent.com/example/dist-sys-toys/main/fencing.py > fencing.py
python3 fencing.py --leaders=2 --writes=1000 --fencing=on
# observe: zero corrupting writes with fencing on, ~50% corrupting writes with fencing off
Where this leads next
The next chapter, the FLP impossibility result, formalises why split-brain is unavoidable through detection alone — no protocol can perfectly distinguish a crashed node from a slow one in an asynchronous network with even one possible failure. Reading FLP is what makes the case for fencing rigorous: it is not "we are bad at detection", it is "perfect detection is mathematically impossible".
Two layers up, leader leases and lease expiration is the protocol layer that decides when a leader steps down voluntarily, and Chubby and the lock-service pattern is the canonical implementation of the fencing-aware lock service. The three together — leases for voluntary step-down, fencing for involuntary protection, Chubby-style services for the implementation — are the production triad for safe leadership in distributed systems.
The takeaway worth carrying forward: split-brain is not a bug to prevent; it is a condition to handle. Detection alone cannot prevent it. Fencing makes it safe. Anywhere your system has a "single leader" or "exclusive lock" abstraction, find the fencing token, verify it is monotonic across elections, and verify the storage layer rejects stale ones atomically. Where any of those three checks fails, you have a split-brain bug waiting for the right partition to surface it.
References
- Burrows, M. — "The Chubby Lock Service for Loosely-Coupled Distributed Systems" (OSDI 2006). The canonical introduction to the sequencer / fencing-token pattern, §2.4.
- Ongaro, D., Ousterhout, J. — "In Search of an Understandable Consensus Algorithm" (USENIX ATC 2014). Raft's term number serves as the implicit fencing token in every
AppendEntriesRPC. - Kingsbury, K. (Aphyr) — "Call me maybe: Redis" (Jepsen, 2013). The post-mortem that demonstrated split-brain in Redis Sentinel due to absent fencing.
- Corbett et al. — "Spanner: Google's Globally-Distributed Database" (OSDI 2012). The TrueTime + commit-wait approach to global fencing across regions.
- Hunt, P., Konar, M., Junqueira, F., Reed, B. — "ZooKeeper: Wait-free coordination for Internet-scale systems" (USENIX ATC 2010). Zxid as the fencing token; ephemeral znodes for automatic deposition.
- Phi accrual failure detector — the failure-detector layer that triggers leadership change, and why its uncertainty makes fencing necessary.
- FLP impossibility — the formal result that explains why detection alone cannot prevent split-brain.
- Chubby and the lock-service pattern — the canonical lock service that issues fencing tokens.
- Junqueira, F., Reed, B., Serafini, M. — "Zab: High-performance broadcast for primary-backup systems" (DSN 2011). Zookeeper's atomic broadcast protocol and how zxid is generated.
- Howard, H., Mortier, R. — "Paxos vs Raft: Have we reached consensus on distributed consensus?" (PaPoC 2020). Compares how each protocol handles leader changes and the implicit fencing pattern.