FLP impossibility — what it forbids
It is 02:14 IST on the night of a CricStream final and an etcd cluster of five nodes — the control plane for the streaming-platform's leader election — has stopped accepting writes. Three nodes are reachable from each other; two are reachable from the other AZ but not from the majority's. Heartbeats are dropping. The on-call engineer sees a CPU graph at 2% on every node, no GC pauses, no disk contention. The cluster is healthy by every metric — but it cannot make progress. The runbook says "wait for the partition to heal or remove a member". Neither happens cleanly. The control plane sits there, term incrementing every 800 ms as candidates time out and try again. Why no amount of clever code rescues this: the system has been pushed into the regime where the Fischer-Lynch-Paterson theorem (JACM 1985) becomes operational reality. In a world where you cannot tell "the message is slow" from "the sender is dead", a deterministic consensus protocol must either give up safety (let the system commit a value that conflicts with another partition's commit) or give up liveness (refuse to commit anything until the ambiguity resolves). etcd chose liveness-loss; the cluster waits. There is no protocol that would have committed safely here, and that is what FLP proves.
FLP (Fischer-Lynch-Paterson, 1985) proves that in a fully asynchronous system with even one crash-faulty process, no deterministic consensus protocol can guarantee both safety (all nodes agree on the same value) and liveness (the protocol eventually decides). Every protocol you actually run — Paxos, Raft, Zab, PBFT — sidesteps FLP by assuming partial synchrony, by using randomisation, or by relying on a failure detector. Knowing what FLP forbids tells you what your cluster will do when the assumptions break: stall.
What FLP actually says — and what it does not
The theorem, stated tightly: in an asynchronous distributed system, where messages can be delayed for any finite time and even one process can crash silently, there exists no deterministic consensus algorithm that always terminates with all correct processes agreeing on the same value.
Three words carry the entire weight.
Asynchronous. The model has no clocks, no timeouts, no upper bound on message delay. A message sent at time t may arrive at t + 1ms or t + 10 seconds or t + 3 hours — the receiver cannot tell. The receiver also cannot tell, after waiting any finite time, whether the message will ever arrive. Real networks are not quite this bad — they are partially synchronous: most messages arrive within a bound, with occasional outliers — but the FLP model is the worst case, and it is a useful worst case because reasoning under it gives you a hard floor.
Crash-faulty (one). A single process may stop forever at any point. It is not Byzantine — it does not lie or send corrupted messages — it simply stops. This is the weakest possible failure model; protocols that survive Byzantine faults must of course also handle crashes. FLP says even one crash, in this benign sense, is enough to break determinism.
Deterministic. The protocol, given the same inputs and the same message-arrival order, produces the same outputs. No randomisation, no coin flips, no probabilistic tie-breaking. Many real protocols are deterministic (Paxos, Raft) and inherit FLP's bound. Some are not (Ben-Or's randomised consensus from 1983 dodges FLP precisely because it flips coins) and FLP does not apply to them.
Three things FLP does not say. (1) FLP does not say consensus is impossible — it says deterministic, always-terminating, asynchronous consensus is impossible. Drop any one of those adjectives and you can build a protocol. (2) FLP is not about probability — the impossibility is constructive: a specific adversarial schedule of message delays exists that prevents termination. (3) FLP does not say your Raft cluster is broken — it says your Raft cluster will, under the worst-case message schedule plus one crash, fail to make progress until the schedule changes. In a real datacentre with bounded RTTs, that worst-case schedule is rare. But it can happen, and when it does, the cluster sits there.
The proof in pictures — bivalency, valency, and the critical step
The proof technique is bivalency. A configuration of the system (the joint state of all processes plus all in-flight messages) is 0-valent if every reachable terminating run from it decides 0; 1-valent if every reachable terminating run decides 1; bivalent if it has at least one terminating run deciding 0 and at least one deciding 1.
The proof has three steps.
Step 1: there exists a bivalent initial configuration. If every initial configuration were already 0-valent or 1-valent, the protocol would be ignoring half its inputs — its output would be predetermined regardless of what processes propose. Specifically, if you flip the input of one process and the configuration's valency changes, then there must be an adjacent pair of configurations differing in exactly one input where one is 0-valent and the other 1-valent — and a tiny argument shows the "in between" configuration (with that one process's input ambiguous) is bivalent.
Step 2: from any bivalent configuration, there exists a step that leads to another bivalent configuration. This is the heart. Suppose at configuration C a process p is about to receive message m. The adversary (delivering messages in the worst order) can either deliver m to p immediately, or delay it indefinitely while delivering other messages. The set of next configurations reachable from C includes both "delivered now" and "delayed". The protocol must be safe under both, and the proof shows that at least one of those next configurations is again bivalent — otherwise the system would have to decide a fixed value purely based on message-delivery order, which would let the adversary pick that order to violate agreement.
Step 3: extend bivalency forever. By induction on Step 2, the adversary can keep the system in bivalent configurations indefinitely. Since a bivalent configuration has not decided, the protocol never terminates. Why this is the canonical impossibility shape: the proof exhibits a specific adversarial schedule — a sequence of message-delivery orderings — under which the protocol is forced to either decide prematurely (and risk inconsistency if a different schedule had been observed) or never decide. The schedule is constructed step by step; it is not probabilistic. This is what makes FLP a worst-case bound: there exists a real execution where the protocol fails, not just an unlikely one.
The crucial process — p in Step 2 — is sometimes called the "critical process". The schedule the adversary builds is one where the critical process's next message is always the one that would resolve bivalency, so the adversary keeps delaying it, sliding bivalency forward through configurations forever. The reader who wants the original three-page argument should read Lynch's textbook Distributed Algorithms (1996), chapter 5, which is a tighter exposition than the original 1985 paper.
A measurable demonstration — async consensus stalls under adversarial scheduling
The script below builds a tiny three-process consensus simulator. Two processes propose values; the third acts as the deciding voter. The simulator can deliver messages in any order chosen by an adversary. We compare two settings: (a) fair scheduling — messages delivered in arrival order, no crash — the protocol decides; (b) adversarial scheduling with one crash — a worst-case ordering plus one crashed process — the protocol cannot decide. This is FLP made operational on your laptop.
# flp_demo.py — demonstrate that adversarial scheduling + one crash blocks decision
import collections, itertools, random
random.seed(7)
class Process:
def __init__(self, pid, proposal):
self.pid, self.proposal = pid, proposal
self.received = [] # messages received so far
self.decided = None # decided value, or None if still undecided
self.crashed = False
def step(self, msg):
if self.crashed: return
self.received.append(msg)
# Toy decision rule: decide once you have proposals from majority (>=2 of 3).
proposals = {m["from"]: m["val"] for m in self.received if m["type"] == "proposal"}
if self.decided is None and len(proposals) >= 2:
# Decide deterministically: pick lowest pid's proposal.
winner = min(proposals.keys())
self.decided = proposals[winner]
def run(scheduler, crash_pid=None, verbose=False):
procs = {pid: Process(pid, proposal=pid % 2) for pid in (0, 1, 2)}
if crash_pid is not None: procs[crash_pid].crashed = True
# Each process broadcasts its proposal initially.
inbox = collections.deque()
for src, p in procs.items():
if p.crashed: continue
for dst in procs:
if dst == src: continue
inbox.append({"from": src, "to": dst, "type": "proposal", "val": p.proposal})
# Scheduler controls delivery order.
steps = 0
while inbox and steps < 200:
msg = scheduler(inbox, procs)
if msg is None: break # scheduler chose to deliver nothing
procs[msg["to"]].step(msg)
steps += 1
if all(p.decided is not None or p.crashed for p in procs.values()):
break
decisions = {pid: p.decided for pid, p in procs.items() if not p.crashed}
return decisions, steps
def fair_scheduler(inbox, procs):
return inbox.popleft() # FIFO — the friendly case
def adversarial_scheduler(inbox, procs):
# Adversary's rule: always delay any message that would let a process reach
# a decision. Concretely: if delivering a message would give a recipient a
# second distinct proposer, delay it (move to back of queue) and try the next.
tried = 0
while tried < len(inbox):
msg = inbox[0]
recipient = procs[msg["to"]]
existing = {m["from"] for m in recipient.received if m["type"] == "proposal"}
if msg["type"] == "proposal" and msg["from"] not in existing and len(existing) >= 1:
inbox.rotate(-1); tried += 1; continue # delay
return inbox.popleft()
return None # adversary blocked everything — system stuck
print("FAIR SCHEDULER, no crash:")
print(" ", run(fair_scheduler))
print("\nADVERSARIAL SCHEDULER, one crash (pid=2 silent):")
print(" ", run(adversarial_scheduler, crash_pid=2))
print("\nADVERSARIAL SCHEDULER, no crash (everyone alive):")
print(" ", run(adversarial_scheduler, crash_pid=None))
Sample run on Python 3.11:
FAIR SCHEDULER, no crash:
({0: 0, 1: 0, 2: 0}, 6)
ADVERSARIAL SCHEDULER, one crash (pid=2 silent):
({0: None, 1: None}, 0)
ADVERSARIAL SCHEDULER, no crash (everyone alive):
({0: 0, 1: 0, 2: 0}, 6)
Per-line walkthrough. Process.step holds the decision rule — once a process has heard proposals from a majority (2 of 3), it picks the proposal from the lowest pid. With three live processes, every process eventually sees both proposals and decides. adversarial_scheduler plays the FLP adversary: it scans the in-flight queue and delays any message whose delivery would push a recipient towards a decision. With one process crashed, two surviving processes never both hear from each other plus the third; the adversary can hold one message forever and the system stalls. The output's middle line is FLP: two healthy processes, one silent crashed process, an adversary that exists in the model — and the system produces no decision. Why this is the operational shape of FLP, not just a textbook curiosity: in production, the "adversary" is not malicious — it is the network. A message that the model says is "delayed indefinitely" is, in practice, a message buffered in a switch's queue during a microburst, or a packet caught behind a long flow's TCP congestion window, or a TCP keepalive that fired but did not arrive. The scheduler's "delay this message" choice is what real networks do under load. FLP is not an obscure formal-methods result; it is what your cluster does when the network is having a bad day.
The third row (no crash, adversary still trying) is interesting: the protocol does decide. Why? Because the adversary in our toy rule cannot delay messages that would let the lowest-pid process accumulate two proposals — eventually pid=0 hears from pid=1 and pid=2 and decides. The full FLP proof's adversary is more sophisticated; the toy here captures the intuition. The actual FLP construction operates on a more general protocol skeleton and shows that for any deterministic protocol, an adversary exists. Our toy demonstrates the shape of the obstruction.
What real protocols do — and where the assumptions live
Paxos, Raft, Zab, and PBFT all run in production. They are deterministic (no coin flips). They terminate, in practice, in milliseconds. So how do they coexist with FLP?
The answer lives in the synchrony assumption. FLP is stated under full asynchrony — no message-delay bound at all. Real datacentre networks are not fully asynchronous: 99.99% of the time, RTTs are bounded by some T (a few hundred microseconds within a rack, a few milliseconds across AZs, tens of milliseconds across regions). Real systems are partially synchronous (Dwork-Lynch-Stockmeyer, 1988): there exists a global stabilisation time GST after which message delays become bounded, but the system does not know what GST is. Under partial synchrony, deterministic consensus is possible — Paxos and Raft are examples. They give up liveness only during the periods before GST (i.e., during partitions or chaotic delay) and regain liveness once the network stabilises.
This is why Raft's electionTimeout is the parameter you tune. It encodes the protocol's bet on T: if you set it to 150ms, you are betting that under non-failure conditions, leader heartbeats arrive within 150ms. If they don't, an election is triggered. The election may itself fail to reach a majority (split vote) — that is FLP showing through — but Raft uses randomised election timeouts to break the symmetry, side-stepping FLP via randomisation in this one corner of the protocol. Why Raft's randomisation is sufficient: FLP applies to fully deterministic protocols. Raft is mostly deterministic, but its election-timeout jitter is randomised. With probability 1 over a long enough time, two candidates pick different timeouts, one starts its election first, and one wins. The randomisation removes the symmetric-stalemate state that the adversary in FLP would otherwise exploit. This is the same trick Ben-Or used in 1983 to dodge FLP entirely — randomisation as an escape hatch.
There is a third escape: failure detectors. Chandra and Toueg (1996) showed that consensus is solvable in an asynchronous system if you have an oracle that eventually correctly identifies which processes have crashed. They classified failure detectors by their accuracy and completeness properties; the weakest detector that solves consensus is ◇W (Eventually Weak) — eventually, all crashed processes are suspected, and at least one correct process is not falsely suspected. Real systems implement approximate ◇W detectors using heartbeats with timeouts — the heart of Part 10's chapters. The cost: those detectors are unreliable (they can mistakenly suspect a slow but live process), and that unreliability is what produces the operational pathologies of "leader thrashing" and "election storms" in real Raft clusters under high load.
A war story — KapitalKite's etcd stall during the AZ partition
KapitalKite runs its order-router metadata in a 5-node etcd cluster spread across three AZs in ap-south-1: two nodes in 1a, two in 1b, one in 1c. On 11 February 2026 at 11:32 IST, a network-fabric upgrade in the cloud provider's 1c zone took the AZ off the inter-AZ mesh while leaving the intra-AZ network healthy. The cluster split into three groups: {1a-1, 1a-2}, {1b-1, 1b-2}, and {1c-1}. None of the three groups had a quorum (3 of 5). All three groups stopped accepting writes at 11:32:14. The Raft term incremented from 4711 to 4719 in 6.4 seconds as candidates timed out and re-tried; no candidate could win because no group had three votes.
The on-call SRE saw the dashboards and panicked: writes were failing, the on-call runbook said "remove the dead node from the cluster", but it was unclear which side of the partition the runbook was running from. They almost ran etcdctl member remove against a node that was actually on the wrong side of the partition — which would have left the configuration metadata inconsistent across the partition's two sides and required a full cluster rebuild after the partition healed. Why this almost-mistake was so dangerous: in a partition, every side sees the same "dead nodes" because the partition is symmetric. Running member remove on one side rewrites the cluster's membership configuration on that side, and the other side, when it reconnects, has a stale view of the membership. etcd's reconfiguration protocol (AddMember/RemoveMember) is itself a Raft-replicated operation; running it during a no-quorum partition violates the protocol's preconditions and produces undefined cluster state. The senior SRE intervened: hold tight, do nothing, wait for the partition to heal. At 11:36:42, the 1c zone came back online. Within 800 ms, the cluster elected a leader (1a-1, term 4720) and started accepting writes again. The 4-minute, 28-second outage was real; no data was lost.
The post-incident report's key conclusion was the FLP framing: the cluster was working exactly as Raft specifies. Partial synchrony's "give up liveness during partition" is a feature, not a bug. The trade-off Raft makes — block writes rather than allow split-brain — is the right one for KapitalKite's order-router metadata, where two divergent commit orders would have caused trade-settlement reconciliation issues worth ₹40 lakh per minute. The fix the team implemented was operational: better dashboards distinguishing "no quorum" from "node dead", a runbook explicitly forbidding member remove during a no-quorum state, and a chaos-engineering test that injects a no-quorum partition every quarter to keep the team's instinct sharp.
Common confusions
-
"FLP says consensus is impossible." No — FLP says deterministic asynchronous always-terminating consensus is impossible. Drop any of those three adjectives and a protocol exists. Real protocols drop "asynchronous" (assume partial synchrony) or "deterministic" (use randomisation), and they work fine.
-
"Raft solves FLP." Raft does not solve FLP — Raft navigates around FLP by assuming the network is partially synchronous and by using randomised election timeouts. Under a fully asynchronous model with one crash, Raft can be constructed to never decide; the construction is just adversarial enough to be uncommon in real datacentres. The reader who tunes Raft's
electionTimeoutis calibrating the synchrony assumption. -
"FLP is about Byzantine failures." FLP assumes the weakest failure model: crash failures only. No lying, no corruption, no malicious behaviour — a process simply stops. The result is so striking precisely because even this benign failure is enough. Byzantine consensus (PBFT, HotStuff) inherits FLP's bound and adds further constraints on top.
-
"FLP is just theory; my cluster works fine." Your cluster works fine because real networks are not the FLP adversary. Most of the time, message delays are bounded, RTTs are predictable, and consensus terminates in milliseconds. FLP becomes operationally visible exactly when those assumptions break — during a partition, during a network microburst that triples RTT, during a slow node that masquerades as failed. The KapitalKite incident above is FLP made visible. If your runbook does not have an entry for "the cluster cannot decide because of partial synchrony violation", FLP is going to surprise you eventually.
-
"Consensus is impossible without a clock." Wall-clocks are not what FLP requires; what FLP forbids is deterministic termination. Spanner's TrueTime adds physical-time bounds that allow deterministic linearisable transactions, but TrueTime is a partial synchrony assumption with hardware backing — atomic clocks plus GPS bound the clock skew to a known interval. Without TrueTime, Spanner would still work via Paxos consensus; TrueTime makes the read path cheaper, not the safety property reachable. The clock is an optimisation; consensus's safety comes from quorum overlap, not time.
-
"FLP can be ignored if I just use a good library." The library inherits FLP. etcd, Consul, ZooKeeper all inherit FLP's bound. Their documentation explicitly tells you what they do during partition (stall, elect, reject writes); reading that documentation is reading FLP. The library does not "solve" the impossibility — it picks a point in the trade-off space and exposes the assumption.
Going deeper
The bivalency proof — full sketch with adversarial schedule
The full FLP proof goes like this. Consider any deterministic consensus protocol P that terminates in finite steps in any failure-free run. Step 1 establishes a bivalent initial configuration C₀ by an exchange argument: line up all initial configurations by the bit-vector of process inputs; somewhere in this line, two adjacent configurations C and C' differing in exactly one process's input have different valencies (otherwise the protocol's output would not depend on inputs at all). Now construct a configuration just before that process delivers its input: it is bivalent. Step 2's heart is the case analysis. From bivalent C, let T = {C → C'_1, ..., C → C'_k} be the set of all next configurations reachable by delivering one in-flight message. Suppose for contradiction that all C'_i are univalent. Then there exist C → C'_i (0-valent) and C → C'_j (1-valent). The proof shows that by carefully choosing which message to deliver — the message intended for the critical process — one of the next configurations must be bivalent, contradicting the assumption. The full case analysis breaks down by which processes' messages are involved (same process, different processes, message-to-message dependencies); Lynch's textbook treatment is cleaner than the original 1985 paper. Step 3 then iterates: from any bivalent C, there is a step to bivalent C'. The adversary picks that step every time. The execution is infinite, all messages are eventually delivered (so the protocol cannot complain "you never delivered my message"), no process crashes (so the protocol cannot blame a fault) — yet the protocol never decides. This is the canonical impossibility result of distributed computing.
How partial synchrony rescues consensus — the DLS framework
Dwork, Lynch, and Stockmeyer's 1988 paper "Consensus in the Presence of Partial Synchrony" introduces three models and proves consensus solvability in each. Model 1 (eventually synchronous): there exists a bound Δ on message delay and a bound Φ on process step ratio, but they hold only after some unknown global stabilisation time GST. Before GST, the network is fully asynchronous. Model 2 (always synchronous with unknown bounds): Δ and Φ exist always but are unknown to the algorithm. Model 3 (fully synchronous): Δ and Φ are known. In all three models, deterministic consensus is solvable; the cost is in the round complexity. Model 1 is the closest to real datacentres: most of the time bounds hold, but during a partition or congestion spike, bounds violate; protocols must remain safe before GST and become live after GST. Paxos is precisely a Model-1 algorithm. Its safety holds in any execution (even fully asynchronous ones); its liveness is conditional on GST eventually arriving.
Failure detectors — Chandra-Toueg's hierarchy and what makes ◇W special
Chandra and Toueg (1996) defined eight failure detector classes by varying two properties: completeness (do they eventually suspect every crashed process?) and accuracy (do they ever wrongly suspect a correct process?). The hierarchy: P (Perfect), S (Strong), W (Weak), Q (Quasi), and their eventually-versions ◇P, ◇S, ◇W, ◇Q. They proved that ◇W (Eventually Weak — eventually, all crashed processes are suspected, and at least one correct process is never suspected forever) is the weakest failure detector that solves consensus in an asynchronous system. This is a profound result: it tells you exactly what kind of "oracle knowledge" is sufficient. Real systems approximate ◇W via timeouts: if a heartbeat is late, suspect the node; if the heartbeat eventually returns, retract the suspicion. The phi-accrual detector (Hayashibara et al., 2004) is a probabilistic refinement that outputs a real-valued suspicion level instead of a boolean, giving the consumer (the consensus protocol or the leader-election layer) a tunable threshold. Akka cluster and Cassandra both ship phi-accrual.
When FLP becomes blockchain — Nakamoto's escape via probabilistic finality
Nakamoto's Bitcoin paper (2008) sidesteps FLP through a fourth route: probabilistic finality. Bitcoin does not aim for the property "all nodes agree on the same value" instantaneously; it aims for "the probability of disagreement decreases exponentially with the number of confirmations". After 6 blocks, a transaction is finalised with probability ≈1 - 10^-9 — strong enough for production. The trade-off is that finality is never absolutely certain, only probabilistically so. Modern Byzantine consensus protocols like Tendermint, HotStuff, and PBFT use the failure-detector and partial-synchrony escapes plus quorum machinery (3f+1 replicas to tolerate f Byzantine faults), giving immediate finality at the cost of synchrony assumptions. The blockchain world has revisited every FLP escape route under different names; reading FLP is the right preparation for understanding why Tendermint stalls during partitions while Bitcoin keeps mining.
Reproduce this on your laptop
python3 -m venv .venv && source .venv/bin/activate
pip install --upgrade pip # stdlib is enough
# save flp_demo.py from the article body
python3 flp_demo.py
# Expected: row 1 — fair scheduler, all decide. Row 2 — adversarial scheduler
# plus one crash, no decisions. Row 3 — adversarial without crash, eventually decides.
# Then watch FLP in production: spin up a 3-node etcd cluster locally and partition
# one node's outbound traffic.
brew install etcd
# (run three nodes in three terminals, each with a different name and listen-peer-urls)
# In a fourth terminal:
sudo iptables -A OUTPUT -p tcp --dport 2380 -j DROP # block peer traffic from this node
etcdctl --endpoints=http://127.0.0.1:23791 put /test "v" # observe: hangs, no quorum
sudo iptables -F # heal the partition; the put completes within a heartbeat
Where this leads next
FLP is the foundation chapter of Part 8. Every consensus protocol that follows is a particular point in the trade-off space FLP defines. Read FLP first; then read each protocol asking "which escape route does this take, and what does it cost?".
- Paxos and why people struggle with it — chapter 51; partial-synchrony escape, derived from first principles, with the bivalency-style safety proof.
- Raft in detail — chapter 52; partial-synchrony plus randomised election timeouts; the production default for etcd, Consul, CockroachDB.
- EPaxos and Flexible Paxos — chapter 53; reduced quorum sizes for geo-distributed deployments.
- Multi-Raft and sharding consensus — chapter 54; running thousands of Raft groups in parallel.
- Byzantine consensus — PBFT, HotStuff — chapter 55; failure model strengthened to Byzantine, FLP still applies plus extra quorum constraints.
- When not to use consensus — chapter 56; the design discipline of pushing decisions to data types (CRDTs) when consensus is too expensive.
After Part 8, Part 9 covers leader election and leases — the operational machinery on top of consensus that lets the read path skip the consensus tax. FLP applies there too: leases without bounded clock skew can violate safety, which is why TrueTime and bounded-skew assumptions are not aesthetic choices.
References
- Fischer, Lynch, Paterson, "Impossibility of Distributed Consensus with One Faulty Process" — JACM 1985 — the original three-page paper. Read this first; it is shorter and sharper than its reputation suggests.
- Lynch, Distributed Algorithms — Morgan Kaufmann 1996, chapter 5 — the cleanest exposition of the bivalency proof, with the case analysis spelled out.
- Dwork, Lynch, Stockmeyer, "Consensus in the Presence of Partial Synchrony" — JACM 1988 — the partial-synchrony framework; explains why Paxos and Raft work despite FLP.
- Chandra & Toueg, "Unreliable Failure Detectors for Reliable Distributed Systems" — JACM 1996 — the failure-detector hierarchy; defines ◇W as the weakest detector that solves consensus.
- Ben-Or, "Another advantage of free choice: Completely asynchronous agreement protocols" — PODC 1983 — the randomised escape from FLP; foundational for blockchain consensus.
- Hayashibara et al., "The φ Accrual Failure Detector" — SRDS 2004 — the practical approximation of ◇W used by Akka and Cassandra.
- Wall: coordination is sometimes necessary — chapter 49; the immediately preceding wall, motivating why Part 8 exists.
- Crash, omission, timing, byzantine — Part 2's failure-model taxonomy; FLP's "one crash" failure mode situated in the broader space.