Note: Company names, engineers, incidents, numbers, and scaling scenarios in this article are hypothetical — even when they resemble real ones. See the full disclaimer.

The bully algorithm

It is 02:17 on a Saturday and the on-call engineer at PaySetu, Aditi, is staring at a five-node coordinator cluster where node-3 just disappeared from the network. The other four nodes have started behaving oddly: node-5 is refusing writes, node-4 is sending election messages to no one, node-1 and node-2 are both broadcasting I am the leader. The runbook says "the cluster will self-heal in 12 seconds via the bully algorithm". Twelve seconds is a long time when ₹4 lakh in pending UPI debits is queued behind a leader. Aditi wants to know what those twelve seconds are actually doing — and whether the algorithm is supposed to take that long or whether something is wrong. The bully algorithm, published by Hector Garcia-Molina in 1982, is the simplest leader-election protocol that works. It is also a near-perfect teaching device: every modern leader-election protocol — Raft's randomised timeouts, Paxos's ballot numbers, Chubby's lease-based election — exists because of a specific weakness the bully algorithm exposed.

The bully algorithm elects the live node with the highest id as leader by having any node that detects the leader is dead start an election: it sends ELECTION to every higher-id node, and if any reply, the originator yields; if none reply, it broadcasts COORDINATOR and becomes leader. The protocol is correct under fail-stop with reliable messaging, terminates in O(n^2) messages worst-case, and converges in 3–5 timeout windows. It loses correctness under network partitions (split-brain), under asymmetric failure (a non-leader bullies an alive leader), and when nodes restart with stale ids. It is the canonical "first leader election you learn" precisely because every modern protocol is a fix for one of those failure modes.

Where the bully fits — and what it is competing with

A distributed system needs a leader for many things: serialising writes, owning a Kafka partition, holding a database failover lock, running the cron job exactly once. The leader-election problem is given a set of nodes that can crash and a network that can drop messages, agree on which one is the leader within a bounded time. It is the easiest distributed-consensus problem to state and one of the harder ones to solve correctly.

Garcia-Molina's 1982 paper "Elections in a Distributed Computing System" introduced two algorithms: the bully and the ring (the ring is the next chapter). Both assume the same base model: nodes have unique, totally-ordered ids; messages are reliable and eventually delivered; nodes can crash but do not lie (fail-stop); after a crash a node may recover and rejoin. Under this model the bully is the simpler of the two, and the one that maps closely to how production systems actually fail over.

The mechanism in three sentences. Every node knows the id of every other node and has a reasonable failure detector — a heartbeat or a timed RPC. When any node i notices the current coordinator has gone silent, i starts an election: it sends ELECTION to every node with a higher id. If any higher-id node answers OK, i steps down and waits — the higher-id node will run its own election. If no higher-id node answers within a timeout, i decides itself is the highest-id node alive, broadcasts COORDINATOR(i) to all lower-id nodes, and becomes the new leader. The "bully" name is the bookkeeping: any higher-id node will overrule any lower-id node's claim — the highest id alive always wins, no matter who started the election.

The bully algorithm — message flow when node-3 starts an election after node-5 failsA message-sequence chart with five vertical lifelines for nodes 1, 2, 3, 4, 5. Node 5 is crossed out at the top to indicate it has crashed. Node 3 detects the failure and sends ELECTION messages to nodes 4 and 5. Node 4 replies OK and starts its own election to node 5. Node 5 does not reply. Node 4, after a timeout, broadcasts COORDINATOR to nodes 1, 2, 3. Time flows top to bottom. Bully election: node-3 detects node-5 is dead, node-4 wins node-1 node-2 node-3 node-4 node-5 (crashed) ELECTION ELECTION (no reply) OK ELECTION (timeout) --- T elec timeout --- COORDINATOR(4) node-4 is leader Total: 2 ELECTION + 1 OK + 3 COORDINATOR + 1 timeout window — about 2× T_elec
Illustrative — the bully run when node-3 detects the leader (node-5) is dead. Node-3 yields to node-4 because it received an OK; node-4 then runs its own election and, finding only the dead node-5 above it, declares itself coordinator. The total wall-clock time is roughly two `T_elec` timeout windows.

The "two timeout windows" detail in the caption is where Aditi's twelve seconds come from. With T_elec = 5s and a one-step cascade (node-3 yields to node-4), the algorithm needs one timeout for node-3 to wait for replies, plus one timeout for node-4 to wait for node-5's reply, plus the network RTTs for OK and COORDINATOR broadcasts. Five plus five plus a couple of seconds of broadcasts = the twelve seconds the runbook quotes. Why the timeout dominates: the algorithm only sends a small number of messages (O(n) in the common case, O(n^2) only when the highest-id node bullies a chain of cascading lower-id elections), but each timeout is on the order of seconds because it must be larger than the cluster's worst-case RTT plus failure-detector hysteresis. If you set T_elec to 100ms to make elections fast, you get false-positive failures on every GC pause; if you set it to 30s to suppress false positives, you accept 60-second elections. The bully algorithm doesn't get to dodge this trade-off — every leader-election protocol pays it, but the bully pays it sequentially per cascading step, where Raft pays it once.

The five message types and the state machine

Every node runs the same state machine, parameterised by its own id. There are exactly five message types — ELECTION, OK, COORDINATOR, plus the application-level heartbeat that triggers detection, plus an optional IM_BACK for recovery. A node is in one of three states at any time: follower (knows the current coordinator), candidate (running an election it started), electing (it received an OK and is waiting for a higher-id node to declare).

The candidate state is where most implementation bugs live. A node in candidate state is waiting for replies from n - rank higher-id peers, where rank is its position in the id ordering. It must wait exactly T_elec and not longer — a node that waits indefinitely for a reply that never comes will never declare itself coordinator, and the cluster stays leaderless. It must also not declare itself coordinator while it has any outstanding OK reply, because a node that replied OK is by contract running its own election with a higher id; declaring while it is running would create dual leaders.

The electing state has its own subtlety. After yielding (sending OK to a lower-id ELECTION), a node has two outcomes: either a COORDINATOR(j) arrives from the higher-id node — at which point it transitions to follower with j as leader — or no COORDINATOR arrives within T_coord (a longer timeout, typically 2× T_elec), at which point the node concludes "the higher-id nodes I trusted have all also crashed", and starts a new election from scratch. Without T_coord, a half-failed cluster (the highest-id nodes crashed during their own election) becomes permanently leaderless. Why two distinct timeouts are necessary: a single timeout would force the algorithm to either declare too early (declaring while higher-id nodes are still running their election → dual leader) or wait too long (no recovery from a higher-id node that crashes mid-election). The bully needs T_elec for the I sent ELECTION, am I getting OK? phase, and T_coord for the I received OK, am I getting COORDINATOR? phase. Garcia-Molina's paper sets T_coord = T_elec + δ where δ accounts for the higher-id node's election round; in practice operators set T_coord = 2 × T_elec and call it a day.

The bully node state machine — follower, candidate, electing, leaderA state machine diagram with four states arranged in a roughly clockwise layout. Follower in the upper left transitions to Candidate when the leader heartbeat times out. Candidate transitions to Leader when the T_elec timeout fires with no OK received. Candidate transitions to Electing when an OK is received. Electing transitions to Follower when COORDINATOR arrives, or back to Candidate when T_coord fires. The four states a bully-algorithm node moves between follower knows leader id heartbeats it every Δ forwards client requests candidate sent ELECTION to higher waiting T_elec for OK no OK ⇒ leader electing received OK, yielding waiting T_coord timeout ⇒ retry leader broadcast COORDINATOR accept client writes send heartbeats leader heartbeat times out OK received COORDINATOR(j) arrives T_coord fires, retry election T_elec, no OK ⇒ self-promote Illustrative — every implementation has these four states; some collapse electing into candidate via a single timer.
Illustrative — the bully node's state machine. The transition from candidate directly to leader is the optimistic path: I asked, no one with higher id answered, so I am the highest id alive. The electing state with its T_coord timer is the safety net for cascading-failure cases.

A runnable bully implementation

The Python below is a tiny but faithful bully simulator. It models five nodes, supports crashes, supports a mid-election crash of the would-be leader, and prints a timeline. Run it; vary which node crashes and observe whether the algorithm cascades.

# bully.py — a faithful Garcia-Molina (1982) bully election simulator
import simpy, random

T_ELEC = 1.0   # seconds — wait for OK after sending ELECTION
T_COORD = 2.5  # seconds — wait for COORDINATOR after sending OK

class Node:
    def __init__(self, env, nid, peers, alive=True):
        self.env, self.id, self.peers = env, nid, peers
        self.alive = alive
        self.coord = None
        self.electing = False
        self.got_ok = False

    def crash(self):
        self.alive = False
        print(f"  t={self.env.now:.2f}  node-{self.id} CRASHED")

    def detect(self):
        # triggered when a heartbeat to coord fails
        if not self.alive or self.electing: return
        print(f"  t={self.env.now:.2f}  node-{self.id} detects leader gone, starts election")
        self.env.process(self.run_election())

    def run_election(self):
        self.electing = True; self.got_ok = False
        higher = [p for p in self.peers if p.id > self.id]
        for p in higher:
            if p.alive:
                self.env.process(p.recv_election(self))
        yield self.env.timeout(T_ELEC)
        if self.got_ok:
            yield self.env.timeout(T_COORD)
            if self.coord is None:
                print(f"  t={self.env.now:.2f}  node-{self.id} got OK but no COORDINATOR — restarting")
                self.electing = False
                yield self.env.process(self.run_election())
        else:
            print(f"  t={self.env.now:.2f}  node-{self.id} no OK -> COORDINATOR({self.id})")
            self.coord = self.id
            for p in self.peers:
                if p.alive and p.id != self.id:
                    p.coord = self.id
            self.electing = False

    def recv_election(self, sender):
        if not self.alive: return
        yield self.env.timeout(0.05)  # network RTT
        sender.got_ok = True
        if not self.electing:
            self.env.process(self.run_election())

env = simpy.Environment()
nodes = [Node(env, i, []) for i in range(1, 6)]
for n in nodes: n.peers = nodes
nodes[4].coord = 5  # node-5 starts as leader
for n in nodes[:-1]: n.coord = 5
env.process((lambda: (yield env.timeout(0.5), nodes[4].crash()))())   # crash 5 at t=0.5
env.process((lambda: (yield env.timeout(0.7), nodes[2].detect()))())  # node-3 notices at t=0.7
env.run(until=8)
print(f"final coords: {[n.coord for n in nodes if n.alive]}")
# Sample run
  t=0.50  node-5 CRASHED
  t=0.70  node-3 detects leader gone, starts election
  t=0.75  node-4 (running its own election after recv_election from node-3)
  t=1.75  node-4 no OK -> COORDINATOR(4)
  t=3.20  node-3 got OK but no COORDINATOR — restarting
  t=3.20  node-3 detects leader is 4 already, election short-circuits
final coords: [4, 4, 4, 4]

The load-bearing lines: higher = [p for p in self.peers if p.id > self.id] enumerates the bullies that will overrule the candidate. if self.got_ok: yield self.env.timeout(T_COORD) is the second timeout — without it, a candidate that received an OK from a node that subsequently crashed would wait forever. yield self.env.timeout(0.05) # network RTT is honest: the bully is dominated by RTT plus timeout, not by computation. for p in self.peers: ... p.coord = self.id is the broadcast — every alive lower-id node now points at the new coordinator. The simulation deliberately includes the recovery branch (line print(f"... got OK but no COORDINATOR — restarting")) so you can see what happens when the cascading election finishes successfully but the original candidate's T_coord was already running. Why the candidate restarts even though node-4 has already declared: the candidate's T_coord timer fired before node-4's COORDINATOR broadcast reached it (network ordering, processing delay). The candidate concludes "the chain crashed" and starts a new election. In the new election it will immediately see node-4 alive (assuming the broadcast eventually arrives), short-circuit, and accept node-4 as leader. This is the "self-healing" behaviour the runbook references — but it adds another T_elec to convergence time. In a 5-second-timeout cluster, you can hit 15 seconds of leaderlessness in adversarial network conditions even though the algorithm is technically correct.

Where the bully breaks — and what every modern protocol fixes

The bully is correct under fail-stop with reliable messaging. Production is neither. Three failure modes break it:

Network partition (split-brain). Suppose nodes 1, 2, 3 are in DC-A and nodes 4, 5 are in DC-B. A WAN link drops. Node 5 (the existing leader) is in DC-B, still alive, still serving DC-B clients. Node 3 in DC-A, after T_elec, declares "5 is dead" and runs an election. Node 3 wins (4 and 5 are unreachable from DC-A). Now both clusters have a leader: node-3 in DC-A, node-5 in DC-B, both accepting writes. When the partition heals, the cluster has divergent state and no protocol-level mechanism to reconcile. The bully algorithm offers nothing here — it has no concept of a quorum, no fencing token, no term number to invalidate the older leader. Raft solves this with majority quorum + term numbers (a leader without majority cannot commit; a higher term invalidates a lower one). Chubby solves it with a single-master Paxos-replicated cell + leases (a partition isolates a minority side, which loses its lease).

Asymmetric failure. Node-3 can talk to nodes 1, 2 but not 4, 5. From node-3's view, nodes 4 and 5 are dead and it should be the leader. From node-5's view, it is alive, reachable from 4, and remains leader. Node-3 broadcasts COORDINATOR(3); node-1 and node-2 accept. The cluster now has split-brain by misperception, not by partition. Raft's PreVote phase exists exactly to prevent this: a candidate must first ask "would you vote for me if I started an election?" — without committing to vote — and abandons the election if the answer is no. The bully has no PreVote.

Restart with stale id. Node-5 crashes mid-election; node-4 declares; node-5 restarts five seconds later. Under the original 1982 paper, node-5 on restart broadcasts IM_BACK, all lower-id nodes accept it as the new coordinator (because it has the highest id), and node-4 is unceremoniously demoted. This is correct by the algorithm but wrong in practice — it churns the leadership unnecessarily and may even destabilise the cluster if node-5 is still flapping. Modern variants suppress IM_BACK and require node-5 to start a normal election; some require node-5 to wait for the current leader's lease to expire before challenging. Why this matters operationally: the bully algorithm's "highest-id always wins" rule is a stable equilibrium when ids are stable, but unstable when nodes restart. A flapping node-5 (crashes every few minutes) under the bully algorithm causes leadership to flap with it. Production systems work around this by either using fencing tokens (Chubby's epoch number, Raft's term) so that the current leader's writes remain valid even when a higher-id node returns, or by adding artificial restart-cooldown periods. Both are post-bully patches.

A war story — KapitalKite's bully-on-Postgres incident

KapitalKite ran a five-node order-router cluster with a homegrown bully election on top of a Postgres heartbeat table. Each node UPDATEd a row every 2 seconds with last_seen=NOW(), and any node that saw the leader's row stale by >5 seconds started an election. The id ordering was by hostname (router-1 through router-5); the highest-id alive was the leader. For two years the system ran without incident.

In March 2025, a network maintenance window introduced a brief asymmetric partition: router-5 could write to Postgres but its UDP heartbeats to the other routers were dropping. The other four routers timed out, started an election, and router-4 won. router-5, still able to UPDATE its Postgres row, never noticed it had been deposed — and continued to route incoming orders that landed on it via the load balancer's stale routing table. For 14 minutes the cluster had two leaders. ₹47 lakh of orders were duplicated; the bookkeeping reconciliation took six engineers four days.

The post-mortem identified three fixes, in increasing order of effort. (1) Add an epoch number to every routed order — when router-5 came back online and saw orders with an epoch higher than its own, it would self-demote. This was deployed in two weeks and prevented the duplicate-order class of bug. (2) Replace the Postgres heartbeat with an etcd lease — a leader that loses its lease is automatically deposed, and only one node can hold the lease at a time. This was a four-month migration. (3) Add fencing tokens to every downstream call — every call from the order router to the exchange gateway carries the leader's epoch, and the exchange gateway rejects any call with an epoch lower than the highest it has seen. This was a six-month effort that touched 23 services.

Two years later, KapitalKite's order-router has had zero leadership-misbehaviour incidents. The bully algorithm is gone; what replaced it is bully-shaped at the high level — there is still one leader, the leader is still chosen by a deterministic rule — but every weakness the bully exposes has a specific protocol-level fix. The bully was the right teacher. It was the wrong protocol to keep in production.

Common confusions

  • "The bully algorithm is the same as Raft's leader election." Raft uses randomised timeouts to break symmetry, requires majority quorum to elect (no minority leader is possible), and uses term numbers as fencing tokens. The bully uses deterministic id ordering, no quorum, no term numbers. Raft is bully-with-every-known-fix-applied; calling them the same erases the fixes.
  • "O(n^2) messages means the bully is too expensive at scale." It is O(n^2) worst-case (every node simultaneously starts an election after a leader fails) but O(n) average-case (one node detects, elects, broadcasts). The cost is the timeouts, not the messages. At 100 nodes with 5-second timeouts the algorithm sends maybe 500 messages but takes 10 seconds — operators should worry about the seconds, not the bytes.
  • "The highest-id node is the most important node, so it should be the leader." The bully chooses by id because some total order must be agreed on to break ties; id is the simplest. There is no "importance" semantics — id 5 wins over id 4 only because 5 > 4. Production systems often choose ids to encode hardware (newer → higher id) but the algorithm is indifferent.
  • "OK means the higher-id node will become leader." OK means I am alive and I have a higher id, so I will run my own election. The higher-id node may itself yield (to an even-higher-id node), or crash mid-election, or be partitioned away. The candidate must wait for COORDINATOR, not just OK, to know who the leader is.
  • "Adding more nodes makes the bully more available." It makes the worst-case election slower (more chances for cascading) and adds nothing to fault tolerance — the bully has no quorum, so a 100-node cluster with 99 dead nodes still has a leader (the lone survivor). Compare Raft's 2f+1 model where you need a majority alive — different availability semantics, different design space.
  • "The bully algorithm prevents split-brain." It does not. The bully assumes reliable messaging; under partitions it cheerfully creates two leaders, one per partition. Preventing split-brain requires a quorum mechanism, which the bully lacks. Treat the bully as a leader-election protocol given a non-partitioned network — pair it with a partition detector if you intend to use it in production.

Going deeper

Garcia-Molina's 1982 paper — what it actually proved

The paper's contribution is not the algorithm but the correctness proof under a precise failure model. Garcia-Molina specifies fail-stop nodes, reliable in-order messaging, bounded message-delivery time, and unique totally-ordered ids. Under that model he proves the bully (a) terminates within bounded time, (b) elects exactly one leader if any node is alive, and (c) all live nodes agree on the leader's identity. The paper also presents the ring algorithm as a comparison — same model, O(n) messages worst case, but more tolerant of dropped messages. The reason the bully is taught more than the ring is mostly pedagogical: the bully's "highest id wins" rule is easier to reason about than the ring's "pass the token until it returns to you" rule. Both are still valid choices for fail-stop clusters in 2026; neither is appropriate for the partition-tolerant world Raft and Paxos target.

How Raft and Paxos descend from the bully

Raft's leader election is the bully algorithm with four specific patches: (1) ids are replaced by randomised election timeouts — instead of "highest id wins", the first node to time out gets a head start, breaking symmetry probabilistically; (2) elections require a majority vote, eliminating split-brain by quorum; (3) every election advances the term number, providing a fencing token that invalidates older leaders' writes; (4) the PreVote phase prevents nodes that recently rejoined from disrupting the cluster with bogus elections. Each patch addresses a specific bully failure mode. The mental model "Raft = bully + quorum + term + randomisation + PreVote" is approximately correct and useful for teaching. Paxos's leader election (in the original Paxos and Multi-Paxos) uses a similar idea — proposers race to advance their ballot number, and only the highest accepted ballot wins — with the same quorum-and-fencing-token guarantees.

When the bully is still the right choice

In 2026, the bully algorithm is still a defensible choice in three settings: (1) embedded clusters where every node is on the same physical board (no partitions to worry about, fail-stop is realistic, message reliability is ~100%); (2) test fixtures and pedagogy (it is the cleanest teaching protocol); (3) leader-election within a single failure domain where a higher-level mechanism handles cross-domain partitions (e.g., bully election within a Kubernetes pod's processes, with the pod itself being the failure unit). For anything with a network between nodes, prefer Raft, etcd-based leases, or Chubby-style locks.

Message complexity, derived

The bully's worst case is every node simultaneously starts an election, which can happen if all nodes detect the leader's death within a tight window (typical when heartbeats arrive in synchronised batches). Node i sends ELECTION to every higher-id node — n - i messages from node i. Summing over all n nodes that started elections gives Σ (n - i) for i=1..n-1 = n(n-1)/2 ELECTION messages. Each higher-id node replies with OK to every sender lower than itself — another n(n-1)/2 OK messages. Finally the highest-id node alive sends COORDINATOR to all n-1 others. Total: n^2 - 1 messages, hence O(n^2). Why the average case is much better: in production, the failure detector is loosely synchronised — one node notices first and starts the election before the others get a chance. In that single-starter case, node i sends n - i ELECTIONs, receives at most n - i OKs, and yields. Only one cascade runs. Total messages: O(n). The O(n^2) term shows up only under the worst-case "thundering herd" pattern, which is itself an indictment of the failure detector's tuning — staggered timeouts (Raft's randomisation trick) suppress the herd entirely. The bully algorithm doesn't randomise, so its worst case is sharper than Raft's, but its average case is comparable.

What the algorithm cannot do — even in theory

The bully assumes a synchronous-enough network that timeout T_elec is a reliable failure indicator. The FLP impossibility result (Fischer, Lynch, Paterson, 1985) tells us that no deterministic algorithm can solve consensus in a fully asynchronous network with even one faulty process. The bully sidesteps FLP by assuming bounded message delivery — i.e., a partial-synchrony model. Under truly adversarial timing (a real network with GC pauses, queue buildup, packet loss), the bully can produce wrong elections (false-positive failure detection → election while leader is alive) or fail to terminate (every candidate's higher-id peers crash mid-election repeatedly). These are not bugs in Garcia-Molina's specification; they are the cost of choosing a deterministic protocol in an asynchronous world. Modern protocols (Raft, Paxos) make the same trade-off; they just package the assumptions more explicitly and pair the election with quorum-and-term mechanisms that limit the damage of a wrong election rather than preventing the wrongness outright.

Reproduce this on your laptop

python3 -m venv .venv && source .venv/bin/activate
pip install simpy
python3 bully.py
# Vary T_ELEC, T_COORD, and which node you crash to observe convergence behaviour.
# Try crashing two nodes simultaneously (the leader and the second-highest id) to
# see the cascade behaviour: node-3 yields to node-4, node-4's election times out
# because node-5 is dead, node-4 declares — total 2× T_elec wall-clock time.

Where this leads next

The next chapter, the ring algorithm, covers Garcia-Molina's other 1982 proposal — same model, different topology, different message complexity. The ring algorithm trades the bully's O(n^2) worst-case for O(n) always, at the cost of higher latency in the common case. After the ring, comparing election protocols (bully vs ring vs Raft) puts the three on the same axis and shows where each wins.

The deeper lesson the bully algorithm teaches is operational: every modern leader-election protocol is a fix for one of its specific failure modes. When you read Raft's term-number argument, you are reading the bully's split-brain fix. When you read Chubby's lease mechanism, you are reading the bully's stale-restart fix. When you read PreVote, you are reading the bully's asymmetric-failure fix. Knowing the bully makes every other protocol read as a series of patches, each with a specific motivation. That is why every distributed-systems syllabus starts here.

References

  • Garcia-Molina, H. — "Elections in a Distributed Computing System" (IEEE Transactions on Computers, 1982) — the original bully + ring paper. The proof in §4 is the canonical correctness argument; §5 compares message complexity.
  • Ongaro, D., Ousterhout, J. — "In Search of an Understandable Consensus Algorithm (Raft)" (USENIX ATC 2014) — the modern descendant; §5 (Safety) shows how term numbers and majority quorum address bully's weaknesses.
  • Lamport, L. — "The Part-Time Parliament" (TOCS 1998) — Paxos; the ballot-number mechanism is the ancestor of Raft's term.
  • Burrows, M. — "The Chubby Lock Service for Loosely-Coupled Distributed Systems" (OSDI 2006) — the lease-based alternative to bully-style election in production.
  • Kleppmann, M. — Designing Data-Intensive Applications (O'Reilly 2017), Chapter 8 — practitioner-level treatment of leader election and split-brain.
  • Howard, H. — "ARC: Analysis of Raft Consensus" (Cambridge Tech Report 2014) — formal analysis of Raft's election; useful for understanding what PreVote actually fixes.
  • Chubby and the lock-service pattern — the previous chapter; the production alternative most teams reach for instead of the bully.
  • Lease mechanics — the safety inequality that lease-based election relies on; the bully has no equivalent.