Note: Company names, engineers, incidents, numbers, and scaling scenarios in this article are hypothetical — even when they resemble real ones. See the full disclaimer.
Leader election without consensus
It is 03:40 on a Wednesday and Jishant is debugging why MealRush's dispatch service has two nodes both convinced they are the leader. The runbook says "we use Raft", but grep -r raft in the dispatch repo turns up nothing. What it does turn up is a 40-line file called leader_lock.py that calls etcd_client.lease(ttl=10) and etcd_client.txn(...).if(KeyDoesNotExist).then(Put(...)). Jishant has just discovered the dirty secret of leader election in production: almost no one runs Paxos or Raft inside their own service. They run a lease against a thing that already runs consensus for them — etcd, ZooKeeper, Consul, Redis, DynamoDB, even Postgres — and call that leader election. This chapter is about that pattern: how to elect a leader without writing a single line of consensus code, why it works, when it breaks, and the one trick (fencing tokens) that turns "election was wrong" from a corruption-class bug into a recoverable one.
You don't need to run consensus to elect a leader. You need to run consensus somewhere — so push it into a coordination service (etcd, ZooKeeper, Consul) and reduce your election to a single conditional write: "set leader=me if no leader exists, with a TTL." The coordination service handles the hard parts (quorum, term numbers, durability); your code just races for the lock and renews it. The whole pattern collapses without fencing tokens — a monotonically-increasing number every leader carries, which downstream services use to reject writes from a deposed leader. Lease + fencing is what 90% of "leader election" in production actually is.
The pattern: push consensus into the coordinator, reduce election to a CAS
Look at any production service that says "we have a leader". Open the leader-election code. Nine times out of ten it does not implement Paxos or Raft. It does this:
- Connect to a coordination service that already runs Paxos or Raft (etcd, ZooKeeper, Consul, sometimes Redis Sentinel, sometimes Postgres with
SELECT FOR UPDATE). - Race to write a key like
/services/dispatch/leaderwith a time-to-live (TTL) — say 10 seconds — using a compare-and-swap: "create this key only if it does not exist." - The winner is the leader. Periodically (every 3 seconds) it renews the TTL.
- Every other node sits in a watch loop, waiting for the key to disappear. When it does, they all race again.
That is the entire algorithm. The reason it works is not the four lines of client code — it is the consensus protocol running inside etcd that makes step 2 atomic. The election is essentially if not leader_exists: leader = me. The hard part — agreeing across a partition on which node won the race — is done once, in etcd's Raft log, and then every service that uses etcd inherits that guarantee. Why this is a strict win over running Raft yourself: you would otherwise need a Raft quorum per service (5 dispatch nodes, 5 inventory nodes, 5 pricing nodes — 15 total nodes just for quorum, before you do any actual work). The coordination-service pattern lets you run one 5-node etcd cluster and amortise it across hundreds of services. The cost is a network hop on every lease renewal; the saving is two orders of magnitude in operational complexity. This is why etcd-as-a-service is what every Kubernetes cluster uses for leader election under the hood.
The pattern has three moving parts: the lease key (whoever holds it is leader), the TTL (the leader renews periodically; if they die, the key auto-expires and a new election begins), and the watcher (followers wait for the key to vanish without polling). All three are primitives the coordination service already provides. You write the orchestration; etcd/ZooKeeper/Consul do the heavy lifting.
The four-line "election" — and why every line matters
Here is the entire algorithm in pseudocode (real Python comes in the next section):
acquire():
response = etcd.txn(
compare = [KeyDoesNotExist("/svc/leader")],
success = [Put("/svc/leader", my_id, lease=10s)],
failure = []
)
if response.succeeded:
return LEADER
else:
watch("/svc/leader") and retry on delete
That is it. There is no Paxos in the application. No quorum to track. No term numbers to manage. The transaction is atomic at the etcd layer — either the comparison succeeds and the put commits, or the comparison fails and nothing happens. Why the KeyDoesNotExist predicate is load-bearing: this is a compare-and-swap (CAS), the same primitive an atomic CPU instruction provides. Without it, two clients reading "no leader exists" simultaneously would both happily Put and overwrite each other — last-writer-wins, and now both think they won. With CAS, exactly one of the two Puts succeeds (the first to reach the etcd Raft leader); the other gets precondition failed and falls back to watch mode. The CAS is the election; everything else is bookkeeping.
The TTL is the second pillar. A leader does not hold the lease forever — it holds it for 10 seconds and must actively renew it. If the leader's process crashes, its network partitions, or it goes into a 30-second GC pause, it stops renewing. After 10 seconds etcd auto-deletes the key, and the watching followers fire their election retry. The TTL is what makes the system self-healing without external intervention — no operator has to manually say "node-1 is dead, promote node-2". The clock does it.
The watcher is the third pillar. Without it, every follower would have to poll the key every second to check if the leader died. With 1000 follower services and 1-second polling, that is 1000 reads per second hitting etcd just to find out nothing has changed. Watch is push-based: etcd holds open a TCP connection per watcher and sends one event when the key changes. A 1000-watcher cluster generates one network event when a leader changes, not 1000-per-second always. Why this matters at scale: ZooKeeper's original 2008 design choice to make watches one-shot (you must re-arm after every event) was specifically to bound server-side state. etcd extended watches to be persistent but pays for it in connection-state per watcher. Either way, watch + ttl is dramatically cheaper than poll + lock — orders of magnitude in network and CPU at production scale.
A runnable lease-based leader election
The Python below uses the etcd3 client to elect a leader, hold the lease, and demonstrate what happens when the leader's renewer is "killed" — the lease expires, a follower detects the deletion, and acquires the next term. It also shows the fencing token mechanic: every successful acquisition gets a monotonically-increasing revision number from etcd, which the application uses to invalidate writes from a stale leader.
# leader_lease.py — leader election via etcd lease + fencing token
import etcd3, time, threading, sys
LEADER_KEY = "/svc/dispatch/leader"
TTL_SEC = 5
def try_become_leader(client, my_id):
"""Race for the lease. Return (is_leader, lease, fencing_token)."""
lease = client.lease(TTL_SEC)
# transaction: put leader=my_id only if key does not exist
success, _ = client.transaction(
compare=[client.transactions.create(LEADER_KEY) == 0],
success=[client.transactions.put(LEADER_KEY, my_id, lease)],
failure=[],
)
if success:
# the etcd revision number is our fencing token
meta = client.get(LEADER_KEY, serializable=False)
rev = meta[1].mod_revision
print(f" [{my_id}] BECAME LEADER, fencing_token={rev}")
return True, lease, rev
return False, None, None
def renewer(lease, my_id, stop_flag):
while not stop_flag.is_set():
try:
lease.refresh()
print(f" [{my_id}] renewed lease at t={time.time():.1f}")
except Exception as e:
print(f" [{my_id}] renew FAILED: {e}")
return
time.sleep(TTL_SEC / 2)
def watch_for_election(client, my_id):
print(f" [{my_id}] following, watching {LEADER_KEY}")
events_iter, cancel = client.watch(LEADER_KEY)
for event in events_iter:
if isinstance(event, etcd3.events.DeleteEvent):
print(f" [{my_id}] leader key deleted at t={time.time():.1f} — racing")
cancel(); return
def run_node(my_id, simulate_crash_at=None):
client = etcd3.client(host='127.0.0.1', port=2379)
while True:
is_leader, lease, fencing = try_become_leader(client, my_id)
if is_leader:
stop = threading.Event()
t = threading.Thread(target=renewer, args=(lease, my_id, stop))
t.start()
if simulate_crash_at:
time.sleep(simulate_crash_at)
print(f" [{my_id}] *** simulating crash — stopping renewals ***")
stop.set(); t.join()
return # process dies; lease will expire after TTL
t.join()
else:
watch_for_election(client, my_id)
if __name__ == "__main__":
my_id = sys.argv[1]
crash = float(sys.argv[2]) if len(sys.argv) > 2 else None
run_node(my_id, simulate_crash_at=crash)
Sample run from three terminals (node-A, node-B, node-C). Terminal A is told to crash at t=8 to simulate failover.
# Terminal A: python3 leader_lease.py node-A 8
[node-A] BECAME LEADER, fencing_token=42
[node-A] renewed lease at t=1700000000.5
[node-A] renewed lease at t=1700000003.0
[node-A] renewed lease at t=1700000005.5
[node-A] *** simulating crash — stopping renewals ***
# Terminal B: python3 leader_lease.py node-B
[node-B] following, watching /svc/dispatch/leader
# ...silence for ~5 seconds while node-A's lease expires...
[node-B] leader key deleted at t=1700000013.2 — racing
[node-B] BECAME LEADER, fencing_token=47
[node-B] renewed lease at t=1700000013.7
The load-bearing lines: compare=[client.transactions.create(LEADER_KEY) == 0] is the CAS — it checks the key has zero create-revision (i.e., does not exist). lease.refresh() in the renewer is what keeps the leader alive — every 2.5 seconds (half the TTL, so we have margin against network jitter), the leader pings etcd to extend the lease. mod_revision gives the fencing token: every transaction in etcd advances a global monotonic counter, so the second leader's token (47) is strictly greater than the first's (42), regardless of wall-clock skew. watch_for_election sleeps until etcd pushes a delete event — no polling, no timeout, just a TCP connection waiting. Notice the failover takes ~5 seconds (one TTL window) plus the few hundred ms of network round-trip. Why the new leader's fencing token jumps from 42 to 47 (not 43): etcd's revision number advances for every write to the entire keyspace, not just this key. In a busy cluster, hundreds of revisions tick by between two leadership changes. The fencing token is therefore "approximately a sequence number" — strictly monotonic, but with gaps. That is fine; downstream services only care that it increases.
The one trick that makes it survivable: fencing tokens
The lease-based pattern has a subtle hole. Suppose node-A is the leader (token=42), goes into a 12-second GC pause, and during the pause node-B is elected (token=47). When node-A's GC finishes, it does not know it has been deposed — it still thinks it holds the lease. It will happily try to make writes to downstream services for a few seconds before it notices its lease has expired. Without fencing, those writes will land in the database alongside node-B's writes, and you have split-brain corruption.
The fix is the fencing token: every leader receives a monotonically-increasing number when it becomes leader (the etcd revision in our example, the ZooKeeper zxid, the Postgres transaction id). Every downstream call carries the fencing token. Every downstream service tracks the highest token it has seen and rejects any call with a lower token. When node-A (token=42) sends a write to the database after its lease has expired, the database sees that it has already accepted writes from token=47, and rejects 42's write with stale leader. The corruption is prevented at the receiver side, not at the leader side — because the receiver is the only entity that can compare two competing tokens.
This pattern was the thing that made Chubby's lock service trustworthy in production at Google, and it is also why ZooKeeper's zxid and etcd's revision exist as exposed numbers in the API rather than internal implementation details. Martin Kleppmann's 2017 essay "How to do distributed locking" is the canonical write-up: lease-based locks are unsafe without fencing, full stop. Why GC pauses are the canonical attack: a stop-the-world GC in Java can pause a JVM for 10+ seconds, well past any reasonable lease TTL. The leader has no way to detect this from inside — to itself, the GC pause looks like normal execution that happens to take a long time. The clock outside has moved; the lease has expired; the leader is the last to find out. Fencing tokens defend against this without needing to detect the pause: even if the deposed leader is unaware, its writes carry the old token and are rejected. This is why the pattern is sometimes called defence in depth — you do not trust the leader to know it is the leader; you trust the receivers to enforce the order.
When this pattern wins, when it loses
The lease + fencing pattern is the right answer when you already have (or are willing to operate) a coordination service. It is the wrong answer when the coordination service is a single point of failure you cannot tolerate, or when the latency of a remote lease renewal is unacceptable. Three concrete scenarios:
PaySetu's payment-gateway pinned-leader election. PaySetu runs a 9-node payment-gateway cluster where exactly one node owns the connection to the upstream bank. They use etcd lease + fencing tokens. Failover takes about 6 seconds (5s TTL + 1s for the new leader's first action), during which payment requests queue. The failover budget was acceptable; running their own Raft inside the gateway service was not (the team is 4 engineers; operating their own consensus would consume 1 of them full-time). This is the typical fit.
CricStream's per-region edge-cache leader. CricStream has 200 edge POPs around India, each with a 3-node cache. The "leader" of each POP coordinates eviction. Putting all 600 nodes on a central etcd cluster would mean every leader heartbeat crosses 1000+ km of network — at 5s TTLs that is fine, but it makes the central etcd a single point of failure for cache eviction in 200 POPs. CricStream solves this differently: each POP runs its own 3-node etcd cluster, scoped to that POP. Election is regional; failures are regional; the central cluster is only consulted for cross-POP coordination (rare). The pattern still applies, but the topology is one coordination cluster per failure domain, not one global one.
KapitalKite's microsecond-sensitive market-data ingestion. KapitalKite's market-data path needs sub-millisecond decisions about which node owns a given exchange feed. A network round-trip to etcd (even on a LAN) is ~500 µs. That is the entire latency budget. KapitalKite cannot afford lease-based election in the hot path. They use it for bootstrapping — at startup, leases pick the initial leader — but once running they rely on the leader's heartbeat to its peers (UDP multicast), and a peer that does not hear a heartbeat for 50 ms invokes a fast in-memory election protocol with fencing tokens still issued by etcd. This is the rarer case where lease-based election is part of the system but not the election.
Common confusions
- "Lease-based election is the same as Raft." Raft elects a leader through a vote among the participating nodes themselves. Lease-based election delegates the consensus to an external coordinator. The application running lease election does not contain Raft — etcd does, and the application is a client of etcd's Raft. The two patterns coexist in production: Kubernetes uses lease election, Kafka used ZooKeeper for it (now KRaft, which is Raft inside Kafka), and CockroachDB runs Raft per range and uses leases on top.
- "You don't need fencing tokens if your TTL is long enough." TTL only bounds how long a leader can be deposed without knowing — it does not prevent the deposed leader from acting during that window. A 30-second TTL still allows a 30-second GC pause to cause writes from a stale leader. Fencing tokens close the window at the receiver; TTLs alone do not.
- "The lease-holder is the leader at all times." No. The lease-holder is the leader as far as etcd knows. The actual application may have crashed, paused, or partitioned — etcd just hasn't found out yet. The fencing token is what makes the gap survivable.
- "etcd revision and ZooKeeper zxid are equivalent fencing tokens." Both are monotonic, both work as fencing tokens. The implementation differs: zxid is a 64-bit number where the upper 32 are the epoch (incremented on leader change in ZK) and the lower 32 are the per-epoch counter. etcd's revision is a single 64-bit counter that ticks every transaction in the entire cluster. The reader cares about monotonicity; the operator cares about how easy it is to interpret in logs.
- "You can do leader election with a Postgres
SELECT FOR UPDATE." You can — and many systems do. The lock is held until the transaction commits or the connection drops, which acts as the TTL. The fencing token is the transaction id. The downside: Postgres connections have higher operational cost than etcd watches, and "connection drops" is a noisier failure detector than a clean lease expiry. It works, but it is the budget version of the pattern. - "This pattern only works if the coordination service is highly available." Yes — but the failure mode is inability to elect a new leader, not split-brain. If etcd is down, the current leader's lease will expire and no one can acquire a new one; the system stalls but does not corrupt. Compare to running your own Raft and getting a partition wrong: that does corrupt. The trade-off favours stalling over corruption almost always.
Going deeper
The Chubby paper as the founding document
Mike Burrows's 2006 OSDI paper "The Chubby Lock Service for Loosely-Coupled Distributed Systems" is the canonical formulation of this pattern. Chubby was Google's internal coordination service (a Paxos-replicated cell, a hierarchical name space, sessions with TTLs, file-like locks). The paper's key claim — and it bears repeating — is that most distributed-systems problems people thought they had were really name-resolution problems, and giving them a strongly-consistent name service collapsed dozens of bespoke election protocols into one library call. ZooKeeper (2008), etcd (2014), and Consul (2014) all descend directly from Chubby's design, with different trade-offs (ZK is slightly higher-throughput and uses Zab, etcd uses Raft with a v3 transactional API, Consul adds service discovery primitives). Read Chubby first; the others make more sense afterwards.
Why "lease + fencing" beats "lease alone" provably
Without fencing, a deposed leader can issue writes during the period between its lease expiry and its discovery of the expiry. In the worst case (network partition isolating the deposed leader from etcd but not from the database) this period is unbounded. The system has no mechanism to prevent corruption. With fencing, the receiver-side check (token >= max_seen) is a monotonic monitor over the sequence of operations. Two leaders with different tokens can both attempt writes; the receiver totally orders them by token. The proof that this is sufficient is the same as the Lamport-timestamp causality proof: a monotonic counter induces a total order that respects causality (a deposed leader's token is strictly less than its successor's). The fencing-token system is, in formal terms, a single-writer principle implemented at the receiver — even if the senders disagree on who is the writer, the receiver picks the highest one and ignores the rest.
When the coordination service itself is the bottleneck
At very large scale (10,000+ services, each with its own lease), the coordination cluster becomes the bottleneck. etcd's published throughput at 3-node clusters is ~30k QPS for small writes; if you have 10k services each renewing every 3 seconds, you are at 3.3k renewals/second, well within budget. At 100k services that is 33k renewals/second, at the limit. Solutions: (a) shard the coordination cluster (one etcd per region or per service tier); (b) lengthen TTLs (10s instead of 3s reduces renewal rate by 3.3×); (c) use a hierarchy — services do not lease directly against etcd, but against a per-tier "lease aggregator" that batches renewals. BharatBazaar reportedly uses option (c) during their flash-sale traffic peaks, where leasing pressure jumps 5× as auto-scalers spin up new replicas. None of these are ergonomic; they are operational scars.
The Kubernetes leader-election library — what production looks like
client-go's leaderelection package is the canonical Kubernetes leader-election library, used by every controller (kube-controller-manager, kube-scheduler, every operator). It implements the lease pattern almost exactly as described — with one wrinkle: it uses a Lease Kubernetes resource (a CRD-like API object) rather than an etcd key directly. Kubernetes' API server, in turn, stores Lease objects in etcd. So the pattern is: app → Kubernetes Lease API → etcd → Raft. Three layers of indirection, each adding ~10ms of latency, but each adding decoupling: applications can move between Kubernetes clusters without knowing about etcd. The library exposes LeaderCallbacks with OnStartedLeading, OnStoppedLeading, OnNewLeader — a clean separation between the election mechanic and the leader's actual work. Read the source; it is ~600 lines of Go, and it is the most-deployed leader-election code on Earth.
The bully algorithm and lease-based election as siblings
The bully algorithm and lease-based election are two answers to the same question — agree on a leader given a network and some failure detection — but they distribute the trust differently. The bully puts the trust in the algorithm itself: every node runs identical code, and correctness emerges from the protocol. Lease-based election puts the trust in the coordinator: the algorithm in your code is trivial, and correctness comes from etcd/ZK's Paxos/Raft layer. The bully scales to N nodes with N(N-1)/2 messages worst-case; lease-based scales to N nodes with N renewals/TTL — much cheaper at scale, but only because you offloaded the consensus elsewhere. They are not competitors at the same level; they are different rungs of the abstraction ladder.
Reproduce this on your laptop
# Run a single-node etcd, then three lease-election clients
docker run -d --name etcd-demo -p 2379:2379 quay.io/coreos/etcd:v3.5.10 \
/usr/local/bin/etcd --advertise-client-urls http://0.0.0.0:2379 \
--listen-client-urls http://0.0.0.0:2379
python3 -m venv .venv && source .venv/bin/activate
pip install etcd3
python3 leader_lease.py node-A 8 &
python3 leader_lease.py node-B &
python3 leader_lease.py node-C &
# Watch node-A "crash" at t=8, node-B win at t≈13 with a higher fencing token.
Where this leads next
The next chapter, lease mechanics, is the deep dive on lease safety: the inequality lease_duration > max_clock_skew + max_message_delay that bounds the period during which a deposed leader can mistakenly act, and the variants (revocable leases, hierarchical leases, lease-extension protocols) that handle real-world conditions. After that, Chubby and the lock-service pattern walks through Google's original implementation, including the session abstraction that turns lease-holding into a stateful client-server interaction.
The deeper takeaway is operational: the question "should I run consensus in my service?" almost always has the answer "no — push it into a coordinator". Run consensus once, share it across services, build everything else on top of CAS + watch + lease + fencing token. That is what production looks like. The bully algorithm and Raft-from-scratch live in textbooks and embedded systems; in your services/ directory, what you will find is leader_lock.py.
References
- Burrows, M. — "The Chubby Lock Service for Loosely-Coupled Distributed Systems" (OSDI 2006) — the founding paper. Read §2 for the API design and §4 for the locking semantics.
- Hunt, P., Konar, M., Junqueira, F.P., Reed, B. — "ZooKeeper: Wait-free coordination for Internet-scale systems" (USENIX ATC 2010) — Chubby's open-source descendant; introduces the recipes (election, locks, barriers).
- Kleppmann, M. — "How to do distributed locking" (2016 essay, kleppmann.com) — the canonical write-up of why fencing tokens are mandatory; shows the GC-pause attack diagram.
- etcd Authors — "etcd v3 API documentation" — the
lease,txn, andwatchprimitives the chapter relies on; section on "lease keepalive" is essential. - Junqueira, F.P., Reed, B., Serafini, M. — "Zab: High-performance broadcast for primary-backup systems" (DSN 2011) — the consensus protocol underneath ZooKeeper; explains zxid as a fencing token.
- Brewer, E. — "A Few Billion Lines of Code Later: Using Static Analysis to Find Bugs in the Real World" (CACM 2010) — tangential but contains the seminal observation that most distributed bugs are coordination bugs that a lock service would have prevented.
- Lease mechanics — the safety inequality and clock-skew bounds; the formal companion to this chapter.
- Chubby and the lock-service pattern — the Google production system this entire pattern descends from.