Note: Company names, engineers, incidents, numbers, and scaling scenarios in this article are hypothetical — even when they resemble real ones. See the full disclaimer.
Wall: consensus is expensive — leases are cheap
It is 11:47 on a Tuesday at PaySetu and Riya is reading a graph she does not believe. The settlement-batch cron, which runs once a minute on exactly one node, has just produced two batches with overlapping IDs in the same minute. The Raft cluster shows no leader change. The node that was supposed to own the lock shows the lock as held. The other node — the one that also ran the batch — shows the lock as expired four hundred milliseconds before it grabbed it. Both nodes are right. The cluster is right. Wall-clock time is the thing that lied. This article is about the primitive that prices that lie correctly — the lease — and the one extra integer (the fencing token) that turns the lease from a hopeful suggestion into a safety property.
A consensus round costs you a quorum RTT on every decision. A lease costs you a quorum RTT once, then lets one node act unilaterally until the lease expires. The savings are dramatic — a leader-elected node can take 50,000 small decisions a second on its own, instead of paying 6 ms per decision through Raft. The price is that leases assume a clock bound: if a holder's clock runs slow, or its OS pauses for 800 ms, two nodes will simultaneously believe they hold the lease. The fix is the fencing token — a monotonically-increasing integer issued with each lease, attached to every action the holder takes, and rejected by the storage layer if it ever sees a smaller token. Without fencing, leases are unsafe. With fencing, they are how every production system avoids paying consensus on the hot path.
What a lease actually is
A lease is a time-bounded grant of authority over a resource — "you, node-3, may act as the leader for the next 30 seconds; after that, ask again". It is issued by a small consensus-backed store (etcd, Zookeeper, Spanner's lock service, Consul) using a single linearisable read-modify-write — if no current holder, set holder=node-3, expires_at=T+30s. From that moment until T+30s, node-3 takes decisions on its own and the rest of the cluster trusts those decisions without further coordination. The lease is renewed periodically — typically when one third of the lease has elapsed — by the same compare-and-set operation.
The economy is simple. A 5-node Raft cluster commits writes at p50 ≈ 6 ms (see the previous chapter). If every decision your service takes — every cron tick, every shard owner check, every "is it my turn to flush the batch" question — went through Raft, you would be capped at roughly 150 decisions per second per leader before the WAL fsync queue saturates. With a lease, you pay that 6 ms exactly once per renewal interval. A 30-second lease renewed every 10 seconds costs you 3 Raft commits per minute — roughly 18 ms of consensus latency per minute — and lets the holder do whatever it likes between renewals at full local speed. Why this is mathematically the right shape: consensus amortises beautifully when the same authority makes many decisions in a row. The lease is the "make many decisions" lever — once the cluster has agreed who decides, the deciding itself does not need to round-trip a quorum. This is the same trick that batched fsync uses inside the Raft leader: pay the expensive boundary cost once, then make many cheap decisions inside the boundary.
How a lease is actually issued — and where the lie hides
A lease is implemented as a single conditional write against a consensus store. In etcd, that is a transaction:
TXN(
if( Value("/lock/settlement-cron") == "" OR expires_at < now() )
then set Value("/lock/settlement-cron", node_id),
set Value("/lock/settlement-cron/expires_at", now() + 30s),
increment Value("/lock/settlement-cron/token")
else fail
)
The transaction commits through Raft. If it succeeds, the caller receives (holder, expires_at, token). The token is the integer the cluster increments on every grant — generation 1, 2, 3, monotonically. From the consensus store's perspective, the lease is unambiguous: at any wall-clock instant the cluster knows who holds the lease and what token they hold.
The lie is in the words "expires_at" and "now()". The consensus store's notion of when the lease expires lives on the consensus store's clocks. The holder's notion of when its lease is still valid lives on the holder's clock. These are different clocks, and they drift. NTP pulls a clock by 100–500 ms per resync; a VM that gets descheduled by the hypervisor can lose 1–2 seconds in a heartbeat; a stop-the-world GC pause on a 64GB JVM can freeze the process for 800 ms; a kernel that takes a long fsync can starve user-space scheduling. Any of these can produce a holder which believes its lease is still valid when, by the consensus store's clock, it has expired and been re-granted to someone else. Why this is unfixable in pure software: there is no protocol that lets a process know the wall-clock outside its own kernel without paying a network round-trip — and if you pay that round-trip on every action, you have re-invented per-decision consensus and lost the lease's economy. Even running an atomic-clock-synchronised TrueTime appliance (Spanner) reduces but does not eliminate the gap; you wait out the gap explicitly with commit_wait. The lease cannot prevent overlap; it can only make overlap expensive enough to detect.
This is the central failure mode: two simultaneous lease holders. Node-A's lease was granted at T=0 with token 7, expiring at T=30s. Node-A then suffered a 31-second GC pause. At T=30.5s the consensus store re-granted the lease to node-B with token 8. At T=31s node-A unfroze, looked at its own monotonic timer, said "my lease is good for one more second", and proceeded to take an action against shared state. Now both A and B think they hold the lease. The action will be applied — twice, in some order — and the storage layer will silently accept both writes if it has no defence. The defence is the fencing token.
The fencing token — the integer that turns "hopeful" into "safe"
A fencing token is the generation number stamped on every action the holder takes. The storage layer (the database, the queue, the file store, whichever resource the lease is protecting) checks the incoming token against the highest token it has ever seen on that resource — if the incoming token is smaller, the action is rejected.
storage.write(record, payload, token=8) # accepted, last_seen=8
storage.write(record, payload, token=7) # REJECTED — 7 < last_seen=8
That is the entire mechanism. It is conceptually trivial and operationally life-saving. The frozen node-A with token 7 cannot corrupt state held by node-B with token 8, because every write A attempts is rejected by the storage layer. Why a monotonic generation suffices: tokens are issued only by the consensus store, and only on a successful re-grant — meaning the previous lease has been declared expired by the cluster. So any token strictly greater than last_seen was issued after the previous holder's lease was revoked. The storage layer's local check enforces a global property: no action under a revoked lease can ever land. The proof is simply that monotonicity at the issuer plus a max-seen check at the receiver gives you a happens-after relation between issuance and acceptance, without coordination between A and B.
The fencing token must be threaded through every action the lease-holder performs. If the holder does ten things — write a row, send a message, log an event, mutate cache, advance a counter — every one carries the token, and every receiving system enforces the max-seen check. Skip any of them and the system has a corruption hole. This is why fencing is hard in practice: it is a property of the entire data path, not a single layer.
last_seen to 8. The consensus store does not need to know A is alive; the storage layer handles the safety property locally.A runnable lease + fencing implementation
The following Python implements a lease holder, a consensus-style store backing the lease, and a storage layer that enforces fencing. We then simulate the GC-pause failure mode and show that fencing rejects the stale write.
# lease_fencing_demo.py — a lease holder, a backing store, and a fencing check
# pip install (no deps; pure stdlib)
import time, threading, random
# ---------- lease store (the consensus-backed bit) ----------
class LeaseStore:
def __init__(self):
self.lock = threading.Lock()
self.holder = None
self.expires_at = 0.0
self.token = 0 # monotonic generation, never decreases
def acquire(self, node_id, ttl_s):
with self.lock:
now = time.time()
if self.holder is None or now > self.expires_at:
self.token += 1
self.holder = node_id
self.expires_at = now + ttl_s
return (True, self.token, self.expires_at)
if self.holder == node_id: # renewal
self.expires_at = now + ttl_s
return (True, self.token, self.expires_at)
return (False, None, None)
# ---------- storage with fencing check ----------
class FencedStorage:
def __init__(self):
self.lock = threading.Lock()
self.last_seen_token = 0
self.records = {}
def write(self, key, value, token):
with self.lock:
if token < self.last_seen_token:
return ("REJECTED", token, self.last_seen_token)
self.last_seen_token = max(self.last_seen_token, token)
self.records[key] = (value, token)
return ("OK", token, self.last_seen_token)
# ---------- demo: simulate the GC-pause overlap ----------
store, storage = LeaseStore(), FencedStorage()
ok, t_a, exp = store.acquire("node-A", ttl_s=2.0)
print(f"node-A acquires: token={t_a}, expires_at={exp:.2f}")
# node-A writes happily under token t_a
print(storage.write("settlement-batch", "A:row1", t_a))
# node-A is now "frozen" — we don't call into it for 3 seconds
time.sleep(2.1)
# meanwhile node-B notices the lease has expired and acquires
ok, t_b, exp = store.acquire("node-B", ttl_s=2.0)
print(f"node-B acquires: token={t_b}")
# node-B writes under token t_b
print(storage.write("settlement-batch", "B:row1", t_b))
# node-A 'thaws' and tries to write under its old token t_a
print(storage.write("settlement-batch", "A:row2-stale", t_a))
# Sample run
node-A acquires: token=1, expires_at=1746031672.83
('OK', 1, 1)
node-B acquires: token=2
('OK', 2, 2)
('REJECTED', 1, 2)
Walking the load-bearing lines: self.token += 1 in acquire is the single source of monotonicity — every successful grant produces a token strictly greater than every prior token, which is the property the storage layer relies on. if token < self.last_seen_token: return ("REJECTED", ...) is the entire fencing check; one comparison, two integers, no coordination with the lease store at write time. time.sleep(2.1) simulates the GC pause: longer than the 2-second TTL, so the lease genuinely expires and re-grants. The final REJECTED line is the safety property in action — node-A's view of "I still hold the lease" is wrong, but its writes cannot corrupt because the storage layer catches the stale token. Why this works without A and B ever talking to each other: the fencing token is the only shared state needed. Once it lands at the storage layer, every future write is filtered against it. A's belief about its own lease is irrelevant; the storage layer is the arbiter of last-seen. This is the same architectural pattern as optimistic concurrency control on a database row, hoisted up one layer to a cluster-wide lease.
A war story — the PaySetu settlement-batch double-run
Three weeks ago, PaySetu's settlement-batch service produced two batches with overlapping IDs in the 11:47 minute. Investigation showed the following timeline.
The settlement-batch is a Python process that runs on three nodes (paysetu-settlement-1/2/3), with etcd providing a 30-second lease named /lock/settlement/holder. The lease is renewed every 10 seconds. Each minute, the holder reads the pending-payments queue, batches the rows, writes a settlement record, and emits a Kafka event. Until that Tuesday, this had run cleanly for nine months.
At 11:46:48 IST, paysetu-settlement-2 held the lease with token 4719 and was halfway through processing the 11:46 batch. At 11:46:48.412 the JVM (the Python process embeds Cassandra's Java driver) entered a major-GC pause that the on-call later reconstructed from the GC log as lasting 41.3 seconds — caused by a memory leak in a recently-deployed metric exporter. The process produced no output to its monitoring socket. Its last lease renewal had been at 11:46:43, so the lease was good until 11:47:13. At 11:47:14, etcd's lease watcher observed the renewal had not arrived and revoked the lease. At 11:47:14.007 paysetu-settlement-3 acquired the lease with token 4720, ran the 11:47 batch — which included the rows that node-2 had been mid-processing at the start of its pause — wrote the settlement record, emitted the Kafka event, and finished cleanly at 11:47:23.
At 11:47:30, paysetu-settlement-2 came out of its GC pause. From its process-local view, time had jumped — its last action before the pause was at 11:46:48, and its monotonic clock now read 11:47:30, but the wall clock confirmed 42 seconds had passed. The Python code had a check: if time.time() < self.lease_expires_at: continue. The lease, in node-2's memory, was still valid until 11:47:13 — already in the past. Node-2 logged "lease expired during pause" and exited cleanly. Good citizen.
Except. The settlement code, mid-batch, had already written 47% of its rows to the settlement table before pausing. Those writes had been going through Cassandra with no fencing token. They landed during the pause, after the lease should have expired. Why Cassandra accepted the writes with no fencing: the writes were issued from inside the JVM's networking thread, which had been queued before the GC pause started. The OS kernel's TCP buffers contained the serialised RPCs; the JVM's freeze did not cancel them. As the kernel drained the buffer, the writes hit Cassandra at the wall-clock times the kernel happened to schedule the sends — some during the pause, some immediately after. The lease check in the application logic ran before the GC pause; the write itself ran much later, with no re-check. This is the canonical "actions outlast checks" failure mode. Without a token threaded into every write, the storage layer cannot distinguish a fresh write from a zombie write.
When paysetu-settlement-3 then ran the 11:47 batch on the same rows, the resulting settlement table contained two batches with overlapping IDs. The downstream reconciliation — the very system meant to catch double-payments — saw consistent IDs on both sides (Cassandra and Kafka) and flagged nothing. The duplicate was caught only when an external bank's settlement file the next morning showed ₹4.2 crore being credited twice. The fix was twofold: (1) every write to the settlement table now includes a fencing token, with Cassandra's lightweight transactions enforcing the monotonic-max-seen rule; (2) the lease-holder code now validates the token immediately before every write rather than once at batch start. The first month after the fix, the system rejected 14 stale-token writes — 14 prevented duplicates that would have otherwise landed.
Common confusions
- "A lease is just a lock." A lock holds until the holder releases it. A lease holds until a clock says it expires — and that clock can be wrong. The lease's value is exactly that you can recover from a holder which has crashed, paused, or partitioned, without needing the holder to come back and release. Locks need a release path; leases tolerate dead holders.
- "Renewing the lease frequently is enough to prevent overlap." Renewal frequency reduces the probability of overlap on a healthy node; it does nothing for an unhealthy node that has stopped renewing because it is paused. The fix is a fencing token, not a tighter renewal interval. A 1-second renewal on a node that pauses for 5 seconds still produces a 4-second overlap window.
- "Fencing tokens require a database that supports them." Any storage layer that supports a conditional write (Postgres
UPDATE ... WHERE token > current_token, DynamoDB conditional writes, S3 conditional PUT, Redis Lua-scripted CAS) can enforce fencing in three lines. The token is just a column or a key. The hard part is making sure every write path uses it — not the storage primitive. - "Lease + fencing = consensus on every write." No — the fencing check is a local comparison at the storage layer, not a quorum round. The consensus cost is paid only at lease grant and lease renewal. The storage write itself is whatever it would have been without a lease (one Cassandra round-trip, one Postgres row write).
- "Spanner doesn't use leases." Spanner uses leases extensively — every Paxos group has a leader with a 10-second lease, and the leader serves linearisable reads without a Paxos round during the lease term. The TrueTime API is the mechanism that makes the lease's clock-bound assumption tighter than NTP — bounded uncertainty rather than unbounded skew. The lease is still the primitive; TrueTime just narrows the gap.
- "You only need fencing in distributed systems, not single-node ones." A single-node service still has multiple processes (an old version, a new version during a deploy) and multiple threads (a paused thread, a fresh thread). The same overlap pattern shows up in microcosm. Fencing is a discipline; it pays off the moment you have more than one entity that thinks it is "the writer".
Going deeper
The clock-bound assumption — what makes leases safe enough to use
A lease's safety rests on a quantitative bound: the difference between the consensus store's clock and the holder's clock is at most ε. If ε > lease TTL, two nodes can simultaneously believe they hold the lease, and fencing is the only thing keeping the system safe. In practice, NTP gives ε ≈ 100–500 ms on a healthy network, but a process pause can produce ε in the seconds. Why a smaller TTL doesn't fix the assumption: shrinking the TTL also shrinks the window during which the fencing check is "needed", but it raises the renewal frequency proportionally — every renewal is a Raft commit. At TTL=1s with renewal at TTL/3, you pay a Raft commit every 333 ms, which is roughly 18,000 commits per minute on the consensus store. The TTL is a parameter that trades operational cost against the window of unsafe-without-fencing behaviour. Fencing tokens decouple safety from TTL; the TTL becomes purely a liveness parameter. Spanner's contribution was not the lease; it was the TrueTime API that gives you a quantified ε. With TrueTime, you can compute "how long do I wait before my lease is provably expired" — commit_wait. Without TrueTime, you trust NTP and use fencing.
Lease implementations in the wild
Zookeeper's ephemeral znodes (Burrows-style locks): the session timeout is the lease, and the ZK server tracks heartbeats. No fencing token is built in — the application must read the znode's cversion (creation version) and use it as a fencing token. Many production users miss this and end up with overlap bugs. Etcd's lease API is more explicit: Grant(ttl) returns a LeaseID which the holder attaches to keys via Put(key, value, lease=id); the keys auto-delete on lease expiry. For fencing, etcd provides the revision number on the key; treat it as the token. Consul's session API works the same way. Spanner's lock service is integrated with the database — every read and write inside a Paxos group uses the lease implicitly, and the database itself enforces the monotonic-token property. The pattern across all four: the lease primitive is straightforward; the fencing token is where systems differentiate themselves on safety.
When a lease is the wrong primitive
A lease assumes one writer at a time, with a clean handoff on expiry. Three workloads fight that assumption. (1) Multi-writer aggregates (counters, sets, registers) — use a CRDT, not a lease; see G-Counter and PN-Counter. (2) Read-mostly workloads where writes are rare but coordination must be exact — a Paxos group is cheaper than the lease's TTL+fencing complexity, because the writes are infrequent enough that you do not need amortisation. (3) Workloads where the holder genuinely cannot tolerate a takeover (e.g. a long-running ETL job that must complete or fail atomically) — use a transaction or a saga, not a lease, because the lease's failure mode (reassignment after timeout) is exactly what you want to prevent.
Why the lease holder must check the token before every action, not once at start
The PaySetu war story above is the canonical example. The application logic checked the lease validity once, at the start of the batch, then issued many writes against the assumption that the check still held. Fencing tokens that are looked up once and cached have the same flaw at finer granularity — the holder must include the token value in every outgoing write, but the value the holder uses must be the value currently issued by the lease store, not a value remembered from a previous renewal. In practice, this means the lease library issues a callback to the application on every renewal with the new token, and the application must thread that token through every write. Frameworks that hide the token behind a high-level "lease-protected operation" wrapper get this right; ad-hoc implementations almost always get it wrong.
Reproduce this on your laptop
python3 -m venv .venv && source .venv/bin/activate
python3 lease_fencing_demo.py
# Then change ttl_s to a longer value and shorten the sleep —
# observe that node-A's "stale" write is now accepted because
# its lease genuinely had not yet expired.
# Try removing the fencing check (the if token < self.last_seen_token line)
# and watch the demo show silent corruption.
For the etcd version, run a 3-node etcd cluster and use etcdctl lease grant, etcdctl lease keep-alive, and the lease ID as the fencing token in every key write — the revision number on each key is what the storage layer should compare against on read.
Where this leads next
The next chapter develops the single-decree Paxos — the protocol that the lease's underlying consensus store actually runs internally. After that, multi-decree Paxos and the leader optimisation shows how Paxos itself uses a lease (the leader's elected term) to amortise the prepare phase across many decisions. The structural pattern repeats at every layer: pay the consensus cost at the boundary, run cheap inside it.
Part 9 — leader election with leases and fencing tokens — picks up from here, asking the next question: given that a lease is the right primitive, how do you elect the holder fast enough that a 30-second TTL does not become a 30-second outage on every leader change? The answer involves PreVote, CheckQuorum, and the difference between a lease's safety bound and its liveness bound.
The deepest lesson is this. A consensus protocol turns "one decision" into "agreed-upon one decision". A lease turns "one decision per round-trip" into "many decisions per round-trip". A fencing token turns "many decisions per round-trip" into "many safe decisions per round-trip". Each layer is doing one job, and the system's correctness is the conjunction of all three. Skip any one of them — write a Paxos round per decision and you saturate the cluster; use leases without fencing and you corrupt state on the first long GC pause; fence without a lease and you have re-invented per-action consensus. The art is in keeping the layers separated and the boundaries explicit.
References
- Burrows, M. — "The Chubby Lock Service for Loosely-Coupled Distributed Systems" (OSDI 2006) — the foundational lease-as-lock-service paper, with detailed discussion of why fencing matters.
- Corbett, J. et al. — "Spanner: Google's Globally-Distributed Database" (OSDI 2012) — leases everywhere, with TrueTime narrowing the clock-bound assumption.
- Kleppmann, M. — "How to do distributed locking" (martin.kleppmann.com, 2016) — the canonical short-form essay on why Redis-only locks without fencing are unsafe.
- Ongaro, D., Ousterhout, J. — "In Search of an Understandable Consensus Algorithm" (USENIX ATC 2014) — Raft's leader lease optimisation, §6.4.
- Junqueira, F. et al. — "Zab: High-performance broadcast for primary-backup systems" (DSN 2011) — Zookeeper's session/lease semantics and the failure modes the protocol handles.
- etcd Authors — "etcd Lease API" (etcd.io documentation) — the concrete API surface used in production deployments worldwide.
- When NOT to use consensus — the previous chapter; the framing for why leases exist at all.
- Idempotency keys — the dedupe primitive that complements fencing tokens at the request layer.