Note: Company names, engineers, incidents, numbers, and scaling scenarios in this article are hypothetical — even when they resemble real ones. See the full disclaimer.
Lease mechanics
It is 02:14 on a Saturday at CricStream and Aditi is staring at a graph titled "leader gap, last 24 hours". The graph shows that for 11.7 seconds yesterday, between 18:42:03 and 18:42:14 IST — peak third-innings traffic, 23 million concurrent viewers, India chasing 287 — the cluster had no leader. No node was elected. No writes were accepted. The video-distribution coordinator, which assigns CDN edges to viewer regions, sat with an empty queue while clients hammered the previous leader's last-known IP. The post-mortem will pin the outage on a single number: the lease TTL. Specifically, on the fact that someone set the lease TTL to 10 seconds and the renewal cadence to "best effort", and on a Saturday night that turned out to mean an 11.7-second window during which the cluster knew its leader was dead but had not yet given anyone permission to be the new one. This article is about the five numbers that govern that window — and how the lease's mechanics, more than any other primitive in distributed systems, are unforgiving about which trade-off you picked.
A lease has five numbers — TTL, renewal interval, clock-skew bound, network-RTT bound, and safety margin — and the relationships between them determine whether the lease is safe and whether it is available. Make TTL too short and you spend all your consensus budget on renewals; too long and a dead holder blocks the cluster for the whole TTL. The renewal interval must leave enough room for one full round-trip plus skew, or a brief network blip kills a healthy holder. Lease handover — graceful release on a planned shutdown — turns a TTL-long outage into a millisecond gap, and is the single most-skipped piece of lease engineering. The mechanics are mathematical; treat them that way.
The five numbers and the inequality between them
A lease is parameterised by five numbers, and they are not independent. The lease is safe (no two simultaneous holders) only when one inequality holds; it is available (a healthy holder doesn't lose its lease to a transient blip) only when a different inequality holds. Most lease bugs are violations of one of these two, and most engineers have only seen one written down.
The five numbers:
- TTL — how long a granted lease is valid. Typically 5–30 seconds.
- Renewal interval (
R) — how often the holder asks the lease store to extend its lease. UsuallyTTL/3. - Clock-skew bound (
ε_clock) — the maximum difference between the holder's clock and the lease store's clock. With NTP, ~100–500 ms in healthy conditions; seconds during a process pause. - Network-RTT bound (
ε_net) — the longest single round-trip the holder might experience while renewing. p99.9 latency from the holder to the lease-store quorum. Typically 5–50 ms intra-DC, 100–300 ms cross-region. - Safety margin (
σ) — the slack the holder leaves before its own lease deadline. Critical and almost always omitted.
The safety inequality is what keeps the lease from overlapping. The holder must stop acting under its lease before the lease-store believes the lease has expired:
holder's local "I still hold it" deadline + ε_clock + ε_net ≤ lease-store's expiry
In practice, that means the holder must subtract ε_clock + ε_net (often 500 ms to 1 second) from its lease's nominal expiry and stop that much earlier. Why subtraction not addition: the holder's clock might be running fast — by ε_clock the holder thinks it is 12:00:30 when the lease store says 12:00:29.5. The holder's network might be slow — its renewal request takes ε_net to arrive, during which time the lease store may already have expired and re-granted the lease. The holder has no way to measure its own skew accurately; the only safe thing is to assume worst-case error in the wrong direction and quit early. This subtraction is the safety margin σ, and it is the difference between a lease and a hopeful lock.
The availability inequality is what keeps a healthy holder from losing its lease to network noise. The holder's renewal must arrive at the lease store before the lease expires, even if a renewal attempt is lost and has to be retried:
R + ε_net + retry budget < TTL
If R = TTL/3 (the canonical choice), then the holder gets two renewal attempts within the TTL — one at R, a retry at 2R — before expiry. With R = TTL/2, a single dropped packet on the renewal path causes the lease to lapse. Why TTL/3 is not arbitrary: it is the smallest renewal interval that gives you two retries — at R and 2R — before expiry at 3R = TTL. With three retries (TTL/4), you pay 33% more renewal traffic and gain almost no failure-mode coverage, because by the third retry the network is sufficiently broken that one more retry doesn't help. With two retries (TTL/2), a single packet loss on the renewal kills a healthy holder. The TTL/3 choice is a Pareto optimum on the cost/safety frontier and is what etcd, Zookeeper, Consul, and Spanner all converge to.
How a renewal actually works — the request, the round-trip, the new expiry
A lease renewal is a conditional write against the lease store: "if I am still the holder of this lease, extend the expiry by TTL". In etcd, this is LeaseKeepAlive(leaseID), which the client streams as a long-running gRPC call. Internally, the lease store does a Raft round to commit the new expiry. If the round commits, the holder receives a confirmation containing the new expiry timestamp.
There are three timestamps in a renewal — and engineers conflate them constantly. (1) The holder's wall-clock when it sent the renewal: T_send. (2) The lease store's wall-clock when the renewal commit completed: T_commit. (3) The new expiry the holder should believe in: T_commit + TTL, but reported back to the holder as a delta from its own clock, because the holder cannot trust the store's wall-clock.
The honest implementation does not return an absolute timestamp. It returns the TTL — "you have 30 more seconds from when you receive this response". The holder records T_received_local + TTL as its new local-deadline, and treats that as authoritative. If the holder used the lease store's T_commit + TTL directly, a clock-skew of 500 ms means the holder thinks it has 500 ms more (or less) than the store thinks. The safe encoding is: the lease store always returns a duration; the holder always converts to a local deadline using its own monotonic clock. Why monotonic and not wall-clock: NTP can step the wall-clock backwards by hundreds of milliseconds during a resync. If the holder records T_received_wall + 30s and then NTP steps the wall-clock back by 200 ms, the holder will believe the lease is good for 200 ms longer than the store does. Monotonic clocks (time.monotonic() in Python, CLOCK_MONOTONIC in Linux, mach_absolute_time on macOS) never step backwards; they tick at a rate close to wall-clock but are immune to NTP corrections, leap seconds, and admin clock-set commands. Every production lease library, without exception, uses monotonic clocks for the holder-side deadline.
The renewal request itself can fail in three observable ways: (a) the holder's network is slow and the response arrives close to expiry, (b) the lease store is busy and takes >TTL to commit, (c) the network drops the request entirely. The lease library's job is to retry on (a) and (c), and to detect (b) by giving up before the local deadline. A correct lease library is essentially a renewal scheduler with a single hard rule: never let the local deadline pass without either (i) a successful renewal, or (ii) ceasing all activity that depended on the lease.
A runnable lease library with the renewal contract
The following Python implements a small lease client that talks to a single-node lease store. The store enforces the lease's grant/renew/expire semantics; the client implements the renewal scheduler with the safety inequality enforced. We then run a happy path, an extended path with renewals, and a failure path where the renewal channel goes silent.
# lease_client_demo.py — a lease store + a client that respects the safety inequality
import time, threading
class LeaseStore:
def __init__(self):
self._lock = threading.Lock()
self._holder = None
self._expiry_mono = 0.0 # monotonic deadline known to the store
self._term = 0
def grant(self, who, ttl):
with self._lock:
now = time.monotonic()
if self._holder and now < self._expiry_mono:
return None
self._term += 1
self._holder = who
self._expiry_mono = now + ttl
return {"term": self._term, "ttl_remaining": ttl}
def renew(self, who, term, ttl):
with self._lock:
now = time.monotonic()
if self._holder != who or self._term != term or now >= self._expiry_mono:
return None
self._expiry_mono = now + ttl
return {"term": term, "ttl_remaining": ttl}
class LeaseClient:
def __init__(self, store, who, ttl, sigma):
self.store, self.who, self.ttl, self.sigma = store, who, ttl, sigma
self.term, self.local_deadline = None, 0.0
def acquire(self):
r = self.store.grant(self.who, self.ttl)
if r is None: return False
self.term = r["term"]
self.local_deadline = time.monotonic() + r["ttl_remaining"] - self.sigma
return True
def renew(self):
r = self.store.renew(self.who, self.term, self.ttl)
if r is None: return False
self.local_deadline = time.monotonic() + r["ttl_remaining"] - self.sigma
return True
def is_valid(self):
return time.monotonic() < self.local_deadline
# demo
store = LeaseStore()
c = LeaseClient(store, "node-A", ttl=2.0, sigma=0.3)
assert c.acquire()
print(f"acquired term={c.term}, valid={c.is_valid()}")
time.sleep(0.5); print(f"after 0.5s valid={c.is_valid()}")
c.renew(); print(f"renewed valid={c.is_valid()}")
time.sleep(2.0); print(f"after 2.0s no-renew valid={c.is_valid()}") # must be False
# Sample output
acquired term=1, valid=True
after 0.5s valid=True
renewed valid=True
after 2.0s no-renew valid=False
Walking the load-bearing lines: time.monotonic() is used everywhere — the store's deadline, the client's local deadline, the validity check. Wall-clock time appears nowhere, by design. self.local_deadline = time.monotonic() + r["ttl_remaining"] - self.sigma is the safety inequality made code: the client subtracts sigma (the safety margin) from the nominal deadline, ensuring its is_valid() returns False before the store believes the lease has expired. if self._holder != who or self._term != term is the store-side guard against a stale renewal from an old term — if the holder has been replaced (term incremented), the old renewal silently fails. time.sleep(2.0) then is_valid() == False: the demo proves the lease lapses without renewals, even though the client never explicitly released. Why the term check matters: the holder name alone is not enough — a node can be granted the lease, lose it (term=2), and be granted it again later (term=3). A renewal request from term=2 that arrives during term=3 must fail, otherwise term=2's stale view of "I'm still the holder" would silently extend the term=3 lease. The term is the linearisation point that makes lease state safe across re-grants.
Lease handover — the planned shutdown that costs nothing
A lease's worst-case failure mode is the holder dying without warning: the cluster waits the full TTL before re-granting, during which there is no leader. For an unplanned crash this is unavoidable. For a planned event — a deploy, a rolling restart, a SIGTERM during a Kubernetes pod eviction — the cost is gratuitous. The fix is lease handover: the outgoing holder explicitly releases its lease, and the lease store immediately allows the next acquirer to grant a fresh lease.
The handover protocol is three lines:
- The outgoing holder calls
store.release(who, term)— a conditional write that clears the lease only if the caller is still the current holder. - The store atomically clears the holder, sets
expiry = 0, and increments the term. - Any node attempting to acquire sees an empty lease and grants itself one immediately.
The expensive trick is making sure the outgoing holder finishes its in-flight work before releasing. If a deploy script SIGTERMs the holder mid-batch and the lease library calls release() on shutdown, the in-flight work either (a) gets cancelled — losing data — or (b) continues running after the release — and another node, holding a fresh lease, may run the same work. Both outcomes are bugs. The correct pattern: SIGTERM triggers a "drain" — stop accepting new work, finish in-flight work, then release the lease, then exit. Kubernetes' terminationGracePeriodSeconds (default 30s) is the budget for this drain; setting it shorter than the work latency is a corruption bug waiting to happen. Why "stop accepting new work" comes before "finish in-flight work": if the holder keeps accepting new work right up to the moment of release, the work that arrives in the last 100 ms cannot complete before release. Either it's truncated (data loss) or it runs after release (overlap with the new holder). The drain must be a strict pipeline: close the input first, then drain the queue, then release. This is the same shape as a graceful HTTP server shutdown — close the listener, wait for in-flight requests, then exit. The lease is the listener.
A war story — CricStream's 11.7-second leader gap
The 18:42 IST outage at CricStream was a textbook case of mechanics gone wrong. The video-distribution coordinator runs on three nodes (cricstream-coord-1/2/3) with etcd providing a 10-second lease. Renewal cadence had been configured at TTL/2 = 5s, on the (mistaken) reasoning that "more renewals = safer". The deploy that morning had pushed a new version of the coordinator with a slightly heavier startup path — 800 ms longer to warm up the in-memory edge-region cache.
At 18:41:58, cricstream-coord-2 (the holder) received a SIGTERM as part of a rolling deploy. The pod's preStop hook had no lease.release() call — it just relied on TTL expiry. The pod terminated cleanly at 18:42:02. The lease, granted at 18:41:54 with TTL=10s, was scheduled to expire at 18:42:04. Etcd's lease-watcher revoked the lease at 18:42:04.117 IST.
Now the lost-packet problem. cricstream-coord-1 and cricstream-coord-3 were both running their lease-acquisition retry loop with a 1-second backoff. Both saw the lease as held until 18:42:04. Both started attempting to acquire at 18:42:04. The first attempts collided in etcd's transaction queue and one was rejected; the second batch of attempts hit a transient network blip — a switch port flap on the rack-top switch — and were dropped. The retry backoff pushed the next attempt to 18:42:05. That attempt succeeded for coord-3, which was granted the lease at 18:42:05.412.
But coord-3 then had to start up the coordinator role — load the edge-region cache, register itself with the upstream traffic shaper, validate its own state. That took 8.3 seconds. The coordinator was usable only at 18:42:13.7. Between 18:42:02 (last action by coord-2) and 18:42:13.7 (first action by coord-3), the cluster had no leader for 11.7 seconds, during which 23 million viewers received stale CDN-edge assignments and 4.2 million reconnection attempts piled up in the upstream load balancer.
The fix had three parts. (1) preStop hook now calls lease.release() and waits up to 5 seconds for the new holder to be elected before exiting. This collapses the unplanned-crash gap (10s) into the handover gap (~50ms) for planned events. (2) Renewal cadence moved from TTL/2 to TTL/3 — fewer renewals per minute, but more retries per renewal window, so transient packet loss does not trigger spurious lease loss. (3) The coordinator's startup path was split into a "ready to acquire" phase and a "ready to lead" phase; the latter happens while the lease is being acquired, in parallel, so the leadership gap is the longer of the two paths rather than their sum. Why parallelising startup and acquisition matters: the lease acquisition is a Raft commit, ~6 ms intra-DC. The cache warmup was 8 seconds. Doing them sequentially gives a 8-second-plus-acquire gap. Doing them in parallel — start cache warmup speculatively on every replica all the time, only acquire the lease when warmup is done — gives an acquire-only gap. The cost is some wasted CPU on non-leader replicas; the benefit is a sub-second leader gap. This is the same pattern as JIT compilation in a HotSpot JVM: pay the warmup cost continuously in the background, so the actual switchover is free. The post-fix leader gap dropped from 11.7 seconds to under 200 ms.
Common confusions
- "Shorter TTL = safer lease." Shorter TTL reduces the unplanned-crash gap but increases the renewal traffic and the probability of a spurious lapse due to transient network noise. Below TTL ≈ 3 seconds, you start losing leases on healthy holders during normal GC pauses. The right TTL is the largest value that keeps the unplanned-crash gap acceptable for your business — typically 10–30 seconds for most workloads, 5 seconds when you have aggressive failure-detector tuning, never below 3 seconds without TrueTime-style clock-skew bounds.
- "Wall-clock and monotonic clock are interchangeable." They are not. Wall-clock can step backwards (NTP corrections, leap seconds, admin clock-set), monotonic cannot. Use monotonic for the holder's local deadline; use the lease store's clock (whatever it is, but always consistently) for the store's expiry. The two never compare directly — the renewal protocol returns a duration, and the holder converts to its own monotonic deadline. See monotonic clocks.
- "A lease and a session are the same thing." Zookeeper sessions are leases with extra semantics (ephemeral znodes, session-expired notifications). Etcd leases are simpler — keys auto-delete on expiry, but there's no session-wide notification. Consul sessions are between the two. The mechanics are the same; the surface is different. Don't pick the implementation by surface; pick by what it does on expiry.
- "Renewals are free." Each renewal is a Raft commit on the lease store — 5–10 ms intra-DC, 50–200 ms cross-region. With a 5-second TTL renewed at TTL/3, a 1000-leaseholder cluster generates 600 commits per second on the lease store, which can saturate a small etcd cluster. Lease consolidation (one lease per process holding many keys) and lease piggybacking (renewing multiple leases in one RPC) are how production systems keep this manageable.
- "Kubernetes leader election just works." Kubernetes'
leaderelectionpackage is correct, but its defaults — 15s leaseDuration, 10s renewDeadline, 2s retryPeriod — give you a 15-second worst-case leader gap on unplanned crashes. If your control loop touches user data, that's 15 seconds of staleness. Tune these numbers; they are not magic. - "You need consensus to implement a lease." You need a linearisable register, which is one feature of consensus but not the whole thing. A single-instance Postgres with
SELECT FOR UPDATEcan implement leases — until Postgres fails over and you discover that the failover path is not linearisable. The lease's safety is exactly as strong as the linearisability of the underlying store; consensus protocols (Raft, Paxos, Zab) are the standard way to get it.
Going deeper
Bounded clock skew vs unbounded clock skew — the Spanner trade
Most leases assume unbounded clock skew between the holder and the store, and rely on the fencing token (see the previous chapter) for safety. Spanner takes the opposite approach: it pays for atomic-clock and GPS infrastructure to bound the skew at ε = 7 ms (their published number), and uses that bound to make leases safe without fencing on the read path. The TrueTime API returns [earliest, latest] intervals rather than point estimates, and the lease holder waits out the latest - earliest interval (commit_wait) before externalising any write. With ε = 7 ms, commit_wait is ~7 ms; with NTP, it would be 100–500 ms, and the technique would be unusable. Why bounded skew makes a different mechanics work: in the unbounded-skew world, two holders can simultaneously believe they hold the lease for arbitrarily long, so safety must live at the storage layer (fencing). In the bounded-skew world, two holders can simultaneously believe they hold the lease for at most ε, so safety can live at the holder layer — wait ε before acting. The price is the infrastructure: GPS antennas in every data centre, atomic clocks, custom protocols to validate the bound. For most teams, fencing is cheaper than TrueTime.
Lease piggybacking and lease coalescing — keeping the renewal load sane
A naive lease implementation has one renewal per lease per renewal-interval. With 10,000 leases on a single application server, that's 3,000 renewals per second, which is more than the lease store can handle. Production systems coalesce: one lease object per process, holding a set of resource names, with the renewal applying to all of them atomically. Etcd's Lease does exactly this — you can attach hundreds of keys to a single lease ID, and one KeepAlive renews them all. The cost is granularity: if the process needs to release one resource without releasing the others, it can't (the lease is the unit of release). Most workloads don't care, and lease coalescing is the difference between an etcd cluster you can run on three small nodes and one that needs sharded etcd.
The lease's lower bound — why TTL=0 doesn't make sense
The TTL must be at least 2× the longest plausible network round-trip plus the longest plausible GC pause. With intra-DC RTT ≈ 1 ms and JVM GC pauses up to 200 ms, the floor is ~500 ms. Below that, you spend more time renewing than acting; above 2× the floor (1 second), you have an actual usable lease. In practice, 5–10 seconds is the sweet spot for intra-DC services; 30 seconds for cross-region. Lease TTL is one of those parameters where the right answer depends on which RTTs and pause distributions you're handling, and the wrong answer is invariably "whatever the example config used".
Lease loss vs lease expiry — a subtle distinction
A holder can lose its lease two ways: (a) it failed to renew before the local deadline (lease expired from the holder's perspective — the holder knows), or (b) the lease store revoked the lease while the holder was unaware (lease lost — the holder is in denial). The library must distinguish these cases on the holder's side. On expiry, the holder transitions to "no lease" cleanly — stop acting, attempt re-acquire. On loss, the holder may have been acting with a stale-but-still-locally-valid lease, and must invalidate any work it did since the last successful renewal. Most lease libraries handle (a) correctly and (b) incorrectly; the test for (b) is "what happens when the network drops the renewal response but the renewal succeeded on the store?" — many libraries decrement a retry counter and re-try, then time out, then declare loss, but during the retry period the holder kept acting. The clean fix is a strict invariant: act only between successful renewals, never during the retry phase.
Reproduce the demo
python3 lease_client_demo.py
# Then change sigma=0.0 (no safety margin) and run with
# threading.Thread(target=lambda: time.sleep(1.95)) before the renew —
# observe that without sigma the lease occasionally lapses on the store
# while the client still believes it holds it.
# Increase ttl to 5s and observe the renewal cadence math.
Where this leads next
The next chapter studies Chubby and the lock-service pattern — Burrows' canonical paper on a service that only provides leases, and the architectural choices that arise when leases are the entire product. After that, the lease holder's responsibility develops the operational discipline of a holder that knows it is one — graceful drain, coordinated handover, and the side-effects layer.
Two structural ideas unify everything that follows. First, the lease's mechanics are quantitative — every lease bug is ultimately a violated inequality between the five numbers, and every fix is restoring the inequality. Second, the lease's mechanics are separable from the lease's purpose. Whether the lease guards a leader role, a cron job, a shard-owner assignment, or a session, the TTL/renewal/safety math is the same. Master the mechanics once; apply them to every lease-shaped problem you ever encounter.
The deepest lesson is that leases are the most-used and least-understood primitive in distributed systems. Engineers reach for them constantly — in Kubernetes, in Kafka brokers, in custom cron schedulers, in distributed-lock libraries, in service-discovery registries — and the bugs are always the same shape: someone treated TTL as the only number, ignored the safety margin, skipped handover, used wall-clock instead of monotonic. The mechanics in this chapter are the floor; everything Part 9 builds on top of them is a refinement of the same arithmetic.
References
- Burrows, M. — "The Chubby Lock Service for Loosely-Coupled Distributed Systems" (OSDI 2006) — the canonical lease-service paper, with §2.4 on TTL choice.
- Corbett, J. et al. — "Spanner: Google's Globally-Distributed Database" (OSDI 2012) — leases everywhere, with TrueTime as the bounded-skew variant.
- Gray, C., Cheriton, D. — "Leases: An Efficient Fault-Tolerant Mechanism for Distributed File Cache Consistency" (SOSP 1989) — the original lease paper, with the safety/availability inequality formalised.
- Junqueira, F. et al. — "ZooKeeper: Wait-free coordination for Internet-scale systems" (USENIX ATC 2010) — session/lease semantics in production.
- etcd Authors — "etcd Lease API and KeepAlive semantics" (etcd.io documentation) — the concrete API used by Kubernetes leader election.
- Kubernetes Authors — "Leader election in client-go" (kubernetes/client-go) — production reference implementation; read its renewal scheduler.
- Wall: consensus is expensive — leases are cheap — the previous chapter; framing for why leases exist.
- Monotonic clocks — the clock primitive every lease library depends on for the holder-side deadline.