Note: Company names, engineers, incidents, numbers, and scaling scenarios in this article are hypothetical — even when they resemble real ones. See the full disclaimer.
Chubby and the lock-service pattern
It is a Tuesday morning at PaySetu and Sneha is trying to explain to a new hire why the company's payment-orchestrator cluster — fourteen stateless services, three Kafka clusters, two Postgres primaries, and a Redis fleet — depends on a five-node service that handles fewer than 200 requests per second. The new hire's question is reasonable: "Why does this tiny thing exist? Can't each service just elect its own leader?" Sneha's answer takes the better part of an hour, and ends with the observation that the lock service is the only piece of infrastructure she has never been paged for in three years. The lock service is Chubby's grand-nephew. The reason it is boring is that Burrows and the Chubby team, in 2006, wrote down what a lock service should be, and almost everything that came after — Zookeeper, etcd, Consul — is a refinement of those choices. This article is about those choices: the coarse-grained-only API, the file-system surface, the lease-and-session model, the deliberate decision to be slow, and the rule that the lock service is the only piece of consensus infrastructure most applications should ever talk to.
Chubby is a Paxos-replicated lock service that exposes a tiny file-system-like API and hands out coarse-grained leases to thousands of clients via long-lived sessions. Its three load-bearing design choices — coarse-grained locks only, a file-system namespace, and the lease-plus-session model — turn what could be a high-throughput consensus engine into a low-throughput, high-availability advisor that everything else in the cluster depends on. The lock-service pattern is the architectural decision to centralise all consensus into one boring service rather than reinvent it in every application; almost every modern coordination primitive (leader election, configuration store, service discovery) is a thin shell over this pattern.
What Chubby is, and what it deliberately is not
Chubby is a five-node service. Five nodes — always five, never three, rarely seven — running Paxos to maintain a replicated state machine. The state machine is a small file system: directories, files, and metadata, where files are 256 bytes by default and capped at 256 KB. Clients open files, lock them, write small blobs into them, and watch them for changes. That is the entire surface.
It is not a database. It is not a queue. It is not a configuration system, except that people use it as one. It is not a leader-election library, except that people use it as one. It is, by Burrows' own description, "a service that solves a coarse-grained synchronisation problem and incidentally provides a small reliable storage hint". The italics are doing a lot of work in that sentence. The surface is a file system because file-system semantics — open, read, write, lock, watch — are a familiar shape that most engineers can use correctly without reading a paper. The implementation is a state machine because state machines are what Paxos protects.
The first deliberate non-feature is fine-grained locking. A naive lock service might allow millions of locks held for milliseconds — protecting individual rows, individual cache entries, individual queue elements. Chubby refuses this. Chubby locks are coarse-grained: a lease over a service's leadership role, not a lock over a single row. A lock might be held for hours or days. The acquire/release rate at the lock service is thousands of ops/sec at most, even when the cluster behind it does millions of qps. Why coarse-grained-only is a feature, not a limitation: a fine-grained lock service must scale linearly with the application's request rate, which means it becomes the application's bottleneck. A coarse-grained lock service is decoupled — it sits underneath the application and rarely sees traffic at all. Chubby's load on a 10,000-machine Google cluster, per the paper, was about 90 KQPS at peak — an order of magnitude less than any of the services using it. The team made this trade explicit: if you need fine-grained locks, build them on top, in your application, using whatever protocol fits. Don't put millions of locks on the consensus path.
The second deliberate non-feature is performance. Chubby is not fast and is not trying to be. Each operation pays a Paxos round (a few milliseconds intra-DC, tens of ms cross-region), and the team explicitly says they could make it faster but chose not to, because faster would invite misuse. A slow lock service is a lock service no one tries to put on the hot path.
The file-system surface — why a familiar API beat a new one
Chubby's API is a file system. A path is /ls/cell-name/path/to/node, where /ls is the namespace prefix ("lock service") and cell-name selects which Chubby cell — Google ran one cell per data centre plus a couple of global cells. A node is either a directory or a file. Files have content (small blobs), an ACL, sequence numbers, and a few metadata fields. Operations are Open, Close, GetContents, SetContents, Acquire, TryAcquire, Release, GetSequencer, SetSequencer, CheckSequencer, and a watch primitive. That is roughly all of it.
The file-system shape was not the obvious choice. The team considered exposing a Paxos library directly — let applications run their own state machines, with Chubby providing only the consensus primitive. They rejected this for two reasons. First, every application that ran Paxos would need to debug it, and consensus debugging is famously hard. Centralising the consensus into one well-tested service amortises the engineering across the company. Second, applications wanted advisory guarantees more than they wanted the full power of consensus — "is this the leader?" or "fetch this small config" or "wake me up when this node changes" — and a file-system surface answers all three with familiar verbs. Why a file system rather than a key-value store: file systems give you free hierarchy, free naming conventions, free namespace partitioning by ACL, and free directory-watch semantics. A flat key-value store would have re-invented all of these as conventions on top of keys (think Redis's service:foo:leader naming) and lost the hierarchical ACL story. Zookeeper inherited this design choice (znodes form a tree); etcd v2 inherited it; etcd v3 went flat KV but added prefixes to recover the hierarchy. The file-system surface keeps coming back because it matches how engineers think about coordination state.
A typical use looks like this. To elect a leader, every candidate opens /ls/payments/orchestrator/leader with Acquire semantics. Exactly one candidate succeeds — it now holds the lock. The successful candidate writes its address into the file's contents (SetContents). Every other process in the cluster opens the same file with GetContents and a watch. When the leader changes, the watch fires, every reader re-fetches, and traffic shifts. The leader renews its lock by keeping its session alive (KeepAlive), not by re-Acquire-ing — the lock is held for the lifetime of the session.
A second typical use is configuration. The control plane writes a small config blob to /ls/payments/config/feature-flags. Every payment service in the cluster opens that path with a watch, fetches once, and waits for change notifications. The config is replicated via Paxos across all five Chubby replicas, and the read latency on the client side is zero — the client library caches aggressively, and the watch fires on change. This is how Google ran most of its config distribution before dedicated config systems existed.
Sessions and KeepAlives — the lease pattern, scaled
Every Chubby client has a session. A session is a long-lived relationship between a client and the master, parameterised by a lease: an interval during which the master promises not to unilaterally revoke the client's locks, watches, or open files. The lease is the same five-numbers primitive from the previous chapter on lease mechanics — TTL, renewal interval, clock skew, network RTT, safety margin — but operated at session granularity rather than per-lock granularity. One session lease covers everything the client holds.
The KeepAlive protocol is asymmetric in an unusual way. The client sends a KeepAlive to the master. The master is allowed to block the response — to hold the call open — until either (a) the lease is about to expire and needs extending, or (b) the master has an event to deliver to the client. This means the KeepAlive is both a renewal and a long-poll: a single open call that the master uses to push notifications when watches fire or sessions need attention.
The lease has two views. The master's view is when the master believes the lease will expire. The client's view is when the client believes the lease will expire. The client's view is always strictly earlier than the master's view, by an amount that bounds the maximum clock skew and network RTT. Why the client believes earlier: the client cannot know exactly when the master granted the lease — its KeepAlive request was sent at one time and the response arrived at another, with an unknown amount of time spent on the master in between. The conservative encoding is for the client to compute "the master granted at least at the time my response arrived", and start its lease clock from there. The master, conversely, knows it granted at exactly the moment of commit. The client's view of the lease deadline is always behind the master's by RTT + ε_clock, ensuring the client stops trusting the lease before the master has revoked it. This asymmetry is the safety inequality made into a session protocol.
If a KeepAlive fails — network blip, master crash, master overload — the client enters a jeopardy state. It does not immediately give up; it keeps trying KeepAlives for a configurable grace period (typically 45 seconds). During jeopardy, the client treats its locks and watches as still potentially valid but does not externalise any state-changing actions that depend on them. If the KeepAlive succeeds before the grace period ends — perhaps because the master failed over and a new master came up — the client recovers its session intact. If the grace period ends without recovery, the session is expired and all locks are released, all watches lost, all open files closed. The client can re-establish a session and re-open its files, but the application above is told "you lost everything" and must rebuild its state.
A runnable simulation of the session protocol
The following Python implements a tiny Chubby-like lock service and a client that maintains a session via KeepAlives. The server enforces session leases and lock-via-session semantics. The client demonstrates the happy path, a normal renewal, and the jeopardy/recovery dance when KeepAlives are temporarily blocked.
# chubby_lite.py — a lock service with sessions, KeepAlives, and jeopardy handling
import time, threading, random
class ChubbyLite:
def __init__(self, lease_ttl=4.0):
self.lock = threading.Lock()
self.lease_ttl = lease_ttl
self.sessions = {} # session_id -> deadline (monotonic)
self.locks = {} # path -> session_id
self.next_sid = 0
def open_session(self):
with self.lock:
self.next_sid += 1
sid = self.next_sid
self.sessions[sid] = time.monotonic() + self.lease_ttl
return sid, self.lease_ttl
def keep_alive(self, sid):
with self.lock:
if sid not in self.sessions or time.monotonic() >= self.sessions[sid]:
return None # session expired
self.sessions[sid] = time.monotonic() + self.lease_ttl
return self.lease_ttl
def acquire(self, sid, path):
with self.lock:
if sid not in self.sessions or time.monotonic() >= self.sessions[sid]:
return False
if path in self.locks and self.locks[path] != sid:
return False
self.locks[path] = sid
return True
def expire_loop(self):
while True:
time.sleep(0.5)
with self.lock:
now = time.monotonic()
dead = [s for s, d in self.sessions.items() if now >= d]
for s in dead:
del self.sessions[s]
for p, owner in list(self.locks.items()):
if owner == s: del self.locks[p]
# demo
srv = ChubbyLite(lease_ttl=2.0)
threading.Thread(target=srv.expire_loop, daemon=True).start()
sid, ttl = srv.open_session()
srv.acquire(sid, "/ls/payments/leader")
print(f"client opened sid={sid}, holds /ls/payments/leader")
for tick in range(4):
time.sleep(0.7)
new_ttl = srv.keep_alive(sid)
print(f"t={tick*0.7+0.7:.1f}s keepalive -> ttl={new_ttl}")
print("now blocking keepalives for 3.0s (simulated network blip)...")
time.sleep(3.0)
result = srv.keep_alive(sid)
print(f"after blip keepalive -> {result} (None = session expired, jeopardy ended)")
print(f"lock still held? {srv.locks.get('/ls/payments/leader')}")
# Sample output
client opened sid=1, holds /ls/payments/leader
t=0.7s keepalive -> ttl=2.0
t=1.4s keepalive -> ttl=2.0
t=2.1s keepalive -> ttl=2.0
t=2.8s keepalive -> ttl=2.0
now blocking keepalives for 3.0s (simulated network blip)...
after blip keepalive -> None (None = session expired, jeopardy ended)
lock still held? None
The load-bearing lines: time.monotonic() is used everywhere on the server side — the only honest clock when sessions are at stake. self.sessions[sid] = time.monotonic() + self.lease_ttl extends the lease on every successful KeepAlive; this is the renewal contract. if path in self.locks and self.locks[path] != sid: return False is the conditional-acquire — locks are tied to sessions, not raw client identities, so a session expiry releases locks atomically. The expire_loop is the master's background reaper: every 500 ms it walks the session table and evicts expired ones, dropping their locks as a side effect. Why locks are tied to sessions, not to processes: a process can die and another process on the same machine can come up with the same identity (PID reuse, container restart). Tying the lock to a session ID — opaque, monotonically allocated, never reused — means lock ownership is unambiguous even across process death and restart. The session is the trust boundary; the lock is just metadata attached to it. Zookeeper inherits this exactly: znodes have an "ephemeral" flag that ties their lifetime to a session, and the session ID is the unit of authentication.
The demo proves three properties. (1) The session lease holds across multiple KeepAlives — the lock survives the renewal cycle. (2) When KeepAlives stop arriving, the session expires after lease_ttl seconds and the lock is automatically released. (3) The application-side recovery — re-opening a session, re-acquiring the lock — is the client's responsibility. Chubby's library handles the first two; the application handles the third.
A war story — PaySetu's lock-service migration
PaySetu's payment-orchestrator originally ran with a homegrown leader-election scheme: each of fourteen orchestrator nodes maintained a row in a Postgres table, with a periodic UPDATE-IF-NEWER claiming "I am the leader, valid until T+30s". The mechanism worked on paper. In practice, Postgres failovers (twice in 2024) and connection-pool exhaustions (six times) led to scenarios where two orchestrator nodes simultaneously believed they were the leader for windows ranging from 800 ms to 11 seconds. The cost was rare but expensive: duplicate UPI debit attempts, duplicated webhook fires, in one case a customer charged twice for the same auto-rickshaw ride totalling ₹284.
The fix took two engineers four months and replaced the Postgres-row scheme with a small Chubby-shaped service running on five etcd nodes. The orchestrator now opens a session and acquires /payments/orchestrator/leader. The session's lease is 10 seconds, renewed every 3.3 seconds, with a 45-second jeopardy window. Every action that requires leadership — UPI initiation, webhook dispatch, settlement reconciliation — is gated by a CheckSession() call that takes <1 ms locally (the session validity is cached client-side and refreshed on KeepAlive responses).
The migration's hardest moment was not the lock service itself — etcd is well-trodden ground — but the audit. The team had to find every code path that assumed leadership and gate it on the session. There were 34 such paths. Three of them were in third-party libraries that the team had to fork. One was in a cron job that nobody had touched in five years. The migration's engineering value was less the lock service and more the enumeration — the list of "things that depend on me being the leader" became a maintained artefact, refreshed every release. Why enumeration matters more than the lock service: a lock service is mechanically simple — three primitives, well-known semantics. The hard part is knowing which actions in a million-line codebase depend on holding the lock. Production lock-service incidents are rarely the lock failing; they're an ungated code path that escaped the audit. Chubby's coarse-grained API forces this enumeration because each lock guards a role, not a row, and the role's responsibilities must be made explicit to use the lock at all. A fine-grained lock service hides this — it lets developers add locks ad-hoc, and the codebase ends up with thousands of locks no one understands. Chubby's API actively discourages that pattern.
In the 18 months post-migration, the dual-leader incident count dropped from "8 per year" to zero. The lock-service cluster has been paged once — for a disk-full alarm on the etcd data directory, fixed in 12 minutes with no client-visible impact. Sneha's claim that she has never been paged for the lock service is approximately, but not exactly, true.
Common confusions
- "Chubby is for fine-grained locks like Redis SETNX." It is the opposite. Chubby's lease and session model is deliberately too expensive for fine-grained locking. If you reach for Chubby (or any lock-service descendant) and find yourself acquiring locks at thousands of QPS per client, you are using the wrong primitive. Use a database row lock, a Redis lease with fencing token, or an in-process mutex.
- "Chubby and Zookeeper are identical." They are siblings. Both expose a hierarchical namespace, both use sessions with leases, both watch nodes. Differences: Chubby uses Paxos, Zookeeper uses Zab; Chubby's notification model is push-on-KeepAlive, Zookeeper's is one-shot watches that re-arm on read; Chubby has explicit lock primitives, Zookeeper has ephemeral znodes (you build locks with a sequence-number convention). The patterns transfer; the APIs do not.
- "The lock service must be highly available, so make it big." It must be highly available, and "big" makes it less available. Chubby cells were five nodes, never more; etcd recommendations are 3 or 5, never 7 in production. More nodes means slower Paxos rounds (the leader waits for a majority) and more failure points. Run multiple cells if you need scale, not bigger cells.
- "Watches are reliable notifications." They are not. Watches in Chubby and Zookeeper are hint-shaped: the server promises to fire a watch when the watched node changes, but the client may also receive a watch fire spuriously (e.g., on session reconnect). The contract is "if the data changed, you'll be told" not "every fire corresponds to a unique change". Always re-read on watch fire; never assume the watch fire encodes the new state.
- "Chubby gives you strong consistency, so I can use it as my database." You can read data from Chubby with linearisable consistency, and many teams use it for tiny configs. But the file size cap (256 KB), low write throughput (~hundreds of writes/sec), and the principle of "lock service, not database" make it a bad choice for anything beyond pointers, leadership state, and small flags. If your config blob is growing past a kilobyte, store the blob in object storage and put the pointer in the lock service.
- "Sessions are equivalent to TCP connections." They are decoupled. A session can survive multiple TCP reconnects (during master failover, the client's session ID is portable to the new master). A TCP connection can drop and the session lives on, jeopardised but recoverable. The session is a logical agreement; the TCP connection is the transport. Modern lock-service libraries (etcd, Zookeeper, Consul) all expose this distinction.
Going deeper
The OSDI 2006 paper — why it still reads as fresh in 2026
Burrows' paper has aged remarkably well because the problem it describes — coarse-grained coordination for clusters of thousands — has only become more universal. The paper's structure is unusual: §2 is a hard-nosed catalogue of design decisions ("we considered X, we rejected it for Y"), §3 is the structure of the cell, §4 is the API, and §5 is operations — failure modes, KeepAlive details, master failover. Sections 4 and 5 are where most engineers go to learn how a real lock service is operated, not how it works on paper. Two passages worth re-reading: §4.1 on the file-system rationale, and §5.6 on the events that the master pushes through KeepAlives ("file contents modified", "child node added or removed", "Chubby master failed over", "lock acquired", "conflicting lock request"). The events are an existence proof that the long-poll KeepAlive carries useful semantic load beyond renewal.
Why exactly five replicas — and never four
Paxos tolerates f failures with 2f+1 replicas. Three replicas tolerate one failure; five tolerate two; seven tolerate three. The Chubby choice was five because (a) two-failure tolerance covers the realistic correlated-failure budget of a single data centre (one node down for maintenance, one node failing during the maintenance window), and (b) five-replica Paxos rounds are still fast (~2 ms intra-DC, since the leader needs only the closer two of the four followers). Three is too risky — one node down for maintenance leaves you one failure away from outage. Seven is wasteful — it doesn't tolerate qualitatively different failure scenarios than five, and the quorum size grows. The mathematics of Paxos quorums make 5 a sweet spot for almost any availability budget; everyone since (etcd, Zookeeper, Consul) has converged to the same number for the same reasons.
Caching, watches, and the proxy pattern
A subtle Chubby contribution is the client-side cache with cache invalidation through watches. Each client maintains a local cache of files it has read; the master keeps track of which files each session has cached, and pushes invalidations when the data changes. This makes reads effectively zero-latency on cache hit, while preserving the linearisable-read semantic — the client never serves a stale read because it has already received the invalidation by the time a stale read would matter. The downside: the master must track per-session cache state, which is O(clients × cached-files). Chubby in production deployed proxy tiers in front of the master to fan out this state — proxies cache on behalf of many clients, the master only tracks proxies. The proxy pattern is recurrent: Zookeeper has observer nodes, etcd has the gateway, Consul has DNS proxies. All same shape.
What Chubby got wrong, in retrospect
The paper notes a few rough edges that newer designs have improved on. (1) The Chubby session model assumes a single client process; multi-process applications had to share sessions awkwardly or reconnect repeatedly. (2) The watch model — fired through the long-poll KeepAlive — coupled session activity to notification delivery, making it hard to bound notification latency under load. Etcd's gRPC streaming watch decouples these. (3) Coarse-grained-only is correct as a rule, but the system gave no help to applications that needed fine-grained locking on top — every team reinvented their own fine-grained protocol over Chubby leases. Modern designs (etcd's Lock and Election recipes, Zookeeper's curator) ship higher-level primitives so each team doesn't reinvent them.
Reproduce the demo
python3 chubby_lite.py
# Vary lease_ttl from 1.0 to 10.0 and observe how the renewal cadence changes
# the survivability against the simulated 3.0s blip.
# Open a second client (sid=2) trying to acquire /ls/payments/leader during
# the blip — verify it is granted only after sid=1's session expires.
Where this leads next
The next chapter, the lease holder's responsibility, develops the operational discipline of a service that holds a Chubby-style lease — graceful drain, fencing tokens for IO that escapes the lease, and the side-effects layer. After that, leader election protocols walks through the alternatives to lease-based election (bully algorithm, ring election, Raft's built-in leader election) and shows why the lock-service pattern won.
The lock-service pattern is the architectural decision worth internalising before any other consensus topic. Almost every modern distributed system you'll touch — Kubernetes' kube-controller-manager, Kafka's broker registration, every Go microservice using etcd, every Java service using Zookeeper — is using this pattern, often without naming it. Recognise the shape: small replicated cell, file-system or KV surface, sessions with leases, coarse-grained-only locks, slow on purpose, the only piece of consensus infrastructure the application talks to. Once you see it, you see it everywhere.
The deepest lesson Burrows leaves is operational, not technical: the lock service exists so that nothing else has to run consensus. Every team that thinks it needs Paxos in its application is wrong; what it needs is a lock service. The technical contribution is the API; the cultural contribution is the rule. Twenty years on, the rule is still the right one.
References
- Burrows, M. — "The Chubby Lock Service for Loosely-Coupled Distributed Systems" (OSDI 2006) — the canonical paper. Read §4 (API) and §5 (operations) twice; §2 (design rationale) is the gold.
- Hunt, P., Konar, M., Junqueira, F., Reed, B. — "ZooKeeper: Wait-free coordination for Internet-scale systems" (USENIX ATC 2010) — the open-source descendant; same shape, slightly different choices.
- Ongaro, D., Ousterhout, J. — "In Search of an Understandable Consensus Algorithm (Raft)" (USENIX ATC 2014) — the consensus protocol that powers etcd, the modern Chubby-shaped service.
- etcd Authors — "etcd Concurrency API: Lock and Election" (etcd.io/docs) — the production library that implements Chubby-shaped semantics on top of Raft.
- Lamport, L. — "Paxos Made Simple" (ACM SIGACT News 2001) — the consensus protocol Chubby actually uses; required reading for anyone running a lock service.
- Junqueira, F., Reed, B. — ZooKeeper: Distributed Process Coordination (O'Reilly 2013) — book-length treatment of the lock-service pattern in production.
- Lease mechanics — the previous chapter; the five-numbers framework used inside every lock service.
- Wall: consensus is expensive — leases are cheap — the architectural argument for why we centralise consensus into a lock service.