Consul, etcd, ZooKeeper
It is 02:14 on the night of a CricStream live final. The score-overlay service is bouncing pods every fifteen minutes — a memory leak in a third-party transcoder library — and the on-call rotation has built an autohealer that just kills the offending pod and lets the deployment controller bring up a fresh one. The clients reading the score overlay are 40,000 long-lived gRPC channels open from edge servers across India. With DNS, this would be a disaster: every pod death would leave a 30-second sinkhole for some fraction of clients. With the etcd-backed registry the team migrated to last quarter, the death is invisible. Within roughly 200 milliseconds of the kubelet marking the pod Terminating, every one of those 40,000 clients gets a watch notification that the endpoint set has changed, drops the dead address from its connection pool, and continues serving. Karan, watching the dashboards, drinks his chai.
Consul, etcd, and ZooKeeper are coordination services: small, strongly-consistent, replicated stores whose primary purpose is to answer "what is the current membership of this group?" and to push the answer to clients the instant it changes. They replace DNS's pull-on-a-TTL-clock with a push-on-event watch, trading universal compatibility for sub-second freshness. The registry is itself a distributed system that can fail, partition, and lose quorum — your application's discovery now depends on the registry's consensus layer staying healthy.
What a coordination service actually does
Three services dominate the registry slot in production: ZooKeeper (the original, born at Yahoo, 2008), etcd (the CoreOS-then-CNCF newcomer, 2013, the brain of every Kubernetes cluster), and Consul (HashiCorp, 2014, opinionated towards multi-datacenter service mesh). They differ in API surface and ergonomics; they agree on the deeper architecture.
Each one is a strongly-consistent replicated key-value store with a small, hierarchical namespace, optimised for small values that change frequently, and exposing two primitives that DNS does not: watches (long-poll subscriptions that fire on key change) and leases / TTLs with session semantics (a key bound to a client session disappears the moment the session dies).
The core API is identical in shape across all three:
| Operation | etcd | Consul | ZooKeeper |
|---|---|---|---|
| Put a key | Put(k, v) |
KV.Put(k, v) |
create(/path, v) or setData |
| Read a key | Get(k) |
KV.Get(k) |
getData(/path) |
| Watch a key for changes | Watch(k) |
blocking query with ?index=N |
getData(/path, watcher) |
| Atomic compare-and-swap | Txn().If().Then().Else() |
KV.CAS(k, v, modify_index) |
setData(/path, v, version) |
| Lease / ephemeral | Lease.Grant(ttl) + Put(k, v, lease) |
session + Acquire |
ephemeral znode |
| List children | Get(prefix, WithPrefix()) |
KV.List(prefix) |
getChildren(/path) |
Watches are the primitive that makes registry-based discovery faster than DNS-based discovery. A client opens a long-lived connection to the registry, subscribes to a key prefix (/services/score-overlay/instances/), and the registry pushes a notification the instant any key in that prefix is created, modified, or deleted. The freshness gap is one network round-trip, not one TTL.
Why the consensus core has to be there at all: a registry without consensus could be split-brained — node-A might say "score-overlay-3 is alive" while node-B says it's gone, and a client talking to either could be told a different truth. The whole point of pushing notifications to clients is that the answer at the moment of push is the answer. That requires the registry's writes to be linearisable, which requires a consensus protocol (Raft for etcd and Consul, Zab for ZooKeeper). Naming is, at its bottom, a consensus problem.
Watches, leases, and ephemeral nodes — the registry idiom
Three primitives, working together, give you the discovery semantics you actually want. Lose any one of them and the model breaks.
Leases bind a key's lifetime to a session. When a service instance starts, it asks the registry for a lease with some TTL (etcd: Lease.Grant(10) for 10 seconds; ZooKeeper: open a session). It then writes its registration key tied to that lease. As long as the instance keeps the lease alive — by sending heartbeats, typically every TTL/3 — the key persists. The moment the instance crashes, segfaults, or gets kill -9'd, the heartbeats stop. After roughly one TTL the lease expires, and the registry deletes every key bound to that lease. There is no "deregister on shutdown" code path that you have to remember to call; the absence of heartbeat is the deregistration signal.
Watches push the change to interested clients. A client opens a watch on a key or prefix. When a key is created, modified, or deleted under that prefix, the registry sends an event over the open connection. In etcd this is a streaming gRPC; in ZooKeeper it is a one-shot watch that re-arms after fire (the client must re-register after every event); in Consul it is a "blocking query" with an index parameter that returns when the index advances.
Compare-and-swap (CAS) gives you safe coordination. Every write in etcd carries a mod_revision; every key in ZooKeeper has a version; every Consul KV operation can include a cas=N parameter. Conditional writes are how you build a leader-election lock without a race: client tries to Put(/leader, "me", If(create_revision==0)), and the operation succeeds for exactly one of N concurrent attempts. The losers see the success and watch the key — they will be notified the moment the leader's lease expires.
The full registration loop fits in a screenful:
# register_with_etcd.py — minimal, faithful registration loop
import etcd3, time, signal, socket, sys, json
client = etcd3.client(host="etcd.local", port=2379)
TTL_SEC = 10
SVC, INSTANCE = "score-overlay", socket.gethostname()
KEY = f"/services/{SVC}/instances/{INSTANCE}"
VAL = json.dumps({"ip": socket.gethostbyname(INSTANCE), "port": 8080,
"version": "v3.4.1", "started_at": int(time.time())})
def register_and_keepalive():
lease = client.lease(TTL_SEC)
client.put(KEY, VAL, lease=lease)
print(f"registered {KEY} with lease {lease.id:x} (ttl={TTL_SEC}s)")
# refresh keepalive forever
for resp in lease.refresh(): # one yield per heartbeat
if resp.TTL <= 0:
print("lease lost — re-registering")
return # outer loop reconnects
time.sleep(TTL_SEC / 3) # heartbeat at TTL/3 cadence
def shutdown_cleanly(*_):
try: client.delete(KEY)
except Exception: pass
sys.exit(0)
signal.signal(signal.SIGTERM, shutdown_cleanly)
signal.signal(signal.SIGINT, shutdown_cleanly)
while True:
try: register_and_keepalive()
except Exception as e:
print(f"registry contact lost: {e}; retry in 1s")
time.sleep(1.0)
Sample run (with one process registered and a second etcdctl watching the prefix):
# Terminal 1 — the registration loop
registered /services/score-overlay/instances/edge-7 with lease 6f4a3b27a118c001 (ttl=10s)
# Terminal 2 — etcdctl watch
$ etcdctl watch --prefix /services/score-overlay/instances/
PUT
/services/score-overlay/instances/edge-7
{"ip": "10.42.7.13", "port": 8080, "version": "v3.4.1", "started_at": 1745539200}
# Terminal 1 — kill -9 the registration loop
# (no clean shutdown runs)
# Terminal 2 — about 10 seconds later:
DELETE
/services/score-overlay/instances/edge-7
Per-line walkthrough. The line lease = client.lease(TTL_SEC) is where the entire model is decided: the registry now knows that this lease is the proof-of-life for this registration. The line client.put(KEY, VAL, lease=lease) binds the key's lifetime to the lease — the registry will autonomously delete this key when the lease expires. The line for resp in lease.refresh() is the heartbeat loop: as long as this Python process is alive and reaching the registry, it sends keepalives at TTL_SEC/3 cadence. signal.signal(signal.SIGTERM, shutdown_cleanly) is a courtesy — on a clean shutdown we delete the key immediately, dropping the staleness window from ~TTL down to near zero, but on kill -9 (or pod OOM) the registry's lease-expiry path is the safety net. Note what is not in the code: no periodic re-registration, no "every 60 seconds, re-PUT the same value". The lease-with-keepalive idiom is the registration protocol — re-PUTs are unnecessary noise.
Why TTL/3 for the heartbeat cadence: with three heartbeats in one TTL window, the lease can absorb up to two consecutive packet drops without expiring. Heartbeating at TTL/2 leaves only one slack heartbeat — a single dropped packet in a saturated network will trigger a false expiry and a spurious deregistration. Heartbeating at TTL/4 or finer wastes registry capacity (registry write throughput is the constraint that actually limits how many services you can register against one cluster). TTL/3 is the convention in etcd's lease docs, ZooKeeper's session-timeout guidance, and Consul's session API examples.
How the three differ — and where they're identical
The architectural shape is the same; the API surface and the operational ergonomics are not.
etcd is the simplest. Single binary. Raft consensus. gRPC API with strict watch semantics (you get every event, in order, between two revisions, even across reconnects, as long as you fit in the history-compaction window). The data model is a flat key-value store with prefix queries — there is no hierarchy beyond the lexicographic ordering of keys. This makes prefix watches cheap (they are range watches under the hood) and discourages people from building deep configuration trees inside the registry. Kubernetes uses etcd as its sole source of truth; if your registry is sitting next to a Kubernetes cluster, you may already have an etcd you can talk to.
Consul is the most opinionated. Built around a service-discovery vocabulary rather than a generic KV: services have explicit Register, Deregister, and Health endpoints; health checks (HTTP, TCP, script, gRPC) run on the Consul agents on each node, and only services with passing checks appear in the discovery response. Consul also exposes services as DNS records (so you can do dig score-overlay.service.consul and get an answer), giving you a backwards-compatible bridge for clients that haven't been updated to use the watch API. The multi-datacenter support is built-in: federate two Consul clusters and the cross-datacenter gossip handles WAN propagation. The price is a much larger surface area than etcd.
ZooKeeper is the oldest and the most general. Hierarchical namespace (znodes form a tree, like a filesystem). Zab (ZooKeeper Atomic Broadcast) instead of Raft, but the practical guarantees are the same: linearisable writes, sequentially consistent reads with the option to force a sync. Watches are one-shot — after a watch fires, the client must re-register it to receive the next event. This is a footgun that the other two avoid. ZooKeeper is what every Hadoop / Kafka / HBase / Solr deployment ran on for a decade; the systems that have moved to KRaft (Kafka) or etcd (Kubernetes) did so to escape exactly the operational quirks of running ZooKeeper at scale (watch-storm thundering herd, JVM heap tuning, snapshot-and-replay for log compaction).
What goes wrong — and the registry's own failure modes
The registry is a distributed system. It can fail. The failures are not the same as DNS failures; they are worse in some ways and better in others.
Quorum loss. If a 3-replica etcd cluster loses two nodes (an AZ outage that takes both replicas in the same AZ, a misconfigured rolling upgrade), the surviving node cannot accept writes — there is no quorum. Reads still serve from the surviving node if you don't ask for linearisable=true, but no new instance can register, no lease can be renewed, and existing leases expire. The registry has effectively gone read-only and is also slowly forgetting everything. PaySetu hit this exactly once: a Terraform plan misapplied an instance-type change to two of three etcd replicas at the same time, both replaced together, quorum was lost for 8 minutes, every service whose lease expired during that window vanished from discovery, and the cascade took 25 minutes to clean up after quorum returned. The fix is operational, not architectural — replicas must be in distinct AZs, upgrades must be one-at-a-time with a quorum-health check between each, and the cluster's --quota-backend-bytes must leave enough headroom that compaction never blocks consensus.
Watch-storm under churn. When 5,000 clients all hold a watch on the same prefix and a single key changes, the registry has to push 5,000 events. Under heavy churn — say a deploy that recycles 200 pods in 30 seconds — the registry can saturate its outbound bandwidth before it can keep up with applying writes. ZooKeeper is the most prone to this; its one-shot watch model means clients re-register watches in tight loops, and the re-registration traffic itself becomes the storm. The mitigation is client-side filtering (watch a small prefix, not the root) and batching at the source (etcd's watch event stream coalesces consecutive events on the same key within a single revision).
Slow-leader failure mode. A registry's leader can be alive and serving heartbeats but slow — fsync queue saturated, GC pause, swap. From the consensus protocol's view, the leader still holds the lease; from the application's view, watch events arrive 4 seconds late. The phi-accrual failure detector (Part 10) is the right tool here, but most registries ship with simple timeout-based detection. Consul exposes serf health for the gossip layer, which detects slow agents independently; etcd has no equivalent — slow-leader detection is on you.
Network partition between clients and registry. A client whose connection to all registry replicas drops cannot tell whether (a) the registry is down or (b) the client itself is partitioned. The conservative policy is fail-closed (refuse to discover; fall back to a hardcoded last-known set), but most production clients fail-open (use the cached set indefinitely). Both are wrong sometimes; the right answer is a bounded staleness with fallback — use the cached set for up to N seconds past loss-of-watch, after which refuse to serve new requests. Most production code does not implement the bounded part.
Why fail-open is usually the right pragmatic choice despite being theoretically wrong: in a real outage, the registry is far more likely to be flapping (brief loss of quorum, few-second recovery) than catastrophically gone. If clients fail-closed every time their watch breaks for 2 seconds, a registry hiccup turns into a fleet-wide outage. The trade is: accept some staleness during registry instability, in exchange for not amplifying registry instability into application instability. CricStream's edge clients hold a 60-second fail-open horizon; KapitalKite, where stale routing could place a trade against a decommissioned matching engine, holds a 5-second horizon. Tune to your blast-radius tolerance.
Common confusions
-
"Consul is just etcd with DNS bolted on." No. Consul has a richer service-catalog model (services with health-check definitions, metadata, tags), an agent-on-every-node architecture (with gossip-based membership independent of the consensus core), and built-in WAN federation. Etcd is intentionally a simpler primitive — Kubernetes builds its service catalog on top of etcd's flat KV. The two solve overlapping but not identical problems.
-
"ZooKeeper watches give you a stream of events." They do not. Each watch fires exactly once and must be re-registered. Between fire and re-register, events are missed. The client SDKs (Curator, kazoo) hide this with auto-rearm, but the underlying protocol is one-shot — and the gap between fire and re-register is a real correctness footgun if you rely on it for state-machine replay.
-
"A registry replaces health-checking." It does not. The registry tells you something registered itself and its lease has not expired. That is a proof-of-life at the registration layer, not a proof of healthy serving. A pod can be alive — heartbeating its lease, holding its znode — while serving 100% 503s on its actual port. Real health needs an out-of-band probe (HTTP
/healthz, TCP connect, gRPC health-check protocol) that gates whether the registration is visible to discovery. Consul builds this in; etcd and ZooKeeper require you to layer it. -
"Leases solve the dead-leader problem." They solve the dead-instance problem (the registration disappears when the lease expires). They do not solve the split-brain leader problem — see leader election and leases — because lease expiry alone does not fence the old leader's writes against a new leader's storage. You also need a fencing token.
-
"You can use the registry as a config store." You can; but every key change becomes a watch event on every client subscribed to that prefix, and every key value adds to the snapshot the registry must replicate to every replica. Etcd's per-key value cap is 1 MiB by default; the total dataset is recommended to stay under 8 GiB. Treating the registry as an application-config blob store has, more than once, caused etcd OOM-killing the kube-apiserver. Configuration that is large or rarely-changing belongs in object storage with a small registry pointer.
-
"Three replicas means three failures tolerated." It tolerates one. Quorum is
floor(N/2)+1, which for N=3 is 2 — losing 2 of 3 means quorum is gone. Five replicas tolerates two failures. This catches people out because the marketing copy on coordination services emphasises "highly-available cluster" without spelling out the f = (N−1)/2 limit. Cross-AZ replica placement is mandatory; cross-region is expensive (every write pays cross-region RTT) and rarely worth it for a registry.
Going deeper
How etcd's watch implementation actually works
Inside etcd, a watch is a server-side cursor on the MVCC store. Every write produces a new global revision number; the keyspace is versioned, so you can ask "what does this key look like at revision N" for any N within the compaction window. A watch is (prefix, start_revision); the server streams every key change with revision > start_revision matching the prefix. When a client reconnects after a brief network blip, it sends its last-seen revision; the server replays from there. This is why etcd watches are claimed to be "lossless" within the compaction horizon (default 1000 revisions of history, configurable via --auto-compaction-retention). The implementation is in mvcc/watchable_store.go — a fanout structure where each new write checks every active watcher's prefix and queues an event. The fanout is the bottleneck under high churn and many watchers; etcd 3.4+ added per-watcher rate limiting to keep one slow watcher from starving the rest.
Why ZooKeeper still ships in 2026
Despite Kubernetes choosing etcd and Kafka migrating to KRaft (its own internal Raft), ZooKeeper remains the registry of record for HBase, Solr, and most legacy Hadoop deployments — and for several reasons keeps being chosen for greenfield. First, the hierarchical znode model is genuinely useful for representing structured topology (Kafka used /brokers/ids/{id} and /brokers/topics/{topic}/partitions/{n}/state paths to encode the cluster's state). Second, ZooKeeper's session model is more expressive than etcd's lease model — a single session can own dozens of ephemeral znodes, and they all disappear together on session expiry, atomically. Third, the Curator framework (Apache, originally Netflix) provides high-level recipes — leader election, distributed locks, shared counters — that have decades of production scar tissue. Greenfield etcd deployments end up reimplementing these from scratch. The trade is operational complexity: ZooKeeper is a JVM application with a snapshot-and-log on disk and a small but real ops burden. If your team already runs JVMs at scale, that is no marginal cost; if your team runs Go and Rust, it is.
Consul's gossip layer and what it actually buys you
Consul's twist is that it runs two protocols: Raft for the linearisable KV and service catalog (the "servers"), and Serf (a SWIM-derivative gossip — see Part 11) for membership and failure detection across all "agents" (one per node, including non-server nodes). The gossip layer is what makes health checks distributed: every Consul agent locally runs the health checks for the services on its node and gossips the result. The Raft cluster only stores the aggregate health state — derived from gossip — for each service. This is why Consul scales to thousands of nodes with fewer servers than equivalent etcd / ZooKeeper deployments would need: the membership and health-check load is gossiped, not centralised. The cost is added complexity and an extra protocol whose failure modes (gossip partitions, suspicion-level oscillation) you now have to understand alongside Raft's.
Reproduce this on your laptop
# Single-node etcd + the registration loop above
docker run -d --name etcd -p 2379:2379 quay.io/coreos/etcd:v3.5.13 \
/usr/local/bin/etcd --advertise-client-urls=http://0.0.0.0:2379 \
--listen-client-urls=http://0.0.0.0:2379
python3 -m venv .venv && source .venv/bin/activate
pip install etcd3==0.12.0 protobuf==3.20.3
# Run register_with_etcd.py in one terminal, watch in another:
etcdctl --endpoints=localhost:2379 watch --prefix /services/
# Inspect the lease state directly:
etcdctl --endpoints=localhost:2379 lease list
etcdctl --endpoints=localhost:2379 get --prefix /services/
Where this leads next
The registry idiom — push-on-event, lease-bound liveness, conditional writes — composes upward in two directions. Sideways, client-side vs server-side discovery takes the registry as given and asks where the discovery logic itself lives — in every client, or in a sidecar that fronts the registry. Upward, leader election and leases uses exactly the same primitives (lease, CAS, watch) to elect a single coordinator out of a pool of candidates, with fencing tokens to keep the old coordinator from doing damage after losing its lease.
The natural follow-up to this chapter is Kubernetes Services and Endpoints, which shows how the platform-API model wraps an etcd registry behind a higher-level abstraction (Service objects) so that application developers never write registration code at all — the kubelet does it on their behalf, by reporting pod readiness to the API server.
References
- Ongaro & Ousterhout, "In Search of an Understandable Consensus Algorithm" — USENIX ATC 2014 — the Raft paper. Both etcd and Consul implement this; reading it is mandatory for anyone tuning a production cluster.
- Burrows, "The Chubby Lock Service for Loosely-Coupled Distributed Systems" — OSDI 2006 — Google's predecessor to ZooKeeper; introduces the lease-and-watch idiom that all three modern registries inherit.
- Hunt, Konar, Junqueira, Reed, "ZooKeeper: Wait-free coordination for Internet-scale systems" — USENIX ATC 2010 — the founding ZooKeeper paper; §3 details the Zab protocol.
- "etcd documentation: Why etcd?" — etcd.io — the official maintainers' position on when to choose etcd vs ZooKeeper vs Consul.
- "Consul Architecture" — HashiCorp Learn — the two-protocol (Raft + Serf) design, including agent-vs-server roles and WAN federation.
- Junqueira, Reed, Serafini, "Zab: High-performance broadcast for primary-backup systems" — DSN 2011 — Zab in formal detail, including how it differs from Multi-Paxos.
- DNS-based discovery — internal companion. The "barely-sufficient" alternative this chapter is written against; reading them back-to-back makes the registry's value concrete.
- Idempotency keys — internal companion. Registries make discovery fresh; idempotency keys make the resulting RPCs safe to retry when the answer changes mid-flight.