In short
Most distributed systems need the same handful of primitives: pick one node to be leader, hold a lock that releases automatically if you crash, store a small piece of shared configuration, watch for changes, register a service so others can find it. Implementing these correctly requires consensus, and consensus is hard enough that you should not roll your own. The industry's answer is to factor consensus out into a small, dedicated coordination service that other systems use as a black box.
etcd is the modern incumbent. It is a distributed key-value store backed by Raft, exposing a gRPC + REST API with two killer primitives: leases (a TTL that auto-expires keys when the holder dies) and watches (a server-streamed feed of changes to a key or prefix). It is the Kubernetes control plane's only stateful component; it backs HashiCorp Vault, CoreOS Locksmith, M3DB's metadata, and a long tail of cluster managers. Run it as a 3- or 5-node cluster on dedicated hardware, give the WAL its own disk, and never let it cross the 8 GB data limit without a plan.
ZooKeeper is the elder. It uses ZAB (ZooKeeper Atomic Broadcast, a Paxos relative) and exposes a hierarchical filesystem-like namespace of znodes. Three znode flavours do all the work: ephemeral znodes vanish when the creating session ends (liveness), sequential znodes get a monotonically increasing suffix appended by the server (ordering), watches notify clients of changes. Kafka pre-KRaft, HBase, Hadoop YARN, Apache Mesos, Apache Solr, and Druid all run on ZooKeeper. It is older, less ergonomic, and operationally heavier than etcd, but battle-tested over fifteen years.
Both let you implement leader election in twenty lines of client code; both implement service registries with ephemeral nodes; both make distributed locks straightforward. Both have a "no-leader" failure mode that is the central thing on-call engineers worry about. The migration trend is away from external coordination services and toward in-cluster Raft (Kafka's KRaft, etcd's role inside Kubernetes) — but for the next decade you will encounter both in production.
You are running a fleet of order-processing workers behind a queue. They consume orders, charge cards, write to the database. The business rule says exactly one worker processes each order, and if that worker crashes mid-charge, another worker must take over within seconds. Naively you can shard by order_id mod N, but workers come and go for deploys; the assignment of shards to workers needs to update without two workers ever thinking they own the same shard.
You sit down to write this. After an hour you realize you need a leader, then you realize the leader needs to be elected without a coordinator, then you discover Raft, then you read the five safety invariants, and then you take a long walk and decide that maybe the right call is not to write a Raft implementation for what is, fundamentally, an order-processing service.
That long walk is what etcd and ZooKeeper are. They are the answer to the question "I need consensus primitives, but consensus is not my product." They run as a small, dedicated cluster of 3 or 5 nodes; they expose locks, leader election, configuration storage, and change notification as a service; and they let your application code be ten lines of "lease this key" rather than ten thousand lines of "implement Raft."
This chapter walks both systems — what they offer, how they differ, what they are used for, how to operate them, and how to wire one into your application. It is the first chapter in this Build that is not about implementing consensus; it is about consuming it.
The role of a coordination service
A coordination service has three distinguishing properties.
It is small. etcd's recommended max database size is 8 GB; ZooKeeper expects to fit working set in memory. These are not document stores. They hold metadata: pod specs, configuration entries, lock holders, leader identities. Everything bulky — your actual data — lives elsewhere.
It is consistent. Reads and writes are linearizable by default. There is no eventual consistency knob, no read-your-writes weakness; you put a key, every subsequent read on any client sees it. Why this matters: if you are using the service to elect a leader, you cannot tolerate "two clients each read a stale value and both think they hold the lock." Linearizability is the entire point.
It exposes primitives, not data. Both systems give you compare-and-set, ephemeral keys with TTL, and change notifications. You can implement locks, leader election, and registries on top in a few dozen lines. The service does not bake in those abstractions; it gives you the safe primitives and lets the client library or your code compose them.
The bargain: you take a hard dependency on a cluster you have to operate (3 nodes minimum, 5 for zone redundancy, separate disks for WAL), and in exchange you get correct primitives instead of subtle bugs.
etcd
etcd was created by CoreOS in 2013 to back its fleet cluster manager. Today it is the only mandatory stateful component of a Kubernetes cluster. The architecture is a Raft-replicated key-value store with a small, opinionated API.
Data model
Keys are byte strings. Values are byte strings. Keys are flat (no hierarchy in the storage layer), but /-separated prefixes are the convention, and the API has a range operation that returns all keys with a given prefix — so logically you can treat etcd as a tree if you want, while the engine sees a sorted byte-string keyspace.
Every write produces a new revision, a cluster-wide monotonically increasing 64-bit integer. The store keeps recent revisions so clients can read at a specific point in history (mini-MVCC). This is what makes watches efficient: a watch starts at a revision and replays every change since then, then streams new changes as they happen.
The four operations
PUT key value [lease N] [if previous revision == R]
GET key [at revision R] [range_end K2]
DEL key [range_end K2]
TXN if [conditions] then [ops] else [ops]
Plus the supporting cast:
- Lease grant/keepalive/revoke: a lease is a TTL-bearing object.
Put(key, value, lease=L)ties the key to lease L; when L expires (the holder stops sending keepalives), the key is automatically deleted by the server. - Watch: a server-streamed gRPC subscription to changes on a key or range, optionally starting at a given revision. The server pushes events; the client does not poll.
The TXN operation is the one that makes etcd usable for primitives. It is a tiny, deterministic mini-transaction: evaluate a list of conditions on revisions/values; if all pass, run the then ops atomically; otherwise run the else ops. Compare-and-swap, "create only if not exists" (if mod_revision(K) == 0), "delete only if value matches" — all expressible as TXN.
Wire and deployment
- gRPC for clients, with a JSON gateway for HTTP fallback.
- Default ports: 2379 for client traffic, 2380 for inter-peer Raft traffic.
- Cluster size: 3 or 5 nodes in production. 1 node has no fault tolerance; 3 tolerates 1 failure; 5 tolerates 2. Beyond 5 the per-write majority cost grows faster than the marginal availability benefit, and Raft heartbeat traffic gets noisy.
Why odd numbers and why 5 is the realistic ceiling: with 7 nodes the majority is 4 and every write waits for the slowest 4 of 7 to ack, while still only tolerating 3 failures (versus 2 in a 5-node cluster). The latency-vs-fault-tolerance trade-off levels off; very few real deployments need to tolerate 3 simultaneous failures, and those that do typically shard across regions instead of growing the etcd cluster.
The on-disk layout is a write-ahead log (/var/lib/etcd/wal) plus periodic snapshots (/var/lib/etcd/snap). The WAL is the hot path — every write fsyncs to it before being acked. Production rule: put the WAL on its own SSD, separate from the snapshot directory and from anything else on the host. A noisy neighbour on the WAL disk turns a 1-ms etcd write into a 100-ms one, and once etcd writes get slow the entire Kubernetes control plane visibly degrades.
What etcd is used for
- Kubernetes — the canonical use. Every pod spec, every node status, every secret, every configmap is a key in etcd. The apiserver is the only client; it translates kubectl/controller traffic into etcd reads, writes, and watches.
- HashiCorp Vault — when configured for HA storage, Vault uses etcd to elect an active node and to store leader status. (Consul is the more common Vault backend, but etcd is supported.)
- CoreOS Locksmith — coordinates rolling reboots of CoreOS hosts using etcd-based locks.
- Cloud provider control planes — AWS, GCP, and Azure all run etcd internally for various managed Kubernetes and infrastructure services.
- M3DB, Patroni (Postgres HA), and a long tail of "I need a tiny consistent store and don't want to run ZooKeeper" use cases.
ZooKeeper
ZooKeeper predates etcd by half a decade. Yahoo open-sourced it in 2008; it became an Apache top-level project soon after. It uses ZAB (ZooKeeper Atomic Broadcast), a primary-backup protocol that is, in the spectrum of consensus algorithms, a close cousin to Multi-Paxos. The protocol is different from Raft in the details — see Paxos, Multi-Paxos, ZAB: what Raft simplified — but the externally visible guarantees are the same: linearizable writes, sequential consistency, fault tolerance up to a minority of nodes.
The znode hierarchy
ZooKeeper exposes a filesystem-like tree. Every node is a znode. A znode has a path (/services/orders/worker-3), a small data payload (≤1 MB by default; in practice keep it ≤1 KB), and metadata (czxid, mzxid, version, ACLs).
The four flavours of znode:
- Persistent — created with
create(). Survives client disconnects and server restarts. The default for configuration data and directory parents. - Ephemeral — created with the
EPHEMERALflag. Tied to the creating client's session. When the session ends (client disconnects and the session times out, default 30 s), the znode is automatically deleted by the server. This is the liveness primitive: "I am alive iff this znode exists." - Sequential — created with the
SEQUENTIALflag. The server appends a 10-digit monotonically increasing suffix to the requested name. Two clients both callingcreate("/queue/job-", SEQUENTIAL)get back distinct names like/queue/job-0000000017and/queue/job-0000000018, with17 < 18reflecting the order the server saw the requests. This is the ordering primitive. - Ephemeral-sequential — both flags. The leader-election workhorse: every client creates an ephemeral-sequential znode under a common parent; the holder of the lowest sequence number is the leader; if it crashes, its znode disappears and the next-lowest takes over.
Watches
A watch is a one-shot trigger registered with most read operations (getData, getChildren, exists). When the watched thing changes, the server sends one notification, and the watch is then deregistered. The client must re-register the watch on its next read.
Why one-shot: it forces the client to re-read after every notification, which is exactly what a correct client should do anyway. A "permanent" watch would tempt clients to react to events without re-reading state, which is racy. The one-shot model bakes the read-after-notification pattern into the API.
Wire and deployment
- Custom binary protocol on port 2181 for clients, 2888 for inter-peer follower-leader, 3888 for leader election.
- Cluster size (called an ensemble in ZK parlance): 3 or 5 nodes, same logic as etcd.
- On-disk:
dataDir(transaction log + snapshots) anddataLogDir(just the txn log; recommended on its own disk). - Health check: send
ruokto the client port; a healthy server repliesimok. Modern installations prefer the four-letter words (stat,mntr) or the AdminServer HTTP endpoint.
What ZooKeeper is used for
- Apache Kafka (pre-KRaft) — broker registration, topic metadata, controller election, ACLs. Every Kafka cluster used to require a ZooKeeper ensemble alongside it.
- Apache HBase — region server discovery, master election, root region location.
- Hadoop YARN — ResourceManager HA, application state.
- Apache Mesos — master election.
- Apache Solr (SolrCloud) — collection metadata, shard leader election.
- Druid, Pinot, Helix — coordinator/controller election and metadata.
The footprint is enormous; for a decade ZooKeeper was the de-facto coordination service for the JVM-centric data infrastructure stack.
Common primitives, two ways
Both systems support the same canonical primitives. The implementation patterns differ.
Leader election
ZooKeeper pattern (ephemeral-sequential). Every candidate creates /election/n_ with EPHEMERAL | SEQUENTIAL. The server returns /election/n_0000000017, /election/n_0000000018, etc. Each candidate then calls getChildren("/election") and checks: if its own sequence number is the lowest, it is the leader. Otherwise, it sets a watch on the znode immediately preceding its own (predecessor only — not the parent) and goes to sleep. When the predecessor's session ends or the predecessor explicitly resigns, the watch fires; the candidate re-checks; if it is now the lowest, it is the new leader.
The "watch only the predecessor" trick avoids the thundering herd: if everyone watched the leader's znode, every leader change would wake all N candidates simultaneously, causing N reads on a freshly elected node. By watching only your predecessor, exactly one candidate wakes up per leadership change.
etcd pattern (lease + put-if-absent). A candidate grants a lease with a TTL (say 10 seconds), then attempts a transaction: if mod_revision("/leader") == 0 then put("/leader", self_id, lease=L). If the TXN succeeds, this candidate is leader; it spawns a keepalive loop that refreshes the lease before it expires. If the TXN fails (key already exists), the candidate watches /leader for a delete event, then retries.
Distributed lock
ZooKeeper. Same ephemeral-sequential pattern as leader election; the lowest is the lock holder, others wait on their predecessors. Releasing the lock means deleting your znode (or letting your session expire if you crash).
etcd. Same lease-and-key as leader election, with the convention that the key path is /locks/<resource> and clients release by deleting the key or letting the lease expire. The etcd3 Python client and the Go client both ship a Lock/Unlock helper that wraps the dance.
Service registry
ZooKeeper. Every worker, on startup, creates /services/<role>/<host_port> as ephemeral with its connection details as the data. Consumers do getChildren("/services/<role>", watch=True). When a worker dies, its session expires, its znode vanishes, the watch fires on every consumer, they re-read the children, and traffic stops being routed to the dead worker.
etcd. Workers grant a lease, put /services/<role>/<host_port> with that lease, and run keepalives. Consumers do Watch("/services/<role>", prefix=True) to get a stream of PUT (worker registers) and DELETE (lease expired or worker exited cleanly) events.
Operational concerns
Both systems are operationally finicky. Three categories matter.
Disk IO. etcd's WAL fsyncs on every write; ZooKeeper's transaction log fsyncs at every commit. If the WAL disk's fdatasync p99 exceeds ~10 ms, etcd starts logging etcdserver: request timed out warnings and Raft heartbeats time out. The fix is always the same: dedicated SSD for the WAL, no shared workloads. Cloud disks with bursty IOPS budgets (gp2 on AWS, standard PD on GCP) are notorious for this.
Snapshot management. etcd takes a full snapshot every 100,000 revisions by default. Old WAL segments below the snapshot are truncated. Without snapshots the WAL grows unbounded; with too-frequent snapshots you waste IO. The default is fine for most clusters; tune only if you have measured. ZooKeeper's snapshots are similar — periodic full memory dumps to disk, with old txn logs truncated below them.
Monitoring. The minimum viable monitor is a periodic etcdctl endpoint health (or zkServer.sh status) that fires an alert if the cluster reports no leader. The dreaded no_leader alert means every write is blocked; debugging usually finds a network partition, a crashed minority, or a disk that has gone read-only. Beyond health, watch the histograms: etcd exposes etcd_server_proposals_committed_total, etcd_disk_wal_fsync_duration_seconds, etcd_network_peer_round_trip_time_seconds. The fsync histogram is the most useful single signal; if its p99 climbs you have a disk problem before you have a Raft problem.
Maximum cluster size. Don't grow the cluster beyond 5 nodes for redundancy. If you need geographic spread, run separate clusters and replicate between them at the application layer (or use etcd's experimental learner-and-replication features carefully).
Leader election across three Python workers using etcd
You are running three instances of a cron-runner service. Exactly one should be the active runner — pulling jobs from a queue and executing them — and if the active runner crashes, another should take over within ~15 seconds. Here is the entire client code.
import os
import socket
import time
import logging
import etcd3 # pip install etcd3
logger = logging.getLogger("cron-runner")
ELECTION_KEY = "/cron-runner/leader"
LEASE_TTL = 10 # seconds; auto-expires if we crash or hang
def run_election_loop():
"""Block forever, becoming leader when possible.
Returns only on unrecoverable error. The caller should restart this.
"""
client = etcd3.client(host="etcd-0.etcd", port=2379)
self_id = f"{socket.gethostname()}-{os.getpid()}".encode()
while True:
# 1. Grant a lease. The lease object knows how to send keepalives.
lease = client.lease(LEASE_TTL)
# 2. Atomically: create the leader key only if it does not exist.
# The condition checks mod_revision == 0 (= "key has never existed
# or was deleted"), which is etcd's idiomatic create-if-not-exists.
succeeded, _ = client.transaction(
compare=[client.transactions.mod(ELECTION_KEY) == 0],
success=[client.transactions.put(ELECTION_KEY, self_id, lease=lease)],
failure=[],
)
if succeeded:
logger.info("won leadership; starting keepalive + work")
try:
act_as_leader(client, lease, self_id)
except Exception:
logger.exception("leader loop died; releasing lease")
lease.revoke()
# Fall through to retry — a transient error should not orphan us.
continue
# 3. Lost the election: someone else has the key. Throw away our
# unused lease and watch the leader key for deletion / expiry.
lease.revoke()
logger.info("lost election; watching for leader to disappear")
wait_for_leader_to_die(client)
# Loop back and try again.
def act_as_leader(client, lease, self_id):
"""Run while we hold the leader key. Refresh the lease forever."""
# The etcd3 client's refresh_lease pumps keepalive every TTL/3 seconds.
refresh = lease.refresh() # iterator yielding each keepalive response
last_refresh = time.monotonic()
while True:
# Drain a keepalive ack so the lease stays alive.
try:
next(refresh)
except StopIteration:
# Lease lost — either we hung past TTL or the cluster fenced us.
raise RuntimeError("lease keepalive ended; not leader anymore")
# Defensive: confirm we are still the holder. If our key was deleted
# out from under us (e.g. someone called etcdctl del), bail out.
value, meta = client.get(ELECTION_KEY)
if value != self_id:
raise RuntimeError("leader key no longer ours; stepping down")
# Do one tick of leader work. Keep ticks short — we must come back
# for keepalive before TTL/3 seconds elapse.
run_one_cron_tick()
# Throttle so we are not spinning.
elapsed = time.monotonic() - last_refresh
if elapsed < LEASE_TTL / 3:
time.sleep(LEASE_TTL / 3 - elapsed)
last_refresh = time.monotonic()
def wait_for_leader_to_die(client):
"""Block until the leader key is deleted (lease expiry or explicit del)."""
events_iter, cancel = client.watch(ELECTION_KEY)
try:
for event in events_iter:
if isinstance(event, etcd3.events.DeleteEvent):
return
finally:
cancel()
def run_one_cron_tick():
# Your real work goes here: pull a job, execute it, update state.
pass
if __name__ == "__main__":
logging.basicConfig(level=logging.INFO)
while True:
try:
run_election_loop()
except Exception:
logger.exception("election loop crashed; restarting in 5s")
time.sleep(5)
Run three copies of this on three separate hosts (or pods). All three call transaction(...). Exactly one's TXN succeeds — etcd serializes them and only the first reaches mod_revision == 0. That one runs act_as_leader and pumps keepalives every ~3 seconds. The other two see succeeded=False, revoke their unused leases, and watch the key.
Now kill the leader's host. The leader stops sending keepalives. After at most LEASE_TTL = 10 seconds, etcd expires the lease, deletes the key, and emits a DeleteEvent. Both watchers wake; both call transaction(...) again; one wins (etcd serializes), the other goes back to watching. Total downtime: between 10 and ~13 seconds depending on when the crash happened relative to the previous keepalive.
Two non-obvious correctness points in this code.
The defensive check inside act_as_leader — re-reading the key and comparing to self_id — is what handles the stale-lease problem: an operator who runs etcdctl del /cron-runner/leader does not revoke our lease, so we would keep refreshing it for nothing while a new leader was elected and put a new key with a new lease. Without the check, we would think we are still leader. With it, we step down.
The keepalive cadence (TTL / 3) is the standard rule of thumb. Lose two consecutive keepalives and you still survive; lose three and you are out. Why TTL/3 and not TTL/2: TTL/2 leaves no margin — one missed keepalive plus normal jitter and the lease expires while you are still healthy. TTL/3 is the sweet spot between safety and not flooding etcd with requests; it is what etcd's own client libraries default to.
The same shape works in Go (using the official clientv3 library's concurrency.Election), Java, Rust, and the rest. The pattern transcends the language.
Migration: KRaft and the in-cluster Raft trend
The 2020s brought a noticeable shift away from external coordination services and toward embedded Raft for systems that previously used ZooKeeper or etcd as a sidecar.
Kafka KRaft (Kafka Raft Metadata mode) was promoted to production-default in Kafka 3.3 (2022) and ZooKeeper support was deprecated in 3.5 and removed in 4.0. Instead of running a ZooKeeper ensemble alongside Kafka brokers, KRaft runs a Raft quorum inside the Kafka brokers themselves for metadata. The result: one fewer cluster to operate, faster controller failover, and much higher partition counts per cluster (millions instead of tens of thousands). The metadata is now a Kafka topic stored on a Raft-replicated log, accessed by the controllers; brokers fetch metadata from the controller leader.
etcd in Kubernetes is going the opposite direction — it remains an external dependency, but the Kubernetes community has invested heavily in operational tooling (etcdadm, kubeadm's etcd lifecycle management, automated snapshots) so that a typical operator does not interact with etcd directly. The trend here is not "remove etcd" but "make etcd disappear into the platform."
HashiCorp Consul is interesting because it is itself a coordination service (gossip + Raft) but with a service-mesh focus. It overlaps with etcd functionally for service discovery and KV storage; the relevant tradeoffs are: Consul has built-in service mesh and health checking; etcd has a tighter, sharper API and lower per-write latency. For Kubernetes, etcd; for non-Kubernetes service discovery and config management, Consul is the more common choice.
The arc is clear: external coordination services are still the right answer for many use cases, but for systems where consensus is on the critical path of every operation, embedding Raft directly is winning.
Common confusions
- "etcd is a database." No. It is a metadata store. 8 GB total data, small values, expensive writes (every put fsyncs through Raft). Do not put your application data in etcd; put your cluster's coordination state.
- "ZooKeeper is just etcd with a tree." Mostly false. The hierarchical namespace is a real semantic difference; ZK clients can atomically operate on subtrees (e.g., create
/a/b/cand/a/b/din one multi-op). The session-and-watch model is also distinct from etcd's lease-and-watch. - "More nodes = more durable." Up to a point, then no. 5 nodes tolerates 2 simultaneous failures, which is essentially always enough. 7+ nodes increases write latency without meaningful availability gains for the common failure modes.
- "I can run the coordination service on the same hosts as my application." You can, but you should not. Resource contention (especially disk IO) on the etcd or ZK host kills cluster health, and the failure modes get tangled — your application crashing because etcd is slow, etcd being slow because your application is using the disk.
Where this leads next
We have closed the loop on Raft for now: chapters 100-106 built the algorithm; this chapter showed how to consume it. Chapter 108 (Consensus is a log, not a database) takes the next conceptual step — the realization that what Raft really gives you is a replicated log, and a database is just one of many state machines that can be driven by it. Streams, queues, configuration trees, and service registries are all the same machinery with different state machines on top.
The single sentence to carry forward: etcd and ZooKeeper exist so that every distributed system does not need to implement consensus; they expose the same handful of primitives — leases, ephemeral nodes, atomic puts, watches — that compose into leader election, locks, and registries; the right way to use them is as a small dedicated cluster you treat as a critical dependency, not as a general-purpose database.
References
- etcd authors, etcd documentation — the canonical reference for the API, deployment topology, and operational guidance. The "FAQ" page covers the 8-GB recommendation and the dedicated-disk rule explicitly.
- The Apache ZooKeeper team, ZooKeeper Programmer's Guide — the authoritative description of znode semantics, sessions, watches, and the recipe for ephemeral-sequential leader election.
- Xiang Li, Yicheng Qin, Brandon Philips, etcd: Distributed reliable key-value store for the most critical data of a distributed system — the design talk and accompanying papers from the original CoreOS team explaining the rationale for Raft, the lease primitive, and the gRPC API.
- Kubernetes documentation, Operating etcd clusters for Kubernetes — the operational playbook for running etcd as a Kubernetes control-plane component, including snapshot/restore procedures.
- Flavio Junqueira and Benjamin Reed, ZooKeeper: Distributed Process Coordination (O'Reilly, 2013) — the book by ZooKeeper's original authors. Part II ("Programming with ZooKeeper") covers the canonical recipes for locks, leader election, and group membership; their USENIX paper on ZooKeeper is the formal reference.
- HashiCorp, Consul vs. ZooKeeper, etcd, doozerd — the comparison page from the most prominent "fourth" option in this space, useful for understanding the design-space tradeoffs even if you end up choosing etcd or ZK.