In short

Most distributed systems need the same handful of primitives: pick one node to be leader, hold a lock that releases automatically if you crash, store a small piece of shared configuration, watch for changes, register a service so others can find it. Implementing these correctly requires consensus, and consensus is hard enough that you should not roll your own. The industry's answer is to factor consensus out into a small, dedicated coordination service that other systems use as a black box.

etcd is the modern incumbent. It is a distributed key-value store backed by Raft, exposing a gRPC + REST API with two killer primitives: leases (a TTL that auto-expires keys when the holder dies) and watches (a server-streamed feed of changes to a key or prefix). It is the Kubernetes control plane's only stateful component; it backs HashiCorp Vault, CoreOS Locksmith, M3DB's metadata, and a long tail of cluster managers. Run it as a 3- or 5-node cluster on dedicated hardware, give the WAL its own disk, and never let it cross the 8 GB data limit without a plan.

ZooKeeper is the elder. It uses ZAB (ZooKeeper Atomic Broadcast, a Paxos relative) and exposes a hierarchical filesystem-like namespace of znodes. Three znode flavours do all the work: ephemeral znodes vanish when the creating session ends (liveness), sequential znodes get a monotonically increasing suffix appended by the server (ordering), watches notify clients of changes. Kafka pre-KRaft, HBase, Hadoop YARN, Apache Mesos, Apache Solr, and Druid all run on ZooKeeper. It is older, less ergonomic, and operationally heavier than etcd, but battle-tested over fifteen years.

Both let you implement leader election in twenty lines of client code; both implement service registries with ephemeral nodes; both make distributed locks straightforward. Both have a "no-leader" failure mode that is the central thing on-call engineers worry about. The migration trend is away from external coordination services and toward in-cluster Raft (Kafka's KRaft, etcd's role inside Kubernetes) — but for the next decade you will encounter both in production.

You are running a fleet of order-processing workers behind a queue. They consume orders, charge cards, write to the database. The business rule says exactly one worker processes each order, and if that worker crashes mid-charge, another worker must take over within seconds. Naively you can shard by order_id mod N, but workers come and go for deploys; the assignment of shards to workers needs to update without two workers ever thinking they own the same shard.

You sit down to write this. After an hour you realize you need a leader, then you realize the leader needs to be elected without a coordinator, then you discover Raft, then you read the five safety invariants, and then you take a long walk and decide that maybe the right call is not to write a Raft implementation for what is, fundamentally, an order-processing service.

That long walk is what etcd and ZooKeeper are. They are the answer to the question "I need consensus primitives, but consensus is not my product." They run as a small, dedicated cluster of 3 or 5 nodes; they expose locks, leader election, configuration storage, and change notification as a service; and they let your application code be ten lines of "lease this key" rather than ten thousand lines of "implement Raft."

This chapter walks both systems — what they offer, how they differ, what they are used for, how to operate them, and how to wire one into your application. It is the first chapter in this Build that is not about implementing consensus; it is about consuming it.

The role of a coordination service

etcd in a Kubernetes control planeA Kubernetes cluster diagram. On the left, a kube-apiserver box receives requests from kubectl, kubelet, and controllers. The apiserver fans out to a 3-node etcd cluster on the right, with one node marked as Raft leader and two as followers. Arrows from controllers and kubelets all flow through the apiserver, which is the only client of etcd. Below, three worker nodes run pods, none of them speak directly to etcd.etcd's role in a Kubernetes cluster — apiserver is the only etcd clientkubectlkubelet (×N)controllers(scheduler,replicaset, …)kube-apiserverREST/JSON ingRPC out to etcdstateless · scalehorizontally · auth+ admission hereetcd cluster (3 nodes)e1leadere2followere3followerRaft replicationport 2379 (clients)port 2380 (peers)/var/lib/etcd: WAL + snapshotsgRPCwatch + put + txnworker node 1pods, kubeletnever speaks to etcdworker node 2pods, kubeletnever speaks to etcdworker node 3pods, kubeletnever speaks to etcd
Kubernetes runs every cluster on a 3- or 5-node etcd quorum, but only the kube-apiserver speaks gRPC to it. Everything else — kubectl, controllers, kubelets — talks REST to the apiserver, which is stateless and horizontally scalable. The apiserver translates each REST call into one or more etcd transactions, watches the relevant key prefixes, and fans changes back out to its REST clients. This funnel is deliberate: etcd's API is sharp, its capacity is small (8 GB recommended ceiling), and exposing it directly to thousands of kubelets would melt it. The apiserver is the throttle.

A coordination service has three distinguishing properties.

It is small. etcd's recommended max database size is 8 GB; ZooKeeper expects to fit working set in memory. These are not document stores. They hold metadata: pod specs, configuration entries, lock holders, leader identities. Everything bulky — your actual data — lives elsewhere.

It is consistent. Reads and writes are linearizable by default. There is no eventual consistency knob, no read-your-writes weakness; you put a key, every subsequent read on any client sees it. Why this matters: if you are using the service to elect a leader, you cannot tolerate "two clients each read a stale value and both think they hold the lock." Linearizability is the entire point.

It exposes primitives, not data. Both systems give you compare-and-set, ephemeral keys with TTL, and change notifications. You can implement locks, leader election, and registries on top in a few dozen lines. The service does not bake in those abstractions; it gives you the safe primitives and lets the client library or your code compose them.

The bargain: you take a hard dependency on a cluster you have to operate (3 nodes minimum, 5 for zone redundancy, separate disks for WAL), and in exchange you get correct primitives instead of subtle bugs.

etcd

etcd was created by CoreOS in 2013 to back its fleet cluster manager. Today it is the only mandatory stateful component of a Kubernetes cluster. The architecture is a Raft-replicated key-value store with a small, opinionated API.

Data model

Keys are byte strings. Values are byte strings. Keys are flat (no hierarchy in the storage layer), but /-separated prefixes are the convention, and the API has a range operation that returns all keys with a given prefix — so logically you can treat etcd as a tree if you want, while the engine sees a sorted byte-string keyspace.

Every write produces a new revision, a cluster-wide monotonically increasing 64-bit integer. The store keeps recent revisions so clients can read at a specific point in history (mini-MVCC). This is what makes watches efficient: a watch starts at a revision and replays every change since then, then streams new changes as they happen.

The four operations

PUT key value [lease N] [if previous revision == R]
GET key [at revision R] [range_end K2]
DEL key [range_end K2]
TXN  if [conditions]  then [ops]  else [ops]

Plus the supporting cast:

The TXN operation is the one that makes etcd usable for primitives. It is a tiny, deterministic mini-transaction: evaluate a list of conditions on revisions/values; if all pass, run the then ops atomically; otherwise run the else ops. Compare-and-swap, "create only if not exists" (if mod_revision(K) == 0), "delete only if value matches" — all expressible as TXN.

Wire and deployment

Why odd numbers and why 5 is the realistic ceiling: with 7 nodes the majority is 4 and every write waits for the slowest 4 of 7 to ack, while still only tolerating 3 failures (versus 2 in a 5-node cluster). The latency-vs-fault-tolerance trade-off levels off; very few real deployments need to tolerate 3 simultaneous failures, and those that do typically shard across regions instead of growing the etcd cluster.

The on-disk layout is a write-ahead log (/var/lib/etcd/wal) plus periodic snapshots (/var/lib/etcd/snap). The WAL is the hot path — every write fsyncs to it before being acked. Production rule: put the WAL on its own SSD, separate from the snapshot directory and from anything else on the host. A noisy neighbour on the WAL disk turns a 1-ms etcd write into a 100-ms one, and once etcd writes get slow the entire Kubernetes control plane visibly degrades.

What etcd is used for

ZooKeeper

ZooKeeper predates etcd by half a decade. Yahoo open-sourced it in 2008; it became an Apache top-level project soon after. It uses ZAB (ZooKeeper Atomic Broadcast), a primary-backup protocol that is, in the spectrum of consensus algorithms, a close cousin to Multi-Paxos. The protocol is different from Raft in the details — see Paxos, Multi-Paxos, ZAB: what Raft simplified — but the externally visible guarantees are the same: linearizable writes, sequential consistency, fault tolerance up to a minority of nodes.

The znode hierarchy

ZooKeeper exposes a filesystem-like tree. Every node is a znode. A znode has a path (/services/orders/worker-3), a small data payload (≤1 MB by default; in practice keep it ≤1 KB), and metadata (czxid, mzxid, version, ACLs).

ZooKeeper znode hierarchy with ephemeral, sequential, and watched nodesA tree diagram of znodes. Root /, with children /services and /locks. /services has child /services/orders, which contains three sequential ephemeral nodes worker-0000000017, worker-0000000018, worker-0000000019. /locks has /locks/db with three sequential ephemeral lock nodes. A watch icon sits on /services/orders showing a client subscribed to its children.ZooKeeper namespace — three znode flavours do all the work/root/servicespersistent/lockspersistent/services/orderspersistentWwatcheron children/locks/dbpersistentworker-0000000017ephemeral · sequentialworker-0000000018ephemeral · sequentialworker-0000000019ephemeral · sequentiallock-0000000042holder (lowest)lock-0000000043waiterlock-0000000044waiter
Three znode flavours: persistent (blue, survive client disconnects), ephemeral (green, vanish when the creating session ends — used for liveness signals), and ephemeral-sequential (the same with a server-assigned monotonic suffix — used for ordered queues and the canonical leader-election pattern). The watcher on /services/orders is a one-shot subscription that fires when any child is added or removed; the client must re-register after each fire.

The four flavours of znode:

Watches

A watch is a one-shot trigger registered with most read operations (getData, getChildren, exists). When the watched thing changes, the server sends one notification, and the watch is then deregistered. The client must re-register the watch on its next read.

Why one-shot: it forces the client to re-read after every notification, which is exactly what a correct client should do anyway. A "permanent" watch would tempt clients to react to events without re-reading state, which is racy. The one-shot model bakes the read-after-notification pattern into the API.

Wire and deployment

What ZooKeeper is used for

The footprint is enormous; for a decade ZooKeeper was the de-facto coordination service for the JVM-centric data infrastructure stack.

Common primitives, two ways

Both systems support the same canonical primitives. The implementation patterns differ.

Three primitives built on coordination servicesThree side-by-side panels showing leader election, distributed lock, and service registry. Each panel shows three application instances connecting to a coordination cluster, with arrows indicating which one holds the role and what happens on failure.The three classic primitives — same shape, different bindings1. Leader electionone of N becomes the only writerAleaderBCetcd: lease + key"/leader" with TTLCAS create-if-absentZK: ephemeral-sequential/election/n_xxxxxxlowest number wins2. Distributed lockmutual exclusion across hostsAholdsBwaitsCwaitsetcd: lease + put-if-absentkey /lock/dbauto-release on lease expiryZK: ephemeral seq + watchpredecessor(no thundering herd)3. Service registrylive members + change eventsw1w2w3w4diedwatch fires → registry shrinksetcd: lease per workerprefix watch /services/...lease expiry → DELETE eventZK: ephemeral per workergetChildren("/services/...")+ watch on parent
The three primitives every coordination-service user implements eventually. The shapes are the same in both systems; only the bindings differ. Leader election: one of N processes becomes the only authorized writer; the others sit by until it dies. Distributed lock: at most one holder at a time; waiters block. Service registry: each live process publishes a presence record; clients watch the parent for change events. In all three, the "auto-cleanup on death" property — etcd lease expiry or ZK ephemeral-on-session-end — is what makes the primitive correct under crash failures.

Leader election

ZooKeeper pattern (ephemeral-sequential). Every candidate creates /election/n_ with EPHEMERAL | SEQUENTIAL. The server returns /election/n_0000000017, /election/n_0000000018, etc. Each candidate then calls getChildren("/election") and checks: if its own sequence number is the lowest, it is the leader. Otherwise, it sets a watch on the znode immediately preceding its own (predecessor only — not the parent) and goes to sleep. When the predecessor's session ends or the predecessor explicitly resigns, the watch fires; the candidate re-checks; if it is now the lowest, it is the new leader.

The "watch only the predecessor" trick avoids the thundering herd: if everyone watched the leader's znode, every leader change would wake all N candidates simultaneously, causing N reads on a freshly elected node. By watching only your predecessor, exactly one candidate wakes up per leadership change.

etcd pattern (lease + put-if-absent). A candidate grants a lease with a TTL (say 10 seconds), then attempts a transaction: if mod_revision("/leader") == 0 then put("/leader", self_id, lease=L). If the TXN succeeds, this candidate is leader; it spawns a keepalive loop that refreshes the lease before it expires. If the TXN fails (key already exists), the candidate watches /leader for a delete event, then retries.

Distributed lock

ZooKeeper. Same ephemeral-sequential pattern as leader election; the lowest is the lock holder, others wait on their predecessors. Releasing the lock means deleting your znode (or letting your session expire if you crash).

etcd. Same lease-and-key as leader election, with the convention that the key path is /locks/<resource> and clients release by deleting the key or letting the lease expire. The etcd3 Python client and the Go client both ship a Lock/Unlock helper that wraps the dance.

Service registry

ZooKeeper. Every worker, on startup, creates /services/<role>/<host_port> as ephemeral with its connection details as the data. Consumers do getChildren("/services/<role>", watch=True). When a worker dies, its session expires, its znode vanishes, the watch fires on every consumer, they re-read the children, and traffic stops being routed to the dead worker.

etcd. Workers grant a lease, put /services/<role>/<host_port> with that lease, and run keepalives. Consumers do Watch("/services/<role>", prefix=True) to get a stream of PUT (worker registers) and DELETE (lease expired or worker exited cleanly) events.

Operational concerns

Both systems are operationally finicky. Three categories matter.

Disk IO. etcd's WAL fsyncs on every write; ZooKeeper's transaction log fsyncs at every commit. If the WAL disk's fdatasync p99 exceeds ~10 ms, etcd starts logging etcdserver: request timed out warnings and Raft heartbeats time out. The fix is always the same: dedicated SSD for the WAL, no shared workloads. Cloud disks with bursty IOPS budgets (gp2 on AWS, standard PD on GCP) are notorious for this.

Snapshot management. etcd takes a full snapshot every 100,000 revisions by default. Old WAL segments below the snapshot are truncated. Without snapshots the WAL grows unbounded; with too-frequent snapshots you waste IO. The default is fine for most clusters; tune only if you have measured. ZooKeeper's snapshots are similar — periodic full memory dumps to disk, with old txn logs truncated below them.

Monitoring. The minimum viable monitor is a periodic etcdctl endpoint health (or zkServer.sh status) that fires an alert if the cluster reports no leader. The dreaded no_leader alert means every write is blocked; debugging usually finds a network partition, a crashed minority, or a disk that has gone read-only. Beyond health, watch the histograms: etcd exposes etcd_server_proposals_committed_total, etcd_disk_wal_fsync_duration_seconds, etcd_network_peer_round_trip_time_seconds. The fsync histogram is the most useful single signal; if its p99 climbs you have a disk problem before you have a Raft problem.

Maximum cluster size. Don't grow the cluster beyond 5 nodes for redundancy. If you need geographic spread, run separate clusters and replicate between them at the application layer (or use etcd's experimental learner-and-replication features carefully).

Leader election across three Python workers using etcd

You are running three instances of a cron-runner service. Exactly one should be the active runner — pulling jobs from a queue and executing them — and if the active runner crashes, another should take over within ~15 seconds. Here is the entire client code.

import os
import socket
import time
import logging
import etcd3  # pip install etcd3

logger = logging.getLogger("cron-runner")
ELECTION_KEY = "/cron-runner/leader"
LEASE_TTL = 10  # seconds; auto-expires if we crash or hang


def run_election_loop():
    """Block forever, becoming leader when possible.

    Returns only on unrecoverable error. The caller should restart this.
    """
    client = etcd3.client(host="etcd-0.etcd", port=2379)
    self_id = f"{socket.gethostname()}-{os.getpid()}".encode()

    while True:
        # 1. Grant a lease. The lease object knows how to send keepalives.
        lease = client.lease(LEASE_TTL)

        # 2. Atomically: create the leader key only if it does not exist.
        #    The condition checks mod_revision == 0 (= "key has never existed
        #    or was deleted"), which is etcd's idiomatic create-if-not-exists.
        succeeded, _ = client.transaction(
            compare=[client.transactions.mod(ELECTION_KEY) == 0],
            success=[client.transactions.put(ELECTION_KEY, self_id, lease=lease)],
            failure=[],
        )

        if succeeded:
            logger.info("won leadership; starting keepalive + work")
            try:
                act_as_leader(client, lease, self_id)
            except Exception:
                logger.exception("leader loop died; releasing lease")
                lease.revoke()
            # Fall through to retry — a transient error should not orphan us.
            continue

        # 3. Lost the election: someone else has the key. Throw away our
        #    unused lease and watch the leader key for deletion / expiry.
        lease.revoke()
        logger.info("lost election; watching for leader to disappear")
        wait_for_leader_to_die(client)
        # Loop back and try again.


def act_as_leader(client, lease, self_id):
    """Run while we hold the leader key. Refresh the lease forever."""
    # The etcd3 client's refresh_lease pumps keepalive every TTL/3 seconds.
    refresh = lease.refresh()  # iterator yielding each keepalive response
    last_refresh = time.monotonic()

    while True:
        # Drain a keepalive ack so the lease stays alive.
        try:
            next(refresh)
        except StopIteration:
            # Lease lost — either we hung past TTL or the cluster fenced us.
            raise RuntimeError("lease keepalive ended; not leader anymore")

        # Defensive: confirm we are still the holder. If our key was deleted
        # out from under us (e.g. someone called etcdctl del), bail out.
        value, meta = client.get(ELECTION_KEY)
        if value != self_id:
            raise RuntimeError("leader key no longer ours; stepping down")

        # Do one tick of leader work. Keep ticks short — we must come back
        # for keepalive before TTL/3 seconds elapse.
        run_one_cron_tick()

        # Throttle so we are not spinning.
        elapsed = time.monotonic() - last_refresh
        if elapsed < LEASE_TTL / 3:
            time.sleep(LEASE_TTL / 3 - elapsed)
        last_refresh = time.monotonic()


def wait_for_leader_to_die(client):
    """Block until the leader key is deleted (lease expiry or explicit del)."""
    events_iter, cancel = client.watch(ELECTION_KEY)
    try:
        for event in events_iter:
            if isinstance(event, etcd3.events.DeleteEvent):
                return
    finally:
        cancel()


def run_one_cron_tick():
    # Your real work goes here: pull a job, execute it, update state.
    pass


if __name__ == "__main__":
    logging.basicConfig(level=logging.INFO)
    while True:
        try:
            run_election_loop()
        except Exception:
            logger.exception("election loop crashed; restarting in 5s")
            time.sleep(5)

Run three copies of this on three separate hosts (or pods). All three call transaction(...). Exactly one's TXN succeeds — etcd serializes them and only the first reaches mod_revision == 0. That one runs act_as_leader and pumps keepalives every ~3 seconds. The other two see succeeded=False, revoke their unused leases, and watch the key.

Now kill the leader's host. The leader stops sending keepalives. After at most LEASE_TTL = 10 seconds, etcd expires the lease, deletes the key, and emits a DeleteEvent. Both watchers wake; both call transaction(...) again; one wins (etcd serializes), the other goes back to watching. Total downtime: between 10 and ~13 seconds depending on when the crash happened relative to the previous keepalive.

Two non-obvious correctness points in this code.

The defensive check inside act_as_leader — re-reading the key and comparing to self_id — is what handles the stale-lease problem: an operator who runs etcdctl del /cron-runner/leader does not revoke our lease, so we would keep refreshing it for nothing while a new leader was elected and put a new key with a new lease. Without the check, we would think we are still leader. With it, we step down.

The keepalive cadence (TTL / 3) is the standard rule of thumb. Lose two consecutive keepalives and you still survive; lose three and you are out. Why TTL/3 and not TTL/2: TTL/2 leaves no margin — one missed keepalive plus normal jitter and the lease expires while you are still healthy. TTL/3 is the sweet spot between safety and not flooding etcd with requests; it is what etcd's own client libraries default to.

The same shape works in Go (using the official clientv3 library's concurrency.Election), Java, Rust, and the rest. The pattern transcends the language.

Migration: KRaft and the in-cluster Raft trend

The 2020s brought a noticeable shift away from external coordination services and toward embedded Raft for systems that previously used ZooKeeper or etcd as a sidecar.

Kafka KRaft (Kafka Raft Metadata mode) was promoted to production-default in Kafka 3.3 (2022) and ZooKeeper support was deprecated in 3.5 and removed in 4.0. Instead of running a ZooKeeper ensemble alongside Kafka brokers, KRaft runs a Raft quorum inside the Kafka brokers themselves for metadata. The result: one fewer cluster to operate, faster controller failover, and much higher partition counts per cluster (millions instead of tens of thousands). The metadata is now a Kafka topic stored on a Raft-replicated log, accessed by the controllers; brokers fetch metadata from the controller leader.

etcd in Kubernetes is going the opposite direction — it remains an external dependency, but the Kubernetes community has invested heavily in operational tooling (etcdadm, kubeadm's etcd lifecycle management, automated snapshots) so that a typical operator does not interact with etcd directly. The trend here is not "remove etcd" but "make etcd disappear into the platform."

HashiCorp Consul is interesting because it is itself a coordination service (gossip + Raft) but with a service-mesh focus. It overlaps with etcd functionally for service discovery and KV storage; the relevant tradeoffs are: Consul has built-in service mesh and health checking; etcd has a tighter, sharper API and lower per-write latency. For Kubernetes, etcd; for non-Kubernetes service discovery and config management, Consul is the more common choice.

The arc is clear: external coordination services are still the right answer for many use cases, but for systems where consensus is on the critical path of every operation, embedding Raft directly is winning.

Common confusions

Where this leads next

We have closed the loop on Raft for now: chapters 100-106 built the algorithm; this chapter showed how to consume it. Chapter 108 (Consensus is a log, not a database) takes the next conceptual step — the realization that what Raft really gives you is a replicated log, and a database is just one of many state machines that can be driven by it. Streams, queues, configuration trees, and service registries are all the same machinery with different state machines on top.

The single sentence to carry forward: etcd and ZooKeeper exist so that every distributed system does not need to implement consensus; they expose the same handful of primitives — leases, ephemeral nodes, atomic puts, watches — that compose into leader election, locks, and registries; the right way to use them is as a small dedicated cluster you treat as a critical dependency, not as a general-purpose database.

References

  1. etcd authors, etcd documentation — the canonical reference for the API, deployment topology, and operational guidance. The "FAQ" page covers the 8-GB recommendation and the dedicated-disk rule explicitly.
  2. The Apache ZooKeeper team, ZooKeeper Programmer's Guide — the authoritative description of znode semantics, sessions, watches, and the recipe for ephemeral-sequential leader election.
  3. Xiang Li, Yicheng Qin, Brandon Philips, etcd: Distributed reliable key-value store for the most critical data of a distributed system — the design talk and accompanying papers from the original CoreOS team explaining the rationale for Raft, the lease primitive, and the gRPC API.
  4. Kubernetes documentation, Operating etcd clusters for Kubernetes — the operational playbook for running etcd as a Kubernetes control-plane component, including snapshot/restore procedures.
  5. Flavio Junqueira and Benjamin Reed, ZooKeeper: Distributed Process Coordination (O'Reilly, 2013) — the book by ZooKeeper's original authors. Part II ("Programming with ZooKeeper") covers the canonical recipes for locks, leader election, and group membership; their USENIX paper on ZooKeeper is the formal reference.
  6. HashiCorp, Consul vs. ZooKeeper, etcd, doozerd — the comparison page from the most prominent "fourth" option in this space, useful for understanding the design-space tradeoffs even if you end up choosing etcd or ZK.