In short
A replicated state machine (RSM) is the theoretical model underlying every consensus-based distributed database — etcd, ZooKeeper, Consul, the per-range Raft groups in CockroachDB, the per-tablet Paxos groups in Spanner. Strip the idea down: a state machine is a deterministic program that reads an operation, updates internal state, and emits a response. Replicate the same program across N nodes, feed every node the same sequence of operations in the same order, and because the program is deterministic, every node ends up in the same state after applying any prefix.
That is the whole abstraction, and it is almost trivial in isolation. The hard part is the "same sequence in the same order" clause when the network drops messages, reorders packets, partitions, and the nodes themselves crash and restart. Consensus is the algorithm that solves exactly that problem: getting a majority of replicas to agree on the i-th entry of the operation log, durably, before any replica acts on it. Raft and Paxos are the two well-studied consensus algorithms; both produce a totally ordered, fault-tolerant log of operations.
The combination — a deterministic program plus a fault-tolerant agreed-upon log — gives you a service that behaves, from the outside, like a single reliable machine, even though no single machine inside is reliable. It tolerates a minority of replicas crashing, networks splitting, leaders dying. As long as a majority survives and can talk to each other, the system makes progress; when they cannot, it stops rather than diverge.
This chapter introduces the abstraction itself: state machines, determinism, the log, safety vs liveness, and the constraints that make the model work. Chapter 100 takes the next step into how a cluster picks a leader; chapter 101 covers how that leader replicates log entries and decides when one is committed.
Two computers, both holding an integer, both starting at zero. You hand each one the same slip with three operations: +3, +5, *2. Both process in that order: 0 → 3 → 8 → 16. They agree.
Now hand them the same operations but let the slips arrive in any order. Computer A reads +3, *2, +5 and lands at 11. Computer B reads *2, +3, +5 and lands at 8. Same operations, different order, divergent state. There is no way to retroactively reconcile without throwing one history away.
That trivial observation is the entire foundation of consensus-based replication. If every replica applies the same operations in the same order, and the operations are deterministic, the replicas cannot diverge. That guarantee — total order over operations — is what a consensus protocol provides. Everything else in this part of the book is plumbing around that one idea.
The state machine model
A state machine, in the Schneider 1990 sense, is four things:
- A set of states — for a KV store, every dictionary mapping strings to values.
- An initial state — the empty dictionary.
- A set of operations —
SET k v,GET k,DEL k. - A transition function
f(state, op) → (state', output)producing the next state plus a response.
The transition function is the program. It can be as simple as a dictionary update or as complex as a SQL execution engine — Spanner's per-tablet RSM is effectively a SQL processor. The model cares about only one property of f: determinism. Same (state, op) always produces the same (state', output), every time, on every machine.
Why determinism is non-negotiable: if f could give different answers on different replicas — say it consults the local wall clock — then feeding the same operation sequence to two replicas no longer guarantees they reach the same state. The replicas would diverge from the first non-deterministic call, and the entire convergence argument collapses.
The state machine alone is a single-node abstraction. The interesting question is what happens when you run several of them.
Why replication via same-ops works
Run the same state machine on five replicas, all starting from the same initial state. The conditions for convergence are clean:
- Every replica starts in the same state.
- Every replica applies the same set of operations.
- Every replica applies them in the same order.
- The transition function
fis deterministic.
If all four hold, induction over the operation sequence proves the replicas are byte-identical at every step. After 0 operations, all share the initial state. After applying op_i, every replica computes f(state_{i-1}, op_i); by determinism this is the same (state_i, output_i) everywhere. Apply the same logic to step i+1 forever.
Three immediate consequences. Reads are easy — any replica gives the right answer, so route reads to whichever is closest. Failures are graceful — surviving replicas have identical state and keep accepting operations; a recovering replica asks for missing operations and rejoins. Adding capacity is easy — a new replica replays the log from operation 0 and is indistinguishable from any other.
Conditions 1 and 4 are programmer responsibilities. Conditions 2 and 3 are what consensus delivers: turn a stream of client operations submitted to arbitrary nodes into a single agreed ordered sequence that every replica applies.
The consensus problem
Imagine three replicas all accepting client writes. Client X sends SET name=Alice to replica 1 at the same instant Y sends SET name=Bob to replica 2. Without coordination, replica 1 puts Alice first, replica 2 puts Bob first, and the third sees whichever arrives first. Three replicas, three histories, divergence.
The fix: force every operation through an agreement step before any replica applies it. The step picks an order — Alice then Bob, or Bob then Alice. Both are valid; the protocol just has to commit to one. Once committed, every replica applies in that order.
That agreement step is the consensus problem: get a majority of nodes to agree on a single value (the next log entry) despite an asynchronous network and crashes. The classical formal requirements are three:
- Agreement. No two correct nodes decide different values.
- Validity. The decided value was actually proposed by some node.
- Termination. Every correct node eventually decides.
The FLP impossibility result (Fischer, Lynch, Paterson 1985) proves no deterministic protocol can guarantee all three in a fully asynchronous system where even one node may crash. Real protocols escape by relaxing termination: agreement and validity always, termination only when the network behaves reasonably. Practically this means heartbeats and timeouts — aggressive timeouts may stall the protocol briefly, but never produce inconsistent state.
Raft and Paxos are the two consensus algorithms to know. Paxos came first (Lamport, formally 1998); Raft (Ongaro and Ousterhout 2014) was designed explicitly to be more understandable while solving the same problem with the same guarantees. Spanner and Chubby use Paxos variants; etcd and CockroachDB use Raft; ZooKeeper uses Zab (a Paxos cousin). The RSM abstraction is independent of which consensus algorithm sits underneath.
Determinism constraints
The "deterministic transition function" requirement is a hard constraint you must enforce. Several common habits break it without warning, and a state machine that violates determinism is a time bomb: it works fine in development, replicates fine for a long time, and then one day a replica diverges and you find yourself reconciling histories at 3 AM.
No wall-clock timestamps inside the transition function. time.time() returns different values on different replicas. If your operation needs a timestamp, the leader must assign it before the operation enters the log, and every replica reads op.timestamp, not time.time(), when applying.
No randomness. random.random() returns different values on every replica. Generate random IDs on the leader and put them in the operation payload.
No local file, environment, or external API access. Reading /etc/hostname, os.environ, or calling requests.get(...) produces different values across replicas. Replicate configuration through the log itself; have the leader make any external calls before the operation is logged and bake the response into the payload.
No floating-point arithmetic whose precision varies across CPUs. Modern x86 and ARM mostly agree on IEEE 754, but historical anomalies (x87's 80-bit intermediates, FMA availability) bite. Production systems either avoid floats in the state machine or pin to deterministic libraries.
Why this matters in practice: the state machine is your application's most replicated, longest-lived code path. A subtle non-determinism — a Python set hashed with PYTHONHASHSEED randomisation, an iteration order that depends on insertion timing — silently corrupts a replica years after deployment. Production RSM frameworks (Apache BookKeeper, Hashicorp Raft) ship with restrictive APIs that make non-determinism awkward to introduce.
The ergonomic answer is always the same: separate side effects from state transitions. The leader does the non-deterministic work (clock, API, random ID), packs results into the operation, and submits to the log. Every replica then runs the deterministic part using the leader's pre-computed values.
The log — the replication unit
The unit of replication in an RSM is not the state and not an individual operation; it is the log, an indexed append-only sequence of operations. Entry 1 applies first, then 2, then 3, forever. The log is the canonical history; the state at any replica is whatever you get by replaying the log up to wherever the replica has caught up.
Two indices matter. The applied index is the highest entry the replica has executed against its state machine. The committed index is the highest entry the cluster has agreed on — a majority has durably stored it. A replica must never apply an entry past its committed index, because an unagreed entry might be overwritten if a different leader takes over. The committed index advances only when a majority acks; the applied index follows behind lazily.
The log is also the unit of durability. After a crash, a node reads its on-disk log, replays it against an empty state machine, and is back in business. The state machine itself is conceptually pure in-memory; you do not need to checkpoint it (though for log-truncation reasons you do, in practice).
Python sketch — toy RSM
The structure of a replicated state machine, free of consensus concerns, fits in fewer than 30 lines.
class StateMachine:
"""The deterministic program. Same input -> same output, always."""
def __init__(self):
self.state: dict[str, str] = {}
def apply(self, op: tuple) -> str | None:
kind, *args = op
if kind == 'SET':
self.state[args[0]] = args[1]
return 'OK'
elif kind == 'GET':
return self.state.get(args[0])
elif kind == 'DEL':
self.state.pop(args[0], None)
return 'OK'
raise ValueError(f'unknown op {kind!r}')
class Replica:
"""A node holding a log and a state machine. The log is the canonical
history; state is derived by replaying log entries in order."""
def __init__(self):
self.sm = StateMachine()
self.log: list[tuple] = []
self.applied = 0
def append_and_apply(self, op: tuple) -> str | None:
self.log.append(op)
self.applied += 1
return self.sm.apply(op)
That is the entire abstraction. To verify three replicas converge, call append_and_apply on each with the same operation sequence in the same order, and assert all three state dictionaries are equal at every step.
What the sketch omits is the part that makes RSMs interesting: how the cluster decides what the next operation is. The test driver here calls append_and_apply synchronously on every replica. In production, clients send operations to different replicas concurrently, and consensus turns those concurrent submissions into a single ordered log every replica appends to. That is what ch.100 (leader election) and ch.101 (log replication) build on this foundation.
Two failures RSMs don't handle alone
Standard consensus protocols handle crash-stop failures (nodes work or stop) and omission failures (messages lost or delayed but not corrupted). Two real failure modes lie outside.
Byzantine failures — a node that lies, sending different log entries to different peers, forging signatures. Raft and Paxos assume honest-but-may-crash; a Byzantine actor can fool them into committing inconsistent state. BFT protocols (PBFT, Tendermint, HotStuff) handle this by requiring a 2/3 majority and cryptographic signatures on every message. Cost: higher message complexity (O(n^2) per decision vs Raft's O(n)). BFT is essential for public blockchains and cross-organisation systems; overkill for one company's internal etcd cluster.
Catastrophic simultaneous crashes — every replica crashing at once (power failure across all DCs, a software bug crashing all nodes on the same input). Consensus tolerates a minority of failures; the majority must survive. Recovery from total loss requires durable on-disk persistence of the log so nodes can replay on restart. Every production RSM uses fsync-on-disk writes for exactly this reason.
For everything between — half the cluster crashing, partitions, slow nodes, lost packets, staggered restarts — the RSM-plus-consensus combination is exactly the right tool.
Liveness vs safety
A correct distributed protocol must hold two qualitatively different properties.
Safety — bad things never happen. Replicas never disagree on a committed entry; an entry once committed is never overwritten. Raft proves these as theorems, conditional only on correct implementation and a non-Byzantine majority. Safety holds regardless of network behaviour — partitions, arbitrary delays, reordering, repeated leader churn cannot violate it. The cluster might stop making progress; it never silently produces wrong answers.
Liveness — good things eventually happen. The cluster eventually commits new entries; a partitioned leader eventually steps down. Liveness is what FLP targets: in a fully asynchronous system where messages can be delayed arbitrarily, no deterministic protocol can guarantee liveness with even one crash failure. The proof constructs an adversarial schedule that oscillates forever.
Real consensus protocols achieve liveness by assuming partial synchrony: some unknown but finite bound on message delay exists, and after some stabilisation time the system behaves synchronously. The mechanism is timeouts and heartbeats. Too short: spurious leader changes. Too long: slow failure recovery. Raft's election timeout is typically 150-300 ms, heartbeat 50 ms.
Why this matters operationally: when an RSM cluster goes silent — partition, runaway GC pause, three replicas down in sequence — it has chosen safety over liveness. It is not "broken" in any silent-corruption sense; it is correctly refusing progress until quorum re-establishes. Monitoring must distinguish "no quorum, stopped" from "quorum exists, slow" — different incidents, different remediations.
Applications
The RSM pattern appears all over modern distributed systems, often without being called out by name.
etcd, ZooKeeper, Consul are coordination services using Raft (etcd, Consul) or Zab (ZooKeeper) over a small in-memory state machine holding configuration, distributed locks, and service registrations. Kubernetes stores its entire cluster state in etcd; every API request that mutates state goes through Raft. CockroachDB replicates every key range (64-512 MB) via an independent Raft group of three or five nodes — the state machine for each range is the KV store for that range. Spanner uses per-tablet Paxos with TrueTime for globally consistent ordering. Apache Kafka's KRaft mode (3.0+) replaced ZooKeeper with an internal Raft state machine for metadata.
The common thread: any time you need a small, strongly-consistent, fault-tolerant piece of shared state, the answer is some flavour of RSM. For terabyte-scale data, you shard into many small pieces and run an RSM per shard.
RSM vs eventual consistency
Two extremes anchor the consistency spectrum.
Replicated state machines give you strong consistency — every read sees a state consistent with some prefix of the agreed log. The cost is that every write requires a majority round trip — typically one RTT to the slowest replica in the quorum, plus an fsync on each. Cross-region, that is 50-200 ms per write. And progress requires quorum: lose a majority and the cluster stops accepting writes entirely.
Eventual consistency systems — Dynamo, Cassandra in default mode, Redis with async replication — let any replica accept writes locally, propagate asynchronously, reconcile after the fact. Reads can return stale or conflicting values; the application must reason about what it means for two clients to "both win." The benefit is dramatic: writes are local-disk fast, no quorum bottleneck.
For shared metadata controlling the rest of a system, strong consistency is non-negotiable. For high-volume application data — carts, feeds, telemetry — eventual consistency is usually right. Production architectures routinely combine both: a small RSM cluster for control plane, an eventually-consistent store for bulk data.
A KV store backed by an RSM
You run a five-replica etcd-like KV store. A client sends PUT name=Alice.
- Routing. The request lands on follower R3, which forwards to leader R1.
- Append. R1 appends the operation to its local log at index 5. Entries 1-4 are committed; entry 5 is new and not yet applied.
- Replicate. R1 sends
AppendEntries(index=5, op=PUT name=Alice, prev_index=4)to R2-R5 in parallel. - Followers ack. R2, R3, R4 append entry 5, fsync, respond
success. R5 is cross-region and slow. R1 has four of five acks — past the three-needed majority. - Commit. R1 advances its committed index to 5. The next heartbeat tells followers
committed=5. - Apply on leader. R1 applies entry 5:
state['name'] = 'Alice'. ReturnsOKto the client. - Apply on followers. R2, R3, R4 see
committed=5and apply. All four now agree:state['name'] == 'Alice'. R5 catches up later.
A subsequent GET name from R1-R4 returns Alice. (A read from R5 might briefly return the previous value — which is why production systems either route reads through the leader or use a consistency-token mechanism.)
If R1 crashed between step 4 and step 5 — followers acked but not yet learned of the commit — the new leader still sees entry 5 in a majority of logs (R2, R3, R4 have it durably). It re-commits entry 5 in its own term; eventual outcome unchanged. This is the safety property: once a majority holds it, the entry is effectively committed even if the leader dies before announcing.
If R1 crashed between step 2 and step 3 — appended locally but did not replicate — entry 5 is in R1's log alone and the new leader never sees it. R1's stale entry 5 is overwritten when it rejoins. The client times out and retries. Correct behaviour, no corruption.
Common confusions
-
"RSM means you replicate the state." No. You replicate the operations that produce the state. State is derived locally by applying operations. Replicating state directly (snapshots, page images) is a different model — closer to async log shipping (ch.67) — and does not generalise to multi-leader writes or strong consistency under failures.
-
"Any program can be turned into an RSM." Only deterministic ones. A program that consults the wall clock, calls external APIs, or reads local files cannot be replicated by feeding the same operations to each replica. Refactor: pull non-determinism into the leader's pre-processing step, where results become part of the operation payload.
-
"Consensus decides which operation wins." No. Consensus decides the order operations apply in. Both Alice's
SET name=Aliceand Bob'sSET name=Bobget logged; the question is which goes first. The state machine's transition function decides what "winning" means semantically. The consensus layer is agnostic to operation meaning. -
"RSM gives you ACID transactions." Not by itself. RSM gives you linearizability — operations appear to take effect in a single global order. Enough for single-key transactions; not enough for multi-key atomicity (the wall in ch.98). Building ACID on top of an RSM requires additional protocols like two-phase commit or snapshot isolation.
-
"Five replicas tolerate four failures." No. Five replicas tolerate two failures. Quorum is
floor(N/2) + 1. With three of five down, only two are alive, below quorum, cluster stops. Odd numbers maximise fault tolerance per replica: 3 tolerate 1, 5 tolerate 2, 7 tolerate 3; even-numbered clusters waste a replica (4 tolerate 1, same as 3, with more cost).
Going deeper
Lamport 1978 — ordering events
Leslie Lamport's Time, Clocks, and the Ordering of Events in a Distributed System (CACM 1978) is the foundational text. It introduces logical clocks and the happens-before relation, and sketches the state machine approach at the end: replicate events in a total order consistent with happens-before, replicas converge. Every consensus paper traces back to this argument.
Schneider 1990 — the canonical RSM formulation
Fred Schneider's Implementing Fault-Tolerant Services Using the State Machine Approach (ACM Computing Surveys 1990) is the textbook treatment. It formalises the model — state, commands, deterministic transitions, agreement and order requirements — and discusses Byzantine variants. The definitions to cite when arguing precisely about RSMs.
Raft and the readability argument
Ongaro and Ousterhout's Raft paper (USENIX ATC 2014) explicitly designs the protocol to be more teachable than Paxos, complete with a user study showing students score higher on Raft comprehension quizzes. Read it after this chapter and ch.100. The official Raft site at https://raft.github.io/ hosts the paper, an interactive visualisation, and a list of production implementations.
Why Paxos still appears in production
Paxos came first; Chubby, Spanner, BookKeeper, and many internal Amazon services run on it, and in aggregate Paxos handles more global traffic than Raft. It persists despite Raft's clarity advantage because institutional codebases migrate slowly and Paxos has well-understood optimisations (Multi-Paxos batching, Fast Paxos one-RTT writes) that took years for Raft variants to match. Greenfield today: Raft. Existing Paxos system: rarely worth migrating.
Where this leads next
The RSM model is the framework. The next two chapters fill in the consensus machinery that makes it work over a real network.
-
Leader election and terms — chapter 100. How the cluster picks a single leader, detects a dead one, avoids electing two simultaneously (split-vote), and what a "term" is. The election gives you the serial-leader property that simplifies log replication.
-
Log replication and commit index — chapter 101. How the leader appends entries, ships them to followers, decides when an entry is committed, and recovers when a follower's log has diverged. The per-entry safety theorem of Raft is established here.
After those two, ch.102 covers the failure scenarios — partitions, leader crashes during replication, brain split — the protocol is designed to survive.
The single sentence to carry forward is the one this chapter started with: same starting state, same operations, same order, deterministic transition function — replicas cannot diverge. Every subsequent complication is in service of guaranteeing that one clause about order despite a network that does not want to cooperate.
References
- Lamport, Time, Clocks, and the Ordering of Events in a Distributed System, Communications of the ACM 21(7), 1978 — the foundational paper introducing logical clocks, happens-before, and the state-machine approach to replication.
- Schneider, Implementing Fault-Tolerant Services Using the State Machine Approach: A Tutorial, ACM Computing Surveys 22(4), 1990 — the canonical formalisation of the RSM model, including Byzantine variants and a thorough discussion of agreement and order requirements.
- Ongaro and Ousterhout, In Search of an Understandable Consensus Algorithm (Extended Version), USENIX ATC 2014 — the Raft paper, the most readable consensus paper in the literature and the basis of etcd, Consul, and CockroachDB.
- Lamport, Paxos Made Simple, ACM SIGACT News 32(4), 2001 — Lamport's clean retelling of the original Paxos algorithm; despite the title, it took the community another decade and the Raft paper to truly understand it.
- Kleppmann, Designing Data-Intensive Applications, O'Reilly 2017, chapter 9 — book-length treatment of consistency, consensus, and the trade-offs between strong and eventual consistency, with practical comparisons across real systems.
- Howard, Schwarzkopf, Madhavapeddy, Crowcroft, Raft Refloated: Do We Have Consensus?, ACM Operating Systems Review 49(1), 2015 — independent re-implementation and analysis of Raft, surfacing edge cases the original paper underspecified and useful for anyone implementing the protocol.