In short
Raft's log is append-only and, by default, unbounded. Run a five-node cluster taking 10k writes/sec for a day and you have 864 million entries, hundreds of gigabytes on every replica's disk, and a restart time measured in hours because every replica must replay the log from index 1 to rebuild its state machine. The protocol as written in chapter 101 has no answer for this — that is what log compaction via snapshots is for.
The fix is an old idea: persist the state machine state itself (not the log that produced it), record the index and term of the last log entry baked into that state, and then throw away every log entry up to and including that index. The state machine plus the snapshot metadata is now equivalent to the original log prefix, and the prefix can go.
The metadata pair is critical. The AppendEntries consistency check (chapter 101) needs prevLogTerm for the entry one before the new batch — but if that entry has been compacted away, the follower has nothing to look up. So Raft requires every snapshot to record lastIncludedIndex and lastIncludedTerm, and the consistency check is extended: if prevLogIndex == lastIncludedIndex, the term comparison uses the snapshot's lastIncludedTerm instead of the (now non-existent) log entry. If prevLogIndex < lastIncludedIndex, the entry is part of the snapshot, and the follower accepts the AppendEntries unconditionally because the snapshot already covers that prefix.
Each server snapshots independently — typically when its log exceeds some byte threshold (etcd defaults to every 100,000 entries; CockroachDB tunes per-range). This is fine because the state machine is deterministic: any replica's snapshot at index N is byte-identical to any other replica's snapshot at index N.
The remaining wrinkle: a slow follower whose nextIndex falls below the leader's lastIncludedIndex cannot be repaired by AppendEntries — the leader has discarded the entries that follower needs. The leader detects this and sends an InstallSnapshot RPC instead, shipping the entire state machine plus metadata. The follower replaces its state, sets its commit and apply indices to lastIncludedIndex, truncates its log, and resumes following from there.
You ran chapter 101's leader-replicates-to-followers loop in production for a week. Throughput was steady at a few thousand entries per second; safety held; elections behaved themselves; the figure-8 anomaly never happened. And then on Tuesday morning a replica restarted, and your dashboard showed it reading the local log file for forty minutes before it accepted its first heartbeat from the leader. By the time it caught up, the leader was 12 million entries ahead and had to replay another batch on top.
This is the unbounded-log problem, and it is what every production Raft system runs into within hours of going live. The protocol from chapters 100-101 is correct for arbitrary log sizes, but operationally it assumes the log is short enough to replay quickly and small enough to fit comfortably on disk. Both assumptions die fast in real workloads.
This chapter does the surgery. By the end you will know why the log must be truncated, what snapshot means in Raft (it is not a database backup, despite the overloaded word), what metadata must accompany the snapshot for the consistency check to keep working, and what happens when a follower has fallen so far behind that the leader's log no longer contains the entries it needs.
The unbounded-log problem
Three concrete failure modes appear as the log grows.
Disk fills. A serialized log entry might be a few hundred bytes (a key, a value, term metadata, a checksum). At 10,000 writes/sec, that is roughly 100-300 MB/hour, 2.4-7.2 GB/day, 1-3 TB/year. Per replica. The leader cannot stop accepting writes when the disk fills; it just dies. Long before that, the per-replica disk cost stops being defensible.
Replay time grows linearly. When a replica restarts, it reads the on-disk log from index 1 and re-applies every entry to rebuild the state machine. Replaying a billion entries at 200,000 entries/sec takes 5,000 seconds — eighty-three minutes of downtime per restart. This is the most insidious failure: the replica is "alive" and re-applying, but contributes nothing to quorum until it finishes, so a restart during a partial outage can drop you below majority.
Bootstrap is hopeless. Adding a new replica to an existing cluster — for capacity, replacement of a dead node, or expansion across a region — requires shipping the entire log to it. Without compaction this is shipping a year of writes over the network and replaying them on the new machine. By the time the new replica is caught up, hours of new entries have accumulated; it might never close the gap.
Why these three are linked: they all stem from treating the log as the system's source of truth forever. The log is the audit trail of how the state was built; once the state has been baked into a checkpoint, the prefix of the log that produced it is redundant. The trick is doing the truncation in a way that doesn't break the consistency check.
The snapshot solution
A snapshot is a serialised image of the state machine at a particular log index, plus two pieces of metadata. That is all it is. The state machine, as we built it in chapter 99, has a deterministic apply function — feed the same log entries into the same initial state and you get the same final state on every replica. So at any given index N, every replica's state is bit-identical, and a serialisation of one replica's state at N is a valid replacement for replaying entries 1 through N on any replica.
The two pieces of metadata are non-negotiable:
lastIncludedIndex— the index of the last log entry whose effect is baked into this snapshot. After installing the snapshot, the receiver'scommitIndex,applyIndex, and the log's "previous entry" pointer all advance to this value.lastIncludedTerm— the term of the entry atlastIncludedIndex. Required because the AppendEntries consistency check on the very next entry — at indexlastIncludedIndex + 1— usesprevLogIndex = lastIncludedIndex, prevLogTerm = ?. The?is whatlastIncludedTermprovides. Without it the check has nothing to compare against and the protocol breaks.
After a snapshot at index N, the server discards every log entry with index <= N. The log effectively starts at N+1. Conceptually:
Before snapshot at N:
log = [e1, e2, e3, ..., eN, eN+1, eN+2, ...]
state = apply(apply(...apply(initial, e1)...)) # all entries replayed
After snapshot at N:
snapshot = serialise(state_at_N)
lastIncludedIndex = N
lastIncludedTerm = eN.term
log = [eN+1, eN+2, ...] # prefix discarded
state = deserialise(snapshot) # equivalent to before
prevLogIndex=8, prevLogTerm=3 — the follower looks up lastIncludedTerm instead of log[8].term (which no longer exists), the comparison succeeds, and replication continues normally.Why metadata is mandatory and not optional: the consistency check is the entire correctness argument for log replication. Truncating the log without recording what was at the truncation boundary breaks the inductive base case of Log Matching — the leader thinks "you should have entry 8 with term 3" and the follower has no way to confirm or deny because it discarded entry 8 long ago. By promoting the boundary (index, term) from "the last log entry" to "a piece of named metadata that survives truncation," Raft preserves the invariant across the snapshot operation.
Per-server independent snapshotting
Each replica decides on its own when to snapshot. There is no coordination, no leader-driven trigger, no requirement that snapshots happen at the same index across replicas. The state machine is deterministic, so any replica's snapshot at index N is a valid representation of the cluster's state at N.
The trigger is usually a local condition: log size exceeds a byte threshold, or entry count exceeds a count threshold, or a periodic timer fires. etcd uses an entry-count trigger by default, with --snapshot-count=100000 meaning "take a snapshot every 100,000 log entries past the previous snapshot." CockroachDB snapshots per-range with size-based triggers tuned around the 64 MB range size. TiKV uses both — count and size — with the more aggressive of the two firing.
Why per-server and not cluster-wide: forcing all replicas to snapshot at the same index would either require an extra round of consensus (slow, defeats the purpose) or arbitrary coupling (one slow replica delays everyone's compaction). Independent snapshotting means each replica reclaims disk on its own schedule, and any divergence in compaction frontiers is invisible to the protocol — the leader's nextIndex for a follower is still in the leader's un-truncated range, and that is all that matters for AppendEntries.
The cost is that snapshotting is not free: serialising the state machine state takes CPU and memory, and writing it out competes with normal write traffic for disk bandwidth. Production implementations snapshot in the background, on a copy-on-write fork of the state, so the foreground apply loop is not blocked. RocksDB-backed Raft implementations get this for free via LSM-tree compaction — the snapshot is the current state of the LSM, and compaction is already incremental.
InstallSnapshot — when AppendEntries cannot help
Consider a follower that was offline for a week. It restarts with nextIndex (according to the leader) at index 200,000. The leader, meanwhile, has snapshotted at index 800,000 and discarded everything before that. When the leader tries AppendEntries(prevLogIndex=199999, ...), it cannot — log[199999] no longer exists in the leader's storage. The repair-loop strategy from chapter 101 (decrement nextIndex and retry) would just walk into the same hole every time.
The leader detects this case at send time: nextIndex[follower] <= lastIncludedIndex. Instead of AppendEntries it sends an InstallSnapshot RPC carrying the entire snapshot blob, the metadata pair, and the cluster configuration. The follower:
- Replaces its state machine with the deserialised snapshot.
- Sets
commitIndexandapplyIndextolastIncludedIndex. - Discards any conflicting log entries it might have (anything whose term or index disagrees with the snapshot is overwritten).
- Persists the snapshot and acknowledges.
After the ack, the leader resumes normal AppendEntries from nextIndex = lastIncludedIndex + 1. The follower has fast-forwarded across hundreds of thousands of entries in a single RPC.
prevLogIndex. It falls back to InstallSnapshot, shipping the state machine state and the metadata pair. The follower deserialises, fast-forwards its commit/apply indices, truncates its log, and acknowledges. The leader's nextIndex for this follower is now lastIncludedIndex + 1, and normal AppendEntries resumes.In production, InstallSnapshot is chunked: a state machine of a few gigabytes is shipped as a stream of, say, 1-MB chunks, each carrying an offset and a "done" flag. The follower writes chunks to a temp file, and only swaps the file in (and replaces the in-memory state) when the final chunk arrives. This is what etcd's Snapshotter.SaveSnap and the receive-side handlers in the etcd server implement.
Why a separate RPC and not "AppendEntries with a snapshot field": the two RPCs have very different semantics — AppendEntries is per-entry, idempotent, frequent, and small; InstallSnapshot is per-snapshot, replaces state, infrequent, and potentially huge. Coupling them would force the AppendEntries fast path to handle multi-gigabyte payloads. Keeping them separate lets the leader use streaming, chunking, and back-pressure for snapshots without complicating the hot path.
Real Python — take_snapshot, install_snapshot, and the modified consistency check
The state-machine-side helpers and the modified follower handler are about thirty lines together.
import pickle
from dataclasses import dataclass
@dataclass
class Snapshot:
last_included_index: int
last_included_term: int
state_bytes: bytes # serialised state machine
class Server:
def __init__(self):
self.log = [] # log[i] is entry at index i+1+offset
self.log_offset = 0 # global index of log[0]'s predecessor
self.commit_index = 0
self.apply_index = 0
self.current_term = 0
self.state = {} # the state machine
self.snapshot: Snapshot | None = None
self.snapshot_threshold = 100_000 # entries
# --- snapshotting on this server ---
def maybe_snapshot(self):
"""Take a snapshot if log has grown past the threshold."""
if len(self.log) < self.snapshot_threshold:
return
# Snapshot up to applyIndex — only entries already applied to state.
last_idx = self.apply_index
last_term = self._term_at(last_idx)
snap = Snapshot(
last_included_index=last_idx,
last_included_term=last_term,
state_bytes=pickle.dumps(self.state),
)
self._persist_snapshot(snap)
# Truncate the log: drop entries with global index <= last_idx.
keep_from = last_idx - self.log_offset # local index in self.log
self.log = self.log[keep_from:]
self.log_offset = last_idx
self.snapshot = snap
def _term_at(self, global_index: int) -> int:
if self.snapshot and global_index == self.snapshot.last_included_index:
return self.snapshot.last_included_term
local = global_index - self.log_offset - 1
if local < 0 or local >= len(self.log):
raise IndexError(f"index {global_index} not in log or snapshot")
return self.log[local].term
# --- receiving a snapshot from the leader ---
def install_snapshot(self, snap: Snapshot, leader_term: int):
"""Apply a snapshot pushed by the leader."""
if leader_term < self.current_term:
return False # stale
self.current_term = max(self.current_term, leader_term)
# Replace state machine.
self.state = pickle.loads(snap.state_bytes)
self.snapshot = snap
# Fast-forward indices. Anything in the snapshot is by definition committed.
self.commit_index = snap.last_included_index
self.apply_index = snap.last_included_index
# Truncate the log. If we happen to hold an entry matching
# (last_included_index, last_included_term), keep entries past it;
# otherwise discard the whole log.
local_match = snap.last_included_index - self.log_offset - 1
if (0 <= local_match < len(self.log)
and self.log[local_match].term == snap.last_included_term):
self.log = self.log[local_match + 1:]
else:
self.log = []
self.log_offset = snap.last_included_index
self._persist_snapshot(snap)
return True
# --- modified AppendEntries handler ---
def handle_append_entries(self, term, leader_id, prev_log_index,
prev_log_term, entries, leader_commit):
if term < self.current_term:
return {'term': self.current_term, 'success': False}
self.current_term = term
# Case A: prev_log_index falls inside the snapshotted prefix.
# The snapshot already covers it, so no log lookup is needed.
if prev_log_index < self.log_offset:
# Anything the leader is trying to send up to log_offset is redundant
# (already in our snapshot). Drop those, keep the rest.
skip = self.log_offset - prev_log_index
entries = entries[skip:] if skip < len(entries) else []
prev_log_index = self.log_offset
prev_log_term = self.snapshot.last_included_term if self.snapshot else 0
# Case B: prev_log_index == log_offset — boundary, term comes from snapshot.
if prev_log_index == self.log_offset:
expected_term = (self.snapshot.last_included_term
if self.snapshot else 0)
if prev_log_term != expected_term:
return {'term': self.current_term, 'success': False}
else:
# Case C: prev_log_index is in the live log; normal check.
local = prev_log_index - self.log_offset - 1
if local >= len(self.log):
return {'term': self.current_term, 'success': False}
if self.log[local].term != prev_log_term:
return {'term': self.current_term, 'success': False}
# Append (with truncate-on-conflict), advance commit, apply. Same as ch.101.
for i, e in enumerate(entries):
idx = prev_log_index + 1 + i
local = idx - self.log_offset - 1
if 0 <= local < len(self.log) and self.log[local].term != e.term:
self.log = self.log[:local]
if local >= len(self.log):
self.log.append(e)
if leader_commit > self.commit_index:
self.commit_index = min(leader_commit, prev_log_index + len(entries))
while self.apply_index < self.commit_index:
self.apply_index += 1
self._apply(self._entry_at(self.apply_index).command)
self.maybe_snapshot()
return {'term': self.current_term, 'success': True}
Three things in this code carry safety, not just plumbing:
-
maybe_snapshottruncates only up toapplyIndex, never beyond. Snapshotting a committed entry that has not been applied yet would mean the snapshot does not actually include its effect — the state bytes would be wrong. Always snapshot up to the apply frontier. -
install_snapshotfast-forwardscommitIndexandapplyIndextolastIncludedIndex. Anything in a snapshot the leader sent is by the leader-completeness property already committed cluster-wide. The receiver can safely treat those entries as committed without independently re-verifying. -
The Case A handling in
handle_append_entries— when the leader sends an AppendEntries whoseprevLogIndexis below the receiver's current snapshot. This happens when the leader is repairing a follower that just installed a snapshot but the leader hasn't seen the ack yet. The follower must respond gracefully (skip the redundant prefix and accept the rest) instead of rejecting, otherwise the leader loops forever.
Worked example — a million-entry cluster, one slow follower
A five-node cluster snapshots, and a follower at 200000 catches up via InstallSnapshot
Setup: five-node Raft cluster running a key-value state machine. Workload: a write/sec rate that has produced 1,000,000 log entries over a week. All nodes have commitIndex = 1000000 except one — node F4 was offline due to a regional network outage from index 200,000 onwards and just came back. Each node's snapshot_threshold = 200_000.
Initial state at the moment F4 rejoins.
| Node | log_offset | log length | commitIndex | snapshot? |
|---|---|---|---|---|
| L (leader) | 800000 | 200000 (entries 800001-1000000) | 1000000 | yes, at 800000 |
| F1 | 800000 | 200000 | 1000000 | yes, at 800000 |
| F2 | 800000 | 200000 | 1000000 | yes, at 800000 |
| F3 | 600000 | 400000 (entries 600001-1000000) | 1000000 | yes, at 600000 |
| F4 | 0 | 200000 (entries 1-200000) | 200000 | no (never snapshotted) |
Each node's snapshot was triggered independently when its log grew past 200k entries. F3 ran a more conservative threshold and snapshotted later. F4 was offline through every threshold crossing.
t=0. F4 rejoins. The leader's bookkeeping shows nextIndex[F4] = 1000001 (initialized optimistically on election win), matchIndex[F4] = 0.
t=10ms. L sends a heartbeat: AppendEntries(prevLogIndex=1000000, prevLogTerm=11, entries=[], leaderCommit=1000000). F4 receives, looks up its log at index 1000000 — does not exist (F4's log only goes to 200000). F4 responds success=false.
t=12ms. L decrements nextIndex[F4] to 1000000 and retries. Same failure. After several rounds of decrement, nextIndex[F4] walks down from 1000000 to 800001.
t=300ms (after several decrement cycles). L tries AppendEntries(prevLogIndex=800000, ...) — but L's log starts at 800001 (its own log_offset = 800000). To answer "what's the term at 800000?" L looks up its snapshot's lastIncludedTerm. The RPC carries prevLogTerm=7 (from L's snapshot metadata). It still fails on F4 because F4's log only goes to 200000.
t=305ms. L decrements once more. Now nextIndex[F4] = 800000 = L.log_offset, which means the next decrement would go below L's snapshotted prefix — into territory L no longer holds. L's _send_append detects this: nextIndex[F4] <= self.log_offset. Time to switch protocols.
t=306ms. L sends InstallSnapshot(term=12, lastIncludedIndex=800000, lastIncludedTerm=7, data=<120 MB of state>). The transmission is chunked over 120 RPCs of 1 MB each over 200ms (assume a 50 MB/s effective link).
t=506ms. Final chunk arrives. F4 deserializes the state, sets state = <restored>, commitIndex = 800000, applyIndex = 800000, log_offset = 800000, log = []. Persists everything. Responds success=true.
t=507ms. L receives the InstallSnapshot ack. Updates matchIndex[F4] = 800000, nextIndex[F4] = 800001. Resumes normal AppendEntries.
t=510ms onwards. L sends AppendEntries(prevLogIndex=800000, prevLogTerm=7, entries=[800001..800050], leaderCommit=1000000). F4's modified handler sees prevLogIndex == log_offset, looks up snapshot.last_included_term = 7, comparison succeeds, appends the batch. The follower's commit index advances as new AppendEntries carry the leader's commit. Within a few hundred more RPCs (and a couple of seconds with batching), F4 has caught up to index 1000000.
Outcome. F4 went from being 800,000 entries behind to fully caught up in roughly 2-3 seconds, transferring ~120 MB instead of replaying ~800,000 individual log entries. The slow path (per-entry AppendEntries) would have taken minutes at 5,000 entries/sec; InstallSnapshot collapses it to one bulk transfer plus a small tail.
What if F3 (whose log_offset = 600000, more conservative than the leader's 800000) becomes leader after a leader churn? F3's log covers 600001-1000000 — strictly more than the old leader's 800001-1000000. F3 can serve an AppendEntries to a follower at nextIndex=700000 directly, no InstallSnapshot needed. The asymmetry between replicas' compaction frontiers is benign: the leader uses its own log_offset to decide whether to switch protocols.
Real-world: thresholds, ranges, and the LSM connection
etcd ships with --snapshot-count=100000. At its default ~1k writes/sec for control-plane workloads, that triggers roughly every 100 seconds — frequent enough that on-disk log stays under a few hundred MB, infrequent enough that snapshot CPU cost is negligible. Operators tune this up (to 10M for high-throughput clusters where snapshot CPU is the bottleneck) or down (to 10k for low-RAM nodes that need to keep state in memory). The implementation lives in etcd's Snapshotter and the WAL package handles the corresponding log-file rotation.
CockroachDB snapshots per-range (a range is ~64 MB of keyspace replicated by its own Raft group). With thousands of ranges per node, snapshot scheduling has its own throttling layer — too many simultaneous range snapshots overwhelm disk. The CockroachDB replication-layer documentation describes the snapshot queue, the bandwidth limiter, and the protocol for a "preemptive snapshot" sent before a new replica starts following.
The LSM-tree connection. Most production Raft state machines sit on top of an LSM-tree storage engine — RocksDB underneath etcd, Pebble underneath CockroachDB, Badger underneath Dgraph. The LSM is already a compaction system: writes go to a memtable, flush to immutable SSTables, and background compaction merges SSTables into larger files while discarding overwritten or deleted keys. A Raft snapshot in this world is essentially "the current state of the LSM at index N." Some implementations (CockroachDB, TiKV) take this literally: the snapshot is a hardlinked copy of the SSTable files plus a metadata blob, sent to the receiver as a stream of files rather than a serialized blob. This is the approach in RocksDB's Checkpoint API — a hardlink-based snapshot that costs O(1) regardless of LSM size, then incremental file shipping.
The deeper observation is that log compaction in Raft and SSTable compaction in LSM are the same idea at two different layers. Both turn an append-only history into a compact representation of the current state. Both run in the background. Both must coordinate with the live read/write path. Production systems that get this right (CockroachDB, TiKV) gain enormous efficiency by treating the two as a single integrated compaction pipeline.
Common confusions
-
"A snapshot is a database backup." No. A snapshot is the in-memory or in-storage state of a single replica's state machine at a particular log index, plus two integers. A backup is typically a periodic external copy for disaster recovery; it includes additional metadata, may be encrypted, and is kept off-cluster. The two have nothing in common except the word.
-
"Snapshots must be taken at the same index across all replicas." No, and they shouldn't be. Coordinating snapshots would require a consensus round per snapshot, killing the point. Independent per-replica snapshotting is correct because the state machine is deterministic.
-
"InstallSnapshot is only for new replicas." No. It is for any replica whose
nextIndexfalls below the leader'slog_offset. New replicas trigger it (theirnextIndexstarts low), but so do replicas returning from extended outages or replicas that fell behind during a leadership change. -
"After installing a snapshot, the follower starts a fresh election." No. The InstallSnapshot is sent by the current leader; receiving it is just an extreme form of catch-up. The follower's term and leader-recognition do not change as a result.
-
"
lastIncludedTermcan be derived fromlastIncludedIndexplus the log." Only if the log entry at that index still exists — which after compaction, it doesn't. That is the entire reasonlastIncludedTermis stored as separate metadata. -
"Snapshotting blocks writes." Naively yes (you have to stop applying new entries while you serialize state), but production implementations use copy-on-write or LSM hardlinks to make snapshotting concurrent with normal traffic. The apply loop continues; the snapshot reflects state at the point in time where the snapshot started.
Going deeper
Streaming snapshots and back-pressure
A snapshot of a 10 GB state machine cannot be sent in one RPC — most RPC frameworks cap message size at 16-64 MB, and in-memory buffering of multi-GB payloads is a non-starter. Production InstallSnapshot is a stream: the leader breaks the snapshot into chunks (typically 1-4 MB), sends them in order, and the follower writes each chunk to a temp file. The final chunk carries a done flag that triggers atomic rename and state replacement.
Back-pressure matters because a slow follower can otherwise tie up the leader's outbound bandwidth. etcd's snapshot sender uses a per-follower rate limiter; CockroachDB has a dedicated "Raft snapshot" priority class with throttle controls. Without it, a snapshot to one slow follower can starve normal traffic to the others.
Snapshot consistency under concurrent writes
If the state machine accepts new applies while the snapshot is being serialized, you have a consistency problem: which of those applies are reflected in the snapshot? Raft's answer is to snapshot a point-in-time state. Implementation strategies:
- Stop-the-world. Pause applies, serialize, resume. Simple, but introduces a latency spike. Acceptable for small state machines (a few MB).
- Copy-on-write fork. Use OS-level COW (Linux
fork()plus a child process serializing) to get a stable view without pausing. The downside is memory doubling during the fork. - LSM hardlink snapshot. RocksDB and Pebble both expose a
CheckpointAPI that hardlinks the current SSTables into a new directory in O(1). The Raft snapshot is a tarball of that directory. No pause, no memory doubling.
The third is the production standard.
Joint consensus and snapshots
A snapshot must include the cluster configuration as it was at lastIncludedIndex, not the current configuration. Otherwise a replica installing an old snapshot would forget about a membership change that happened between the snapshot point and now. The InstallSnapshot RPC carries this configuration as a separate field, and the receiver applies it before fast-forwarding its commit index. Section 7.2 of the Ongaro thesis covers this carefully.
Snapshot-driven log truncation vs. WAL rotation
There are two related but distinct things being pruned: the in-memory log (the log[] array in our Python code) and the on-disk write-ahead-log files. Snapshotting truncates the in-memory log immediately. The on-disk WAL is rotated separately — once a snapshot has been persisted and confirmed, WAL segments containing only entries below lastIncludedIndex can be unlinked. etcd's WAL package does this in a background goroutine. Production-grade systems decouple the two so that a slow snapshot (gigabytes serializing to disk) does not delay WAL space reclamation.
Where this leads next
With log compaction in place, the Raft cluster as we have built it across chapters 100-104 is operationally complete: it elects leaders quickly, replicates and commits with O(1) consistency checks, recovers from arbitrary partitions and restarts, and keeps disk and replay times bounded as the cluster runs forever. The protocol pieces are all in place.
What we have not yet done is take a hard look at failure scenarios. Chapter 105 walks through partitions, leader churn during snapshotting, message reordering, and the precise interleavings that would break a less careful design. The five Raft safety invariants (leader completeness, log matching, election restriction, state machine safety, leader append-only) are each defended by specific mechanisms across chapters 100-104; chapter 105 demonstrates the defense in adversarial scenarios. After that, Build 14 takes the consensus log we now have and uses it as a primitive for cross-shard atomic commit.
The two sentences to carry forward: a snapshot replaces a log prefix with the state machine state plus the two-integer metadata that keeps the consistency check working. And: InstallSnapshot exists because a leader cannot send AppendEntries for entries it has discarded, so the leader ships state directly when the follower's nextIndex falls below the snapshot frontier. The first preserves the protocol's correctness across compaction; the second preserves its operational completeness when followers fall arbitrarily far behind.
References
- Ongaro and Ousterhout, In Search of an Understandable Consensus Algorithm (Extended Version), USENIX ATC 2014 — Section 7 of the Raft paper covers log compaction, snapshotting, and InstallSnapshot. The metadata pair, the modified consistency check, and the fall-back-to-snapshot rule all come from this section.
- Ongaro, Consensus: Bridging Theory and Practice, Stanford PhD dissertation, 2014 — Chapter 5 expands the snapshot treatment with copy-on-write strategies, configuration handling under snapshots, and the analysis of concurrent applies during serialisation.
- etcd, Snapshotter and snapshot configuration — production Go implementation of the snapshot pipeline, including the
--snapshot-countflag (default 100000), WAL rotation interaction, and the streaming InstallSnapshot receiver. - Facebook RocksDB project, Checkpoints and Compaction and Checkpoint API wiki — LSM-tree compaction is the storage-layer analogue of Raft snapshotting; the Checkpoint API is the hardlink-based primitive most production Raft systems use to take a snapshot in O(1).
- Cockroach Labs, Replication layer architecture and the
raftpbpackage documentation — per-range snapshot scheduling, the snapshot queue, bandwidth limiting, and the preemptive-snapshot pattern for new replicas. - Howard, Schwarzkopf, Madhavapeddy, Crowcroft, Raft Refloated: Do We Have Consensus?, ACM Operating Systems Review 49(1), 2015 — independent reimplementation surfacing edge cases in the snapshot protocol, including the AppendEntries-after-snapshot-install race condition the original paper underspecifies.