Note: Company names, engineers, incidents, numbers, and scaling scenarios in this article are hypothetical — even when they resemble real ones. See the full disclaimer.
Log compaction and snapshots
A Raft replica restarts and spends forty minutes reading its own log file before it can rejoin the cluster. The leader is twelve million entries ahead by the time it finishes, and the operator is left wondering why a "correct" protocol behaves like a slow-motion outage. The answer is that the protocol from chapter 101 never told anyone to throw old log entries away.
Raft's log is append-only and unbounded by default — at 10k writes/sec a five-node cluster has 864 million entries on every replica's disk after one day, and a restart replays them all. The fix is to persist the state machine at some index N, record lastIncludedIndex = N and lastIncludedTerm = log[N].term, then truncate every entry up to N. A follower whose nextIndex falls below the leader's discarded prefix is repaired by an InstallSnapshot RPC instead of AppendEntries.
The unbounded-log problem
Three failure modes appear together as the log grows.
Disk fills. A serialised entry might be 200-300 bytes — a key, a value, term, type, checksum. At 10,000 writes/sec that is around 2.5 MB/sec, 9 GB/hour, 216 GB/day. Per replica. The PaisaBridge-scale write rate that ate ₹50 crore of NVMe in a quarter is not theoretical; it is what every team running etcd at scale has seen.
Replay time grows linearly. When a replica restarts it reads the on-disk log from index 1 and re-applies every entry to rebuild its state machine. Replaying a billion entries at 200,000 entries/sec takes 5,000 seconds — eighty-three minutes of downtime per restart. The replica is "alive" but contributes nothing to quorum during replay, so a restart that lands during a partial outage can drop the cluster below majority and freeze writes.
Bootstrap is hopeless. Adding a new replica — for capacity, for replacing a dead node, for a fresh region — requires shipping the full log to it. Without compaction this is a year of writes over the network and replayed on the new machine, while new entries continue to accumulate. The new replica might never close the gap.
Why these three are the same problem: they all stem from treating the log as the system's source of truth forever. The log is the audit trail of how the state was built. Once the state has been baked into a checkpoint, the prefix of the log that produced it is redundant — the trick is doing the truncation in a way that does not break Raft's consistency check.
Snapshots: the state machine is the source of truth
A snapshot is a serialised image of the state machine at a particular log index, plus two tiny pieces of metadata. That is the entire idea. The state machine, as built in chapter 99, has a deterministic apply function — feed the same log entries into the same initial state and every replica produces the same final state. So at any committed index N, every replica's state is bit-identical, and a serialisation of that state is a valid replacement for replaying entries 1 through N.
The two pieces of metadata are non-negotiable:
lastIncludedIndex— the index of the last log entry whose effect is baked into the snapshot. After installing the snapshot the receiver'scommitIndex,lastApplied, and the log's "previous entry" pointer all advance to this value.lastIncludedTerm— the term of the entry atlastIncludedIndex. Required because the next AppendEntries — at indexlastIncludedIndex + 1— carriesprevLogIndex = lastIncludedIndexand the consistency check needsprevLogTermto compare against. If we threw the entry away without recording its term, the check has nothing to look up and the protocol breaks.
After a snapshot at index N the server discards every log entry with index ≤ N. The log effectively starts at N+1. Conceptually:
Before snapshot at N:
log = [e1, e2, ..., eN, eN+1, eN+2, ...]
state = apply(apply(...apply(initial, e1)...)) # all entries replayed
After snapshot at N:
snapshot = serialise(state_at_N)
lastIncludedIndex = N
lastIncludedTerm = log[N].term
log = [eN+1, eN+2, ...] # prefix discarded
state = deserialise(snapshot) # equivalent to before
Why every replica can snapshot independently: the state machine is deterministic and the log is identical across replicas (Raft's Log Matching property guarantees it). So replica A's snapshot at index N and replica B's snapshot at index N are byte-identical for byte-identical state machines. There is no need for the leader to broadcast a "snapshot now" command — each replica does it on its own schedule when its log gets big enough.
A working implementation
Open your editor. Type this in. The compaction primitive itself is small — the subtleties live in how it interacts with AppendEntries.
# raft_snapshot.py — minimum-viable Raft snapshot for a KV state machine
import json, os, tempfile
class RaftLog:
def __init__(self, path):
self.path = path
self.entries = [] # list of {"index","term","cmd"}
self.last_included_index = 0 # snapshot covers up to here
self.last_included_term = 0
self.state = {} # the KV state machine
self.commit_index = 0
self.last_applied = 0
def append(self, term, cmd):
idx = self.last_log_index() + 1
self.entries.append({"index": idx, "term": term, "cmd": cmd})
return idx
def last_log_index(self):
if self.entries:
return self.entries[-1]["index"]
return self.last_included_index
def term_at(self, index):
# term lookup that respects the snapshot boundary
if index == self.last_included_index:
return self.last_included_term
if index < self.last_included_index:
return None # entry has been compacted away
offset = index - self.last_included_index - 1
return self.entries[offset]["term"]
def apply_committed(self):
while self.last_applied < self.commit_index:
self.last_applied += 1
entry = self.entries[self.last_applied - self.last_included_index - 1]
k, v = entry["cmd"] # ("put", "user:42", "riya") simplified
self.state[k] = v
def take_snapshot(self):
# crash-safe: write to a temp file and rename atomically
snap = {
"last_included_index": self.last_applied,
"last_included_term": self.term_at(self.last_applied),
"state": self.state,
}
tmp = self.path + ".tmp"
with open(tmp, "w") as f:
json.dump(snap, f)
f.flush()
os.fsync(f.fileno())
os.replace(tmp, self.path) # atomic on POSIX
# truncate the log prefix
keep_from = self.last_applied + 1
self.entries = [e for e in self.entries if e["index"] >= keep_from]
self.last_included_index = snap["last_included_index"]
self.last_included_term = snap["last_included_term"]
return snap
Walk it line by line.
self.last_included_index and self.last_included_term. These are persisted alongside the snapshot file. After a snapshot, every method that wants to ask "what term was at index X?" or "where is the previous entry?" first checks whether X is below the snapshot horizon. If it is, the answer comes from the metadata, not the log array.
term_at(index). Three cases. If index == last_included_index, return last_included_term — the metadata stored in the snapshot. If index < last_included_index, return None — the entry is gone, AppendEntries cannot reach it, and the leader must fall back to InstallSnapshot. Otherwise read from the in-memory entries array, offset by where the snapshot ends.
take_snapshot(). Write the state machine plus metadata to a temp file, fsync, then os.replace to swap atomically into the real path. Why the temp-file-then-rename dance: a half-written snapshot file is worse than no snapshot. If the process crashes mid-write, the next startup must find either the old snapshot (perfectly valid) or the new one (perfectly valid) — never a torn file. os.replace is atomic on POSIX, so the swap is all-or-nothing. This is the same pattern Postgres uses for pg_control and SQLite uses for the rollback journal commit.
The truncation step. After writing the new snapshot we drop every log entry with index ≤ last_applied. The entries list now starts at last_applied + 1. Memory frees, and on next persist of the log file, the on-disk log file shrinks too.
What this code is missing for production: the actual on-disk log file rewrite (we kept only the in-memory copy concise), incremental snapshotting that doesn't pause writes during the dump, and copy-on-write of the state machine so the snapshot can be serialised in the background. Real systems do all three; etcd's bbolt-backed state and CockroachDB's RocksDB-backed ranges both snapshot via filesystem copy-on-write so the foreground request loop never stalls.
InstallSnapshot: rescuing the slow follower
Most followers stay close enough to the leader that AppendEntries always carries entries the leader still has. But a follower that crashes for an hour and comes back, or a brand-new replica added by membership change, has a nextIndex that may sit below the leader's last_included_index. The leader cannot serve those entries — it discarded them. This is what InstallSnapshot is for.
The leader detects the situation in one place: when handling a follower's AppendEntries response and computing the next batch, if nextIndex - 1 < last_included_index, it stops trying to send AppendEntries and sends InstallSnapshot instead.
InstallSnapshot RPC (leader → follower):
term : leader's current term
leaderId : so the follower can redirect clients
lastIncludedIndex : the snapshot's metadata
lastIncludedTerm : ditto
offset : byte offset of this chunk in the snapshot file
data : the chunk bytes
done : true on the final chunk
Response:
term : follower's current term (so leader can step down if stale)
Snapshots are big — gigabytes for production state machines — so they ship in chunks. The offset and done fields let the protocol stream over multiple RPCs, with the follower writing each chunk to a temp file and only swapping to the live snapshot when done = true.
When the follower receives the final chunk:
- If
lastIncludedIndex ≤ commitIndex, drop the snapshot — the follower is ahead. (Common when an old snapshot arrives late.) - Otherwise, replace its state machine with the deserialised snapshot, set
commitIndex = lastApplied = lastIncludedIndex, and discard any log entries at or beforelastIncludedIndex. - If the follower had log entries beyond
lastIncludedIndexwhose term matcheslastIncludedTerm, keep them — they are valid. Otherwise discard the entire log; the leader will re-replicate the tail through normal AppendEntries.
A worked example. A Bengaluru fintech runs a five-node etcd cluster for service discovery. Replica D runs out of memory on Monday and is hard-rebooted. By the time D is back, the leader has snapshotted at index 9,500,000 and discarded the log up to there. D's last persisted index is 8,200,000. The leader sends D a sequence of 64 KB InstallSnapshot chunks — total payload around 800 MB — and once done=true arrives, D deserialises the etcd boltdb image, sets its commitIndex and lastApplied to 9,500,000, and starts catching up the live tail (entries 9,500,001 onward) over the network in seconds rather than the hour AppendEntries-only would have taken.
When to take a snapshot
Production Raft implementations take snapshots based on log size, not time, because workloads are bursty. The two common policies:
- Byte threshold. Snapshot when the log file exceeds, say, 64 MB. etcd's default is roughly
--snapshot-count=100000entries, which at typical entry sizes works out to a few tens of megabytes. - Entry-count threshold. Snapshot every
Kentries since the last snapshot. Simpler to reason about; the actual byte size depends on entry distribution but is usually predictable enough.
The cost of a snapshot is the work to serialise the state machine plus an fsync. For a state machine that fits in memory and is small (etcd's typical 100 MB or so), this takes a few hundred milliseconds. For larger states (CockroachDB ranges, TiKV regions), snapshots are taken via RocksDB's checkpoint feature — a hard-link of the SSTable files to a new directory, which is essentially free, with only the WAL tail copied. The point is to make the snapshot cheap enough that you can take one every few minutes without affecting tail latency.
$ etcdctl --endpoints=$ENDPOINTS endpoint status -w table
+----------------+------------------+---------+---------+-----------+
| ENDPOINT | ID | VERSION | DB SIZE | RAFT TERM |
+----------------+------------------+---------+---------+-----------+
| 10.0.1.10:2379 | 8e9e05c52164694d | 3.5.9 | 87 MB | 127 |
| 10.0.1.11:2379 | 91bc3c398fb3c146 | 3.5.9 | 87 MB | 127 |
| 10.0.1.12:2379 | fd422379fda50e48 | 3.5.9 | 87 MB | 127 |
+----------------+------------------+---------+---------+-----------+
DB size 87 MB after weeks of traffic, on a cluster taking thousands of writes/sec — that is what compaction buys you. Without it the same column would read in tens or hundreds of GB.
Common confusions
-
"A Raft snapshot is the same as a database backup." It is not. A backup is a point-in-time copy preserved for disaster recovery; a Raft snapshot is an operational artifact whose only purpose is to truncate the log. Backups are kept in cold storage for months; snapshots are overwritten every few minutes. You still need a backup story on top of Raft.
-
"All replicas must snapshot at the same index." They must not. Each replica snapshots independently, when its own log gets big enough. Two replicas can have
lastIncludedIndex = 9,500,000andlastIncludedIndex = 9,520,000simultaneously and the cluster works fine — the only constraint is that the snapshotted index has been committed. -
"InstallSnapshot is what new replicas get when they join." Not necessarily. New replicas typically start by trying to follow from index 0 via AppendEntries; only if the leader's
lastIncludedIndex > 0does the leader fall back to InstallSnapshot. etcd extends this with learners — non-voting replicas that catch up via snapshot before being promoted, so the catch-up traffic doesn't affect quorum. -
"You can snapshot at any committed index." You can snapshot at any applied index, not just any committed one. The state machine has only been advanced through
lastApplied; an entry atcommitIndexthat hasn't been applied yet has no effect baked into the state, so you must wait for it to apply before including it in the snapshot. -
"Snapshots make the log fully redundant." They do not. The active log tail (everything past
lastIncludedIndex) is still the source of truth for committed-but-not-yet-snapshotted entries, and it is what the leader uses to repair followers via AppendEntries. The snapshot replaces only the prefix, not the whole log. -
"Snapshot installation is atomic at the protocol level." It is atomic at the recipient (the temp-file-then-rename swap) but the InstallSnapshot RPC itself streams in chunks, and a leader change mid-stream means the new leader may resend a different snapshot. Followers must tolerate receiving partial chunks they end up throwing away — and must validate the term on every chunk in case the sender is no longer leader.
Going deeper
Why the snapshot must include lastIncludedTerm
Suppose a replica snapshotted at index 200 but only stored lastIncludedIndex = 200 and not the term. A new leader arrives in term 15 and sends AppendEntries(prevLogIndex = 200, prevLogTerm = 12, entries = [...]). The follower needs to compare 12 against the term of its entry at index 200 — but that entry is gone. Without lastIncludedTerm, the follower has no way to answer the consistency check. It would have to either accept blindly (unsafe — the new leader might be from a stale partition) or reject blindly (the cluster grinds to a halt because no AppendEntries can ever pass the boundary). Storing one extra integer per snapshot resolves the entire ambiguity.
Copy-on-write snapshots and serialisation cost
Naive snapshotting freezes the state machine, serialises it, then unfreezes — which stalls the foreground request loop for the duration of the dump. For a 1 GB state machine and 500 MB/sec serialisation, that's a 2-second pause every snapshot interval. Every production system avoids this with copy-on-write:
- etcd snapshots the boltdb file by taking a bbolt transaction, then uses bbolt's MVCC to read the snapshot's view of the b-tree while writers continue committing into newer pages.
- CockroachDB snapshots a range by
LinkSnapshotTo-ing the underlying RocksDB SSTables — the file system creates hard links to the immutable SSTables instantly, and only the active memtable's contents need actual copying. - TiKV does the same with RocksDB checkpoints.
In all three, the snapshot operation completes in milliseconds and the actual byte transfer happens in the background while writers are still appending to the live log.
Snapshots and membership changes
When a new replica is added via joint consensus or single-server change, it has an empty log and no state. The leader's first AppendEntries to it triggers an InstallSnapshot (since nextIndex - 1 = 0 < lastIncludedIndex). For a large state machine, the snapshot transfer can take minutes; during this time the new replica is in the cluster's voting set but cannot vote yes on anything (it has no state). etcd's learner abstraction fixes this: a learner is a non-voting member that receives the snapshot and catches up before being promoted to voter. The promotion is a separate config change once the learner's lastApplied is close to the leader's. Without learners, adding a replica during an outage can drop the cluster below majority because the new replica can vote on elections but holds an empty log.
Snapshot transfer flow control
InstallSnapshot can ship gigabytes. Naively streamed at line rate, it saturates the leader's NIC and starves AppendEntries for the other followers, breaking heartbeats and triggering elections. Production implementations rate-limit: etcd defaults to 100 MB/sec per peer, CockroachDB to a configurable RTT-aware budget. The streaming protocol must also support resumption — if the follower crashes mid-snapshot, the leader should not start over from byte 0. Both are left as engineering details outside the original Raft paper; the paper specifies what must be sent, not how to send it efficiently.
What the original Raft paper says, and what it skips
The Raft paper (Ongaro and Ousterhout, 2014, §7) covers snapshots in about three pages. It specifies the InstallSnapshot RPC fields, the snapshotting algorithm, and the consistency-check extension for lastIncludedIndex/lastIncludedTerm. It explicitly does not cover: snapshot streaming and chunking (only mentioned), copy-on-write serialisation, snapshot retention policy, or how snapshots interact with membership changes. Diego Ongaro's PhD thesis (2014, ch. 5) fills in some of these. Production teams fill in the rest — etcd's snapshot code in mvcc/backend/backend.go and CockroachDB's kv/kvserver/replica_raftstorage.go are the two most readable references for the engineering story.
Where this leads next
- Membership changes (joint consensus) — chapter 103: how new replicas get added, and why they almost always trigger InstallSnapshot before they trigger AppendEntries.
- Paxos, Multi-Paxos, ZAB — what Raft simplified — chapter 105: ZAB has its own snapshot story (ZooKeeper's
snap.*files) that predates Raft's by years. - The replicated state machine abstraction — chapter 99: snapshots only work because the state machine is deterministic; this is where that property is established.
- Multi-Raft sharding — chapter 106: each shard has its own Raft group, its own log, and its own snapshot schedule. CockroachDB runs hundreds of thousands of these per node.
- LSM-tree compaction — back in Build 3: the same word, "compaction", a different mechanism. Worth comparing to Raft's snapshot-based compaction.
References
- Diego Ongaro and John Ousterhout, In Search of an Understandable Consensus Algorithm (USENIX ATC, 2014), §7 Log Compaction — the original specification of Raft snapshots and the InstallSnapshot RPC. raft.github.io/raft.pdf.
- Diego Ongaro, Consensus: Bridging Theory and Practice (Stanford PhD thesis, 2014), Ch. 5 — the long-form treatment, with discussion of streaming, retention, and copy-on-write that the paper omits. github.com/ongardie/dissertation.
- etcd documentation, Snapshotting and DB compaction — production-grade defaults for
--snapshot-count, retention, and learner promotion. etcd.io/docs/v3.5/op-guide/maintenance. - CockroachDB source,
pkg/kv/kvserver/replica_raftstorage.go— the file that implements snapshot creation, transfer chunking, and follower install. github.com/cockroachdb/cockroach. - Tushar Chandra, Robert Griesemer, Joshua Redstone, Paxos Made Live: An Engineering Perspective (PODC, 2007) — Querion's account of all the gaps between consensus theory and a live system, including the snapshot story they had to invent for Chubby. research.google/pubs/pub33002.
- Log replication and the commit index — chapter 101: where AppendEntries' consistency check is defined and where
lastIncludedIndex/lastIncludedTermplug back in. - Raft membership changes (joint consensus) — chapter 103: snapshots and membership changes are inseparable in production.