Note: Company names, engineers, incidents, numbers, and scaling scenarios in this article are hypothetical — even when they resemble real ones. See the full disclaimer.

Log compaction and snapshots

A Raft replica restarts and spends forty minutes reading its own log file before it can rejoin the cluster. The leader is twelve million entries ahead by the time it finishes, and the operator is left wondering why a "correct" protocol behaves like a slow-motion outage. The answer is that the protocol from chapter 101 never told anyone to throw old log entries away.

Raft's log is append-only and unbounded by default — at 10k writes/sec a five-node cluster has 864 million entries on every replica's disk after one day, and a restart replays them all. The fix is to persist the state machine at some index N, record lastIncludedIndex = N and lastIncludedTerm = log[N].term, then truncate every entry up to N. A follower whose nextIndex falls below the leader's discarded prefix is repaired by an InstallSnapshot RPC instead of AppendEntries.

The unbounded-log problem

Three failure modes appear together as the log grows.

Disk fills. A serialised entry might be 200-300 bytes — a key, a value, term, type, checksum. At 10,000 writes/sec that is around 2.5 MB/sec, 9 GB/hour, 216 GB/day. Per replica. The PaisaBridge-scale write rate that ate ₹50 crore of NVMe in a quarter is not theoretical; it is what every team running etcd at scale has seen.

Replay time grows linearly. When a replica restarts it reads the on-disk log from index 1 and re-applies every entry to rebuild its state machine. Replaying a billion entries at 200,000 entries/sec takes 5,000 seconds — eighty-three minutes of downtime per restart. The replica is "alive" but contributes nothing to quorum during replay, so a restart that lands during a partial outage can drop the cluster below majority and freeze writes.

Bootstrap is hopeless. Adding a new replica — for capacity, for replacing a dead node, for a fresh region — requires shipping the full log to it. Without compaction this is a year of writes over the network and replayed on the new machine, while new entries continue to accumulate. The new replica might never close the gap.

Raft log growing without bound across daysHorizontal bars showing log size and restart-replay time for a five-node Raft cluster at 10k writes/sec. Day 1 small bar, day 7 medium bar, day 30 long bar, day 90 bar running off the diagram into a red disk-full region. Five-node cluster, 10k writes/sec — log size and restart-replay time day 1 864M entries, 250 MB log, replay ~30s day 7 6.0B entries, 1.7 GB, replay ~3 min day 30 26B entries, 7.5 GB, replay ~13 min day 90 78B entries, 22 GB, replay ~40 min disk-full territory (1 TB / replica) all five replicas pay this storage; bootstrap pays it on the network
Without compaction, on-disk log size grows linearly with calendar time and restart-replay time grows linearly with log size. Within a quarter, both are untenable.

Why these three are the same problem: they all stem from treating the log as the system's source of truth forever. The log is the audit trail of how the state was built. Once the state has been baked into a checkpoint, the prefix of the log that produced it is redundant — the trick is doing the truncation in a way that does not break Raft's consistency check.

Snapshots: the state machine is the source of truth

A snapshot is a serialised image of the state machine at a particular log index, plus two tiny pieces of metadata. That is the entire idea. The state machine, as built in chapter 99, has a deterministic apply function — feed the same log entries into the same initial state and every replica produces the same final state. So at any committed index N, every replica's state is bit-identical, and a serialisation of that state is a valid replacement for replaying entries 1 through N.

The two pieces of metadata are non-negotiable:

  • lastIncludedIndex — the index of the last log entry whose effect is baked into the snapshot. After installing the snapshot the receiver's commitIndex, lastApplied, and the log's "previous entry" pointer all advance to this value.
  • lastIncludedTerm — the term of the entry at lastIncludedIndex. Required because the next AppendEntries — at index lastIncludedIndex + 1 — carries prevLogIndex = lastIncludedIndex and the consistency check needs prevLogTerm to compare against. If we threw the entry away without recording its term, the check has nothing to look up and the protocol breaks.

After a snapshot at index N the server discards every log entry with index ≤ N. The log effectively starts at N+1. Conceptually:

Before snapshot at N:
  log = [e1, e2, ..., eN, eN+1, eN+2, ...]
  state = apply(apply(...apply(initial, e1)...))    # all entries replayed

After snapshot at N:
  snapshot = serialise(state_at_N)
  lastIncludedIndex = N
  lastIncludedTerm  = log[N].term
  log = [eN+1, eN+2, ...]                           # prefix discarded
  state = deserialise(snapshot)                     # equivalent to before

Why every replica can snapshot independently: the state machine is deterministic and the log is identical across replicas (Raft's Log Matching property guarantees it). So replica A's snapshot at index N and replica B's snapshot at index N are byte-identical for byte-identical state machines. There is no need for the leader to broadcast a "snapshot now" command — each replica does it on its own schedule when its log gets big enough.

Snapshot truncates the log prefixTop row: a long log of entries 1 through 200 with state machine on the right. Bottom row: a snapshot box covering entries 1-150, lastIncludedIndex 150 and lastIncludedTerm 7 metadata, and a shortened log starting at index 151. The state machine on the right is unchanged. Before and after snapshot at lastIncludedIndex = 150 Before: log[1] log[2] log[3] ... log[149] log[150] log[151] ... log[200] state machine After snapshot: snapshot (state at index 150) lastIncludedIndex=150 · lastIncludedTerm=7 log[151] ... log[200] state machine Old log[1]..log[150] discarded. State unchanged. Log now indexed from 151. AppendEntries with prevLogIndex=150 → consistency check uses lastIncludedTerm=7 instead of log[150].term.
The snapshot replaces the log prefix. State is unchanged; only the bookkeeping for "what was the term of the entry just before the new ones?" moves from the discarded log into the two-field metadata header.

A working implementation

Open your editor. Type this in. The compaction primitive itself is small — the subtleties live in how it interacts with AppendEntries.

# raft_snapshot.py — minimum-viable Raft snapshot for a KV state machine
import json, os, tempfile

class RaftLog:
    def __init__(self, path):
        self.path = path
        self.entries = []                    # list of {"index","term","cmd"}
        self.last_included_index = 0         # snapshot covers up to here
        self.last_included_term  = 0
        self.state = {}                      # the KV state machine
        self.commit_index = 0
        self.last_applied = 0

    def append(self, term, cmd):
        idx = self.last_log_index() + 1
        self.entries.append({"index": idx, "term": term, "cmd": cmd})
        return idx

    def last_log_index(self):
        if self.entries:
            return self.entries[-1]["index"]
        return self.last_included_index

    def term_at(self, index):
        # term lookup that respects the snapshot boundary
        if index == self.last_included_index:
            return self.last_included_term
        if index < self.last_included_index:
            return None                      # entry has been compacted away
        offset = index - self.last_included_index - 1
        return self.entries[offset]["term"]

    def apply_committed(self):
        while self.last_applied < self.commit_index:
            self.last_applied += 1
            entry = self.entries[self.last_applied - self.last_included_index - 1]
            k, v = entry["cmd"]              # ("put", "user:42", "riya") simplified
            self.state[k] = v

    def take_snapshot(self):
        # crash-safe: write to a temp file and rename atomically
        snap = {
            "last_included_index": self.last_applied,
            "last_included_term":  self.term_at(self.last_applied),
            "state": self.state,
        }
        tmp = self.path + ".tmp"
        with open(tmp, "w") as f:
            json.dump(snap, f)
            f.flush()
            os.fsync(f.fileno())
        os.replace(tmp, self.path)           # atomic on POSIX
        # truncate the log prefix
        keep_from = self.last_applied + 1
        self.entries = [e for e in self.entries if e["index"] >= keep_from]
        self.last_included_index = snap["last_included_index"]
        self.last_included_term  = snap["last_included_term"]
        return snap

Walk it line by line.

self.last_included_index and self.last_included_term. These are persisted alongside the snapshot file. After a snapshot, every method that wants to ask "what term was at index X?" or "where is the previous entry?" first checks whether X is below the snapshot horizon. If it is, the answer comes from the metadata, not the log array.

term_at(index). Three cases. If index == last_included_index, return last_included_term — the metadata stored in the snapshot. If index < last_included_index, return None — the entry is gone, AppendEntries cannot reach it, and the leader must fall back to InstallSnapshot. Otherwise read from the in-memory entries array, offset by where the snapshot ends.

take_snapshot(). Write the state machine plus metadata to a temp file, fsync, then os.replace to swap atomically into the real path. Why the temp-file-then-rename dance: a half-written snapshot file is worse than no snapshot. If the process crashes mid-write, the next startup must find either the old snapshot (perfectly valid) or the new one (perfectly valid) — never a torn file. os.replace is atomic on POSIX, so the swap is all-or-nothing. This is the same pattern Postgres uses for pg_control and SQLite uses for the rollback journal commit.

The truncation step. After writing the new snapshot we drop every log entry with index ≤ last_applied. The entries list now starts at last_applied + 1. Memory frees, and on next persist of the log file, the on-disk log file shrinks too.

What this code is missing for production: the actual on-disk log file rewrite (we kept only the in-memory copy concise), incremental snapshotting that doesn't pause writes during the dump, and copy-on-write of the state machine so the snapshot can be serialised in the background. Real systems do all three; etcd's bbolt-backed state and CockroachDB's RocksDB-backed ranges both snapshot via filesystem copy-on-write so the foreground request loop never stalls.

InstallSnapshot: rescuing the slow follower

Most followers stay close enough to the leader that AppendEntries always carries entries the leader still has. But a follower that crashes for an hour and comes back, or a brand-new replica added by membership change, has a nextIndex that may sit below the leader's last_included_index. The leader cannot serve those entries — it discarded them. This is what InstallSnapshot is for.

The leader detects the situation in one place: when handling a follower's AppendEntries response and computing the next batch, if nextIndex - 1 < last_included_index, it stops trying to send AppendEntries and sends InstallSnapshot instead.

InstallSnapshot RPC (leader → follower):
  term                  : leader's current term
  leaderId              : so the follower can redirect clients
  lastIncludedIndex     : the snapshot's metadata
  lastIncludedTerm      : ditto
  offset                : byte offset of this chunk in the snapshot file
  data                  : the chunk bytes
  done                  : true on the final chunk

Response:
  term                  : follower's current term (so leader can step down if stale)

Snapshots are big — gigabytes for production state machines — so they ship in chunks. The offset and done fields let the protocol stream over multiple RPCs, with the follower writing each chunk to a temp file and only swapping to the live snapshot when done = true.

When the follower receives the final chunk:

  1. If lastIncludedIndex ≤ commitIndex, drop the snapshot — the follower is ahead. (Common when an old snapshot arrives late.)
  2. Otherwise, replace its state machine with the deserialised snapshot, set commitIndex = lastApplied = lastIncludedIndex, and discard any log entries at or before lastIncludedIndex.
  3. If the follower had log entries beyond lastIncludedIndex whose term matches lastIncludedTerm, keep them — they are valid. Otherwise discard the entire log; the leader will re-replicate the tail through normal AppendEntries.
InstallSnapshot rescues a follower that fell behind the prefixLeader on the left with a long log starting at index 200, snapshot 1-200. Follower on the right with stale log ending at index 150. Arrow shows leader sending InstallSnapshot chunks; follower replaces state and resumes following from 201. Follower at index 150, leader's snapshot covers 1–200 — AppendEntries can't reach it Leader snapshot[1..200] log[201..300] Follower (stale) log[1..150] nextIndex = 151 leader has thrown 151..200 away InstallSnapshot(offset=0..N, lastIncludedIndex=200, lastIncludedTerm=12, data=...) Follower (after install) snapshot[1..200] commitIndex = lastApplied = 200 log truncated; nextIndex = 201 resume normal AppendEntries from 201
The leader detects that the follower's `nextIndex` falls below `lastIncludedIndex` and switches from AppendEntries to InstallSnapshot. The follower replaces its state in one shot, advances commit and apply indices, truncates its log, and rejoins the normal replication path at index 201.

A worked example. A Bengaluru fintech runs a five-node etcd cluster for service discovery. Replica D runs out of memory on Monday and is hard-rebooted. By the time D is back, the leader has snapshotted at index 9,500,000 and discarded the log up to there. D's last persisted index is 8,200,000. The leader sends D a sequence of 64 KB InstallSnapshot chunks — total payload around 800 MB — and once done=true arrives, D deserialises the etcd boltdb image, sets its commitIndex and lastApplied to 9,500,000, and starts catching up the live tail (entries 9,500,001 onward) over the network in seconds rather than the hour AppendEntries-only would have taken.

When to take a snapshot

Production Raft implementations take snapshots based on log size, not time, because workloads are bursty. The two common policies:

  • Byte threshold. Snapshot when the log file exceeds, say, 64 MB. etcd's default is roughly --snapshot-count=100000 entries, which at typical entry sizes works out to a few tens of megabytes.
  • Entry-count threshold. Snapshot every K entries since the last snapshot. Simpler to reason about; the actual byte size depends on entry distribution but is usually predictable enough.

The cost of a snapshot is the work to serialise the state machine plus an fsync. For a state machine that fits in memory and is small (etcd's typical 100 MB or so), this takes a few hundred milliseconds. For larger states (CockroachDB ranges, TiKV regions), snapshots are taken via RocksDB's checkpoint feature — a hard-link of the SSTable files to a new directory, which is essentially free, with only the WAL tail copied. The point is to make the snapshot cheap enough that you can take one every few minutes without affecting tail latency.

$ etcdctl --endpoints=$ENDPOINTS endpoint status -w table
+----------------+------------------+---------+---------+-----------+
|    ENDPOINT    |        ID        | VERSION | DB SIZE | RAFT TERM |
+----------------+------------------+---------+---------+-----------+
| 10.0.1.10:2379 | 8e9e05c52164694d |  3.5.9  |  87 MB  |       127 |
| 10.0.1.11:2379 | 91bc3c398fb3c146 |  3.5.9  |  87 MB  |       127 |
| 10.0.1.12:2379 | fd422379fda50e48 |  3.5.9  |  87 MB  |       127 |
+----------------+------------------+---------+---------+-----------+

DB size 87 MB after weeks of traffic, on a cluster taking thousands of writes/sec — that is what compaction buys you. Without it the same column would read in tens or hundreds of GB.

Common confusions

  • "A Raft snapshot is the same as a database backup." It is not. A backup is a point-in-time copy preserved for disaster recovery; a Raft snapshot is an operational artifact whose only purpose is to truncate the log. Backups are kept in cold storage for months; snapshots are overwritten every few minutes. You still need a backup story on top of Raft.

  • "All replicas must snapshot at the same index." They must not. Each replica snapshots independently, when its own log gets big enough. Two replicas can have lastIncludedIndex = 9,500,000 and lastIncludedIndex = 9,520,000 simultaneously and the cluster works fine — the only constraint is that the snapshotted index has been committed.

  • "InstallSnapshot is what new replicas get when they join." Not necessarily. New replicas typically start by trying to follow from index 0 via AppendEntries; only if the leader's lastIncludedIndex > 0 does the leader fall back to InstallSnapshot. etcd extends this with learners — non-voting replicas that catch up via snapshot before being promoted, so the catch-up traffic doesn't affect quorum.

  • "You can snapshot at any committed index." You can snapshot at any applied index, not just any committed one. The state machine has only been advanced through lastApplied; an entry at commitIndex that hasn't been applied yet has no effect baked into the state, so you must wait for it to apply before including it in the snapshot.

  • "Snapshots make the log fully redundant." They do not. The active log tail (everything past lastIncludedIndex) is still the source of truth for committed-but-not-yet-snapshotted entries, and it is what the leader uses to repair followers via AppendEntries. The snapshot replaces only the prefix, not the whole log.

  • "Snapshot installation is atomic at the protocol level." It is atomic at the recipient (the temp-file-then-rename swap) but the InstallSnapshot RPC itself streams in chunks, and a leader change mid-stream means the new leader may resend a different snapshot. Followers must tolerate receiving partial chunks they end up throwing away — and must validate the term on every chunk in case the sender is no longer leader.

Going deeper

Why the snapshot must include lastIncludedTerm

Suppose a replica snapshotted at index 200 but only stored lastIncludedIndex = 200 and not the term. A new leader arrives in term 15 and sends AppendEntries(prevLogIndex = 200, prevLogTerm = 12, entries = [...]). The follower needs to compare 12 against the term of its entry at index 200 — but that entry is gone. Without lastIncludedTerm, the follower has no way to answer the consistency check. It would have to either accept blindly (unsafe — the new leader might be from a stale partition) or reject blindly (the cluster grinds to a halt because no AppendEntries can ever pass the boundary). Storing one extra integer per snapshot resolves the entire ambiguity.

Copy-on-write snapshots and serialisation cost

Naive snapshotting freezes the state machine, serialises it, then unfreezes — which stalls the foreground request loop for the duration of the dump. For a 1 GB state machine and 500 MB/sec serialisation, that's a 2-second pause every snapshot interval. Every production system avoids this with copy-on-write:

  • etcd snapshots the boltdb file by taking a bbolt transaction, then uses bbolt's MVCC to read the snapshot's view of the b-tree while writers continue committing into newer pages.
  • CockroachDB snapshots a range by LinkSnapshotTo-ing the underlying RocksDB SSTables — the file system creates hard links to the immutable SSTables instantly, and only the active memtable's contents need actual copying.
  • TiKV does the same with RocksDB checkpoints.

In all three, the snapshot operation completes in milliseconds and the actual byte transfer happens in the background while writers are still appending to the live log.

Snapshots and membership changes

When a new replica is added via joint consensus or single-server change, it has an empty log and no state. The leader's first AppendEntries to it triggers an InstallSnapshot (since nextIndex - 1 = 0 < lastIncludedIndex). For a large state machine, the snapshot transfer can take minutes; during this time the new replica is in the cluster's voting set but cannot vote yes on anything (it has no state). etcd's learner abstraction fixes this: a learner is a non-voting member that receives the snapshot and catches up before being promoted to voter. The promotion is a separate config change once the learner's lastApplied is close to the leader's. Without learners, adding a replica during an outage can drop the cluster below majority because the new replica can vote on elections but holds an empty log.

Snapshot transfer flow control

InstallSnapshot can ship gigabytes. Naively streamed at line rate, it saturates the leader's NIC and starves AppendEntries for the other followers, breaking heartbeats and triggering elections. Production implementations rate-limit: etcd defaults to 100 MB/sec per peer, CockroachDB to a configurable RTT-aware budget. The streaming protocol must also support resumption — if the follower crashes mid-snapshot, the leader should not start over from byte 0. Both are left as engineering details outside the original Raft paper; the paper specifies what must be sent, not how to send it efficiently.

What the original Raft paper says, and what it skips

The Raft paper (Ongaro and Ousterhout, 2014, §7) covers snapshots in about three pages. It specifies the InstallSnapshot RPC fields, the snapshotting algorithm, and the consistency-check extension for lastIncludedIndex/lastIncludedTerm. It explicitly does not cover: snapshot streaming and chunking (only mentioned), copy-on-write serialisation, snapshot retention policy, or how snapshots interact with membership changes. Diego Ongaro's PhD thesis (2014, ch. 5) fills in some of these. Production teams fill in the rest — etcd's snapshot code in mvcc/backend/backend.go and CockroachDB's kv/kvserver/replica_raftstorage.go are the two most readable references for the engineering story.

Where this leads next

References

  1. Diego Ongaro and John Ousterhout, In Search of an Understandable Consensus Algorithm (USENIX ATC, 2014), §7 Log Compaction — the original specification of Raft snapshots and the InstallSnapshot RPC. raft.github.io/raft.pdf.
  2. Diego Ongaro, Consensus: Bridging Theory and Practice (Stanford PhD thesis, 2014), Ch. 5 — the long-form treatment, with discussion of streaming, retention, and copy-on-write that the paper omits. github.com/ongardie/dissertation.
  3. etcd documentation, Snapshotting and DB compaction — production-grade defaults for --snapshot-count, retention, and learner promotion. etcd.io/docs/v3.5/op-guide/maintenance.
  4. CockroachDB source, pkg/kv/kvserver/replica_raftstorage.go — the file that implements snapshot creation, transfer chunking, and follower install. github.com/cockroachdb/cockroach.
  5. Tushar Chandra, Robert Griesemer, Joshua Redstone, Paxos Made Live: An Engineering Perspective (PODC, 2007) — Querion's account of all the gaps between consensus theory and a live system, including the snapshot story they had to invent for Chubby. research.google/pubs/pub33002.
  6. Log replication and the commit index — chapter 101: where AppendEntries' consistency check is defined and where lastIncludedIndex/lastIncludedTerm plug back in.
  7. Raft membership changes (joint consensus) — chapter 103: snapshots and membership changes are inseparable in production.