Log Replication and Commit Index

Note: Company names, engineers, incidents, numbers, and scaling scenarios in this article are hypothetical — even when they resemble real ones. See the full disclaimer.

In short

A Raft leader's log is the cluster's authoritative sequence of operations. Every follower's log must converge to a prefix of it, and the entire job of the replication subsystem is to make that convergence happen despite lost messages, slow followers, and freshly-elected leaders inheriting messy histories.

The mechanism is one RPC: AppendEntries. The leader sends a batch of new log entries plus two consistency tokens — prevLogIndex and prevLogTerm, the index and term of the entry immediately preceding the batch. The follower accepts only if it already holds an entry at prevLogIndex with that exact term. If yes, the new entries are appended (overwriting any conflicting suffix). If no, the follower rejects, and the leader decrements its per-follower nextIndex and retries one entry earlier. Repeat until a match is found; from there, the leader streams forward and the follower's log gets repaired entry by entry.

The reason this works at all is the Log Matching property: if two logs share an entry at the same (index, term), then all preceding entries in those logs are identical too. The single-pair check at prevLogIndex therefore certifies the entire prefix. You do not have to compare the whole history; one well-chosen pair is enough.

Once an entry is replicated to a majority of nodes the leader can advance its commit index, with one famous caveat — the entry must be from the leader's current term. Older entries reach majority replication on their own as a side effect, but they are committed only when an entry from the current term commits on top of them. That qualification is what stops the figure-8 anomaly in chapter 102; ch.101 just makes it part of the rule.

The commit index then drives the apply loop: every replica's applyIndex chases commitIndex, and as it advances, the deterministic state machine receives entries in order. The client sees its write succeed only after the leader has applied the entry — which it does only after a majority has it durable, which happens only after prevLogIndex/prevLogTerm certified that majority's logs are consistent with the leader's. Five rules, one RPC, total ordering across the cluster.

You finished chapter 100 with a single elected leader holding term T. Followers know who it is because heartbeats arrive every 50 ms with that term in the header. The interesting question now is what those heartbeats actually carry, and how a client write turns into a durable, ordered, committed entry on every replica.

The honest answer is that one RPC does almost all the work. AppendEntries is both the heartbeat and the entry-shipping mechanism — empty when there is nothing new to send, full when the leader has accumulated client commands to replicate. The protocol's elegance is that the same handler on the follower side runs the consistency check whether the batch is empty or not. The leader is constantly probing every follower, and the follower is constantly answering "yes, my log is consistent with yours" or "no, send me something earlier."

This chapter walks the loop end to end: how the leader maintains per-follower bookkeeping, how AppendEntries is constructed, what the follower checks, what happens when the check fails, how the commit index advances, why the current-term caveat exists, and what the apply loop looks like. By the end you can read the etcd source and recognise every variable.

The leader's view of the cluster

A leader does not maintain one log; it maintains one log plus a small table of pointers, one row per follower. The two pointers per follower are:

nextIndex[follower] — the index of the next log entry the leader will try to send to this follower. On election win, the leader initialises every nextIndex to its own last-log-index plus one (optimistic: assume followers are caught up; learn otherwise from rejections).
matchIndex[follower] — the highest log index known to be replicated on this follower. Starts at zero and only ever moves forward, updated when a follower acknowledges an AppendEntries success.

nextIndex drives the send side: where to start the next batch from. matchIndex drives the commit side: a count of how many followers have acknowledged each entry. The leader recomputes the commit index after every successful response by sorting matchIndex values and taking the median — the largest N such that a majority of replicas have matchIndex >= N.

Why two pointers, not one: nextIndex is hopeful and can be wrong (it might say "send entry 50" when the follower's log only goes to 20); the rejection-and-decrement loop corrects it. matchIndex is conservative and provably correct — it only advances when the follower confirms it. Mixing the two would either make the leader pessimistic (slow to send) or optimistic about commits (unsafe). Keeping them separate gives you fast-forward sending and conservative committing in the same data structure.

Anatomy of an AppendEntries RPC

Every AppendEntries call from the leader carries seven fields. Five of them are about the consistency check; two are about the new payload.

The AppendEntries RPC as the leader sees it and as the follower processes it. The leader's per-follower nextIndex is 8, so the leader ships entry 8 plus the consistency tokens — the index and term of entry 7. The follower checks its own log at index 7, finds term 3 (a match), accepts entry 8, and acknowledges. If the follower had no entry at index 7, or had an entry there with a different term, it would reject. The red fields in the RPC are the check; everything else is payload or metadata.

The seven fields in detail:

AppendEntries(
    term,           # the leader's current term
    leaderId,       # so followers can redirect clients
    prevLogIndex,   # index of the log entry immediately before `entries`
    prevLogTerm,    # term of that prevLogIndex entry, on the leader
    entries,        # list of new entries (empty for a heartbeat)
    leaderCommit,   # leader's commitIndex, so follower can advance its own
)

The follower's response is two fields: term (so the leader can step down if it sees a higher term) and success (whether the consistency check passed and the batch was accepted). Some implementations add hints — conflictTerm, conflictIndex — to let the leader skip ahead instead of decrementing one entry at a time, but the protocol is correct without them.

The Log Matching property

Why does checking just one entry suffice to certify the entire preceding log? Because Raft maintains an invariant the protocol calls Log Matching:

If two logs contain an entry with the same index and term, then the logs are identical in all entries up through that index.

The proof is by construction. Each log entry, once written by some leader in term T, is uniquely identified by its (index, T) pair — a leader writes at most one entry at each index in its term, because there is at most one leader per term (chapter 100). The AppendEntries handler enforces, on every accept, that the immediately preceding entry matches. Induction does the rest: if entries up to index i-1 match and entry i matches, then entries up to i match.

Why this matters for replication: without Log Matching the leader would have to send the entire log every time, or build a hash tree, or do some other O(log) or O(n) check on every RPC. With it, the consistency check is O(1) — one index, one term, one comparison. The protocol's bandwidth efficiency falls directly out of this invariant. It is also the reason Raft can repair a divergent follower with a backwards linear search instead of a binary search or full diff.

When the consistency check fails — the repair loop

Suppose the follower's log diverges from the leader's. Maybe the follower was offline when entries 5-10 were committed and only has entries up to 4. Maybe the follower was a previous-term leader that appended uncommitted entries 8-12 that the new leader does not have. Either way, when the new leader sends AppendEntries(prevLogIndex=10, prevLogTerm=3, entries=[entry11]), the follower's check fails and it responds success=false.

The leader's reaction is simple and almost embarrassingly mechanical: decrement nextIndex[follower] by one and retry. Now nextIndex is 10 instead of 11, so the next RPC is AppendEntries(prevLogIndex=9, prevLogTerm=?, entries=[entry10, entry11]). If that fails too, decrement again. The leader walks nextIndex backward until it finds a (prevLogIndex, prevLogTerm) that matches the follower's log.

Once a match is found, the AppendEntries succeeds and the leader streams forward — the follower truncates any conflicting suffix (entries it had with terms different from the leader's) and appends the new entries, and from then on nextIndex only moves forward as the follower keeps up.

A follower whose tail two entries (terms 5, 5) come from a stale leader that briefly held office. The new leader's nextIndex starts at 9 (one past its own last index, 8); the consistency check at prevLogIndex=8 fails, then at 7 fails, then at 6 succeeds because both logs agree that index 6 holds a term-3 entry. From there the leader ships entries 7 and 8 in its own version (term 4); the follower truncates its term-5 entries and appends the leader's. Log Matching now holds for the entire log.

Decrementing one at a time is correct but can be slow if the follower is many entries behind. Production implementations use a hint: when a follower rejects, it includes the term of its own conflicting entry and the index of the first entry in that term. The leader then jumps nextIndex past the entire stale term in one hop. The etcd Raft implementation uses this optimisation; see the etcd Raft README and the MsgAppResp handling for the exact rule.

Commit index — the majority + current-term rule

A log entry is replicated once a follower has it durable on disk. An entry is committed once the leader knows a majority of the cluster has it durable. The leader announces commits by including its commitIndex in subsequent AppendEntries RPCs; followers update their own commitIndex = min(leaderCommit, lastNewEntryIndex) and then start applying.

The leader recomputes commit index after every successful AppendEntries response. The rule:

Find the largest N such that N > commitIndex, a majority of matchIndex[i] >= N, and log[N].term == currentTerm. If such N exists, set commitIndex = N.

The first two conditions are obvious — past the current commit point and replicated to a majority. The third is the famous caveat: the entry must be from the leader's current term. A new leader cannot commit an entry from an older term just because a majority happens to hold it.

Why the current-term caveat exists: this is the scaffold for Raft's safety proof. Imagine entries from term 2 sitting on a majority but not yet committed because the term-2 leader crashed. A new leader in term 3 wins election. If it commits the term-2 entries simply because they are on a majority, a subsequent leader churn could overwrite them — the figure-8 scenario in the Raft paper. By insisting that only current-term entries trigger commit advancement, the leader implicitly commits all preceding entries when its own first entry commits, and at that point the safety theorem in chapter 102 guarantees no future leader can possibly disagree on those entries. One sentence in the rule, several pages of proof behind it.

In practice this means a brand-new leader's first action is often to append a no-op entry in its term and replicate it. As soon as the no-op commits, every entry from prior terms inherited by this leader's log is also committed — not because the no-op forced them, but because the proof condition is now satisfied. Some Raft implementations make this explicit and call it the "leader's initial empty entry"; etcd and CockroachDB do.

Apply loop — committed → applied

Commitment is an agreement statement; applying is what actually changes the state machine. Each replica runs a tight loop:

while applyIndex < commitIndex:
    applyIndex += 1
    state_machine.apply(log[applyIndex])

Strictly monotonic, never skips, never replays. Applying is the only step that touches user-visible state; everything before it is just bookkeeping about who has what. The leader applies first (it knows what is committed before anyone else) and only after the apply succeeds does it return a result to the client. Followers apply when their next heartbeat tells them commitIndex has moved.

This separation of "majority replication → commit → apply" is what gives Raft its safety guarantees. A network partition during step 1 means the entry never commits and is later overwritten — the client sees a timeout, not stale state. A crash during step 2 (after replication, before commit announcement) is fine: the new leader sees the entry on a majority, re-commits it under its own term (via the no-op trick above), and the client's retry succeeds idempotently if the operation has an ID. A crash during step 3 (after commit, before apply) is also fine: on restart the replica replays the log, applies up to commitIndex, and resumes.

Real Python — the leader's replicate path

Stripping away network glue and durability, the leader's job per follower is roughly thirty lines.

import asyncio
from dataclasses import dataclass

@dataclass
class LogEntry:
    index: int
    term: int
    command: tuple   # ('SET', 'x', 5) etc.

class Leader:
    def __init__(self, peers, self_id):
        self.peers = peers              # other nodes
        self.id = self_id
        self.term = 0                   # current term
        self.log: list[LogEntry] = []   # leader's log, 1-indexed via dummy at [0]
        self.log.append(LogEntry(0, 0, ('NOOP',)))  # sentinel
        self.commit_index = 0
        self.next_index = {p: 1 for p in peers}
        self.match_index = {p: 0 for p in peers}

    async def replicate(self, command) -> int:
        """Append a command, replicate to majority, return its log index."""
        entry = LogEntry(len(self.log), self.term, command)
        self.log.append(entry)
        # Kick off AppendEntries to every follower; commit advances when majority acks.
        await asyncio.gather(*(self._send_append(p) for p in self.peers))
        # Wait for this entry to be committed
        while self.commit_index < entry.index:
            await asyncio.sleep(0.001)
        return entry.index

    async def _send_append(self, peer):
        """Loop: send AppendEntries, on reject decrement nextIndex, retry."""
        while True:
            ni = self.next_index[peer]
            prev_index = ni - 1
            prev_term = self.log[prev_index].term
            entries = self.log[ni:]   # everything from nextIndex forward
            resp = await peer.append_entries(
                term=self.term,
                leader_id=self.id,
                prev_log_index=prev_index,
                prev_log_term=prev_term,
                entries=entries,
                leader_commit=self.commit_index,
            )
            if resp.term > self.term:
                self.step_down(resp.term)   # someone else is leader now
                return
            if resp.success:
                self.match_index[peer] = prev_index + len(entries)
                self.next_index[peer] = self.match_index[peer] + 1
                self._maybe_advance_commit()
                return
            # Consistency check failed — back off one and retry.
            self.next_index[peer] = max(1, ni - 1)

    def _maybe_advance_commit(self):
        """Find the largest N replicated on a majority whose term == current term."""
        match = sorted(list(self.match_index.values()) + [len(self.log) - 1])
        majority_idx = match[len(match) // 2]   # median = majority threshold
        if (majority_idx > self.commit_index
                and self.log[majority_idx].term == self.term):
            self.commit_index = majority_idx

The follower side is the consistency check plus the truncate-and-append.

class Follower:
    def __init__(self):
        self.term = 0
        self.log = [LogEntry(0, 0, ('NOOP',))]
        self.commit_index = 0
        self.apply_index = 0
        self.state = {}

    def handle_append_entries(self, term, leader_id, prev_log_index,
                              prev_log_term, entries, leader_commit):
        # 1. Reject stale leader.
        if term < self.term:
            return {'term': self.term, 'success': False}
        self.term = term
        # 2. Consistency check: must hold matching (index, term) immediately before.
        if prev_log_index >= len(self.log):
            return {'term': self.term, 'success': False}    # log too short
        if self.log[prev_log_index].term != prev_log_term:
            return {'term': self.term, 'success': False}    # term mismatch
        # 3. Append, truncating any conflicting suffix.
        for i, entry in enumerate(entries):
            idx = prev_log_index + 1 + i
            if idx < len(self.log) and self.log[idx].term != entry.term:
                self.log = self.log[:idx]   # truncate divergent tail
            if idx >= len(self.log):
                self.log.append(entry)
        # 4. Advance commit index, never past last new entry.
        if leader_commit > self.commit_index:
            last_new = prev_log_index + len(entries)
            self.commit_index = min(leader_commit, last_new)
        # 5. Apply newly-committed entries to state machine.
        while self.apply_index < self.commit_index:
            self.apply_index += 1
            self._apply(self.log[self.apply_index].command)
        return {'term': self.term, 'success': True}

    def _apply(self, cmd):
        kind, *args = cmd
        if kind == 'SET':
            self.state[args[0]] = args[1]

That is the entire data path. Election (chapter 100) gets you to a single leader; this code keeps the cluster's logs in lockstep until the next election. Real implementations add disk persistence between accepting an entry and acknowledging it (the Raft paper's currentTerm, votedFor, and log[] must be on stable storage before responding), batching of multiple entries per RPC, pipelining (sending entry N+1 before N is acked), and snapshotting (chapter 103) to truncate the on-disk log.

Worked example — a write through the cluster

A `PUT x=5` lands on a five-node cluster

You run a five-replica Raft cluster. Leader L is in term 3, with log entries 1-6 already committed and applied. commitIndex = 6, applyIndex = 6 on every replica. The leader's per-follower table:

nextIndex  = {F1: 7, F2: 7, F3: 7, F4: 7}
matchIndex = {F1: 6, F2: 6, F3: 6, F4: 6}
currentTerm = 3

A client calls PUT x=5. The leader handles it in seven steps.

Append on leader. L appends LogEntry(index=7, term=3, command=('SET','x',5)) to its own log. Persists to disk via fsync. Local log length is now 8 (with sentinel), len(log)-1 == 7.
Build AppendEntries. For each follower, nextIndex=7, so prev_log_index=6, prev_log_term=2 (entry 6 was written by the previous-term leader; this leader inherited it but cannot commit it on its own — see the current-term caveat above). entries=[log[7]], leader_commit=6.
Send in parallel. L fires AppendEntries to F1, F2, F3, F4 simultaneously. F1, F2, F3 receive within 5 ms (same datacentre). F4 is in a different region, 80 ms away.
Consistency check on each follower. Each follower has log[6] with term=2, matching prev_log_term=2. All accept, append log[7], fsync. F1, F2, F3 respond success=true. F4 has not yet replied.
Update bookkeeping. L processes responses as they arrive. After F1's: matchIndex[F1]=7, nextIndex[F1]=8. Same for F2 and F3.
Try to advance commit. With L's own len(log)-1 == 7 and matchIndex = {F1:7, F2:7, F3:7, F4:6}, the median (over five values: leader's 7, F1's 7, F2's 7, F3's 7, F4's 6) is 7. Three of five are at 7, which is a majority. And log[7].term == 3 == currentTerm. The current-term caveat is satisfied. Commit advances: commitIndex = 7.

Note: the previously-uncommitted entry 6 (term 2) was replicated to a majority all along, but the leader could not commit it directly because log[6].term != currentTerm. Now that entry 7 is committing, entry 6 commits implicitly — they are both now safe by the safety theorem.
Apply and respond. L applies log[7]: state['x'] = 5. Returns OK to the client. The next heartbeat carries leader_commit=7; F1, F2, F3 receive it, set their own commitIndex=7, apply, and arrive at state['x'] == 5. F4 catches up when its delayed RPC's response is processed.

Total client-observed latency: roughly one RTT to the third-fastest follower, plus an fsync on the leader and on each acking follower. In a same-region cluster, 5-15 ms. Cross-region, 50-200 ms. That is the unavoidable cost of strong consistency through consensus, and what every CRDT and quorum-relaxation paper since 2010 has been trying to wriggle out of.

Common confusions

"The follower's log is overwritten if the leader sends a different entry." True only for uncommitted entries. The consistency check ensures that any divergent suffix on the follower is from terms that never reached commit on a majority — so overwriting it is safe. Committed entries can never be overwritten because the safety proof in chapter 102 guarantees a new leader's log already contains them.
"Decrementing nextIndex one at a time is dangerously slow." It is O(divergence length) in the worst case but practical in normal operation, where divergence is small (a follower missed a few entries during a brief partition). Production systems add the conflictTerm/conflictIndex hint to skip whole stale terms, bringing it close to O(1) in pathological cases too.
"Why does the leader send leaderCommit separately from entries?" Because leaderCommit advances independently. A heartbeat with no entries can still tell the follower "you can apply up to index N now" — useful when the leader has committed entries the follower already had but had not been told about.
"Pipelining breaks the consistency check." It does not, because each AppendEntries carries its own prevLogIndex/prevLogTerm. The leader can have entries 8, 9, 10 in flight simultaneously to the same follower; the follower processes them in arrival order, and out-of-order arrival just means later RPCs are rejected and re-sent from the appropriate nextIndex. The check is per-RPC, not per-stream.
"Why must the entry be from the current term to commit?" Because of the figure-8 scenario in the Raft paper: an older-term entry replicated to a majority can still be overwritten by a future leader if that future leader's log happens not to contain it. Restricting commit to current-term entries (combined with the election restriction in chapter 100) closes the hole. A formal treatment is the Safety section of the Ongaro thesis.
"matchIndex is the same as nextIndex - 1." Not always. Right after election, nextIndex is initialised optimistically to leader's last-log-index + 1, but matchIndex starts at 0 — until the first successful AppendEntries response confirms what the follower actually holds. The two converge once probing finishes.

Going deeper

Pipelining and batching — the gap between paper and production

The reference Raft paper describes one outstanding AppendEntries per follower. Production implementations pipeline aggressively: the leader sends entry N+1 to a follower without waiting for the ack of N. As long as the follower processes RPCs in order (TCP gives this for free), the consistency check still works — entry N+1's RPC carries prevLogIndex=N, prevLogTerm=T, and if the follower has not yet processed N's RPC, it rejects, the leader's _send_append loop re-fires at nextIndex=N, and the system self-corrects.

Batching is even more important. A single AppendEntries can carry hundreds of entries; a single fsync on the follower covers all of them. Throughput goes from a few hundred ops/sec (one fsync per entry) to tens of thousands (one fsync per batch). The trade-off is latency — large batches mean each entry waits for the batch to fill. CockroachDB's proposalEvalCh and etcd's MsgPropC channels buffer a few hundred microseconds worth of proposals before flushing.

The conflictTerm/conflictIndex hint

The original Raft paper mentions but does not fully specify the hint. The version etcd uses: when a follower rejects, it returns the term of the conflicting entry at prevLogIndex (call it t_c) and the index of the first entry of term t_c in its log. The leader then either skips its nextIndex past the entire t_c segment if it has no entries of term t_c itself, or jumps to one past its own last entry of term t_c. This collapses an O(divergence) probe to typically O(1).

Diego Ongaro's PhD thesis (Stanford, 2014) section 5.3 gives the formal version, including a proof that the optimisation does not violate any Log Matching invariant. Worth reading for anyone implementing the protocol.

Read indices and lease-based reads

A naive Raft cluster requires every read to go through a log entry too — append a no-op, wait for it to commit, then serve the read. This is correct but expensive (one RTT, one fsync per read).

The two production optimisations:

Read index. The leader reads its commitIndex, sends one round of heartbeats to confirm it is still leader, then serves reads from local state once applyIndex >= commitIndex. One RTT, no disk.
Leader leases. The leader holds a time-bounded lease (granted by the election quorum) and serves reads with no coordination at all during the lease. Requires synchronised clocks within bounds; CockroachDB uses this aggressively.

Both are post-2014 additions to production Raft. The original paper sketches read index in section 8; lease reads come from Spanner's TrueTime literature.

Joint consensus and config changes

What if you want to change the cluster membership — add a node, remove a dead one? Naively adding a node could split the cluster into two majorities (the old and the new). Raft solves this with joint consensus: a transitional configuration where decisions require majority approval from both the old and new sets simultaneously. Once the joint config commits, the cluster transitions to the new set alone. Section 6 of the Raft paper covers this. Most production systems implement the simpler one-at-a-time variant where adding/removing a single node is always safe.

Where this leads next

Chapter 102 picks up the safety thread. With log replication mechanics in hand, you can now read the formal Raft safety properties — election restriction, leader completeness, state machine safety — and follow the proof that the figure-8 scenario is impossible. The current-term commit caveat introduced here is one of the five invariants, and chapter 102 explains the other four and how they combine.

Chapter 103 covers log compaction (snapshots), which lets the cluster discard old log entries without losing history — necessary because real logs grow without bound and replaying from entry 1 every restart becomes prohibitive. Snapshots interact subtly with AppendEntries (an InstallSnapshot RPC takes over when nextIndex is below the snapshot point) and chapter 103 walks the protocol.

Chapter 104 then puts the whole cluster through the wringer of partitions, leader churn, and message loss, showing that the safety guarantees do hold in practice. Build 14 takes the consensus log we have built and uses it as the primitive for cross-shard atomic commit — the wall from chapter 98.

The two sentences to carry forward: Log Matching means one well-chosen consistency check certifies the entire prefix. And: commit advances when a majority holds the entry AND the entry is from the leader's current term. The first is what makes the protocol efficient; the second is what makes it safe.

References

Ongaro and Ousterhout, In Search of an Understandable Consensus Algorithm (Extended Version), USENIX ATC 2014 — the canonical Raft paper, with sections 5.3 (log replication) and 5.4 (safety) directly underlying this chapter.
Ongaro, Consensus: Bridging Theory and Practice, Stanford PhD thesis, 2014 — the long-form treatment with formal proofs, the conflictTerm hint, joint consensus, and a thorough analysis of edge cases the conference paper compresses.
etcd-io, etcd Raft library — production Go implementation, the source most often cited as the reference for what Raft looks like in real code; the README and node.go are the entry points.
Howard, Schwarzkopf, Madhavapeddy, Crowcroft, Raft Refloated: Do We Have Consensus?, ACM Operating Systems Review 49(1), 2015 — independent re-implementation surfacing edge cases the original paper underspecifies, including subtleties in the AppendEntries handler.
Hashicorp, hashicorp/raft — Go library powering Consul, Nomad, and Vault, with a different code organisation from etcd; useful as a second reference implementation when reasoning about implementation freedom within the protocol.
Cockroach Labs, Living Without Atomic Clocks and the CockroachDB Raft documentation — engineering write-ups on running Raft at scale, including read-index optimisations and lease-based reads not in the original paper.