In short

Adding or removing servers from a running Raft cluster is the kind of operation that looks trivial in a wiring diagram and is genuinely subtle in code. The hazard is that the cluster's majority depends on the cluster's size, and during a transition different servers can disagree on what the size is — which means two disjoint groups can each believe they hold a majority and each commit conflicting entries. That is split-brain, except it is caused by configuration drift, not by a network partition.

The naive solution — every server atomically switches from C_old to C_new — fails because the switch is not actually atomic. Configuration changes propagate through the log like every other entry, and during the propagation window some servers still think the cluster is C_old (3 nodes, majority 2) while others have already moved to C_new (5 nodes, majority 3). A leader operating under C_old and a separate leader operating under C_new can both gather their respective majorities from disjoint sets, commit different entries at the same log index, and now your replicated log has forked.

Raft's fix is joint consensus: a transitional configuration C_old,new that requires both a majority of the old set and a majority of the new set for any decision — election, log entry commit, anything. Because the majorities of C_old and C_new cannot both be satisfied without overlap (any server in the intersection has to vote yes for both), no two disjoint groups can each pass the joint test. The transition becomes a two-phase log protocol: leader appends C_old,new entry (active immediately on append, even before commit), once committed it appends C_new entry, once C_new commits the cluster has shed C_old and old-only servers can shut down.

Most production implementations skip joint consensus entirely and use single-server changes — add or remove one server at a time. The proof Diego Ongaro added in his thesis: with |C_new \triangle C_old| = 1, every majority of C_new overlaps every majority of C_old, so two simultaneous decisions cannot disagree. etcd, Hashicorp Raft, and CockroachDB all use this simpler form (with etcd extending it via "learners" — non-voting catch-up replicas). You should know joint consensus to read the paper; you will mostly write single-server change code in production.

The implementation note that catches everyone out: configuration entries take effect as soon as they are appended to the log, not when committed. A leader that appends C_old,new and then loses a partition before committing it has nonetheless triggered the joint rules locally — an unusual exception to "log entries are advisory until committed" — and the design is correct precisely because it errs on the side of more restriction during ambiguity.

You have a 3-node Raft cluster — {A, B, C} — running production traffic for a small fintech in Bengaluru. Your VP of engineering wants to bump it to 5 nodes so a single rack outage cannot take you down. The plan is innocent: install Raft on machines D and E, point them at the existing cluster, and let them join.

The thing that should worry you is the question: during the join, while D and E are getting up to speed, what does "majority" mean? If the cluster is 3 nodes, majority is 2. If it is 5, majority is 3. A live system has to operate continuously through the change — clients are still writing — and some servers will see the change before others. The window where servers disagree on the cluster size is the window where two different leaders can each gather two-vote and three-vote majorities from disjoint sets and commit incompatible entries at the same index.

That is not a theoretical hazard. It is the reason joint consensus exists.

The previous chapters built the leader-election layer (terms, votes, the up-to-date restriction) and the log-replication layer (AppendEntries, the Log Matching property, the commit index). Both assumed a fixed cluster — a known, immutable list of peers. This chapter relaxes that assumption and shows the protocol Raft uses to safely change the peer list while the cluster keeps serving requests.

Why the naive approach fails

The seductive idea is that every server reads a config file at startup, and to change membership you (a) write the new config to every server's file, (b) restart them in some careful order. This works fine if you can take downtime — stop the cluster, change everyone, restart everyone — but you almost never can. So you want an online change, where the cluster transitions while still serving.

The first online idea is to ship the new configuration through the log itself: leader appends a Config(C_new) entry, replicates to a majority, commits, and from that point every server uses C_new. This is the single-step or atomic switch approach. It looks correct because Raft already provides ordering and replication guarantees for log entries; why would config entries be different?

They are different because the moment a server sees the C_new entry — even before commit — it starts behaving as if C_new is the rule. We will defend this design choice later; right now accept it as a constraint and see what goes wrong without joint consensus.

Concretely, take C_old = {A, B, C} (majority 2) expanding to C_new = {A, B, C, D, E} (majority 3). Suppose leader A (in C_old) appends a Config(C_new) entry at log index 7, replicates to D and E (which have it but not yet B or C), and then A is partitioned away from B and C but can still reach D and E.

Naive single-step membership change is unsafeFive nodes A, B, C, D, E. The Config(C_new) entry has been appended to A, D, E but not yet to B, C. A network partition cuts off A,D,E from B,C. Server A on its side thinks the cluster is C_new (5 nodes) and gathers a majority of 3 from {A, D, E}. Servers B and C on the other side still think the cluster is C_old (3 nodes) and hold a majority of 2 from {B, C}. Both sides can commit conflicting entries at the same log index — split brain.Naive single-step C_old → C_new during a partition: two disjoint majoritiesnetwork partitionSide 1: thinks cluster is C_new = {A,B,C,D,E}, majority 3Aleaderlog[7]=C_newDnewlog[7]=C_newEnewlog[7]=C_newA reaches majority of C_new (3 of 5):commits log[8] = "x = 1"Side 2: thinks cluster is C_old = {A,B,C}, majority 2Boldlog[7] absentColdlog[7] absentB times out, becomes candidate,gets vote from C — majority 2 of 3B becomes leader of C_old:commits log[8] = "x = 2"SPLIT BRAIN: log[8] = "x=1" on one side, log[8] = "x=2" on the other
The naive single-step membership change. The leader A has propagated the new configuration to D and E but not to B and C when a partition cuts the cluster. A,D,E believe the cluster is C_new (5 nodes, majority 3) and A satisfies that majority on its side. B,C still believe the cluster is C_old (3 nodes, majority 2) and B's election satisfies that on its side. Both sides commit different entries at the same log index. The fork is now permanent — when the partition heals, neither side can reconcile because both have committed entries that must not be overwritten.

The minority majority on each side is the disaster. {A, D, E} is a 3-node majority of C_new (size 5), and {B, C} is a 2-node majority of C_old (size 3). These two sets are disjoint — they share no servers — and yet each is, by its respective rule, a majority. Two leaders, two committed entries at the same index, no overlap to detect the conflict.

Why this is impossible to fix after the fact: Raft's safety proof rests on the fact that any two majorities overlap in at least one server, so any committed entry is held by at least one server in any future leader's quorum, and the up-to-date-log rule then forces the new leader to inherit that entry. When the two majorities come from different cluster sizes, the overlap argument breaks. There is no server that both sides have to consult, so neither side knows the other has committed anything. After partition heal, both sides hold log entries they each believe are committed, and any reconciliation would require overwriting committed state — which the safety theorem explicitly forbids. The damage is unrecoverable; you cannot patch this after the fact, only prevent it.

Joint consensus: requiring both majorities

The fix is to introduce an intermediate configuration that is more restrictive than both C_old and C_new alone: a configuration C_old,new that requires a majority of both sets for any decision. Now during the transition, no group of servers can pass the rule unless it includes a majority from C_old AND a majority from C_new — and any two such groups must overlap, because each contains a majority of C_old (and majorities of C_old overlap with each other), and each contains a majority of C_new (and majorities of C_new overlap with each other).

Joint consensus configuration C_old,new requires majority in both setsA Venn-style diagram of two overlapping circles. Left circle is C_old containing A, B, C. Right circle is C_new containing A, B, C, D, E. The intersection contains A, B, C. The C_old,new rule requires any decision to gather a majority within the left circle AND a majority within the right circle simultaneously. Examples of valid and invalid quorums shown.C_old,new: a decision needs majority of C_old AND majority of C_newC_old = {A, B, C}majority = 2C_new = {A, B, C, D, E}majority = 3ABCDE(intersection: A, B, C — old members carry over)Valid quorum: {A, B, D}2 of C_old (A,B) + 3 of C_new (A,B,D) ✓Invalid: {D, E}0 of C_old — fails first rule ✗Invalid: {A, C}2 of C_old ✓ but only 2 of C_new ✗
The joint configuration C_old,new requires every quorum decision to be approved by a majority within both C_old and C_new. With C_old = {A,B,C} and C_new = {A,B,C,D,E}, a valid quorum like {A, B, D} contains 2 of the 3 old servers and 3 of the 5 new servers — both majorities satisfied. {D, E} contains no old servers and so cannot pass; {A, C} contains a majority of old but only 2 of the 5 new, failing the second rule. Notice that any two valid quorums must intersect in at least one server, because each must contain a majority of C_old (and majorities of any one set always overlap).

Two valid quorums under C_old,new always overlap, because each contains a majority of C_old and the intersection of any two majorities of the same set is non-empty. So while C_old,new is in effect, only one leader can be elected and only one entry can commit at any index — exactly the property C_old alone gave you, and C_new alone will give you, but reasserted across the transition.

Why this works as a bridge: think of C_old,new as a strictly stronger rule than both C_old and C_new. Any group that satisfies C_old,new automatically satisfies C_old (it contains a majority of C_old) and automatically satisfies C_new (it contains a majority of C_new). So during the transition the cluster behaves "as if" both rules are in force simultaneously. Once C_old,new itself is committed and the cluster transitions to C_new alone, you have inherited the safety properties of C_old (carried through the joint phase) and gained the safety properties of C_new going forward. There is no single instant where the rule is laxer than either endpoint.

The two-phase protocol

The transition runs as a two-step log dance:

Phase 1: leader appends C_old,new to the log. Every server that receives the entry — through normal AppendEntries replication — immediately starts using the joint rules for elections and commits, even before the entry is committed. The leader keeps replicating until C_old,new is committed under the joint rules. After this commit, the cluster is firmly in joint mode and no leader from C_old-only servers can be elected.

Phase 2: leader appends C_new to the log. Same propagation: every server seeing the entry switches to C_new-only rules. Leader replicates until C_new is committed under the C_new rules (since by this point everyone who matters has moved to C_new). After this commit, servers that are in C_old but not in C_new (the "removed" servers) can shut down — they are no longer part of the cluster.

Two-phase membership change sequenceThree horizontal phases laid out left to right. Phase 1 is C_old, with the leader operating under old-only rules. Middle phase is C_old,new where the joint rules apply. Right phase is C_new where new-only rules apply. Two transitions between them: append C_old,new and append C_new. Below each phase is the cluster state and the majority rule active.Two-phase membership change: C_old → C_old,new → C_newPhase 0: C_oldcluster = {A, B, C}majority = 2 of 3normal operationold rulesPhase 1: C_old,newcluster = {A,B,C,D,E}need maj of C_old ANDmaj of C_newjoint rules — bridgePhase 2: C_newcluster = {A,B,C,D,E}majority = 3 of 5A,B,C kept; D,E addednew rules onlyappendlog entryC_old,newappendlog entryC_newstabletransitional (until commit)stableNo moment exists where two disjoint majorities can form
The full two-phase sequence. The cluster begins in C_old (3 nodes, majority 2). The leader appends a C_old,new entry and the cluster enters joint mode where every decision needs a majority within both C_old and C_new. Once C_old,new commits under that doubled rule, the leader appends a second entry C_new, after which the cluster operates under C_new alone (5 nodes, majority 3). At no point in this sequence can two server groups, each believing it has a majority, be disjoint — because the joint phase forces every decision through C_old's overlap.

Two non-obvious details about this protocol matter for correctness:

Configuration entries take effect on append, not on commit. When A writes the C_old,new entry to its own log, it immediately starts evaluating new election timeouts, vote-counts, and commit-counts under the joint rules — before C_old,new reaches majority. Same for any follower that receives the entry: the moment it lands in their log they switch behaviour. This is unusual; every other log entry in Raft is advisory until committed.

Why on append: imagine the leader appends C_old,new and replicates it to D and E but is partitioned before the joint config commits. If the leader were still using C_old-only rules, it could in principle commit further entries through {A, B, C} — but B and C may not have the new servers in their config and could elect a competing leader. By switching to joint rules immediately, the leader becomes unable to commit anything until both old and new majorities cooperate. This conservative bias is the safety property: as soon as a server might be in joint mode, it acts as if it definitely is. Erring on the side of more restriction during ambiguity prevents the disjoint-majority hazard.

A leader that committed C_old,new is not necessarily in C_new. If C_new removes the current leader (a "demote yourself" scenario — say C_new = {B, C, D, E}, dropping A), the leader still has to drive Phase 2 to its commit. The Raft paper's prescription: the leader continues until C_new is committed, then steps down, even though it is no longer a member of the cluster. Some implementations sidestep this by transferring leadership to a C_new member before appending the C_new entry.

Election rules during joint consensus

While C_old,new is the active config (between append and commit of the next entry), the election restriction tightens: a candidate needs a majority of votes from both C_old and C_new. The voter rules are unchanged (one vote per term, up-to-date log) — only the counting changes.

For the 3→5 example with C_old = {A, B, C} and C_new = {A, B, C, D, E}:

A candidate with votes from {A, B, D} passes both: 2 of C_old and 3 of C_new. A candidate with {D, E, A} has 1 of C_old (fails old majority) and 3 of C_new — overall fails. A candidate with {A, C, E} has 2 of C_old ✓ and 3 of C_new ✓ — passes. The point is that every winning candidate must be approved by a majority of the original cluster, which is what carries the safety of the old configuration through the transition.

The simpler alternative: single-server changes

Joint consensus is the protocol Section 6 of the Raft paper specifies. It is correct, it is general (you can swap any subset for any other), and it is more complicated than most production code wants to deal with. Diego Ongaro's PhD thesis (Section 4, 2014) introduces a simpler approach that is sufficient for the common case: change membership one server at a time.

The insight is straightforward. If C_new differs from C_old by exactly one server (added or removed), then every majority of C_old and every majority of C_new must overlap in at least one server. Proof: |C_old| = N, |C_new| = N+1 (or N-1). A majority of C_old is ⌊N/2⌋ + 1. A majority of C_new is ⌊(N±1)/2⌋ + 1. Adding the two majorities and applying the pigeonhole principle to the union shows the two sets must share at least one element. So the disjoint-majority hazard cannot occur — no joint phase is needed, and you can use the standard single-step config change without joint consensus.

This is what etcd, Hashicorp Raft, CockroachDB, MongoDB, Consul, and most other production Raft implementations actually do. To go from 3 to 5 nodes, you do two single-server changes back to back: 3→4, then 4→5. To replace a dead node, you do 5→4 (remove dead) then 4→5 (add fresh). It is slower than a one-shot joint-consensus swap, but the implementation is dramatically simpler — you never need a "joint" code path.

etcd extends single-server changes further with the learner role: a new server is added first as a non-voting learner (it receives log entries via AppendEntries but is not counted in any majority). Once the learner has caught up to within some replication threshold of the leader, a separate PromoteLearner operation upgrades it to a voter — a normal single-server change. The benefit: a fresh server starting from scratch can take seconds to minutes to catch up the log, and during that catchup period adding it as a voter would inflate the majority and increase write latency unnecessarily. Learners get the catchup out of the way before they vote. CockroachDB's "non-voter" replicas are the same idea, with extra knobs around geo-locality. The etcd learner documentation and the CockroachDB voter/non-voter design document the production semantics.

For the rest of this chapter, we will write code for joint consensus because it is the more general and more interesting case. Translating it down to single-server changes is straightforward (drop the joint rules, only switch on append-of-C_new).

Real Python: handle_append_entries and count_votes

Here is the data layout extending the structures from the previous chapters. Configuration is the new type, and current_config is the active rule a server is using.

from dataclasses import dataclass, field
from typing import Optional

@dataclass
class Configuration:
    """A Raft cluster configuration, possibly joint."""
    old_voters: frozenset[str]
    new_voters: Optional[frozenset[str]] = None  # None = single config

    @property
    def is_joint(self) -> bool:
        return self.new_voters is not None

    @property
    def all_voters(self) -> frozenset[str]:
        # Union for "who do I send AppendEntries to / collect votes from"
        if self.is_joint:
            return self.old_voters | self.new_voters
        return self.old_voters

    def has_majority(self, supporters: set[str]) -> bool:
        """Does the supporter set satisfy this configuration's quorum rule?"""
        old_ok = len(supporters & self.old_voters) > len(self.old_voters) // 2
        if not self.is_joint:
            return old_ok
        new_ok = len(supporters & self.new_voters) > len(self.new_voters) // 2
        return old_ok and new_ok  # JOINT: must satisfy both


@dataclass
class ConfigChangeEntry:
    """A log entry whose payload is a configuration."""
    new_config: Configuration


@dataclass
class LogEntry:
    term: int
    op: tuple
    config: Optional[Configuration] = None  # set for config-change entries

The append-entries handler now has to update the active configuration on append, before any commit decision. This is the load-bearing change.

class Server:
    def __init__(self, server_id, initial_config):
        self.server_id = server_id
        self.current_term = 0
        self.voted_for: Optional[str] = None
        self.log: list[LogEntry] = []
        self.commit_index = 0
        self.current_config: Configuration = initial_config
        # Per-follower state from chapter 101
        self.next_index = {}
        self.match_index = {}

    def handle_append_entries(self, term, leader_id, prev_log_index,
                              prev_log_term, entries, leader_commit):
        # Term and consistency checks (chapter 101) — unchanged
        if term < self.current_term:
            return {'term': self.current_term, 'success': False}
        self.current_term = term
        if prev_log_index >= len(self.log):
            return {'term': self.current_term, 'success': False}
        if prev_log_index >= 0 and self.log[prev_log_index].term != prev_log_term:
            return {'term': self.current_term, 'success': False}

        # Append entries, truncating divergent suffix.
        for i, entry in enumerate(entries):
            idx = prev_log_index + 1 + i
            if idx < len(self.log) and self.log[idx].term != entry.term:
                # Truncating: any config change in the discarded suffix must
                # be rolled back too — the append-on-receive design means an
                # uncommitted config can be undone if the entry is overwritten.
                self._rollback_config_through(idx)
                self.log = self.log[:idx]
            if idx >= len(self.log):
                self.log.append(entry)
                # CRITICAL: if this is a config entry, switch immediately,
                # before commit. This is the rule that makes the joint
                # phase safe even on uncommitted entries.
                if entry.config is not None:
                    self.current_config = entry.config

        if leader_commit > self.commit_index:
            last_new = prev_log_index + len(entries)
            self.commit_index = min(leader_commit, last_new)

        return {'term': self.current_term, 'success': True}

    def _rollback_config_through(self, truncate_from_idx: int):
        """If a truncated suffix contained a config change, walk back the
        log to find the last surviving config entry and revert to it."""
        for entry in reversed(self.log[:truncate_from_idx]):
            if entry.config is not None:
                self.current_config = entry.config
                return
        # No config in surviving log → the initial bootstrap config holds.
        self.current_config = self._bootstrap_config()

The vote-counting code on the candidate side switches between single-set and joint majority via Configuration.has_majority:

class Candidate(Server):
    async def count_votes(self, term: int) -> bool:
        """Run an election under the candidate's *current* configuration,
        which may be joint. Returns True if elected."""
        votes_received: set[str] = {self.server_id}  # self-vote
        peers = self.current_config.all_voters - {self.server_id}

        last_idx = len(self.log) - 1
        last_term = self.log[last_idx].term if self.log else 0

        async def request_one(peer: str):
            try:
                resp = await self.rpc.request_vote(
                    peer, term=term, candidate_id=self.server_id,
                    last_log_index=last_idx, last_log_term=last_term,
                )
                return peer, resp
            except (asyncio.TimeoutError, ConnectionError):
                return peer, None

        pending = [asyncio.create_task(request_one(p)) for p in peers]
        for fut in asyncio.as_completed(pending):
            peer, resp = await fut
            if self.state != 'Candidate' or self.current_term != term:
                return False  # superseded
            if resp is None:
                continue
            if resp.term > self.current_term:
                self.step_down(resp.term)
                return False
            if resp.vote_granted:
                votes_received.add(peer)
                # The single line that changes everything:
                if self.current_config.has_majority(votes_received):
                    return True  # won under whichever rule applies
        return False

The leader's commit-advance logic changes the same way — _maybe_advance_commit from chapter 101 now consults current_config.has_majority instead of comparing match_index count to a fixed threshold:

def _maybe_advance_commit(self):
    """Advance commit_index to the largest N replicated on a majority
    (under whichever config is currently active) AND from current term."""
    candidates = sorted(set(self.match_index.values()) | {len(self.log) - 1},
                        reverse=True)
    for N in candidates:
        if N <= self.commit_index:
            break
        # Set of servers that have entry N (or beyond) replicated.
        supporters = {s for s, m in self.match_index.items() if m >= N}
        supporters.add(self.server_id)  # leader has it by construction
        if self.current_config.has_majority(supporters):
            # Plus the current-term commit caveat from chapter 101
            if self.log[N].term == self.current_term:
                self.commit_index = N
                self._on_commit_advance(N)
                return

_on_commit_advance is where the two-phase protocol is driven forward. When C_old,new commits, the leader appends C_new. When C_new commits, the leader (if it is no longer a member) steps down.

def _on_commit_advance(self, new_commit_index: int):
    # Did we just commit a joint config? Time to append C_new.
    for idx in range(new_commit_index, self.commit_index + 1):
        # (Iterate any newly-committed indices in case of batched advance.)
        cfg = self.log[idx].config
        if cfg is None:
            continue
        if cfg.is_joint:
            # Phase 1 just committed. Now append the C_new entry.
            target = Configuration(old_voters=cfg.new_voters, new_voters=None)
            self.append_config_entry(target)
        else:
            # Phase 2 just committed. If we are not in C_new, step down.
            if self.server_id not in cfg.old_voters:
                self.step_down(self.current_term)

def append_config_entry(self, new_config: Configuration):
    """Leader-side: write a config entry to the log. Takes effect locally
    immediately on append (the same rule followers will obey on receive)."""
    entry = LogEntry(term=self.current_term, op=('CONFIG',),
                     config=new_config)
    self.log.append(entry)
    self.current_config = new_config
    # Initialize per-follower state for any new voters
    for peer in new_config.all_voters:
        if peer not in self.next_index:
            self.next_index[peer] = len(self.log)
            self.match_index[peer] = 0

That is the entire delta from chapter 101's leader code: a config-aware Configuration type, switching on append in handle_append_entries, joint-aware quorum checks in count_votes and _maybe_advance_commit, and the two-phase driver in _on_commit_advance. About 80 lines of code on top of the existing replicator.

Worked example: 3 nodes to 5 with joint consensus

Expanding `{A, B, C}` to `{A, B, C, D, E}` step by step

You have a 3-node cluster with leader A in term 4. The current log is committed up to index 12, and the active configuration is:

C_old = Configuration(old_voters=frozenset({'A', 'B', 'C'}))
# Single config, majority = 2 of 3

Match indices: {A: 12, B: 12, C: 12}. Servers D and E are running and reachable but not yet members.

Step 1: Leader appends C_old,new (log[13]). A admin add_servers([D, E]) API call hits the leader. A constructs:

C_joint = Configuration(
    old_voters=frozenset({'A', 'B', 'C'}),
    new_voters=frozenset({'A', 'B', 'C', 'D', 'E'}),
)

A appends LogEntry(term=4, op=('CONFIG',), config=C_joint) at index 13. Immediately, A.current_config = C_joint. A initializes next_index and match_index for D and E (both start at 13, both with match 0). From this point, A's commit-advance logic and election logic both use C_joint.has_majority.

Step 2: Replicate log[13] to all four peers. A sends AppendEntries to B, C, D, E. The fresh peers D and E are far behind — their logs are empty — so the consistency check fails, next_index decrements, and the leader streams them the entire history (indices 1-13). Meanwhile B and C already have indices 1-12 and just append index 13.

After replication settles: match_index = {A: 13, B: 13, C: 13, D: 13, E: 13}. A runs _maybe_advance_commit with current_config = C_joint:

  • Candidate N=13. Supporters = {A, B, C, D, E}.
  • Old majority of {A,B,C}: supporters ∩ old = {A,B,C}, size 3 > 1. ✓
  • New majority of {A,B,C,D,E}: supporters ∩ new = {A,B,C,D,E}, size 5 > 2. ✓
  • Both pass → joint majority satisfied. Term check: log[13].term = 4 = current_term. ✓
  • commit_index = 13.

In practice the joint config commits well before all five replicas have it — as soon as a majority of C_old (2 of {A,B,C}) and a majority of C_new (3 of {A,B,C,D,E}) hold it. So if A, B, D have it and C, E are still catching up: old majority is {A,B}, size 2 ✓; new majority needs ≥3 from {A,B,C,D,E} — supporters intersect new = {A,B,D}, size 3 ✓. Joint commit satisfied.

Step 3: _on_commit_advance triggers Phase 2. Index 13 just committed and log[13].config.is_joint == True. The leader appends:

C_new = Configuration(old_voters=frozenset({'A', 'B', 'C', 'D', 'E'}))

LogEntry(term=4, op=('CONFIG',), config=C_new) at index 14. Immediately A.current_config = C_new. The cluster is now operating under C_new rules locally on A, and everyone else flips when index 14 reaches them.

Step 4: Replicate log[14] until commit. This is fast — all five servers are healthy and already have everything through 13. The leader collects match_index >= 14 from {A, B, C, D, E}. Under C_new (single config, majority 3 of 5), the supporters are 5 → majority easily. commit_index = 14.

Step 5: stable. The cluster is now in C_new permanently. Any future add_server or remove_server op repeats the same dance from the new starting point.

During the change, what could go wrong and why is it OK?

Case (a): Leader A crashes after appending log[13] (joint) but before committing it. On A's local log, joint is active. Followers B, C may or may not have it — say B has it but C does not. New election:

  • If B becomes candidate (it has the joint entry, so it switches to joint rules on append): needs majority of both C_old and C_new. Votes from {A=down, B=self, C, D, E}. To win: ≥2 of {A,B,C} → need at least one of {C} ✓ if C votes; ≥3 of {A,B,C,D,E} → need at least 2 more, e.g. D and E. So B needs {B, C, D, E} to have any hope of winning under joint. That is 4 of 5 — much harder than the old 2 of 3.
  • If C becomes candidate (it does not have the joint entry, so it still uses C_old-only rules): needs 2 of {A,B,C}. Votes from {B, C}C gets B's vote? B checks: candidate C's log lacks index 13 (the joint entry that B has). The up-to-date-log restriction (chapter 100) rejects: C is behind. So B votes no. C cannot win.

The only candidate who can win is one whose log already has index 13 — i.e., a candidate who is itself in joint mode. That candidate then continues the protocol. Safety holds.

Case (b): A and a few new nodes commit log[13], but a partition isolates A,D,E from B,C. Now A,D,E have the joint entry; B,C do not. A is leader.

  • A cannot commit anything new on its side under joint rules: it has 0 votes from C_old \ {A} = {B, C} (they are partitioned away), so old majority fails. A can only commit when at least 2 of {A,B,C} ack — impossible while B,C are unreachable.
  • On the other side, B times out and starts an election under C_old-only rules (it has no joint entry). It can get C's vote — but C will reject because the up-to-date-log rule sees B's last log term differs from any in-flight... wait, in this case both B and C are at index 12. B becomes candidate, gets C's vote, becomes leader of C_old alone in the new term. It now thinks it can commit with majority 2.

This is the worry case. But notice: any entries that B commits on its side will not have made it past A's joint-rule barrier on the other side — the partition prevents propagation. When the partition heals, A (still leader of its side, but unable to make progress) sees B's higher term and steps down. B is now leader. B does not have the joint entry yet, so its config is still C_old. Its log on the index-13 slot is whatever client write it accepted — say "x = 9".

What about A,D,E's view? A had appended joint at index 13; it never committed (only D,E are reachable, and {A,D,E} does not contain a C_old majority). On healing and stepping down, A now receives AppendEntries from B with prev_log_index=12, prev_log_term=... and entries [(13, T_new, "x=9")]. A's check: log[12].term matches → success. The conflict: A's log[13] is the joint config (term 4), B's log[13] is "x=9" (term T_new > 4). The truncate-on-conflict rule (chapter 101) overwrites A's joint entry. The _rollback_config_through helper detects the truncated config and reverts A.current_config back to C_old. D, E get the same overwrite (cascading from A once A becomes a follower of B).

The cluster is back to C_old. The membership change is abandoned — admin re-runs it later. No safety violation: nothing was committed under joint rules, the joint entry never made it past the propagation barrier, and the rollback machinery undoes the local config switch when the entry is truncated.

This is the value of "config takes effect on append": the same machinery that replicates and overwrites entries also replicates and overwrites configs. There is no separate config-state to keep coherent — it lives in the log.

Common confusions

Going deeper

Why "on append" is not the same as "on receive"

A subtle distinction: configuration entries take effect when a server appends them to its log, not merely when an RPC carrying them arrives. If an AppendEntries RPC carries entries that fail the consistency check (prevLogIndex/prevLogTerm mismatch), the entries are rejected and not appended — and the server does not switch its config either. This matters because a follower with a stale prevLogIndex might receive a Config entry but reject it, and you do not want the follower to half-switch.

The implementation invariant: current_config is a function of log alone, specifically current_config = the most recent config entry in log. Whenever log is mutated (append or truncate), current_config is recomputed. The _rollback_config_through helper above handles truncation; the inline assignment in handle_append_entries handles append. With both in place, current_config is exactly the most recent config in the log at all times.

Ongaro's PhD thesis Section 4.3 formalises this and proves that the log-determined config is sufficient for safety: every config decision the server has made tracks exactly the configs that survive in its log.

Diego Ongaro's single-server-change post

Outside the thesis, Ongaro wrote a shorter argument on the raft-dev mailing list defending the single-server simplification. The key passage:

"Removing the requirement to support arbitrary config changes substantially simplifies the implementation. In retrospect I should have specified the simpler version in the original paper."

This is the lineage of the etcd, Hashicorp, and CockroachDB designs. They all cite this argument as the reason they did not implement joint consensus.

Configuration changes during snapshots

Log compaction (chapter 104) discards old log entries and replaces them with a snapshot. The snapshot must include the active configuration at the snapshot point, so a server installing the snapshot sets current_config from it before any log replay. This is straightforward when the active config is single (just write C_old or C_new); slightly trickier when the snapshot point falls inside a joint phase, where the snapshot must record (C_old, C_new) together.

CockroachDB's implementation handles this by storing the configuration in the same key-value space as the snapshot data, updated atomically on every config change, so the snapshot picks it up for free. etcd stores it in a dedicated ConfState field of the snapshot metadata. The Raft paper itself underspecifies this; production implementations all add it, with the etcd ConfState design being the most-copied reference.

Why production avoids joint consensus

In practice, almost no production system uses full joint consensus. The concrete reasons:

  1. Code-path duplication. Every piece of code that asks "does this set of supporters constitute a majority?" — election counting, commit-advance, leadership transfer, read-index — has to handle the joint case. With single-server changes, none of them do.
  2. Testing complexity. Each code path needs partition tests. Joint consensus adds a third regime (joint) on top of C_old and C_new, multiplying the test matrix.
  3. Operational rarity. Most config changes are "replace a dead node" or "add a node to scale up" — inherently single-server operations. The case where you genuinely want to swap the whole set in one operation (e.g., zero-overlap relocation between datacentres) is rare enough to be done as a sequence of single-server changes spaced out by minutes.

The trade-off: single-server changes take longer for big swaps (six steps for {A,B,C}{D,E,F} versus one for joint), but each step is fast (a single round of replication), and the simpler code is worth the latency. CockroachDB's documentation for its voter/non-voter replicas describes the full operational model.

Reconfiguration disruption and PreVote interaction

When a server is removed from C_new, it does not immediately know — its current_config updates only when the C_new entry reaches its log. Until then, it might miss heartbeats (the leader is no longer sending to it under the new rules) and start election timeouts. If it eventually times out and tries to start an election in a higher term, it can disrupt the live cluster.

The Raft paper recommends the leader explicitly tell removed servers "you are out" via a teardown message. The PreVote optimisation (chapter 100) mitigates the disruption: a removed server's PreVote will fail (the active leader is healthy from the cluster's perspective) so its term never advances past the live one. CockroachDB and etcd both rely on PreVote rather than explicit teardown for this reason.

Where this leads next

Membership changes round out the Raft protocol's "operational" surface — you can now grow, shrink, and replace nodes in a running cluster without downtime. The next chapter (log compaction and snapshots) handles the storage problem: real logs grow indefinitely, and at some point a fresh server cannot reasonably catch up by replaying every entry from the dawn of time. Snapshots solve this; they also interact with membership changes (the snapshot must carry the config state) and with nextIndex (a follower so far behind that the leader's log no longer contains its nextIndex falls back to InstallSnapshot RPC).

Chapter 104 then puts the whole assembled protocol — election, replication, membership, snapshots — through partition and crash testing, demonstrating that the safety properties hold under aggressive failure injection. Build 14 takes the resulting consensus-log primitive and uses it as the foundation for cross-shard atomic commit, returning to the cross-shard transactions problem flagged at the start of Build 13.

The single sentence to carry forward: during a membership change, the cluster passes through a strictly stronger rule (joint consensus) than either the old or the new alone — and that strict-stronger phase is what prevents two disjoint majorities from forming. The simpler one-server-at-a-time variant gets the same safety because, with |ΔC| = 1, the "strictly stronger" phase is unnecessary — every old majority and every new majority already overlap on their own.

References

  1. Ongaro and Ousterhout, In Search of an Understandable Consensus Algorithm (Extended Version), USENIX ATC 2014, Section 6 — the canonical specification of joint consensus, including the proof that C_old,new prevents disjoint majorities.
  2. Ongaro, Consensus: Bridging Theory and Practice, Stanford PhD Dissertation 2014, Chapter 4 — the deep treatment, including the single-server-change simplification, configuration-rollback handling, and the formal "config = function of log" invariant.
  3. Ongaro, Bug in single-server membership changes, raft-dev mailing list, 2015 — the well-known correction to single-server membership changes (an edge case where two simultaneous changes can still produce inconsistency without joint consensus); also where Ongaro states he should have specified the simpler version in the conference paper.
  4. etcd-io, Learner Design — the etcd team's design document for non-voting "learner" replicas, the production refinement of single-server changes that decouples catchup from voting.
  5. etcd-io, Configuration Changes — the implementation-level documentation of how etcd Raft handles config changes via single-server changes plus learners, with the ConfState snapshot integration.
  6. Cockroach Labs, Configure Replication Zones (Voter/Non-Voter) — CockroachDB's production model for membership management at scale, with non-voter replicas serving the same role as etcd learners and the operational knobs for geo-distributed clusters.