In short
Adding or removing servers from a running Raft cluster is the kind of operation that looks trivial in a wiring diagram and is genuinely subtle in code. The hazard is that the cluster's majority depends on the cluster's size, and during a transition different servers can disagree on what the size is — which means two disjoint groups can each believe they hold a majority and each commit conflicting entries. That is split-brain, except it is caused by configuration drift, not by a network partition.
The naive solution — every server atomically switches from C_old to C_new — fails because the switch is not actually atomic. Configuration changes propagate through the log like every other entry, and during the propagation window some servers still think the cluster is C_old (3 nodes, majority 2) while others have already moved to C_new (5 nodes, majority 3). A leader operating under C_old and a separate leader operating under C_new can both gather their respective majorities from disjoint sets, commit different entries at the same log index, and now your replicated log has forked.
Raft's fix is joint consensus: a transitional configuration C_old,new that requires both a majority of the old set and a majority of the new set for any decision — election, log entry commit, anything. Because the majorities of C_old and C_new cannot both be satisfied without overlap (any server in the intersection has to vote yes for both), no two disjoint groups can each pass the joint test. The transition becomes a two-phase log protocol: leader appends C_old,new entry (active immediately on append, even before commit), once committed it appends C_new entry, once C_new commits the cluster has shed C_old and old-only servers can shut down.
Most production implementations skip joint consensus entirely and use single-server changes — add or remove one server at a time. The proof Diego Ongaro added in his thesis: with |C_new \triangle C_old| = 1, every majority of C_new overlaps every majority of C_old, so two simultaneous decisions cannot disagree. etcd, Hashicorp Raft, and CockroachDB all use this simpler form (with etcd extending it via "learners" — non-voting catch-up replicas). You should know joint consensus to read the paper; you will mostly write single-server change code in production.
The implementation note that catches everyone out: configuration entries take effect as soon as they are appended to the log, not when committed. A leader that appends C_old,new and then loses a partition before committing it has nonetheless triggered the joint rules locally — an unusual exception to "log entries are advisory until committed" — and the design is correct precisely because it errs on the side of more restriction during ambiguity.
You have a 3-node Raft cluster — {A, B, C} — running production traffic for a small fintech in Bengaluru. Your VP of engineering wants to bump it to 5 nodes so a single rack outage cannot take you down. The plan is innocent: install Raft on machines D and E, point them at the existing cluster, and let them join.
The thing that should worry you is the question: during the join, while D and E are getting up to speed, what does "majority" mean? If the cluster is 3 nodes, majority is 2. If it is 5, majority is 3. A live system has to operate continuously through the change — clients are still writing — and some servers will see the change before others. The window where servers disagree on the cluster size is the window where two different leaders can each gather two-vote and three-vote majorities from disjoint sets and commit incompatible entries at the same index.
That is not a theoretical hazard. It is the reason joint consensus exists.
The previous chapters built the leader-election layer (terms, votes, the up-to-date restriction) and the log-replication layer (AppendEntries, the Log Matching property, the commit index). Both assumed a fixed cluster — a known, immutable list of peers. This chapter relaxes that assumption and shows the protocol Raft uses to safely change the peer list while the cluster keeps serving requests.
Why the naive approach fails
The seductive idea is that every server reads a config file at startup, and to change membership you (a) write the new config to every server's file, (b) restart them in some careful order. This works fine if you can take downtime — stop the cluster, change everyone, restart everyone — but you almost never can. So you want an online change, where the cluster transitions while still serving.
The first online idea is to ship the new configuration through the log itself: leader appends a Config(C_new) entry, replicates to a majority, commits, and from that point every server uses C_new. This is the single-step or atomic switch approach. It looks correct because Raft already provides ordering and replication guarantees for log entries; why would config entries be different?
They are different because the moment a server sees the C_new entry — even before commit — it starts behaving as if C_new is the rule. We will defend this design choice later; right now accept it as a constraint and see what goes wrong without joint consensus.
Concretely, take C_old = {A, B, C} (majority 2) expanding to C_new = {A, B, C, D, E} (majority 3). Suppose leader A (in C_old) appends a Config(C_new) entry at log index 7, replicates to D and E (which have it but not yet B or C), and then A is partitioned away from B and C but can still reach D and E.
The minority majority on each side is the disaster. {A, D, E} is a 3-node majority of C_new (size 5), and {B, C} is a 2-node majority of C_old (size 3). These two sets are disjoint — they share no servers — and yet each is, by its respective rule, a majority. Two leaders, two committed entries at the same index, no overlap to detect the conflict.
Why this is impossible to fix after the fact: Raft's safety proof rests on the fact that any two majorities overlap in at least one server, so any committed entry is held by at least one server in any future leader's quorum, and the up-to-date-log rule then forces the new leader to inherit that entry. When the two majorities come from different cluster sizes, the overlap argument breaks. There is no server that both sides have to consult, so neither side knows the other has committed anything. After partition heal, both sides hold log entries they each believe are committed, and any reconciliation would require overwriting committed state — which the safety theorem explicitly forbids. The damage is unrecoverable; you cannot patch this after the fact, only prevent it.
Joint consensus: requiring both majorities
The fix is to introduce an intermediate configuration that is more restrictive than both C_old and C_new alone: a configuration C_old,new that requires a majority of both sets for any decision. Now during the transition, no group of servers can pass the rule unless it includes a majority from C_old AND a majority from C_new — and any two such groups must overlap, because each contains a majority of C_old (and majorities of C_old overlap with each other), and each contains a majority of C_new (and majorities of C_new overlap with each other).
Two valid quorums under C_old,new always overlap, because each contains a majority of C_old and the intersection of any two majorities of the same set is non-empty. So while C_old,new is in effect, only one leader can be elected and only one entry can commit at any index — exactly the property C_old alone gave you, and C_new alone will give you, but reasserted across the transition.
Why this works as a bridge: think of C_old,new as a strictly stronger rule than both C_old and C_new. Any group that satisfies C_old,new automatically satisfies C_old (it contains a majority of C_old) and automatically satisfies C_new (it contains a majority of C_new). So during the transition the cluster behaves "as if" both rules are in force simultaneously. Once C_old,new itself is committed and the cluster transitions to C_new alone, you have inherited the safety properties of C_old (carried through the joint phase) and gained the safety properties of C_new going forward. There is no single instant where the rule is laxer than either endpoint.
The two-phase protocol
The transition runs as a two-step log dance:
Phase 1: leader appends C_old,new to the log. Every server that receives the entry — through normal AppendEntries replication — immediately starts using the joint rules for elections and commits, even before the entry is committed. The leader keeps replicating until C_old,new is committed under the joint rules. After this commit, the cluster is firmly in joint mode and no leader from C_old-only servers can be elected.
Phase 2: leader appends C_new to the log. Same propagation: every server seeing the entry switches to C_new-only rules. Leader replicates until C_new is committed under the C_new rules (since by this point everyone who matters has moved to C_new). After this commit, servers that are in C_old but not in C_new (the "removed" servers) can shut down — they are no longer part of the cluster.
Two non-obvious details about this protocol matter for correctness:
Configuration entries take effect on append, not on commit. When A writes the C_old,new entry to its own log, it immediately starts evaluating new election timeouts, vote-counts, and commit-counts under the joint rules — before C_old,new reaches majority. Same for any follower that receives the entry: the moment it lands in their log they switch behaviour. This is unusual; every other log entry in Raft is advisory until committed.
Why on append: imagine the leader appends C_old,new and replicates it to D and E but is partitioned before the joint config commits. If the leader were still using C_old-only rules, it could in principle commit further entries through {A, B, C} — but B and C may not have the new servers in their config and could elect a competing leader. By switching to joint rules immediately, the leader becomes unable to commit anything until both old and new majorities cooperate. This conservative bias is the safety property: as soon as a server might be in joint mode, it acts as if it definitely is. Erring on the side of more restriction during ambiguity prevents the disjoint-majority hazard.
A leader that committed C_old,new is not necessarily in C_new. If C_new removes the current leader (a "demote yourself" scenario — say C_new = {B, C, D, E}, dropping A), the leader still has to drive Phase 2 to its commit. The Raft paper's prescription: the leader continues until C_new is committed, then steps down, even though it is no longer a member of the cluster. Some implementations sidestep this by transferring leadership to a C_new member before appending the C_new entry.
Election rules during joint consensus
While C_old,new is the active config (between append and commit of the next entry), the election restriction tightens: a candidate needs a majority of votes from both C_old and C_new. The voter rules are unchanged (one vote per term, up-to-date log) — only the counting changes.
For the 3→5 example with C_old = {A, B, C} and C_new = {A, B, C, D, E}:
- Old-set majority: ≥2 votes from
{A, B, C}. - New-set majority: ≥3 votes from
{A, B, C, D, E}.
A candidate with votes from {A, B, D} passes both: 2 of C_old and 3 of C_new. A candidate with {D, E, A} has 1 of C_old (fails old majority) and 3 of C_new — overall fails. A candidate with {A, C, E} has 2 of C_old ✓ and 3 of C_new ✓ — passes. The point is that every winning candidate must be approved by a majority of the original cluster, which is what carries the safety of the old configuration through the transition.
The simpler alternative: single-server changes
Joint consensus is the protocol Section 6 of the Raft paper specifies. It is correct, it is general (you can swap any subset for any other), and it is more complicated than most production code wants to deal with. Diego Ongaro's PhD thesis (Section 4, 2014) introduces a simpler approach that is sufficient for the common case: change membership one server at a time.
The insight is straightforward. If C_new differs from C_old by exactly one server (added or removed), then every majority of C_old and every majority of C_new must overlap in at least one server. Proof: |C_old| = N, |C_new| = N+1 (or N-1). A majority of C_old is ⌊N/2⌋ + 1. A majority of C_new is ⌊(N±1)/2⌋ + 1. Adding the two majorities and applying the pigeonhole principle to the union shows the two sets must share at least one element. So the disjoint-majority hazard cannot occur — no joint phase is needed, and you can use the standard single-step config change without joint consensus.
This is what etcd, Hashicorp Raft, CockroachDB, MongoDB, Consul, and most other production Raft implementations actually do. To go from 3 to 5 nodes, you do two single-server changes back to back: 3→4, then 4→5. To replace a dead node, you do 5→4 (remove dead) then 4→5 (add fresh). It is slower than a one-shot joint-consensus swap, but the implementation is dramatically simpler — you never need a "joint" code path.
etcd extends single-server changes further with the learner role: a new server is added first as a non-voting learner (it receives log entries via AppendEntries but is not counted in any majority). Once the learner has caught up to within some replication threshold of the leader, a separate PromoteLearner operation upgrades it to a voter — a normal single-server change. The benefit: a fresh server starting from scratch can take seconds to minutes to catch up the log, and during that catchup period adding it as a voter would inflate the majority and increase write latency unnecessarily. Learners get the catchup out of the way before they vote. CockroachDB's "non-voter" replicas are the same idea, with extra knobs around geo-locality. The etcd learner documentation and the CockroachDB voter/non-voter design document the production semantics.
For the rest of this chapter, we will write code for joint consensus because it is the more general and more interesting case. Translating it down to single-server changes is straightforward (drop the joint rules, only switch on append-of-C_new).
Real Python: handle_append_entries and count_votes
Here is the data layout extending the structures from the previous chapters. Configuration is the new type, and current_config is the active rule a server is using.
from dataclasses import dataclass, field
from typing import Optional
@dataclass
class Configuration:
"""A Raft cluster configuration, possibly joint."""
old_voters: frozenset[str]
new_voters: Optional[frozenset[str]] = None # None = single config
@property
def is_joint(self) -> bool:
return self.new_voters is not None
@property
def all_voters(self) -> frozenset[str]:
# Union for "who do I send AppendEntries to / collect votes from"
if self.is_joint:
return self.old_voters | self.new_voters
return self.old_voters
def has_majority(self, supporters: set[str]) -> bool:
"""Does the supporter set satisfy this configuration's quorum rule?"""
old_ok = len(supporters & self.old_voters) > len(self.old_voters) // 2
if not self.is_joint:
return old_ok
new_ok = len(supporters & self.new_voters) > len(self.new_voters) // 2
return old_ok and new_ok # JOINT: must satisfy both
@dataclass
class ConfigChangeEntry:
"""A log entry whose payload is a configuration."""
new_config: Configuration
@dataclass
class LogEntry:
term: int
op: tuple
config: Optional[Configuration] = None # set for config-change entries
The append-entries handler now has to update the active configuration on append, before any commit decision. This is the load-bearing change.
class Server:
def __init__(self, server_id, initial_config):
self.server_id = server_id
self.current_term = 0
self.voted_for: Optional[str] = None
self.log: list[LogEntry] = []
self.commit_index = 0
self.current_config: Configuration = initial_config
# Per-follower state from chapter 101
self.next_index = {}
self.match_index = {}
def handle_append_entries(self, term, leader_id, prev_log_index,
prev_log_term, entries, leader_commit):
# Term and consistency checks (chapter 101) — unchanged
if term < self.current_term:
return {'term': self.current_term, 'success': False}
self.current_term = term
if prev_log_index >= len(self.log):
return {'term': self.current_term, 'success': False}
if prev_log_index >= 0 and self.log[prev_log_index].term != prev_log_term:
return {'term': self.current_term, 'success': False}
# Append entries, truncating divergent suffix.
for i, entry in enumerate(entries):
idx = prev_log_index + 1 + i
if idx < len(self.log) and self.log[idx].term != entry.term:
# Truncating: any config change in the discarded suffix must
# be rolled back too — the append-on-receive design means an
# uncommitted config can be undone if the entry is overwritten.
self._rollback_config_through(idx)
self.log = self.log[:idx]
if idx >= len(self.log):
self.log.append(entry)
# CRITICAL: if this is a config entry, switch immediately,
# before commit. This is the rule that makes the joint
# phase safe even on uncommitted entries.
if entry.config is not None:
self.current_config = entry.config
if leader_commit > self.commit_index:
last_new = prev_log_index + len(entries)
self.commit_index = min(leader_commit, last_new)
return {'term': self.current_term, 'success': True}
def _rollback_config_through(self, truncate_from_idx: int):
"""If a truncated suffix contained a config change, walk back the
log to find the last surviving config entry and revert to it."""
for entry in reversed(self.log[:truncate_from_idx]):
if entry.config is not None:
self.current_config = entry.config
return
# No config in surviving log → the initial bootstrap config holds.
self.current_config = self._bootstrap_config()
The vote-counting code on the candidate side switches between single-set and joint majority via Configuration.has_majority:
class Candidate(Server):
async def count_votes(self, term: int) -> bool:
"""Run an election under the candidate's *current* configuration,
which may be joint. Returns True if elected."""
votes_received: set[str] = {self.server_id} # self-vote
peers = self.current_config.all_voters - {self.server_id}
last_idx = len(self.log) - 1
last_term = self.log[last_idx].term if self.log else 0
async def request_one(peer: str):
try:
resp = await self.rpc.request_vote(
peer, term=term, candidate_id=self.server_id,
last_log_index=last_idx, last_log_term=last_term,
)
return peer, resp
except (asyncio.TimeoutError, ConnectionError):
return peer, None
pending = [asyncio.create_task(request_one(p)) for p in peers]
for fut in asyncio.as_completed(pending):
peer, resp = await fut
if self.state != 'Candidate' or self.current_term != term:
return False # superseded
if resp is None:
continue
if resp.term > self.current_term:
self.step_down(resp.term)
return False
if resp.vote_granted:
votes_received.add(peer)
# The single line that changes everything:
if self.current_config.has_majority(votes_received):
return True # won under whichever rule applies
return False
The leader's commit-advance logic changes the same way — _maybe_advance_commit from chapter 101 now consults current_config.has_majority instead of comparing match_index count to a fixed threshold:
def _maybe_advance_commit(self):
"""Advance commit_index to the largest N replicated on a majority
(under whichever config is currently active) AND from current term."""
candidates = sorted(set(self.match_index.values()) | {len(self.log) - 1},
reverse=True)
for N in candidates:
if N <= self.commit_index:
break
# Set of servers that have entry N (or beyond) replicated.
supporters = {s for s, m in self.match_index.items() if m >= N}
supporters.add(self.server_id) # leader has it by construction
if self.current_config.has_majority(supporters):
# Plus the current-term commit caveat from chapter 101
if self.log[N].term == self.current_term:
self.commit_index = N
self._on_commit_advance(N)
return
_on_commit_advance is where the two-phase protocol is driven forward. When C_old,new commits, the leader appends C_new. When C_new commits, the leader (if it is no longer a member) steps down.
def _on_commit_advance(self, new_commit_index: int):
# Did we just commit a joint config? Time to append C_new.
for idx in range(new_commit_index, self.commit_index + 1):
# (Iterate any newly-committed indices in case of batched advance.)
cfg = self.log[idx].config
if cfg is None:
continue
if cfg.is_joint:
# Phase 1 just committed. Now append the C_new entry.
target = Configuration(old_voters=cfg.new_voters, new_voters=None)
self.append_config_entry(target)
else:
# Phase 2 just committed. If we are not in C_new, step down.
if self.server_id not in cfg.old_voters:
self.step_down(self.current_term)
def append_config_entry(self, new_config: Configuration):
"""Leader-side: write a config entry to the log. Takes effect locally
immediately on append (the same rule followers will obey on receive)."""
entry = LogEntry(term=self.current_term, op=('CONFIG',),
config=new_config)
self.log.append(entry)
self.current_config = new_config
# Initialize per-follower state for any new voters
for peer in new_config.all_voters:
if peer not in self.next_index:
self.next_index[peer] = len(self.log)
self.match_index[peer] = 0
That is the entire delta from chapter 101's leader code: a config-aware Configuration type, switching on append in handle_append_entries, joint-aware quorum checks in count_votes and _maybe_advance_commit, and the two-phase driver in _on_commit_advance. About 80 lines of code on top of the existing replicator.
Worked example: 3 nodes to 5 with joint consensus
Expanding `{A, B, C}` to `{A, B, C, D, E}` step by step
You have a 3-node cluster with leader A in term 4. The current log is committed up to index 12, and the active configuration is:
C_old = Configuration(old_voters=frozenset({'A', 'B', 'C'}))
# Single config, majority = 2 of 3
Match indices: {A: 12, B: 12, C: 12}. Servers D and E are running and reachable but not yet members.
Step 1: Leader appends C_old,new (log[13]). A admin add_servers([D, E]) API call hits the leader. A constructs:
C_joint = Configuration(
old_voters=frozenset({'A', 'B', 'C'}),
new_voters=frozenset({'A', 'B', 'C', 'D', 'E'}),
)
A appends LogEntry(term=4, op=('CONFIG',), config=C_joint) at index 13. Immediately, A.current_config = C_joint. A initializes next_index and match_index for D and E (both start at 13, both with match 0). From this point, A's commit-advance logic and election logic both use C_joint.has_majority.
Step 2: Replicate log[13] to all four peers. A sends AppendEntries to B, C, D, E. The fresh peers D and E are far behind — their logs are empty — so the consistency check fails, next_index decrements, and the leader streams them the entire history (indices 1-13). Meanwhile B and C already have indices 1-12 and just append index 13.
After replication settles: match_index = {A: 13, B: 13, C: 13, D: 13, E: 13}. A runs _maybe_advance_commit with current_config = C_joint:
- Candidate N=13. Supporters =
{A, B, C, D, E}. - Old majority of
{A,B,C}: supporters ∩ old ={A,B,C}, size 3 > 1. ✓ - New majority of
{A,B,C,D,E}: supporters ∩ new ={A,B,C,D,E}, size 5 > 2. ✓ - Both pass → joint majority satisfied. Term check:
log[13].term = 4 = current_term. ✓ - commit_index = 13.
In practice the joint config commits well before all five replicas have it — as soon as a majority of C_old (2 of {A,B,C}) and a majority of C_new (3 of {A,B,C,D,E}) hold it. So if A, B, D have it and C, E are still catching up: old majority is {A,B}, size 2 ✓; new majority needs ≥3 from {A,B,C,D,E} — supporters intersect new = {A,B,D}, size 3 ✓. Joint commit satisfied.
Step 3: _on_commit_advance triggers Phase 2. Index 13 just committed and log[13].config.is_joint == True. The leader appends:
C_new = Configuration(old_voters=frozenset({'A', 'B', 'C', 'D', 'E'}))
LogEntry(term=4, op=('CONFIG',), config=C_new) at index 14. Immediately A.current_config = C_new. The cluster is now operating under C_new rules locally on A, and everyone else flips when index 14 reaches them.
Step 4: Replicate log[14] until commit. This is fast — all five servers are healthy and already have everything through 13. The leader collects match_index >= 14 from {A, B, C, D, E}. Under C_new (single config, majority 3 of 5), the supporters are 5 → majority easily. commit_index = 14.
Step 5: stable. The cluster is now in C_new permanently. Any future add_server or remove_server op repeats the same dance from the new starting point.
During the change, what could go wrong and why is it OK?
Case (a): Leader A crashes after appending log[13] (joint) but before committing it. On A's local log, joint is active. Followers B, C may or may not have it — say B has it but C does not. New election:
- If
Bbecomes candidate (it has the joint entry, so it switches to joint rules on append): needs majority of bothC_oldandC_new. Votes from{A=down, B=self, C, D, E}. To win: ≥2 of{A,B,C}→ need at least one of{C}✓ ifCvotes; ≥3 of{A,B,C,D,E}→ need at least 2 more, e.g.DandE. SoBneeds{B, C, D, E}to have any hope of winning under joint. That is 4 of 5 — much harder than the old 2 of 3. - If
Cbecomes candidate (it does not have the joint entry, so it still usesC_old-only rules): needs 2 of{A,B,C}. Votes from{B, C}→CgetsB's vote?Bchecks: candidateC's log lacks index 13 (the joint entry thatBhas). The up-to-date-log restriction (chapter 100) rejects:Cis behind. SoBvotes no.Ccannot win.
The only candidate who can win is one whose log already has index 13 — i.e., a candidate who is itself in joint mode. That candidate then continues the protocol. Safety holds.
Case (b): A and a few new nodes commit log[13], but a partition isolates A,D,E from B,C. Now A,D,E have the joint entry; B,C do not. A is leader.
Acannot commit anything new on its side under joint rules: it has 0 votes fromC_old \ {A} = {B, C}(they are partitioned away), so old majority fails.Acan only commit when at least 2 of{A,B,C}ack — impossible whileB,Care unreachable.- On the other side,
Btimes out and starts an election underC_old-only rules (it has no joint entry). It can getC's vote — butCwill reject because the up-to-date-log rule seesB's last log term differs from any in-flight... wait, in this case bothBandCare at index 12.Bbecomes candidate, getsC's vote, becomes leader ofC_oldalone in the new term. It now thinks it can commit with majority 2.
This is the worry case. But notice: any entries that B commits on its side will not have made it past A's joint-rule barrier on the other side — the partition prevents propagation. When the partition heals, A (still leader of its side, but unable to make progress) sees B's higher term and steps down. B is now leader. B does not have the joint entry yet, so its config is still C_old. Its log on the index-13 slot is whatever client write it accepted — say "x = 9".
What about A,D,E's view? A had appended joint at index 13; it never committed (only D,E are reachable, and {A,D,E} does not contain a C_old majority). On healing and stepping down, A now receives AppendEntries from B with prev_log_index=12, prev_log_term=... and entries [(13, T_new, "x=9")]. A's check: log[12].term matches → success. The conflict: A's log[13] is the joint config (term 4), B's log[13] is "x=9" (term T_new > 4). The truncate-on-conflict rule (chapter 101) overwrites A's joint entry. The _rollback_config_through helper detects the truncated config and reverts A.current_config back to C_old. D, E get the same overwrite (cascading from A once A becomes a follower of B).
The cluster is back to C_old. The membership change is abandoned — admin re-runs it later. No safety violation: nothing was committed under joint rules, the joint entry never made it past the propagation barrier, and the rollback machinery undoes the local config switch when the entry is truncated.
This is the value of "config takes effect on append": the same machinery that replicates and overwrites entries also replicates and overwrites configs. There is no separate config-state to keep coherent — it lives in the log.
Common confusions
-
"Why not just gate config changes on commit, like every other entry?" Because if the leader applied the new config only on commit, there would be a window between append and commit where the leader operates under the old rules while some followers have the new entry in their log. A partition during that window allows a follower in the new config to be elected with new-config rules, while the original leader still runs old-config rules — two leaders. Switching on append closes that window.
-
"Joint consensus seems to require holding two configs in memory." It does —
Configurationholds bothold_votersandnew_votersduring the joint phase. This is fine; the configs are small (server IDs, maybe a few hundred bytes total). The one place to be careful is when serializing snapshots (chapter 104) — the snapshot must include the active configuration, joint or not, so a fresh server reading the snapshot can begin life with the correct rules. -
"What if a server in C_old that is not in C_new keeps voting after the change?" It cannot, because by the end of phase 2 the leader has appended
C_newand replicated it to a majority. Servers inC_old \ C_new(the removed ones) will eventually receiveC_new, switch theircurrent_config, see that they are not inall_voters, and stop participating. Some implementations have them shut themselves down explicitly; others let them sit idle. -
"Single-server changes are 'safer' than joint consensus." They are simpler, not safer. Both are formally proven correct. The reason single-server changes are more popular is that the code path is one if-else away from the no-config-change code, while joint consensus requires a parallel "joint" path everywhere quorum is computed. Engineering simplicity wins.
-
"Can we use joint consensus to swap entire sets — say {A,B,C} → {D,E,F} with no overlap?" Yes — that is exactly the scenario joint consensus was designed for, and where single-server changes would require many sequential steps. With single-server: 3→4 (add D), 4→5 (add E), 5→6 (add F), 6→5 (remove A), 5→4 (remove B), 4→3 (remove C). Six steps, each requiring catchup. With joint consensus: one transition. In practice, full-set swaps are rare enough that the extra code complexity is hard to justify.
-
"Learners in etcd are the same as joint consensus members." No. Learners are explicitly not counted in any majority, joint or otherwise. They receive AppendEntries (so the leader can fan out the log to them) but do not vote and are not part of any quorum. Once a learner has caught up, a separate operation
PromoteLearnerruns a real (single-server) configuration change to add it as a voter. The benefit is that catchup happens before the new server's vote-presence inflates the majority size. -
"The Phase 2 entry is C_new, but Phase 2 commits under C_new rules — chicken and egg?" No. By the time the leader appends
C_new, the cluster is operating underC_old,newrules andC_old,newitself is committed. Every server inC_old,new's union (i.e.,C_old ∪ C_new) is reachable to the leader. The leader replicatesC_newto that union; each server, on receiving the entry, switches toC_newrules locally. Once a majority ofC_newhas the entry,C_newcommits — and the C_old members have already been bypassed in the rule, having served their purpose during the joint phase.
Going deeper
Why "on append" is not the same as "on receive"
A subtle distinction: configuration entries take effect when a server appends them to its log, not merely when an RPC carrying them arrives. If an AppendEntries RPC carries entries that fail the consistency check (prevLogIndex/prevLogTerm mismatch), the entries are rejected and not appended — and the server does not switch its config either. This matters because a follower with a stale prevLogIndex might receive a Config entry but reject it, and you do not want the follower to half-switch.
The implementation invariant: current_config is a function of log alone, specifically current_config = the most recent config entry in log. Whenever log is mutated (append or truncate), current_config is recomputed. The _rollback_config_through helper above handles truncation; the inline assignment in handle_append_entries handles append. With both in place, current_config is exactly the most recent config in the log at all times.
Ongaro's PhD thesis Section 4.3 formalises this and proves that the log-determined config is sufficient for safety: every config decision the server has made tracks exactly the configs that survive in its log.
Diego Ongaro's single-server-change post
Outside the thesis, Ongaro wrote a shorter argument on the raft-dev mailing list defending the single-server simplification. The key passage:
"Removing the requirement to support arbitrary config changes substantially simplifies the implementation. In retrospect I should have specified the simpler version in the original paper."
This is the lineage of the etcd, Hashicorp, and CockroachDB designs. They all cite this argument as the reason they did not implement joint consensus.
Configuration changes during snapshots
Log compaction (chapter 104) discards old log entries and replaces them with a snapshot. The snapshot must include the active configuration at the snapshot point, so a server installing the snapshot sets current_config from it before any log replay. This is straightforward when the active config is single (just write C_old or C_new); slightly trickier when the snapshot point falls inside a joint phase, where the snapshot must record (C_old, C_new) together.
CockroachDB's implementation handles this by storing the configuration in the same key-value space as the snapshot data, updated atomically on every config change, so the snapshot picks it up for free. etcd stores it in a dedicated ConfState field of the snapshot metadata. The Raft paper itself underspecifies this; production implementations all add it, with the etcd ConfState design being the most-copied reference.
Why production avoids joint consensus
In practice, almost no production system uses full joint consensus. The concrete reasons:
- Code-path duplication. Every piece of code that asks "does this set of supporters constitute a majority?" — election counting, commit-advance, leadership transfer, read-index — has to handle the joint case. With single-server changes, none of them do.
- Testing complexity. Each code path needs partition tests. Joint consensus adds a third regime (joint) on top of
C_oldandC_new, multiplying the test matrix. - Operational rarity. Most config changes are "replace a dead node" or "add a node to scale up" — inherently single-server operations. The case where you genuinely want to swap the whole set in one operation (e.g., zero-overlap relocation between datacentres) is rare enough to be done as a sequence of single-server changes spaced out by minutes.
The trade-off: single-server changes take longer for big swaps (six steps for {A,B,C} → {D,E,F} versus one for joint), but each step is fast (a single round of replication), and the simpler code is worth the latency. CockroachDB's documentation for its voter/non-voter replicas describes the full operational model.
Reconfiguration disruption and PreVote interaction
When a server is removed from C_new, it does not immediately know — its current_config updates only when the C_new entry reaches its log. Until then, it might miss heartbeats (the leader is no longer sending to it under the new rules) and start election timeouts. If it eventually times out and tries to start an election in a higher term, it can disrupt the live cluster.
The Raft paper recommends the leader explicitly tell removed servers "you are out" via a teardown message. The PreVote optimisation (chapter 100) mitigates the disruption: a removed server's PreVote will fail (the active leader is healthy from the cluster's perspective) so its term never advances past the live one. CockroachDB and etcd both rely on PreVote rather than explicit teardown for this reason.
Where this leads next
Membership changes round out the Raft protocol's "operational" surface — you can now grow, shrink, and replace nodes in a running cluster without downtime. The next chapter (log compaction and snapshots) handles the storage problem: real logs grow indefinitely, and at some point a fresh server cannot reasonably catch up by replaying every entry from the dawn of time. Snapshots solve this; they also interact with membership changes (the snapshot must carry the config state) and with nextIndex (a follower so far behind that the leader's log no longer contains its nextIndex falls back to InstallSnapshot RPC).
Chapter 104 then puts the whole assembled protocol — election, replication, membership, snapshots — through partition and crash testing, demonstrating that the safety properties hold under aggressive failure injection. Build 14 takes the resulting consensus-log primitive and uses it as the foundation for cross-shard atomic commit, returning to the cross-shard transactions problem flagged at the start of Build 13.
The single sentence to carry forward: during a membership change, the cluster passes through a strictly stronger rule (joint consensus) than either the old or the new alone — and that strict-stronger phase is what prevents two disjoint majorities from forming. The simpler one-server-at-a-time variant gets the same safety because, with |ΔC| = 1, the "strictly stronger" phase is unnecessary — every old majority and every new majority already overlap on their own.
References
- Ongaro and Ousterhout, In Search of an Understandable Consensus Algorithm (Extended Version), USENIX ATC 2014, Section 6 — the canonical specification of joint consensus, including the proof that
C_old,newprevents disjoint majorities. - Ongaro, Consensus: Bridging Theory and Practice, Stanford PhD Dissertation 2014, Chapter 4 — the deep treatment, including the single-server-change simplification, configuration-rollback handling, and the formal "config = function of log" invariant.
- Ongaro, Bug in single-server membership changes, raft-dev mailing list, 2015 — the well-known correction to single-server membership changes (an edge case where two simultaneous changes can still produce inconsistency without joint consensus); also where Ongaro states he should have specified the simpler version in the conference paper.
- etcd-io, Learner Design — the etcd team's design document for non-voting "learner" replicas, the production refinement of single-server changes that decouples catchup from voting.
- etcd-io, Configuration Changes — the implementation-level documentation of how etcd Raft handles config changes via single-server changes plus learners, with the
ConfStatesnapshot integration. - Cockroach Labs, Configure Replication Zones (Voter/Non-Voter) — CockroachDB's production model for membership management at scale, with non-voter replicas serving the same role as etcd learners and the operational knobs for geo-distributed clusters.