Note: Company names, engineers, incidents, numbers, and scaling scenarios in this article are hypothetical — even when they resemble real ones. See the full disclaimer.
Multi-Raft: sharding consensus (CockroachDB style)
A 9-node cluster sounds like a lot of hardware. Run a single Raft group on it and the whole cluster commits writes at the speed of one leader's fsync, while the other eight nodes mostly forward heartbeats. Multi-Raft is the trick that lets every node lead some of the writes and follow the rest, so the cluster's write rate scales with the number of machines instead of the speed of one.
A single Raft group has one leader, one log, and one fsync stream — so it caps near 30k writes/sec regardless of cluster size. CockroachDB, TiKV, and YugabyteDB shard the keyspace into thousands of small ranges, run a separate Raft group per range, distribute leaderships across nodes, and coalesce per-node heartbeats so the wire cost stays at O(peers), not O(ranges × peers).
Why one Raft group cannot scale
Picture the leader's hot path during a single Put(key, value):
- Append the entry to the in-memory log (one mutex on one structure).
fsyncthe entry to the leader's WAL on disk (one stream).- Send
AppendEntriesto every follower (one TCP loop per peer, one process). - Followers
fsync, ack. - Leader counts acks, advances
commitIndexonce a majority replies. - Apply to the state machine, reply to the client.
Every step is serial for that group. Raft's safety proof requires the log to have a single writer, and that writer is the leader process. Why "just shard the work inside the leader" is not enough: production Raft pipelines AppendEntries (sends the next entry without waiting for the previous ack), batches writes (one fsync per millisecond window), and parallelises replication to followers. With every trick on, etcd's published numbers land near 40,000 writes/sec at p99 = 10 ms on good hardware. Beyond that, the bottleneck moves from disk to the single mutex protecting the log, and one CPU core pegs. Adding followers does not help — they only consume more replication bandwidth.
A second wall is state size. A single group's state machine has to fit on every replica, snapshots have to stream through one network connection, and log compaction has to fit in one process's memory. A 50 TB PaisaBridge ledger in one Raft group means every replica is a 50 TB box and every snapshot is a 50 TB stream. That is not an architecture; it is a hostage situation.
Multi-Raft: thousands of independent groups
The fix is straightforward to state: split the keyspace into many small contiguous ranges, and run an independent Raft group for each one. CockroachDB calls these ranges (default size 512 MB, originally 64 MB); TiKV calls them regions (96 MB default); YugabyteDB calls them tablets (1 GB default). A 1 TB database breaks into roughly 2,000 ranges. Each range is its own Raft group: its own leader, its own log, its own commit index, its own three followers.
Once the database is sharded this way, three things change at once:
- Writes parallelise. A write to range A goes only through range A's Raft group. Range B's leader, sitting on a different node, has no idea the write happened. A second write to B can commit at the same instant. Aggregate cluster throughput now scales with the number of ranges, not the speed of one machine.
- Leadership distributes automatically. On a 9-node cluster with 3,000 ranges and replication factor 3, every node holds 1,000 replicas (9,000 ÷ 9), of which it leads roughly 333 (one in three). Read load and write load both spread evenly without any extra layer.
- State stays bounded. One snapshot is one range — at most a few hundred MB, not 50 TB. Compaction, replica rebuilds, and rebalancing all work on bite-sized units.
Why ranges (sorted) and not hash buckets: contiguous ranges keep WHERE id BETWEEN 1000 AND 2000 cheap — the scan touches a small number of consecutive ranges, not every shard. The cost is hot-spotting on monotonically increasing keys (timestamps, autoincrement IDs), which CockroachDB sidesteps by recommending UUIDs as primary keys, and TiKV by exposing pre-split hints to the application.
A worked write
Riya, on a BhojanBox delivery-status page in Pune, triggers an UPDATE orders SET status = 'delivered' WHERE id = 'ord_842' against a CockroachDB-style cluster. The key ord_842 lives in range C, which has its leader on node 3 in Hyderabad.
- Riya's request hits the load balancer and reaches some gateway node — say node 1.
- Node 1's range cache says: range C → leader node 3. Node 1 forwards the request to node 3.
- Node 3's Multi-Raft layer routes the request to range C's Raft group, where node 3 is the leader. (If leadership had moved, node 3 replies
NotLeaderError(leader=node1), and the gateway updates its cache and retries.) - Node 3 appends the entry to C's log. C's log is one of thousands on node 3, but it is C's own. Node 3 sends
AppendEntries(entry_for_C)to nodes 1 and 2 (C's followers). Bothfsync, ack. Node 3 advances C's commit index, applies the entry, replies success. - Ranges A, B, and D are entirely unaware. None of their logs grew. None of their state machines changed. Concurrently, a write to A on node 1 and a write to B on node 2 commit through their own Raft groups without touching the others.
The cluster is now executing three independent Raft commits in parallel. With 3,000 ranges, the aggregate write rate is roughly 3,000× the single-range ceiling — modulo disk and network sharing, which we will get to.
Heartbeat coalescing: the network bug Multi-Raft creates
Naïve Multi-Raft hits a wall about ten minutes into a real deployment. Each Raft group sends heartbeats every 50 ms. Three replicas, leader → two followers, twenty heartbeats a second. With 3,000 ranges per node, that is 3,000 × 2 × 20 = 120,000 RPCs/sec leaving every node, just to say "I am still here." On a 9-node cluster, the network carries 1.08 million heartbeat RPCs/sec, each with TCP framing, gRPC headers, range ID, term, leader commit. The wire is fully booked carrying the metaphorical equivalent of "yes hi still here."
The fix is heartbeat coalescing. Instead of each Raft group sending its own heartbeats, the node maintains one heartbeat channel per peer node. Every 50 ms it sends one batched message:
"From node 3 to node 1: I am still leading ranges {C, F, K, ...} with terms {14, 22, 9, ...} and commit indexes {8123, 4192, 776, ...}. Please ack."
The peer walks the list and updates each Raft group's "I heard from leader" timer in one shot. The wire RPC count drops from O(ranges × peers) to O(peers) — about 40 RPCs/sec from a node, regardless of how many ranges live on it.
Why this is correct: heartbeats only need to convey "leader X is alive in term T for group G" and the leader's commit hint. Bundling many such tuples into one envelope does not break safety — each group's state machine still receives its own update. It is a transport optimisation only. The one subtlety: if a coalesced message is dropped, every group it covered misses its tick at once, and on the next election timeout you can get a stampede. Production systems jitter election timeouts (each group rolls its own random timeout in [150ms, 300ms]) and add pre-vote mode (a candidate first asks "would you vote for me?" before advancing its term) — both standard in etcd's Raft library.
A short worked sketch of the coalesced transport in Python:
# Per-node coalesced heartbeat sender. Runs every 50 ms.
import collections, time
class CoalescedTransport:
def __init__(self, node_id, peers, raft_groups):
self.node_id = node_id
self.peers = peers # peer node IDs
self.raft_groups = raft_groups # dict: group_id -> RaftReplica
def tick(self):
# Bucket every group I lead by the peer it needs to heartbeat.
per_peer = collections.defaultdict(list)
for gid, g in self.raft_groups.items():
if not g.is_leader():
continue
for follower_node in g.follower_nodes():
per_peer[follower_node].append(
(gid, g.term, g.commit_index, g.last_log_index)
)
# Send ONE message per peer node, carrying all ranges in one envelope.
for peer, group_status_list in per_peer.items():
self.send(peer, {
"type": "coalesced_heartbeat",
"from": self.node_id,
"ts": time.time(),
"groups": group_status_list, # may have thousands of entries
})
def on_recv(self, msg):
if msg["type"] == "coalesced_heartbeat":
# Walk the list, deliver to each Raft group locally.
for gid, term, commit_idx, last_log_idx in msg["groups"]:
g = self.raft_groups.get(gid)
if g is not None:
g.on_heartbeat(msg["from"], term, commit_idx, last_log_idx)
# Output (sample): one node leading 3000 groups across 2 peers
# tick(): sent 2 coalesced messages, total payload ≈ 90 KB
# per-message: 3000 group tuples × 30 bytes = 90 KB
# compare to naive: 6000 separate RPCs × ~250 bytes each = 1.5 MB + framing
Routing clients: the catalogue
A client wants to write to key users/42. Which node leads the range that contains it?
The answer lives in a catalogue — a sorted index from key range to (range ID, replica list, leader hint). CockroachDB's catalogue is the meta range (with a meta-meta range one level above it for catalogues that themselves grow large). TiKV uses a separate Placement Driver (PD) server cluster. YugabyteDB uses a master cluster. Crucially, the catalogue is itself stored in a Raft group, with the same linearizable guarantees as everything else.
The catalogue is small. A 100 TB database with 64 MB ranges has ~1.5 million ranges; each descriptor is a few hundred bytes; the whole catalogue is hundreds of MB — small enough that one Raft group can hold it. CockroachDB further splits the meta range itself once it grows past size, with the meta-meta range pointing into it. The recursion terminates at the meta-meta range, replicated to a fixed set of nodes.
A client routes a request like this:
- Look up the key in the local range cache. Cache hit: send to the leader hint. (This is >99% of requests in steady state.)
- Cache miss or stale ("no longer leader, here is the new one"): query the meta range, get the descriptor, retry. CockroachDB piggybacks the new leader hint on the rejection, so the retry is one-shot.
- Stale on the meta range too: fall back to the meta-meta range, then to a fixed bootstrap address baked into the binary.
When a leader moves due to rebalancing or failover, clients hit the stale path a few times until caches refresh. The catalogue's load stays low because the cache absorbs almost everything.
Common confusions
-
"Multi-Raft is one big Raft group with many shards." No. It is many independent Raft groups, each with its own leader, log, term counter, and commit index. Range A's leader does not know what range B is doing; they share only the host machine and the storage engine's WAL.
-
"Splitting a range adds work to the original Raft group." Splitting is a one-time Raft commit on the parent range that produces two child groups; from the moment the split entry is applied, the children are independent Raft groups with their own logs and (initially) the same replica set. The parent group ceases to exist.
-
"Coalesced heartbeats break Raft's safety properties." They do not. Coalescing is a transport optimisation: the same per-group heartbeat content is delivered, just inside a shared envelope. Each Raft group's state machine still sees its own heartbeat. The only second-order risk is correlated election timeouts when an envelope is dropped — production systems handle that with jittered timeouts and pre-vote.
-
"Reads always go to the leader, so reads do not scale." Reads do go to the leader by default (for linearizability), but with thousands of ranges and leadership distributed evenly, every node serves reads for the ranges it leads. Read load distributes the same way write load does. CockroachDB also offers follower reads with a small staleness bound for read-only workloads that can tolerate it.
-
"Multi-Raft replaces two-phase commit." It does not. Multi-Raft solves single-key throughput. A transaction that touches keys in different ranges still needs an atomic-commit protocol on top — CockroachDB uses a parallel-commit variant of 2PC, where a transaction record lives in one designated range and intent records on every key the transaction wrote live in their respective ranges.
-
"Hash partitioning would be simpler than range partitioning." Hash partitioning eliminates hot-spotting on monotonic keys but makes range scans (
SELECT ... WHERE id BETWEEN 1000 AND 2000) expensive — every shard has to be queried. CockroachDB and TiKV chose range partitioning because relational workloads lean heavily on range scans; YugabyteDB exposes both as a per-table choice.
Going deeper
Disk contention across thousands of groups
All 3,000 ranges on one node share the same NVMe device. A naïve implementation issues 3,000 independent fsync calls per second — a hard wall on most disks. Production systems batch these: CockroachDB's storage layer (Pebble) accepts log writes from every Raft group into one shared WAL, fsyncs the WAL once, then notifies every group that its entry is durable. The consensus logic stays independent per group, but the disk write is shared. This single piece of plumbing is one of the largest gaps between a textbook Multi-Raft and a production one.
Cross-range transactions
A transaction that touches keys in different ranges needs an atomic-commit protocol layered on top. CockroachDB's parallel-commit 2PC stages the commit by writing intents to every range first, then flipping a transaction record (in one designated range) to "committed" with a single Raft commit. Intent resolution into committed values happens asynchronously. The cost is roughly N+1 Raft commits across N involved ranges, but each commit happens in its own group with its own leader, so cluster-wide commit rate stays high.
Range placement and locality
A 9-node cluster spread across three datacentres typically wants every range's three replicas to span all three DCs — so a single DC outage cannot lose quorum. CockroachDB's allocator uses locality labels (region=mumbai, zone=mumbai-1a) and refuses to place all three replicas in the same zone unless explicitly forced. The leader is then preferred to be the replica closest to the bulk of the writers (lease preferences), so a PaisaBridge deployment spanning Mumbai, Bengaluru, and Hyderabad would have replicas pinned one per city, with leadership for each customer's range placed wherever that customer mostly writes.
Why Raft and not Paxos for sharded consensus
Spanner uses Multi-Paxos. CockroachDB, TiKV, YugabyteDB, and Polara YDB all chose Multi-Raft. The shift was about operational debuggability. Raft's strict leader-only-log-write rule and explicit term-and-index ordering make every step trivially traceable; Multi-Paxos with parallel proposers and gap-filling is famously hard to debug under failure. When you have 3,000 groups per node and a hung commit at 3 a.m., "look at the leader for group 1842, check its log, check what the followers acked" beats reconstructing the highest-numbered chosen value across three proposers by a wide margin.
Splits, merges, and load-based decisions
Range splits are themselves Raft operations: the parent range's leader proposes a Split(splitKey) entry; once committed, every replica performs the split locally, producing two new in-memory Raft groups. Splits trigger when a range exceeds its target size or when load (CPU, QPS) on the range exceeds a threshold even if it is small. CockroachDB's load-based splitter monitors per-range request distribution and chooses split keys that bisect the load, not just the byte size. Merges are the reverse, applied when two adjacent ranges are both small and cold for some duration.
Where this leads next
- etcd and ZooKeeper as services — chapter 107, the opposite design choice: keep one Raft group, accept the throughput ceiling, sell consensus as a service.
- Consensus is a log, not a database — chapter 108, the abstraction that lets every Multi-Raft system share a single mental model with single-group systems.
- Paxos, Multi-Paxos, ZAB — what Raft simplified — chapter 105, the genealogy of consensus protocols and why CockroachDB chose Raft over the Paxos lineage that Spanner stayed with.
References
- Living Without Atomic Clocks: Where CockroachDB and Spanner Diverge — Cockroach Labs blog explaining Multi-Raft + HLC versus Spanner's TrueTime + Multi-Paxos.
- etcd-io/raft library — the canonical production Raft library, source of the multi-group fork used by CockroachDB.
- Multi-Raft on TiKV (PingCAP engineering blog) — TiKV's heartbeat coalescing, region scheduler, and Placement Driver design.
- CockroachDB raft library — the in-tree fork with batched proposals and quotapool.
- Spanner: Querion's Globally-Distributed Database (OSDI 2012) — origin of the "many small consensus groups, one per tablet" pattern.
- In Search of an Understandable Consensus Algorithm (Raft, USENIX ATC 2014) — Ongaro and Ousterhout, the paper Multi-Raft is built on.
- Internal: Raft log replication and the commit index — the per-group mechanics that Multi-Raft replicates thousands of times in parallel.