Multi-Raft: Sharding Consensus (CockroachDB Style)

Note: Company names, engineers, incidents, numbers, and scaling scenarios in this article are hypothetical — even when they resemble real ones. See the full disclaimer.

In short

A single Raft group has a hard throughput ceiling that no amount of hardware can paper over. One leader, one log, one fsync stream, one CPU core handling AppendEntries — even on a beefy machine with NVMe and 100 GbE you top out somewhere around 10k–50k writes per second. That is fine for a metadata store like etcd or ZooKeeper. It is laughably small for a database that wants to back a payments platform or a shopping cart for 100 million users. You cannot make Raft faster by adding more machines to the same group; adding followers only adds replication work.

The way out is Multi-Raft: shard the keyspace into many small contiguous ranges (CockroachDB calls them "ranges", typically 64–512 MB; TiKV calls them "regions"; YugabyteDB calls them "tablets"), and run an independent Raft group for each range. Range A has its own leader, its own log, its own commit index, its own followers. Range B has a completely separate Raft group, which may live on entirely different machines. Writes to A and writes to B do not contend on a leader, do not share a log, do not share a CPU core. The cluster's aggregate throughput now scales with the number of ranges, not with the speed of one machine.

Leadership is distributed. On a 9-node cluster with 3000 ranges, each node leads roughly 1000 ranges (3 replicas × 3000 ranges ÷ 9 nodes), and follows the other 2000. Read load and write load both spread out automatically.

The naïve implementation explodes on heartbeat traffic: 3000 ranges × 3 followers × 1 heartbeat every 50 ms is 180,000 RPCs/sec leaving every node, just to say "I am still alive". Every production Multi-Raft system coalesces heartbeats — instead of one heartbeat per (range, follower), each node sends one batched heartbeat per peer node containing a list of "ranges I am still leading", and a separate "I am still following you" reply. The wire cost drops from O(ranges) to O(nodes) and the network stops drowning.

Range splits and merges are themselves Raft operations — a range that grows past 512 MB or that gets too hot is split, which is implemented as a coordinated entry committed in the parent's Raft log producing two new Raft groups. A central catalogue (itself a Raft group; in CockroachDB it is the meta range) tracks the (key-range → leader-node) mapping, and clients consult it to route requests. CockroachDB, TiKV, YugabyteDB, and Polara YDB all share this architecture; the differences are mostly in how they pick split points, how they place replicas, and how they rebalance.

You have read the previous chapters in this build — leader election and terms, log replication, membership changes, safety invariants. You can now build a working three-node Raft cluster that survives a leader crash, replicates a log, and gracefully accepts new members. You can run etcd. That is real progress.

It is also not enough to build a database. The Raft cluster you have built has exactly one leader, and that leader handles every write the cluster will ever serve. Run a benchmark: a modern machine fsyncing batched 4 KB writes to NVMe tops out around 30k–50k commits/second; with replication and the network round trip eating into that, a single Raft group in production is a ~10k–30k writes/sec animal. Compare that to MySQL on a single big box (~50k–100k tps), or a sharded NoSQL deployment (~millions of ops/sec across the cluster). The single Raft group is behind the unsharded single-box database, and that is before we ask it to also handle reads, snapshot transfers, and rebalancing.

This chapter is about the architectural trick — Multi-Raft — that every modern strongly-consistent distributed SQL system uses to break that ceiling. We focus on CockroachDB because its design is the most thoroughly documented, but the same pattern shows up in TiKV (which powers TiDB), YugabyteDB, and Polara's YDB.

Why a single Raft group cannot scale

Look at the leader's hot path during a write:

Client sends Put(k, v) to the leader.
Leader appends one entry to its in-memory log.
Leader fsyncs the entry to its persistent log on disk.
Leader sends AppendEntries(entry) to every follower over the network.
Followers fsync, ack.
Leader counts acks, advances commitIndex once a majority replies.
Leader applies the entry to the state machine and replies to the client.

Every step is serial for that group. Step 2 is one mutex on one log structure. Step 3 is one fsync stream on one disk. Step 4 is one TCP send loop per follower from one process. Step 6 is one counter being advanced. The leader is a single-threaded chokepoint by design — Raft's safety proof depends on the log having a single writer, and that writer is the leader process.

Why "just shard the work inside the leader" does not save you: you can pipeline AppendEntries (don't wait for ack before sending the next entry), batch writes (one fsync per 1 ms window), and parallelise replication to followers. Production Raft implementations do all of this, and it gets you to the 10k–50k tier. Beyond that the bottleneck moves from disk to CPU — encoding entries, computing CRCs, traversing the log data structure — and the single mutex protecting the log eventually pegs one core. A 2024 etcd benchmark on the etcd team's hardware: ~40k writes/sec at p99=10ms, dropping sharply past that. CockroachDB's per-range ceiling is in the same neighbourhood. You do not break it without parallel groups.

A single Raft group has every client write funnelling through one leader. The leader's log, fsync stream, and CPU core are serial by Raft's design. Adding followers does not raise the write ceiling — they only consume replication bandwidth. The hard limit is roughly 10k–50k writes/sec on modern hardware, regardless of how many machines you throw at the cluster.

The other constraint is state size. Even if writes were free, a single Raft group's state machine has to fit on every replica, snapshot transfers have to fit through the network, and log compaction has to fit in one process's memory. A 100 TB database in one Raft group means every replica is 100 TB and every snapshot is a 100 TB stream. That is not an architecture; that is a hostage situation.

So you shard. The question is how.

The Multi-Raft architecture

The Multi-Raft idea is straightforward to state and full of subtleties to implement: split the keyspace into many small contiguous ranges, and run an independent Raft group for each range.

A "range" in CockroachDB is a contiguous span of the sorted keyspace, sized at around 64–512 MB by default. (TiKV calls these "regions" and defaults to 96 MB; YugabyteDB calls them "tablets". Polara YDB calls them "tablets" too.) A 1 TB database is split into roughly 2,000–16,000 ranges. Each range is its own Raft group with three replicas (the typical default), three log streams, three commit indexes, and one leader.

Multi-Raft on a 3-node cluster with 4 ranges (A, B, C, D). Each range has 3 replicas — one on each node. Leadership is distributed: node 1 leads A and D, node 2 leads B, node 3 leads C. Every node is simultaneously a leader for some ranges and a follower for others. A write to range A goes only through node 1's Raft group for A — ranges B, C, D and their leaders are entirely undisturbed. Aggregate cluster throughput scales with the number of ranges, not with one machine's fsync rate.

Look at what this buys you. A write to a key in range A goes through range A's Raft group only — its leader appends to A's log, replicates to A's followers, commits. Range B's leader, sitting on node 2, has no idea the write happened and does not care. Each range's log is independent. Each range's commit index is independent. Each range's snapshot is independent. The fsync streams from different ranges on the same disk are typically batched by the storage engine (CockroachDB uses Pebble; TiKV uses RocksDB), so disk throughput is shared, but the consensus work parallelises perfectly.

Why ranges, not hash buckets: contiguous key ranges support efficient range scans — SELECT * WHERE id BETWEEN 1000 AND 2000 reads from a small number of consecutive ranges, not from every shard like in hash-partitioned systems. The cost is hot-spotting: if your application writes monotonically increasing keys (timestamps, autoincrement IDs), the highest-key range becomes a hot spot — which is why CockroachDB recommends UUID primary keys and TiKV has explicit pre-split hints. The split logic (next chapter) handles the rest by detecting load and splitting hot ranges.

Distributing leadership

On a 9-node cluster with 3,000 ranges and replication factor 3, there are 9,000 replicas total. Each node holds 1,000 replicas (9000 ÷ 9). Of those 1,000, the node leads roughly 333 (1000 ÷ 3, since one of each replica trio is the leader). That means every node is doing leader work for about 333 ranges and follower work for about 667 ranges, simultaneously, in 9,000 separate Raft state machines.

Read load distributes the same way: in CockroachDB and TiKV, reads typically go to the leader (for linearizability), so a node leading 333 ranges handles reads for those 333 keyspaces. If reads are uniform across the keyspace, every node handles 1/9 of the read traffic. If reads are skewed toward a hot range, only the node leading that range is hot — and the cluster's response is to split that range further until the heat subsides.

The rebalancer's job is to keep this distribution even. CockroachDB runs a continuous background process that monitors load (CPU, QPS, range count, disk usage) per node and moves replicas — and leaderships — to even things out. Moving a leader is cheap: it is a single Raft TransferLeader message. Moving a replica is more expensive: it is a Raft membership change (add the new replica as a learner, catch it up via snapshot, promote it, remove the old replica).

Heartbeat coalescing

Here is the bug a naïve Multi-Raft implementation hits about ten minutes after deployment.

Each Raft group sends heartbeats: the leader sends an empty AppendEntries to every follower every 50 ms or so. Three replicas means each leader sends to two followers; with 3,000 ranges per node × 2 followers per range × 20 heartbeats/sec = 120,000 RPCs/sec leaving every node, just to say "I'm alive". On a 9-node cluster, that is 1.08 million heartbeat RPCs/sec on the network. Each carries TCP framing, gRPC headers, range ID, term, leader commit index, log index — a few hundred bytes per heartbeat. The network is now fully booked carrying the metaphorical equivalent of "I'm still here, are you still there?".

Heartbeat coalescing. Naïvely each (range, follower) pair generates its own heartbeat — 3,000 ranges × 2 followers × 20 Hz = 120,000 RPCs/sec from one node. The coalesced version sends *one* batched heartbeat per peer node every 50 ms, carrying the list of range IDs the sender is currently leading. Wire RPC count drops to 40/sec from one node, regardless of how many ranges live there. CockroachDB and TiKV both implement this; without it, Multi-Raft does not work at scale.

The fix is heartbeat coalescing. Instead of each Raft group sending its own heartbeats, the host node maintains one heartbeat channel per peer node. Every 50 ms it sends one batched message containing: "I am still leading the following list of ranges with these terms and commit indexes — please ack." The peer node, on receipt, walks the list and updates each Raft group's "I heard from leader" timer in one go. Replies are coalesced symmetrically: "I am still following you on these ranges."

The wire RPC count drops from O(ranges × peers) to O(peers). The CPU cost of heartbeat handling drops too — one TCP read, one batched dispatch, instead of thousands of RPC-framing operations. CockroachDB calls this "Raft transport coalescing"; TiKV's design document refers to "store-level heartbeats". They are the same idea.

Why this is correct: Raft heartbeats only need to convey "leader X is alive in term T for group G" and "leader X believes commit index is C". Bundling many such tuples into one message does not break any safety property — each group's state machine still receives its own update, just delivered in a shared envelope. The optimisation is purely about transport. The only subtlety: if a coalesced heartbeat is dropped, all the groups it covers miss their tick simultaneously — which can cause correlated election timeouts. Production systems jitter election timeouts and add per-group nack tracking to handle this; the issue is empirically rare.

Range splits and merges

Ranges are not static. A range that grows past its target size (CockroachDB defaults to 512 MB; TiKV to 144 MB) is split into two adjacent ranges. A range that gets too hot (high QPS) is split by load even if it is small. Pairs of small adjacent ranges may be merged.

A split is itself a Raft operation. The leader of the parent range proposes a Split(splitKey) entry; once committed by the parent's Raft group, every replica of the parent range performs the split locally — creating two new in-memory Raft groups, copying the relevant key range into each, and starting the new groups with the same replica set. From that moment on, the two new ranges are independent Raft groups with independent leaders (initially the same physical node as the parent's leader, then potentially rebalanced).

We will cover the load-based split logic — when and where to split, not just how — in detail in the next chapter (Build 14). For this chapter the relevant fact is just that splits and merges are coordinated through the parent Raft group's log, so they preserve linearizability across the split point.

The catalogue: routing clients to the right leader

A client wants to write to key users/42. Which node leads the range containing that key?

The answer lives in a system catalogue — a sorted index from key range to (range ID, replica list, leader hint). In CockroachDB this is the meta range (and the meta-meta range one level above it, for catalogues large enough to themselves need splitting); in TiKV it is the Placement Driver (PD) server cluster; in YugabyteDB it is the master cluster.

Crucially, the catalogue is itself stored in a Raft group (or, in TiKV, in a separate small Raft cluster). It has all the same guarantees — linearizable updates, leader election, replication — and the same scaling characteristics. It is small (a 100 TB database with 64 MB ranges has ~1.5M ranges; each range descriptor is a few hundred bytes; the whole catalogue is ~hundreds of MB), so a single Raft group suffices. CockroachDB further splits the meta range itself once it grows past size, with a meta-meta range pointing into it; the recursion terminates at the meta-meta range, which is fixed and replicated to a configurable set of nodes.

A client routes a request like this:

Client receives a request for key K.
Client looks up K in its local cache of range descriptors. Cache hit: it knows range R contains K and node N is the leader hint. Send to N.
Cache miss, or stale (N replies "no longer leader, here is the new one"): client queries the meta range for K, gets the current descriptor, retries.
Stale on the meta range too: client falls back to the meta-meta range, then to a fixed bootstrap address baked into the binary.

In steady state, the client cache is hot and >99% of requests skip the catalogue entirely. The catalogue's load is low, so its single Raft group is not a bottleneck. When a leader moves (rebalance, failover), clients hit the stale-leader path a few times until their caches refresh; CockroachDB sends piggybacked "actually the leader is N'" hints on the rejection response to make this fast.

Worked example: a 3-node CockroachDB-style cluster with 4 ranges A, B, C, D (replication factor 3). Leadership is distributed as in the architecture diagram: node 1 leads A and D, node 2 leads B, node 3 leads C. A client running on a Mumbai web server wants to execute UPDATE accounts SET balance = balance - 100 WHERE id = 'user_842' where user_842 lives in range C (key range [m..s)).

Step 1: client cache lookup. The client driver consults its in-process range cache. It has a cached descriptor: range C ([m..s)) → leader hint node 3. The cache was warmed at startup; it might be slightly stale.

Step 2: send to node 3. The client opens (or reuses) a gRPC connection to node 3 and sends Put(user_842, new_balance) for range C.

Step 3: node 3 confirms it leads C. Node 3's Multi-Raft layer routes the request to range C's Raft group, where node 3 is in fact the leader. (If it were not — say leadership had moved to node 1 — node 3 would reject with NotLeaderError(leader=node1) and the client retries against node 1, also updating its cache.)

Step 4: range C's Raft group commits the write. Node 3, as leader of C, appends the entry to C's log. C's log is one of thousands of logs on node 3, but it is C's own. Node 3 sends AppendEntries(entry_for_C) to nodes 1 and 2 (C's followers). Both fsync, ack. Node 3 advances C's commit index, applies the entry to C's state machine (a key-value store covering keys [m..s)), and replies success to the client.

Step 5: ranges A, B, D are entirely undisturbed. Range A's leader (node 1) is unaware the write happened. Range B's leader (node 2) is unaware. Range D's leader (node 1) is unaware. None of their logs have new entries. None of their state machines change. None of their replicas were contacted. The write happened in exactly the slice of the cluster relevant to it.

Step 6: heartbeat traffic stays coalesced. While this write is in flight, every node is sending one batched heartbeat per peer per 50 ms — node 3 → node 1 carries "leader of C, term=14, commit=8123" along with status for every other range it leads. Node 3 → node 2 the same. The heartbeat that confirmed C's last commit was bundled with status for thousands of other ranges. The wire cost did not budge.

The point: the write is bounded — in CPU, in disk, in network — to range C and its three replicas. A second write to range A on node 1 can happen concurrently with no contention. A third write to range B on node 2 likewise. The cluster is now executing three independent Raft commits in parallel, and the aggregate throughput is roughly 3× the single-range ceiling. With 3000 ranges, the aggregate ceiling is roughly 3000× — modulo disk and network shared resources.

Real systems

CockroachDB is the canonical Multi-Raft implementation. Range size targets default to 512 MB. The meta range plus meta-meta range structure is documented in the project's architecture docs. CockroachDB's kv layer wraps an in-tree fork of etcd-io/raft called cockroachdb/raft, with extensive modifications for batched proposals, raft scheduler integration, and leadership transfer. The "Living Without Atomic Clocks" blog post explains how Multi-Raft + HLC replaces what Spanner does with TrueTime [1].

TiKV (the storage layer of TiDB) uses the same design with regions sized at 96 MB by default. It uses a separate process called the Placement Driver (PD) as the catalogue cluster; PD itself runs etcd. TiKV pioneered many Multi-Raft optimisations and the team has presented the tradeoffs publicly [3].

YugabyteDB uses tablets (their term for ranges) and a master service for the catalogue. The architecture is similar; their tablet sizes are typically larger (1 GB+ default).

Polara YDB uses tablets with a distributed scheduler and a separate metadata service. Public documentation is sparser but the architecture follows the same pattern.

All four trace their lineage back to Spanner (Querion) [5], which used Paxos rather than Raft but introduced the "many small Paxos groups, one per tablet" pattern that Multi-Raft generalises.

For JEE Advanced–level depth: how does Multi-Raft interact with cross-range transactions, when does the heartbeat-coalescing optimisation become a correctness liability, and what are the second-order effects of having thousands of Raft groups per node sharing one disk?

Cross-range transactions

Multi-Raft solves single-key throughput. A transaction that touches two keys in different ranges needs an atomic-commit protocol layered on top of the per-range Raft groups. CockroachDB uses a parallel-commit variant of two-phase commit, where a transaction record lives in one designated range and intent records on every key the transaction wrote live in their respective ranges. Commit is staged: write all intents, then flip the transaction record to committed (one Raft commit in one range), then asynchronously resolve intents into committed values. This makes a multi-range transaction cost roughly N+1 Raft commits across the involved ranges, but each commit happens in its own group with its own leader — so the cluster's aggregate commit rate stays high even under transactional workloads.

When heartbeat coalescing breaks

A coalesced heartbeat carries status for many groups; if the receiving node is overloaded and drops the message, every group covered by that heartbeat misses its tick. With 3,000 groups in a coalesced batch, a single dropped packet is potentially 3,000 simultaneous election timeouts on the next tick — which would cause all 3,000 groups to start elections concurrently and overwhelm the network with RequestVote traffic. Production systems mitigate this by jittering election timeouts (each group rolls its own random timeout in [150ms, 300ms]) and by setting pre-vote mode, where a candidate first asks "would you vote for me if I called an election?" before actually advancing its term. Pre-vote is in the etcd raft library [2] and used by every production Multi-Raft system.

Disk contention across thousands of groups

All 3,000 ranges on one node share the same NVMe device. Each range's Raft log appends its own fsync stream. A naive implementation would issue 3,000 independent fsync calls per second, which is a bottleneck on most disks. Production systems batch these: CockroachDB's storage layer (Pebble) accepts log writes from all Raft groups into a single shared WAL, fsyncs the WAL once, and then notifies every group that its entry is durable. The consensus logic remains independent per group, but the disk write is shared. This is one of the most subtle parts of a real Multi-Raft implementation and the source of much of the gap between toy implementations and production ones.

Range placement and locality

A 9-node cluster with three datacentres of three nodes each typically wants every range's three replicas to span all three datacentres — so a single DC outage cannot lose quorum. CockroachDB's allocator considers locality labels (region=mumbai, zone=mumbai-1a) and will refuse to place all three replicas of a range in the same zone unless explicitly forced. The leader is then preferred to be the replica closest to the bulk of the writers (lease preferences), which on a global deployment can mean different ranges have leaders on different continents. The Indian context: a CockroachDB cluster spanning Mumbai, Bengaluru, and Hyderabad placed for HSBC India would typically have replicas pinned one per city, with leaderships distributed by where each customer's data is mostly written.

Why Raft and not Paxos for sharded consensus

Spanner uses Multi-Paxos. CockroachDB, TiKV, YugabyteDB, YDB all use Multi-Raft. Why the shift? Implementation simplicity and operational debuggability. Raft's strict leader-only-log-write rule and explicit term-and-index ordering make every step of the protocol trivially traceable; Multi-Paxos with parallel proposers and gap-filling is famously hard to debug under failure. When you have 3,000 groups per node and need to root-cause a hung commit at 3am, "look at the leader for group 1842, check its log, check what the followers acked" beats "reconstruct the highest-numbered chosen value across three proposers" by a wide margin. The Raft paper [6] explicitly cites understandability as the design goal, and that pays dividends every time you operate a cluster with thousands of independent groups.

What you have learned

You now understand why a single Raft group is a throughput dead-end (one leader, one log, one core, ~30k writes/sec) and how Multi-Raft sidesteps that by making a fresh independent Raft group per 64–512 MB range. You can see why heartbeat traffic explodes naïvely and how coalescing fixes it. You know that splits and merges are themselves Raft operations and that a system-table catalogue (also a Raft group) routes clients to the right leader.

The next chapter (Build 14) goes inside the split logic — when does a range split, how is the split key chosen, what does load-based splitting look like, and how do range descriptors update atomically through the catalogue.

References

Living Without Atomic Clocks: Where CockroachDB and Spanner Diverge — Cockroach Labs blog explaining Multi-Raft + HLC vs TrueTime.
etcd-io/raft library README — the canonical production Raft library, source of the MultiRaft-style fork in CockroachDB.
Multi-Raft in TiKV — PingCAP engineering talk — TiKV's heartbeat coalescing, region scheduler, and PD design.
CockroachDB raft library documentation — the in-tree fork with batched proposals and quotapool.
Spanner: Querion's Globally-Distributed Database (OSDI 2012) — origin of the "many small consensus groups per tablet" pattern.
In Search of an Understandable Consensus Algorithm (Raft paper) — Ongaro and Ousterhout 2014, USENIX ATC.