Consistent hashing (ring, jump, Maglev)
It is 19:42 on a Friday and Aditi is on call for BharatBazaar's product-page cache. The fleet has 96 cache pods behind a sidecar load balancer; the cache hit rate has lived at 94% for months. At 19:38 the platform team rolled an autoscaler change that bumped the fleet to 112 pods. The hit rate fell to 31% in 90 seconds, the origin database CPU went from 22% to 91%, and the read p99 climbed from 18 ms to 1.4 s. Nothing was broken — the cache pods were healthy, the database was healthy, the LB was healthy. The problem was that the LB used hash(product_id) % N to pick a pod. When N went from 96 to 112, every key got remapped. Every cached entry on every pod was now a miss. This chapter is about the three algorithms — ring, jump, Maglev — that solve this exact problem, and how the choice between them maps to a real production decision tree.
Consistent hashing routes a key to one of N backends so that adding or removing a backend remaps roughly 1/N of the keys instead of all of them. The ring (Karger 1997, Chord) costs O(log N) per lookup and gives unbalanced load without virtual nodes. Jump hash (Lamping & Veach 2014) is O(log N) lookups with no memory and perfect balance, but only supports append/remove of the last shard. Maglev (Google 2016) gives O(1) lookups, near-perfect balance, and minimal disruption — at the cost of a precomputed lookup table sized at ~65k entries per service.
Why modulo hashing breaks
The naive picker is one line: pod = pods[hash(key) % len(pods)]. It distributes load evenly when N is fixed. It is also the textbook example of why "evenly distributed" and "stable under reconfiguration" are different properties.
Consider a fleet of 4 pods and 12 keys. With modulo hashing, key k goes to pod hash(k) % 4. Now add a 5th pod. Key k now goes to pod hash(k) % 5. The fraction of keys whose target pod did not change is roughly 1/5 — the keys that happened to land on the same residue under both moduli. Roughly 80% of the keys remap when you add a single pod. For a cache, that is an 80% cold-cache hit. For a sticky session router, that is 80% of in-flight sessions losing their backend. For a sharded database, that is 80% of writes routed to a pod that does not have the data yet.
Why modulo hashing remaps (N-1)/N of the keys when you add one pod: a key k keeps its assignment only if hash(k) % N == hash(k) % (N+1). For uniformly distributed hash(k), that condition holds with probability 1/(N+1) (one residue out of N+1 candidates aligns by chance). So the fraction of keys preserved is 1/(N+1) and the fraction remapped is N/(N+1). For N=4, that is 80%; for N=96, that is 99%. The disruption is worse for bigger fleets, which is the opposite of what you want — adding a pod to a 1000-pod fleet should be a tiny event, not a cache-cliff.
The ring (Karger / Chord, 1997)
Karger et al.'s 1997 STOC paper Consistent Hashing and Random Trees introduced the construction now called the ring or Chord algorithm (Stoica et al. 2001 generalised it to a peer-to-peer DHT). The construction:
- Hash every pod's identifier (e.g.
hash("pod-3")) to a 32-bit or 64-bit integer. Call this the pod's position on the ring. - To route a key, compute
hash(key)and walk clockwise on the ring until you hit the first pod position. Route to that pod. - To add a pod, hash its identifier to a new position. Only the keys between the previous-clockwise pod and the new pod move — roughly
K/Nkeys (whereKis the total key count). - To remove a pod, its keys go to the next-clockwise pod. Again,
K/Nkeys move.
The lookup is O(log N) because the pod positions are stored in a sorted structure (typically a binary search tree or a sorted array with binary search). The key property — adding or removing a pod moves only ~1/N of the keys instead of all of them — is the foundation of every modern key-routing system.
The naive ring has one problem: load is highly unbalanced. With N=10 randomly placed pod positions on the ring, the ratio of the most-loaded to least-loaded pod is typically 4× to 5×. The variance shrinks slowly with N. The fix is virtual nodes (vnodes): each physical pod is hashed to V ring positions instead of one. With V=200, a 10-pod fleet has 2000 virtual positions, and the load skew across physical pods drops to under 1.05×. Dynamo's paper (DeCandia 2007) reports V=128–256 as production-realistic. Cassandra's default is num_tokens=16 (recently lowered from 256 because higher counts hurt repair performance), Riak's is 64.
A 70-line implementation: ring, jump, Maglev side-by-side
This script implements all three algorithms and measures (a) lookup balance, (b) disruption when one pod is added, on the same key set. The point is to make the cost-quality trade-off concrete in numbers.
# hashing_compare.py — ring vs jump vs Maglev, balance and disruption.
import bisect, hashlib, random
from collections import Counter
def h64(s): # 64-bit stable hash (sha256 truncated)
return int.from_bytes(hashlib.sha256(s.encode()).digest()[:8], "big")
# 1. Ring with virtual nodes.
def build_ring(pods, vnodes=200):
ring = []
for p in pods:
for v in range(vnodes):
ring.append((h64(f"{p}#{v}"), p))
ring.sort()
return ring
def ring_lookup(ring, key):
pos = h64(key)
idx = bisect.bisect(ring, (pos,))
return ring[idx % len(ring)][1]
# 2. Jump hash (Lamping & Veach 2014).
def jump_lookup(key, num_buckets):
k = h64(key) & 0xFFFFFFFFFFFFFFFF
b, j = -1, 0
while j < num_buckets:
b = j
k = (k * 2862933555777941757 + 1) & 0xFFFFFFFFFFFFFFFF
j = int((b + 1) * ((1 << 31) / ((k >> 33) + 1)))
return b
# 3. Maglev (Google 2016) — precomputed lookup table.
def build_maglev(pods, M=65537): # M must be prime
N = len(pods)
perms = []
for p in pods:
offset = h64(f"{p}-off") % M
skip = h64(f"{p}-skip") % (M - 1) + 1
perms.append([(offset + i * skip) % M for i in range(M)])
table = [-1] * M
next_idx = [0] * N
filled = 0
while filled < M:
for i in range(N):
c = perms[i][next_idx[i]]
while table[c] != -1:
next_idx[i] += 1
c = perms[i][next_idx[i]]
table[c] = i
next_idx[i] += 1
filled += 1
if filled == M: break
return table, pods
def maglev_lookup(maglev, key):
table, pods = maglev
return pods[table[h64(key) % len(table)]]
# Measure balance and disruption.
random.seed(42)
keys = [f"product-{i}" for i in range(50000)]
pods_a = [f"pod-{i}" for i in range(8)]
pods_b = [f"pod-{i}" for i in range(9)] # added one pod
for name, build, lookup in [
("ring", lambda ps: build_ring(ps), ring_lookup),
("jump", lambda ps: len(ps), lambda n,k: f"pod-{jump_lookup(k,n)}"),
("maglev", lambda ps: build_maglev(ps), maglev_lookup),
]:
sa, sb = build(pods_a), build(pods_b)
bal = Counter(lookup(sa, k) for k in keys)
moved = sum(1 for k in keys if lookup(sa,k) != lookup(sb,k))
skew = max(bal.values()) / min(bal.values())
print(f" {name:7s} load skew={skew:.3f} moved on +1 pod={moved/len(keys):.1%} ideal={1/9:.1%}")
Sample run:
ring load skew=1.094 moved on +1 pod=11.4% ideal=11.1%
jump load skew=1.001 moved on +1 pod=11.1% ideal=11.1%
maglev load skew=1.012 moved on +1 pod=11.6% ideal=11.1%
Per-line walkthrough. build_ring hashes each pod 200 times to create virtual nodes, then sorts the resulting (position, pod) pairs. ring_lookup binary-searches the sorted array for the smallest position greater than the key's hash — the clockwise-walk in O(log N). jump_lookup is the Google-published algorithm: a tight loop that uses an LCG-style hash to "jump" the bucket assignment forward; it terminates when the next jump would exceed num_buckets. The genius is that the algorithm uses zero memory — it computes the bucket from scratch each time. build_maglev precomputes a permutation per pod, then fills a fixed-size lookup table by round-robin claim — each pod's permutation determines which slot it claims next; whoever claims a slot first owns it. maglev_lookup is O(1): hash the key, modulo the table size, read the bucket. The output shows all three give roughly ideal balance and roughly ideal disruption (1/N keys move when a pod is added). The differences are in lookup cost (ring O(log N), jump O(log N), Maglev O(1)), memory (ring ~N×V entries, jump 0, Maglev ~65k entries per service), and operational flexibility — discussed in the next section.
Why all three converge on ~11.1% disruption when you add a pod to an 8-pod fleet: the theoretical minimum is 1/N_new = 1/9 = 11.1%. Any consistent-hashing algorithm that achieves balanced load also achieves ~1/N_new disruption — these properties are closely linked. The only way to be balanced after adding a pod is for the new pod to take roughly its fair share 1/N_new of the keys, and those keys must come from somewhere. Ring with V=200 vnodes ends up at 11.4% (slightly higher than minimum because vnode positions are random, not optimal). Jump and Maglev both reach within a percentage point of optimal. The disruption floor is the same for all three; what differs is everything else.
When to use each — the production decision tree
The three algorithms exist because they are not interchangeable. Each occupies a different point on the cost-quality-flexibility curve.
Ring (Karger / Chord) — use when nodes can be added and removed in arbitrary order. This is the only algorithm of the three that lets you remove pod-3 from the middle of a fleet without renaming or reshuffling the others. Cassandra, Riak, DynamoDB internals, memcached client-side sharding (ketama), and Discord's session-routing layer all use ring-based consistent hashing for this reason. The cost is O(log N) per lookup, ~N×V memory (small — for N=100, V=200, that is 20k entries), and a non-trivial implementation. The benefit is full reconfiguration freedom.
Jump hash (Lamping & Veach 2014) — use when shards are append-only. Jump hash assumes shard IDs are {0, 1, ..., N-1}. Adding a shard means going from N to N+1 — perfect. Removing shard 3 from the middle is not directly supported; you can only "remove" the last shard. Production uses: Google's BigQuery resharding, sharded counters, Vitess range-based sharding for tail shards, anywhere shards can be append-grown but never punched-out from the middle. The benefit is zero memory and zero coordination: any client computes jump_lookup(key, N) from scratch as long as it knows N. The cost is the inflexibility on removal.
Maglev (Google 2016) — use when lookup latency must be O(1) and you can afford ~64k entries per service. Maglev was designed for Google's L4 software load balancer running at line rate (10M+ pps per box); every nanosecond per packet matters. The lookup is a hash, a modulo, and a table read — three machine instructions. The disruption when a backend changes is slightly worse than ring/jump (some Maglev tables can move 2–3% more keys than the theoretical minimum on remove), but the constant-time lookup is decisive at high pps. Cloudflare's L4 LB (Unimog) uses a Maglev variant. Linkerd's service-mesh L7 LB uses it for "ring hash" balancing too. The cost is the table-rebuild on backend changes (~10ms for a 65k-entry table on commodity hardware) and the fixed memory.
Why Maglev needs the table size M to be prime and large (M=65537 is the canonical pick): the table is filled by having each backend claim slots in a permutation order; the permutation is generated by (offset + i × skip) mod M. For this to be a true permutation (visit every slot exactly once), gcd(skip, M) = 1 must hold. The simplest way to guarantee that for arbitrary skip is to make M prime — every nonzero skip is coprime to a prime modulus. The table size also bounds the disruption: when a backend leaves, its slots are reclaimed by other backends in their permutation order; the smaller M is, the lumpier the redistribution. M=65537 (a Fermat prime) is large enough that the disruption is within ~2% of the theoretical minimum and small enough that a rebuild fits in ~10ms on commodity hardware. Google's paper uses M=65537 for fleets of up to ~1000 backends; for larger fleets, M=655373 (also prime) keeps disruption tight at ~10× the rebuild cost.
Common confusions
-
"Consistent hashing means the hash function is consistent." It does not. The hash function is just a hash function. "Consistent" refers to the mapping from keys to nodes being stable under reconfiguration — adding or removing a node remaps
~1/Nof the keys instead of all of them. Karger's 1997 paper named the property; the term is awkward in retrospect but it stuck. -
"Virtual nodes are just for load balancing." They are also a fault-isolation primitive. When physical pod-3 fails, its 200 virtual positions are spread across the rest of the fleet — each surviving pod inherits roughly
200/(N-1)vnodes' worth of load. Without vnodes, pod-3's neighbour would have inherited 100% of pod-3's load — a 2× spike on a single pod. Vnodes turn that 2× spike into a1+1/(N-1)spike, which forN=20is just a 5% bump. -
"Jump hash and consistent hashing are different things." Jump hash is a consistent-hashing algorithm — it satisfies the property that adding a bucket only moves
~1/Nof the keys. The confusion comes from jump hash not having a "ring" — there is no geometric structure, no virtual nodes, no positions. Just a tight loop that computes the bucket directly. Consistent-hashing is the property; ring is one implementation; jump is another. -
"Maglev is just a precomputed ring lookup." It is not. The ring picks pods by clockwise-walk on hash positions; Maglev picks pods by slot ownership in a precomputed table where each pod claims slots in a permutation order. The disruption profile is different — when a Maglev backend leaves, the slots are redistributed by re-running the table-fill, which can move slightly more than
1/Nkeys due to permutation-order interactions. The lookup cost and memory profile are also different. They share the name "consistent hashing" because they share the property; the mechanisms are independent. -
"You should always use Maglev because O(1) is best." Maglev's table-rebuild cost (~10ms for 65k entries) becomes a coordination problem when backends change frequently. For a fleet that adds/removes a pod every 30 seconds during autoscaling, the LB spends a noticeable fraction of CPU rebuilding tables. The ring's
O(log N)lookup is fine for L7 LBs at modest throughput (~10k req/s per LB), and the incremental update (insert/remove a few vnodes) is cheap. Pick Maglev when packet rate is the constraint, not when latency is just "low".
Going deeper
Karger's original use case — distributed web caches
Karger et al.'s 1997 paper Consistent Hashing and Random Trees: Distributed Caching Protocols for Relieving Hot Spots on the World Wide Web was motivated by a specific problem: distributing CDN cache lookups across a pool of cache servers without flash-crowd hot spots. The paper proves two properties of the ring with virtual nodes: balance (no cache holds more than (1+ε) K/N keys with high probability for any constant ε > 0 once V = Ω(log N)) and monotonicity (adding a cache moves at most O(K/N) keys). The proofs use balls-into-bins arguments adapted to the random-position structure. Akamai's CDN — co-founded by Karger — was the first large-scale production deployment; the technique became the standard for client-side cache sharding when memcached's ketama library shipped in 2007.
Jump hash is one of the most elegant 10-line algorithms in distributed systems
Lamping & Veach's 2014 paper A Fast, Minimal Memory, Consistent Hash Algorithm has 4 pages of body and a 10-line algorithm. The core observation: when you add a bucket going from N to N+1, each key independently has probability 1/(N+1) of moving to the new bucket and probability N/(N+1) of staying. So if you simulate the "additions" in order — start with bucket 0, then potentially move to bucket 1 with probability 1/2, then potentially to bucket 2 with probability 1/3, and so on — a key's final bucket is exactly its consistent-hashing assignment. The clever bit is doing this in O(log N) instead of O(N) by using an LCG (linear congruential generator) seeded by the key to "jump" forward over consecutive non-moves. The expected number of jumps is O(log N). The algorithm uses zero memory — every client computes the bucket independently from (key, N) alone. The downside is the inflexibility on remove: jump hash maps to the bucket index 0..N-1, and removing bucket 3 from the middle requires renaming buckets 4..N-1 to 3..N-2, which is a coordination event.
Maglev's permutation × table-fill is a clever way to amortise lookup cost
Eisenbud et al.'s 2016 NSDI paper Maglev: A Fast and Reliable Software Network Load Balancer describes the algorithm and the system. The algorithm trades one-time precomputation cost for O(1) lookups. Each backend has a permutation of the table indices 0..M-1 derived from two hashes (offset, skip). The table is filled by having each backend in turn claim its next-uncontested slot from its permutation. The order of "turns" is round-robin across backends. The result is that each backend ends up owning M/N slots, and the slot-ownership pattern is nearly identical when one backend is added or removed (because the permutations of the other backends are unchanged, and they fill the slots vacated by the absent backend in the same order they would have anyway). This gives near-minimal disruption with O(1) lookups. Maglev's deployment story is the more interesting part: Google's L4 LB runs Maglev on commodity x86 servers, processes 10M+ packets per second per box, and uses M=65537 for fleets up to ~1000 backends.
Reproduce this on your laptop
# Run the simulator from above:
python3 -m venv .venv && source .venv/bin/activate
python3 hashing_compare.py
# Expected (seed=42, 50000 keys, 8→9 pods):
# ring load skew=1.094 moved=11.4%
# jump load skew=1.001 moved=11.1%
# maglev load skew=1.012 moved=11.6%
# Inspect ketama (memcached client-side ring) in Python:
pip install pymemcache hash_ring
python3 -c "from hash_ring import HashRing; \
r = HashRing(['cache-1', 'cache-2', 'cache-3']); \
print([(k, r.get_node(k)) for k in ['session-42', 'cart-99', 'product-7']])"
# Cassandra-style virtual-node configuration (read-only inspection):
docker run --rm cassandra:4.1 \
bash -c "echo 'num_tokens: 16' && echo 'allocate_tokens_for_local_replication_factor: 3'"
# Cassandra defaults to V=16 since 4.0; older clusters used V=256.
Where this leads next
Consistent hashing solves the "stable mapping" problem for stateless / sticky lookups. The same machinery shows up at three other layers of distributed systems: replication (where to put the N replicas of a key), partitioning (how to split a database across shards), and rendezvous routing (an alternative to ring-based that the next chapter covers).
- Rendezvous (HRW) hashing — alternative to ring; computes
argmax_i hash(key, pod_i); same disruption guarantees, simpler implementation, slightly worse balance without weights. - Sticky sessions and cookie-based routing — the "L7 application" of consistent hashing; sessions encoded in cookies, routed via hash.
- Power of two choices (P2C) — what to use when stickiness is not required; trades affinity for tail-latency reduction.
- Replication and quorum reads — Part 8. Consistent hashing picks the primary replica; the next replicas are typically the next-clockwise vnodes on the ring.
After Part 6 closes, Part 7 (reliability patterns) revisits the picker correctness from the failure-handling angle — what happens when the consistently-hashed pod is unreachable.
References
- Karger et al., "Consistent Hashing and Random Trees" — STOC 1997 — the foundational paper; introduces the ring and proves balance + monotonicity.
- Stoica et al., "Chord: A Scalable Peer-to-peer Lookup Service" — SIGCOMM 2001 — the DHT generalisation; the canonical reference for ring-with-virtual-nodes.
- Lamping & Veach, "A Fast, Minimal Memory, Consistent Hash Algorithm" — arXiv 2014 — the jump-hash paper; ten-line algorithm, zero memory.
- Eisenbud et al., "Maglev: A Fast and Reliable Software Network Load Balancer" — NSDI 2016 — the Maglev paper; production L4 LB at Google.
- DeCandia et al., "Dynamo: Amazon's Highly Available Key-Value Store" — SOSP 2007 — Dynamo's use of ring-with-vnodes; influential on Cassandra and Riak.
- Cloudflare, "Unimog — Cloudflare's edge load balancer" (2020 blog) — Maglev-variant in production at edge.
- Power of two choices (P2C) — internal companion. The previous chapter; what to use when stickiness is not required.
- Wall: many instances → load balancing decisions — internal companion. The wall this chapter sits beneath; explains why heterogeneity is the underlying problem all pickers face.