Consistent hashing (ring, jump, Maglev)

It is 19:42 on a Friday and Aditi is on call for BharatBazaar's product-page cache. The fleet has 96 cache pods behind a sidecar load balancer; the cache hit rate has lived at 94% for months. At 19:38 the platform team rolled an autoscaler change that bumped the fleet to 112 pods. The hit rate fell to 31% in 90 seconds, the origin database CPU went from 22% to 91%, and the read p99 climbed from 18 ms to 1.4 s. Nothing was broken — the cache pods were healthy, the database was healthy, the LB was healthy. The problem was that the LB used hash(product_id) % N to pick a pod. When N went from 96 to 112, every key got remapped. Every cached entry on every pod was now a miss. This chapter is about the three algorithms — ring, jump, Maglev — that solve this exact problem, and how the choice between them maps to a real production decision tree.

Consistent hashing routes a key to one of N backends so that adding or removing a backend remaps roughly 1/N of the keys instead of all of them. The ring (Karger 1997, Chord) costs O(log N) per lookup and gives unbalanced load without virtual nodes. Jump hash (Lamping & Veach 2014) is O(log N) lookups with no memory and perfect balance, but only supports append/remove of the last shard. Maglev (Google 2016) gives O(1) lookups, near-perfect balance, and minimal disruption — at the cost of a precomputed lookup table sized at ~65k entries per service.

Why modulo hashing breaks

The naive picker is one line: pod = pods[hash(key) % len(pods)]. It distributes load evenly when N is fixed. It is also the textbook example of why "evenly distributed" and "stable under reconfiguration" are different properties.

Consider a fleet of 4 pods and 12 keys. With modulo hashing, key k goes to pod hash(k) % 4. Now add a 5th pod. Key k now goes to pod hash(k) % 5. The fraction of keys whose target pod did not change is roughly 1/5 — the keys that happened to land on the same residue under both moduli. Roughly 80% of the keys remap when you add a single pod. For a cache, that is an 80% cold-cache hit. For a sticky session router, that is 80% of in-flight sessions losing their backend. For a sharded database, that is 80% of writes routed to a pod that does not have the data yet.

Modulo hashing — adding one pod remaps almost everythingTwo side-by-side tables. Left table headed N=4 shows 12 keys k0 through k11 mapped to pods 0-3 via hash mod 4. Right table headed N=5 shows the same 12 keys mapped via hash mod 5; the pod assignments are almost entirely different. A summary box below says only 2 of 12 keys keep their pod when N goes from 4 to 5; 83% of keys remap. 12 keys, modulo hashing — N=4 vs N=5 N = 4 k0 → 1k1 → 3k2 → 2k3 → 0 k4 → 1k5 → 3k6 → 2k7 → 1 k8 → 0k9 → 3k10 → 2k11 → 1 12 keys assigned to 4 pods N = 5 (added pod 4) k0 → 4k1 → 1k2 → 2k3 → 3 k4 → 2k5 → 0k6 → 4k7 → 3 k8 → 1k9 → 2k10 → 2k11 → 4 10 of 12 remapped (orange) — 83% disruption Adding one pod cost 10 cache misses out of 12 keys
Illustrative — the modulo failure mode. Adding a single pod does not "shift" assignments by one; it permutes them. Cache fleets, sticky-session routers, and sharded stores all see the same disruption pattern.

Why modulo hashing remaps (N-1)/N of the keys when you add one pod: a key k keeps its assignment only if hash(k) % N == hash(k) % (N+1). For uniformly distributed hash(k), that condition holds with probability 1/(N+1) (one residue out of N+1 candidates aligns by chance). So the fraction of keys preserved is 1/(N+1) and the fraction remapped is N/(N+1). For N=4, that is 80%; for N=96, that is 99%. The disruption is worse for bigger fleets, which is the opposite of what you want — adding a pod to a 1000-pod fleet should be a tiny event, not a cache-cliff.

The ring (Karger / Chord, 1997)

Karger et al.'s 1997 STOC paper Consistent Hashing and Random Trees introduced the construction now called the ring or Chord algorithm (Stoica et al. 2001 generalised it to a peer-to-peer DHT). The construction:

  1. Hash every pod's identifier (e.g. hash("pod-3")) to a 32-bit or 64-bit integer. Call this the pod's position on the ring.
  2. To route a key, compute hash(key) and walk clockwise on the ring until you hit the first pod position. Route to that pod.
  3. To add a pod, hash its identifier to a new position. Only the keys between the previous-clockwise pod and the new pod move — roughly K/N keys (where K is the total key count).
  4. To remove a pod, its keys go to the next-clockwise pod. Again, K/N keys move.

The lookup is O(log N) because the pod positions are stored in a sorted structure (typically a binary search tree or a sorted array with binary search). The key property — adding or removing a pod moves only ~1/N of the keys instead of all of them — is the foundation of every modern key-routing system.

The naive ring has one problem: load is highly unbalanced. With N=10 randomly placed pod positions on the ring, the ratio of the most-loaded to least-loaded pod is typically 4× to 5×. The variance shrinks slowly with N. The fix is virtual nodes (vnodes): each physical pod is hashed to V ring positions instead of one. With V=200, a 10-pod fleet has 2000 virtual positions, and the load skew across physical pods drops to under 1.05×. Dynamo's paper (DeCandia 2007) reports V=128–256 as production-realistic. Cassandra's default is num_tokens=16 (recently lowered from 256 because higher counts hurt repair performance), Riak's is 64.

Consistent hashing ring with virtual nodesCircular ring diagram. Outer ring is divided into 16 colored arcs each labeled with a virtual-node identifier and a physical-pod number. Four physical pods (pod-0, pod-1, pod-2, pod-3) each own 4 virtual positions distributed around the ring. Three keys k1, k2, k3 are shown at distinct positions on the ring with arrows pointing clockwise to the first virtual node they encounter. A side panel labels each virtual node with its physical pod and shows that the keys land on different physical pods despite being adjacent on the ring. Hashring — 4 physical pods, 16 virtual nodes (vnode count V=4) v0/p0 v1/p2 v2/p1 v3/p3 v4/p0 v5/p2 v6/p3 v7/p1 v8/p0 v9/p2 v10/p1 v11/p3 v12/p0 v13/p2 v14/p1 v15/p3 k1 → k2 → k3 → Virtual-node assignment pod-0 → v0, v4, v8, v12 (4 vnodes) pod-1 → v2, v7, v10, v14 (4 vnodes) pod-2 → v1, v5, v9, v13 (4 vnodes) pod-3 → v3, v6, v11, v15 (4 vnodes) Lookup: O(log N) on sorted vnode array k1 hashes near v4 → pod-0 k2 hashes near v6 → pod-3 k3 hashes near v15 → pod-3 Adding pod-4 inserts new vnodes; only the keys in those arc-segments move (≈K/N per vnode).
Illustrative — the ring with virtual nodes. Each physical pod owns multiple positions on the ring; this is what flattens the load distribution. With V=128–256 virtual positions per pod, the load skew across physical pods is under 1.05× even for fleets of size 10–50.

A 70-line implementation: ring, jump, Maglev side-by-side

This script implements all three algorithms and measures (a) lookup balance, (b) disruption when one pod is added, on the same key set. The point is to make the cost-quality trade-off concrete in numbers.

# hashing_compare.py — ring vs jump vs Maglev, balance and disruption.
import bisect, hashlib, random
from collections import Counter

def h64(s):  # 64-bit stable hash (sha256 truncated)
    return int.from_bytes(hashlib.sha256(s.encode()).digest()[:8], "big")

# 1. Ring with virtual nodes.
def build_ring(pods, vnodes=200):
    ring = []
    for p in pods:
        for v in range(vnodes):
            ring.append((h64(f"{p}#{v}"), p))
    ring.sort()
    return ring

def ring_lookup(ring, key):
    pos = h64(key)
    idx = bisect.bisect(ring, (pos,))
    return ring[idx % len(ring)][1]

# 2. Jump hash (Lamping & Veach 2014).
def jump_lookup(key, num_buckets):
    k = h64(key) & 0xFFFFFFFFFFFFFFFF
    b, j = -1, 0
    while j < num_buckets:
        b = j
        k = (k * 2862933555777941757 + 1) & 0xFFFFFFFFFFFFFFFF
        j = int((b + 1) * ((1 << 31) / ((k >> 33) + 1)))
    return b

# 3. Maglev (Google 2016) — precomputed lookup table.
def build_maglev(pods, M=65537):  # M must be prime
    N = len(pods)
    perms = []
    for p in pods:
        offset = h64(f"{p}-off") % M
        skip = h64(f"{p}-skip") % (M - 1) + 1
        perms.append([(offset + i * skip) % M for i in range(M)])
    table = [-1] * M
    next_idx = [0] * N
    filled = 0
    while filled < M:
        for i in range(N):
            c = perms[i][next_idx[i]]
            while table[c] != -1:
                next_idx[i] += 1
                c = perms[i][next_idx[i]]
            table[c] = i
            next_idx[i] += 1
            filled += 1
            if filled == M: break
    return table, pods

def maglev_lookup(maglev, key):
    table, pods = maglev
    return pods[table[h64(key) % len(table)]]

# Measure balance and disruption.
random.seed(42)
keys = [f"product-{i}" for i in range(50000)]
pods_a = [f"pod-{i}" for i in range(8)]
pods_b = [f"pod-{i}" for i in range(9)]   # added one pod

for name, build, lookup in [
    ("ring",   lambda ps: build_ring(ps),   ring_lookup),
    ("jump",   lambda ps: len(ps),          lambda n,k: f"pod-{jump_lookup(k,n)}"),
    ("maglev", lambda ps: build_maglev(ps), maglev_lookup),
]:
    sa, sb = build(pods_a), build(pods_b)
    bal = Counter(lookup(sa, k) for k in keys)
    moved = sum(1 for k in keys if lookup(sa,k) != lookup(sb,k))
    skew = max(bal.values()) / min(bal.values())
    print(f"  {name:7s}  load skew={skew:.3f}  moved on +1 pod={moved/len(keys):.1%}  ideal={1/9:.1%}")

Sample run:

  ring     load skew=1.094  moved on +1 pod=11.4%  ideal=11.1%
  jump     load skew=1.001  moved on +1 pod=11.1%  ideal=11.1%
  maglev   load skew=1.012  moved on +1 pod=11.6%  ideal=11.1%

Per-line walkthrough. build_ring hashes each pod 200 times to create virtual nodes, then sorts the resulting (position, pod) pairs. ring_lookup binary-searches the sorted array for the smallest position greater than the key's hash — the clockwise-walk in O(log N). jump_lookup is the Google-published algorithm: a tight loop that uses an LCG-style hash to "jump" the bucket assignment forward; it terminates when the next jump would exceed num_buckets. The genius is that the algorithm uses zero memory — it computes the bucket from scratch each time. build_maglev precomputes a permutation per pod, then fills a fixed-size lookup table by round-robin claim — each pod's permutation determines which slot it claims next; whoever claims a slot first owns it. maglev_lookup is O(1): hash the key, modulo the table size, read the bucket. The output shows all three give roughly ideal balance and roughly ideal disruption (1/N keys move when a pod is added). The differences are in lookup cost (ring O(log N), jump O(log N), Maglev O(1)), memory (ring ~N×V entries, jump 0, Maglev ~65k entries per service), and operational flexibility — discussed in the next section.

Why all three converge on ~11.1% disruption when you add a pod to an 8-pod fleet: the theoretical minimum is 1/N_new = 1/9 = 11.1%. Any consistent-hashing algorithm that achieves balanced load also achieves ~1/N_new disruption — these properties are closely linked. The only way to be balanced after adding a pod is for the new pod to take roughly its fair share 1/N_new of the keys, and those keys must come from somewhere. Ring with V=200 vnodes ends up at 11.4% (slightly higher than minimum because vnode positions are random, not optimal). Jump and Maglev both reach within a percentage point of optimal. The disruption floor is the same for all three; what differs is everything else.

When to use each — the production decision tree

The three algorithms exist because they are not interchangeable. Each occupies a different point on the cost-quality-flexibility curve.

Ring (Karger / Chord) — use when nodes can be added and removed in arbitrary order. This is the only algorithm of the three that lets you remove pod-3 from the middle of a fleet without renaming or reshuffling the others. Cassandra, Riak, DynamoDB internals, memcached client-side sharding (ketama), and Discord's session-routing layer all use ring-based consistent hashing for this reason. The cost is O(log N) per lookup, ~N×V memory (small — for N=100, V=200, that is 20k entries), and a non-trivial implementation. The benefit is full reconfiguration freedom.

Jump hash (Lamping & Veach 2014) — use when shards are append-only. Jump hash assumes shard IDs are {0, 1, ..., N-1}. Adding a shard means going from N to N+1 — perfect. Removing shard 3 from the middle is not directly supported; you can only "remove" the last shard. Production uses: Google's BigQuery resharding, sharded counters, Vitess range-based sharding for tail shards, anywhere shards can be append-grown but never punched-out from the middle. The benefit is zero memory and zero coordination: any client computes jump_lookup(key, N) from scratch as long as it knows N. The cost is the inflexibility on removal.

Maglev (Google 2016) — use when lookup latency must be O(1) and you can afford ~64k entries per service. Maglev was designed for Google's L4 software load balancer running at line rate (10M+ pps per box); every nanosecond per packet matters. The lookup is a hash, a modulo, and a table read — three machine instructions. The disruption when a backend changes is slightly worse than ring/jump (some Maglev tables can move 2–3% more keys than the theoretical minimum on remove), but the constant-time lookup is decisive at high pps. Cloudflare's L4 LB (Unimog) uses a Maglev variant. Linkerd's service-mesh L7 LB uses it for "ring hash" balancing too. The cost is the table-rebuild on backend changes (~10ms for a 65k-entry table on commodity hardware) and the fixed memory.

Three algorithms — cost-quality trade-off matrixA 3-by-4 table comparing ring, jump hash, and Maglev across four dimensions: lookup cost (O(logN), O(logN), O(1)); memory per service (N times V entries, zero, sixty-five thousand entries fixed); disruption on add (one over N, one over N, slightly worse than one over N); supports arbitrary remove (yes, no, yes). Production users column: Cassandra and Riak and Discord for ring; BigQuery and Vitess tail shards for jump; Google L4 and Cloudflare Unimog for Maglev. The visual highlights Maglev's O(1) lookup and jump's zero memory in accent color. Cost-quality matrix — pick the algorithm that matches your operational profile algorithm lookup cost memory disruption (+1 pod) arbitrary remove? ring O(log N) N × V entries (~20k) ~1/N yes jump O(log N) zero (recomputed) 1/N (optimal) no — append-only Maglev O(1) — table read ~65k entries fixed ~1/N (slight excess) yes Production users: ring → Cassandra, Riak, Discord, ketama; jump → Google BigQuery, Vitess tail shards; Maglev → Google L4 LB, Cloudflare Unimog, Linkerd hash-ring. Key trade-off: ring is the safe default for reconfiguration freedom; Maglev wins when packet-rate matters; jump wins when shards are append-only and clients should hold zero state.
Illustrative — the three-way trade-off. The decision tree: do you need O(1) lookups at line rate (Maglev) → can shards be append-only (jump) → otherwise default to the ring with virtual nodes.

Why Maglev needs the table size M to be prime and large (M=65537 is the canonical pick): the table is filled by having each backend claim slots in a permutation order; the permutation is generated by (offset + i × skip) mod M. For this to be a true permutation (visit every slot exactly once), gcd(skip, M) = 1 must hold. The simplest way to guarantee that for arbitrary skip is to make M prime — every nonzero skip is coprime to a prime modulus. The table size also bounds the disruption: when a backend leaves, its slots are reclaimed by other backends in their permutation order; the smaller M is, the lumpier the redistribution. M=65537 (a Fermat prime) is large enough that the disruption is within ~2% of the theoretical minimum and small enough that a rebuild fits in ~10ms on commodity hardware. Google's paper uses M=65537 for fleets of up to ~1000 backends; for larger fleets, M=655373 (also prime) keeps disruption tight at ~10× the rebuild cost.

Common confusions

Going deeper

Karger's original use case — distributed web caches

Karger et al.'s 1997 paper Consistent Hashing and Random Trees: Distributed Caching Protocols for Relieving Hot Spots on the World Wide Web was motivated by a specific problem: distributing CDN cache lookups across a pool of cache servers without flash-crowd hot spots. The paper proves two properties of the ring with virtual nodes: balance (no cache holds more than (1+ε) K/N keys with high probability for any constant ε > 0 once V = Ω(log N)) and monotonicity (adding a cache moves at most O(K/N) keys). The proofs use balls-into-bins arguments adapted to the random-position structure. Akamai's CDN — co-founded by Karger — was the first large-scale production deployment; the technique became the standard for client-side cache sharding when memcached's ketama library shipped in 2007.

Jump hash is one of the most elegant 10-line algorithms in distributed systems

Lamping & Veach's 2014 paper A Fast, Minimal Memory, Consistent Hash Algorithm has 4 pages of body and a 10-line algorithm. The core observation: when you add a bucket going from N to N+1, each key independently has probability 1/(N+1) of moving to the new bucket and probability N/(N+1) of staying. So if you simulate the "additions" in order — start with bucket 0, then potentially move to bucket 1 with probability 1/2, then potentially to bucket 2 with probability 1/3, and so on — a key's final bucket is exactly its consistent-hashing assignment. The clever bit is doing this in O(log N) instead of O(N) by using an LCG (linear congruential generator) seeded by the key to "jump" forward over consecutive non-moves. The expected number of jumps is O(log N). The algorithm uses zero memory — every client computes the bucket independently from (key, N) alone. The downside is the inflexibility on remove: jump hash maps to the bucket index 0..N-1, and removing bucket 3 from the middle requires renaming buckets 4..N-1 to 3..N-2, which is a coordination event.

Maglev's permutation × table-fill is a clever way to amortise lookup cost

Eisenbud et al.'s 2016 NSDI paper Maglev: A Fast and Reliable Software Network Load Balancer describes the algorithm and the system. The algorithm trades one-time precomputation cost for O(1) lookups. Each backend has a permutation of the table indices 0..M-1 derived from two hashes (offset, skip). The table is filled by having each backend in turn claim its next-uncontested slot from its permutation. The order of "turns" is round-robin across backends. The result is that each backend ends up owning M/N slots, and the slot-ownership pattern is nearly identical when one backend is added or removed (because the permutations of the other backends are unchanged, and they fill the slots vacated by the absent backend in the same order they would have anyway). This gives near-minimal disruption with O(1) lookups. Maglev's deployment story is the more interesting part: Google's L4 LB runs Maglev on commodity x86 servers, processes 10M+ packets per second per box, and uses M=65537 for fleets up to ~1000 backends.

Reproduce this on your laptop

# Run the simulator from above:
python3 -m venv .venv && source .venv/bin/activate
python3 hashing_compare.py
# Expected (seed=42, 50000 keys, 8→9 pods):
#   ring     load skew=1.094  moved=11.4%
#   jump     load skew=1.001  moved=11.1%
#   maglev   load skew=1.012  moved=11.6%

# Inspect ketama (memcached client-side ring) in Python:
pip install pymemcache hash_ring
python3 -c "from hash_ring import HashRing; \
  r = HashRing(['cache-1', 'cache-2', 'cache-3']); \
  print([(k, r.get_node(k)) for k in ['session-42', 'cart-99', 'product-7']])"

# Cassandra-style virtual-node configuration (read-only inspection):
docker run --rm cassandra:4.1 \
  bash -c "echo 'num_tokens: 16' && echo 'allocate_tokens_for_local_replication_factor: 3'"
# Cassandra defaults to V=16 since 4.0; older clusters used V=256.

Where this leads next

Consistent hashing solves the "stable mapping" problem for stateless / sticky lookups. The same machinery shows up at three other layers of distributed systems: replication (where to put the N replicas of a key), partitioning (how to split a database across shards), and rendezvous routing (an alternative to ring-based that the next chapter covers).

After Part 6 closes, Part 7 (reliability patterns) revisits the picker correctness from the failure-handling angle — what happens when the consistently-hashed pod is unreachable.

References

  1. Karger et al., "Consistent Hashing and Random Trees" — STOC 1997 — the foundational paper; introduces the ring and proves balance + monotonicity.
  2. Stoica et al., "Chord: A Scalable Peer-to-peer Lookup Service" — SIGCOMM 2001 — the DHT generalisation; the canonical reference for ring-with-virtual-nodes.
  3. Lamping & Veach, "A Fast, Minimal Memory, Consistent Hash Algorithm" — arXiv 2014 — the jump-hash paper; ten-line algorithm, zero memory.
  4. Eisenbud et al., "Maglev: A Fast and Reliable Software Network Load Balancer" — NSDI 2016 — the Maglev paper; production L4 LB at Google.
  5. DeCandia et al., "Dynamo: Amazon's Highly Available Key-Value Store" — SOSP 2007 — Dynamo's use of ring-with-vnodes; influential on Cassandra and Riak.
  6. Cloudflare, "Unimog — Cloudflare's edge load balancer" (2020 blog) — Maglev-variant in production at edge.
  7. Power of two choices (P2C) — internal companion. The previous chapter; what to use when stickiness is not required.
  8. Wall: many instances → load balancing decisions — internal companion. The wall this chapter sits beneath; explains why heterogeneity is the underlying problem all pickers face.