Consistent Hashing and Virtual Nodes

Note: Company names, engineers, incidents, numbers, and scaling scenarios in this article are hypothetical — even when they resemble real ones. See the full disclaimer.

In short

You have ten cache servers and a billion keys. The textbook placement is server_index = hash(key) % 10 — two lines of code, perfectly uniform, trivially parallel. Add an eleventh server and the formula becomes hash(key) % 11. Around 91% of your keys now map to a different server, because h % 10 and h % 11 agree only when both reduce to the same small remainder. The cluster spends the next hour shuffling data it already had, the hit rate collapses to zero, and a routine capacity bump looks like an outage.

Consistent hashing replaces the modulo with a different geometry. Imagine the 64-bit hash space as a circle from 0 to 2^64 − 1. Hash each node to a position on the circle. Hash each key to a position on the circle. Declare that a key is owned by the next node clockwise from its hash. Adding a new node inserts one new position on the circle; only the keys whose hashes fall in the arc between that position and the previous node clockwise move. On a ring of N nodes with random positions, an arrival or departure touches on average 1/N of all keys, not nearly all of them. Removing a node hands its arc to its clockwise neighbour. No other key sees any change.

The basic scheme has a flaw: N random positions on a circle are not evenly spaced. With only three nodes, one node often owns 55% of the ring and another 15%. Virtual nodes fix this. Each physical node is represented by K positions on the circle — typically 128 or 256. The law of large numbers takes over: the union of 256 random arcs is close to 1/N of the circle, and the arcs are shared across all existing nodes, so rebalance traffic on a join flows from every node rather than just one neighbour. Riverone Dynamo (2007), Apache Cassandra, Amazon DynamoDB, Riak, memcached clients, and Envoy's ring-hash load balancer all implement some flavour of this. This chapter derives the ring from first principles, writes the full algorithm in under 40 lines of Python, explains why virtual nodes are not optional in practice, and connects the scheme back to Dynamo's preference lists.

Before Build 10 of this course, every database you saw had one answer to "where does this row live": a single authoritative leader. Dynamo (chapter 75) threw the leader away and declared every node equal. That decision left a question behind. If there is no leader, how does a client decide which three nodes among the cluster's fifty hold a given key's replicas? The answer cannot be a lookup table, because every node must reach the same answer independently. The answer cannot be hash(key) % 50 either, because the next page shows why.

The answer is consistent hashing. Without it, Dynamo does not work, Cassandra does not scale, DynamoDB has no story for capacity changes.

Naive hash-modulo-N — why it fails

The simplest way to shard data across N machines is server = hash(key) % N. It is what every junior engineer reaches for, and for a fixed N it is excellent. Hash functions like MD5 or xxHash are close to uniformly distributed on their output, so % N gives each server roughly 1/N of the keys. Lookups are O(1). No metadata. No coordination. As long as N never changes, the scheme is unbeatable.

The problem appears the moment N changes. Suppose you start with N=10 servers. You decide to add an eleventh. For every key, the new assignment is hash(key) % 11. The keys that stay put are those where hash(key) % 10 == hash(key) % 11 — exactly the keys where both moduli happen to yield the same remainder. For random hash values this happens with probability around 1/11. Concretely: roughly 91% of your keys now live on the wrong server.

The cost is not theoretical. For a distributed cache like memcached, "wrong server" means "miss". Every cache request after the resize checks a server that does not hold the key, falls back to the database, and the database — which had been handling perhaps 1% of traffic behind a hot cache — suddenly handles 100% and falls over. For a sharded database, the cost is the time to physically move 91% of the data. On a 100 TB cluster, that is 91 TB of inter-node traffic for the crime of provisioning one new box.

Why % N is brittle: the modulo function has no structural memory of what used to be where. Every key's home is recomputed from scratch under the new N. Consistent hashing is the minimum structural change that preserves most of the old assignment when N changes — that is literally the property "consistent" refers to.

Before consistent hashing, production systems lived with this pain by over-provisioning or by pre-sharding — fixing a logical shard count (say 1024) on day one and mapping shards to nodes via a table. Pre-sharding works but pins you to the initial shard count. Consistent hashing replaces the workaround with one algorithm that handles arbitrary N changes.

The consistent-hash ring

Picture the output space of a 64-bit hash function as a number line from 0 to 2^64 − 1, then bend it into a circle so that 2^64 meets 0. Every hashable thing — a key, a node identifier — lives at a single point on this circle. You will use the circle in one direction only: clockwise.

Place each node on the ring by hashing some stable identifier for it. node-a hashes to, say, position 0x3F00...; node-b to 0xA100...; node-c to 0xC700.... These three positions partition the circle into three arcs. Declare a rule: every key is owned by the first node encountered by walking clockwise from the key's hash. Equivalently, each node owns the arc of the ring that ends at its own position and starts at the next-previous node's position, walking clockwise.

A key hashes to 0x6200. Walking clockwise from 0x6200, the next node is node-b at 0xA100. node-b owns the key. Another key at 0xB500 finds node-c at 0xC700 next. A key at 0xE900 wraps past 2^64 back to 0 and finds node-a at 0x3F00; node-a owns it. Every key gets exactly one owner.

Three nodes A, B, C and four keys k1–k4 on the hash ring. Each key is owned by the next node clockwise from its position. k1 and k4 go to A; k2 goes to B; k3 goes to C. The ownership function is purely local — each node can compute it from the ring membership alone, without any central authority.

Now consider what happens when a node joins or leaves.

A node joins. Suppose node-d arrives and hashes to 0x5800. Walking clockwise from 0x5800 you hit node-b at 0xA100. That means node-d sits inside the arc that node-b used to own. Before the join, keys in the arc 0x3F00–0xA100 belonged to node-b. After the join, keys in 0x3F00–0x5800 still go to node-d (the new clockwise next), and keys in 0x5800–0xA100 still go to node-b. Every other arc is untouched. node-d asks node-b for a copy of the keys it now owns; node-b forgets them; ownership has transferred; nobody else saw any change. The amount of data that moved is one arc's worth — roughly 1/4 of the ring (because N went from 3 to 4), on average.

A node leaves. Suppose node-c at 0xC700 departs. The arc it owned, 0xA100–0xC700, merges into the arc that its clockwise neighbour (back to node-a at 0x3F00, wrapping past 0) owned. node-a inherits node-c's keys. No other node is affected. Again, one arc's worth of data moves — roughly 1/3 of the ring before the departure.

Why only 1/N of keys move: on a random ring of N nodes, the expected arc length per node is 1/N of the circumference. A join splits one existing arc into two; the new node takes half of that arc on average, which is (1/N)·(1/2) of the circle, plus the fact that after the join there are N+1 nodes so the average arc is slightly smaller. Handwaving the constants, the fraction of keys that move on a single join is Θ(1/N), not Θ(1). That is the entire value proposition of the ring over plain modulo.

Two operational consequences matter. First, the ring is computed independently by every node. Every node that agrees on the node list and hash function reaches the same ownership map; no central lookup service is needed. Dynamo keeps the node list in sync with gossip (chapter 78). Second, lookups are O(log N) per key — binary search into a sorted list of N hash positions. With N = 1000 nodes that is ten comparisons per key, fast enough that every client does it inline on the request path.

Python implementation

The whole scheme fits in a small class.

import hashlib
import bisect

class ConsistentHashRing:
    def __init__(self):
        self.ring = {}          # hash_position -> node_id
        self.sorted_hashes = [] # sorted list of positions, for bisect

    def _hash(self, s):
        # MD5 is fine for placement: not a security boundary, just a mixer.
        return int(hashlib.md5(s.encode()).hexdigest(), 16)

    def add_node(self, node):
        h = self._hash(node)
        self.ring[h] = node
        self.sorted_hashes = sorted(self.ring.keys())

    def remove_node(self, node):
        h = self._hash(node)
        del self.ring[h]
        self.sorted_hashes = sorted(self.ring.keys())

    def get_node(self, key):
        if not self.sorted_hashes:
            return None
        h = self._hash(key)
        idx = bisect.bisect(self.sorted_hashes, h) % len(self.sorted_hashes)
        return self.ring[self.sorted_hashes[idx]]

Thirty lines. Every method does exactly what the prose said. add_node hashes the node identifier to a position and records it. get_node hashes the key, then uses bisect.bisect to find the first ring position strictly greater than the key's hash — that is the "next clockwise" — with the % len at the end to wrap around 2^64 back to the smallest ring position. remove_node is symmetric.

Why bisect.bisect and not bisect_left: bisect (a.k.a. bisect_right) returns the insertion point after any equal elements, which is what "strictly next clockwise" means. If a key happens to hash to exactly a node's position, it goes to the next node, not to that node; the convention is arbitrary but has to be fixed. With bisect_left, a key at a node's exact position would map to that node, which is equally valid. What matters is that every implementation in the cluster chooses the same convention.

Using it looks like this:

ring = ConsistentHashRing()
for node in ["alpha", "beta", "gamma"]:
    ring.add_node(node)

print(ring.get_node("user:42"))     # -> one of alpha/beta/gamma
print(ring.get_node("cart:priya"))  # -> one of alpha/beta/gamma

ring.add_node("delta")
# Only keys in delta's arc change owner; the rest stay put.

MD5 is chosen for placement because it mixes well and is available in every standard library; you are not using it as a security hash here. SHA-1 or xxHash or Murmur3 would work identically. What matters is that the hash has a uniform distribution over 2^64 and that every node in the cluster agrees on which hash it is.

The load-imbalance problem

The math that says "a join moves 1/N of keys" assumes node positions are random and roughly uniform. For large N, they are. For small N, they are not.

Suppose N = 3. The three arcs between consecutive points have expected length 1/3 of the circumference, but the variance is enormous — it is not uncommon for one arc to be 55% of the circle and another to be 15%. The node owning the 55% arc receives nearly four times as many keys as the one owning 15%. One server's disks fill up; another's sit at 30%. The cluster's capacity is gated by the hottest node, so you run at 40% of paid-for capacity.

The hash function is not broken; it is uniform by construction. The problem is that three uniform samples from a circle are not evenly spaced. You need many samples before the gaps concentrate near the mean. For N uniform points, the expected ratio of longest arc to mean arc is roughly H(N) — about 3.0 at N = 10, about 5.2 at N = 100. Factor-of-three hot spots are normal using basic consistent hashing.

Why a better hash will not help: the hash already produces uniform output. The variance comes from the small number of samples, not from bias in the sampler. Swapping MD5 for SHA-256 or xxHash changes nothing — the cure has to change the sampling, not the sampler.

The cure is virtual nodes.

Virtual nodes

Instead of hashing each physical node to one position, hash it to K positions, usually 128 or 256. Each virtual node is a separate point on the ring; each physical node owns the union of its K arcs. With three physical nodes and K = 256, the ring has 768 positions, and the long-arc-to-mean-arc ratio drops to around 1.5 — a 50% overload rather than 400%. At K = 256 and N = 10, you get within 10% of perfect balance on most clusters.

The law of large numbers is doing the work. The sum of 256 random arc lengths concentrates tightly around its expected value. You are averaging out placement variance by taking many samples per node.

Cassandra's default is 256 tokens per node; older versions used 1 per node and required hand-picked tokens, which was a frequent operational pain. Amazon DynamoDB uses virtual nodes internally (the specific K is not public). Riak uses a fixed ring of 64 or 128 partitions mapped onto physical nodes — a slightly different scheme. Envoy's ring-hash load balancer exposes K as minimum_ring_size, default 1024.

There is no magic at K = 256; it is a pragmatic choice. Larger K is more uniform but uses more memory. Smaller K is cheaper but lumpier. Most production systems pick 128 or 256 and stop worrying.

Adding and removing nodes with vnodes

The behaviour of joins and leaves improves once vnodes are in the picture.

Adding a node. With K = 256 vnodes, a join adds 256 tokens scattered randomly around the ring. Each token takes over a small arc from its clockwise neighbour, and those 256 neighbours are almost always different physical nodes. The 1/N of keys that move flow from roughly 256 source nodes in parallel, rather than all flowing from one neighbour.

When you add a node to a production Cassandra cluster, the new node pulls data from existing nodes. If all of it came from one neighbour, that neighbour's disk and network would saturate and the join would take days. Spreading the pull across 256 sources means each contributes a few hundred MB; the join finishes in hours.

Removing a node. Symmetric. The departing node's 256 tokens hand their arcs to 256 different clockwise neighbours; the destination load is spread.

Why this is the real reason vnodes are not optional: load balance is one reason; rebalance parallelism is the bigger one. Without vnodes, a join saturates one link and takes a day; with vnodes, it saturates 256 links briefly and takes an hour.

Heterogeneous node capacities

Virtual nodes have a second convenience. Suppose one machine is twice the size of the others — eight SSDs instead of four, twice the RAM. In the basic ring each node owns 1/N of the data, so the big machine's extra disk sits idle.

With vnodes it is one line. Give the big node 2K tokens instead of K. It occupies twice as many ring positions, owns twice as much hash space, receives twice as much data. Load is proportional to token count, and token count is a configuration value.

Cassandra exposes this as num_tokens in cassandra.yaml. The default is 256; raise it on larger machines. A mixed cluster of 1 TB and 2 TB hosts runs them at 256:512 tokens and keys distribute accordingly. No code changes, no new protocol, just a count per node.

Python implementation with vnodes

The extension is short.

class ConsistentHashRing:
    def __init__(self):
        self.ring = {}
        self.sorted_hashes = []

    def _hash(self, s):
        return int(hashlib.md5(s.encode()).hexdigest(), 16)

    def add_node(self, node, vnodes=256):
        for i in range(vnodes):
            h = self._hash(f"{node}#{i}")
            self.ring[h] = node
        self.sorted_hashes = sorted(self.ring.keys())

    def remove_node(self, node):
        self.ring = {h: n for h, n in self.ring.items() if n != node}
        self.sorted_hashes = sorted(self.ring.keys())

    def get_node(self, key):
        if not self.sorted_hashes:
            return None
        h = self._hash(key)
        idx = bisect.bisect(self.sorted_hashes, h) % len(self.sorted_hashes)
        return self.ring[self.sorted_hashes[idx]]

The only change from the plain version: add_node loops K times, hashing node#0, node#1, ..., node#K-1 to produce K distinct positions for the same physical node. The dictionary stores each position's ring entry as the physical node identifier. get_node is unchanged — it still finds the next clockwise position and returns whatever physical node owns that position. remove_node drops every ring entry for the node; multiple removals is O(NK) as written but can be cached.

To give a bigger node more weight, call add_node("big-host", vnodes=512). Same data structure, proportional load.

The preference-list pattern

Consistent hashing as described gives you one owner per key. Dynamo, Cassandra, and every other leaderless system need N owners per key — the replicas. The extension is trivial once you have the ring.

Walk clockwise from the key's hash position and collect the next N distinct physical nodes (skipping over additional vnodes of nodes you already saw). That ordered list is the key's preference list. The first entry is sometimes called the primary or coordinator; the others are replicas; quorum reads query R of them and quorum writes wait for W of them to ack (see chapter 75).

def preference_list(self, key, n):
    if not self.sorted_hashes:
        return []
    h = self._hash(key)
    idx = bisect.bisect(self.sorted_hashes, h) % len(self.sorted_hashes)
    seen, result = set(), []
    for i in range(len(self.sorted_hashes)):
        pos = self.sorted_hashes[(idx + i) % len(self.sorted_hashes)]
        node = self.ring[pos]
        if node not in seen:
            seen.add(node)
            result.append(node)
            if len(result) == n:
                return result
    return result

Walking is O(NK) worst case but amortised cost per key is close to N, since the first distinct nodes appear quickly on a well-distributed ring.

This is the natural extension point for rack-aware placement (chapter 77). If the first N clockwise physical nodes all live in one rack, a rack failure loses every replica. Production systems add a rack-distinctness constraint to the walk: "skip additional nodes in the same rack as nodes already in the list". Same ring, richer rule.

A five-node cluster, then six

Start with five physical nodes — A, B, C, D, E — each with K = 128 vnodes. The ring has 640 positions scattered over 2^64. Each physical node owns roughly 1/5 = 20% of the key space.

Simulate it: create the ring, insert ten thousand synthetic keys like cart:1 through cart:9999, count which physical node each is assigned to. You get something close to {A: 2006, B: 1987, C: 2014, D: 1998, E: 1995}. Variance is small.

Add a sixth node F. F contributes 128 new vnodes. Rerun the assignment; each existing node loses about 1/6 of its keys to F, landing near {A: 1678, B: 1642, C: 1701, D: 1660, E: 1651, F: 1668}. Total moved is around 1670 keys, roughly one-sixth of 10000, exactly as predicted.

Trace who gave up keys. Because F's 128 vnodes are scattered, each takes its small arc from whichever existing vnode was clockwise-next — a random draw from all five existing physical nodes. So F's intake is sourced from A, B, C, D, E in roughly equal parts: about 334 keys from each. Rebalance traffic is parallel across every existing node.

Now remove node C. Its 128 vnodes disappear; each arc merges into the clockwise-next vnode, which belongs to a random physical node among A, B, D, E. So C's 2014 keys redistribute across the other four, roughly 500 to each. Nobody else saw a change.

What you just demonstrated. Joins and leaves touch exactly the 1/N fraction the theory promises, and the ring distributes rebalance traffic across all existing nodes. No physical node becomes a bottleneck during the operation.

Common confusions

"Consistent hashing gives perfectly even distribution." Only with virtual nodes, and only approximately. Plain consistent hashing with one position per node is significantly uneven — the longest arc can be 3-5× the mean. Virtual nodes at K = 256 bring the spread within 10% or so. Perfectly even requires either a very large K or an explicit rebalance protocol on top of the ring.
"Adding a node doubles capacity." Adding a node adds one node's worth of capacity — if you had five nodes and add a sixth, you have 6/5 = 1.2× the capacity, not 2×. The ring controls placement; it does not change per-node resources. Saying "the ring distribution scales" is accurate; saying "adding a node doubles capacity" is not.
"You need a central coordinator to maintain the ring." No. Every node can compute ring assignments independently given the agreed-upon node list. You do need a mechanism — typically gossip (chapter 78) — for the node list itself to stay consistent across the cluster, but that is a small metadata problem, not a per-request lookup.
"Consistent hashing prevents hotspots." It distributes keys uniformly over the key space; it does nothing about per-key skew. If one specific key is requested a million times a second, all its replicas are hot regardless of how beautifully the ring is balanced. For per-key hotspots, the answer is caching, request coalescing, or explicit per-key replicas — not the placement algorithm.
"MD5 is insecure so I should use SHA-256." For placement, the hash is a mixer, not a cryptographic primitive. No adversary is forging collisions to disrupt your key distribution. Use SHA-256 for security; use MD5 or xxHash for placement.
"Consistent hashing and pre-sharding are the same thing." Related but different. Pre-sharding picks a fixed shard count (say, 1024) on day one and maps shards to nodes via a lookup table. The count is fixed forever. Consistent hashing makes the shard count effectively infinite and lets virtual nodes act as a dynamic per-node shard count.

Going deeper

Jump consistent hashing (Lamping and Veach, 2014)

Querion published a scheme that solves the same problem without a ring or any per-node state. It is a closed-form function jump_hash(key, num_buckets) → bucket_id that runs in O(log N) time and uses zero memory, using a pseudorandom number generator seeded from the key.

Jump hashing is excellent for stateless sharding — load-balancer-style "pick a backend" problems — because it needs no ring synchronisation. It does not handle node removal in the middle of the range, only at the end. For databases where any node can leave, the ring is still the right structure.

Rendezvous hashing (Thaler and Ravishankar, 1998)

Also called Highest Random Weight (HRW) hashing. For each key K and each node N_i, compute hash(K, N_i); the key is owned by the node with the highest hash value. Rendezvous gives perfectly uniform distribution for any N, no vnodes needed, and touches exactly 1/N of keys on join or leave. The cost is O(N) per lookup rather than O(log N). For small N (under 100), rendezvous is often simpler than a vnode ring; for large N, the ring wins.

Range-based sharding versus hash-based

Not every system hashes keys at all. Range-based sharding divides the key space into contiguous ranges with a coordinator maintaining the range table. Bigtable, HBase, Cockroach, TiKV, and Spanner all use range-based sharding.

The trade-off is clean. Hash-based placement distributes keys uniformly, which is great for load balance but destroys locality — range scans become cluster-wide queries. Range-based placement preserves locality — users A through M live together — but needs a coordinator to manage hot ranges. Dynamo-style systems chose hash; Spanner-style systems chose range.

Chord and the academic lineage

Consistent hashing was invented by Karger et al. in 1997 for distributed web caching. Four years later, Stoica et al. published Chord (SIGCOMM 2001), a peer-to-peer lookup protocol built on the same ring with finger tables giving O(log N) lookups at P2P scale without global ring knowledge. Chord is not used in Dynamo-style databases — their N is small enough that every node holds the full ring — but it influenced BitTorrent's DHT and a generation of P2P systems.

Where this leads next

Consistent hashing is the placement layer. Chapter 77 covers replica placement and rack awareness — how Dynamo's preference-list walk is extended to spread replicas across failure domains so no single rack or datacenter takes down all copies of a key. Chapter 78 covers gossip and membership — how all nodes in a cluster stay in agreement about the node list that feeds the ring. Chapter 79 covers anti-entropy and Merkle trees — the background reconciliation that repairs the replicas the ring is routing writes to. Every one of these mechanisms assumes the ring exists and is cheap to query; every one adds engineering on top of the basic placement decision this chapter describes.

And then: sloppy quorums (ch.80) — what happens when the preference list computed from the ring contains nodes that are temporarily unreachable and the coordinator has to improvise. That chapter is where consistent hashing meets availability head-on. The ring tells you where data should live; sloppy quorum tells you what to do when should and can disagree.

References

Karger, Lehman, Leighton, Levine, Lewin, and Panigrahy, Consistent Hashing and Random Trees: Distributed Caching Protocols for Relieving Hot Spots on the World Wide Web, STOC 1997 — the original consistent-hashing paper. Developed for web caching at what would become Akamai; the ring geometry, the 1/N movement property, and the name all come from this work.
DeCandia et al., Dynamo: Riverone's Highly Available Key-value Store, SOSP 2007 — Dynamo's adoption of consistent hashing with virtual nodes, and the first paper to work out the preference-list pattern on top of the ring.
Ellis, Virtual Nodes Strategies in Cassandra, DataStax Engineering Blog 2012 — the practical case for 256-vnodes-per-node in Cassandra 1.2, with measurements of rebalance times with and without vnodes on production-scale clusters.
Lamping and Veach, A Fast, Minimal Memory, Consistent Hash Algorithm, arXiv 2014 — Querion's jump consistent hashing: closed-form, O(log N) time, zero memory. The canonical alternative to ring-based consistent hashing for stateless sharding.
Thaler and Ravishankar, A Name-Based Mapping Scheme for Rendezvous, University of Michigan Technical Report 1996 — the original rendezvous (HRW) hashing paper. Published before the 1997 Karger et al. paper and offering the same 1/N movement guarantee through a completely different algorithm.
Stoica, Morris, Karger, Kaashoek, and Balakrishnan, Chord: A Scalable Peer-to-peer Lookup Service for Internet Applications, SIGCOMM 2001 — the Chord paper. Extends consistent hashing with finger tables for O(log N) lookups at P2P scale. Influenced DHT designs across BitTorrent, I2P, and a generation of peer-to-peer systems.