Bounded-load consistent hashing

It is the IPL final, 47th over, three balls left. CricStream's video-segment cache is serving 22M concurrent viewers. The fleet is 240 cache pods behind a consistent-hashing LB; the lookup is ring_lookup(segment_id) and the system has run beautifully for two seasons. Then the over ends with a six, the highlight clip is generated, and segment-clip-final-six.ts becomes the most-requested object on the internet for the next 90 seconds. That single segment-id hashes to one ring position, owned by cache-pod-117. Within 8 seconds the pod's egress NIC is at line rate, its CPU is pinned at 99% on TLS handshakes, and the p99 for every other key on that pod has blown out to 4.2 s. The other 239 pods are at 18% CPU. Aditi, on call, watches one pod sweat while the rest of the fleet idles. This chapter is about bounded-load consistent hashing (Mirrokni, Thorup, Zadimoghaddam, 2017) — the algorithm that closes exactly this gap, caps every pod at (1+ε) of the average load, and spills overflow to the next pod on the ring.

Plain consistent hashing balances key counts, not request load. A single hot key can pin one pod. Bounded-load consistent hashing (Mirrokni et al., 2017) adds a per-pod capacity cap of (1+ε) × avg_load and, when a key's primary pod is full, walks the ring to the next pod with capacity. The result: stickiness is preserved for cold keys, hot keys spread to a small handful of pods, and no pod ever exceeds (1+ε) × avg — provably.

The hot-key problem plain consistent hashing cannot solve

Plain consistent hashing — ring, jump, Maglev — solves the key-to-pod mapping stability problem. Adding a pod moves ~1/N of the keys; balance across pods is good when the keys are uniformly distributed in popularity. The unstated assumption is that every key gets roughly equal request volume. Production traffic violates this constantly:

CricStream during a final: one segment-id at the end of an over gets 40× the average segment's traffic for ~90 seconds.
BharatBazaar Diwali sale: one product page (the discounted iPhone) gets 200× the average product's traffic.
PlayDream toss-time spike: one match's lineup-cache key gets 1000× the average match's traffic in a 30-second window.
AutoGo airport surge: one geohash cell (the airport pickup point) gets 80× the surrounding cells.

These are not bugs in your traffic distribution — they are the actual shape of demand. With plain consistent hashing, the pod that owns the hot key sees its load go from avg to 40 × avg. Its CPU pins, its tail latency blows up, and the other keys on that pod (which were perfectly fine cold keys) get dragged down with the hot one. This is the hot-pod failure mode: a single hot key punishes every other key it shares a pod with.

Illustrative — the hot-pod failure mode. One hot key on plain consistent hashing means one pod at 99% CPU and 11 pods at idle. The p99 of every key on the hot pod blows up, even keys that were never hot.

Why the hot pod's cold keys also slow down: TLS handshakes, request parsing, and TCP accept queues are shared resources on a pod. When the hot key's traffic saturates the CPU and the egress NIC, every other request on that pod queues behind it. The hot key's request rate becomes the bottleneck for every request on the pod — Little's Law forces the queueing delay up for cold keys too. This is the "noisy neighbour" effect at the request-routing layer, and consistent hashing's stickiness is what makes it acute: there is no other pod that has cached segment-clip-final-six.ts, so spilling to a less-loaded pod requires a cache miss and an origin fetch.

The bound: cap each pod at `(1+ε) × avg`

Mirrokni, Thorup, and Zadimoghaddam's 2017 paper Consistent Hashing with Bounded Loads fixes the hot-pod problem with one extra rule. Define:

capacity_per_pod = ceil((1 + ε) × total_keys / num_pods)

Where ε > 0 is a small constant (typical: 0.1 to 0.25). The lookup becomes:

Hash the key, walk clockwise on the ring as in plain consistent hashing.
If the first pod's current load is below capacity_per_pod, route there — done.
Otherwise continue walking clockwise. Try the next pod, the next, until you find one with spare capacity.
With ε > 0, the total capacity (N × (1+ε) × avg) strictly exceeds the total load (N × avg), so a slot always exists. The walk terminates.

The mathematical guarantee from the paper: no pod ever exceeds (1+ε) × avg load, and the expected number of additional ring-walk hops is O(1/ε² × log(1/ε)). For ε = 0.25, the expected walk length is under 2 hops. For ε = 0.1, it climbs to ~6 hops. Tightening ε toward zero buys tighter balance but more displacement; loosening ε allows more imbalance for less displacement.

The "load" here is whatever you measure — concurrent connections, requests-in-flight, weighted bytes-per-second, anything monotonic. Vimeo's original deployment (where Andrew Rodland implemented BLCH for HAProxy in 2016, predating the paper's publication) used active-connection count: each connection to a pod increments its load by 1 on accept and decrements on close. HAProxy's hash-balance-factor directive exposes (1+ε) directly — hash-balance-factor 125 means ε = 0.25.

Illustrative — the overflow walk. Pod-3 and pod-4 are at capacity (100% of their cap, which is 1.25 × avg). The hot key walks clockwise to pod-5 which has 13% spare capacity. The cold keys already on pod-3 and pod-4 keep their primary assignment — only the new request that found them full is displaced.

A 90-line implementation: BLCH against plain ring under hot keys

This script implements bounded-load consistent hashing on top of a plain hashring, runs both against a Zipf-distributed key workload, and measures (a) max pod load and (b) p99 fleet imbalance. The Zipf distribution captures real-world traffic skew where a few keys dominate request volume.

# blch_compare.py — plain ring vs bounded-load consistent hashing under Zipf traffic.
import bisect, hashlib, math, random
from collections import Counter

def h64(s):
    return int.from_bytes(hashlib.sha256(s.encode()).digest()[:8], "big")

def build_ring(pods, vnodes=200):
    ring = []
    for p in pods:
        for v in range(vnodes):
            ring.append((h64(f"{p}#{v}"), p))
    ring.sort()
    return ring

def ring_lookup(ring, key):
    idx = bisect.bisect(ring, (h64(key),))
    return ring[idx % len(ring)][1]

def blch_lookup(ring, key, load, capacity):
    """Plain ring, but if the candidate is full, walk clockwise to next free."""
    pos = h64(key)
    idx = bisect.bisect(ring, (pos,)) % len(ring)
    start = idx
    while True:
        pod = ring[idx][1]
        if load[pod] < capacity:
            load[pod] += 1
            return pod
        idx = (idx + 1) % len(ring)
        if idx == start:
            raise RuntimeError("ring full — total capacity < total load")

def zipf_keys(num_distinct, num_requests, alpha=1.2, seed=42):
    """Generate request stream with Zipf-distributed key popularity."""
    random.seed(seed)
    weights = [1.0 / ((i + 1) ** alpha) for i in range(num_distinct)]
    total = sum(weights)
    keys = [f"key-{i}" for i in range(num_distinct)]
    out = []
    for _ in range(num_requests):
        r, acc = random.random() * total, 0.0
        for k, w in zip(keys, weights):
            acc += w
            if r <= acc:
                out.append(k); break
    return out

pods = [f"pod-{i}" for i in range(20)]
ring = build_ring(pods)
requests = zipf_keys(num_distinct=2000, num_requests=20000, alpha=1.3)
total_load = len(requests)
avg = total_load / len(pods)

# Plain ring — every request goes to its primary pod.
plain_load = Counter(ring_lookup(ring, k) for k in requests)
plain_max = max(plain_load.values())
plain_p99 = sorted(plain_load.values())[int(0.99 * len(pods))]

# BLCH with eps = 0.25
for eps in (0.10, 0.25, 0.50):
    capacity = math.ceil((1 + eps) * avg)
    blch_load = {p: 0 for p in pods}
    for k in requests: blch_lookup(ring, k, blch_load, capacity)
    bm = max(blch_load.values())
    print(f"eps={eps:.2f}  cap/pod={capacity}  max={bm}  ratio max/avg={bm/avg:.3f}")

print(f"\nplain ring:  max={plain_max}  ratio max/avg={plain_max/avg:.3f}  p99 pod={plain_p99}")
print(f"avg per pod = {avg:.0f}")

Sample run:

eps=0.10  cap/pod=1100  max=1100  ratio max/avg=1.100
eps=0.25  cap/pod=1250  max=1250  ratio max/avg=1.250
eps=0.50  cap/pod=1500  max=1377  ratio max/avg=1.377

plain ring:  max=2284  ratio max/avg=2.284  p99 pod=1948
avg per pod = 1000

Per-line walkthrough. build_ring and ring_lookup are the standard ring-with-vnodes from the previous chapter. blch_lookup is the entire BLCH algorithm in 12 lines: walk clockwise from the hashed position; the first pod with load < capacity wins; increment that pod's load. zipf_keys generates 20 000 requests over 2000 distinct keys with α=1.3, which means the top 1% of keys account for ~14% of traffic — a typical hot-key shape. plain_load measures the load each pod sees under plain ring lookup; plain_max=2284 means the hottest pod gets 2.28× the average — that single pod is a clear bottleneck. BLCH at ε=0.25 caps every pod at 1250 requests (1.25 × avg) — exactly the bound the paper proves. The displaced traffic from the hot pod walks to its next-clockwise neighbours; total work is the same, but no single pod exceeds the cap. The cost is a slightly longer expected walk and slightly less stickiness for the hot keys (their requests now spread across 2-3 pods instead of one).

Why ε=0.25 is the most common production choice: the paper's expected-walk bound is O(1/ε² × log(1/ε)). At ε = 0.25 the constant is small enough that the average request walks under 2 hops, which is invisible at the request layer. At ε = 0.10 (a tighter 10% imbalance allowance) the walk grows to ~6 hops, which costs a measurable cache-miss rate because each "walk" target may not have the key cached. At ε = 0.50 the cap is so loose that the algorithm degenerates back toward plain ring's behaviour for non-hot keys. Vimeo's HAProxy default is hash-balance-factor 125 (ε = 0.25); Google's GCLB defaults to hash-balance-factor 1.5 for some L7 services where the slightly higher cap is preferred for fewer cache invalidations.

Production: HAProxy, Envoy, and Google Cloud Load Balancing

Bounded-load consistent hashing is implemented in three production-grade load balancers, each with slightly different semantics on what counts as "load":

HAProxy (since 1.7, 2016 — predates Mirrokni's paper) implements BLCH via hash-balance-factor. Andrew Rodland of Vimeo wrote the patch to fix Vimeo's video-CDN hot-pod problem during the 2014–2016 traffic-spike era; the patch landed in HAProxy before the academic paper was published. HAProxy measures load as active connections per backend. Setting hash-type consistent plus hash-balance-factor 125 enables BLCH with ε = 0.25. The walk terminates within 1-2 hops in 99% of cases for typical traffic.

Envoy (since 2017) implements BLCH in its ring_hash and maglev LB policies via the consistent_hashing_lb_config.hash_balance_factor field. Envoy measures load as in-flight requests per host. The implementation is identical to HAProxy in semantics but lives inside the service-mesh sidecar; this means ε can be configured per-service via the LoadBalancer resource in Istio/Envoy CDS.

Google Cloud Load Balancing uses BLCH for its L7 internal LBs when localityLbPolicy: RING_HASH is set with a consistentHash field. Google's internal documents (and the Mirrokni paper itself) note that GCLB uses BLCH with ε = 0.5 in some configurations — a looser cap that prefers stickiness over balance, because for cache-heavy L7 traffic the cost of a cache miss exceeds the cost of a slightly skewed pod.

The shared lesson across these three: BLCH is the default upgrade for any consistent-hashing LB that sees skewed traffic. Plain ring is fine for stateless or roughly-uniform workloads (sharded counters, embedding lookups with random keys). The moment your traffic has a heavy tail — and almost all production traffic does — you want BLCH or you spend a large fraction of your time managing hot-pod incidents.

Why HAProxy uses active-connection count rather than CPU or request rate as the load metric: connection count is the only load signal HAProxy can observe locally without instrumenting the backend. CPU requires backend cooperation (a metrics endpoint, a sidecar exporter), and request rate alone misses the long-running streaming connections that dominate video CDN load. Active connections approximate "work currently in flight" well enough for video traffic — a connection that has been open 20 seconds counts the same as one open 20 milliseconds, but for video streaming both are doing roughly equal work (continuously serving bytes). For request-response workloads the same metric over-counts idle keep-alive connections; in those services Envoy's "in-flight requests" variant is more accurate. The choice of metric is the fault-line between L4 LBs (use connection count) and L7 LBs (use in-flight request count).

Common confusions

"BLCH is the same as least-connections." It is not. Least-connections always picks the backend with the fewest in-flight requests, ignoring the consistent-hash position. BLCH picks the consistently-hashed primary first and only walks if it is full — preserving stickiness for the common case (cold keys hash to one pod every time and stay there). The hot-key spread is a side effect of overflow, not the policy. For cache-fronted services, this matters: BLCH's cold keys see ~100% cache hit rate; pure least-connections sees a much lower hit rate because cold keys land on different pods on different requests.
"ε = 0 would be perfect balance." ε = 0 makes the bound capacity = avg, but avg is computed from total load, not pod count, and the algorithm may not be able to fill the table — the proof requires ε > 0. In practice, very small ε (say 0.01) requires O(10000) expected walks; the algorithm becomes glacial. There is a strict lower bound at ε > 0 for the algorithm to terminate quickly.
"BLCH replaces virtual nodes." It does not. BLCH operates on top of a ring (or a Maglev table) that already uses virtual nodes. Vnodes solve the random-position imbalance problem (one pod's vnode happens to cover a large arc of the ring); BLCH solves the traffic skew problem (one key gets 40× the average requests). Both problems exist; production systems use vnodes plus BLCH.
"The walk-cost makes BLCH slow." The walk cost in lookup latency is negligible (1-2 extra array reads at ε=0.25). The actual cost is caching effectiveness: a hot key that gets displaced lands on a different pod on each subsequent request as the original pod stays full. That displacement causes cache misses on the secondary pods until the hot key is hot enough on those pods too. For cache-fronted workloads, you might see hit rate drop from 94% to 89% during the hot-key event — much better than the 50% drop you'd see from a fully-pinned pod.
"BLCH guarantees fair load." It guarantees no pod exceeds (1+ε) × avg. It does not guarantee that pods are at equal load — pods owning popular ring arcs run hotter than pods owning sparse arcs, just below the cap. The "fairness" is one-sided: a tight upper bound, no lower bound. If you want lower-bound fairness too, you need a different policy (least-connections, JSQ).

Going deeper

The (1+ε) bound is tight — and `ε` controls a real trade-off

Mirrokni, Thorup, and Zadimoghaddam's paper proves two complementary results. Upper bound: bounded-load consistent hashing with cap (1+ε) per pod accepts every request (no overflow into a "rejected" state) as long as the total request count stays within N × (1+ε) × avg — which is automatic since avg = total/N. Walk-length bound: the expected number of clockwise hops a displaced key takes before finding a non-full pod is O(1/ε² × log(1/ε)). This is sharp — the proof uses a balls-into-bins argument and shows the bound is tight up to a constant factor.

The trade-off ε exposes is therefore not a knob for "more or less balance" — it is a knob for how much imbalance you tolerate vs how much displacement you accept. Tighter ε (smaller value) means tighter bound but longer walks (more cache misses on secondary pods). Looser ε means looser bound but minimal walks (less cache disruption). The right setting depends on whether your bottleneck is per-pod CPU/memory (tighten ε) or cache hit rate (loosen ε).

Vimeo's actual story: the patch that predated the paper

Andrew Rodland's HAProxy patch landed in 2016, a year before the formal Mirrokni paper appeared in 2017. The Vimeo deployment was production before the algorithm had a published proof. The story (recounted in HAProxy mailing-list threads and Rodland's KubeCon talks) is that Vimeo's video CDN was suffering exactly the hot-pod problem this chapter describes: one popular video would pin one cache pod, killing the p99 for every co-tenant video on that pod. Rodland's intuition — "hash to the primary, but walk if it's full" — was implemented and shipped before the math was formalised. Mirrokni's paper later proved that this exact algorithm gives the (1+ε) bound. This is one of the cleaner cases where production engineering and theoretical computer science arrive at the same algorithm independently.

Why `ε = 0.25` ends up canonical

Across HAProxy (hash-balance-factor 125), Envoy default-recommended (hash_balance_factor: 125), and several internal Google services, ε = 0.25 keeps showing up. The reason is empirical: at ε = 0.25, the expected walk length is under 2 hops, which keeps cache-hit-rate degradation under 5% during typical hot-key events on a 100-pod fleet. At ε = 0.10, the walk length grows to ~6 hops and cache hit rate drops by 15-20% during hot events — too costly. At ε = 0.50, the cap is so loose that one pod can run at 1.5× average, which on a fleet whose average is 60% CPU means the hot pod runs at 90% CPU — too tight a margin against full saturation. ε = 0.25 is the empirical sweet spot for L4 and L7 load-balanced cache fleets.

Reproduce this on your laptop

# Run the BLCH simulator from above:
python3 -m venv .venv && source .venv/bin/activate
python3 blch_compare.py
# Expected output (seed=42, 20 pods, Zipf alpha=1.3):
#   eps=0.10  max_ratio=1.10
#   eps=0.25  max_ratio=1.25
#   plain ring:  max_ratio=2.28

# Inspect HAProxy's BLCH directive in a real config:
docker run --rm haproxy:2.8 haproxy -c -f /dev/stdin <<'EOF'
defaults
  mode http
  timeout connect 5s
  timeout client 30s
  timeout server 30s
backend cache-fleet
  hash-type consistent
  hash-balance-factor 125     # epsilon = 0.25
  server pod-0 10.0.0.1:80
  server pod-1 10.0.0.2:80
  server pod-2 10.0.0.3:80
EOF
# 'haproxy -c' validates the config; 'hash-balance-factor 125' enables BLCH.

# Envoy equivalent (consistent_hashing_lb_config in CDS):
echo '{
  "consistent_hashing_lb_config": {
    "hash_balance_factor": 125
  }
}' | python3 -m json.tool

Where this leads next

Bounded-load consistent hashing is the answer when stickiness matters but a single hot key would punish the pod owning it. The next chapter — client-side vs proxy-side LB — addresses where the LB algorithm runs: in the client SDK (zero-RTT but harder to update) or in a proxy hop (extra RTT but centralised policy). Both layers can use BLCH; the choice is independent.

Consistent hashing (ring, jump, Maglev) — the foundation BLCH sits on top of.
Locality-aware load balancing — the orthogonal axis: route to nearby pods first, balance second.
Power of two choices (P2C) — the alternative when stickiness is not required; trades affinity for tail-latency reduction.
Least connections — the policy that ignores hashing entirely and routes by current load.

Beyond Part 6, Part 7 (reliability patterns) revisits the hot-key problem from the failure-handling angle — what happens when the BLCH walk lands on a pod that is itself unreachable.

References

Mirrokni, Thorup, Zadimoghaddam, "Consistent Hashing with Bounded Loads" — arXiv 2017 — the formal proof of the (1+ε) bound and O(1/ε² log 1/ε) walk-length bound.
Rodland, "Improving load balancing with a new consistent-hashing algorithm" — Vimeo Engineering 2016 — the production deployment that predated the paper.
HAProxy 1.7 release notes — hash-balance-factor — the first production implementation.
Envoy consistent_hashing_lb_config reference — Envoy's BLCH integration.
Karger et al., "Consistent Hashing and Random Trees" — STOC 1997 — the foundation BLCH builds on.
Consistent hashing (ring, jump, Maglev) — internal companion. The previous chapter; covers the underlying algorithms.
Power of two choices (P2C) — internal companion. The non-sticky alternative for tail-latency reduction.

Bounded-load consistent hashing

The hot-key problem plain consistent hashing cannot solve

The bound: cap each pod at (1+ε) × avg

A 90-line implementation: BLCH against plain ring under hot keys

Production: HAProxy, Envoy, and Google Cloud Load Balancing

Common confusions

Going deeper

The (1+ε) bound is tight — and ε controls a real trade-off

Vimeo's actual story: the patch that predated the paper

Why ε = 0.25 ends up canonical

Reproduce this on your laptop

Where this leads next

References

The bound: cap each pod at `(1+ε) × avg`

The (1+ε) bound is tight — and `ε` controls a real trade-off

Why `ε = 0.25` ends up canonical