Wall: many instances → load balancing decisions

It is 21:18 on the night of a CricStream final, and Aditi is staring at the dashboard for the highlight-clip service. The clip-encoder deployment has 240 healthy pods. The discovery cache is fresh — last_watch_event_ts is 1.4 s old, every endpoint passed its readiness probe in the last cycle. The control plane says everything is fine. And yet, the p99 latency for POST /clip/encode has climbed from 180 ms to 2.4 s in the last forty seconds, fan reactions are being generated faster than they can be encoded, the social team's "share clip" button is timing out at the edge, and the overall service is ten minutes from being declared an incident. There is nothing wrong with the registry. There is nothing wrong with the cache. There is something wrong with which of the 240 pods Aditi's frontend keeps picking, and the freshness of her endpoint list cannot tell her that.

Part 5 solved "find the callee". The next problem — given a list of 240 healthy IPs, which one do you dial on this exact request? — is the load-balancing problem, and it is not the trivial scheduling exercise it looks like in textbooks. The pods are heterogeneous (CPU, GC pauses, in-flight depth), the requests are heterogeneous (5 ms thumbnail vs 800 ms full-encode), the network paths are heterogeneous (same-zone vs cross-AZ), and the picker's information is always stale. Round-robin treats this as a fair coin and gets eaten by the tail. The whole of Part 6 is the answer; this chapter names the wall.

What Part 5 quietly handed you, and what it did not

Open discovery-caching-and-staleness and read the last paragraph. You finish that chapter with a set of healthy endpoints, kept fresh by a watch and a periodic pull, with a freshness budget that decides when to trust it. Then your application calls endpoints[0], or random.choice(endpoints), or endpoints[i % len(endpoints)] — and Part 5 does not pretend to tell you which one was right.

Pick any of those three lines and look at what it assumed:

All endpoints are equally good. The picker treats the list as homogeneous. In production it never is. A pod that started 90 seconds ago has a cold JIT, an empty connection pool, and a ramp-up window where its p99 is 4× the long-running pods'. A pod whose JVM is one allocation away from a major GC pause is about to disappear from the rotation for 800 ms even though the registry still says healthy. A pod sharing a node with a noisy neighbour is doing 70% of the work for the same cpu: 1000m request.
All requests are equally cheap. The picker treats the request stream as homogeneous. In production it never is. The clip-encoder service handles 5 ms thumbnail requests and 800 ms full-segment encodes through the same endpoint. The payment-status service handles 12 ms cache hits and 240 ms database-read fallbacks. If your picker is a counter that increments per request, you have no mechanism to notice that pod-117 just received six 800 ms encodes in a row while pod-118 received six 5 ms thumbnails — they look identical to a round-robin counter.
The picker has fresh information about pod state. The picker has a list, period. It does not know how many requests are currently in flight to each pod. It does not know whose CPU is at 95% versus 30%. It does not know whose connection pool has 4 of 8 sockets stuck in a slow query. The registry's healthy flag is the slowest-moving signal in the system; by the time it flips, you have already overloaded the pod for two minutes.

These three assumptions, stacked, are why round-robin's tail latency is not what beginners expect. Round-robin is fair in the sense that every pod sees the same number of requests; it is unfair in the sense that "same number of requests" produces wildly different per-pod load when the pods or the requests are heterogeneous. The Power-of-Two-Choices result (Mitzenmacher 1996) showed that even querying two random pods and picking the one with shorter queue cuts maximum load from O(log n / log log n) to O(log log n) — an exponential improvement, available for one extra cheap query per request. Round-robin leaves that improvement on the table on every request.

Illustrative — the registry's view (top panel) is what discovery surfaces. The picker's view (bottom panel) is what actually decides tail latency. Round-robin closes its eyes to all of the lower panel and rotates as if everything were the top panel.

Why "fair" round-robin produces unfair load: round-robin equalises request count across pods, but per-pod work is the product of request count and per-request cost. If pod-117 has been handed 50 requests of which 12 happened to be 800 ms encodes, and pod-118 has been handed 50 requests of which 12 happened to be 5 ms thumbnails, both have the same count and pod-117 has 12× the actual work in flight. Round-robin's only feedback signal is "did I send to you last?", which contains zero information about whether you are currently overloaded. P2C and least-connections fix this by adding a feedback signal — queue depth, in-flight count — that correlates with actual load.

Four sources of heterogeneity the picker must reckon with

Once you accept that round-robin is wrong, the next question is what is it wrong about. The answer is a four-way decomposition, and every load-balancing algorithm in Part 6 is some answer to one or more of these four:

1. Per-pod heterogeneity. Pods are not interchangeable, even within the same deployment. Cold pods (post-deploy, post-autoscale) have a 30–120 second ramp-up where their p99 is 2–4× the steady-state. Pods sharing a node with a noisy neighbour run slower at the same nominal CPU request. Pods that have just done a major GC have empty caches and run cold for the next few minutes. Pods running on different EC2 instance types within the same ASG (mixed-instance policies are common) have measurably different per-request CPU cost. The picker that does not distinguish pod-117 from pod-118 will treat the noisy-neighbour pod identically to the quiet one and watch its tail latency walk up.

2. Per-request heterogeneity. Requests within the same RPC are not interchangeable. The PaySetu payment-status API returns in 12 ms for cache hits and 240 ms for database fallbacks — a 20× spread routed through the same endpoints[i] decision. The CricStream clip-encoder runs 5 ms for thumbnails and 800 ms for full-segment encodes. The KapitalKite order-status RPC takes 8 ms for new orders and 1.2 s for orders with full historical fills attached. A picker that uses request-count as its load signal is confusing rate with cost; only in-flight-count or queue-depth or measured EWMA latency tracks cost.

3. Network-path heterogeneity. Within a region, calls between same-zone pods cost 0.4–0.8 ms RTT; cross-zone calls cost 1.5–3.5 ms RTT. Within a multi-region cluster, cross-region calls cost 30–80 ms RTT. The cross-zone cost is small per call but enormous per million calls per second when you multiply by tail-latency amplification (a single cross-zone hop in a fan-out of 50 will, at p99, cost more than the median in-zone fanout). Locality-aware load-balancing prefers same-zone pods until queueing makes that prefer-local choice worse than the cross-zone alternative — see locality-aware load balancing.

4. Stickiness requirements — when you cannot freely choose. Some requests must go to the same pod they previously went to: stateful sessions, in-memory caches keyed by user, sharded backends where each pod owns a subset of the keys. Here you do not get to "load balance" in the symmetric sense; you have to apply a deterministic mapping (consistent hashing, modulo-by-shard) and then ask the picker to do the freedom-of-choice it has within that constraint. The picker can spread load across the replicas of a shard but cannot move work across shards without violating the stickiness contract. See consistent hashing (ring, jump, Maglev) and bounded-load consistent hashing for how the field handles this.

A 60-line tail-latency simulator: round-robin vs P2C under heterogeneity

The cleanest way to feel the wall is to run the picker against a synthetic but production-shaped workload. Below is a small simulator: 20 pods, 80% of which are "fast" (50 ms median service time) and 20% of which are "slow" (200 ms median, modelling a hot-cache miss). The simulator runs round-robin and Power-of-Two-Choices over the same request stream and prints per-strategy tail latency.

# picker_tail.py — round-robin vs P2C on a heterogeneous pod fleet.
# Demonstrates that the choice of picker, not the pod count, determines tail latency.
import random, statistics, heapq

NUM_PODS = 20
SLOW_FRACTION = 0.20         # 20% of pods are 4x slower at the median (hot-cache miss state)
NUM_REQUESTS = 5000

# Each pod's per-request service time is drawn from an exponential whose mean depends on type.
pod_mean_ms = [200.0 if i < int(NUM_PODS * SLOW_FRACTION) else 50.0 for i in range(NUM_PODS)]
random.shuffle(pod_mean_ms)  # slow pods are not contiguous in the index space

class Pod:
    def __init__(self, mean_ms):
        self.mean_ms = mean_ms
        self.in_flight = 0          # requests currently being served on this pod
        self.next_free_at = 0.0     # earliest wall-clock the pod is free

def serve(pod, arrival_t):
    """Return the response wall-clock time for a request arriving at arrival_t."""
    service_ms = random.expovariate(1.0 / pod.mean_ms)
    start = max(arrival_t, pod.next_free_at)        # queue if pod busy
    finish = start + service_ms
    pod.next_free_at = finish
    return finish - arrival_t                       # observed end-to-end latency

def round_robin(pods, num_requests):
    latencies = []
    arrival_rate_ms = 8.0                           # one request every 8ms (~125 RPS)
    for i in range(num_requests):
        t = i * arrival_rate_ms
        pod = pods[i % len(pods)]
        latencies.append(serve(pod, t))
    return latencies

def p2c(pods, num_requests):
    """Power-of-Two-Choices: pick 2 random pods, send to the one with smaller queue."""
    latencies = []
    arrival_rate_ms = 8.0
    for i in range(num_requests):
        t = i * arrival_rate_ms
        a, b = random.sample(pods, 2)
        # queue depth ~ how far in the future the pod is busy until
        depth_a = max(0.0, a.next_free_at - t)
        depth_b = max(0.0, b.next_free_at - t)
        chosen = a if depth_a <= depth_b else b
        latencies.append(serve(chosen, t))
    return latencies

def report(name, lats):
    lats = sorted(lats)
    n = len(lats)
    p50 = lats[n // 2]
    p99 = lats[int(n * 0.99)]
    p999 = lats[int(n * 0.999)]
    print(f"{name:18s}  p50={p50:6.1f}ms  p99={p99:7.1f}ms  p99.9={p999:7.1f}ms  mean={statistics.mean(lats):6.1f}ms")

random.seed(7)
report("round-robin", round_robin([Pod(m) for m in pod_mean_ms], NUM_REQUESTS))
random.seed(7)
report("P2C", p2c([Pod(m) for m in pod_mean_ms], NUM_REQUESTS))

Sample run:

round-robin         p50=  46.0ms  p99= 1180.4ms  p99.9= 1685.2ms  mean= 105.7ms
P2C                 p50=  41.7ms  p99=  243.6ms  p99.9=  431.0ms  mean=  68.1ms

Per-line walkthrough. The line pod_mean_ms = [200.0 if i < ... else 50.0 ...] seeds the heterogeneity — 20% slow, 80% fast — without telling the picker which is which. The line pod = pods[i % len(pods)] is round-robin's entire policy: every fifth request lands on a slow pod regardless of whether the slow pod is overloaded. The line a, b = random.sample(pods, 2) followed by depth_a = max(0.0, a.next_free_at - t) is the P2C decision: query two random pods cheaply, pick the one with shorter queue. The line p99=1180ms for round-robin vs p99=243ms for P2C is the wall the chapter is naming: same pods, same workload, 5× difference at the tail purely from the picker. The line p50=46ms vs p50=42ms shows the median is barely affected — a beginner looking only at "average latency" would not see the difference.

The same pattern recurs across every load-balancing benchmark you will run. P50 is not where the cost lives; P99 and P99.9 are. And P99/P99.9 is dominated by which pod a small minority of unlucky requests landed on. The picker's job is not to be fair — it is to keep the unlucky requests off the slow pods.

Why P2C is not just "round-robin with extra steps": round-robin's information content is zero — it picks based on a counter that says nothing about pod state. P2C's information content per pick is one bit — "of these two random pods, which has the shorter queue?" — and that one bit, multiplied across millions of decisions, is enough to keep the worst-case queue length to O(log log n) instead of O(log n / log log n). The maths is in Mitzenmacher's PhD thesis (1996) and Mitzenmacher & Richa's "The Power of Two Random Choices" (2001). The intuition: round-robin sends to whoever's turn it is; P2C sends to whoever is least busy of the two it asked. The first ignores the queue; the second consults a 2-sample estimate of it.

Why we don't pick from all 240: querying every pod on every request would multiply every RPC by 240 — the picker becomes the bottleneck, and the cross-pod chatter melts the network. P2C is the cheapest meaningful sample size — two queries, one bit of information, the same exponential improvement as querying all 240. This is one of the most counter-intuitive results in distributed systems: doubling the sample to 4 or 8 helps marginally; the gap between 1 (round-robin) and 2 (P2C) is where the orders-of-magnitude live.

Illustrative CDF — generated from the picker_tail.py simulator above with seed=7 over 5000 requests on a 20-pod fleet (20% slow). The medians are within 4 ms of each other; the p99 differs by 5×. The gap is the picker, not the fleet.

Where the wall actually shows up at 9 p.m.

Three patterns recur across every distributed system whose picker is dumber than its workload. Each is a symptom of the same wall.

The "noisy neighbour amplification" pattern. A clip-encoder pod is co-tenanted with a batch-job pod doing CPU-bound video transcoding. The clip-encoder's effective CPU drops from 1.0 vCPU to ~0.6 vCPU under contention. Round-robin keeps sending it the same fraction of requests, the queue grows, p99 goes from 180 ms to 1.4 s. The registry never marks the pod unhealthy — it is healthy, just slow. P2C cuts traffic to that pod within ~5 seconds because its queue depth is visibly worse than its peers'. CricStream's 2024 final-night dashboard incident was this exact pattern; the fix was rolling out P2C in their service mesh's mesh-wide LB config.

The "deploy ramp-up tail" pattern. A new version of the payment-status service rolls out. Pods come up in batches of 20; each new pod takes 90 seconds to warm its in-process caches and JIT, during which its p99 is ~3× steady state. Round-robin sends them their fair share immediately. The fleet's overall p99 spikes for ~2 minutes per batch, even though the deploy is technically successful. P2C with a small bias against pods whose first-byte time was high (a slow-start exclusion) keeps cold pods underloaded for their first 60 seconds; their p99 falls into line organically. PaySetu's deploy-induced p99 spikes dropped from 740 ms to 210 ms when they switched from round-robin to envoy's LEAST_REQUEST (P2C with a queue-depth tiebreaker) on the payment-status frontend.

The "shard-key skew" pattern. A consistent-hashing-routed cache fleet has 64 shards. One key — the hot_orderbook_BSE_NIFTY key during market open — gets 18% of all reads. The shard owning that key gets 18% of the load while the other 63 share 82%. Pure consistent hashing refuses to spread the hot key (it must go to its owner). Bounded-load consistent hashing accepts a controlled fraction of overflow to neighbours when the owner exceeds a threshold; the hot shard's load drops from 18% to 6% with a 1.25× overflow budget. KapitalKite's market-open p99 across the order-cache fleet dropped from 480 ms to 95 ms when they enabled the bounded-load mode in their mclb library.

Common confusions

"Round-robin is fair, so it must be optimal." Round-robin is fair to pod request counts and unfair to pod load whenever pods or requests are heterogeneous. Optimality requires a feedback signal correlated with load — queue depth, in-flight count, or measured latency — and round-robin has none of those. The Mitzenmacher result is that even one extra bit of feedback (P2C) gives an exponential improvement.
"Least-connections is just round-robin with extra state." Least-connections (sometimes called least-outstanding-requests) explicitly tracks in-flight count per pod and routes to the minimum. It is one of the simplest non-trivial pickers and is not equivalent to round-robin under any workload — they agree only when every request takes the same time, which is the assumption load-balancing exists to violate. See least connections.
"P2C is good enough; I don't need anything fancier." P2C is a fantastic baseline but ignores three things: locality (does this request prefer same-zone?), stickiness (is this request constrained to a particular shard?), and request cost (is this a 5 ms or 800 ms request, and can the picker tell?). For most services P2C with a locality bias is the right floor; layering on consistent hashing where stickiness matters is how production fleets converge. None of these subsume P2C — they extend it.
"My LB only sees the request rate, not the pod state, and that's fine." It is fine only when every pod has identical capacity, every request has identical cost, and the network paths are all the same. In real production systems all three are violated routinely. Rate-only pickers (round-robin, weighted round-robin) are the right choice for stateless idempotent reads against a homogeneous backend behind a hardware LB; for anything else they leave the tail on the table.
"Sticky sessions solve the load-balancing problem for stateful services." Stickiness is a constraint on the picker, not a load-balancing strategy. It says "this user must always go to pod X" and removes the picker's freedom to balance for that user. Stickiness solves the correctness problem (in-memory state survives across requests) and creates the load-balancing problem (one user with a 10× workload pins one pod into hot-spot territory while others sit idle). The fix is bounded-load consistent hashing, which preserves stickiness in the common case and admits controlled overflow when the owner is overloaded.
"The LB algorithm doesn't matter at our scale; we're not Netflix." The Mitzenmacher result is independent of scale — P2C beats round-robin at 4 pods, at 40 pods, at 4000 pods. The percentage improvement at the tail is similar across scales. The only thing that varies is whether you can afford round-robin's tail; a small service with two-second SLAs may not feel it, a payment service with 200 ms SLAs absolutely will.

Going deeper

Why "the picker is part of the discovery layer" is the wrong abstraction

It is tempting to draw the system architecture as application → discovery library → LB → backend. In practice the discovery library and the LB are the same component — the discovery library is what has the freshness state, the LB is what uses it. Splitting them is an architectural mistake that shows up as freshness/ pick-decision desync: the discovery layer drops a pod from its set, the LB caches the old set for two more seconds, the LB picks the dropped pod and the request fails. Every modern client-side LB library (gRPC's load-balancing API, envoy's xDS, Finagle, Linkerd's destination service) treats discovery and LB as one stream of state — a "resolver" produces an Endpoints update which the picker immediately sees. The boundary is internal; do not draw it externally.

Tail at scale — Dean & Barroso's framing

Dean and Barroso's "The Tail at Scale" (CACM 2013) makes the load-balancing problem an architectural one: at fan-out N, the request's tail is dominated by the slowest of N parallel sub-requests, so reducing per-pod tail by 5× at the picker reduces the user-visible request's tail by an even larger factor when N > 10. The same paper introduces hedged requests (send to two pods, take the first response, cancel the loser), tied requests (send to two and let them coordinate cancellation), and microquantum scheduling — all of which are picker-layer techniques. Part 6 unpacks these; the wall here is to recognise that "the picker" is the lever for tail latency, not just an implementation detail.

Where the LB lives — client, sidecar, server

Three placements, each with different freshness and failure properties: client-side (the application library does the picking — gRPC's default, Finagle), sidecar (an envoy or linkerd-proxy in the same pod intercepts and picks — Istio, Linkerd), server-side (a hardware or software LB in front of the backends — AWS NLB, F5, nginx upstream). Client-side has freshest state and the least extra hop but pushes complexity into every application. Sidecar centralises the LB logic per-pod and is what most service meshes ship. Server-side adds a reliable hop but bottlenecks at the LB's own scaling and limits the picker's information (it does not see the client's state). Client-side vs proxy-side LB develops this trade-off in working detail.

Reproduce this on your laptop

# Run the picker tail simulator from above:
python3 -m venv .venv && source .venv/bin/activate
python3 picker_tail.py
# Expect: round-robin p99 ~1.2s, P2C p99 ~0.25s on the 20-pod heterogeneous fleet.

# Try larger fleets and different slow-pod fractions:
python3 -c "import picker_tail" # then edit NUM_PODS, SLOW_FRACTION, NUM_REQUESTS

# Watch a real envoy LB pick using P2C-equivalent (LEAST_REQUEST):
docker run --rm envoyproxy/envoy:v1.28-latest --config-yaml '
admin:
  address: { socket_address: { address: 0.0.0.0, port_value: 9901 } }'
# In the envoy admin UI under /clusters, observe per-endpoint rq_active and rq_total.

Where this leads next

Part 6 — LOAD BALANCING IN DEPTH — picks up here. Each chapter is one answer to the wall this chapter named:

Random and round-robin — the simplest pickers, and the precise heterogeneity assumption each one violates.
Least connections — adding queue depth as the feedback signal.
Power of two choices (P2C) — the Mitzenmacher result and what it actually buys you.
Consistent hashing (ring, jump, Maglev) — when stickiness is the constraint and the picker has to live within it.
Locality-aware load balancing — adding network-path heterogeneity to the picker.
Bounded-load consistent hashing — stickiness with overflow, the pragmatic compromise.
Client-side vs proxy-side LB — where the picker lives and why it changes the failure modes.

After Part 6 ends with the next wall — "most of what you send breaks somewhere, and a perfect picker still loses 0.4% of requests to causes the picker cannot fix" — Part 7 picks up the reliability-patterns conversation: timeouts, retries, deadlines, circuit breakers, hedged requests.

References

Mitzenmacher, "The Power of Two Choices in Randomized Load Balancing" — IEEE TPDS 2001 — the foundational result for P2C; the exponential gap between round-robin and 2-sample is proved formally here.
Dean & Barroso, "The Tail at Scale" — CACM 2013 — the architectural argument for why picker-level tail matters more than median; hedged and tied requests as picker techniques.
Vulimiri et al., "Low Latency via Redundancy" — CoNEXT 2013 — quantifies the cost-vs-latency trade-off of redundant queries, which is the picker layer's tool against tail.
Eisenbud et al., "Maglev: A Fast and Reliable Software Network Load Balancer" — NSDI 2016 — Google's hardware-class software LB, including the consistent-hashing variant that bears its name.
"Envoy Load Balancing" — envoyproxy.io — the canonical reference on production picker implementations: LEAST_REQUEST, RING_HASH, MAGLEV, locality-aware variants.
Lu et al., "Join-Idle-Queue: A Novel Load Balancing Algorithm for Dynamically Scalable Web Services" — Performance 2011 — alternative to P2C with explicit idle-queue maintenance; relevant for the "what beats P2C" question.
Discovery caching and staleness — internal companion. The freshness state the picker draws from; the wall in this chapter is what comes immediately after that one.
Wall: calling a service requires finding it — internal companion. The earlier wall in the same arc — finding the callee — that this chapter builds on.