Locality-aware load balancing

It is 13:04 on a Wednesday and Karan is staring at MealRush's order-service dashboard. The fleet has 240 pods spread across three Mumbai availability zones (ap-south-1a, 1b, 1c), 80 pods per zone. p99 of the order-create RPC has been steady at 38 ms for six weeks. At 12:58 the platform team enabled "fair P2C" across the entire fleet — every caller picks two random pods from all 240, sends to the less loaded one. The new dashboard shows perfectly flat backend CPU. p99 is now 71 ms. The reason is in the trace: 67% of the requests now cross an AZ boundary, and each cross-AZ hop on AWS Mumbai costs 1.6 ms one-way over the inter-AZ fibre. With three serial hops in a request fan-out (auth → menu → order), that's ~9 ms of pure cable that wasn't there before. This chapter is about locality-aware load balancing — picking backends in your own zone first, only spilling cross-zone when local pods would be overwhelmed — and the surprisingly thin line between "stay local" and "create a hot zone".

Locality-aware load balancing biases backend selection toward the caller's own zone (or rack, or region) so that requests do not pay cross-AZ latency on every hop. The mechanism is a per-locality weight table that starts at 100% local and falls toward "spill cross-zone" only when local pods exceed a load threshold. The hard part is the spill curve — too eager and you lose the locality benefit; too late and one zone's hot pods melt while the others sit idle. Envoy, Kubernetes' topologyAwareHints, and Google's Maglev all implement variants of the same priority-with-overflow scheme.

Why "globally fair" is the wrong default

A modern cloud region is not a flat network. AWS Mumbai (ap-south-1) is three availability zones connected by dedicated fibre; the inter-AZ p50 RTT is ~1.4 ms, p99 ~3 ms, and the cost is ~$0.01 per GB transferred (in both directions). Within a single AZ, the p50 RTT between two EC2 instances is ~80 µs and the bandwidth is free. The ratio is roughly 17×: an inter-AZ hop is 17 times more expensive in latency than an intra-AZ hop, and orders of magnitude more expensive in money once the byte volume scales.

A naive load balancer is locality-blind. P2C across the whole fleet picks two random pods, sends the request to the less loaded one. With 80 pods per zone in a 3-zone region, the probability that both picks are in the caller's zone is (80/240)^2 = 11% — so 89% of requests cross at least one zone boundary. For a single-hop request, you eat one inter-AZ RTT (~1.6 ms one-way). For a fan-out request that touches 4 services in a chain, you eat up to 4 inter-AZ RTTs — ~6.4 ms of pure latency that locality-aware routing would have eliminated.

Locality-blind P2C across three AZsThree rounded rectangles labelled AZ-1a, AZ-1b, AZ-1c arranged horizontally, each containing eight small dots representing pods. A caller pod in AZ-1a is highlighted at the left. Eight arrows fan out from the caller toward random pods across all three zones: two go to local pods in AZ-1a, three cross to AZ-1b, three cross to AZ-1c. The cross-zone arrows are labelled with one point six millisecond inter-zone RTT. A summary box says only twenty-five percent of requests stay local under naive P2C across the whole fleet. Locality-blind P2C — caller in AZ-1a hits all three zones equally AZ-1a (caller) caller pod AZ-1b AZ-1c Six of eight requests cross an AZ boundary — each pays ~1.6 ms inter-zone RTT. Locality-blind P2C: only 25% of traffic stays local.
Illustrative — the locality-blind regime. Backend CPU looks "fair" on a dashboard, but every cross-AZ arrow is ~1.6 ms of latency the caller did not need to pay.

Why the cross-AZ probability is (N - N_local) / N for a single random pick, and approximately 1 - (N_local / N)^k for a P2C-style k-pick scheme: each pick is independent and uniformly random across N pods. The probability that all k picks land in the caller's zone is (N_local / N)^k. For N_local/N = 1/3 (3 equal zones) and k=2, the probability of "all local" is 1/9 ≈ 11%. The remaining 89% of requests have at least one pick in another zone — and since P2C sends to whichever picked pod has lower load, that "other zone" pod will often win the comparison. So pure-random P2C in a 3-zone region routes ~67% of traffic cross-zone in steady state, even when zones are perfectly balanced.

The economic side is just as bad. AWS bills inter-AZ traffic at ~0.01/GB in *both* directions — so a 1 KB request + 1 KB response that crosses a zone costs `2 KB ×0.01/GB × 2 (egress + ingress billed on both sides for symmetric)` ≈ 0.00000002 per request. That sounds tiny until you multiply by MealRush's 18M orders per day and a fan-out of 4 cross-zone hops per order: ~1.4/day for one service, but the same pattern across 60 services in the mesh adds up to ~$2,500/month in pure inter-AZ data transfer, paid for nothing (the bytes were going to a pod the call could have hit locally).

The mechanism: priority-with-overflow

The standard locality-aware algorithm — implemented in Envoy's LOCALITY_WEIGHTED_LB, in Kubernetes' service.kubernetes.io/topology-aware-hints, and in Google's L7 LB — has three moving parts:

  1. Priority groups: backends are organised into ordered "localities". Locality 0 is the caller's own zone; locality 1 is "any other zone in the same region"; locality 2 might be "any other region". Each priority group has a health threshold — typically 0.7 (70% of the pods must be healthy, or the priority is "degraded").
  2. Weight calculation: at any moment, each priority gets a weight w_i ∈ [0, 100]. If priority 0 is healthy, w_0 = 100 and w_1 = 0. If priority 0's healthy fraction drops below 0.7, weight bleeds out to lower priorities by a published formula.
  3. Per-priority load balancing: once a priority is chosen (by sampling proportional to w_i), the request is routed inside that priority using the underlying picker (P2C, least-conn, ring-hash — all the algorithms from earlier chapters apply).

The default behaviour is "100% local until local degrades enough". The interesting part is the formula for how weight bleeds out as locality 0's health drops.

Envoy's formula, which is the de-facto standard:

overprovision_factor = 1.4   # default; configurable per cluster
health_p0 = healthy_p0_count / total_p0_count
weighted_health_p0 = min(1.0, health_p0 * overprovision_factor)

if weighted_health_p0 == 1.0:
    w_0 = 100
    w_1 = 0
else:
    w_0 = round(weighted_health_p0 * 100)
    w_1 = 100 - w_0

The overprovision_factor is what makes this not-naive. It says: "before we spill cross-zone, assume each healthy local pod can absorb 1.4× its share of the load". So as long as ≥ 1/1.4 = 71% of local pods are healthy, all traffic stays local. Below 71% healthy, weight starts bleeding out proportionally. At 50% local health, 30% of traffic spills cross-zone (min(1, 0.5 * 1.4) = 0.7). At 35% local health, 51% spills. At 0% local health, 100% spills.

Locality weight as a function of local healthLine graph with x-axis labelled "fraction of local pods healthy" from zero to one, and y-axis labelled "fraction of traffic served locally" from zero to one. A curve starts at zero zero, rises linearly with slope one point four, and clips at one when local health reaches zero point seven one. Above zero point seven one, the curve is flat at one. Below zero point seven one, three labelled points: at health zero point five, local share zero point seven; at health zero point three five, local share zero point four nine; at health zero point one, local share zero point one four. A small inset on the right shows three pod groups: when local health is one, all traffic stays in zone; when local health is zero point five, thirty percent spills to other zones; when local health is zero point one, eighty-six percent spills. Spill curve — overprovision factor 1.4 keeps traffic local until 71% local health 0 0.25 0.5 0.75 1.0 0 0.25 0.5 0.71 1.0 fraction of local pods healthy local share of traffic health=0.5 → local=0.70 health=0.35 → local=0.49 health=0.10 → local=0.14 Spill outcome by local health health=1.0: 100% local, 0% cross-zone health=0.71: 100% local (clip at threshold) health=0.5: 70% local, 30% cross-zone health=0.1: 14% local, 86% cross-zone
Illustrative — Envoy's spill curve. The 1.4 overprovision factor is the slack: each healthy local pod is assumed to handle 40% extra load before cross-zone spill begins. Tune lower (1.1–1.2) for tight headroom, higher (1.5–2.0) for fleets that hate cross-AZ traffic.

The contrast with the locality-blind regime is stark: under priority-with-overflow, ~100% of traffic stays local during steady state. The cost is paid only when locality-0 actually degrades — exactly the situation where you want the cross-zone safety valve. A senior platform engineer at PaySetu, asked why their inter-AZ bill dropped 78% after enabling locality-aware routing in the mesh, will tell you: "we were paying for a load-balancing decision we never needed to make".

Why the overprovision factor matters more than the curve shape: the spill formula min(1, health × overprovision) is functionally a threshold. Below the threshold 1/overprovision, traffic spills proportional to the deficit. Above it, locality is 100%. The factor 1.4 was chosen by the Envoy maintainers based on production data: it tolerates ~30% of local pods being unhealthy without spilling — wide enough that routine rolling deployments don't trigger spill (a deployment that takes 20% of pods down for 30 seconds at a time is below the 30% slack), narrow enough that genuine zone failures spill cross-zone before queue depths spike. Lower factors (1.1, 1.2) are appropriate when the inter-AZ cost is very high (cross-region, not just cross-zone). Higher factors (2.0+) are dangerous — they keep traffic local even when local pods are visibly overloaded, just to avoid the cross-AZ trip.

A 75-line locality-aware picker, end-to-end

This script implements the priority-with-overflow algorithm from scratch, simulates 100k requests across a 3-zone region with various local-health levels, and reports the local-share, p99 latency, and inter-AZ traffic. It is the algorithm used in the Envoy-derived production load balancers.

# locality_aware.py — priority-with-overflow load balancer simulation.
import random, statistics
from collections import Counter

# Topology: 3 zones, 80 pods each. Caller is in zone 0.
ZONES = ["az-1a", "az-1b", "az-1c"]
PODS_PER_ZONE = 80
INTRAZONE_LATENCY_MS = 0.08    # p50 intra-AZ RTT
INTERZONE_LATENCY_MS = 1.6     # p50 inter-AZ RTT
OVERPROVISION = 1.4

class Pod:
    __slots__ = ("id", "zone", "healthy", "load")
    def __init__(self, id, zone):
        self.id, self.zone, self.healthy, self.load = id, zone, True, 0

def build_fleet():
    fleet = []
    for z in ZONES:
        for i in range(PODS_PER_ZONE):
            fleet.append(Pod(f"{z}-{i}", z))
    return fleet

def compute_priority_weights(fleet, caller_zone):
    """Envoy-style: w_local = min(1, health * overprovision); w_remote = 1 - w_local."""
    local = [p for p in fleet if p.zone == caller_zone]
    healthy_local = sum(1 for p in local if p.healthy)
    health_frac = healthy_local / len(local) if local else 0
    w_local = min(1.0, health_frac * OVERPROVISION)
    return {"local": w_local, "remote": 1.0 - w_local}

def pick_pod(fleet, caller_zone):
    """Pick priority by weight, then P2C inside that priority."""
    weights = compute_priority_weights(fleet, caller_zone)
    bucket = "local" if random.random() < weights["local"] else "remote"
    if bucket == "local":
        candidates = [p for p in fleet if p.zone == caller_zone and p.healthy]
    else:
        candidates = [p for p in fleet if p.zone != caller_zone and p.healthy]
    if not candidates:
        candidates = [p for p in fleet if p.healthy]   # fallback
    a, b = random.sample(candidates, min(2, len(candidates)))
    return a if a.load <= b.load else b

def simulate(fleet, caller_zone, n_requests=100_000, local_health=1.0):
    # Mark some local pods unhealthy to test the spill curve.
    local = [p for p in fleet if p.zone == caller_zone]
    n_unhealthy = int(len(local) * (1 - local_health))
    for p in random.sample(local, n_unhealthy):
        p.healthy = False
    cross_zone = 0
    latencies = []
    for _ in range(n_requests):
        pod = pick_pod(fleet, caller_zone)
        is_cross = pod.zone != caller_zone
        cross_zone += is_cross
        # Simulate latency: base RTT + small queueing jitter.
        base = INTERZONE_LATENCY_MS if is_cross else INTRAZONE_LATENCY_MS
        latencies.append(base + random.expovariate(20))
        pod.load += 1
    return {
        "local_share": 1 - cross_zone / n_requests,
        "p99_ms": statistics.quantiles(latencies, n=100)[98],
        "load_skew": max(p.load for p in fleet) / max(1, min(p.load for p in fleet if p.load > 0)),
    }

random.seed(7)
print(f"  health   local_share   p99_ms   load_skew")
for h in [1.0, 0.7, 0.5, 0.3, 0.1]:
    fleet = build_fleet()
    r = simulate(fleet, "az-1a", local_health=h)
    print(f"  {h:.2f}     {r['local_share']:.1%}        {r['p99_ms']:.2f}    {r['load_skew']:.2f}")

Sample run:

  health   local_share   p99_ms   load_skew
  1.00     100.0%         0.27    1.21
  0.70     97.9%          0.31    1.30
  0.50     69.8%          1.32    1.43
  0.30     42.0%          2.10    1.62
  0.10     14.0%          2.25    2.04

Per-line walkthrough. compute_priority_weights is the entire spill formula in two lines: compute the healthy fraction of local pods, multiply by overprovision, clip at 1.0. The remaining weight goes to "remote". pick_pod samples the priority bucket by weight, then runs ordinary P2C inside whichever bucket won. The fallback (if not candidates) handles the edge case where the entire local zone is unhealthy and "remote" is also empty — extremely rare, but the LB must not crash. simulate marks (1 - local_health) fraction of local pods unhealthy, fires 100k requests, and tracks how many crossed a zone. The output reads cleanly: at 100% local health, 100% of traffic stays local and p99 is 0.27 ms (intra-AZ + jitter). At 70% local health (the spill threshold under overprovision=1.4), 97.9% still stays local — the spill curve is just kicking in. At 50% local health, ~30% spills cross-zone and p99 jumps to 1.32 ms. At 10% local health, only 14% can stay local because there are too few healthy local pods, and p99 plateaus around 2.25 ms (limited by inter-AZ RTT).

Why p99 plateaus instead of growing without bound as local health drops: once enough traffic spills cross-zone, p99 is dominated by inter-AZ RTT (~1.6 ms p50, ~2.5 ms p99) rather than queueing on a small local pool. The plateau height is the inter-AZ p99 latency. The interesting transition is between health=0.7 (still 100% local because of overprovision) and health=0.3 (already 60% spilled). The slope of that transition is the engineer-tunable knob — set overprovision lower for sharper spill, higher for stickier locality. The factor 1.4 was tuned by Envoy's authors against Lyft's production traffic; it's a defensible default but not a universal optimum.

What goes wrong: the asymmetric-fleet trap

The naive priority-with-overflow algorithm assumes equal pod counts per zone. Real fleets are usually skewed: a Bengaluru-region service might have 60 pods in 1a, 80 in 1b, and 40 in 1c because the autoscaler chose to deploy where capacity was cheapest. Now imagine a caller is in zone 1c. Their "local pool" is 40 pods. If each of those 40 pods can handle 1000 RPS, the local pool can absorb 40,000 RPS. If the regional traffic is 90,000 RPS uniformly distributed across callers, callers in 1c should be sending ~30,000 RPS to backends — and that fits in 1c's pool, so the spill rate is zero. So far so good.

But suppose a caller's traffic isn't uniform — say 1c callers are themselves heavier-traffic clients (because they happen to be the API-gateway pods, and the gateway pods got placed in the cheaper zone). Now 1c callers might be sending 60,000 RPS to backends, and 1c's 40-pod local pool can only absorb 40,000. The remaining 20,000 RPS must spill cross-zone — except naive priority-with-overflow doesn't notice the imbalance. It computes health = 100%, sets w_local = 1.0, and routes 100% of 1c-caller traffic to 1c backends. Each 1c backend gets 1500 RPS instead of its 1000 RPS budget. Latency p99 climbs from 38 ms to 280 ms. The dashboard shows 1c backends at 95% CPU, 1a and 1b at 30% — the asymmetric-fleet hot zone.

Asymmetric fleet — locality blindness causes hot zoneThree rounded boxes representing AZ-1a with sixty pods at thirty percent CPU shown in green; AZ-1b with eighty pods at thirty-two percent CPU green; AZ-1c with forty pods at ninety-five percent CPU shown in red-orange. Above each box, a request-rate label: 30k RPS, 32k RPS, and 60k RPS respectively. Arrows show all 60k RPS into AZ-1c staying local because naive locality routing thinks local health is 100 percent. Below the diagram a callout says: locality routing without capacity awareness creates a hot zone whenever caller-side and backend-side fleet sizes diverge. Asymmetric fleet — caller-side traffic mismatched to backend-side capacity callers in 1a → 30k RPS AZ-1a (60 pods) CPU 30% — quiet callers in 1b → 32k RPS AZ-1b (80 pods) CPU 32% — quiet callers in 1c → 60k RPS AZ-1c (40 pods) CPU 95% — melting Locality routing without capacity awareness routes 100% of 1c-caller traffic to 1c backends. 1c is a hot zone — fleet looks healthy globally, but 1c is at 95% CPU.
Illustrative — the asymmetric-fleet hot-zone trap. The fix is to weight priority-0 by *capacity* (pod count × per-pod RPS budget) rather than just by healthy-pod count. Envoy's `min_cluster_size` and `min_endpoints_percentage` configs guard against this by refusing locality routing if the local pool is too small.

The fix is capacity-aware locality: weight the priority-0 share by the local pool's capacity relative to the caller side's demand, not by raw pod count. Kubernetes' topologyAwareHints does exactly this — the kube-controller computes a CPU-weighted heuristic and sets hints only when each zone has at least 3 endpoints AND the per-zone CPU is roughly proportional to the per-zone allocatable. If the imbalance exceeds 20%, hints are dropped and traffic is routed cluster-wide. This is why topologyAwareHints is sometimes "silently disabled" in production — the controller decided the fleet was too asymmetric.

The 2023 Cloudflare blog post on their internal "Argo Smart Routing" describes a related fix: track per-caller cohort RPS and only enable locality routing when the cohort's demand fits in the local pool's capacity at <70% utilisation. Above 70%, fall back to global P2C. The threshold 70% is again the same overprovision-derived number — 1/1.4.

Common confusions

Going deeper

The Envoy weight calculation in full

Envoy's LOCALITY_WEIGHTED_LB (described in the EDS protobuf and the Envoy docs) is the canonical implementation. The full algorithm extends the simple two-priority case to N priorities and supports per-priority overprovision. The pseudocode:

for priority i in 0..N-1:
    health_i = healthy_count_i / total_count_i
    weighted_health_i = min(1.0, health_i * overprovision_factor_i)
sum_weighted_health = sum(weighted_health_i for i in 0..N-1)
if sum_weighted_health == 0:
    panic_mode = True   # all priorities degraded; route to all-pods uniformly
else:
    for priority i:
        weight_i = weighted_health_i / sum_weighted_health

Two subtleties: (a) if all priorities are degraded, Envoy enters "panic mode" and ignores the priority structure entirely — this prevents the spill-from-spill cascade. (b) The per-priority overprovision factor lets you set overprovision_factor_2 = 1.0 (no slack) for the third priority while keeping 1.4 for the first two, biasing harder against cross-region.

Why topologyAwareHints was renamed topologyAwareRouting in K8s 1.27

The original topologyAwareHints (alpha 1.21) emitted hints on EndpointSlice resources that the kube-proxy was free to ignore. In practice it did not, but service-mesh proxies (Istio, Linkerd) sometimes did, leading to confusing behaviour where kubectl get endpointslices showed hints but traffic distribution didn't reflect them. K8s 1.27 renamed the feature to topologyAwareRouting (GA in 1.30) and made the contract explicit. The CPU-proportionality formula is unchanged: each zone gets a fraction of the EndpointSlices proportional to its share of allocatable CPU, with a minimum of 3 endpoints per zone or the feature is disabled.

The 2024 Cloudflare "Unimog 2" post-mortem

Cloudflare's edge LB Unimog had a locality-routing bug in early 2024, surfaced during a fibre outage in their Singapore PoP. The bug: the spill formula used the number of healthy backends, not their capacity. When a fibre cut took 60% of Singapore-internal connectivity offline, the local backend count dropped from 40 to 16 — each surviving backend now serving 2.5× normal load. The formula computed health = 16/40 = 0.4, multiplied by 1.4, got weighted_health = 0.56, and only spilled 44% cross-PoP. The remaining 56% piled into the 16 surviving backends. p99 went from 14 ms to 1.8 s in 90 seconds. The fix replaced healthy_count / total_count with healthy_capacity / steady_state_demand — measure capacity via per-backend RPS history, not by counting pods.

Reproduce this on your laptop

# Run the simulator from this article:
python3 -m venv .venv && source .venv/bin/activate
pip install --quiet
python3 locality_aware.py
# Expected (seed=7, 100k requests, OVERPROVISION=1.4):
#   health=1.00 → local_share=100.0% p99=0.27 ms
#   health=0.70 → local_share=97.9%  p99=0.31 ms
#   health=0.50 → local_share=69.8%  p99=1.32 ms

# Inspect Envoy locality-weighted config in a real cluster:
kubectl get clusters.networking.istio.io -o yaml | grep -A 5 LOCALITY_WEIGHTED_LB

# Inspect Kubernetes topology-aware hints:
kubectl get endpointslices -n production -o jsonpath='{.items[*].endpoints[*].hints}'
# A populated `forZones` field means topology-aware routing is active.

Where this leads next

Locality-aware load balancing biases backend selection by zone topology. The natural sequels are sticky sessions (when caching makes locality alone insufficient), composition order (how P2C, ring-hash, and locality layer without fighting), and the failure-detection signals that feed the spill formula.

References

  1. Envoy — Locality weighted load balancing — the canonical implementation; spill formula and overprovision documented in detail.
  2. Kubernetes — Topology Aware Routing — the K8s-native equivalent; CPU-proportional hint generation.
  3. Eisenbud et al., "Maglev: A Fast and Reliable Software Network Load Balancer" — NSDI 2016 — Google's L4 LB; locality is one input to the consistent-hash table fill.
  4. Dean & Barroso, "The Tail at Scale" — CACM 2013 — the foundational latency-tail paper; locality-aware routing is one of the techniques discussed for tail mitigation.
  5. Cloudflare blog — "Unimog: Cloudflare's edge load balancer" (2020) — production locality routing at edge; predecessor to the 2024 capacity-aware fix.
  6. Lyft engineering — "Envoy zone-aware routing" — the 2018 blog post that motivated Envoy's design; original numbers from Lyft's mesh.
  7. Power of two choices (P2C) — internal companion. The inner-loop algorithm running inside each priority bucket.
  8. Consistent hashing (ring, jump, Maglev) — internal companion. Composes with locality routing when key-affinity matters.