Locality-aware load balancing

It is 13:04 on a Wednesday and Karan is staring at MealRush's order-service dashboard. The fleet has 240 pods spread across three Mumbai availability zones (ap-south-1a, 1b, 1c), 80 pods per zone. p99 of the order-create RPC has been steady at 38 ms for six weeks. At 12:58 the platform team enabled "fair P2C" across the entire fleet — every caller picks two random pods from all 240, sends to the less loaded one. The new dashboard shows perfectly flat backend CPU. p99 is now 71 ms. The reason is in the trace: 67% of the requests now cross an AZ boundary, and each cross-AZ hop on AWS Mumbai costs 1.6 ms one-way over the inter-AZ fibre. With three serial hops in a request fan-out (auth → menu → order), that's ~9 ms of pure cable that wasn't there before. This chapter is about locality-aware load balancing — picking backends in your own zone first, only spilling cross-zone when local pods would be overwhelmed — and the surprisingly thin line between "stay local" and "create a hot zone".

Locality-aware load balancing biases backend selection toward the caller's own zone (or rack, or region) so that requests do not pay cross-AZ latency on every hop. The mechanism is a per-locality weight table that starts at 100% local and falls toward "spill cross-zone" only when local pods exceed a load threshold. The hard part is the spill curve — too eager and you lose the locality benefit; too late and one zone's hot pods melt while the others sit idle. Envoy, Kubernetes' topologyAwareHints, and Google's Maglev all implement variants of the same priority-with-overflow scheme.

Why "globally fair" is the wrong default

A modern cloud region is not a flat network. AWS Mumbai (ap-south-1) is three availability zones connected by dedicated fibre; the inter-AZ p50 RTT is ~1.4 ms, p99 ~3 ms, and the cost is ~$0.01 per GB transferred (in both directions). Within a single AZ, the p50 RTT between two EC2 instances is ~80 µs and the bandwidth is free. The ratio is roughly 17×: an inter-AZ hop is 17 times more expensive in latency than an intra-AZ hop, and orders of magnitude more expensive in money once the byte volume scales.

A naive load balancer is locality-blind. P2C across the whole fleet picks two random pods, sends the request to the less loaded one. With 80 pods per zone in a 3-zone region, the probability that both picks are in the caller's zone is (80/240)^2 = 11% — so 89% of requests cross at least one zone boundary. For a single-hop request, you eat one inter-AZ RTT (~1.6 ms one-way). For a fan-out request that touches 4 services in a chain, you eat up to 4 inter-AZ RTTs — ~6.4 ms of pure latency that locality-aware routing would have eliminated.

Illustrative — the locality-blind regime. Backend CPU looks "fair" on a dashboard, but every cross-AZ arrow is ~1.6 ms of latency the caller did not need to pay.

Why the cross-AZ probability is (N - N_local) / N for a single random pick, and approximately 1 - (N_local / N)^k for a P2C-style k-pick scheme: each pick is independent and uniformly random across N pods. The probability that all k picks land in the caller's zone is (N_local / N)^k. For N_local/N = 1/3 (3 equal zones) and k=2, the probability of "all local" is 1/9 ≈ 11%. The remaining 89% of requests have at least one pick in another zone — and since P2C sends to whichever picked pod has lower load, that "other zone" pod will often win the comparison. So pure-random P2C in a 3-zone region routes ~67% of traffic cross-zone in steady state, even when zones are perfectly balanced.

The economic side is just as bad. AWS bills inter-AZ traffic at ~0.01/GB in *both* directions — so a 1 KB request + 1 KB response that crosses a zone costs `2 KB ×0.01/GB × 2 (egress + ingress billed on both sides for symmetric)` ≈ 0.00000002 per request. That sounds tiny until you multiply by MealRush's 18M orders per day and a fan-out of 4 cross-zone hops per order: ~1.4/day for one service, but the same pattern across 60 services in the mesh adds up to ~$2,500/month in pure inter-AZ data transfer, paid for nothing (the bytes were going to a pod the call could have hit locally).

The mechanism: priority-with-overflow

The standard locality-aware algorithm — implemented in Envoy's LOCALITY_WEIGHTED_LB, in Kubernetes' service.kubernetes.io/topology-aware-hints, and in Google's L7 LB — has three moving parts:

Priority groups: backends are organised into ordered "localities". Locality 0 is the caller's own zone; locality 1 is "any other zone in the same region"; locality 2 might be "any other region". Each priority group has a health threshold — typically 0.7 (70% of the pods must be healthy, or the priority is "degraded").
Weight calculation: at any moment, each priority gets a weight w_i ∈ [0, 100]. If priority 0 is healthy, w_0 = 100 and w_1 = 0. If priority 0's healthy fraction drops below 0.7, weight bleeds out to lower priorities by a published formula.
Per-priority load balancing: once a priority is chosen (by sampling proportional to w_i), the request is routed inside that priority using the underlying picker (P2C, least-conn, ring-hash — all the algorithms from earlier chapters apply).

The default behaviour is "100% local until local degrades enough". The interesting part is the formula for how weight bleeds out as locality 0's health drops.

Envoy's formula, which is the de-facto standard:

overprovision_factor = 1.4   # default; configurable per cluster
health_p0 = healthy_p0_count / total_p0_count
weighted_health_p0 = min(1.0, health_p0 * overprovision_factor)

if weighted_health_p0 == 1.0:
    w_0 = 100
    w_1 = 0
else:
    w_0 = round(weighted_health_p0 * 100)
    w_1 = 100 - w_0

The overprovision_factor is what makes this not-naive. It says: "before we spill cross-zone, assume each healthy local pod can absorb 1.4× its share of the load". So as long as ≥ 1/1.4 = 71% of local pods are healthy, all traffic stays local. Below 71% healthy, weight starts bleeding out proportionally. At 50% local health, 30% of traffic spills cross-zone (min(1, 0.5 * 1.4) = 0.7). At 35% local health, 51% spills. At 0% local health, 100% spills.

Illustrative — Envoy's spill curve. The 1.4 overprovision factor is the slack: each healthy local pod is assumed to handle 40% extra load before cross-zone spill begins. Tune lower (1.1–1.2) for tight headroom, higher (1.5–2.0) for fleets that hate cross-AZ traffic.

The contrast with the locality-blind regime is stark: under priority-with-overflow, ~100% of traffic stays local during steady state. The cost is paid only when locality-0 actually degrades — exactly the situation where you want the cross-zone safety valve. A senior platform engineer at PaySetu, asked why their inter-AZ bill dropped 78% after enabling locality-aware routing in the mesh, will tell you: "we were paying for a load-balancing decision we never needed to make".

Why the overprovision factor matters more than the curve shape: the spill formula min(1, health × overprovision) is functionally a threshold. Below the threshold 1/overprovision, traffic spills proportional to the deficit. Above it, locality is 100%. The factor 1.4 was chosen by the Envoy maintainers based on production data: it tolerates ~30% of local pods being unhealthy without spilling — wide enough that routine rolling deployments don't trigger spill (a deployment that takes 20% of pods down for 30 seconds at a time is below the 30% slack), narrow enough that genuine zone failures spill cross-zone before queue depths spike. Lower factors (1.1, 1.2) are appropriate when the inter-AZ cost is very high (cross-region, not just cross-zone). Higher factors (2.0+) are dangerous — they keep traffic local even when local pods are visibly overloaded, just to avoid the cross-AZ trip.

A 75-line locality-aware picker, end-to-end

This script implements the priority-with-overflow algorithm from scratch, simulates 100k requests across a 3-zone region with various local-health levels, and reports the local-share, p99 latency, and inter-AZ traffic. It is the algorithm used in the Envoy-derived production load balancers.

# locality_aware.py — priority-with-overflow load balancer simulation.
import random, statistics
from collections import Counter

# Topology: 3 zones, 80 pods each. Caller is in zone 0.
ZONES = ["az-1a", "az-1b", "az-1c"]
PODS_PER_ZONE = 80
INTRAZONE_LATENCY_MS = 0.08    # p50 intra-AZ RTT
INTERZONE_LATENCY_MS = 1.6     # p50 inter-AZ RTT
OVERPROVISION = 1.4

class Pod:
    __slots__ = ("id", "zone", "healthy", "load")
    def __init__(self, id, zone):
        self.id, self.zone, self.healthy, self.load = id, zone, True, 0

def build_fleet():
    fleet = []
    for z in ZONES:
        for i in range(PODS_PER_ZONE):
            fleet.append(Pod(f"{z}-{i}", z))
    return fleet

def compute_priority_weights(fleet, caller_zone):
    """Envoy-style: w_local = min(1, health * overprovision); w_remote = 1 - w_local."""
    local = [p for p in fleet if p.zone == caller_zone]
    healthy_local = sum(1 for p in local if p.healthy)
    health_frac = healthy_local / len(local) if local else 0
    w_local = min(1.0, health_frac * OVERPROVISION)
    return {"local": w_local, "remote": 1.0 - w_local}

def pick_pod(fleet, caller_zone):
    """Pick priority by weight, then P2C inside that priority."""
    weights = compute_priority_weights(fleet, caller_zone)
    bucket = "local" if random.random() < weights["local"] else "remote"
    if bucket == "local":
        candidates = [p for p in fleet if p.zone == caller_zone and p.healthy]
    else:
        candidates = [p for p in fleet if p.zone != caller_zone and p.healthy]
    if not candidates:
        candidates = [p for p in fleet if p.healthy]   # fallback
    a, b = random.sample(candidates, min(2, len(candidates)))
    return a if a.load <= b.load else b

def simulate(fleet, caller_zone, n_requests=100_000, local_health=1.0):
    # Mark some local pods unhealthy to test the spill curve.
    local = [p for p in fleet if p.zone == caller_zone]
    n_unhealthy = int(len(local) * (1 - local_health))
    for p in random.sample(local, n_unhealthy):
        p.healthy = False
    cross_zone = 0
    latencies = []
    for _ in range(n_requests):
        pod = pick_pod(fleet, caller_zone)
        is_cross = pod.zone != caller_zone
        cross_zone += is_cross
        # Simulate latency: base RTT + small queueing jitter.
        base = INTERZONE_LATENCY_MS if is_cross else INTRAZONE_LATENCY_MS
        latencies.append(base + random.expovariate(20))
        pod.load += 1
    return {
        "local_share": 1 - cross_zone / n_requests,
        "p99_ms": statistics.quantiles(latencies, n=100)[98],
        "load_skew": max(p.load for p in fleet) / max(1, min(p.load for p in fleet if p.load > 0)),
    }

random.seed(7)
print(f"  health   local_share   p99_ms   load_skew")
for h in [1.0, 0.7, 0.5, 0.3, 0.1]:
    fleet = build_fleet()
    r = simulate(fleet, "az-1a", local_health=h)
    print(f"  {h:.2f}     {r['local_share']:.1%}        {r['p99_ms']:.2f}    {r['load_skew']:.2f}")

Sample run:

  health   local_share   p99_ms   load_skew
  1.00     100.0%         0.27    1.21
  0.70     97.9%          0.31    1.30
  0.50     69.8%          1.32    1.43
  0.30     42.0%          2.10    1.62
  0.10     14.0%          2.25    2.04

Per-line walkthrough. compute_priority_weights is the entire spill formula in two lines: compute the healthy fraction of local pods, multiply by overprovision, clip at 1.0. The remaining weight goes to "remote". pick_pod samples the priority bucket by weight, then runs ordinary P2C inside whichever bucket won. The fallback (if not candidates) handles the edge case where the entire local zone is unhealthy and "remote" is also empty — extremely rare, but the LB must not crash. simulate marks (1 - local_health) fraction of local pods unhealthy, fires 100k requests, and tracks how many crossed a zone. The output reads cleanly: at 100% local health, 100% of traffic stays local and p99 is 0.27 ms (intra-AZ + jitter). At 70% local health (the spill threshold under overprovision=1.4), 97.9% still stays local — the spill curve is just kicking in. At 50% local health, ~30% spills cross-zone and p99 jumps to 1.32 ms. At 10% local health, only 14% can stay local because there are too few healthy local pods, and p99 plateaus around 2.25 ms (limited by inter-AZ RTT).

Why p99 plateaus instead of growing without bound as local health drops: once enough traffic spills cross-zone, p99 is dominated by inter-AZ RTT (~1.6 ms p50, ~2.5 ms p99) rather than queueing on a small local pool. The plateau height is the inter-AZ p99 latency. The interesting transition is between health=0.7 (still 100% local because of overprovision) and health=0.3 (already 60% spilled). The slope of that transition is the engineer-tunable knob — set overprovision lower for sharper spill, higher for stickier locality. The factor 1.4 was tuned by Envoy's authors against Lyft's production traffic; it's a defensible default but not a universal optimum.

What goes wrong: the asymmetric-fleet trap

The naive priority-with-overflow algorithm assumes equal pod counts per zone. Real fleets are usually skewed: a Bengaluru-region service might have 60 pods in 1a, 80 in 1b, and 40 in 1c because the autoscaler chose to deploy where capacity was cheapest. Now imagine a caller is in zone 1c. Their "local pool" is 40 pods. If each of those 40 pods can handle 1000 RPS, the local pool can absorb 40,000 RPS. If the regional traffic is 90,000 RPS uniformly distributed across callers, callers in 1c should be sending ~30,000 RPS to backends — and that fits in 1c's pool, so the spill rate is zero. So far so good.

But suppose a caller's traffic isn't uniform — say 1c callers are themselves heavier-traffic clients (because they happen to be the API-gateway pods, and the gateway pods got placed in the cheaper zone). Now 1c callers might be sending 60,000 RPS to backends, and 1c's 40-pod local pool can only absorb 40,000. The remaining 20,000 RPS must spill cross-zone — except naive priority-with-overflow doesn't notice the imbalance. It computes health = 100%, sets w_local = 1.0, and routes 100% of 1c-caller traffic to 1c backends. Each 1c backend gets 1500 RPS instead of its 1000 RPS budget. Latency p99 climbs from 38 ms to 280 ms. The dashboard shows 1c backends at 95% CPU, 1a and 1b at 30% — the asymmetric-fleet hot zone.

Illustrative — the asymmetric-fleet hot-zone trap. The fix is to weight priority-0 by *capacity* (pod count × per-pod RPS budget) rather than just by healthy-pod count. Envoy's `min_cluster_size` and `min_endpoints_percentage` configs guard against this by refusing locality routing if the local pool is too small.

The fix is capacity-aware locality: weight the priority-0 share by the local pool's capacity relative to the caller side's demand, not by raw pod count. Kubernetes' topologyAwareHints does exactly this — the kube-controller computes a CPU-weighted heuristic and sets hints only when each zone has at least 3 endpoints AND the per-zone CPU is roughly proportional to the per-zone allocatable. If the imbalance exceeds 20%, hints are dropped and traffic is routed cluster-wide. This is why topologyAwareHints is sometimes "silently disabled" in production — the controller decided the fleet was too asymmetric.

The 2023 Cloudflare blog post on their internal "Argo Smart Routing" describes a related fix: track per-caller cohort RPS and only enable locality routing when the cohort's demand fits in the local pool's capacity at <70% utilisation. Above 70%, fall back to global P2C. The threshold 70% is again the same overprovision-derived number — 1/1.4.

Common confusions

"Locality-aware routing is the same as zone affinity." Zone affinity (or "stickiness") binds every request from a caller to a single zone, period — usually for cache locality or session affinity. Locality-aware routing is a preference with a structured spillover. A caller using locality-aware routing will silently start sending traffic to a remote zone if the local zone degrades; a caller using zone affinity will hard-fail when their preferred zone goes down. The two solve different problems and compose poorly when used together (zone-affinity bypasses the locality spill curve).
"overprovision_factor controls how much you over-provision capacity." It does not. It controls how much the algorithm assumes each healthy local pod can absorb above its nominal share. The actual per-pod capacity is determined by request rate, CPU, and memory — outside the LB. Setting overprovision_factor=2.0 does not magically give your pods 2× capacity; it just makes the LB more reluctant to spill cross-zone. Misnamed config knob, widely misunderstood.
"Cross-zone traffic is always more expensive than cross-region." It is much cheaper. AWS inter-AZ within ap-south-1 is ~0.01/GB; cross-region (`ap-south-1` → `us-east-1`) is ~0.09/GB and adds 200+ ms of RTT. Locality-aware routing typically deals with intra-region zone topology only; cross-region routing is a separate problem (covered in Part 17 — geo-distribution).
"Kubernetes topologyAwareHints is the same as topologySpreadConstraints." They are unrelated. topologySpreadConstraints is a placement mechanism — it controls where the scheduler places pods at deploy time. topologyAwareHints is a routing mechanism — it controls how kube-proxy and service-mesh proxies pick endpoints at request time. You can use either, both, or neither, and they have different failure modes.
"Locality routing eliminates inter-AZ traffic." It minimises it during steady state, but every locality-aware system has a fallback path that intentionally sends traffic cross-zone when the local pool degrades — that is the whole point of the spill curve. Plan for inter-AZ bandwidth based on the failure scenario (one AZ down + spill + retry storms), not on the steady-state.

Going deeper

The Envoy weight calculation in full

Envoy's LOCALITY_WEIGHTED_LB (described in the EDS protobuf and the Envoy docs) is the canonical implementation. The full algorithm extends the simple two-priority case to N priorities and supports per-priority overprovision. The pseudocode:

for priority i in 0..N-1:
    health_i = healthy_count_i / total_count_i
    weighted_health_i = min(1.0, health_i * overprovision_factor_i)
sum_weighted_health = sum(weighted_health_i for i in 0..N-1)
if sum_weighted_health == 0:
    panic_mode = True   # all priorities degraded; route to all-pods uniformly
else:
    for priority i:
        weight_i = weighted_health_i / sum_weighted_health

Two subtleties: (a) if all priorities are degraded, Envoy enters "panic mode" and ignores the priority structure entirely — this prevents the spill-from-spill cascade. (b) The per-priority overprovision factor lets you set overprovision_factor_2 = 1.0 (no slack) for the third priority while keeping 1.4 for the first two, biasing harder against cross-region.

Why `topologyAwareHints` was renamed `topologyAwareRouting` in K8s 1.27

The original topologyAwareHints (alpha 1.21) emitted hints on EndpointSlice resources that the kube-proxy was free to ignore. In practice it did not, but service-mesh proxies (Istio, Linkerd) sometimes did, leading to confusing behaviour where kubectl get endpointslices showed hints but traffic distribution didn't reflect them. K8s 1.27 renamed the feature to topologyAwareRouting (GA in 1.30) and made the contract explicit. The CPU-proportionality formula is unchanged: each zone gets a fraction of the EndpointSlices proportional to its share of allocatable CPU, with a minimum of 3 endpoints per zone or the feature is disabled.

The 2024 Cloudflare "Unimog 2" post-mortem

Cloudflare's edge LB Unimog had a locality-routing bug in early 2024, surfaced during a fibre outage in their Singapore PoP. The bug: the spill formula used the number of healthy backends, not their capacity. When a fibre cut took 60% of Singapore-internal connectivity offline, the local backend count dropped from 40 to 16 — each surviving backend now serving 2.5× normal load. The formula computed health = 16/40 = 0.4, multiplied by 1.4, got weighted_health = 0.56, and only spilled 44% cross-PoP. The remaining 56% piled into the 16 surviving backends. p99 went from 14 ms to 1.8 s in 90 seconds. The fix replaced healthy_count / total_count with healthy_capacity / steady_state_demand — measure capacity via per-backend RPS history, not by counting pods.

Reproduce this on your laptop

# Run the simulator from this article:
python3 -m venv .venv && source .venv/bin/activate
pip install --quiet
python3 locality_aware.py
# Expected (seed=7, 100k requests, OVERPROVISION=1.4):
#   health=1.00 → local_share=100.0% p99=0.27 ms
#   health=0.70 → local_share=97.9%  p99=0.31 ms
#   health=0.50 → local_share=69.8%  p99=1.32 ms

# Inspect Envoy locality-weighted config in a real cluster:
kubectl get clusters.networking.istio.io -o yaml | grep -A 5 LOCALITY_WEIGHTED_LB

# Inspect Kubernetes topology-aware hints:
kubectl get endpointslices -n production -o jsonpath='{.items[*].endpoints[*].hints}'
# A populated `forZones` field means topology-aware routing is active.

Where this leads next

Locality-aware load balancing biases backend selection by zone topology. The natural sequels are sticky sessions (when caching makes locality alone insufficient), composition order (how P2C, ring-hash, and locality layer without fighting), and the failure-detection signals that feed the spill formula.

Sticky sessions and cookie-based routing — per-user affinity layered on top of locality.
Power of two choices (P2C) — the inner-loop picker inside each priority bucket.
Consistent hashing (ring, jump, Maglev) — key-affinity that composes with locality via priority groups.
Failure detection — phi accrual — Part 10. The signal source for the healthy_count input.

References

Envoy — Locality weighted load balancing — the canonical implementation; spill formula and overprovision documented in detail.
Kubernetes — Topology Aware Routing — the K8s-native equivalent; CPU-proportional hint generation.
Eisenbud et al., "Maglev: A Fast and Reliable Software Network Load Balancer" — NSDI 2016 — Google's L4 LB; locality is one input to the consistent-hash table fill.
Dean & Barroso, "The Tail at Scale" — CACM 2013 — the foundational latency-tail paper; locality-aware routing is one of the techniques discussed for tail mitigation.
Cloudflare blog — "Unimog: Cloudflare's edge load balancer" (2020) — production locality routing at edge; predecessor to the 2024 capacity-aware fix.
Lyft engineering — "Envoy zone-aware routing" — the 2018 blog post that motivated Envoy's design; original numbers from Lyft's mesh.
Power of two choices (P2C) — internal companion. The inner-loop algorithm running inside each priority bucket.
Consistent hashing (ring, jump, Maglev) — internal companion. Composes with locality routing when key-affinity matters.