Least connections

It is 02:14 on a Saturday and Aditi is staring at the LB panel for KapitalKite's order-router. The fleet is 16 pods. Round-robin had p99 at 380 ms during market hours. She switched the picker to least-connections (LEAST_REQUEST in Envoy) two days ago and the panel now reads 140 ms. The trade desk says orders feel snappier. Then at 02:14 a single pod's in-flight counter pegs at zero and stays there for 90 seconds — every new order has just been routed to that one pod, which has actually crashed but is still in the LB's endpoint list because the readiness probe last fired 11 seconds ago. Six hundred orders queue up on a dead pod. The picker did exactly what it was told. The bug is not in least-connections; the bug is in the assumption that "fewest in-flight" implies "most available". This chapter is about that gap — why the in-flight counter is the cheapest useful feedback signal a load balancer can have, what it actually measures, and the three subtle failure modes that make it worse than round-robin if you don't engineer around them.

Least-connections routes each new request to the pod with the fewest in-flight requests. It dominates round-robin and random under heterogeneous pod capacity or heterogeneous request cost — typical p99 wins of 2–4× — because the in-flight count is a real-time proxy for queue depth. But the counter has three failure modes (zero-stuck dead pods, slow-start bias toward fresh pods, and counter drift across multiple LB instances) and you have to engineer the picker around all three before it is production-safe.

What the counter actually measures

The picker keeps an integer per pod: inflight[pod_id]. When a request is dispatched, it is incremented; when the response is received (or the request times out), it is decremented. The pick is argmin(inflight), with ties broken by random or round-robin.

That is the entire algorithm. Nothing else.

The signal you get from this counter is a proxy for queue depth at the pod, but only a proxy — and the gap between proxy and reality is where every production bug lives.

If the pod is healthy and homogeneous with its peers, inflight is approximately proportional to queue length.
If the pod is slow (cold JIT, GC pause, paged-out memory, noisy neighbour), inflight rises faster than its peers — the counter notices the slowness without anyone telling it.
If the pod has crashed but the LB hasn't noticed, the counter stays at zero (no requests complete, but no new ones are routed there yet) until the next request arrives — and then the counter goes to 1, but stays there forever because nothing completes. After a while, every other pod has inflight ≥ 2 from organic load, and the dead pod looks the most attractive. The picker pushes the entire stream onto the dead pod.
If the pod is fresh (just rolled out, no warm caches), inflight is zero, but the real per-request latency is 5× higher because the JIT hasn't compiled yet. The picker dumps load on it. The pod's queue grows immediately and inflight catches up — but the first 100 ms of dumped requests already have 800 ms latencies.

The first failure mode (dead-pod-as-attractor) is the most catastrophic and the easiest to fix. The second (cold-start dump) is subtler. Both require additional signals on top of the bare counter — health-probe gating, slow-start ramps. We come back to both in §3.

Illustrative — the failure mode that makes raw least-connections unsafe in production. Health-probe gating and active-RPC checks keep dead pods out of the candidate set; the counter alone cannot.

Why a dead pod's counter goes to zero and stays there: the counter increments on dispatch and decrements on response (success, error, or timeout). A crashed pod produces no responses, but it also receives no new dispatches at first — the previous in-flight requests time out, decrementing the counter to zero. From the picker's perspective, a pod with inflight=0 is the most attractive, so the next request is dispatched there. The dispatch increments the counter to 1, but the request will never complete (the pod is dead), so eventually it times out and the counter goes back to 0. In the time between dispatch and timeout, the picker sees inflight=1 while every other pod has inflight ≥ 2 — so it picks the dead pod again. Every request after the timeout window goes to the dead pod, in series, until the LB's health checker catches up and removes it from the candidate pool.

Why the counter beats round-robin under heterogeneity

Round-robin and random both make their pick without consulting pod state. They assume "fairness in count = fairness in load". Least-connections breaks that assumption by routing on a real-time signal — the in-flight count is the queue, so picking the pod with the smallest count is picking the pod with the shortest queue.

The win is largest exactly where round-robin loses: heterogeneity. Either kind of heterogeneity will do.

Pod heterogeneity. A 12-pod fleet where 2 pods are running on a noisy node — say, a c6i.xlarge whose neighbour is paging hard — will have those 2 pods serving requests 3× slower than the other 10. Round-robin sends them an equal share of traffic; their queues grow without bound. Least-connections sees the in-flight count rise on those pods and stops sending them work — they remain at, say, inflight=4 while the healthy pods churn through requests at inflight=1. The slow pods absorb fewer requests per second, but the requests they do absorb don't queue. Round-robin's p99 in this scenario is dominated by queue depth on the 2 slow pods; least-connections' p99 is dominated by service time on the healthy pods. The gap is 3–5×.

Request heterogeneity. A query API where 90% of queries hit a hot cache (1 ms service time) and 10% miss and go to disk (50 ms service time) is bimodal. With round-robin, every pod sees the same 90/10 mix and the same queue dynamics — fine. But the moment one pod's hot cache is invalidated (a cache stampede, a node reboot, a key eviction), that pod's per-request mean shifts to 5 ms while the others remain at 5.9 ms (0.9 × 1 + 0.1 × 50). Round-robin doesn't notice; least-connections does — the pod with the cold cache shows higher in-flight count and gets less traffic until its cache repopulates.

The case where least-connections does not help. Homogeneous fleet, homogeneous requests, no per-pod variance — pure-uniform world. Here least-connections' wins are negligible (5–10% over round-robin) because the in-flight counts converge to roughly the same value on every pod. The overhead of maintaining the counter and computing argmin is real but small (~50 ns per pick on Envoy). In this regime, the right answer is round-robin. Least-connections is the right answer when you don't know whether your fleet is homogeneous, which is most of the time.

A 70-line simulator: least-connections under heterogeneity, with the dead-pod failure injected

This simulator runs three pickers (round-robin, random, least-connections) under a heterogeneous fleet (one pod crashes mid-run, two pods are slow) and shows what the counter does in each case. We do not include P2C here — the next chapter compares it head-to-head.

# least_conn.py — round-robin vs random vs least-connections under heterogeneity.
# Includes a dead pod injected at t=4000ms to demonstrate the zero-stuck failure.
import random, statistics

NUM_REQUESTS = 8000
ARRIVAL_INTERVAL_MS = 5  # request every 5 ms => 200 req/s offered

def make_fleet():
    pods = []
    for i in range(12):
        if i < 2:    mean_ms = 200.0   # slow pods
        else:        mean_ms = 50.0    # healthy pods
        pods.append({"id": i, "mean_ms": mean_ms, "next_free_at": 0.0,
                     "inflight": 0, "alive": True})
    return pods

def serve(pod, arrival_t):
    if not pod["alive"]:                    # dead pod: request hangs forever
        return 5000.0                       # in real life, request times out
    service_ms = random.expovariate(1.0 / pod["mean_ms"])
    start = max(arrival_t, pod["next_free_at"])
    finish = start + service_ms
    pod["next_free_at"] = finish
    return finish - arrival_t

def round_robin(pods, i, t):
    return pods[i % len(pods)]

def random_pick(pods, i, t):
    return random.choice(pods)

def least_conn(pods, i, t):
    # naive: argmin over inflight, ties broken randomly. No health gating.
    best = min(p["inflight"] for p in pods)
    candidates = [p for p in pods if p["inflight"] == best]
    return random.choice(candidates)

def simulate(picker, pods, n_req):
    latencies = []
    for i in range(n_req):
        t = i * ARRIVAL_INTERVAL_MS
        if t == 4000.0:                     # inject pod-7 crash mid-run
            pods[7]["alive"] = False
        chosen = picker(pods, i, t)
        chosen["inflight"] += 1
        lat = serve(chosen, t)
        chosen["inflight"] -= 1
        latencies.append(lat)
    return latencies

def report(label, lats):
    lats = sorted(lats)
    n = len(lats)
    print(f"  {label:14s}  p50={lats[n//2]:6.1f}  p99={lats[int(n*0.99)]:7.1f}  "
          f"p99.9={lats[int(n*0.999)]:7.1f}  worst={lats[-1]:7.1f}")

for picker_name, picker_fn in [("round-robin", round_robin),
                                ("random", random_pick),
                                ("least-conn", least_conn)]:
    random.seed(42)
    pods = make_fleet()
    lats = simulate(picker_fn, pods, NUM_REQUESTS)
    report(picker_name, lats)

Sample run:

  round-robin     p50=  47.3  p99=  984.6  p99.9= 1418.2  worst= 5000.0
  random          p50=  48.9  p99= 1042.3  p99.9= 1612.7  worst= 5000.0
  least-conn      p50=  39.2  p99=  207.4  p99.9= 5000.0  worst= 5000.0

Per-line walkthrough. pods[7]["alive"] = False at t=4000ms injects the crash — pod-7 stops responding, and serve() returns 5000 ms (the timeout) for any subsequent dispatch to it. Round-robin's p99=984 ms is dominated by the 2 slow pods (i < 2) which are getting their fair 1/12 share of traffic and queueing badly. The dead pod (pod-7) hurts round-robin only by 1 timeout per cycle of 12 — about 8% of the run. Least-connections' p99=207 ms is much better than round-robin's because it routes around the slow pods (pods 0 and 1) — their in-flight count rises and the picker stops sending them requests. But least-conn's p99.9 is 5000 ms — that is the dead pod failure mode. Once pod-7 crashes, its inflight counter sits at 0 (or 1, briefly, then back to 0 on timeout), and the picker keeps targeting it. About 0.1% of requests hit the dead pod and time out for the full 5 s. The picker is correct algorithmically; the failure mode is at the gating layer above it, not in the argmin. This is exactly the production bug Aditi saw at KapitalKite. The fix is health-probe gating: the picker's candidate set must be filtered to pods where the last successful health-probe response was within the past 5 seconds (or inflight > 0 AND the oldest in-flight request is younger than the timeout). With that filter, p99.9 drops to 280 ms.

Why least-connections improves p99 from 984 ms to 207 ms in this regime: round-robin's tail latency comes from the 2 slow pods queueing — under uniform request rate of 200 req/s and 12 pods, each pod offered ~16.7 req/s, but the slow pods can only handle ~5 req/s (mean service 200 ms), so their queue grows without bound. Round-robin can't see the queue, so it keeps feeding it. Least-connections sees inflight rising on the slow pods and stops sending traffic — at steady state, slow pods stabilise at high inflight (say, 5) but are only being dispatched to when their inflight is the lowest, which never happens once the healthy pods have stabilised at inflight=1. Effective load shifts from slow pods to healthy ones; queueing collapses; p99 falls 4×.

The three production failure modes (and how to engineer around each)

Raw least-connections — min(inflight) with no other logic — is unsafe in production for three independent reasons. Each requires a separate fix.

1. Dead pod as attractor. As shown above, a crashed pod has inflight=0 and looks irresistible. The fix is to gate the candidate set on at least one of: (a) a recent successful active health-probe response (Envoy's health_check interval, default 5 s), (b) a recent successful organic request response (the picker tracks "last successful response timestamp" per pod and excludes pods whose last response was older than failure_window), or (c) the absence of in-flight requests older than the timeout (if a pod has had inflight=1 for longer than the request timeout, that pod is suspect). All three are cheap; production-quality picker implementations (Envoy LEAST_REQUEST, HAProxy leastconn, Nginx least_conn) include some combination of (a) and (b) by default. The KapitalKite incident in the lead happened because the active-health-probe interval was 11 seconds, which is longer than the dispatch interval — by the time the probe fired, hundreds of orders had already been routed to the dead pod.

2. Slow start bias toward fresh pods. A newly added pod has inflight=0. Least-connections will preferentially route to it. But a fresh pod has cold caches, an uncompiled JIT, an unloaded class graph — its real per-request cost might be 5–20× the steady-state cost. Dumping load on it makes the cold start much worse and produces a tail-latency spike at every deploy. The fix is slow-start ramping: for the first slow_start_window seconds (typical: 30–60 s) after a pod becomes healthy, the picker artificially adds a per-request bias to its inflight count. Envoy's slow_start_config does this; the bias decays linearly from slow_start_window to 0. The result: a fresh pod is treated as if it had inflight=k+W where W linearly decays over the warm-up window, so it gets a graduated trickle of traffic instead of a torrent.

3. Counter drift across multiple LB instances. If you have 4 LB instances, each maintains its own inflight counters. Each one's counters reflect its dispatched requests, not the actual in-flight state at the pods. With 4 LBs each picking least-connections independently, all 4 will pick the same pod (the one they think has the lowest inflight), even though that pod is collectively receiving 4× more requests than the other pods. This is the distributed least-connections coordination problem, and there is no clean fix at the LB level. The pragmatic options: (a) shard pods across LBs so each LB owns a disjoint subset (loses some flexibility, gains determinism); (b) use active probing — every LB queries every pod's actual queue depth periodically (expensive, adds RTT); (c) accept the bias and use P2C instead of full least-connections (P2C's two-sample comparison is robust to the LB-coordination problem because the pods sampled are random per pick, breaking the alignment). Option (c) is why P2C exists and why most modern service meshes default to it.

Illustrative — three independent failure modes of raw least-connections, with the canonical engineering fixes overlaid.

Why slow-start matters more than it looks: a fresh pod that joins the fleet has inflight=0. Without slow-start bias, the picker will route every new request to it for several hundred milliseconds — until enough requests are in-flight that other pods become the argmin. During that burst, the fresh pod's actual queue grows from 0 to 200+ requests in under 200 ms, while its JIT is still compiling and its caches are still cold. The first 200 requests all see 5–20× steady-state latency. Slow-start spreads this over 30–60 s — fewer requests per second hit the cold pod, the pod warms up gradually, and the deploy-time tail latency spike is replaced by a deploy-time small p99 bump.

Where least-connections actually wins in production

Three places, concretely.

1. Long-lived connection workloads where requests have very different costs. WebSocket fan-out servers (each connection is a long-lived in-flight request), gRPC streaming endpoints, video transcoding (per-job cost varies wildly). Round-robin gives every pod the same connection count; least-connections gives every pod the same in-flight load. CricStream's live-cricket WebSocket fan-out runs least-connections on the edge load balancers — 25M concurrent viewers across 80 edge pods, one viewer = one long-lived in-flight, and the cost per connection varies 100× between a buffering retail viewer and a low-bandwidth mobile viewer. Round-robin would give every pod the same number of connections; least-connections gives every pod the same actual streaming load.

2. Database connection pools (HAProxy in front of PostgreSQL replicas). Read queries against a 5-replica pgpool cluster have wildly different costs — an indexed point lookup is 0.5 ms, a sequential scan is 800 ms. HAProxy's balance leastconn directs new connections to the replica with the fewest in-flight queries. The wins here are real and measurable: PaySetu's analytics-replica fleet went from 480 ms p99 (round-robin) to 110 ms p99 (least-conn) on the same workload, the same fleet, the same query mix. The fix was one config-line change.

3. Auto-scaled fleets where pods come and go. When the autoscaler adds a pod, least-connections (with slow-start) shifts traffic to it gradually, smoothing the deploy. When the autoscaler removes a pod, the pod stops receiving new requests as soon as it is removed from the candidate set, but its in-flight requests continue until they complete (which is what graceful shutdown wants). Round-robin can't do either — its index simply skips the new pod's slot, so onboarding traffic to a fresh pod requires a full cycle of i % N, and offboarding a pod orphans its in-flight count without coordination.

The places least-connections does not win: short-lived stateless RPC fleets with homogeneous pods and no per-request cost variance. Your typical microservice. Round-robin is fine. P2C is better. Don't reach for least-connections by default; reach for it when one of the three scenarios above applies.

Common confusions

"Least-connections is the same as least-load." It is not. Least-connections measures in-flight count; least-load measures CPU% or queue depth or some other resource metric. The two are correlated but not identical — a pod can have few in-flight requests but high CPU because each request is expensive (sequential scan), and a pod can have many in-flight requests but low CPU because each request is cheap (indexed lookup). Most LBs implement least-connections because the in-flight count is local to the LB (no extra signal needed); least-load needs a per-pod metric pull.
"Least-connections fixes the slow-pod problem." It does, in the steady state. It does not fix it in the transient — at startup, after a deploy, after a partition heals. The transient is where most production tail-latency damage happens (deploys are when p99 spikes, not steady state). Slow-start, health-probe gating, and PreVote-equivalent gating mechanisms fix the transient; raw least-connections handles only the steady state.
"Least-connections requires a centralised counter." It does not. Each LB maintains its own counter, decoupled from the others. This is the source of the multi-LB drift problem (the third failure mode above), but it is also why least-connections is fast: no coordination, no lock, no RTT to a counter service. The trade-off is that what each LB sees is its contribution to in-flight, not the global in-flight at the pod.
"Adding more LB instances improves least-connections accuracy." It makes it worse. With 1 LB, the counter is exact. With 4 LBs, each one's counter sees only 1/4 of dispatches, so the LB has 4× less information about each pod's actual load. The picker still works, but its accuracy degrades. Past 4 LBs, you should move to P2C (which is robust to this) or invest in a shared counter.
"Least-connections is O(N) per pick, so it doesn't scale." It is O(N), but N is the candidate-pod count, typically 10–100. The constant factor is tiny (one integer comparison per pod, ~1 ns). Even at N=1000, a pick is ~1 μs — negligible compared to the network RTT of the request itself. The scaling concern is wrong; the multi-LB drift concern (above) is the real reason to migrate to P2C, not picker-time cost.

Going deeper

The Mitzenmacher framing — least-connections as `N`-choice

Power of two choices (P2C) samples 2 random pods and picks the lower-loaded one. Least-connections (the global version) samples N pods (all of them) and picks the lowest-loaded. Mitzenmacher's "Power of d Choices" result (1996) shows that the marginal gain from d to d+1 is exponential at d=2 and diminishes rapidly past d=3. At d=N (full least-connections), the maximum bin load is exactly the minimum, but the variance gain over d=2 is small — typically 10–20% lower max load. The overhead difference is O(N) vs O(2) per pick. P2C trades a small accuracy loss for a large coordination win, which is why modern service meshes default to it.

Active vs passive in-flight tracking — the gRPC streaming subtlety

For unary RPCs, in-flight is tracked at request dispatch and decremented at response. Simple. For server-streaming RPCs, what is "a request"? Envoy's LEAST_REQUEST counts each open stream as one in-flight unit, which is reasonable but undercounts the actual load — a slow stream sending 10 KB/s and a fast stream sending 10 MB/s look identical to the picker. For workloads where per-stream cost varies wildly, you want to track bytes in-flight, not streams. This is why CricStream's edge layer uses a custom least-load picker that pulls per-pod sustained-Mbps from the telemetry pipeline every 1 s — least-connections at the stream level was insufficient for their workload's variance.

The historical Linux kernel ipvs least-connections heuristic

LVS / IPVS — the in-kernel Linux Virtual Server — has implemented lc (least-connections) and wlc (weighted least-connections) since Linux 2.4. The kernel maintains per-real-server counters in a hash table, updated on connection establishment / teardown. This is the same algorithm as user-space least-conn but at line rate (Mpps). The notable historical bug: in Linux 2.6.18, the counter wraparound on a 32-bit integer overflow caused a counter to dip negative for several seconds, attracting all traffic to one real server until it recovered. The fix (2.6.30, ~2009) was to clamp the counter at zero. This is the exact bug pattern you should expect to find in any LB implementation that uses raw integer counters without saturation arithmetic.

Reproduce this on your laptop

# Run the least_conn.py simulator from above:
python3 -m venv .venv && source .venv/bin/activate
python3 least_conn.py
# Expected output (with seed=42):
#   round-robin     p99 ~985 ms (slow pods queue, no feedback)
#   random          p99 ~1040 ms (binomial fluctuation + slow pods)
#   least-conn      p99  ~210 ms BUT p99.9 = 5000 ms (dead pod attractor)

# Add health-probe gating and re-run:
# Modify least_conn() to include:  pods = [p for p in pods if p["alive"]]
# (in real code, "alive" comes from a separate health-probe loop.)

# Compare against HAProxy's leastconn in a real cluster:
docker run --rm -p 80:80 haproxy:2.9 sh -c '
cat > /usr/local/etc/haproxy/haproxy.cfg << EOF
backend app
  balance leastconn
  option httpchk GET /healthz
  server pod1 10.0.0.1:8080 check inter 1s fall 2 rise 2
  server pod2 10.0.0.2:8080 check inter 1s fall 2 rise 2
EOF
haproxy -f /usr/local/etc/haproxy/haproxy.cfg'
# `inter 1s fall 2` is the health-probe gating that prevents the dead-pod attractor.

Where this leads next

Least-connections is the first picker that consults pod state. It buys you 2–4× tail-latency wins under heterogeneity, and three failure modes you have to engineer around. The next chapter takes the same idea — load-aware picking — but uses two-sample comparison instead of full argmin, which is robust to multi-LB drift and almost as accurate.

Power of two choices (P2C) — the same insight, with two-sample comparison instead of O(N) argmin. The Mitzenmacher result quantifies the trade.
Consistent hashing (ring, jump, Maglev) — relaxes "uniform pick" entirely when stickiness is required.
Sticky sessions and session affinity — what happens when "fewest in-flight" must yield to "same client always to same pod".
Locality-aware load balancing — adds network-path heterogeneity to the picker's decision.

After Part 6, Part 7 (reliability patterns) revisits the same trade space from the resilience angle — what happens when the picker's chosen pod fails after dispatch, and how retries / hedged requests / circuit breakers compose with the picker.

References

Mitzenmacher, "The Power of Two Choices in Randomized Load Balancing" — IEEE TPDS 2001 — the formal result that frames least-connections as the d=N extreme of the d-choice family.
Dean & Barroso, "The Tail at Scale" — CACM 2013 — why picker-level decisions dominate tail latency in fan-out workloads.
Envoy Load Balancing — LEAST_REQUEST — production reference for least-connections in a service-mesh sidecar; documents slow-start config and active-health-check gating.
HAProxy balance leastconn documentation — historical reference for the algorithm; documents option httpchk as the health-gating mechanism.
LVS Wiki — Job Scheduling Algorithms (lc, wlc) — the in-kernel Linux Virtual Server implementation; oldest deployed least-connections in production.
Random and round-robin — internal companion. The previous chapter; least-connections is what you reach for when round-robin's count uniformity isn't enough.
Wall: many instances → load balancing decisions — internal companion. The wall this chapter follows; explains why all pickers face heterogeneity as their core problem.