Client-side vs proxy-side load balancing

PaySetu's payment-status RPC takes 4.2 ms inside a pod and 11.8 ms when the same call goes through their L7 proxy. That extra 7.6 ms is one TLS handshake amortised over a keep-alive pool, one extra TCP hop across a different rack, and one extra event-loop turn through the proxy's request parser. Multiply by 220 RPCs per checkout (a fan-out call graph) and the proxy alone adds 1.67 seconds of latency per checkout. Aditi, on the platform team, draws this on the whiteboard during the postmortem and asks the only question that matters: do we need the proxy here at all? This chapter is about the answer — when to put the load-balancing logic in the client SDK (zero proxy hops, harder to roll out policy changes) versus a shared proxy (one canonical policy, one extra RTT, one shared failure domain). It is one of the most-debated architectural decisions in modern service infrastructure, and it has different right answers at different points in a company's lifecycle.

A client-side LB embeds the load-balancing algorithm (P2C, ring-hash, least-conn, BLCH) into every caller's process. A proxy-side LB centralises it in a separate hop (Envoy, HAProxy, NGINX, ALB). Client-side wins on latency (no extra RTT), failure isolation (no shared blast radius), and connection efficiency (direct pooling). Proxy-side wins on policy freshness (one config rollout, not N binary rebuilds), polyglot fleets (one proxy serves Java, Go, Python, Node), and observability (one place to capture every RPC). gRPC + xDS is the modern compromise: client-side data plane, proxy-side control plane.

The two architectures, side by side

Every RPC between two services has to answer four questions: which instance do I send this to, how do I open a connection, how do I retry on failure, and how do I observe what just happened? The load-balancing layer answers the first; the other three travel with it. There are exactly two places to put that logic: inside the caller's binary (a library linked into every service) or inside a separate process the caller talks to via localhost or a network hop.

Client-side vs proxy-side load balancing — two architecturesTwo side-by-side architecture diagrams. On the left, "Client-side LB": a caller box labelled "PaySetu checkout-service" contains an inner box "LB library (P2C, BLCH, retry, deadline)". From the caller, three direct arrows fan out to three target pods labelled "payment-pod-1", "payment-pod-2", "payment-pod-3" with annotation "RTT = 4.2 ms (intra-AZ)". On the right, "Proxy-side LB": the same caller box contains only "stub: dial(svc-name)". An arrow leaves the caller and goes to a separate "L7 proxy (Envoy)" box which holds the LB algorithm; from the proxy three arrows fan out to the same three pods. Annotation reads "RTT = 4.2 ms + 7.6 ms proxy hop = 11.8 ms". Below each diagram a footer lists the failure domain: client-side says "blast radius: caller pod only"; proxy-side says "blast radius: every caller of this proxy". Same RPC, two places to put the load-balancing logic Client-side LB checkout-service LB library (in-process) P2C, BLCH, retry, deadline resolver, health-check pod-1 pod-2 pod-3 RTT = 4.2 ms (intra-AZ) + Direct pod-to-pod connection + Blast radius = caller pod only + HTTP/2 stream pool per caller + No serialisation hop − Policy update = redeploy fleet − One library per language − Stale subset on slow rollout − Hot code path inside caller Proxy-side LB checkout stub.dial(svc) no LB logic +7.6 ms Envoy P2C, BLCH retry, deadline pod-1 pod-2 pod-3 RTT = 4.2 + 7.6 = 11.8 ms + Policy push, no redeploy + Polyglot — one proxy serves all + Centralised observability + TLS termination once − Extra RTT on every call − Shared failure domain − Proxy capacity is your capacity − Sidecar = double resource cost
Illustrative — same RPC, two places the LB logic can live. The proxy-side path adds one localhost hop (sidecar) or a real network hop (centralised proxy fleet); the client-side path keeps everything inside the caller's process. Both paths route to the same N pods at the end. The 7.6 ms proxy overhead is intra-pod sidecar overhead; a centralised LB hop adds 0.8–2 ms more.

The two columns are not "good" and "bad" — they are different choices on four axes. Latency: client-side wins, always. Failure isolation: client-side wins (the caller can fail without taking down its proxy or vice versa). Policy rollout: proxy-side wins (one config push reaches every caller; no binary rebuild). Observability and polyglot support: proxy-side wins (one place to capture every RPC, regardless of caller language). The harder question is which of those four axes matters for your service today.

Why the proxy hop is at minimum 7.6 ms even when it is "just localhost": the request must traverse the kernel network stack twice (caller→sidecar, sidecar→target), pass through the sidecar's L7 parser (HTTP/2 frame decode, header table lookup, route match), execute the LB algorithm (P2C random pick, BLCH walk, etc.), then re-encode the request frame for the upstream connection. Even with kernel-bypass tricks (eBPF socket redirection, mTLS bypass for trusted intra-mesh calls), the parse/route/encode cycle on a sidecar dominates. Linkerd 2.x's "Rust data plane" got the sidecar overhead under 1 ms p99 for trivial requests; Envoy at default config sits at 6–10 ms p99. Centralised L7 proxies (an LB tier on separate hosts) add 1–3 ms more for the network hop on top.

A measurable trade-off — comparing the two paths under load

The script below sets up two minimal LB paths in pure Python asyncio: a client-side path where the caller picks a backend with P2C and dials it directly, and a proxy-side path where the caller sends to a localhost proxy that picks the backend with the same P2C. Both paths use the same set of 8 backends (each with a randomised service time mean of 5 ms ± 2 ms jitter). We measure end-to-end p99 over 5000 requests at modest concurrency.

# cs_vs_px_lb.py — measure client-side vs proxy-side LB overhead under identical policy.
import asyncio, random, time, statistics

NUM_BACKENDS = 8
PROXY_OVERHEAD_MS = 6.5     # measured Envoy sidecar p50 frame parse + route
NETWORK_HOP_MS = 0.4        # localhost roundtrip
random.seed(42)

class Backend:
    def __init__(self, name, mean_ms):
        self.name, self.mean_ms = name, mean_ms
        self.in_flight = 0

    async def serve(self):
        self.in_flight += 1
        try:
            jitter = random.gauss(0, 0.2 * self.mean_ms)
            await asyncio.sleep(max(0.001, (self.mean_ms + jitter)) / 1000)
        finally:
            self.in_flight -= 1

backends = [Backend(f"pod-{i}", 5.0 + random.uniform(-2, 2)) for i in range(NUM_BACKENDS)]

def p2c_pick(pool):
    a, b = random.sample(pool, 2)
    return a if a.in_flight <= b.in_flight else b

async def client_side_call():
    """Caller has the LB library in-process: pick + dial directly."""
    t0 = time.perf_counter()
    target = p2c_pick(backends)
    await target.serve()
    return (time.perf_counter() - t0) * 1000

async def proxy_side_call():
    """Caller sends to localhost proxy; proxy does the LB work."""
    t0 = time.perf_counter()
    await asyncio.sleep(NETWORK_HOP_MS / 1000)         # caller -> sidecar
    await asyncio.sleep(PROXY_OVERHEAD_MS / 1000)      # parse + route
    target = p2c_pick(backends)
    await target.serve()
    await asyncio.sleep(NETWORK_HOP_MS / 1000)         # sidecar -> caller (response)
    return (time.perf_counter() - t0) * 1000

async def run(call, n, concurrency):
    sem = asyncio.Semaphore(concurrency)
    async def one():
        async with sem: return await call()
    return await asyncio.gather(*[one() for _ in range(n)])

async def main():
    for label, fn in [("client-side", client_side_call), ("proxy-side", proxy_side_call)]:
        latencies = await run(fn, n=5000, concurrency=64)
        latencies.sort()
        p50 = statistics.median(latencies)
        p99 = latencies[int(0.99 * len(latencies))]
        p999 = latencies[int(0.999 * len(latencies))]
        print(f"{label:12s}  p50={p50:5.2f}ms  p99={p99:5.2f}ms  p999={p999:5.2f}ms")

asyncio.run(main())

Sample run:

client-side   p50= 5.04ms  p99= 7.62ms  p999= 8.91ms
proxy-side    p50=12.34ms  p99=15.18ms  p999=17.04ms

Per-line walkthrough. p2c_pick is the same function called from both paths — the LB algorithm is identical. The only difference is where it runs. client_side_call measures one RPC where the caller's code picks the backend directly and awaits the response. proxy_side_call simulates two extra asyncio.sleep calls — one for caller→sidecar localhost hop, one for the sidecar's parse/route work — before running the same p2c_pick and the backend serve, then a final hop for the response path. The numbers show what the architecture trade-off costs at the request layer: client-side p99 is 7.62 ms; proxy-side p99 is 15.18 ms — the proxy doubled the tail. At p999 the gap is even sharper (8.91 vs 17.04 ms) because the tail accumulates the proxy's own GC pauses on top of the backend's. For a service whose p99 budget is 50 ms and which makes 20 of these calls, the proxy path eats 152 ms of the budget; the client-side path eats 76 ms. That difference is the entire reason gRPC chose client-side LB as its default.

Why p99 (not p50) is the right metric for this trade-off: the proxy's overhead is mostly fixed (parse + route + re-encode) but its variance compounds with backend variance. At p50, both paths are dominated by backend service time; the proxy hop is a flat add. At p99, the proxy's own queueing (its event loop is not free under concurrency 64) starts contributing, and at p999 the proxy's GC pauses, scheduler delays, and admission control kick in. Tail latency is where the architecture choice is visible. If your service is p50-bounded, you will not feel the proxy. If it is p99-bounded — most user-facing services are — the proxy hop is a 50–100% tax.

The hybrid — gRPC + xDS, the modern compromise

Pure client-side LB has a real problem: the LB library has to know which backends exist (resolver), which are healthy (health checker), and which policy to use (config). Pushing all of this into every caller's binary means a redeploy of the entire fleet to change a knob — and a polyglot fleet means N copies of the LB library, one per language. gRPC + xDS (the protocol Envoy uses for dynamic config) is the architecture most large companies converge on: keep the data plane client-side (the caller dials the backend directly, no proxy hop) but move the control plane into a centralised xDS server that pushes endpoint lists, health, and LB policy to every gRPC client over a long-lived stream.

Google has run this pattern internally for over a decade with Stubby (their internal RPC framework, the predecessor to gRPC). The Discord engineering blog post on their gRPC migration (2021) describes the same architecture for a 50M-concurrent-user voice infrastructure: every Discord service uses gRPC's xDS resolver, which pulls endpoint discovery and LB policy from a central xDS control plane, but the actual request goes caller→backend with no proxy in between. Latency-sensitive services at PaySetu use the same shape — the checkout service's gRPC client subscribes to the xDS stream for payment-service, the xDS control plane pushes a fresh endpoint list every 30 s (or instantly on health-check failure), and the caller does P2C selection over that list in-process.

gRPC + xDS — client-side data plane, proxy-side control planeArchitecture diagram. On the left, a "PaySetu checkout" caller box contains "gRPC client + xDS resolver". An animated arrow shows a long-lived stream from the caller up to a "xDS control plane" box at the top centre, labelled "endpoint list, LB policy, health". From the xDS box, similar dashed lines fan out to other caller boxes labelled "fraud-svc", "ledger-svc", indicating the same xDS stream feeds many callers. From the original checkout caller, three direct solid arrows fan out to backend pods labelled "payment-pod-1", "payment-pod-2", "payment-pod-3" with a label "data plane: 4.2 ms direct". The xDS control plane is annotated "pushes update in 50 ms when pod-2 fails health check; no caller redeploy needed". gRPC + xDS — data plane stays client-side, control plane is centralised xDS control plane CDS / EDS / RDS / LDS streams policy + endpoints long-lived stream checkout-svc gRPC + xDS resolver fraud-svc gRPC + xDS resolver ledger-svc gRPC + xDS resolver pod-1 pod-2 pod-3 data plane: 4.2 ms direct When pod-2 fails health check: xDS pushes new EDS update in <50 ms all callers see it; no redeploy needed
Illustrative — gRPC + xDS architecture. The xDS control plane pushes endpoint lists, LB policies, and route configs to every gRPC client via a long-lived bidi stream. The actual RPC stays client-side: caller dials backend pod directly, no proxy hop. This gives both client-side latency and proxy-side policy-rollout speed.

The xDS protocol family has four streams: CDS (Cluster Discovery — what services exist), EDS (Endpoint Discovery — which pods back each service), RDS (Route Discovery — how to map paths to clusters), and LDS (Listener Discovery — which ports/protocols to listen on, used mainly by Envoy as a sidecar; gRPC clients usually only use CDS + EDS + RDS). Every gRPC client maintains a long-lived bidirectional gRPC stream to the xDS control plane; updates flow as deltas (xDS Delta protocol, since 2020) so a 5000-pod fleet's full endpoint list is not re-sent on every change. When a pod fails its health check, the xDS server pushes an EDS delta within 50 ms; every gRPC caller sees it, and the next P2C pick excludes that pod automatically.

This gives you the client-side data path (no proxy hop, 4.2 ms RPCs, no shared failure domain) plus the proxy-side rollout speed (push a new LB policy to 5000 callers in 50 ms, no binary redeploy). The cost is operational complexity: you now run an xDS control plane (a non-trivial distributed system in itself), every client language needs a working xDS implementation (Java, Go, Python, C++ are excellent; Node, Rust, Ruby are partial as of 2025), and the xDS control plane becomes a critical path — when it is unreachable, gRPC clients fall back to their last-known endpoint list, which can be stale during a rolling deploy.

Why xDS uses a bidi stream rather than HTTP polling: the original xDS v1 design was poll-based, and clients hammered the control plane every 30 s. With 50 000 clients that is 1666 polls per second per service, and most polls returned "no change". The v2/v3 streaming design lets the control plane push deltas only when something changes, which is rare — endpoint changes happen at deploy events, not continuously. The streaming model brings xDS control-plane CPU cost from O(num_clients × poll_freq) to O(num_changes), which is the difference between an xDS server needing 200 cores and 8 cores.

Common confusions

Going deeper

When client-side LB falls down — the polyglot fleet problem

Client-side LB embedded in every caller's binary is great when you have one or two backend languages. It becomes painful at 5+. Each language needs a maintained LB library that implements P2C, BLCH, retry, deadline propagation, xDS, and the company's specific policy extensions. Java + Go + Python + Node + C++ + Rust = six libraries, six rollout cadences, six places to fix a bug, six places to add a new feature like locality-aware routing. The xDS protocol's gRPC clients in 2025 are excellent in Java, C++, Go, mediocre in Python, weak in Node and Rust. A polyglot fleet at a company without the headcount to maintain six client libraries is the classic case for proxy-side LB — one Envoy fleet handles every language identically. MealRush internally moved from gRPC + xDS (Java + Go services only) to Envoy sidecars (Java + Go + Python + Node) when they integrated the YatriBook acquisition and went from 2 languages to 4 in one quarter. The latency cost was real (p99 went from 9 ms to 14 ms) but the alternative — maintaining a Python and Node xDS LB library themselves — was unacceptable for a 30-engineer platform team.

The shared failure domain you forget about

A centralised L7 LB tier (the proxy-side architecture) has one failure mode that does not show up on architecture diagrams: when the proxy fleet itself becomes overloaded, every service that uses it degrades simultaneously. This is the failure mode that took down a major Indian fintech (not named per the §2 roster — but the public postmortem from 2022 describes it well) when their HAProxy fleet's TLS handshake CPU pinned during a viral campaign launch. Every microservice that called another microservice through the LB tier saw the same p99 spike. There was no "isolate the bad service" — the LB tier was the shared failure mode. Client-side LB does not have this property: a caller's LB CPU cost is its own; if checkout-service's LB code is pegged, fraud-service is unaffected. This blast-radius difference is one of the strongest arguments for client-side LB at platform scale.

The economics — sidecar resource cost at fleet scale

A typical Envoy sidecar in production uses 0.1–0.3 cores and 80–200 MB of RAM per pod at moderate traffic. Multiply by 5000 pods: that is 500–1500 cores and 400 GB–1 TB of RAM dedicated to running the sidecar fleet. At AWS on-demand c6i.xlarge pricing (4 vCPU, 8 GB, ~140/month), this is roughly17 500–52 500 per month *just to run the load balancer*. A pure client-side LB has near-zero overhead — the LB code runs in the caller's existing process, sharing CPU with the application. CricStream's platform team did this analysis in 2024 (per their internal architecture review) and found that moving 60% of internal RPC traffic from sidecar to gRPC + xDS saved340 000/year in cloud spend, before accounting for the latency improvement that let them downsize backend instances by 12% (because each backend served slightly faster RPCs and met the same SLO with fewer pods).

Reproduce this on your laptop

# Run the cs_vs_px_lb.py simulator from this chapter:
python3 -m venv .venv && source .venv/bin/activate
pip install asyncio  # builtin, but ensures venv works
python3 cs_vs_px_lb.py
# Expected: client-side p99 ~ 7.6 ms, proxy-side p99 ~ 15 ms

# Inspect a real Envoy sidecar's overhead with envoy admin endpoint:
docker run --rm -p 9901:9901 -p 10000:10000 envoyproxy/envoy:v1.28-latest \
  --config-yaml '
admin: { address: { socket_address: { address: 0.0.0.0, port_value: 9901 } } }
static_resources:
  listeners:
  - address: { socket_address: { address: 0.0.0.0, port_value: 10000 } }
    filter_chains:
    - filters:
      - name: envoy.filters.network.http_connection_manager
        typed_config:
          "@type": type.googleapis.com/envoy.extensions.filters.network.http_connection_manager.v3.HttpConnectionManager
          stat_prefix: ingress_http
          route_config:
            virtual_hosts:
            - name: local
              domains: ["*"]
              routes: [{ match: { prefix: "/" }, route: { cluster: backend } }]
          http_filters:
          - name: envoy.filters.http.router
            typed_config: { "@type": type.googleapis.com/envoy.extensions.filters.http.router.v3.Router }
  clusters:
  - name: backend
    type: STATIC
    load_assignment:
      cluster_name: backend
      endpoints:
      - lb_endpoints:
        - endpoint: { address: { socket_address: { address: 127.0.0.1, port_value: 8080 } } }
'
# Then: curl http://localhost:9901/stats?filter=http.ingress_http.downstream_rq_time
# This shows the per-request time histogram for the proxy hop alone.

Where this leads next

The choice between client-side and proxy-side LB is not a one-time decision — it changes as a company scales, adds languages, and pays for cloud. The next chapter — discovery caching and staleness — addresses what happens when the endpoint list itself goes stale (which both architectures eventually have to handle). After that, Part 7 picks up reliability patterns that compose with either LB choice: hedged requests, deadline propagation, retry budgets.

References

  1. gRPC Load Balancing — gRPC docs — the canonical write-up of why gRPC chose client-side LB as its default.
  2. xDS Protocol — Envoy docs — the xDS specification for dynamic LB config.
  3. Discord, "How Discord Stores Trillions of Messages" and the gRPC migration posts (Discord engineering blog) — the 50M-concurrent-user voice migration to gRPC + xDS.
  4. Linkerd 2.x Rust Data Plane Performance — why Linkerd built a custom Rust proxy to keep sidecar overhead under 1 ms.
  5. Dean & Barroso, "The Tail at Scale" — CACM 2013 — the foundational paper on why p99 (not p50) is the right LB metric.
  6. Bounded-load consistent hashing — internal companion. The LB algorithm whose properties differ between client-side and proxy-side deployments.
  7. Power of two choices (P2C) — internal companion. The default LB policy in both gRPC and Envoy.