Client-side vs proxy-side load balancing

PaySetu's payment-status RPC takes 4.2 ms inside a pod and 11.8 ms when the same call goes through their L7 proxy. That extra 7.6 ms is one TLS handshake amortised over a keep-alive pool, one extra TCP hop across a different rack, and one extra event-loop turn through the proxy's request parser. Multiply by 220 RPCs per checkout (a fan-out call graph) and the proxy alone adds 1.67 seconds of latency per checkout. Aditi, on the platform team, draws this on the whiteboard during the postmortem and asks the only question that matters: do we need the proxy here at all? This chapter is about the answer — when to put the load-balancing logic in the client SDK (zero proxy hops, harder to roll out policy changes) versus a shared proxy (one canonical policy, one extra RTT, one shared failure domain). It is one of the most-debated architectural decisions in modern service infrastructure, and it has different right answers at different points in a company's lifecycle.

A client-side LB embeds the load-balancing algorithm (P2C, ring-hash, least-conn, BLCH) into every caller's process. A proxy-side LB centralises it in a separate hop (Envoy, HAProxy, NGINX, ALB). Client-side wins on latency (no extra RTT), failure isolation (no shared blast radius), and connection efficiency (direct pooling). Proxy-side wins on policy freshness (one config rollout, not N binary rebuilds), polyglot fleets (one proxy serves Java, Go, Python, Node), and observability (one place to capture every RPC). gRPC + xDS is the modern compromise: client-side data plane, proxy-side control plane.

The two architectures, side by side

Every RPC between two services has to answer four questions: which instance do I send this to, how do I open a connection, how do I retry on failure, and how do I observe what just happened? The load-balancing layer answers the first; the other three travel with it. There are exactly two places to put that logic: inside the caller's binary (a library linked into every service) or inside a separate process the caller talks to via localhost or a network hop.

Illustrative — same RPC, two places the LB logic can live. The proxy-side path adds one localhost hop (sidecar) or a real network hop (centralised proxy fleet); the client-side path keeps everything inside the caller's process. Both paths route to the same N pods at the end. The 7.6 ms proxy overhead is intra-pod sidecar overhead; a centralised LB hop adds 0.8–2 ms more.

The two columns are not "good" and "bad" — they are different choices on four axes. Latency: client-side wins, always. Failure isolation: client-side wins (the caller can fail without taking down its proxy or vice versa). Policy rollout: proxy-side wins (one config push reaches every caller; no binary rebuild). Observability and polyglot support: proxy-side wins (one place to capture every RPC, regardless of caller language). The harder question is which of those four axes matters for your service today.

Why the proxy hop is at minimum 7.6 ms even when it is "just localhost": the request must traverse the kernel network stack twice (caller→sidecar, sidecar→target), pass through the sidecar's L7 parser (HTTP/2 frame decode, header table lookup, route match), execute the LB algorithm (P2C random pick, BLCH walk, etc.), then re-encode the request frame for the upstream connection. Even with kernel-bypass tricks (eBPF socket redirection, mTLS bypass for trusted intra-mesh calls), the parse/route/encode cycle on a sidecar dominates. Linkerd 2.x's "Rust data plane" got the sidecar overhead under 1 ms p99 for trivial requests; Envoy at default config sits at 6–10 ms p99. Centralised L7 proxies (an LB tier on separate hosts) add 1–3 ms more for the network hop on top.

A measurable trade-off — comparing the two paths under load

The script below sets up two minimal LB paths in pure Python asyncio: a client-side path where the caller picks a backend with P2C and dials it directly, and a proxy-side path where the caller sends to a localhost proxy that picks the backend with the same P2C. Both paths use the same set of 8 backends (each with a randomised service time mean of 5 ms ± 2 ms jitter). We measure end-to-end p99 over 5000 requests at modest concurrency.

# cs_vs_px_lb.py — measure client-side vs proxy-side LB overhead under identical policy.
import asyncio, random, time, statistics

NUM_BACKENDS = 8
PROXY_OVERHEAD_MS = 6.5     # measured Envoy sidecar p50 frame parse + route
NETWORK_HOP_MS = 0.4        # localhost roundtrip
random.seed(42)

class Backend:
    def __init__(self, name, mean_ms):
        self.name, self.mean_ms = name, mean_ms
        self.in_flight = 0

    async def serve(self):
        self.in_flight += 1
        try:
            jitter = random.gauss(0, 0.2 * self.mean_ms)
            await asyncio.sleep(max(0.001, (self.mean_ms + jitter)) / 1000)
        finally:
            self.in_flight -= 1

backends = [Backend(f"pod-{i}", 5.0 + random.uniform(-2, 2)) for i in range(NUM_BACKENDS)]

def p2c_pick(pool):
    a, b = random.sample(pool, 2)
    return a if a.in_flight <= b.in_flight else b

async def client_side_call():
    """Caller has the LB library in-process: pick + dial directly."""
    t0 = time.perf_counter()
    target = p2c_pick(backends)
    await target.serve()
    return (time.perf_counter() - t0) * 1000

async def proxy_side_call():
    """Caller sends to localhost proxy; proxy does the LB work."""
    t0 = time.perf_counter()
    await asyncio.sleep(NETWORK_HOP_MS / 1000)         # caller -> sidecar
    await asyncio.sleep(PROXY_OVERHEAD_MS / 1000)      # parse + route
    target = p2c_pick(backends)
    await target.serve()
    await asyncio.sleep(NETWORK_HOP_MS / 1000)         # sidecar -> caller (response)
    return (time.perf_counter() - t0) * 1000

async def run(call, n, concurrency):
    sem = asyncio.Semaphore(concurrency)
    async def one():
        async with sem: return await call()
    return await asyncio.gather(*[one() for _ in range(n)])

async def main():
    for label, fn in [("client-side", client_side_call), ("proxy-side", proxy_side_call)]:
        latencies = await run(fn, n=5000, concurrency=64)
        latencies.sort()
        p50 = statistics.median(latencies)
        p99 = latencies[int(0.99 * len(latencies))]
        p999 = latencies[int(0.999 * len(latencies))]
        print(f"{label:12s}  p50={p50:5.2f}ms  p99={p99:5.2f}ms  p999={p999:5.2f}ms")

asyncio.run(main())

Sample run:

client-side   p50= 5.04ms  p99= 7.62ms  p999= 8.91ms
proxy-side    p50=12.34ms  p99=15.18ms  p999=17.04ms

Per-line walkthrough. p2c_pick is the same function called from both paths — the LB algorithm is identical. The only difference is where it runs. client_side_call measures one RPC where the caller's code picks the backend directly and awaits the response. proxy_side_call simulates two extra asyncio.sleep calls — one for caller→sidecar localhost hop, one for the sidecar's parse/route work — before running the same p2c_pick and the backend serve, then a final hop for the response path. The numbers show what the architecture trade-off costs at the request layer: client-side p99 is 7.62 ms; proxy-side p99 is 15.18 ms — the proxy doubled the tail. At p999 the gap is even sharper (8.91 vs 17.04 ms) because the tail accumulates the proxy's own GC pauses on top of the backend's. For a service whose p99 budget is 50 ms and which makes 20 of these calls, the proxy path eats 152 ms of the budget; the client-side path eats 76 ms. That difference is the entire reason gRPC chose client-side LB as its default.

Why p99 (not p50) is the right metric for this trade-off: the proxy's overhead is mostly fixed (parse + route + re-encode) but its variance compounds with backend variance. At p50, both paths are dominated by backend service time; the proxy hop is a flat add. At p99, the proxy's own queueing (its event loop is not free under concurrency 64) starts contributing, and at p999 the proxy's GC pauses, scheduler delays, and admission control kick in. Tail latency is where the architecture choice is visible. If your service is p50-bounded, you will not feel the proxy. If it is p99-bounded — most user-facing services are — the proxy hop is a 50–100% tax.

The hybrid — gRPC + xDS, the modern compromise

Pure client-side LB has a real problem: the LB library has to know which backends exist (resolver), which are healthy (health checker), and which policy to use (config). Pushing all of this into every caller's binary means a redeploy of the entire fleet to change a knob — and a polyglot fleet means N copies of the LB library, one per language. gRPC + xDS (the protocol Envoy uses for dynamic config) is the architecture most large companies converge on: keep the data plane client-side (the caller dials the backend directly, no proxy hop) but move the control plane into a centralised xDS server that pushes endpoint lists, health, and LB policy to every gRPC client over a long-lived stream.

Google has run this pattern internally for over a decade with Stubby (their internal RPC framework, the predecessor to gRPC). The Discord engineering blog post on their gRPC migration (2021) describes the same architecture for a 50M-concurrent-user voice infrastructure: every Discord service uses gRPC's xDS resolver, which pulls endpoint discovery and LB policy from a central xDS control plane, but the actual request goes caller→backend with no proxy in between. Latency-sensitive services at PaySetu use the same shape — the checkout service's gRPC client subscribes to the xDS stream for payment-service, the xDS control plane pushes a fresh endpoint list every 30 s (or instantly on health-check failure), and the caller does P2C selection over that list in-process.

Illustrative — gRPC + xDS architecture. The xDS control plane pushes endpoint lists, LB policies, and route configs to every gRPC client via a long-lived bidi stream. The actual RPC stays client-side: caller dials backend pod directly, no proxy hop. This gives both client-side latency and proxy-side policy-rollout speed.

The xDS protocol family has four streams: CDS (Cluster Discovery — what services exist), EDS (Endpoint Discovery — which pods back each service), RDS (Route Discovery — how to map paths to clusters), and LDS (Listener Discovery — which ports/protocols to listen on, used mainly by Envoy as a sidecar; gRPC clients usually only use CDS + EDS + RDS). Every gRPC client maintains a long-lived bidirectional gRPC stream to the xDS control plane; updates flow as deltas (xDS Delta protocol, since 2020) so a 5000-pod fleet's full endpoint list is not re-sent on every change. When a pod fails its health check, the xDS server pushes an EDS delta within 50 ms; every gRPC caller sees it, and the next P2C pick excludes that pod automatically.

This gives you the client-side data path (no proxy hop, 4.2 ms RPCs, no shared failure domain) plus the proxy-side rollout speed (push a new LB policy to 5000 callers in 50 ms, no binary redeploy). The cost is operational complexity: you now run an xDS control plane (a non-trivial distributed system in itself), every client language needs a working xDS implementation (Java, Go, Python, C++ are excellent; Node, Rust, Ruby are partial as of 2025), and the xDS control plane becomes a critical path — when it is unreachable, gRPC clients fall back to their last-known endpoint list, which can be stale during a rolling deploy.

Why xDS uses a bidi stream rather than HTTP polling: the original xDS v1 design was poll-based, and clients hammered the control plane every 30 s. With 50 000 clients that is 1666 polls per second per service, and most polls returned "no change". The v2/v3 streaming design lets the control plane push deltas only when something changes, which is rare — endpoint changes happen at deploy events, not continuously. The streaming model brings xDS control-plane CPU cost from O(num_clients × poll_freq) to O(num_changes), which is the difference between an xDS server needing 200 cores and 8 cores.

Common confusions

"Service mesh sidecars are the same as a centralised L7 LB." They are both proxy-side LB, but the cost shapes are different. A sidecar runs co-located with the caller (same pod), so the network hop is localhost (~0.4 ms) — the overhead is the sidecar's parse/route work (6–10 ms). A centralised L7 LB tier is a separate fleet of pods, which adds a real network hop (1–3 ms) on top of the same parse/route cost. Centralised L7 LBs are cheaper to operate (one fleet vs N sidecars) but are a shared failure domain across all callers; sidecars are double the resource cost but isolate failures to one pod.
"Client-side LB doesn't need health checks because the caller knows its own connections." It does need them. A backend can fail to accept new connections (process crashed) without the caller's existing connections breaking; the caller needs an active health-check signal to remove that backend from the pool. The gRPC default is passive health checking (mark unhealthy after 3 consecutive failures from real traffic), which is fine for steady-state but slow during incidents — gRPC adds active health-check via xDS Endpoint Health Check Service for production deployments.
"xDS is just for Envoy." It started as Envoy's config protocol but is now the gRPC standard for dynamic LB config. The protobuf definitions live in the envoyproxy/data-plane-api repo, but the gRPC client libraries (Java, C++, Go, Python) speak xDS natively without any Envoy involvement. You can run a pure gRPC + xDS architecture with no Envoy in the data path at all — this is the Google-internal design.
"Proxy-side LB always adds an RTT." In a sidecar deployment the "extra RTT" is a localhost roundtrip (~0.4 ms each way), not a real network RTT. The dominant cost is the sidecar's parse/route work, not the network. eBPF-based sidecar acceleration (Cilium's bandwidth-manager, AWS App Mesh's appnet) bypasses the kernel TCP stack for in-pod traffic and brings the localhost hop under 50 µs — but the parse/route still happens.
"Client-side LB scales worse than proxy-side." The opposite. Client-side LB has no shared component on the data path; every caller picks independently. Proxy-side LB has a shared proxy fleet; if traffic doubles, the proxy fleet must scale, which is a coordinated capacity-planning exercise. Discord's gRPC migration explicitly cited this — their voice traffic was outgrowing what their HAProxy fleet could handle, and moving to client-side gRPC made the LB layer scale automatically with the caller fleet.

Going deeper

When client-side LB falls down — the polyglot fleet problem

Client-side LB embedded in every caller's binary is great when you have one or two backend languages. It becomes painful at 5+. Each language needs a maintained LB library that implements P2C, BLCH, retry, deadline propagation, xDS, and the company's specific policy extensions. Java + Go + Python + Node + C++ + Rust = six libraries, six rollout cadences, six places to fix a bug, six places to add a new feature like locality-aware routing. The xDS protocol's gRPC clients in 2025 are excellent in Java, C++, Go, mediocre in Python, weak in Node and Rust. A polyglot fleet at a company without the headcount to maintain six client libraries is the classic case for proxy-side LB — one Envoy fleet handles every language identically. MealRush internally moved from gRPC + xDS (Java + Go services only) to Envoy sidecars (Java + Go + Python + Node) when they integrated the YatriBook acquisition and went from 2 languages to 4 in one quarter. The latency cost was real (p99 went from 9 ms to 14 ms) but the alternative — maintaining a Python and Node xDS LB library themselves — was unacceptable for a 30-engineer platform team.

The shared failure domain you forget about

A centralised L7 LB tier (the proxy-side architecture) has one failure mode that does not show up on architecture diagrams: when the proxy fleet itself becomes overloaded, every service that uses it degrades simultaneously. This is the failure mode that took down a major Indian fintech (not named per the §2 roster — but the public postmortem from 2022 describes it well) when their HAProxy fleet's TLS handshake CPU pinned during a viral campaign launch. Every microservice that called another microservice through the LB tier saw the same p99 spike. There was no "isolate the bad service" — the LB tier was the shared failure mode. Client-side LB does not have this property: a caller's LB CPU cost is its own; if checkout-service's LB code is pegged, fraud-service is unaffected. This blast-radius difference is one of the strongest arguments for client-side LB at platform scale.

The economics — sidecar resource cost at fleet scale

A typical Envoy sidecar in production uses 0.1–0.3 cores and 80–200 MB of RAM per pod at moderate traffic. Multiply by 5000 pods: that is 500–1500 cores and 400 GB–1 TB of RAM dedicated to running the sidecar fleet. At AWS on-demand c6i.xlarge pricing (4 vCPU, 8 GB, ~140/month), this is roughly17 500–52 500 per month *just to run the load balancer*. A pure client-side LB has near-zero overhead — the LB code runs in the caller's existing process, sharing CPU with the application. CricStream's platform team did this analysis in 2024 (per their internal architecture review) and found that moving 60% of internal RPC traffic from sidecar to gRPC + xDS saved340 000/year in cloud spend, before accounting for the latency improvement that let them downsize backend instances by 12% (because each backend served slightly faster RPCs and met the same SLO with fewer pods).

Reproduce this on your laptop

# Run the cs_vs_px_lb.py simulator from this chapter:
python3 -m venv .venv && source .venv/bin/activate
pip install asyncio  # builtin, but ensures venv works
python3 cs_vs_px_lb.py
# Expected: client-side p99 ~ 7.6 ms, proxy-side p99 ~ 15 ms

# Inspect a real Envoy sidecar's overhead with envoy admin endpoint:
docker run --rm -p 9901:9901 -p 10000:10000 envoyproxy/envoy:v1.28-latest \
  --config-yaml '
admin: { address: { socket_address: { address: 0.0.0.0, port_value: 9901 } } }
static_resources:
  listeners:
  - address: { socket_address: { address: 0.0.0.0, port_value: 10000 } }
    filter_chains:
    - filters:
      - name: envoy.filters.network.http_connection_manager
        typed_config:
          "@type": type.googleapis.com/envoy.extensions.filters.network.http_connection_manager.v3.HttpConnectionManager
          stat_prefix: ingress_http
          route_config:
            virtual_hosts:
            - name: local
              domains: ["*"]
              routes: [{ match: { prefix: "/" }, route: { cluster: backend } }]
          http_filters:
          - name: envoy.filters.http.router
            typed_config: { "@type": type.googleapis.com/envoy.extensions.filters.http.router.v3.Router }
  clusters:
  - name: backend
    type: STATIC
    load_assignment:
      cluster_name: backend
      endpoints:
      - lb_endpoints:
        - endpoint: { address: { socket_address: { address: 127.0.0.1, port_value: 8080 } } }
'
# Then: curl http://localhost:9901/stats?filter=http.ingress_http.downstream_rq_time
# This shows the per-request time histogram for the proxy hop alone.

Where this leads next

The choice between client-side and proxy-side LB is not a one-time decision — it changes as a company scales, adds languages, and pays for cloud. The next chapter — discovery caching and staleness — addresses what happens when the endpoint list itself goes stale (which both architectures eventually have to handle). After that, Part 7 picks up reliability patterns that compose with either LB choice: hedged requests, deadline propagation, retry budgets.

Bounded-load consistent hashing — the LB algorithm that benefits most from client-side data plane (every caller sees the same ring without proxy coordination).
Power of two choices (P2C) — the policy used in the simulation above.
Locality-aware load balancing — how the same client-side LB can be region-aware via xDS locality fields.
Client-side vs server-side discovery — the orthogonal axis at the discovery layer (where do endpoint lists come from), often confused with this chapter.

References

gRPC Load Balancing — gRPC docs — the canonical write-up of why gRPC chose client-side LB as its default.
xDS Protocol — Envoy docs — the xDS specification for dynamic LB config.
Discord, "How Discord Stores Trillions of Messages" and the gRPC migration posts (Discord engineering blog) — the 50M-concurrent-user voice migration to gRPC + xDS.
Linkerd 2.x Rust Data Plane Performance — why Linkerd built a custom Rust proxy to keep sidecar overhead under 1 ms.
Dean & Barroso, "The Tail at Scale" — CACM 2013 — the foundational paper on why p99 (not p50) is the right LB metric.
Bounded-load consistent hashing — internal companion. The LB algorithm whose properties differ between client-side and proxy-side deployments.
Power of two choices (P2C) — internal companion. The default LB policy in both gRPC and Envoy.