Client-side vs server-side discovery
PaySetu's payment-status service has 240 backend pods. Every front-end pod, when handed the request GET /payment/PS-2871-9931/status, must answer one question before the request can leave the wire: which of those 240 pods do I send this to? There are exactly two architectural answers. The front-end pod can fetch the membership list itself, pick a backend, and connect to its IP directly — client-side discovery. Or the front-end pod can send the packet to a virtual address belonging to a load balancer, and let that load balancer pick — server-side discovery. The choice is not a style preference. It changes who owns retries, who sees the staleness, who pays the egress cost, and which engineer's pager rings when a pod dies mid-request.
Client-side discovery puts the membership list and load-balancing decision inside the caller — the client fetches endpoints from a registry, picks one, connects directly. Server-side discovery hides backends behind a virtual IP or DNS name; a load-balancer (kube-proxy, an L4 NLB, an L7 envoy) does the picking. Client-side wins on latency and L7 awareness; server-side wins on operational simplicity, language-independence, and not needing the registry library in every codebase. Modern service meshes are server-side discovery that lives next to each client — the sidecar pattern is the synthesis.
The two architectures, drawn precisely
In client-side discovery, the calling process holds three pieces of state simultaneously: the service name (payments-status), the current membership list (the set of pod IPs that back it, kept fresh by a watch on a registry like etcd / Consul / Eureka), and a load-balancing policy (round-robin, P2C, weighted least-connection). When a request arrives, the client picks one IP from its in-memory list and opens a TCP connection straight to that pod. There is no intermediary on the wire — the packet's destination IP, on the wire between the front-end node and the backend node, is the real pod IP.
In server-side discovery, the calling process holds only the service name. It resolves the name to a virtual address — a Kubernetes ClusterIP, an internal NLB DNS name like payments-internal.elb.ap-south-1.amazonaws.com, an envoy listener on 127.0.0.1:9001. The packet's destination on the wire is that virtual address. Something else — kernel iptables rules, a hardware load-balancer, a sidecar proxy — rewrites the destination address (or terminates the L4 connection and opens a new one) and forwards the request to a backend it picked. The membership list and the load-balancing policy live in the load balancer, not in the client.
Why the wire-level destination matters: in client-side discovery, if you tcpdump between the front-end node and a backend node, the destination IP in the IP header is 10.244.7.83 — the actual backend. In server-side discovery (kube-proxy iptables mode), the destination on the wire is also 10.244.7.83 — but only because the iptables NAT rule rewrote it on the source node before transmission. The cluster-IP 10.96.42.17 never appears on the wire. With an external load balancer (an AWS NLB), the destination on the wire from the front-end is the NLB's IP, then the NLB opens a new TCP connection to the backend. That is the operational difference: one TCP connection vs two. Two connections mean two retransmit timers, two congestion windows, two TLS handshakes.
What each architecture costs in practice
The trade is deceptively simple to state. Client-side discovery saves a network hop and gives the application full control over load-balancing policy. Server-side discovery removes the registry library from every application codebase and decouples the load-balancing choice from the application's release cycle. The complications hide in five places: latency, freshness, fan-out, language coverage, and blast radius.
Latency. A client-side call traverses one TCP connection, one round trip. A server-side call to an out-of-process load-balancer traverses two connections, with the load-balancer's processing time as a fixed cost — typically 0.2–0.8 ms for an L4 LB, 0.5–2 ms for an L7 LB doing TLS termination and HTTP parsing. PaySetu's internal observability platform measured exactly this in 2025 when the team migrated payments-status from a kube-proxy ClusterIP (server-side, L4) to a gRPC xDS client-side balancer: p50 dropped by 0.6 ms, p99 dropped by 1.4 ms, p99.9 dropped by 4.1 ms. The p99.9 win came from removing the iptables connection-tracking table's tail latency, which on a busy node (>50K concurrent connections) shows up as occasional sub-millisecond hiccups.
Freshness — who sees the stale endpoint. Both architectures have a watch on the registry; both have a propagation delay. The difference is who suffers when the cache is stale. In client-side discovery, every client has its own cache; a slow watch on one client means that one client sends to a dead pod. In server-side discovery, the load-balancer has a shared cache; a slow watch on the load-balancer means every client sends to a dead pod. Client-side spreads the failure; server-side concentrates it.
Fan-out — registry load. A client-side cluster with 5,000 callers means 5,000 watch connections to the registry. The registry must handle 5,000 long-lived connections, broadcast every membership change to all of them, and survive their reconnect storms after registry restarts. Eureka's design — large in-memory caches, periodic full snapshots, eventual consistency — comes from exactly this constraint. Server-side discovery puts ~10–50 watches on the registry (one per LB instance), but every client pays the LB's hop cost.
Language coverage. Client-side discovery requires a registry-aware library in every language the company writes services in. Netflix's Eureka had Java-first support; Python and Node.js teams either wrote their own clients (with subtle bugs) or routed through a shim. PaySetu's polyglot stack — Java, Python, Go, Node.js, a few Kotlin services — meant 5 client-side libraries to maintain, each with its own bug list, each with a different jitter algorithm for the watch backoff. Server-side discovery is language-agnostic: any program that can open a socket gets the discovery for free.
Blast radius. A bug in the client-side library — a memory leak in the watch handler, a panic on malformed membership data — affects every service using that library. A bug in the server-side load-balancer affects every request through that LB. The blast radius of the LB is usually larger (more callers per LB than per client process), which is why production LB software (HAProxy, envoy, nginx) is among the most battle-tested infrastructure in the industry.
# discovery_compare.py — measure round-trip latency for client-side vs server-side discovery
# under three conditions: warm cache, cold cache (registry round-trip), stale cache (one dead backend)
import asyncio, random, statistics, time
class Registry:
"""The membership source of truth — one network round-trip away."""
def __init__(self): self.endpoints = [] # list of (ip, port, alive)
async def fetch(self):
await asyncio.sleep(0.012) # 12ms RTT to registry across AZ
return [(ip, port) for ip, port, alive in self.endpoints if alive]
class Backend:
async def serve(self, alive):
if not alive:
await asyncio.sleep(2.0); raise ConnectionRefusedError("dead pod")
await asyncio.sleep(random.uniform(0.0008, 0.0015)) # 0.8-1.5 ms service time
async def client_side_call(reg, cache, backends, refresh):
"""Caller picks; one network hop to the backend."""
if refresh:
cache[:] = await reg.fetch()
if not cache: return None
ip, port = random.choice(cache)
alive = next((a for (i, p, a) in reg.endpoints if (i, p) == (ip, port)), False)
t0 = time.perf_counter()
try: await backends[(ip, port)].serve(alive); return time.perf_counter() - t0
except ConnectionRefusedError: return time.perf_counter() - t0 # error counted as latency
async def server_side_call(reg, lb_cache, backends):
"""Caller hits LB; LB picks; two network hops."""
await asyncio.sleep(0.0004) # 0.4 ms LB hop in-cluster
if not lb_cache: return None
ip, port = random.choice(lb_cache)
alive = next((a for (i, p, a) in reg.endpoints if (i, p) == (ip, port)), False)
t0 = time.perf_counter()
try: await backends[(ip, port)].serve(alive); return time.perf_counter() - t0
except ConnectionRefusedError: return time.perf_counter() - t0
async def main():
reg = Registry()
reg.endpoints = [(f"10.244.7.{i}", 8080, True) for i in range(1, 9)]
backends = {(ip, port): Backend() for (ip, port, _) in reg.endpoints}
cs_cache, ss_cache = list(reg.endpoints[:]), list(reg.endpoints[:])
cs_cache = [(ip, port) for (ip, port, _) in reg.endpoints]
ss_cache = list(cs_cache)
# Now mark one pod dead — but the caches do NOT get updated yet (stale)
reg.endpoints[2] = (reg.endpoints[2][0], reg.endpoints[2][1], False)
cs_lat = [await client_side_call(reg, cs_cache, backends, refresh=False) for _ in range(200)]
ss_lat = [await server_side_call(reg, ss_cache, backends) for _ in range(200)]
pct = lambda xs, p: sorted(xs)[int(len(xs) * p / 100)]
print(f"client-side p50={1000*pct(cs_lat,50):5.2f}ms p99={1000*pct(cs_lat,99):6.2f}ms")
print(f"server-side p50={1000*pct(ss_lat,50):5.2f}ms p99={1000*pct(ss_lat,99):6.2f}ms")
asyncio.run(main())
Sample run on a quiet laptop:
client-side p50= 1.13ms p99=2003.42ms
server-side p50= 1.51ms p99=2003.78ms
Per-line walkthrough. The line await asyncio.sleep(0.012) simulates the registry round-trip — 12 ms is realistic for a cross-AZ etcd quorum read. The line await asyncio.sleep(0.0004) # 0.4 ms LB hop in-cluster is the cost server-side discovery pays every call — a single in-cluster L4 hop runs about 0.3–0.6 ms when the LB is on a different node, lower if it is on the same node (kube-proxy iptables case). The line reg.endpoints[2] = (reg.endpoints[2][0], reg.endpoints[2][1], False) marks one pod dead without updating either cache — both caches are now stale. if not alive: await asyncio.sleep(2.0); raise ConnectionRefusedError is the failure mode: a stale endpoint costs 2 seconds per stuck request (TCP connect timeout). That is why the p99 in both rows is dominated by the 2-second timeout — staleness is the same disaster on both sides; the only difference is whether one client pays it or all clients do.
Why p50 is barely different but p99 is identical: in steady state (warm cache, no failures), the per-call cost difference is the LB hop (~0.4 ms) — visible in the median. But under failure (stale cache, dead endpoint), both architectures hit the same TCP-connect timeout — the timeout dwarfs the hop cost by 5,000×. The architectural choice barely affects tail latency under failure; it affects tail latency only in the normal regime, where it adds up to a 30–40% p99 difference. This is why companies obsessed with tail latency (high-frequency trading, real-time bidding) prefer client-side: every microsecond saved in the normal path is a win, and they have separate machinery for failure handling.
When each one is the right choice
The choice maps cleanly onto a few concrete questions about the system. The first is how many languages the company writes services in. With one language and a deeply integrated framework — the way Netflix used Java + Spring Cloud + Ribbon, or the way Twitter used Scala + Finagle — client-side discovery is cheap to maintain and gives you fine-grained policy per service. With five or seven languages, the maintenance cost of keeping every client library current explodes; server-side wins by default.
The second question is how much L7 awareness you need. Server-side L4 load-balancing (kube-proxy, an NLB) treats every TCP connection as an opaque byte stream; it cannot pick a backend based on the request's HTTP path, gRPC method, or tenant header. If you need per-method weighting (canary v2 of /Checkout/Submit only), per-tenant routing (tenant=mealrush always goes to pool-A), or retries on specific status codes, you need either an L7 LB (envoy, nginx) or client-side discovery with policy logic in the client. Modern service meshes solved this by making the L7 LB a sidecar — it is server-side from the application's POV (talks to localhost) but client-side from the network's POV (one hop, no shared LB).
The third question is whether you can tolerate the registry being a fan-out target. Eureka gave up linearizable consistency precisely because it was holding 50,000 long-lived watches and could not afford a strongly-consistent read on every membership lookup. If your registry is etcd or ZooKeeper (strongly consistent), the watch fan-out limit is in the low thousands; client-side discovery beyond that scale either needs caching layers (which reintroduce staleness) or a different registry. Server-side discovery puts a small constant number of watches on the registry — even at very large scales.
CricStream's 2024 architecture choice illustrates the trade. Their internal microservice mesh (~800 services, 4 languages) runs on Kubernetes with kube-proxy server-side discovery for the unary RPC traffic. Their streaming path — chunked video segments from the encode stage to the edge CDN — runs client-side: every encode pod holds an in-memory list of edge-CDN ingest endpoints, weighted by recent observed latency, and sends each segment to the lowest-RTT edge directly. The unary path traded 0.5 ms p99 for not having to ship a Java + Python + Go + Node.js client library. The streaming path could not afford that 0.5 ms because they ship 4 segments per second per stream and the LB hop showed up as a 2 ms p99 bump on the segment-ingest distribution.
Failure modes and the sidecar synthesis
Each architecture has a characteristic failure mode that the other does not.
Client-side: the slow-watch problem. The client's endpoint cache is updated by a watch on the registry. If the watch falls behind — registry overload, network partition between the client and the registry, GC pause in the client — the client is certain about an endpoint set that no longer exists. PaySetu hit this exact failure mode in 2024 when their Eureka registry was migrated to a new instance class with smaller heap; one rolling restart of the registry caused 4,000 client pods to lose their watches simultaneously, all 4,000 reconnected with cached data that was 90 seconds stale, and the 90-second window of stale routing resulted in a 17-minute p99 spike from 80 ms to 1.4 s as 12% of requests hit pods that had been killed. The fix involved staggering reconnects with jittered backoff and adding a "freshness budget" — refusing to use cache older than 30 seconds without a successful refresh.
Server-side: the LB-as-failure-domain problem. The load-balancer is on the request path. If it is unhealthy — out of file descriptors, deadlocked thread pool, hung TLS session — every request fails. AWS NLBs and ALBs are highly available because they are distributed (per-AZ instances behind a single DNS name), but a single HAProxy in front of a service pool is a single point of failure. The mitigation is typically N LB instances behind anycast or DNS round-robin, which reintroduces the question of how the client picks an LB — at which point you are doing client-side discovery of the load-balancer.
The service mesh sidecar pattern, popularised by Linkerd (2016) and Istio (2017) and now standard via envoy + xDS, resolves the dilemma by making the LB local to the client. Each client pod has an envoy sidecar at 127.0.0.1:9001. The application opens a socket to localhost — server-side discovery from the app's perspective. The sidecar holds the membership cache and applies the LB policy — like a co-located client-side library, but in a separate process so the application code is language-agnostic. The LB hop is now 0.05 ms (loopback), not 0.4 ms (cross-node). The LB is no longer a shared failure domain — if your sidecar crashes, only your pod's requests fail, not everyone's. The xDS protocol gives you the L7 policy of an envoy without the maintenance burden of a per-language client library.
The cost is the sidecar's CPU and memory — typically 30–80 MB resident and 0.05–0.2 cores per pod. At PaySetu scale (say 10,000 pods running an envoy sidecar) that is ~600 GB of memory and ~1,000 cores spent on the mesh data plane. Whether that bill is worth it depends on what L7 policy the team needs. For L4-only routing, a service mesh is over-engineering; kube-proxy is enough. For mTLS, retries with budgets, per-method timeouts, traffic shifting, and detailed per-call observability — the sidecar pays for itself.
Why the sidecar's loopback hop is essentially free compared to a cross-node LB hop: a loopback packet does not traverse any NIC, does not pay the TCP segmentation cost, does not enter any qdisc, and the kernel can short-circuit the entire IP stack via the loopback driver. Modern kernels short-cut even further with SO_BUSY_POLL and TCP_NODELAY on loopback connections. The end-to-end cost of a 1-byte loopback round-trip is dominated by the two context switches (app→sidecar→app), which is on the order of 30–80 microseconds. Compare that to a cross-node L4 LB hop, where you pay two NIC transmissions plus the LB's userspace processing — that is two-to-three orders of magnitude more expensive.
Common confusions
-
"Server-side discovery means the server picks the backend." No. "Server-side" refers to the load-balancer being a separate component the request flows through, not to the backend server picking. The picking still happens before the request reaches a backend — just inside the LB instead of inside the client.
-
"DNS-based discovery is server-side." It is in-between. DNS resolves the name, the client opens the connection. If DNS returns a single virtual IP (an LB's IP), it is server-side. If DNS returns multiple A records and the client picks one, it is client-side discovery using DNS as the registry — exactly what Kubernetes headless Services and Cassandra clients do.
-
"Service meshes are client-side discovery." They are server-side from the application's viewpoint — the app sends to localhost, not to a backend. They behave like client-side from the network's viewpoint — there is no shared remote LB; each pod has its own. The mesh is a hybrid: server-side architecture, distributed deployment.
-
"Client-side discovery is always faster." Only by the LB hop (0.3–1 ms typically). Under cache staleness or load-balancing-policy mismatch, client-side can be much slower: a P2C client-side balancer with a stale latency-EWMA can keep sending requests to a slow pod long after a centralised LB would have noticed and shifted load.
-
"Server-side discovery is always simpler." It is simpler for the application developer — but the LB is now part of your operational surface. You need to monitor, scale, upgrade, and patch it. A team without good LB-operations skills can find that they have moved complexity from the application into a place where they have less expertise.
-
"You must pick one." Most production systems run both. Internal RPC: server-side via mesh or kube-proxy. Cache lookups (Redis, memcached): client-side via consistent hashing. Database writes: server-side via the database's own LB (PgBouncer, RDS proxy). External traffic: server-side via cloud LB (NLB / ALB). The choice is per-pair, not per-cluster.
Going deeper
Netflix's path: client-side, then sidecar, then both
Netflix popularised client-side discovery in the 2010s with Eureka (registry) and Ribbon (Java client-side balancer) — every microservice in their JVM-based stack queried Eureka and balanced load itself. The architecture worked extraordinarily well at Netflix's scale (~1,000 services, almost entirely Java) and won them a 0.5–1 ms p99 advantage on internal hops. As they expanded into Node.js (for the web tier) and Python (for ML services), the per-language client cost mounted; in 2017 the team published a post on rebuilding their service-discovery as a sidecar (using envoy) that other languages could use without writing a Java port of Ribbon. The design today is hybrid: legacy Java services still use Ribbon directly; new services and other-language services go through the sidecar. The pragmatic lesson is that the architectural choice is rarely "all-or-nothing" — large companies accumulate both, and the migration plan from one to the other lasts years.
gRPC's xDS: client-side discovery, server-side policy
Modern gRPC clients (since 2020) ship with built-in xDS support. xDS — the API used by envoy — lets a control plane push membership and policy into the client library. The client opens connections directly to backends (client-side data plane), but the load-balancing policy, retry policy, and traffic-shifting decisions are made by the central control plane and pushed down. This gives you the latency win of client-side discovery and the central-policy win of server-side, at the cost of needing an xDS-aware client library in every language. Google, Lyft, and Airbnb run substantial xDS deployments; the gRPC project has reference implementations for Java, Go, C++, and (less mature) Python and Node.
The lookup-cache tax — what FreshDirect taught us in 2008
A foundational paper for service discovery is the 2008 FreshDirect post-mortem (referenced in the original Eureka design): a 90-second freshness window in their client-side discovery cache, combined with a registry restart that took 90 seconds to repopulate, produced a 7-minute outage during which 30% of internal requests routed to dead instances. The remediation that emerged from that incident — bounded freshness, jittered watch reconnects, and a "negative cache" to demote endpoints that fail repeatedly even if the registry still claims they're alive — is now standard practice in Eureka, Consul Connect, and the gRPC xDS client. The principle: never trust the registry alone; corroborate with health-check signals from the data plane.
Reproduce this on your laptop
# Compare client-side vs server-side discovery latency in 30 lines of Python
python3 -m venv .venv && source .venv/bin/activate
pip install asyncio
python3 discovery_compare.py
# Expect ~0.4 ms p50 difference between client-side and server-side
# under normal cache, with both p99s dominated by the (simulated) TCP timeout
# whenever the cache is stale relative to the registry.
Where this leads next
Once you know which side does the picking, the question becomes which one. Load balancing strategies — round-robin, P2C, least-connections, weighted takes up the algorithms each side runs. The data plane is the same regardless of where the picker lives — the algorithm questions are about heterogeneity, queueing, and tail.
Closely related: DNS-based discovery is the in-between mode where the resolver is the registry. Consul, etcd, ZooKeeper describes the strongly-consistent registries that back both architectures. Kubernetes services and endpoints is the canonical server-side discovery implementation in modern infrastructure — and a useful counter-example, because kube-proxy also runs on the calling node, blurring the client/server split.
References
- Chris Richardson, "Pattern: Client-side discovery" — microservices.io — the canonical pattern definition, with sequence diagrams.
- Chris Richardson, "Pattern: Server-side discovery" — microservices.io — companion pattern; lists Kubernetes, AWS ELB, and Marathon-LB as examples.
- Netflix Tech Blog, "Eureka 2.0 architecture overview" (archived) — Netflix's client-side discovery system; explains the AP design and the 90-second cache freshness window.
- Matt Klein, "Service mesh data plane vs. control plane" (2017) — the envoy author's framing of why a sidecar is the synthesis of client-side and server-side discovery.
- gRPC xDS documentation — the xDS protocol that pushes server-side policy into client-side data planes; current implementation status across languages.
- Daniel Bryant, "Service Discovery in a Microservices Architecture" — InfoQ 2017 — a balanced survey of client-side, server-side, and hybrid approaches.
- Kubernetes services and endpoints — internal companion. kube-proxy iptables mode is server-side discovery whose load-balancer happens to live on the calling node.
- DNS-based discovery — internal companion. DNS straddles the client-side / server-side line depending on whether the resolver returns a virtual IP or many backend IPs.