Wall: calling a service requires finding it
It is 03:14 on a Tuesday and Karan is staring at a CricStream alert: 8% of requests to the recommendations service are returning connection refused, but only from one of the three frontend tiers. He SSHes into a healthy frontend pod and pings recos.internal:8080 — works. He SSHes into an unhealthy one and pings the same hostname — also works. He pings recos.internal:8080 from the unhealthy pod and gets a different IP than from the healthy one. The healthy pod has resolved the DNS name to the recommendations pods that exist now; the unhealthy one is holding onto a cached A record from twelve minutes ago, pointing at three pods that have already been recycled by the autoscaler. Every chapter in Part 4 — gRPC, idempotency keys, deadlines, retries — assumed the client could open a TCP connection to the right IP. That assumption is the wall, and Karan just walked into it.
Part 4 taught you what to do once two services are talking to each other. Part 5 is about getting them on the line in the first place. In production, every callee is a moving target — pods get recycled, instances get autoscaled, AZs get drained — so an address obtained at 12:00 is statistically wrong by 12:10. Service discovery is the distributed-systems problem of mapping a stable name (recos.internal) to a freshly-correct set of addresses, and every solution to it is itself a distributed-systems problem in miniature.
What Part 4 quietly assumed
Open any chapter in Part 4 and look at the first line of the example. client = grpc.insecure_channel('payments:8080'). await session.post('http://recos.internal/v1/predict', ...). r = requests.post('https://api.paysetu.in/charge', ...). Every single call begins with a string — a hostname, a service name, a URL — and ends with the operating system magically producing an IP address and a TCP connection. The chapter then proceeds to discuss what happens after the bytes leave the box: serialisation, deadlines, retries, idempotency. The five lines before the bytes leave — turning payments:8080 into 10.42.7.193:8080 — were treated as a single instruction.
In a single-process or single-machine world that simplification is harmless. localhost does not move. The local Postgres on port 5432 is the local Postgres. The OS /etc/hosts is a static file someone wrote once and forgot. In a distributed system on real infrastructure — Kubernetes, ECS, Nomad, plain VMs behind autoscaling groups — the simplification is a lie. The callee's address is ephemeral, often by design:
- A Kubernetes pod's IP is allocated when the pod is scheduled and reclaimed when it terminates. A typical CricStream recommendations deployment recycles pods every 4–7 hours during normal operation; during a deploy or autoscale event, every IP in the deployment can change inside 90 seconds.
- An EC2 instance behind an autoscaling group gets a new private IP each time it is launched. The 30 instances serving PaySetu's payment-status read-replicas at noon are mostly not the same 30 instances that were serving it at midnight.
- A serverless function (Lambda, Cloud Run) does not even have a stable IP from one invocation to the next; the platform may route consecutive calls to physically different machines in different racks.
This is not an accident or an inconvenience — it is the consequence of treating compute as fungible and recoverable. The same property that lets the cluster route around a failed node ("just spin up another one with a fresh IP") is what makes the address ephemeral. You cannot have both immutable infrastructure and stable addresses. You can have one or the other, and modern cloud designs picked immutable infrastructure, which means something else has to absorb the address-churn problem. That something else is service discovery.
Why the OS-level abstraction is not enough: gethostbyname() calls the resolver, the resolver caches by TTL, and TTLs in DNS are advisory at best — applications, JVM HotSpot, even glibc itself have layered caches above the OS resolver. CricStream's outage above was caused by glibc's nscd cache holding an A record for 12 minutes despite the upstream record's 30-second TTL. Until you know which cache layer holds the answer, you cannot know how stale your address is, and stale addresses point at pods that no longer exist.
Three layers that pretend to solve it (and where each leaks)
The industry has built three broad classes of solution to "find the callee right now". Each works inside its operating envelope and breaks outside it. Knowing where each leaks is what matters at 03:14 when an alert is firing.
Layer 1 — DNS. The simplest answer: recos.internal is a name in DNS, the resolver returns one or more A records, the client picks one and dials. This is what every Kubernetes service gives you for free; it is also what every cloud-provider load balancer gives you (internal-paysetu-payments-1234.elb.ap-south-1.amazonaws.com). DNS is universal, language-agnostic, and built into every TCP stack on earth. It leaks in three places: TTLs are advisory (the resolver may hold the record longer), the client typically caches above the resolver (JVM networkaddress.cache.ttl defaults to 30 s, sometimes infinite), and the answer is opaque — you get IPs, not health, not load, not version, not zone.
Layer 2 — a registry: Consul, etcd, ZooKeeper. Each service instance, when it starts, registers itself in a key-value store under a known path (/services/recos/<instance-id>) with a session that auto-expires if the instance dies. Clients query the registry, get a list of currently-registered healthy instances, and dial. This is what Netflix's Eureka, HashiCorp's Consul, and Apache ZooKeeper do. It leaks at the registry boundary: the registry is itself a distributed system that can have its own partitions, the session-expiry timing means stale entries can persist for the lease duration after a hard crash, and now you have a new dependency that must be running before any service can discover any other service (the chicken-and-egg of registry bootstrap).
Layer 3 — the platform's API: Kubernetes Endpoints, AWS Cloud Map, Nomad. The orchestrator already knows where every pod / task / instance lives because it scheduled them. It exposes that knowledge as a queryable, watchable API: kubectl get endpoints recos. A Service resource is a stable name; the matching Endpoints (or EndpointSlice) object is a list of currently-ready pod IPs that the kubelet keeps fresh as pods come and go. Clients (or, more usually, a sidecar / kube-proxy) watch the API and update their local view in milliseconds. This is the model Kubernetes won with, and it is what every modern service mesh (Istio, Linkerd, Consul Connect) builds on. It leaks at the watch boundary: the watcher's local cache can be seconds behind the API server during high churn, the API server itself depends on etcd quorum, and a partition between the kubelet and the API server can leave stale endpoints that point at pods which have terminated.
Why no single layer is sufficient: each layer's failure mode is correlated with the things you most need it for. DNS works fine in steady state and lies hardest during the deploy that changed all the IPs. The registry works fine until the registry itself is the partitioned component (this is what took down Knight Capital's discovery layer in 2012, and what made Twilio's 2021 Consul outage cascade). The platform API works fine until the API server is overloaded by the same control-plane storm that is causing the application incident. Production systems typically layer at least two of the three — for example, Kubernetes Endpoints for in-cluster discovery, plus Consul as a fallback for cross-cluster — so that the failure of one does not blind the client.
Code: a tiny registry-with-leases that exposes the freshness problem
The cleanest way to feel the wall is to build the smallest possible registry yourself and watch where freshness breaks. Below is a 50-line in-process registry: services register with a TTL, the registry expires entries whose lease was not renewed, and a client querying mid-expiry sees the inconsistency.
# tiny_registry.py — a 50-line model of "find the callee now"
import threading, time, random, json
from collections import defaultdict
class Registry:
"""Thread-safe service registry with TTL-based liveness."""
def __init__(self):
self._instances = defaultdict(dict) # service -> {instance_id: (addr, expires_at)}
self._lock = threading.Lock()
def register(self, service: str, instance_id: str, addr: str, ttl_s: float):
with self._lock:
self._instances[service][instance_id] = (addr, time.time() + ttl_s)
def renew(self, service: str, instance_id: str, ttl_s: float):
with self._lock:
if instance_id in self._instances[service]:
addr, _ = self._instances[service][instance_id]
self._instances[service][instance_id] = (addr, time.time() + ttl_s)
def query(self, service: str) -> list[str]:
now = time.time()
with self._lock:
return [addr for (addr, exp) in self._instances[service].values() if exp > now]
def reap(self):
"""Remove expired entries — the only place stale data goes to die."""
now = time.time()
with self._lock:
for service, insts in list(self._instances.items()):
self._instances[service] = {k: v for k, v in insts.items() if v[1] > now}
REG = Registry()
def instance(name: str, addr: str, lifetime_s: float, renew_every_s: float):
"""Simulate a pod that registers, renews, and eventually dies."""
inst_id = f"{name}-{addr.split(':')[-1]}"
REG.register(name, inst_id, addr, ttl_s=2.0)
deadline = time.time() + lifetime_s
while time.time() < deadline:
time.sleep(renew_every_s)
REG.renew(name, inst_id, ttl_s=2.0)
# death — no cleanup, just stop renewing. Lease times out in <=2s.
def client_loop():
for t in range(8):
live = REG.query("recos")
REG.reap()
print(f"t={t}s recos has {len(live)} live instances: {sorted(live)}")
time.sleep(1.0)
# Three pods register; pod-2 dies at t=3s (stops renewing).
threading.Thread(target=instance, args=("recos", "10.42.0.1:8080", 100.0, 0.5), daemon=True).start()
threading.Thread(target=instance, args=("recos", "10.42.0.2:8080", 3.0, 0.5), daemon=True).start()
threading.Thread(target=instance, args=("recos", "10.42.0.3:8080", 100.0, 0.5), daemon=True).start()
time.sleep(0.1) # let registrations settle
client_loop()
Sample run:
t=0s recos has 3 live instances: ['10.42.0.1:8080', '10.42.0.2:8080', '10.42.0.3:8080']
t=1s recos has 3 live instances: ['10.42.0.1:8080', '10.42.0.2:8080', '10.42.0.3:8080']
t=2s recos has 3 live instances: ['10.42.0.1:8080', '10.42.0.2:8080', '10.42.0.3:8080']
t=3s recos has 3 live instances: ['10.42.0.1:8080', '10.42.0.2:8080', '10.42.0.3:8080']
t=4s recos has 3 live instances: ['10.42.0.1:8080', '10.42.0.2:8080', '10.42.0.3:8080']
t=5s recos has 2 live instances: ['10.42.0.1:8080', '10.42.0.3:8080']
t=6s recos has 2 live instances: ['10.42.0.1:8080', '10.42.0.3:8080']
t=7s recos has 2 live instances: ['10.42.0.1:8080', '10.42.0.3:8080']
Walkthrough. The line self._instances[service][instance_id] = (addr, time.time() + ttl_s) is the heart of the model: the registry never trusts that an instance is alive — it only trusts the lease. The line return [addr for (addr, exp) in ... if exp > now] is where the freshness window lives: pod-2 stopped renewing at t=3 s but stayed in the result set until its lease expired at t=5 s. That 2-second lag between actual death and observed death is not a bug in the implementation — it is the lease duration, and shrinking it costs renewal traffic (a fleet of 3 000 pods with ttl=200ms would generate 15 000 renewals/sec). The line # death — no cleanup, just stop renewing is the realistic failure mode: a crashing pod does not get to send a goodbye. The registry must infer death from absence, and absence is only meaningful relative to a TTL clock.
Why this is different from idempotency keys: in idempotency keys the receiver is the source of truth, and the client retries until the receiver has stored the result. Here the registry is not the source of truth about liveness — the pod is. The registry can only model the pod's liveness through the lease, and the model is always lagging. Tightening the TTL reduces lag at the cost of renewal load and increases the rate of false-positive expirations during a brief network blip on the renewer's side. There is no setting of TTL that gives both fast detection and zero false positives; this is the same impossibility frontier that failure detection rests on, surfacing here in the discovery layer. Why this leaks back into Part 4: the deadlines, retries, and idempotency you learned in Part 4 only work if your retry sends the request to a pod that exists. If your discovery layer hands you the IP of a pod that died 1.4 seconds ago and is no longer listening, the deadline and retry budget are both being spent on connection refused errors that no application-layer mechanism can fix. Discovery freshness is a precondition for Part 4 mechanisms to behave the way Part 4 said they would.
Where the wall actually shows up at 3 a.m.
A handful of incident patterns recur across every distributed system that does not take discovery seriously. Each is a symptom of the same wall.
The "blast radius of a deploy" pattern. A team rolls out a new version of the recommendations service. Half the old pods terminate before the new ones are ready. The DNS records show 5 IPs healthy for 90 seconds while only 3 are actually serving traffic; clients keep dialling the dead 2, get connection refused, and retry. The retry storm spikes load on the surviving 3, which start failing readiness probes, which gets them removed from the endpoints — and now you are in the cascade. CricStream lost 4 minutes of reco traffic on a Saturday night for exactly this pattern; the fix was not "make DNS faster", it was wiring preStop: sleep 30 into the pod spec so the pod stops accepting new connections, drains in-flight ones, then terminates after the endpoints have been updated.
The "DNS TTL lie" pattern. A team migrates the payments service from one ELB to another and updates the DNS record with a 30-second TTL. JVM clients, with networkaddress.cache.ttl=-1 (the JDK default for security-manager-on environments), continue to send traffic to the old ELB for 17 minutes after the cutover. The old ELB is still healthy because its targets are still healthy — they just shouldn't be receiving production traffic. PaySetu had a 14-minute settlement-job delay during a cross-region cutover for this; the fix was a tooling check that fails the deploy if any reachable JVM service has a non-zero networkaddress.cache.ttl.
The "registry partition" pattern. A Consul cluster loses quorum after an etcd snapshot grows past 8 GB and restoration takes longer than the heartbeat-failure window. Every service in the mesh starts seeing stale or empty discovery results. Cascading failures because clients fall back to whatever cached list they last saw — including pods that have since terminated. This is what bit Twilio in 2021 (a Consul control-plane outage that propagated as application-level failures across most of their fleet); the lesson is that the discovery layer must degrade gracefully, ideally with a stale-but-positive fallback rather than empty results that leave clients with no addresses at all.
Common confusions
-
"DNS is service discovery." DNS is one layer of service discovery, and a leaky one. It maps names to IPs but tells you nothing about health, version, or load, and its caching layers make freshness unreliable under churn. Production systems built on DNS-only discovery have all the freshness limitations of DNS without realising they have built a discovery system at all.
-
"Once I have the IP, the rest is just Part 4." True in steady state, false during churn. The IP you got might be 800 ms old, which is fine for an idempotent read but devastating for a connection-pooled long-lived gRPC channel that will keep trying that IP for the next 30 seconds. The freshness of your discovery answer interacts with how long your connection pool holds onto it; long-lived pools amplify staleness. See client-side vs server-side discovery for how the choice of discovery placement changes this.
-
"Just use a load balancer." A load balancer is one implementation of server-side service discovery — the LB itself has to discover its backends from somewhere (usually the platform API or a registry). You have not removed the discovery problem; you have moved it from the client to the LB. The LB's discovery layer can fail, get stale, or partition exactly the same way the client's would have. What an LB does give you is centralisation of the discovery decision, which can be a benefit (one place to debug) or a drawback (one place that can fail).
-
"The platform API (Kubernetes Endpoints) is always fresh." It is the freshest practical answer, but "always" is wrong. The kubelet → API server → etcd path can lag by hundreds of milliseconds during high churn, the API server can be slow under control-plane load, and the watcher inside your client (informer cache, kube-proxy iptables rules) adds its own propagation delay. Empirically, a Kubernetes Endpoints update is visible to a client between 50 ms and 2 s after the actual pod state change, and the tail can be much worse during cluster events.
-
"Service mesh removes the need to think about discovery." A service mesh (Istio, Linkerd) puts the discovery into a sidecar proxy that watches the platform API on behalf of the application. The application no longer has to do discovery, but the mesh's discovery layer is still subject to the same freshness, partition, and bootstrap problems. The wall has moved one process to the left; it has not been demolished. When Istio's pilot has problems, the entire mesh has problems, often without the application realising why.
-
"Cache TTLs are always safer when shorter." Shorter TTLs reduce staleness but increase load on the discovery layer (DNS server, registry, API server). At extreme scale this matters: Netflix's Eureka famously increased the cache TTL during peak hours because the registry traffic itself was becoming a bottleneck. The right TTL is "short enough that staleness is bearable, long enough that you do not melt your registry under steady-state queries", and that intersection is workload-specific.
Going deeper
The CAP framing of service discovery
A service registry is itself a distributed system, and the CAP theorem bites it the same way it bites every other one. ZooKeeper and etcd are CP systems: under partition, the minority side refuses to serve queries, which means a partitioned client may get no answer rather than a stale one. Eureka and Consul (in some modes) lean AP: under partition, both sides keep serving stale-but-positive answers, which keeps applications running but means a client may dial a pod that the platform considers gone. There is no right answer in the abstract — fintech systems usually want CP discovery (better to refuse a charge than dial a pod that has already been recycled into a different service's instance), while best-effort recommendation systems usually want AP discovery (better to answer with stale recommendations than 503 the user). Picking is part of the service contract, not a default.
Why the bootstrap problem is permanent
Every discovery solution has to answer the question "how does the client find the registry?", and that question recurses unless you stop it somewhere. In practice it stops at one of three places: a hard-coded IP list (the registry's seed nodes are written into config), a DNS name with very high TTL and very stable IPs (the registry's load balancer hostname), or a multicast / mDNS local-network probe. Each of these is its own distributed-systems problem. The bootstrap problem is permanent because turtles-all-the-way-down does not actually work — somewhere a human or a piece of infra-as-code has to inject a fixed point. This is why Kubernetes ships with a hard-coded kubernetes.default.svc for the API server inside the cluster, and why every Consul deployment has a retry_join list with explicit IPs.
Discovery freshness vs connection pooling — the long tail
Long-lived connections (HTTP/2 channels, gRPC streams, MySQL connection pools) tend to outlive the freshness of the discovery answer that opened them. A gRPC channel opened to recos:8080 at t=0 with a 30-minute keep-alive will keep using that single TCP connection long after the underlying pod has been recycled — if the pod terminated gracefully and closed the connection cleanly, the client retries; if the pod was killed hard or its node went away, the client sits on a stale connection until a TCP timeout fires (default 2 hours on Linux). Production systems mitigate this with MAX_CONNECTION_AGE (server-side gRPC closes any connection older than 30 min, forcing the client to re-resolve) and aggressive TCP keep-alives, but the underlying tension is real: connection pooling and discovery freshness are in opposition, and you need both. KapitalKite's order-router enforces a 5-minute hard recycle on every gRPC channel for this exact reason.
Reproduce this on your laptop
# Run the tiny registry from above:
python3 -m venv .venv && source .venv/bin/activate
python3 tiny_registry.py
# Watch a real Kubernetes Endpoints object update during a deploy:
kubectl get endpoints recos -w
# In another terminal:
kubectl rollout restart deployment/recos
# Watch the Endpoints list change in real time as pods drain and new ones become ready.
# See your JVM's DNS cache TTL right now:
jcmd <pid> VM.system_properties | grep networkaddress.cache.ttl
Where this leads next
Part 5 — SERVICE DISCOVERY — picks up here. The chapters ahead cover each of the three layers in working detail and the trade-offs between them:
- DNS-based discovery — the universal default, what its caching layers actually do, and the failure modes that come with TTL-based freshness.
- Consul, etcd, ZooKeeper — the registry pattern, session leases, watch APIs, and why each picked a different consistency model.
- Kubernetes Services and Endpoints — how the platform API model works, EndpointSlice scaling beyond 1000 pods, and what kube-proxy actually does to your iptables.
- Client-side vs server-side discovery — where the discovery decision lives changes the failure modes.
- Discovery caching and staleness — how to size cache TTLs and what to do when they lie.
After Part 5 ends with another wall — having addresses is necessary but not sufficient; once you have ten of them, which one do you call? — Part 6 picks up the load-balancing question.
References
- DeCandia et al., "Dynamo: Amazon's Highly Available Key-Value Store" (SOSP 2007) — the canonical paper that introduced consistent hashing into the discovery / partitioning conversation; §4.1 on membership and failure detection is what every modern registry inherited.
- "Eureka at Cloud Scale" — Netflix Tech Blog, 2012 — Netflix's open-source registry, the canonical AP discovery system, with discussion of why they accept stale-but-positive answers under partition.
- Burrows, "The Chubby Lock Service for Loosely-Coupled Distributed Systems" (OSDI 2006) — Google's foundational lease-based discovery / coordination service; the design choices in §2 are the template every CP registry follows.
- "Kubernetes Service and EndpointSlice Documentation" — kubernetes.io — the platform-API model in working detail, including the EndpointSlice scaling story for clusters past 1 000 pods per service.
- "Consul Service Discovery and Configuration" — HashiCorp — Consul's design notes on session-based liveness and the gossip protocol underneath the registry layer.
- "The Tail at Scale" — Dean & Barroso, CACM 2013 — although primarily about request-level tail latency, §IV's discussion of replicated requests presupposes a discovery layer that can hand back multiple equivalent replicas, making it a foundational read for why discovery and load balancing are intertwined.
- Idempotency keys — internal companion. Discovery hands you an address; idempotency keys make the call you make to that address safe to retry when the address turns out to have been stale.
- Wall: distributed is a failure-first design — internal companion. The same failure-first lens, applied earlier in the curriculum, frames why the discovery layer's freshness is a probability rather than a guarantee.