DNS-based discovery
It is 21:47 on a Friday. PaySetu is rolling out a new version of its payment-status service. The deploy looks clean: pods come up, readiness probes turn green, the new ELB target group hits 100% healthy. Then the on-call alert fires — settlement reconciliation is failing for 6.4% of merchants. Aditi opens the Java client logs and sees the smoking gun: the JVM clients are still dialling payments-status.internal and getting an IP that points at three pods that were terminated forty seconds ago. The Kubernetes Endpoints object is correct. The DNS authoritative answer is correct. Everything in between is wrong, and the wrongness has a name: caching.
DNS is the lowest-common-denominator answer to "where does this name live right now". It works because every TCP stack on earth knows how to resolve a hostname, and it lies under churn because at least four caches sit between the authoritative answer and the application's connect() call. The freshness of a DNS-based discovery answer is not bounded by the TTL you set; it is bounded by the longest-lived cache in the chain, which is almost never the one you control.
What "DNS-based discovery" actually means
When someone says "we use DNS for service discovery", they usually mean one of three concrete setups, and each has a different freshness story.
The first is plain A-record DNS: a hostname (recos.paysetu.internal) maps to one or more IP addresses, the resolver returns the list, and the client picks one. This is what nslookup shows you. The TTL on the record is the authoritative server's suggestion about how long the answer should be considered valid. The client has no obligation to honour that exact TTL — it can hold the answer longer (most do) or shorter (rare).
The second is SRV records: instead of just name → IP, the record returns name → (priority, weight, port, target-hostname) tuples. SRV gives you ports and load distribution hints, which is why HashiCorp Consul exposes services as SRV records by default and why Kubernetes' headless services produce SRV records for stateful sets. The freshness story is the same as plain A — TTLs and caches still lie.
The third is Kubernetes' in-cluster DNS (CoreDNS). The cluster runs a CoreDNS deployment that watches the Kubernetes API for Service and EndpointSlice objects and synthesises DNS records on the fly. When a pod's /etc/resolv.conf points to the cluster DNS service IP, every gethostbyname() call goes through CoreDNS, which gives you the freshest answer the API server has — typically within 1–2 seconds of an actual pod state change. This sounds great until you remember that CoreDNS is itself just a DNS server, with TTLs (default 5 seconds in CoreDNS) and downstream caches that don't go away.
Why the JVM is the most common offender: the JDK ships with two DNS-cache TTL controls, networkaddress.cache.ttl and networkaddress.cache.negative.ttl. With a security manager installed, the positive TTL defaults to -1 — cache forever. Without one, it defaults to 30 seconds. Most enterprise Java applications run with the security manager (or did, until JDK 17+) and inherit the forever-cache default without realising it. PaySetu's 14-minute settlement delay was caused by exactly this: a JVM that resolved payments-status.internal to an old ELB IP at 09:02 and kept dialling that IP until the JVM restarted at 23:14.
How a real DNS lookup unfolds — and where each cache decides
Trace what happens when a Python service running in a Kubernetes pod runs socket.gethostbyname("recos.paysetu.internal"). The exact path depends on how the pod's /etc/resolv.conf is configured, but the canonical flow is six hops:
- Application cache (JVM, Python
urllib3connection pool, Gonet.DefaultResolver). The Pythonsocketmodule does not cache, but anything sitting on top of it —requestsSessions,aiohttp.ClientSession,grpc.Channel— holds the resolved IP for the lifetime of the connection pool. Agrpc.aio.insecure_channel("recos:8080")opened at boot may use the same IP for hours. - glibc resolver (
getaddrinfo). On most Linux distributions, this is a thin layer over/etc/resolv.conf. Withnscd(Name Service Cache Daemon) running, it caches positive answers forpositive-time-to-live(default 600 s, ten minutes) and negative answers fornegative-time-to-live(default 20 s). systemd-resolved(if active, on Ubuntu 18.04+ and most modern distros). This adds another cache layer between glibc and the upstream resolver, with its own TTL handling (it does honour the upstream TTL, capped at 2 hours).- The upstream recursive resolver — in Kubernetes, this is CoreDNS running in
kube-system. CoreDNS itself caches with thecacheplugin (default 30 s positive, 5 s negative forkubernetesplugin records). - Authoritative answer. Inside Kubernetes, CoreDNS is the authority for
*.cluster.localrecords and synthesises them on demand from the API server (via informers watching Service and EndpointSlice objects). - The Kubernetes API server, which holds the truth in etcd. The etcd watch tells CoreDNS when an Endpoints object changes; CoreDNS updates its in-memory representation; the next DNS query gets the new answer.
The freshness of the answer the application gets is bounded by the maximum of the TTLs across all six layers, plus the watch propagation delay between layer 5 and layer 6. In a healthy cluster, layers 5 and 6 update in well under a second. In an unhealthy cluster — control-plane storm, etcd compaction, kubelet partition — this can stretch to many seconds. The application doesn't see any of this; it just sees connection refused when it tries the old IP.
Code: a tiny resolver-with-TTL that exposes the staleness math
The clearest way to internalise where staleness comes from is to write a resolver that has the same shape as the real chain. The Python script below models a two-level cache (an "app cache" with TTL app_ttl above a "resolver cache" with TTL resolver_ttl) above a backing source-of-truth that updates asynchronously. Run it and watch the staleness window emerge.
# dns_staleness.py — model a two-level cache over a moving source of truth
import time, random, threading
# The "world" — what the authoritative answer would say if asked right now.
TRUTH = {"recos.internal": ["10.42.0.1", "10.42.0.2", "10.42.0.3"]}
TRUTH_LOCK = threading.Lock()
class CachedResolver:
"""Two-level cache: app layer above a resolver layer above the truth."""
def __init__(self, app_ttl: float, resolver_ttl: float):
self.app_ttl = app_ttl
self.resolver_ttl = resolver_ttl
self._app: dict[str, tuple[list[str], float]] = {}
self._resolver: dict[str, tuple[list[str], float]] = {}
def _query_authoritative(self, name: str) -> list[str]:
with TRUTH_LOCK:
return list(TRUTH.get(name, []))
def _resolver_lookup(self, name: str, now: float) -> list[str]:
cached = self._resolver.get(name)
if cached and cached[1] > now:
return cached[0]
ips = self._query_authoritative(name)
self._resolver[name] = (ips, now + self.resolver_ttl)
return ips
def lookup(self, name: str) -> list[str]:
now = time.time()
cached = self._app.get(name)
if cached and cached[1] > now:
return cached[0]
ips = self._resolver_lookup(name, now)
self._app[name] = (ips, now + self.app_ttl)
return ips
def churn_world():
"""Replace one IP every 4 seconds, mimicking pod recycling."""
for i in range(5):
time.sleep(4.0)
with TRUTH_LOCK:
new_ip = f"10.42.0.{100 + i}"
TRUTH["recos.internal"][0] = new_ip
print(f" [t={time.time()-START:.1f}s] world: rotated first IP to {new_ip}")
R = CachedResolver(app_ttl=30.0, resolver_ttl=5.0)
START = time.time()
threading.Thread(target=churn_world, daemon=True).start()
for tick in range(25):
ips = R.lookup("recos.internal")
print(f"t={time.time()-START:5.1f}s client sees: {ips}")
time.sleep(1.0)
Sample run (truncated):
t= 0.0s client sees: ['10.42.0.1', '10.42.0.2', '10.42.0.3']
t= 1.0s client sees: ['10.42.0.1', '10.42.0.2', '10.42.0.3']
[t=4.0s] world: rotated first IP to 10.42.0.100
t= 4.0s client sees: ['10.42.0.1', '10.42.0.2', '10.42.0.3'] # <-- world changed, client unaware
t= 5.0s client sees: ['10.42.0.1', '10.42.0.2', '10.42.0.3']
[t=8.0s] world: rotated first IP to 10.42.0.101
t= 8.0s client sees: ['10.42.0.1', '10.42.0.2', '10.42.0.3']
t= 10.0s client sees: ['10.42.0.1', '10.42.0.2', '10.42.0.3']
[t=12.0s] world: rotated first IP to 10.42.0.102
t= 12.0s client sees: ['10.42.0.1', '10.42.0.2', '10.42.0.3']
t= 16.0s client sees: ['10.42.0.1', '10.42.0.2', '10.42.0.3']
[t=20.0s] world: rotated first IP to 10.42.0.104
t= 20.0s client sees: ['10.42.0.1', '10.42.0.2', '10.42.0.3']
t= 30.0s client sees: ['10.42.0.104', '10.42.0.2', '10.42.0.3'] # <-- finally refreshed at app_ttl=30s
Per-line walkthrough. The line if cached and cached[1] > now: return cached[0] is the entire mechanism — both layers do the same thing, and either layer's hit short-circuits the chain. The line self._app[name] = (ips, now + self.app_ttl) is where the longest-lived cache is born: even if the resolver layer correctly fetched a fresh answer at t=5, the app layer pinned that answer for 30 seconds. Notice that the first refresh happens at exactly t=30s — not at t=4s (when the world changed) and not at t=5s (when the resolver's TTL would have allowed a re-fetch). The staleness window is app_ttl, period. Lowering resolver_ttl to 1 second changes nothing about the application's experience; the app cache dominates.
Why this matches PaySetu's outage: when Aditi looked at the JVM's networkaddress.cache.ttl, it was set to -1 (forever-cache, the security-manager default). The CoreDNS authoritative TTL was 5 seconds. The recursive resolver TTL was 5 seconds. Lowering either of those would have done nothing — the JVM was the longest-lived cache in the chain, and lowering the upstream TTL only affects layers below the longest cache. The fix was to set -Dsun.net.inetaddr.ttl=30 on the JVM, restart, and add a deploy-time check that fails the rollout if any reachable JVM service has a non-zero networkaddress.cache.ttl.
Why this differs from the registry pattern: a Consul or etcd client gets a push notification (long-poll watch) when the answer changes, rather than polling on a TTL clock. The registry can deliver "the answer changed" inside one round-trip; DNS cannot, because DNS is a request-response protocol with no native push. The freshness gap between DNS-based and registry-based discovery is fundamentally about pull vs push, not about how short you can make the TTL.
What this means for production deploys
The staleness math has direct, predictable consequences for how you have to structure deploys behind DNS-based discovery.
You cannot do an instant cutover. If you migrate payments.internal from ELB-A to ELB-B by flipping the DNS A record, the cutover is not instant — it is bounded by the longest-lived cache in your fleet. KapitalKite once tried a "5-minute cutover" for an order-router migration; nine hours later, 0.3% of traffic was still hitting ELB-A because of long-lived JVM connection pools holding the old IP. The fix is to keep both backends serving identical traffic for at least the maximum cache horizon (24 hours is a safe upper bound for JVM-heavy fleets), then drain ELB-A only after observing zero traffic for one full TTL beyond that.
You need pre-stop sleeps. When a Kubernetes pod is terminating, the kubelet removes it from Endpoints before it sends SIGTERM. But CoreDNS, intermediate resolvers, and especially clients with their own caches will keep the old IP for some staleness window. If the pod stops accepting connections immediately, every client request inside that window gets connection refused. The standard mitigation is lifecycle.preStop: exec sleep 30, which holds the pod alive (still serving traffic on already-open connections, accepting new ones for the first few seconds) while the endpoints update propagates. CricStream uses 45 seconds in production for everything behind the cluster's CoreDNS; anything shorter causes detectable user-visible 502s during deploys.
Long-lived gRPC channels need forced recycling. A gRPC Channel opened at boot will resolve recos:8080 once, pick one IP from the answer, and reuse that connection for hours. Even if DNS is updated and other clients see the new answer, this channel keeps using the original IP. The fix is server-side: gRPC servers can set grpc.max_connection_age (typical: 30 minutes), which causes the server to send a GOAWAY frame after 30 minutes of channel age, forcing the client to re-resolve and reconnect. Without this, a client opened during deploy-N keeps using deploy-N's IP indefinitely, and you discover this only when one of those IPs is reused for a different service.
Common confusions
-
"DNS TTL is the staleness bound." False. The TTL is the staleness bound only for the resolver-layer cache. Caches above the resolver — JVM, glibc, application connection pool — have their own TTLs that ignore the authoritative record's TTL. The actual staleness is the maximum of all cache TTLs in the chain.
-
"DNS is stateless and can't lie." DNS the protocol is stateless; the deployment of DNS in a real network is anything but. Every cache in the chain is mutable state. The protocol can't lie; the caches can, and routinely do.
-
"Setting TTL to 0 makes DNS instant." It mostly makes you slow. Most resolvers treat
TTL=0as "do not cache at this layer", but they may still cache for an internal minimum (BIND defaults to a 1-second floor; some intermediate caches floor TTL at 5 seconds regardless). And it does nothing for application-level caches. What you get is a ten-fold increase in DNS query traffic against your authoritative server in exchange for marginal freshness improvements. -
"DNS is the same as service discovery." DNS is one layer of service discovery — the universal, lowest-common-denominator one. It maps name to IP and tells you nothing about health, version, weight, or protocol. Real service-discovery systems either layer on top of DNS (Kubernetes adds health gates and EndpointSlice readiness) or replace it (Consul, Eureka). Saying "we use DNS for service discovery" is correct; saying "DNS is service discovery" is the mental model that makes you build the wrong thing. See the wall for the broader framing.
-
"CoreDNS makes Kubernetes DNS fresh." It makes it the freshest practical DNS layer — sub-second propagation from API server to CoreDNS in healthy clusters — but everything above CoreDNS still has its own caches. CoreDNS being fresh is necessary, not sufficient. The application-level caches are still the bottleneck.
-
"DNS load balancing is fine for sticky sessions." It is not. DNS returns a list of IPs; the client picks one based on its own logic (often "first in the list"). Without explicit weighting and without any health-check feedback, DNS-based load balancing produces unbalanced traffic the moment any client population disagrees on its picking strategy. See client-side vs server-side discovery for why DNS-only is rarely sufficient under heterogeneous load.
Going deeper
Why JVM networkaddress.cache.ttl=-1 exists at all
The JVM's forever-cache default with a security manager is not a bug — it is a security feature. The reasoning, baked in around Java 1.4, was that a malicious DNS server could rebind a hostname to a different IP between when the security manager checks the destination (using DNS) and when the actual connection is opened (using DNS again). If the cache is forever, this race is impossible: the IP used at check time is the IP used at connect time. This is the DNS rebinding attack, and it is real (it bites browsers all the time). The cost of the defence is a discovery freshness disaster in any modern cloud-native environment, which is why every Java cloud guide on the planet starts with "first, set networkaddress.cache.ttl=30". Kubernetes' Java SDKs ship with this set by default; vanilla OpenJDK does not. Always check.
What CoreDNS actually does inside a Kubernetes cluster
CoreDNS is a plugin-chain DNS server. Inside Kubernetes it loads the kubernetes plugin, which establishes a watch against the API server's Service, Endpoints, and EndpointSlice resources via informers. When a Service is created, CoreDNS synthesises an A record <service>.<namespace>.svc.cluster.local → <cluster-IP>. When a headless Service exists (cluster-IP=None), CoreDNS synthesises one A record per backing pod IP, plus SRV records with port info. The watch latency from API-server-update to CoreDNS-cache-update is typically under 100ms; CoreDNS then serves with a short TTL (default 5s for the kubernetes plugin, configurable). The 5s TTL is what shows up as the upstream TTL in the freshness chain — it is the authoritative answer your downstream caches see.
Why DNS still wins despite all this
The leak budget of DNS is well-understood and the ecosystem has built around it. Every mature load balancer, service mesh, and gRPC client knows about MAX_CONNECTION_AGE, knows about JVM TTLs, knows to use connection-recycling and warm-pool warmup. DNS wins because it is universal: every TCP stack, every language runtime, every load balancer, every cloud provider, every certificate authority understands DNS. Adopting any of the alternatives (Consul, Eureka, custom registry) means each language and each tool needs an SDK and each SDK needs to be kept current. The total cost of DNS's freshness leak — measured in occasional deploy-window 502s and 30-second cutover delays — is, for most teams, lower than the total cost of running and operating a full registry. This is why even systems that "use Consul" usually also keep DNS as a fallback, and why Kubernetes' service mesh (Istio, Linkerd) ultimately exposes everything as DNS-resolvable names even though the actual discovery is happening via the API server.
Reproduce this on your laptop
# Run the resolver simulation:
python3 dns_staleness.py
# Inspect a real Kubernetes resolution path:
kubectl run dnsutils --rm -it --image=tutum/dnsutils -- bash
nslookup kubernetes.default.svc.cluster.local
dig +noall +answer +ttlid kubernetes.default.svc.cluster.local
# Check your local nscd / systemd-resolved cache state:
sudo nscd -g # if nscd is running
systemd-resolve --statistics # if systemd-resolved is running
# Check a JVM's DNS cache TTL right now (for a running PID):
jcmd <pid> VM.system_properties | grep networkaddress.cache.ttl
Where this leads next
DNS gets you a name-to-IP mapping with universal compatibility and unbounded staleness. The next chapter — Consul, etcd, ZooKeeper — picks up the registry-based alternative: instead of polling on a TTL clock, clients hold a long-poll watch and get notified the moment the answer changes. That model trades universality for freshness, and adds a new dependency that itself can fail.
After that, Kubernetes Services and Endpoints shows how the platform-API model compresses the entire chain into one watch against the API server, and Discovery caching and staleness returns to the freshness math for the case when no single layer is acceptable on its own and you have to compose them.
References
- "DNS Specification" — RFC 1034 / RFC 1035, Mockapetris 1987 — the foundational DNS protocol; §3.6 on TTLs is what every cache in this article inherits.
- "Common Configuration Hazards When Using DNS for Service Discovery" — AWS architecture blog — production-grade discussion of cache layering and TTL pitfalls in EC2/ELB environments.
- "CoreDNS: DNS Server" — coredns.io documentation — the plugin-chain architecture, the
kubernetesplugin, and how cache plugins layer on top. - "Java DNS Caching Properties" — Oracle JDK Docs, networkaddress.cache.ttl — the canonical source for the JVM TTL behaviour described in §Going deeper.
- "DNS Rebinding Attacks" — Singh et al., 2007 — the historical security context for why JVMs default to forever-cache.
- Dean & Barroso, "The Tail at Scale" — CACM 2013 — §IV's discussion of replicated requests presupposes a discovery layer that can hand back multiple equivalent backends; DNS's multi-A record is the simplest implementation.
- Wall: calling a service requires finding it — internal companion. The wall chapter that frames why discovery is a distinct distributed-systems problem; this article picks up the DNS-specific corner of it.
- Idempotency keys — internal companion. Stale DNS answers cause retries against dead IPs; idempotency keys are what make those retries safe when the IP eventually does respond.