DNS-based discovery

It is 21:47 on a Friday. PaySetu is rolling out a new version of its payment-status service. The deploy looks clean: pods come up, readiness probes turn green, the new ELB target group hits 100% healthy. Then the on-call alert fires — settlement reconciliation is failing for 6.4% of merchants. Aditi opens the Java client logs and sees the smoking gun: the JVM clients are still dialling payments-status.internal and getting an IP that points at three pods that were terminated forty seconds ago. The Kubernetes Endpoints object is correct. The DNS authoritative answer is correct. Everything in between is wrong, and the wrongness has a name: caching.

DNS is the lowest-common-denominator answer to "where does this name live right now". It works because every TCP stack on earth knows how to resolve a hostname, and it lies under churn because at least four caches sit between the authoritative answer and the application's connect() call. The freshness of a DNS-based discovery answer is not bounded by the TTL you set; it is bounded by the longest-lived cache in the chain, which is almost never the one you control.

What "DNS-based discovery" actually means

When someone says "we use DNS for service discovery", they usually mean one of three concrete setups, and each has a different freshness story.

The first is plain A-record DNS: a hostname (recos.paysetu.internal) maps to one or more IP addresses, the resolver returns the list, and the client picks one. This is what nslookup shows you. The TTL on the record is the authoritative server's suggestion about how long the answer should be considered valid. The client has no obligation to honour that exact TTL — it can hold the answer longer (most do) or shorter (rare).

The second is SRV records: instead of just name → IP, the record returns name → (priority, weight, port, target-hostname) tuples. SRV gives you ports and load distribution hints, which is why HashiCorp Consul exposes services as SRV records by default and why Kubernetes' headless services produce SRV records for stateful sets. The freshness story is the same as plain A — TTLs and caches still lie.

The third is Kubernetes' in-cluster DNS (CoreDNS). The cluster runs a CoreDNS deployment that watches the Kubernetes API for Service and EndpointSlice objects and synthesises DNS records on the fly. When a pod's /etc/resolv.conf points to the cluster DNS service IP, every gethostbyname() call goes through CoreDNS, which gives you the freshest answer the API server has — typically within 1–2 seconds of an actual pod state change. This sounds great until you remember that CoreDNS is itself just a DNS server, with TTLs (default 5 seconds in CoreDNS) and downstream caches that don't go away.

The full path of a DNS resolution from app code to authoritative answerVertical stack of caches between application code and the authoritative DNS server. Each cache is labelled with its typical TTL and who controls it. The layer marked "stale danger" is highlighted in accent. What sits between your code and the authoritative DNS answer application code: gethostbyname("recos.internal") JVM cache networkaddress.cache.ttl, default 30s forever if security manager active — STALE DANGER glibc / nscd positive 600s, negative 20s by default systemd-resolved or stub honours TTL, caches min(TTL, 2 hours) recursive resolver (CoreDNS / VPC) honours TTL, caches per-record authoritative server the only place that holds the truth
Illustrative — the application sits five caches above the truth. The TTL you set on the authoritative record is honoured by exactly one of them (the recursive resolver). Every layer above is a fresh chance for staleness. This is why a 30-second TTL change does not propagate in 30 seconds.

Why the JVM is the most common offender: the JDK ships with two DNS-cache TTL controls, networkaddress.cache.ttl and networkaddress.cache.negative.ttl. With a security manager installed, the positive TTL defaults to -1 — cache forever. Without one, it defaults to 30 seconds. Most enterprise Java applications run with the security manager (or did, until JDK 17+) and inherit the forever-cache default without realising it. PaySetu's 14-minute settlement delay was caused by exactly this: a JVM that resolved payments-status.internal to an old ELB IP at 09:02 and kept dialling that IP until the JVM restarted at 23:14.

How a real DNS lookup unfolds — and where each cache decides

Trace what happens when a Python service running in a Kubernetes pod runs socket.gethostbyname("recos.paysetu.internal"). The exact path depends on how the pod's /etc/resolv.conf is configured, but the canonical flow is six hops:

  1. Application cache (JVM, Python urllib3 connection pool, Go net.DefaultResolver). The Python socket module does not cache, but anything sitting on top of it — requests Sessions, aiohttp.ClientSession, grpc.Channel — holds the resolved IP for the lifetime of the connection pool. A grpc.aio.insecure_channel("recos:8080") opened at boot may use the same IP for hours.
  2. glibc resolver (getaddrinfo). On most Linux distributions, this is a thin layer over /etc/resolv.conf. With nscd (Name Service Cache Daemon) running, it caches positive answers for positive-time-to-live (default 600 s, ten minutes) and negative answers for negative-time-to-live (default 20 s).
  3. systemd-resolved (if active, on Ubuntu 18.04+ and most modern distros). This adds another cache layer between glibc and the upstream resolver, with its own TTL handling (it does honour the upstream TTL, capped at 2 hours).
  4. The upstream recursive resolver — in Kubernetes, this is CoreDNS running in kube-system. CoreDNS itself caches with the cache plugin (default 30 s positive, 5 s negative for kubernetes plugin records).
  5. Authoritative answer. Inside Kubernetes, CoreDNS is the authority for *.cluster.local records and synthesises them on demand from the API server (via informers watching Service and EndpointSlice objects).
  6. The Kubernetes API server, which holds the truth in etcd. The etcd watch tells CoreDNS when an Endpoints object changes; CoreDNS updates its in-memory representation; the next DNS query gets the new answer.

The freshness of the answer the application gets is bounded by the maximum of the TTLs across all six layers, plus the watch propagation delay between layer 5 and layer 6. In a healthy cluster, layers 5 and 6 update in well under a second. In an unhealthy cluster — control-plane storm, etcd compaction, kubelet partition — this can stretch to many seconds. The application doesn't see any of this; it just sees connection refused when it tries the old IP.

Staleness windows for DNS-based discoveryThree timelines stacked vertically. Top: actual pod lifecycle, with pod-X dying at t=10s. Middle: authoritative DNS answer, which removes pod-X by t=11s. Bottom: client view, showing pod-X still in the resolved list until t=42s due to JVM cache. Where staleness lives — three timelines for the same event Pod-X actually serving serving t=0..10s terminated t=10s pod dies CoreDNS authoritative answer includes pod-X removed t=11s t=11s endpoint update JVM client view (cache.ttl=30s) pod-X still in cache refreshed t=42s JVM TTL expires Staleness window: 32 seconds. During this window, every retry to "the dead IP" returns `connection refused`. The TTL on the authoritative record (5s) was honoured. The application-level cache (30s) was not.
Illustrative — for a single pod death event with a 30-second JVM cache TTL above a 5-second authoritative TTL, the staleness window is 32 seconds, not 5. The longest-lived cache wins. This is why setting tight TTLs on the authoritative record alone does not solve the problem.

Code: a tiny resolver-with-TTL that exposes the staleness math

The clearest way to internalise where staleness comes from is to write a resolver that has the same shape as the real chain. The Python script below models a two-level cache (an "app cache" with TTL app_ttl above a "resolver cache" with TTL resolver_ttl) above a backing source-of-truth that updates asynchronously. Run it and watch the staleness window emerge.

# dns_staleness.py — model a two-level cache over a moving source of truth
import time, random, threading

# The "world" — what the authoritative answer would say if asked right now.
TRUTH = {"recos.internal": ["10.42.0.1", "10.42.0.2", "10.42.0.3"]}
TRUTH_LOCK = threading.Lock()

class CachedResolver:
    """Two-level cache: app layer above a resolver layer above the truth."""
    def __init__(self, app_ttl: float, resolver_ttl: float):
        self.app_ttl = app_ttl
        self.resolver_ttl = resolver_ttl
        self._app: dict[str, tuple[list[str], float]] = {}
        self._resolver: dict[str, tuple[list[str], float]] = {}

    def _query_authoritative(self, name: str) -> list[str]:
        with TRUTH_LOCK:
            return list(TRUTH.get(name, []))

    def _resolver_lookup(self, name: str, now: float) -> list[str]:
        cached = self._resolver.get(name)
        if cached and cached[1] > now:
            return cached[0]
        ips = self._query_authoritative(name)
        self._resolver[name] = (ips, now + self.resolver_ttl)
        return ips

    def lookup(self, name: str) -> list[str]:
        now = time.time()
        cached = self._app.get(name)
        if cached and cached[1] > now:
            return cached[0]
        ips = self._resolver_lookup(name, now)
        self._app[name] = (ips, now + self.app_ttl)
        return ips

def churn_world():
    """Replace one IP every 4 seconds, mimicking pod recycling."""
    for i in range(5):
        time.sleep(4.0)
        with TRUTH_LOCK:
            new_ip = f"10.42.0.{100 + i}"
            TRUTH["recos.internal"][0] = new_ip
            print(f"  [t={time.time()-START:.1f}s] world: rotated first IP to {new_ip}")

R = CachedResolver(app_ttl=30.0, resolver_ttl=5.0)
START = time.time()
threading.Thread(target=churn_world, daemon=True).start()

for tick in range(25):
    ips = R.lookup("recos.internal")
    print(f"t={time.time()-START:5.1f}s  client sees: {ips}")
    time.sleep(1.0)

Sample run (truncated):

t=  0.0s  client sees: ['10.42.0.1', '10.42.0.2', '10.42.0.3']
t=  1.0s  client sees: ['10.42.0.1', '10.42.0.2', '10.42.0.3']
  [t=4.0s] world: rotated first IP to 10.42.0.100
t=  4.0s  client sees: ['10.42.0.1', '10.42.0.2', '10.42.0.3']    # <-- world changed, client unaware
t=  5.0s  client sees: ['10.42.0.1', '10.42.0.2', '10.42.0.3']
  [t=8.0s] world: rotated first IP to 10.42.0.101
t=  8.0s  client sees: ['10.42.0.1', '10.42.0.2', '10.42.0.3']
t= 10.0s  client sees: ['10.42.0.1', '10.42.0.2', '10.42.0.3']
  [t=12.0s] world: rotated first IP to 10.42.0.102
t= 12.0s  client sees: ['10.42.0.1', '10.42.0.2', '10.42.0.3']
t= 16.0s  client sees: ['10.42.0.1', '10.42.0.2', '10.42.0.3']
  [t=20.0s] world: rotated first IP to 10.42.0.104
t= 20.0s  client sees: ['10.42.0.1', '10.42.0.2', '10.42.0.3']
t= 30.0s  client sees: ['10.42.0.104', '10.42.0.2', '10.42.0.3']  # <-- finally refreshed at app_ttl=30s

Per-line walkthrough. The line if cached and cached[1] > now: return cached[0] is the entire mechanism — both layers do the same thing, and either layer's hit short-circuits the chain. The line self._app[name] = (ips, now + self.app_ttl) is where the longest-lived cache is born: even if the resolver layer correctly fetched a fresh answer at t=5, the app layer pinned that answer for 30 seconds. Notice that the first refresh happens at exactly t=30s — not at t=4s (when the world changed) and not at t=5s (when the resolver's TTL would have allowed a re-fetch). The staleness window is app_ttl, period. Lowering resolver_ttl to 1 second changes nothing about the application's experience; the app cache dominates.

Why this matches PaySetu's outage: when Aditi looked at the JVM's networkaddress.cache.ttl, it was set to -1 (forever-cache, the security-manager default). The CoreDNS authoritative TTL was 5 seconds. The recursive resolver TTL was 5 seconds. Lowering either of those would have done nothing — the JVM was the longest-lived cache in the chain, and lowering the upstream TTL only affects layers below the longest cache. The fix was to set -Dsun.net.inetaddr.ttl=30 on the JVM, restart, and add a deploy-time check that fails the rollout if any reachable JVM service has a non-zero networkaddress.cache.ttl.

Why this differs from the registry pattern: a Consul or etcd client gets a push notification (long-poll watch) when the answer changes, rather than polling on a TTL clock. The registry can deliver "the answer changed" inside one round-trip; DNS cannot, because DNS is a request-response protocol with no native push. The freshness gap between DNS-based and registry-based discovery is fundamentally about pull vs push, not about how short you can make the TTL.

What this means for production deploys

The staleness math has direct, predictable consequences for how you have to structure deploys behind DNS-based discovery.

You cannot do an instant cutover. If you migrate payments.internal from ELB-A to ELB-B by flipping the DNS A record, the cutover is not instant — it is bounded by the longest-lived cache in your fleet. KapitalKite once tried a "5-minute cutover" for an order-router migration; nine hours later, 0.3% of traffic was still hitting ELB-A because of long-lived JVM connection pools holding the old IP. The fix is to keep both backends serving identical traffic for at least the maximum cache horizon (24 hours is a safe upper bound for JVM-heavy fleets), then drain ELB-A only after observing zero traffic for one full TTL beyond that.

You need pre-stop sleeps. When a Kubernetes pod is terminating, the kubelet removes it from Endpoints before it sends SIGTERM. But CoreDNS, intermediate resolvers, and especially clients with their own caches will keep the old IP for some staleness window. If the pod stops accepting connections immediately, every client request inside that window gets connection refused. The standard mitigation is lifecycle.preStop: exec sleep 30, which holds the pod alive (still serving traffic on already-open connections, accepting new ones for the first few seconds) while the endpoints update propagates. CricStream uses 45 seconds in production for everything behind the cluster's CoreDNS; anything shorter causes detectable user-visible 502s during deploys.

Long-lived gRPC channels need forced recycling. A gRPC Channel opened at boot will resolve recos:8080 once, pick one IP from the answer, and reuse that connection for hours. Even if DNS is updated and other clients see the new answer, this channel keeps using the original IP. The fix is server-side: gRPC servers can set grpc.max_connection_age (typical: 30 minutes), which causes the server to send a GOAWAY frame after 30 minutes of channel age, forcing the client to re-resolve and reconnect. Without this, a client opened during deploy-N keeps using deploy-N's IP indefinitely, and you discover this only when one of those IPs is reused for a different service.

Common confusions

Going deeper

Why JVM networkaddress.cache.ttl=-1 exists at all

The JVM's forever-cache default with a security manager is not a bug — it is a security feature. The reasoning, baked in around Java 1.4, was that a malicious DNS server could rebind a hostname to a different IP between when the security manager checks the destination (using DNS) and when the actual connection is opened (using DNS again). If the cache is forever, this race is impossible: the IP used at check time is the IP used at connect time. This is the DNS rebinding attack, and it is real (it bites browsers all the time). The cost of the defence is a discovery freshness disaster in any modern cloud-native environment, which is why every Java cloud guide on the planet starts with "first, set networkaddress.cache.ttl=30". Kubernetes' Java SDKs ship with this set by default; vanilla OpenJDK does not. Always check.

What CoreDNS actually does inside a Kubernetes cluster

CoreDNS is a plugin-chain DNS server. Inside Kubernetes it loads the kubernetes plugin, which establishes a watch against the API server's Service, Endpoints, and EndpointSlice resources via informers. When a Service is created, CoreDNS synthesises an A record <service>.<namespace>.svc.cluster.local → <cluster-IP>. When a headless Service exists (cluster-IP=None), CoreDNS synthesises one A record per backing pod IP, plus SRV records with port info. The watch latency from API-server-update to CoreDNS-cache-update is typically under 100ms; CoreDNS then serves with a short TTL (default 5s for the kubernetes plugin, configurable). The 5s TTL is what shows up as the upstream TTL in the freshness chain — it is the authoritative answer your downstream caches see.

Why DNS still wins despite all this

The leak budget of DNS is well-understood and the ecosystem has built around it. Every mature load balancer, service mesh, and gRPC client knows about MAX_CONNECTION_AGE, knows about JVM TTLs, knows to use connection-recycling and warm-pool warmup. DNS wins because it is universal: every TCP stack, every language runtime, every load balancer, every cloud provider, every certificate authority understands DNS. Adopting any of the alternatives (Consul, Eureka, custom registry) means each language and each tool needs an SDK and each SDK needs to be kept current. The total cost of DNS's freshness leak — measured in occasional deploy-window 502s and 30-second cutover delays — is, for most teams, lower than the total cost of running and operating a full registry. This is why even systems that "use Consul" usually also keep DNS as a fallback, and why Kubernetes' service mesh (Istio, Linkerd) ultimately exposes everything as DNS-resolvable names even though the actual discovery is happening via the API server.

Reproduce this on your laptop

# Run the resolver simulation:
python3 dns_staleness.py

# Inspect a real Kubernetes resolution path:
kubectl run dnsutils --rm -it --image=tutum/dnsutils -- bash
nslookup kubernetes.default.svc.cluster.local
dig +noall +answer +ttlid kubernetes.default.svc.cluster.local

# Check your local nscd / systemd-resolved cache state:
sudo nscd -g                          # if nscd is running
systemd-resolve --statistics          # if systemd-resolved is running

# Check a JVM's DNS cache TTL right now (for a running PID):
jcmd <pid> VM.system_properties | grep networkaddress.cache.ttl

Where this leads next

DNS gets you a name-to-IP mapping with universal compatibility and unbounded staleness. The next chapter — Consul, etcd, ZooKeeper — picks up the registry-based alternative: instead of polling on a TTL clock, clients hold a long-poll watch and get notified the moment the answer changes. That model trades universality for freshness, and adds a new dependency that itself can fail.

After that, Kubernetes Services and Endpoints shows how the platform-API model compresses the entire chain into one watch against the API server, and Discovery caching and staleness returns to the freshness math for the case when no single layer is acceptable on its own and you have to compose them.

References

  1. "DNS Specification" — RFC 1034 / RFC 1035, Mockapetris 1987 — the foundational DNS protocol; §3.6 on TTLs is what every cache in this article inherits.
  2. "Common Configuration Hazards When Using DNS for Service Discovery" — AWS architecture blog — production-grade discussion of cache layering and TTL pitfalls in EC2/ELB environments.
  3. "CoreDNS: DNS Server" — coredns.io documentation — the plugin-chain architecture, the kubernetes plugin, and how cache plugins layer on top.
  4. "Java DNS Caching Properties" — Oracle JDK Docs, networkaddress.cache.ttl — the canonical source for the JVM TTL behaviour described in §Going deeper.
  5. "DNS Rebinding Attacks" — Singh et al., 2007 — the historical security context for why JVMs default to forever-cache.
  6. Dean & Barroso, "The Tail at Scale" — CACM 2013 — §IV's discussion of replicated requests presupposes a discovery layer that can hand back multiple equivalent backends; DNS's multi-A record is the simplest implementation.
  7. Wall: calling a service requires finding it — internal companion. The wall chapter that frames why discovery is a distinct distributed-systems problem; this article picks up the DNS-specific corner of it.
  8. Idempotency keys — internal companion. Stale DNS answers cause retries against dead IPs; idempotency keys are what make those retries safe when the IP eventually does respond.