Discovery caching and staleness

PaySetu's payments-status service has 240 backend pods. The front-end pods do not query the registry on every request — that would multiply read load by the request rate and turn etcd into the slowest part of the call. Instead, every caller holds an in-memory copy of the membership list, refreshed by a long-poll watch. The cache is the request path. The registry is not. And then on a Tuesday afternoon a node-pool autoscaler drains 14 pods over 90 seconds, the watches on 4,000 callers fall behind by 22 seconds because the registry's CPU saturated under the broadcast storm, and 9% of payment-status calls land on TCP-RST or 2-second connect timeouts. Nobody changed any code. The discovery cache decided to be wrong for 22 seconds.

A discovery cache is a pre-computed answer to "where does this service live?" that the caller trusts until the next refresh — and during the gap between truth and refresh, every routing decision is wrong by the size of the gap. The two knobs are TTL (how long until you must re-ask) and watch lag (how long the registry takes to push you a change). Their sum is your worst-case staleness; under load, watch lag dominates. The mitigations are bounded freshness, jittered reconnects, and a negative cache that lets the data plane veto the registry when health checks disagree.

What the cache stores and what it gets wrong

A discovery cache is not the same shape as a database row cache. It holds a set — the current members of a service — keyed by service name, with each member tagged by IP, port, weight, status (healthy / draining / failed), and metadata (zone, version, capacity hints). The reads are point-lookups by service name; the writes are full-set replacements driven by registry events. Most of the time the set does not change; when it does, the change is usually small — one pod up, one pod down, one weight adjustment.

The thing the cache gets wrong is not "the value is old". It is "the membership has changed since the last refresh". There are exactly four ways the cache and the registry can disagree, and every one of them produces a different production failure:

Phantom member — the cache lists a pod that no longer exists. Picking it sends the request to a TCP-RST (kernel says "no listener") or a connect timeout (kernel-level "no route to host" after the pod and its IP are recycled). Worst-case: a 2-second timeout per stuck request.
Missing member — a new healthy pod has joined the registry but is not in the cache yet. Load lands on the old set, which is now under-provisioned. The new pod sits idle until the next refresh — wasting both capacity and the ramp-up money.
Wrong status — the cache says a pod is healthy but the registry has marked it draining (or vice versa). Picking it works at the TCP level but the pod itself is rejecting new requests with a 503 ("draining") or accepting them and then dropping mid-flight when the SIGTERM grace period expires.
Wrong weight / metadata — the registry says a pod was scaled up (CPU bumped, weight raised from 1.0 to 2.5); the cache still has 1.0. Load distribution is now skewed: the bumped pod is under-utilised, smaller pods are over-utilised by the LB policy that consults the stale weights.

Illustrative — the four canonical disagreements between a registry's truth and a stale cache, with the production symptom each one creates. Numbers (epoch=4319, 22 s lag) are placeholder values typical of a Eureka-class registry under autoscaler-driven churn.

Why "the cache is N seconds old" understates the problem: a single number conflates two different kinds of staleness. Refresh staleness is bounded by your TTL — the cache will refresh within ttl seconds. Event-propagation staleness is bounded by your watch lag — the time between a registry change and the watch event landing in the client. Under registry overload, watch lag can balloon to 30–60 s while TTL stays at 30 s, because the watch is a long-running TCP stream that backed up rather than a periodic poll that completed. The two clocks are not the same; assuming they are is how you build a system that "should be at most 30 s stale" and is then 90 s stale on the day you need it most.

TTL, watch lag, and the freshness budget

Cache freshness is the upper bound on how old your view of membership can be while you still consider it usable. There are two mechanisms a discovery library uses to refresh, and they fail differently:

Pull (TTL-driven). The client polls the registry every T_pull seconds and replaces its cache wholesale. Worst-case staleness is T_pull — but only if every pull succeeds. If a pull fails (registry overload, transient network), the previous data ages further. Kubernetes node-local kube-proxy used to work this way before informers replaced it. Eureka's classic client polled every 30 s.

Push (watch-driven). The client opens a long-poll or streaming connection (HTTP/2 watch on etcd, gRPC stream on Consul, server-sent-events on Eureka 2.0) and the registry pushes events. Best-case staleness is the network RTT plus the registry's notify latency — typically 5–50 ms. Worst-case is far worse: if the watch falls behind (too many events queued, registry CPU saturated, client GC pause), the client may be 30+ seconds behind without knowing it.

Real systems combine the two: a watch for liveness, a periodic pull as a safety net to catch missed events. The aggregate freshness rule is the one a careful operator writes down explicitly:

effective_freshness_max = max(time_since_last_successful_refresh, time_since_last_watch_event)

If either side has not made progress in freshness_budget seconds, the cache is considered expired and the safest action is to refuse to use it. Refusing has two flavours: hard-fail (return an error to the application — "no fresh endpoints, retry") or fall back to a stale-but-better-than-nothing path (use the cache anyway but mark every chosen endpoint as low-confidence, so a single failure causes immediate eviction rather than retry-with-the-same-pod).

PaySetu's payments-status SDK uses a freshness budget of 30 s. CricStream's encode-to-edge segment shipper uses 5 s, because they ship 4 segments per stream per second and a 22 s gap is 88 segments lost. KapitalKite's order-router uses 2 s because the cost of routing a market-buy order to a stale price-feed pod is regulatory exposure, not just a timeout. The budget is set by the cost-of-being-wrong, not by some universal default.

# discovery_cache.py — a fresh-or-fail discovery cache with TTL, watch lag, and a freshness budget.
# Demonstrates that worst-case staleness equals max(time_since_pull, time_since_watch_event),
# not the smaller of the two.
import asyncio, random, time
from dataclasses import dataclass, field

@dataclass
class CacheEntry:
    endpoints: list                       # current set of (ip, port)
    last_pull_ts: float = 0.0             # last successful TTL refresh
    last_watch_event_ts: float = 0.0      # last membership event from watch
    consecutive_pull_fails: int = 0

@dataclass
class Registry:
    truth: list = field(default_factory=list)
    notify_latency_ms: float = 8.0        # push delay, normal regime
    overloaded: bool = False              # if True, push delay balloons

    async def long_poll_event(self):
        # When overloaded, watch events queue up for tens of seconds before delivery.
        delay_ms = 30_000 if self.overloaded else self.notify_latency_ms
        await asyncio.sleep(delay_ms / 1000)
        return list(self.truth)

    async def pull(self):
        await asyncio.sleep(0.012)        # 12 ms cross-AZ RTT
        if random.random() < 0.05: raise TimeoutError("pull failed")
        return list(self.truth)

class DiscoveryCache:
    def __init__(self, freshness_budget_s=30.0):
        self.entry = CacheEntry(endpoints=[])
        self.budget = freshness_budget_s

    def is_fresh(self, now):
        age = now - max(self.entry.last_pull_ts, self.entry.last_watch_event_ts)
        return age <= self.budget, age

    async def pull_loop(self, reg, every=15.0):
        while True:
            try:
                self.entry.endpoints = await reg.pull()
                self.entry.last_pull_ts = time.time()
                self.entry.consecutive_pull_fails = 0
            except Exception:
                self.entry.consecutive_pull_fails += 1
            await asyncio.sleep(every)

    async def watch_loop(self, reg):
        while True:
            self.entry.endpoints = await reg.long_poll_event()
            self.entry.last_watch_event_ts = time.time()

async def main():
    reg = Registry(truth=[("10.244.7.117", 8080), ("10.244.7.118", 8080)])
    cache = DiscoveryCache(freshness_budget_s=30.0)
    asyncio.create_task(cache.pull_loop(reg, every=15.0))
    asyncio.create_task(cache.watch_loop(reg))
    await asyncio.sleep(0.5)              # let initial pull and watch land
    print(f"after warmup    fresh={cache.is_fresh(time.time())}  endpoints={cache.entry.endpoints}")
    reg.overloaded = True                 # registry CPU saturates
    reg.truth.append(("10.244.7.201", 8080))   # new pod added
    for sec in [5, 15, 25, 35]:
        await asyncio.sleep(sec - (time.time() - cache.entry.last_pull_ts) if sec - (time.time() - cache.entry.last_pull_ts) > 0 else 0)
        ok, age = cache.is_fresh(time.time())
        print(f"t+{sec:2d}s  fresh={ok}  age={age:5.2f}s  endpoints={len(cache.entry.endpoints)}  knows-pod-201? {('10.244.7.201',8080) in cache.entry.endpoints}")

asyncio.run(main())

Sample run:

after warmup    fresh=(True, 0.51)  endpoints=[('10.244.7.117', 8080), ('10.244.7.118', 8080)]
t+ 5s  fresh=True   age= 5.02s  endpoints=2  knows-pod-201? False
t+15s  fresh=True  age=14.97s  endpoints=2  knows-pod-201? False
t+25s  fresh=True  age=24.85s  endpoints=2  knows-pod-201? False
t+35s  fresh=False  age=34.96s  endpoints=2  knows-pod-201? False

Per-line walkthrough. The line age = now - max(self.entry.last_pull_ts, self.entry.last_watch_event_ts) is the freshness rule that matters: take the more recent of the two refresh paths, because either one progressing is enough to call the cache fresh. The line reg.overloaded = True flips the registry into the regime where watch events queue for 30 s. The line if random.random() < 0.05: raise TimeoutError("pull failed") simulates pull-side failure — the second mechanism that lets last_pull_ts go stale even if the registry itself is up. The line reg.truth.append(("10.244.7.201", 8080)) adds a new pod the cache will not learn about until either the next pull (15 s) or the queued watch (30 s) — whichever wins. By t+35, the budget has expired and the cache should be marked unusable; the application sees fresh=False and either fails fast or runs in degraded mode. The pull-and-watch design is not "belt and braces" — it is a hedge against each mechanism's failure mode, and the freshness check looks at both clocks.

Why pull and watch, not pull or watch: a watch-only client is fast in the common case but blind when the registry's notify pipeline is the part that broke; a pull-only client is robust but slower to converge on small changes. Combining them gives best-case latency from the watch (5–50 ms to learn about a normal change) and worst-case bounded staleness from the pull (T_pull even when the watch is silent for an unrelated reason). The freshness budget then asks the only question that matters from the application's perspective: "is the most recent of these two clocks within my tolerance?" — not "are both healthy?".

Negative caching, jittered reconnects, and the data-plane veto

Even a perfectly fresh registry view can route to a dead pod, because the registry is not the only source of truth about a pod's liveness. The pod might have been killed two seconds ago by an OOM signal; the registry's health-check has a check interval of 10 s and a 3-failure threshold, so the pod will be marked dead 30+ seconds after it stopped serving. During that window, the registry says "healthy" and is wrong.

Discovery libraries deal with this by maintaining a negative cache — a set of endpoints the data plane has recently observed to be failing, regardless of what the registry says. The signal is a 5xx response, a TCP-RST, a connect timeout, or a circuit-breaker trip. When the data plane records pod-117 → 5 connect timeouts in 30 s, the discovery library demotes pod-117 even though the registry still has it listed. The demotion has a TTL — typically 30–120 s — after which the pod is reinstated and gets a probe request to confirm.

The negative cache is a data-plane override on the control plane. It says: I trust my own request outcomes more than I trust the registry's idea of liveness. The principle generalises — anywhere the control plane and data plane disagree about reality, the data plane is closer to reality and should win.

There are three other safety patterns worth naming because they are uniformly missed in first-cut implementations:

Jittered watch reconnects. When a registry restarts (rolling deploy, leader election, instance replacement), every client's watch breaks at roughly the same instant. Without jitter, every client retries at the same instant, and the registry's first 5 s of life is spent under a thundering herd of 4,000 reconnects. Jitter with delay = base * 2^attempt + random(0, base) spreads the storm. Eureka caps initial reconnect at [0, 30 s] random; gRPC xDS uses exponential backoff with full jitter (each retry uniformly random in [0, current_max]).

Stale-while-revalidate (borrowed from HTTP). When a pull fails, do not invalidate the cache immediately — keep serving the previous data while a background refresh retries. The invariant becomes: the cache is never empty unless we have explicit evidence to empty it. This trades a longer staleness window during registry incidents for not failing every request the moment the registry hiccups.

Probe before promote. When a pod that was in the negative cache reaches the end of its demotion TTL, do not put it straight back in the rotation. Send one or two probe requests (typically the same path the LB health-check uses) and only restore full traffic if the probes succeed. This prevents the oscillation where a pod that is genuinely broken keeps getting re-added every 60 s and immediately re-demoted by the next failed request.

Illustrative — the local data-plane signal demotes a misbehaving pod ~30 s before the registry's health-check would have. Without this veto, the gap between truth and registry is the gap during which every client routes traffic into a black hole.

Why the data plane is closer to truth than the registry: the registry's health-check is a synthetic probe issued every 5–10 s from one location. The data plane sees actual production traffic from many callers in real time — orders of magnitude more samples per minute, and on the real code paths that actually matter. When 3 callers in 5 seconds all see 5xx from the same pod, that is a far stronger statistical signal than a single registry health-check pass. The negative cache encodes the principle: the system that observed the failure has the highest authority to demote, even if the registry is still optimistic.

Watch lag under registry overload — the failure that hides

The most dangerous failure mode of a discovery cache is the one where every component is "healthy" but the cache is silently 30+ seconds behind. The clients are up. The registry is up. The pulls succeed. The watch is connected. But the watch's event delivery is queued behind a backlog of broadcast events the registry cannot drain fast enough.

This is what happened to PaySetu in late 2024 during a multi-tenant Eureka migration. The new instance class for the registry had ~30% less CPU per node. Under steady-state load this was fine — the registry was at 40% CPU. Then a node-pool autoscaler drained 14 pods in 90 seconds, generating ~14 × 4,000 = 56,000 watch events that the registry needed to broadcast. Registry CPU pegged at 100%. Each event's broadcast latency went from ~10 ms to ~22 s — not because any single broadcast was slow, but because they queued behind each other on the limited-CPU broadcast thread pool. Every client's last_watch_event_ts clock looked perfectly current — the watch connection was up, just no events had arrived. Pulls were happening every 30 s but those pulls were against a registry whose own state was fresh; the membership delta from the autoscaler had been ingested into the registry by t+1 s, just not delivered to anyone. From the client's perspective: pulls succeeded, watch connection healthy, cache was "fresh" by every metric — and 9% of requests went to drained pods.

The remediations that came out of this incident, all of which are now defaults in the PaySetu discovery library:

Watch heartbeats — the registry sends a no-op heartbeat every 5 s on each watch, so last_watch_event_ts only updates on either a real event or the heartbeat. If the heartbeat goes silent for >15 s, the client treats the watch as broken and reconnects.
Per-event timestamps — each event carries the registry's own ingest time, not just delivery time. The client computes event_lag = now - event.ingest_ts and exports it as a metric. Watch lag becomes observable.
Quorum reads as a freshness probe — periodically, force a strongly-consistent read against the registry (etcd serializable=false, Consul consistent=true) and compare against the cache. A divergence beyond freshness_budget flips the client into negative-cache-only mode until the next successful watch event lands.
Backpressure at the registry — when broadcast latency exceeds 1 s, the registry drops watch events older than the most recent one for the same (service, member) tuple. Old events are stale by definition; coalescing them is correctness-preserving.

After the incident, watch lag during the next autoscaler-driven drain was 1.2 s p99 instead of 22 s — not because the registry got faster, but because the system stopped pretending the watch connection's existence implied the watch's freshness.

Common confusions

"TTL is the staleness bound." TTL is the bound if pulls always succeed and the watch is irrelevant. In practice, watch lag and pull-failure compound; the actual bound is max(time_since_last_successful_pull, time_since_last_watch_event), and either can blow the budget independently.
"A long-poll watch gives you near-real-time updates." Only when the registry's broadcast pipeline is healthy. Under registry CPU saturation, watch events queue and arrive 30+ seconds late — a slowness the watch connection itself does not signal. Watch heartbeats and per-event ingest timestamps are how you tell the difference.
"The negative cache is a workaround; the real fix is faster registry health-checks." No — the registry's health-check is a sample, the data plane's traffic is the population. A 5-second-interval check at one location cannot match the signal density of real production callers. The negative cache is the correct architectural layer; faster checks help but do not replace it.
"Stale-while-revalidate is unsafe — you might serve a request to a dead pod." You will route to a dead pod anyway during the window between the pod dying and the registry noticing. Stale-while-revalidate trades a few extra seconds of staleness during registry incidents against making every request fail the moment the registry hiccups. The math almost always favours stale-while-revalidate.
"My TTL is 30 s so my staleness is at most 30 s." Add the watch lag (which can be tens of seconds), the broadcast latency (registry-dependent), the data-plane health-check interval (5–10 s), and the failure-detection threshold (3 misses × 5 s). The end-to-end "I picked a dead pod" window can be 90 s on a 30 s TTL.
"Bounded freshness should make the application fail when stale." Sometimes — but most production systems pick a softer policy: serve stale, mark the response with a x-discovery-stale: true header, and let the application choose to retry on a different endpoint or surface the staleness to its callers. Hard-fail is right only when the cost-of-being-wrong is large enough that not serving is better than serving wrong (financial trades, identity-attestation flows).

Going deeper

Eureka's AP design — staleness as a feature, not a bug

Netflix designed Eureka to favour availability over consistency. The registry is partition-tolerant and eventually-consistent across its own peers; clients are explicitly designed to use stale data when fresh data is unavailable. The original 30 s pull interval was chosen because that's how long Netflix's data showed a typical microservice membership change took to propagate end-to-end, and they considered any stricter freshness guarantee misleading. The 2014 paper Eureka 2.0 Architecture Overview explains the AP rationale: in a network partition between client and registry, an AP design lets the client keep routing to whatever pods it last knew about, on the assumption that most of them are probably still healthy. A CP registry (etcd, ZooKeeper) would refuse to serve, which means the client's only options are "fail every request" or "use very stale data anyway" — Eureka chose to make the second case explicit.

Consul's hierarchical caching — agent-local, server-side, and beyond

Consul layers the discovery cache: each Consul agent (one per node) holds a local cache populated by streaming RPCs from the Consul server cluster. Application clients query the local agent over localhost, never the Consul servers directly. This gives sub-millisecond reads (loopback) at the cost of one more layer of staleness. Consul's streaming subsystem (introduced 2020, default since 1.10) replaced the original blocking-query model precisely to fix watch-lag issues in large fleets. The streaming protocol delivers per-event ingest timestamps and explicit heartbeats — exactly the patterns described above as remediation for watch-lag failure modes.

gRPC xDS — push-based discovery with built-in freshness signals

The xDS protocol used by gRPC and envoy carries explicit version-info and nonce fields on every config update. The client ACKs each version; the server tracks per-client lag and can refuse to push more updates if the client is behind a configurable threshold. The freshness budget is built into the protocol rather than left to each client to implement. xDS also supports "delta" updates (only the changed members) and full-state updates (the entire set), with the client requesting whichever is appropriate after a reconnect — solving the "I just reconnected, what is the canonical truth" problem that plain watch protocols handle ad hoc.

Reproduce this on your laptop

# Run the freshness-budget cache demo
python3 -m venv .venv && source .venv/bin/activate
pip install --upgrade pip
python3 discovery_cache.py
# Expect: cache stays "fresh" for 30 s after the registry overloads, then flips to fresh=False
# and the application code is supposed to either retry or degrade gracefully.

# Optional: run a real Consul agent and observe streaming watch behaviour
docker run -d -p 8500:8500 hashicorp/consul:1.18
consul members
consul watch -type=service -service=web   # leave running
consul services register -name=web -address=10.0.0.1 -port=8080  # in another terminal
# Watch events should arrive in <50 ms; this is the healthy regime.

Where this leads next

Once the cache mechanics are solid, the next failure to confront is what happens when the cache and the LB algorithm interact. Load balancing strategies — round-robin, P2C, least-connections covers how the picker uses the cache; a stale weight here turns into hot-spot heat there. Health checks — active probes vs passive observation is the natural next chapter — the negative cache mechanics described above are exactly the passive-observation primitives.

Closely related: DNS-based discovery inherits a particularly hostile staleness regime (DNS TTLs are coarse, intermediate resolvers cache aggressively, and there is no watch). Consul, etcd, ZooKeeper goes deeper into how the strongly-consistent registries implement watch and where their own staleness comes from.

References

Eureka 2.0 Architecture Overview — Netflix (archived) — the AP-design rationale and why staleness is treated as a first-class feature.
HashiCorp, "Streaming for Service Health" (2020) — Consul's move from blocking queries to streaming watches; rationale and watch-lag measurements before/after.
Matt Klein, "xDS protocol" — envoy proxy — the version-and-nonce protocol that makes freshness an explicit signal rather than an implicit assumption.
Mark Nottingham, "stale-while-revalidate" — RFC 5861 — the HTTP cache directive whose discipline maps directly onto discovery caches.
Adrian Cole, "DNS in service discovery" — Square Engineering Blog (2018) — practitioner notes on why DNS staleness regimes are particularly painful and what mitigations work.
Werner Vogels, "Eventually Consistent" — CACM 2009 — foundational framing for why staleness is unavoidable in distributed registries; the AP/CP trade-offs that follow.
Client-side vs server-side discovery — internal companion. The cache mechanics in this article apply on whichever side of that split holds the membership view.
DNS-based discovery — internal companion. DNS TTL is the cache primitive of one of the oldest discovery protocols still in production.