Caching patterns and the thundering-herd problem

It is 11:59:59 PM on a Flipkart Big Billion Days sale. A million phones in Bengaluru are refreshing the same Redboard banner page. The cached HTML expires at exactly midnight, and in the next millisecond ten thousand application servers each see a MISS, each fire the same query at the same MySQL primary, and the database falls over before the first user has finished saying "add to cart". This chapter is about why that happens, and the four-line fix that prevents it.

In short

A cache sits between a fast caller and a slow source of truth, and four patterns describe how reads and writes flow through it: cache-aside, read-through, write-through, and write-behind. The dangerous failure mode is the thundering herd (or cache stampede): when one hot key expires, every concurrent request misses, recomputes, and slams the source. The cheap fix is a SETNX recompute lock; the cleaner fix is probabilistic early recompute so the cache renews itself before it expires.

The previous three chapters built up the in-memory store itself — Redis data structures, persistence, replication and Cluster, and the contrast with Memcached's minimalism. This chapter is the one that closes Build 22 by stepping back to the question every team reaches for caches eventually: given that I have an in-memory store, what is the discipline of using it as a cache? The answer is four patterns, two failure modes, and a fix you can write in five lines of Python.

The four patterns: who reads, who writes, who owns the staleness

A cache is defined by three things — what triggers a read, what triggers a write, and who is responsible for keeping the cache and the source of truth in sync. Pick any reasonable answer to those three questions and you land on one of four canonical patterns. The names are old and the boundaries are fuzzy, but the trade-offs are sharp.

The four canonical cache patterns: cache-aside, read-through, write-through, write-behindA four-quadrant grid showing the four cache patterns. Top-left cache-aside: the application talks to both the cache and the database directly. On read, app calls cache.get; on miss it calls db.read then cache.set. On write, app writes to db then deletes the cache key. Top-right read-through: the application talks only to the cache; the cache itself fetches from the database on a miss. Bottom-left write-through: writes go through the cache which writes synchronously to the database before returning. Bottom-right write-behind: writes go to the cache and a background worker flushes batched writes to the database asynchronously. A bottom row summarises consistency, latency, and complexity for each pattern.Four cache patterns: who reads, who writes, who owns sync1. Cache-aside (lazy / read-around)app talks to both — the default 90% of teams useappcacheDBGETon missread direct (miss path)read:v=cache.get(k); if !v: v=db.read(k); cache.set(k,v,ttl)write:db.write(k,v); cache.delete(k)+ simple, cache failure does not break writes- every miss pays one DB read, stampede risk2. Read-throughcache wraps the DB — app sees only the cacheappcacheDBGETload on missread:v = cache.get_or_load(k, loader=db.read)write:app writes to DB; cache invalidates on next read+ app code is dead simple, loader logic centralised+ fits a library / sidecar (Caffeine, EhCache, JCache)- cache outage = read outage (no fallback path)3. Write-throughwrites go through cache, sync to DB before replyappcacheDBSETsync persistwrite:cache.set(k,v) → cache calls db.write(k,v); ack to appread:cache.get(k) — always populated, no cold misses+ cache is always consistent with DB after write- writes pay both cache and DB latency- DB outage breaks all writes4. Write-behind (write-back)writes ack from cache; DB flushed in batch asyncappcacheDBSETasync batchwrite:cache.set(k,v) → ack immediately; bg flushes batchesread:cache.get(k); DB may lag by seconds–minutes+ lowest write latency, batched DB throughput- cache crash = data loss; DB sees stale snapshot- only safe when source of truth is the cache itself
The four patterns differ on one axis: how strongly the cache and the source of truth are coupled. Cache-aside leaves the application in charge of both; read-through hides the database behind the cache; write-through makes the cache forward writes synchronously; write-behind decouples writes entirely. The 90/10 rule of production caching is that **cache-aside covers most use cases**, write-through is the right choice when you cannot tolerate stale reads after a write, and write-behind is reserved for systems where the cache *is* the source of truth (e.g. Redis with AOF holding session state).

Two patterns deserve dwelling on, because the trade-offs decide the architecture.

Cache-aside is the default. Application code looks like v = cache.get(k); if v is None: v = db.read(k); cache.set(k, v, ttl=300); return v. The application owns both halves of the relationship: it reads from the cache, falls back to the database, and writes back to the cache. On the write path, the canonical idiom is write to the database, then delete the cache keynot cache.set(k, new_value), because two concurrent writers can race and leave the cache with the older value (write A reads, write B reads, write B writes DB, write B sets cache, write A writes DB, write A sets cache → cache holds A's value but DB holds B's value). Deleting the key forces the next reader to repopulate from the now-correct DB. Why "delete on write" beats "set on write": the database is the source of truth and serialises writes; the cache only needs to invalidate. A DEL is idempotent and cannot leave the cache in a state that contradicts the DB. A SET competes with concurrent SETs and the last writer wins on the cache — which is the wrong winner if it is not the last writer on the DB.

Write-behind is the dangerous one. It looks attractive because the application's write returns the moment the cache acknowledges, and a background worker batches writes to the DB. The catch is that the cache is now the source of truth between the ack and the flush, and a Redis crash in that window erases real user data. The pattern is only safe when the cache has its own durability story (AOF with appendfsync everysec, replication to a synchronous follower) and when the application has accepted up to N seconds of write loss as a contract. In practice, very few teams should reach for write-behind; if you think you need it, you usually want a queue (Kafka, SQS) in front of the database instead, which gives you ordered, durable, replayable writes without pretending the cache owns them.

The thundering herd: one expiry, ten thousand misses

Now the failure mode this chapter is named for. Pick any cache pattern. Set a TTL on a key. Make the key very popular. The moment the TTL expires, every concurrent request that was reading that key sees a miss at the same instant, and every request fires its own database query to refill the cache. If 10,000 requests/second were hitting the cache and the recompute takes 50 ms, you have just sent 500 simultaneous identical queries to a database that was sized for one. This is the thundering herd — also called cache stampede, dogpile effect, or cache miss storm.

The thundering herd: one TTL expiry, every concurrent request misses, the database meltsA timeline showing the thundering herd. Before t=0 the cache holds the hot key product:101 with TTL counting down. At t=0 the key expires and is evicted from Redis. Between t=0 and t=50ms, 500 application servers each independently see a cache miss while trying to read the key. Each fires the same SELECT query against the MySQL primary. The MySQL CPU spikes from 20% to 100% within 100 ms. Connection pool fills. Some queries time out. The 501st request through the cache sees that another request has already populated it but the damage is done. A note explains that the recompute itself is 50 ms but the herd creates 500x amplification.Thundering herd: one expiry, every concurrent caller missest = -1 ms (cache hit)t = 0 (TTL expires)t = 50 ms (one recompute returns)steady statecache: product:101 = JSON10 K req/s, all HITDB CPU: 20%latency p99: 1 msstampede window (50 ms)cache: (key expired, MISS)500 concurrent recomputes fireDB CPU: 100% — pool fulllatency p99: 5 s, timeoutsrecoveryfirst recompute lands in cacheremaining 499 race to SET itDB queries already paid: 500DB still recovering for ~5 swhy the multiplier is the request rate × recompute time10 K req/s × 50 ms recompute = 500 in-flight requests when the key is missingall 500 see MISS, all 500 fire the same SQL, all 500 hit the same rowamplification factor:N = req_rate × recompute_latencyat Flipkart scale:100 K rps × 100 ms = 10 000 simultaneous duplicate queriesthe bigger your traffic, the harder the herd hits — and the bigger the cache benefit, the bigger the cliffpopular hot key + short recompute = small herd; popular hot key + slow recompute = system-down event"the cache was load-bearing and we did not know" is the post-mortem one-liner
The amplification factor is `request_rate × recompute_latency`. A page taking 100 ms to assemble at a steady 100 K req/s has 10,000 in-flight requests at any moment — and at the millisecond the cache key expires, every one of them misses. The herd is not a Redis problem; it is a side-effect of caches working too well. The faster the cache, the bigger the cliff when it disappears.

The mechanism is mechanical, not exotic. While the cache held the value, the database saw 0 queries on that key. The instant the TTL fires, the database sees request_rate × recompute_latency queries on that key — possibly within a single millisecond if the cache delete was atomic. Every replica of the application sees the same miss, none of them coordinate, and they all race to refill. The first one that finishes wins (its SET populates the cache); the others finish a few milliseconds later and overwrite with the same value, having paid the database cost for nothing.

The textbook expiry is not the only trigger. The same pattern fires when a Redis node is rebooted (every key disappears), when FLUSHDB is run by accident, when a deploy bumps the cache-key namespace (every key effectively becomes new), when the cache evicts under memory pressure, or when an application bug invalidates too aggressively. The herd is a system-level property, not a property of any particular pattern.

The fix: SETNX recompute lock + early probabilistic recompute

Two fixes together kill the herd. The first is a recompute lock: when a request misses, only the first one is allowed to recompute; the rest either wait briefly and retry, or serve a stale value. The second is probabilistic early recompute: a small fraction of requests refresh the key before it actually expires, so the synchronous-miss event never happens for a hot key. Together they cap the database hit at exactly one per recompute interval, regardless of how popular the key is.

Here is the recompute lock in five lines, using Redis SET key value NX EX ttl (the modern atomic form of SETNX):

import redis, json, time, random

r = redis.Redis(decode_responses=True)

def get_with_lock(key, loader, ttl=300, lock_ttl=10):
    """Cache-aside read with a SETNX recompute lock to prevent stampedes."""
    val = r.get(key)
    if val is not None:
        return json.loads(val)

    # Cache miss. Try to acquire the recompute lock.
    lock_key = f'lock:{key}'
    got_lock = r.set(lock_key, '1', nx=True, ex=lock_ttl)
    if got_lock:
        try:
            fresh = loader()                                  # the slow recompute
            r.set(key, json.dumps(fresh), ex=ttl)
            return fresh
        finally:
            r.delete(lock_key)

    # Lost the race — wait briefly and retry the cache.
    for _ in range(20):
        time.sleep(0.05 + random.random() * 0.05)             # 50–100 ms jitter
        val = r.get(key)
        if val is not None:
            return json.loads(val)
    # Fallback: pay the DB cost ourselves (better than 500-error)
    return loader()

def assemble_product(sku):
    """Pretend this hits MySQL + inventory service — 50 ms."""
    time.sleep(0.05)
    return {'sku': sku, 'price': 19999, 'stock': 42, 'name': 'Redmi Note 13'}

# The herd test: 500 threads ask for the same key the moment after expiry
# Without the lock: 500 calls to assemble_product (DB melts)
# With the lock:    1 call to assemble_product, 499 sleep-and-retry hits

Run that under 500 concurrent requests and you will see exactly one assemble_product call; the other 499 sleep 50–100 ms, find the now-populated cache, and return. The database paid one query instead of 500. Why SET ... NX EX 10 is the right primitive: SET with NX (only if not exists) is atomic on a single Redis instance — exactly one client among 500 simultaneous callers gets the True reply, every other gets None. The EX 10 second TTL is a safety net: if the lock holder crashes mid-recompute, the lock auto-releases in 10 seconds and the next request retries. Without the TTL, a crashed worker would block every reader on that key until manual intervention.

That fixes the spike at expiry, but it leaves a smaller problem: the lock holder still pays the recompute synchronously, and during those 50 ms, the 499 other readers are blocked. For very hot keys you want the recompute to happen before the key expires, so no reader ever sees the gap. This is probabilistic early expiration (XFetch, named after the algorithm in Vattani et al., 2015):

import math, random

def get_with_probabilistic_refresh(key, loader, ttl=300, beta=1.0):
    """Each reader independently rolls a die to refresh early."""
    pipe = r.pipeline()
    pipe.get(key)
    pipe.get(f'{key}:delta')   # how long the last recompute took (ms)
    pipe.ttl(key)
    val, delta_ms, remaining = pipe.execute()

    if val is None:
        return _recompute(key, loader, ttl)

    delta = float(delta_ms or 50) / 1000.0     # default 50 ms if unknown
    expiry = time.time() + (remaining or 0)

    # XFetch: refresh probability rises as TTL approaches zero.
    if time.time() - delta * beta * math.log(random.random()) >= expiry:
        return _recompute(key, loader, ttl)

    return json.loads(val)

def _recompute(key, loader, ttl):
    start = time.time()
    fresh = loader()
    elapsed_ms = int((time.time() - start) * 1000)
    pipe = r.pipeline()
    pipe.set(key, json.dumps(fresh), ex=ttl)
    pipe.set(f'{key}:delta', elapsed_ms, ex=ttl)
    pipe.execute()
    return fresh

The trick is the line time.time() - delta * beta * log(random()). As remaining shrinks toward zero, the probability that this expression exceeds expiry grows; one of the in-flight readers will roll a small enough random number to trigger a refresh before the natural expiry. The hotter the key (more readers per second), the more likely some reader refreshes early, while cold keys experience zero overhead. Why −log(random()) is the right shape: random() is uniform on (0, 1), so −log(random()) is exponentially distributed. The probability of triggering scales with the key's request rate — exactly the keys you most want refreshed early. Cold keys get refreshed only when they actually expire (one recompute, no stampede because there is no herd). Hot keys get refreshed by a single early reader, with a smooth probability curve that has no sharp boundary anyone can synchronise against. Combined with the SETNX lock, you get the full fix: hot keys self-refresh before expiry, and the rare honest expiry triggers a single locked recompute instead of a herd.

There are two cheaper alternatives worth knowing. Stale-while-revalidate (the HTTP-cache pattern, also baked into CDNs like Cloudflare) serves the stale value from the cache while a single background worker refreshes it; the stampede becomes invisible because every reader gets a hit. TTL jitter — instead of SETEX 300, use SETEX (300 + random.randint(0, 60)) — desynchronises expiries across keys so a million keys created at deploy time do not all expire in the same second. Jitter alone does not save a single hot key from stampeding, but it saves the system from every key stampeding at the same minute.

Common confusions

Going deeper

The math of −log(random())

The XFetch algorithm works because of one elegant property of the exponential distribution. If U is uniform on (0, 1), then X = −log(U) is Exponential(1). Multiplied by the recompute time delta, the random variable delta × β × X has mean delta × β and a long thin tail. The check now − delta × β × log(random()) ≥ expiry triggers a refresh when the rolled random sample is small enough; the probability of a small sample is calibrated by delta so the refresh window scales with the recompute cost. The full analysis is in Vattani, Chierichetti & Lowenstein (2015), "Optimal Probabilistic Cache Stampede Prevention" — a four-page VLDB paper that is unusually readable.

Negative caching: cache the misses too

A subtle related stampede: someone keeps requesting product:99999 which does not exist. Every request misses the cache (correctly — the row is not there) and queries the database (which returns nothing, also correctly), but the database query is paid every time. The fix is to cache the negative resultr.set('product:99999', '__NOT_FOUND__', ex=60) — and check for the sentinel on read. The TTL is short to allow new rows to appear quickly; the value is small and tagged so it does not collide with real data. Without negative caching, a single bot scanning random IDs can DOS your database through a perfectly healthy cache.

Cache penetration, breakdown, avalanche

Three named failure modes from Chinese tech blogs (击穿, 穿透, 雪崩) that map cleanly to English: cache breakdown is the single-hot-key herd this chapter is built around. Cache penetration is the negative-caching problem above — keys that do not exist hammering the DB. Cache avalanche is many keys expiring simultaneously (often after a deploy or a synchronous TTL on bulk-loaded data) — the fix is jitter at insert time. The three terms describe three distinct symptoms with three distinct fixes; muddling them produces the wrong fix and an outage that "looks like" the herd but is actually penetration.

Two-tier caching: L1 in-process + L2 Redis

For the very hottest keys (under a few thousand per app pod), an in-process cache (Caffeine in Java, cachetools.TTLCache in Python, lru-cache in Node) sits in front of Redis with its own short TTL. A read hits L1 (microseconds), falls back to Redis (sub-millisecond), falls back to the DB (tens of milliseconds). Two-tier caching takes pressure off Redis itself the same way Redis takes pressure off the DB; the cost is that L1 is per-pod so invalidation needs a pub/sub broadcast (or short L1 TTLs and acceptance of brief inconsistency). Facebook's TAO and Twitter's home-timeline cache both use multi-tier shapes.

CDN as a cache layer

Everything above also applies to the CDN layer (Cloudflare, Akamai, Fastly). The "Big Billion Days" stampede in the lead is, in production, mostly absorbed by the CDN edge — Cloudflare's cache-control: s-maxage plus stale-while-revalidate and stale-if-error directives implement the stale-while-revalidate pattern at the network edge so end users never see the herd. The application-level cache (Redis) catches what the CDN misses (logged-in pages, personalised data, dynamic API responses). The patterns are identical; only the layer and latency differ.

Where this leads next

This chapter closes Build 22 — In-memory databases. The next build moves up the stack to streaming systems where the "log is the source of truth" idea returns:

References