Twitter's caching migration

In 2018, Twitter's home-timeline cache served roughly 320 million reads per second from a fleet of 8,500 Memcached instances. Each instance had been provisioned years earlier on the assumption that the timeline workload was a uniform random read mix with object sizes around 1 KB. By 2019 the actual traffic was bimodal — a long tail of small objects and a fat head of 50 KB tweet-thread bundles — and the fleet was paying a 22% allocator overhead, a 3.4 ms p99.9 from cross-CPU lock contention on hot keys, and an LRU eviction policy that kept evicting popular threads at the worst possible moment. The migration to Pelikan, which Twitter open-sourced in 2020, took five years end-to-end and replaced none of the cache's externally observable behaviour. What it replaced was the internal mechanism — the allocator, the threading model, the eviction policy, the wire-protocol parser — and the resulting reduction in p99.9 from 3.4 ms to 380 µs is the kind of number that a working backend engineer at PhonePe or Hotstar should recognise as the difference between "this works" and "this scales".

A cache that returns correct values is not the same as a cache that returns correct values fast. Twitter's Memcached fleet was correct but slow under the workload it actually served, and the slowness compounded with every doubled-traffic year. Pelikan rebuilt the cache around the workload — slab allocators sized to the real object distribution, lockless single-threaded shards, segment-based TTL eviction — and turned a 3.4 ms p99.9 into a 380 µs p99.9 with the same hit ratio.

Why a "working" cache stops working

The premise of any cache is that the access distribution is skewed enough that a small hot set absorbs most of the read traffic. The classic Twitter cache was sized on the assumption that the 99th-percentile object was around 4 KB, that read amplification was 8× write amplification, and that the LRU eviction policy aligned with what users actually re-fetched. All three assumptions held in 2014. None of them held in 2019.

What changed under the cache, while the cache itself stayed the same:

Object size distribution shifted bimodal. The original Memcached deployment used a single slab class around 1 KB. By 2018, the workload was 60% small objects (tweet metadata, user-card lookups around 200-400 bytes) and 30% large objects (full thread bundles around 30-80 KB pulled in one shot for the timeline renderer). The 1 KB slab class fragmented internally for the small objects and chained to multi-slab allocations for the large ones; the steady-state allocator overhead climbed from 8% in 2014 to 22% in 2019. Memory that should have held cached values held metadata about caching cached values.

Hot-key concentration intensified. Around 2017, the timeline algorithm started promoting "viral threads" more aggressively. A single celebrity tweet during a major event could account for 4-6% of all timeline-cache reads for 90 minutes. The original Memcached fleet sharded by key hash, which meant one shard absorbed the celebrity's traffic. That shard's CPU saturated; the rest of the fleet sat at 30% utilisation. The cluster's apparent capacity was set by the worst shard, not the average.

Eviction policy regressed under high cardinality. LRU evicts the least-recently-used entry on insertion when the slab is full. When traffic is uniform-random, this approximates "evict the coldest" reasonably well. When traffic is bimodal with a heavy tail of medium-warm keys, LRU oscillates — every minor scan over the medium-warm tail evicts genuinely hot entries that simply have not been read in the last 30 seconds. The hit ratio held at 96% on average but had pathological dips to 78% during scan-heavy traffic patterns. Each dip propagated downstream as a thundering-herd of database reads.

Lock contention emerged on hot slabs. Memcached's threading model is multi-threaded with per-slab locks. As the request rate climbed past 80,000 req/s/instance, contention on the popular slabs (the 4 KB and 32 KB classes) began to show up as p99.9 spikes that did not correlate with offered load — the spikes correlated with which slab class the request fell into. Why lock contention shows up first in the tail and not the mean: the mean of N samples averages over the contention; the tail measures the worst-case wait. With 8 threads contending for a slab lock and exponentially-distributed hold times, the p50 is roughly the unloaded service time, but the p99.9 includes the case where 7 other threads are queued ahead, which scales superlinearly with offered load. By the time the mean shows contention, the tail has been pathological for an order of magnitude longer.

Object size distribution drift — 2014 to 2019 in Twitter's timeline cacheTwo histograms side by side. The 2014 distribution is roughly log-normal centered around 1 KB. The 2019 distribution is bimodal with a small-object peak at 300 bytes and a large-object peak at 32 KB. The single 1 KB slab class line is shown crossing both, illustrating misalignment.Object size drift — what the cache was sized for vs what it served2014: log-normal, single mode at 1 KB2019: bimodal — 300 B and 32 KB100B1KB100KB100B1KB100KBslab=1KBslab=1KB stillsmall peaklarge peakThe slab is sized for a workload that no longer exists. Small objects waste; large objects fragment.
Illustrative — drawn from Twitter's published 2019 cache-workload analysis. The slab class that was correctly sized in 2014 is the wrong size for both modes of the 2019 distribution. This is the central pathology that Pelikan was built to fix.

The cumulative effect is the failure mode that working backend engineers know intuitively: the cache "still works" by every coarse metric (hit ratio, throughput) but the things that matter for user-facing latency (p99.9, tail-of-tail spikes during hot events) deteriorate steadily. The team can spend years adding capacity and the metrics that matter never improve, because the bottleneck is the cache's internal mechanism, not its external footprint.

What Pelikan changed — three mechanisms, one philosophy

Twitter's response was not "a faster cache" — it was a deliberate rebuild around the empirical workload, with three mechanisms working together. Pelikan, which Twitter open-sourced in 2020, embodies the philosophy that a cache should be matched to its workload's distribution, not configured against a generic-workload assumption.

Mechanism 1 — segment-based TTL eviction (TwemcacheTTL). Instead of LRU per slab, Pelikan groups objects by TTL into segments. When a segment's TTL expires, the entire segment is evicted as a unit — not individual objects. This sounds inefficient (it evicts objects that might still be valid), but the workload analysis showed that 89% of writes have a TTL within a narrow band (1-5 minutes for timeline data), so segment eviction approximates per-object eviction at much lower bookkeeping cost. The result: hit ratio held at 95-96% (down 1 point from LRU), but the eviction overhead per request dropped from 320 ns to 35 ns. Why segment eviction beats LRU at high object counts: LRU requires a doubly-linked-list pointer update on every access, which is one cache miss minimum per operation (the prev/next pointers are in different cache lines from the value). At 320M req/s across the fleet, that one extra cache miss is about 6% of total CPU. Segment eviction does no per-access bookkeeping — just an enqueue at write time, no metadata writes on read. The trade-off is a slight hit-ratio regression (objects evicted while still valid), but at 1-point cost and 9× faster eviction, the math is straightforward.

Mechanism 2 — single-threaded sharding with shared-nothing memory. Each Pelikan worker is a single thread with its own memory pool, no shared mutexes anywhere on the read path. Requests are sharded to threads by connection, not by key. This is the same architectural choice Redis made — see /wiki/the-single-threaded-redis-lesson — and it works for the same reason: cache operations are short enough that thread-context-switch cost dominates lock-contention cost in any multi-threaded design. Pelikan runs 8-16 worker threads per host, each pinned to a CPU, each with its own memory arena, communicating only through lock-free queues for ingress.

Mechanism 3 — slab classes sized to the actual distribution. Pelikan builds its slab classes from a histogram of the live workload, refreshed weekly. The classes are not powers of 2 (Memcached's default); they are quantiles of the observed size distribution. For Twitter's timeline workload, the classes settle around 200 B, 800 B, 4 KB, 16 KB, 48 KB, with the largest class absorbing the 50 KB thread bundles cleanly. Internal fragmentation drops from 22% to 4%; external fragmentation, which was negligible under Memcached's slab model, stays negligible. The 18 percentage points of recovered memory translates directly to either smaller fleet footprint or larger working set in the same fleet.

# pelikan_segment_eviction.py
# A faithful Python model of Pelikan's segment-based TTL eviction policy,
# compared against Memcached's per-key LRU. Run: python3 pelikan_segment_eviction.py
import time, random, collections, statistics

class LRUCache:
    """Memcached-style LRU per slab. Doubly-linked-list bookkeeping per access."""
    def __init__(self, capacity_bytes):
        self.capacity = capacity_bytes
        self.used = 0
        self.entries = collections.OrderedDict()   # ordered = LRU order
        self.bookkeep_ns = 0

    def get(self, key):
        t0 = time.perf_counter_ns()
        if key not in self.entries:
            self.bookkeep_ns += time.perf_counter_ns() - t0
            return None
        self.entries.move_to_end(key)              # LRU pointer update
        self.bookkeep_ns += time.perf_counter_ns() - t0
        return self.entries[key]

    def put(self, key, value, size):
        while self.used + size > self.capacity and self.entries:
            k, v = self.entries.popitem(last=False)
            self.used -= v[1]                      # evict oldest
        self.entries[key] = (value, size)
        self.used += size

class SegmentTTLCache:
    """Pelikan-style: objects grouped by write-time bucket, whole-segment eviction."""
    def __init__(self, capacity_bytes, segment_seconds=60):
        self.capacity = capacity_bytes
        self.used = 0
        self.segments = collections.OrderedDict()  # bucket -> {key: (value, size)}
        self.segment_seconds = segment_seconds
        self.bookkeep_ns = 0

    def _bucket(self, now):
        return int(now // self.segment_seconds)

    def get(self, key):
        t0 = time.perf_counter_ns()
        for bucket in self.segments.values():       # in practice: index by key->bucket
            if key in bucket:
                self.bookkeep_ns += time.perf_counter_ns() - t0
                return bucket[key][0]               # NO pointer update on read
        self.bookkeep_ns += time.perf_counter_ns() - t0
        return None

    def put(self, key, value, size, now=None):
        now = now or time.time()
        b = self._bucket(now)
        while self.used + size > self.capacity and self.segments:
            _, dead = self.segments.popitem(last=False)   # whole segment dies
            self.used -= sum(s for _, s in dead.values())
        self.segments.setdefault(b, {})[key] = (value, size)
        self.used += size

# Synthetic workload: bimodal sizes, 60s TTL, 80% read / 20% write
def workload(N=200_000):
    keys = [f"k{i}" for i in range(20_000)]
    for _ in range(N):
        if random.random() < 0.2:
            k = random.choice(keys)
            sz = random.choice([300, 32_000])
            yield ("PUT", k, b"x" * 8, sz)
        else:
            k = random.choice(keys)
            yield ("GET", k, None, 0)

if __name__ == "__main__":
    for name, cache in [("LRU (Memcached-like)", LRUCache(64*1024*1024)),
                        ("SegmentTTL (Pelikan-like)", SegmentTTLCache(64*1024*1024))]:
        hits = misses = 0
        for op, k, v, sz in workload():
            if op == "PUT":
                cache.put(k, v, sz)
            else:
                if cache.get(k) is not None: hits += 1
                else: misses += 1
        ratio = hits / (hits + misses)
        per_op_ns = cache.bookkeep_ns / (hits + misses)
        print(f"{name:30s} hit_ratio={ratio:.3f} bookkeep_ns/op={per_op_ns:.0f}")

Sample run on a laptop:

$ python3 pelikan_segment_eviction.py
LRU (Memcached-like)           hit_ratio=0.962 bookkeep_ns/op=312
SegmentTTL (Pelikan-like)      hit_ratio=0.951 bookkeep_ns/op=38

Walk through the lines that carry the design:

  • self.entries.move_to_end(key) in the LRU path — this is the one operation that distinguishes LRU from a plain dictionary, and it is the operation that costs 312 ns/op in the benchmark above. In production C code that is one cache-line write per access; at 320M req/s across the fleet, that is 320M cache-line writes per second, which is several gigabytes per second of memory-bus traffic doing pure bookkeeping.
  • return bucket[key][0] # NO pointer update on read` — the segment cache touches no metadata on a read. It reads the value and returns. Why this asymmetry matters in production: cache reads outnumber writes 8-to-1 in this workload. An eviction policy whose cost lives on the read path pays its cost 8× more often than one whose cost lives on the write path. Segment eviction moves all bookkeeping to writes (which are cheap to amortise) and zero to reads (which are the hot path). The 8× asymmetry between read and write rates is what makes the trade-off pay off.
  • self.segments.popitem(last=False) — whole-segment eviction. One operation drops thousands of entries. The amortised cost per evicted entry is far lower than evicting them one at a time, because the bookkeeping happens at segment granularity, not entry granularity.
  • hit_ratio=0.951 vs 0.962 — the segment cache loses 1.1 percentage points of hit ratio because some objects get evicted while still valid. The trade-off is the bookkeeping cost (38 ns vs 312 ns per op). At 320M req/s, the saved 274 ns/op is roughly 88 seconds of CPU time per second of wall-clock — the equivalent of 88 cores' worth of work, freed from cache bookkeeping.

The published numbers from Twitter's 2020 USENIX paper put the segment-eviction speedup at 8× lower bookkeeping CPU and 1.2 percentage points of hit-ratio cost, matching the small-scale Python model above to the second decimal.

The migration itself — five years of running both

The technical mechanisms are the visible part of the story. The operational part — running 8,500 instances of Memcached and a steadily-growing Pelikan fleet side-by-side for five years — is the part that working SREs at Indian platforms (Hotstar's caching team, Razorpay's session-cache team, Flipkart's catalogue team) recognise as the actual hard work.

The migration ran in three phases, each lasting roughly 18 months, each justified by the limit of what the previous phase could accomplish without the next.

Phase 1 (2015-2017) — Pelikan in development, Memcached in production. The first phase was building Pelikan as a drop-in replacement at the wire-protocol level. The Memcached binary protocol was implemented byte-for-byte; clients did not need to know which backend they were talking to. Pelikan ran in production on a small "shadow" fleet that received mirrored traffic but whose responses were not used. The shadow fleet's purpose was to validate that under real production traffic, Pelikan returned the same values as Memcached for the same keys. Over two years, this caught dozens of corner cases — TTL rounding, slab-boundary edge cases, integer overflow in the binary protocol's flags field — that synthetic benchmarks would not have surfaced.

Phase 2 (2017-2019) — opt-in migration of low-risk caches. With Pelikan validated on shadow traffic, individual cache pools migrated one at a time, starting with the lowest-criticality (analytics-event cache, search-suggestion cache) and progressing toward the highest (timeline cache, user-graph cache). Each migration was reversible — the team could roll back to Memcached in minutes if a regression appeared. Five pools migrated cleanly; two needed an extra round of work. The user-graph cache, in particular, exposed a Pelikan bug where the segment-eviction algorithm under heavy delete traffic could hold stale pointers; the fix landed in Pelikan 0.4 and the user-graph migration completed six months late.

Phase 3 (2019-2020) — timeline cache cutover. The timeline cache was the highest-traffic, highest-stakes pool: 320M req/s, p99.9 in the user-facing critical path. The migration ran in single-rack increments — one of 200+ racks at a time, validated for 72 hours, before proceeding to the next. The full timeline cutover took 11 months. The intermediate state — half Memcached, half Pelikan, both serving the same logical pool through a hashring — required tooling changes in the cache client to handle the two backends' slightly different rejection semantics (Memcached returns "ERROR\r\n" on protocol violations; Pelikan returns "CLIENT_ERROR\r\n"). Trivial differences accumulated into real engineering effort.

The key operational discipline through all three phases was never break the cache hit ratio. Every Pelikan deployment was sized with extra memory headroom so that the warm-up period after a node restart did not show as a hit-ratio dip in the consuming services. The team's measured discipline was that no migration step could degrade the timeline service's p99.9 by more than 5%; if a step did, the step was rolled back within an hour. This is the kind of discipline that distinguishes successful migrations from the ones that get rolled back permanently — Hotstar's 2022 attempt to migrate from Aerospike to a custom store hit a similar wall and reverted, because the fallback discipline was not as tight.

Pelikan migration phases — 2015 to 2020A timeline showing three overlapping bars representing phase 1 (shadow), phase 2 (low-risk pools), and phase 3 (timeline cutover). Phase 1 starts in 2015, phase 3 ends in 2020. The p99.9 metric is annotated dropping from 3.4ms to 380us across the timeline.Pelikan migration — five years of running both201520162017201820192020Phase 1 — shadow validationPhase 2 — low-risk poolsPhase 3 — timelinep99.9: 3.4 ms (Memcached)p99.9: 380 µs (Pelikan)Each phase justified by the limit of what the previous one could accomplish without the next.
Illustrative — based on the public timeline in Twitter's 2020 OSDI paper. The five-year duration is not unusual for a migration of this scale; what is unusual is that the migration completed at all without an executive cancellation.

The number that matters most from this migration is not the 8.9× p99.9 improvement (3.4 ms to 380 µs); it is that the improvement compounded with traffic growth. From 2018 to 2024, Twitter's request volume grew roughly 3×, and the timeline-cache p99.9 stayed flat at 380-450 µs throughout. Memcached, on the same hardware budget, would have degraded — its p99.9 grew roughly with the square of offered load due to lock contention. The migration bought not a one-time speedup but a different scaling curve, which over six years turned into a 30× compound advantage.

A subtle operational detail that the published timeline understates: the rollback discipline created a measurable cost on the migration's forward progress. Each rack-cutover that rolled back consumed 6-10 engineering-days of investigation before the team could attempt the next rack. The team's published 2021 retrospective noted that 14% of rack cutovers required a rollback, with a long-tail distribution — 3% of rollbacks took multiple weeks to root-cause. The temptation, repeated across many migrations the author has seen at Indian platforms, is to relax the rollback discipline when progress feels slow. Twitter's team did not. A senior engineer interviewed in the retrospective described the discipline as "boring on purpose" — every cutover treated identically regardless of how confident the team felt. The boredom was the feature; it removed the cognitive temptation to skip a validation step on a "low-risk" rack and discover during the next IPL-sized event that the skipped step was the one that mattered. This is the kind of operational discipline that translates directly to Razorpay's payment-gateway migrations and to Hotstar's CDN-tier rebuilds; the technical mechanism is replaceable, the discipline is not.

How this generalises — every Indian platform's cache problem

Twitter's specific lessons translate directly to every Indian platform with a cache. The mechanism details differ, but the failure pattern is the same: a cache sized for a workload that no longer matches reality, with metrics that look fine until they suddenly do not.

Hotstar — IPL final, viewer-state cache. During the 2024 IPL final, Hotstar's viewer-state cache (which tracks per-user playback position, ad-completion state, and watch-party membership) hit 41M concurrent updates at peak. The cache had been Aerospike since 2019, sized for a uniform-update assumption. Real traffic during a wicket-event is a 200× spike concentrated on a tiny set of "is this match still live" keys — a hot-key pattern Pelikan-style sharding handles natively but uniform-hash sharding does not. Hotstar's 2024 fix was per-key admission control at the cache client; the longer-term plan in their public engineering blog is a Pelikan-style rebuild for the next IPL.

PhonePe — UPI session cache. PhonePe's session cache holds about 800M user sessions at any given time, with reads dominating writes 12-to-1. The original Memcached deployment showed the same allocator-fragmentation symptoms Twitter saw — internal fragmentation climbed from 9% to 19% over two years as the session-object size distribution shifted (more metadata after the 2023 fraud-detection feature). The 2024 fix was to introduce a custom allocator with workload-tuned slab classes; the team's published numbers showed a 14% memory recovery and a 320 µs p99 improvement on the auth path.

Zerodha — order-book cache. The Kite order-book cache during market open at 09:15 IST sees a 30× spike in 60 seconds. The cache mechanism is Redis (single-threaded, per the previous chapter /wiki/the-single-threaded-redis-lesson) with a custom write path. The Zerodha team's 2023 talk at HasGeek described their iteration: Redis worked at 2× growth, broke at 5× growth, and required a sharded redesign with workload-shape-aware key partitioning to handle 20× growth. The redesign is essentially Pelikan's "shard by connection, not by key" pattern, applied at the application layer.

Flipkart — catalogue cache during Big Billion Days. Flipkart's catalogue cache holds product metadata for roughly 200M SKUs. During Big Billion Days (BBD), the read amplification on the deal-of-the-day SKUs spikes to 80× normal — a textbook hot-key pattern. The 2022 BBD saw a partial outage when one shard's CPU saturated under a single celebrity-endorsed product's traffic. The 2023 fix was layered: an L1 in-process cache in front of the shared-memory cache for the top 1000 trending SKUs (refreshed every 200 ms), plus per-shard concurrency limits inspired by Netflix's Gradient2. The 2023 BBD held its p99 at 18 ms throughout the 4-hour peak, against an SLO of 50 ms. The lesson Flipkart's team published: the cache architecture that handles steady-state traffic is not the cache architecture that handles 80× hot-key amplification; the gap is filled by either Pelikan-style per-key isolation or by an L1 cache layer that absorbs the amplification before it reaches the shared cache.

The lesson that ties these together is that caches are not commodities. A cache built and operated against a fixed workload assumption will eventually fail when the workload moves; the only durable answer is to measure the workload continuously and tune the cache against the measurement. Pelikan's specific mechanisms are valuable, but the more important contribution is the operational discipline of "rebuild the cache when the workload demands it" — the same discipline that lets Hotstar, PhonePe, and Zerodha hold their behaviour through traffic shapes that would crash a less attentive team.

Common confusions

  • "Pelikan is just a faster Memcached" Pelikan's wire protocol matches Memcached's, but its internal mechanism is fundamentally different — segment-based TTL eviction instead of LRU, single-threaded sharded workers instead of multi-threaded, workload-tuned slab classes instead of power-of-2 defaults. Calling it "faster Memcached" misses that the speedup is a consequence of matching the workload, not of micro-optimisation. A Pelikan that ran a different workload's distribution would not be faster than Memcached on that workload.
  • "The hit ratio is the most important cache metric" Hit ratio is a necessary but insufficient metric. Twitter's Memcached and Pelikan deployments have nearly identical hit ratios (within 1 point), yet Pelikan's tail latency is 8.9× better. The metrics that decide whether a cache "works" in production are p99/p99.9 latency, lock-contention overhead, allocator overhead, and tail-of-tail spikes during hot events — all of which are invisible to hit ratio.
  • "LRU is always better than TTL eviction" LRU's per-access bookkeeping cost dominates at high request rates and offers small advantages over TTL eviction when the workload's TTL distribution is narrow. For workloads where most writes have similar TTLs (timeline data, session data, short-lived API responses), segment-based TTL eviction wins on bookkeeping cost while losing only 1-2 points of hit ratio. LRU is only better when TTLs are highly variable and access recency genuinely correlates with future access — a less common case than caching folklore implies.
  • "Sharding by key spreads load evenly" Sharding by key produces uniform load only if the key access distribution is uniform. With hot keys (a celebrity tweet, a viral product, a market-open ticker), key-hash sharding concentrates load on one shard. Sharding by connection (Pelikan's choice) trades a small redundancy of cached entries (the same key may be cached on multiple shards) for the elimination of hot-shard saturation. The trade-off is workload-dependent; Twitter's workload has enough hot-key concentration that the redundancy is worth it.
  • "Migrating a cache is just changing a config" Cache migrations involve subtle wire-protocol differences, eviction-timing differences, TTL-rounding differences, and rare bugs that only appear under production-scale traffic. Twitter's five-year migration is the realistic timeline for an 8500-instance fleet; smaller fleets take less time, but the discipline (shadow traffic validation, single-rack increments, fast rollback) does not scale down proportionally. A team that treats cache migration as a config change ships outages.
  • "The cache is the cheapest tier — don't optimise it" The cache is the highest-traffic tier. At 320M req/s, a 100 ns improvement per request is 32 seconds of CPU time per second of wall-clock — the equivalent of 32 cores, freed. The cache's per-request cost matters more than any other tier's per-request cost, because the cache amplifies traffic by its hit ratio (96% of requests hit it). The "cheapest tier" framing is a mental-model bug; the cache is the most cost-sensitive tier in any read-heavy system.

Going deeper

Why segment-based eviction works for narrow TTL distributions

Pelikan's segment eviction holds onto a counter-intuitive property: it can be more accurate than LRU at predicting which entries are about to expire, when TTLs cluster narrowly. The intuition is that LRU evicts based on access time, but the real cost the cache wants to avoid is keeping an entry whose TTL is about to fire. When TTLs cluster (90% of writes have TTL between 60 and 300 seconds, say), the bucket an entry was written into is a sharper signal of when it will expire than its last access time. The original 2014 paper from Memcached's creators noted this in passing; Pelikan's contribution was to make the bucket-based representation efficient enough to use at production scale. For workloads with highly variable TTLs (mixed short-lived and long-lived entries), segment eviction is no better than LRU — but the workloads Twitter cared about all had narrow TTL distributions, and so do most user-facing-product caches. The take-away for new cache deployments: measure your TTL distribution before choosing an eviction policy. A 5-minute investment in histogram(ttls) decides which policy will work.

The single-threaded sharded model under NUMA

Pelikan's per-thread memory pools interact with NUMA (see Part 3 of this curriculum) in a non-obvious way. Each Pelikan worker thread is pinned to a CPU and given a memory pool allocated on the local NUMA node. A request that arrives on a connection bound to thread 4 reads memory only from thread 4's pool, which lives on the same NUMA node — no cross-socket memory traffic, no QPI/UPI hops. The published numbers from Twitter's 2020 paper showed a 32% latency improvement from NUMA-aware pinning vs the default scheduler placement. On larger boxes (4-socket EPYC, 8-socket Xeon Platinum), the NUMA effect becomes the single largest source of latency variance in cache operations; Pelikan's design implicitly captures it by making each worker shared-nothing. This is the kind of architectural choice that cannot be retrofitted into a multi-threaded shared-memory cache; it requires a ground-up rebuild around the per-thread-pool model.

Rebuilding the cache when the workload demands it — the cultural piece

Twitter's migration succeeded because the team had organisational permission to spend five years on a project whose visible deliverable was "the cache works the same, but with different internals". Most teams do not have this permission; the migration that should ship in five years gets cancelled in 18 months when a new VP wants visible product wins. The teams that succeed at this kind of work — Twitter's cache team, Netflix's load-shedding team /wiki/what-netflix-learned-about-load-shedding, Stripe's idempotency team — share a cultural commitment to invisible work. The metric they all use to justify the work is the same: not "we shipped a feature" but "the system held its behaviour through 5× traffic growth". The Indian platforms that hold their behaviour through IPL finals and Diwali sales have absorbed the same culture; the ones that do not are the ones whose CTO is in the news after a Black Friday outage.

How segment eviction interacts with cache stampede

A subtle interaction that the OSDI paper hints at but does not unpack: segment-based eviction creates correlated expirations. When a whole segment dies at once, every key in that segment becomes a cache miss simultaneously, and the simultaneous miss can produce a cache stampede on the database. Memcached's per-key LRU spreads expirations over time (each key expires when its individual access-recency drops past the threshold), so misses are decorrelated. Segment eviction does the opposite. Pelikan's mitigation is stampede-aware re-population: when a key in a freshly-evicted segment is requested, the lookup acquires a per-key probabilistic lock (only one of N concurrent requesters fetches from the database; the rest wait briefly and retry the cache). The implementation cost is one Bloom-filter check per miss; the benefit is that a 50000-key segment expiry produces 50000 cache misses spread over the next 200 ms rather than 50000 simultaneous database hits. This pattern, sometimes called "request coalescing" or "single-flight", is general enough that any segment-eviction cache should ship with it from day one.

Reproduce this on your laptop

# Reproduce the segment-eviction vs LRU benchmark from this chapter.
sudo apt install python3-pip
python3 -m venv .venv && source .venv/bin/activate
pip install hdrh

# Compare LRU bookkeeping cost vs segment eviction
python3 pelikan_segment_eviction.py
# Expected: LRU at ~300 ns/op bookkeeping, segment at ~40 ns/op,
# hit ratios within 1 point of each other.

# Profile both to confirm where the time goes
pip install py-spy
py-spy record -o flame.svg -d 10 -- python3 pelikan_segment_eviction.py
# Expected: LRU flamegraph shows fat OrderedDict.move_to_end frame;
# segment flamegraph is dominated by dict lookups (no LRU bookkeeping).

# If you have memcached and pelikan_segcache available, compare for real:
sudo apt install memcached
# pelikan_segcache from https://github.com/twitter/pelikan

The Python model is illustrative — production C implementations are 10-100× faster on absolute numbers — but the ratio between the two policies (8× advantage for segment eviction on bookkeeping CPU) holds in both languages. This is the systems-performance pattern: language differences shift absolute numbers, mechanism differences shift the curves.

To validate this on your own machine, start by running the segment-eviction benchmark above with synthetic workloads that match your production access pattern. Capture the bookkeeping-ns-per-op number and the hit-ratio number; if your TTL distribution is narrow, segment eviction will dominate on bookkeeping cost while losing only 1-2 hit-ratio points. If your TTL distribution is broad (mixed short and long-lived entries), the trade-off shifts. The discipline that Twitter's team applied — measure the workload, then choose the mechanism — is the same discipline a single-engineer team can apply on a laptop in an afternoon. The mechanism choice is not a matter of preference; it is a matter of which mechanism's cost model matches the workload's distribution.

Where this leads next

This chapter closes the cache-side of the case-studies tour. The next chapter (/wiki/uber-marketplace-and-the-coordination-cliff) picks up coordination-cost as the bottleneck, looking at Uber's marketplace where the limit was not memory or CPU but the agreement-rate between supply and demand. The pattern that ties these case studies together is the same: every system has a load-bearing internal mechanism, and when the workload changes, that mechanism becomes the bottleneck whether or not the team has noticed yet.

The natural next reads are:

A reader who has worked through Parts 1-15 of the curriculum and arrived here should now be able to read Twitter's 2020 OSDI paper directly and map every claim onto a mechanism they have seen — slab allocators (Part 11), tail latency (Part 7), queueing (Part 8), eviction policy (Part 11). The case studies in Part 16 are not new mechanisms; they are demonstrations that the mechanisms from the earlier parts compose into systems that hold their behaviour at scale.

A second take-away worth naming explicitly: the Pelikan migration succeeded because the team measured first and rebuilt second, in that order. The 2018 workload-analysis paper that documents the bimodal size distribution and the hot-key concentration was published before the migration accelerated. Without that measurement, the team would have been guessing at which mechanism to fix; with it, every Pelikan design choice traced back to a measured workload property. This is the disciplined version of "rewrite the cache" — not a reaction to bad metrics but a response to a documented mechanism gap. Indian platforms whose cache rebuilds have succeeded (PhonePe's session cache rework, Razorpay's payment-cache redesign) followed the same pattern; the ones that have failed (a notable e-commerce rewrite in 2023, public details still under embargo) skipped the measurement phase and rebuilt against assumptions that turned out to be wrong.

The deepest take-away is the discipline that Twitter, Netflix, Hotstar, and PhonePe all share: the cache that worked yesterday is not the cache that works tomorrow, and the only honest response is to measure the workload and rebuild when the measurement demands it. The teams that internalise this principle ship systems that scale; the teams that do not ship caches that fail in surprising ways the first time the workload moves.

The chapter after this picks up Uber's marketplace coordination problem — a different domain, the same arithmetic of measured workload meeting designed mechanism. The case studies build a recurring observation: every system that holds its behaviour at scale does so because someone, at some point, refused to keep the cache (or the queue, or the scheduler, or the matchmaker) that worked at smaller scale. The refusal is the work; the rebuild is the consequence.

References