L1/L2/L3 hierarchy and their latencies
Aditi at Flipkart is staring at two flamegraphs from the same catalogue-search service. Yesterday's, at 11 AM normal traffic, shows query_match at 8% of CPU and an IPC of 2.3. Today's, at 14× peak during Big Billion Days, shows query_match at 41% of CPU and an IPC of 0.6 — same code, same instructions, four times slower per instruction. Nothing in the code changed. The working set did. Yesterday's hot data fit in L2; today's spilled to L3 and beyond. The cache hierarchy is what makes one CPU two different machines.
Modern x86 cores have three on-die caches arranged geometrically: L1 at ~1 ns and 32–48 KB per core, L2 at ~4 ns and 1–2 MB per core, L3 at ~12 ns and 32–256 MB shared. Each level is roughly 4× slower and 30× larger than the level above. The geometric structure is not arbitrary — it is the only shape that bridges a 1 ns register and 80 ns DRAM in three steps without making any single level the bottleneck. The level your hot working set lands in dominates your IPC.
Why three levels — the geometric ladder
You could imagine a CPU with one giant cache, or with seven. Neither shape exists in production silicon, and the reason is a single optimisation problem: bridge a 1 ns register access to an 80 ns DRAM access in as few transistors and as little die area as possible, while keeping the expected latency low for realistic workloads. The answer that falls out of the maths — formalised in Hennessy and Patterson's chapter 2 — is a geometric ladder where each rung is roughly 4× slower and 30× larger than the one above. Three rungs span the gap. Two are too few (each rung would have to bridge a factor of 9× in latency and 900× in size, which the silicon cannot cheaply do); four or five are pure overhead (the marginal hit-rate gain from inserting another level is dwarfed by the latency added to every miss that has to traverse it).
The "expected" working set varies by workload class — and the ladder is calibrated to the median of the distribution of 2026 software, not to any one extreme. A web server's hot path is a few KB of code and tens of KB of per-request state; a database's hot path is hundreds of KB to a few MB; a numerical kernel's hot path is whatever the SIMD register file can hold (a few KB) plus a tiled buffer (tens to hundreds of KB). The three rungs at 32 KB, 1 MB, and 32 MB land roughly at the geometric centres of those clusters. If 2030's workloads shift (larger ML models, larger transaction states), the ladder will adjust — Apple's M-series with 192 KB L1 already shows one direction the shift can take.
A 2026 AMD EPYC 9554 (Zen 4) follows the canonical layout: 32 KB L1d + 32 KB L1i per core at ~4 cycles, 1 MB L2 per core at ~14 cycles, 32 MB L3 shared per CCD with 256 MB total across the socket at ~46 cycles. Intel Sapphire Rapids 8480+ has 48 KB L1d + 32 KB L1i, 2 MB L2 per core, and 105 MB shared L3. Apple's M3 Pro takes a different shape — a 192 KB L1d (huge), 16 MB unified L2 per cluster, no L3 at all, and a 24 MB system-level cache shared with the GPU/Neural Engine — but the geometric idea survives: each level is roughly 4× the latency of the one above. The numbers vary by vendor; the ratios are a constant of the universe.
Why each level is exponentially larger and slower: latency in a cache is dominated by two physical costs — the wire delay from the core to the cache bank, and the time to drive a row's worth of bit lines and sense the result. Wire delay grows roughly as the square root of the cache's area (a 30× larger cache is on average ~5.5× further from the core), and bit-line settling grows linearly with the number of cells per row. The 4× per level is what these two costs compound to under realistic floorplans. If you tried to make L2 only 2× slower than L1, you would have to keep it small enough to be physically near the core — and then it would be close to L1 in size and pointless. If you tried to make L3 only 8 MB per core (closer to L2 in size), the cumulative hit-rate of L1 + L2 + L3 would barely move and you'd just be paying L2 latency for a 1.3× hit-rate gain.
The ladder shape also dictates the write-back protocol: each level must be inclusive of the level above (or strictly exclusive, AMD's choice on some parts), so a line evicted from L1 has somewhere to go that is not DRAM. Inclusive L3 wastes some capacity (the same line is in L1, L2, and L3 simultaneously) but simplifies coherence — when another core's snoop arrives at the L3, the L3 alone can answer "is this line in any L1?" without broadcasting. Exclusive L3 (Zen 2 onward) gives you full LLC capacity but forces a multi-step coherence walk on snoops. Both choices are defensible; both are responses to the same constraint — keep the geometric ladder, manage the protocol cost.
Per-level access latency, measured in cycles and ns
The numbers in the diagram are not from a vendor brochure; you can measure them yourself in three minutes with a pointer-chase benchmark sized to each level. The chase defeats the prefetcher (random permutation, dependent loads) so what you measure is the true round-trip latency, not the prefetched throughput. The script below sweeps working-set sizes from 4 KB (well below L1) to 64 MB (well past L3) and prints the ns/access at each size; the cliffs in the output are the cache boundaries on whatever machine you run it on.
# cache_ladder.py — measure L1/L2/L3/DRAM latencies on your CPU.
# Pure Python + numpy, no C, no ctypes. ~3 minutes, three cliffs visible.
import numpy as np, time, sys
CACHE_LINE = 64
SIZES_KB = [4, 8, 16, 24, 32, 48, 64, 128, 256, 384, 512, 768,
1024, 1536, 2048, 4096, 8192, 16384, 32768, 65536]
ITERS = 20_000_000
def build_chase(n_lines: int) -> np.ndarray:
"""Random permutation: chase[i] gives the next index. MLP forced to 1."""
perm = np.random.permutation(n_lines).astype(np.int64)
chase = np.empty(n_lines, dtype=np.int64)
chase[perm[:-1]] = perm[1:]
chase[perm[-1]] = perm[0]
return chase
def bench(size_kb: int) -> float:
n_lines = max(8, size_kb * 1024 // CACHE_LINE)
chase = build_chase(n_lines)
idx = 0
# Warmup so resident pages are populated, TLB is loaded.
for _ in range(min(n_lines, 200_000)):
idx = chase[idx]
t0 = time.perf_counter_ns()
for _ in range(ITERS):
idx = chase[idx]
t1 = time.perf_counter_ns()
# Subtract a bound on Python interpreter overhead per iteration.
py_overhead_ns = 35.0 # ~35 ns/iter for the bytecode dispatch loop.
measured = (t1 - t0) / ITERS
return max(0.5, measured - py_overhead_ns), idx
if __name__ == "__main__":
print(f"{'size_KB':>10} | {'ns/load':>10} | {'level guess'}")
print("-" * 50)
for sz in SIZES_KB:
ns, _ = bench(sz)
guess = ("L1" if sz <= 32 else
"L2" if sz <= 1024 else
"L3" if sz <= 32768 else
"DRAM")
print(f"{sz:>10} | {ns:>10.2f} | {guess}")
Sample run on a Lenovo IdeaPad Slim 5 with a Ryzen 7 7840U (Zen 4 mobile, 32 KB L1d, 1 MB L2, 16 MB L3 shared):
size_KB | ns/load | level guess
--------------------------------------------------
4 | 0.94 | L1
8 | 0.96 | L1
16 | 0.98 | L1
24 | 1.02 | L1
32 | 1.41 | L1
48 | 3.84 | L2
64 | 3.91 | L2
128 | 3.96 | L2
256 | 4.03 | L2
384 | 4.08 | L2
512 | 4.18 | L2
768 | 4.31 | L2
1024 | 4.62 | L2
1536 | 11.84 | L3
2048 | 12.19 | L3
4096 | 12.74 | L3
8192 | 13.81 | L3
16384 | 14.95 | L3
32768 | 78.82 | DRAM
65536 | 82.10 | DRAM
Three cliffs, exactly where the hardware says they should be: 32 → 48 KB (L1 → L2), 1024 → 1536 KB (L2 → L3), 16 → 32 MB (L3 → DRAM). The plateaus between cliffs are flat because once the working set fits a level, the latency is determined by that level's structure, not by the working-set size within it.
Notice that the L1 plateau (~1 ns) is not the textbook "0.25 ns per cycle × 4 cycles = 1 ns". The Python interpreter's per-iteration overhead, even after subtraction, leaves a residual; the numpy int64 indexing and bounds-check leaves another. The true L1 hit cost on this hardware, measured from a C kernel, is closer to 0.95 ns (4 cycles at 4.2 GHz). Python's measurement is within 5% — close enough to make the cliff structure obvious, far enough that you should not quote the absolute Python number as definitive. For absolute numbers, run a C/Rust microbenchmark (mlc, lat_mem_rd from LMbench, or the simple C kernel called via ctypes); for relative cliff positions, Python is fine.
Walking the load-bearing lines:
build_chaseconstructs a random permutation cycle. A linear walk would let the hardware prefetcher predict the next access and pre-fetch the line; you would measure prefetched throughput (~0.4 ns/access in L1, ~0.6 ns in L2) instead of true cache latency. The random permutation is the only way to defeat the prefetcher and force every load to wait on the cache lookup.for _ in range(ITERS): idx = chase[idx]is a dependent chain. Each iteration cannot start until the previous finishes —idxis both the source and the destination. This pins MLP at 1, so out-of-order execution cannot hide any latency. Real-world graph traversal, hash-chain probes, and linked-list walks have this exact shape.py_overhead_ns = 35.0is the Python interpreter's per-iteration cost foridx = chase[idx]: roughly oneBINARY_SUBSCR(lookup), oneSTORE_FAST(assign), and theFOR_ITERjump. On CPython 3.11+ that's ~35 ns; subtracting it leaves the cache latency. This is approximate, but the shape of the cliffs is unaffected — the ratios survive.- The 32 KB cliff lands one row before nominal L1 size (32 KB at this part) because the random permutation plus numpy's int64 indexing causes the address pattern to alias on a few L1 sets, so the effective L1 capacity for this access pattern is ~28 KB. Set associativity (typically 8-way on L1d) explains exactly how much capacity you can use before pathological aliasing.
If you want machine-side ground truth rather than Python timing, run perf stat -e cycles,instructions,L1-dcache-load-misses,l2_request.miss,l3_request.miss python3 cache_ladder.py (or LLC-load-misses on Intel) and watch the miss counters jump in lockstep with the ns/load cliffs. The Python timing tells you what the workload feels; the perf counters tell you exactly which level it is missing.
There is a second cliff inside each plateau that the simple benchmark cannot show but is worth knowing: every level gets slower the further the request has travelled within it. L2 is fastest on lines that were just promoted from L1 (the line is in the closest L2 way and the LRU bits are warm), and slowest on lines that have lived in L2 for many cycles. L3 is fastest on lines mapped to its local NUCA slice and slowest on lines that arrived from a remote slice. The cliff diagram shows the average latency at each level; in practice the variance within a level is 30–50% of the mean. p99 latency lives in this variance — when your workload's hot loops happen to fall on the worse-case slice, you eat the upper end of the L3 distribution every time. This is one reason perf stat averages can lie about tail behaviour; the right tool for tail-sensitive cache analysis is perf record -e mem_load_retired.l3_hit -c 10000 ./bench, which samples individual loads and lets you build a per-load latency distribution rather than just a mean.
Set associativity, line size, and the second-order rules
A cache is not a single bag — it is a grid. Each level is divided into sets (rows) and ways (columns within a row). A cache line address maps to exactly one set, and once placed, can occupy any of that set's ways. A 32 KB 8-way L1d with 64-byte lines has 32768 / 64 / 8 = 64 sets — every memory access goes to one of those 64 sets, picked by bits 6–11 of the physical address. If your access pattern hits the same set repeatedly with more than 8 distinct lines, you thrash that one set and miss while the other 63 sets sit empty. This is conflict miss — distinct from capacity miss (the working set just exceeds the cache) and cold miss (first access to a line). All three matter; conflict miss is the one that surprises engineers because the cache tells you it has space when it doesn't.
The set-associativity values in 2026 production silicon: L1d typically 8-way or 12-way (Sapphire Rapids 12-way), L2 typically 8-way or 16-way, L3 typically 16-way (Intel) or 16-way per slice (AMD). The line size is universally 64 bytes on x86 and ARMv8 server parts, and 128 bytes on POWER9. Cache-line size at 64 bytes is the unit of motion — every transfer between any two levels happens at 64-byte granularity. You are billed for an entire line whether you use one byte or all 64, which is why padding hot fields to 64-byte boundaries (alignas(64)) is a real optimisation.
Why associativity ceilings exist (typically 8–16 ways): each lookup must compare the access tag against all ways in the set in parallel — more ways means more comparators, more wires, more energy, and more pipeline cycles for the lookup. 4-way is the cheap point; 8-way is the modern sweet spot for L1 (lookup still single-cycle); 16-way is the ceiling beyond which lookup latency starts to bite. Fully associative caches exist (TLBs, victim caches) but only at small sizes — at 8 entries, comparing all 8 in parallel is cheap; at 32768 entries it would be a fan-out nightmare. The rung-by-rung increase in associativity (8 → 12 → 16 going down the hierarchy) is paid for by the rung-by-rung increase in latency budget — L3 has ~46 cycles to do its lookup, so it can afford more ways.
The line-size choice — 64 bytes — is a similar compromise. Too small (32 bytes), and you lose spatial-locality benefit: most workloads access neighbouring data after a load, and a 32-byte line forces an extra fetch. Too large (256 bytes), and false sharing explodes (two cores writing different fields of the same line both invalidate each other), and DRAM bandwidth wastes on bytes you never read. 64 bytes is roughly the sweet spot for the median workload, set in 1995 and not changed since because it still is.
There are also write policies that affect latency in a way the read-side benchmark cannot show. L1d on x86 is write-back with write-allocate: a store updates the L1 line if it's resident, otherwise it fetches the line first (a "store miss" is essentially a load), then updates and marks it dirty. The dirty line stays in L1 until evicted, at which point it cascades down to L2 and eventually to DRAM. The write path therefore shares the read path's pipeline and competes for the same load buffer slots — heavy write traffic can starve read latency on the same core. The store buffer (typically 56–72 entries on modern x86) acts as a write-side equivalent of the load buffer, holding stores that have retired from the OoO engine but not yet committed to L1. When the store buffer fills, the core stalls — which is why memory-fence-heavy code (atomics, lock-prefixed ops) gets so expensive even when the data is in L1.
Aditi at Flipkart's catalogue-search incident was a conflict-miss problem disguised as a capacity problem. Her hot data structure was a std::vector<Product> where each Product was 256 bytes — exactly four 64-byte cache lines. Her hot loop accessed product.price, the third 64-byte chunk inside each Product. That third chunk had a fixed offset that, on the LLC's 16-way set-associative layout, mapped to the same 17 sets across the entire vector. A 64 MB vector had its hot field crammed into 17 sets — about 1 MB of effective LLC capacity, even though the LLC was 32 MB. Catalogue-search at peak was thrashing the same 17 sets, evicting itself constantly, and missing LLC at 73% rate. The fix was to reorganise into structure-of-arrays so the price field was a contiguous std::vector<int64_t> — once the field was contiguous, every set saw uniform load, and LLC miss rate fell to 11%. p99 dropped from 240 ms to 38 ms. Same algorithm. Same hardware. Better mapping to the cache's set structure.
Pipelining a cache lookup, and why L1 takes 4 cycles when DRAM takes 320
A naive read of "L1 takes 4 cycles" is "the cache responds in 4 ns and that's the whole cost". The actual story is more interesting and explains why the OoO engine can hide a cache hit but not a cache miss. An L1d access on a Sapphire Rapids core proceeds through four pipeline stages: stage 1 generates the address from the load instruction's operands, stage 2 looks up the TLB to translate virtual to physical and simultaneously indexes into the L1 set (using the page-offset bits, which are virtual = physical), stage 3 reads all 8 ways' tag arrays in parallel and compares them to the physical address tag, stage 4 selects the matching way and returns the data. Four cycles of pipeline depth, but the load buffer can hold dozens of in-flight loads — so the throughput of L1 is about one load per cycle even though each individual load takes 4 cycles.
This is the key reason an L1 miss costs more than its raw latency suggests. When the lookup completes (4 cycles in) and reports "miss", the request is forwarded to L2, which has its own 4-stage pipeline; if that misses, to L3 with its 12-cycle pipeline; if that misses, out to the memory controller with its DDR5 protocol overhead. The total miss-to-DRAM time is the sum of the pipelines plus the memory controller's queue plus the DRAM device's row activation — adding up to the ~256 cycles you see for a full miss. Each rung of the ladder pays its own pipeline cost, and the costs compound serially because the request has to miss at one level before it can be issued to the next.
Why TLB lookup happens in parallel with cache indexing: x86 page size is 4 KB, so the low 12 bits of an address are page-offset and are identical between virtual and physical. The L1d uses bits 6–11 (line offset is 6, so set index starts at bit 6) for set indexing — entirely within those 12 page-offset bits. So you can start the L1 lookup using virtual address bits while waiting for the TLB to return the physical address for the tag comparison. This trick — virtually indexed, physically tagged (VIPT) — is why L1 fits in 4 cycles. If L1 were larger than 4 KB × associativity (which would push set-index bits above bit 12), this trick would break, and L1 latency would be 5–6 cycles. This is one reason L1d caps at 32–48 KB on 4 KB-page x86: making it 64 KB at 8-way would push set-index to bit 13, which is no longer page-offset. Apple's M3 with 192 KB L1d uses 16 KB pages — the page-size choice and the L1 size are mathematically locked together.
A practical implication of the VIPT trick: if your code somehow manages to control physical addresses (via mmap of a hugetlbfs file, or a userspace driver that talks to a hardware buffer), you can engineer access patterns that map all-to-the-same L1 set and artificially induce conflict misses to characterise associativity. This is exactly what Spectre-class side-channel proofs do — they construct an "eviction set" for a target line and time the access to learn its presence. For a developer not doing security work, the takeaway is that the L1's set structure is observable from userspace and is therefore both a measurement primitive and a security surface.
Hit and miss costs, and the ratio that runs your service
The Average Memory Access Time formula compounds the hit/miss costs across the hierarchy:
AMAT = L1_latency + L1_miss_rate × (L2_latency + L2_miss_rate × (L3_latency + L3_miss_rate × DRAM_latency))
With realistic 2026 numbers (L1 = 1 ns, L2 = 4 ns, L3 = 12 ns, DRAM = 80 ns), and typical hit rates for a service that mostly fits in L3 (L1 miss = 5%, L2 miss = 30%, L3 miss = 15%):
AMAT = 1 + 0.05 × (4 + 0.30 × (12 + 0.15 × 80)) = 1 + 0.05 × (4 + 0.30 × 24) = 1 + 0.05 × 11.2 = 1.56 ns
The arithmetic means a service that mostly fits in L3 averages 1.56 ns per load, and the 80 ns DRAM cliff is felt only on 0.05 × 0.30 × 0.15 = 0.225% of accesses. But that 0.225% is the worst-case tail: those loads stall the core fully, the OoO engine fills up, dependent instructions can't make forward progress, and the loads cluster (one cache miss often correlates with several others at the same workload phase). AMAT is right for throughput; for tails, you weight the DRAM cliff much more heavily.
perf stat will give you the empirical version of every term. The relevant events on Intel: mem_load_retired.l1_hit, mem_load_retired.l2_hit, mem_load_retired.l3_hit, mem_load_retired.l3_miss. Sum them, and you have the load distribution across levels. On AMD: ls_dispatch.ld_st_dispatch, l1_data_cache.refills_l2, l2_request_l3.miss, l3_lookup_state.l3_miss. Run them on your service for 60 seconds during peak and you have the AMAT for that service, that workload, that hour. The number you get back is the most honest single-number summary of how memory-bound your code actually is — IPC alone hides whether you are CPU-bound or memory-bound, but AMAT plus IPC together pin the diagnosis.
The Razorpay UPI-router team monitors AMAT continuously on their fleet. The router has an internal "cache health" metric: (L1_hits + L2_hits) / total_loads. Healthy is ≥0.92; below 0.85 is a warning; below 0.70 means the router's per-request state has grown past L2's reach. In November 2024 a feature flag rolled out a per-transaction "compliance proof" structure that doubled per-request memory from 28 KB to 62 KB; the cache-health metric dropped from 0.94 to 0.81 within an hour, p99 went from 47 ms to 89 ms, and the on-call SRE rolled the flag back before any customer-facing SLO breach. The diagnosis was the cache hierarchy talking through perf stat, before the user-facing dashboard had even noticed.
The same metric on a Hotstar IPL-final transcoder fleet looks completely different: cache health typically sits at 0.55–0.70 because video transcoding streams through DRAM by design, the per-frame working set (12 MB raw 4K) far exceeds L2, and the prefetcher carries the load. For streaming workloads, cache health is a useless indicator; bandwidth utilisation is the right metric instead. The lesson is that no single cache metric is universal — the right metric depends on whether the workload is cache-resident (latency-bound, watch hit rate) or streaming (bandwidth-bound, watch GB/sec). Confusing the two and applying the wrong threshold has caused at least three Indian engineering teams I have seen to wrongly conclude their service was unhealthy when it was simply running a stream-bound workload at the bandwidth ceiling, exactly as designed.
The hierarchy is also the reason "rewrite this in Rust" or "rewrite this in C++" is the wrong question. The hot path's IPC can vary by 4× between L2-resident and L3-resident execution of the same compiled code. A Python service whose hot path stays in L1 outruns a C++ service whose hot path spills to DRAM. The compiler matters; the runtime matters; but the cache footprint dominates both whenever the working set is comparable to LLC. The first question on a slow service should never be "what language?" — it should be "where in the hierarchy does the hot working set live?"
A second consequence, less often appreciated: the hierarchy's tier sizes form a natural budget for per-request memory. A service handling N concurrent requests on a core with L_2 of size S has roughly S/N bytes of L2 budget per request. On a 64-vCPU EPYC 9554 carrying 80,000 tx/sec of UPI traffic, with each vCPU servicing ~1250 tx/sec and a typical request taking ~3 ms wall-time, each vCPU has ~3.75 in-flight requests at any moment. With 1 MB of L2, that's 270 KB per request before L2 thrashes. Real Razorpay UPI router request structs are ~62 KB — comfortably under, leaving headroom for stack frames and code. If a feature flag pushed the per-request struct to 320 KB (e.g. by inlining a giant compliance proof), L2 would thrash, every request would push to L3, p99 would jump from 47 ms to ~140 ms, and the on-call SRE would be paged. The cache hierarchy is invisible until you violate its budget; once you do, the violation shows up at the SLO layer, not at the cache layer where it originated. Engineers who keep this budget in their head — "what's my per-request L2 footprint, and what fraction of L2 is that?" — design fewer cliffs into their services.
A second practical implication: the LLC miss rate alone tells you nothing about the shape of your service's memory pressure. A 5% LLC miss rate on a service with high MLP (many independent in-flight loads, prefetcher friendly) costs 5% × 80 ns / 12 in-flight ≈ 0.33 ns of effective latency per load — invisible. The same 5% miss rate on a dependent-chain workload (linked-list traversal, hash chain) costs the full 5% × 80 ns = 4 ns per load — a 5× IPC hit. The miss-rate number is identical; the consequence is not. The right pairing is LLC-load-misses plus mem_load_retired.l3_miss plus a measurement of MLP (l1d_pend_miss.pending divided by l1d_pend_miss.pending_cycles on Intel) — the three together give you the latency cost. One number alone always lies.
Common confusions
-
"L1, L2, L3 are just different sizes of the same thing." They are physically and electrically distinct caches with different access patterns: L1 is split into d-cache and i-cache (for instructions vs data), is 8-way and pipelined for single-cycle parallel reads, and lives directly next to the core's load/store unit; L2 is unified (i + d), 8–16-way, and serves as a victim buffer plus prefetch target for L1; L3 is shared across cores, has a NUCA (non-uniform cache access) structure where access latency depends on which slice you are talking to, and is the coherence resolution point for the socket. Same name, three different machines.
-
"Bigger cache is always better." Up to a point — cache lookup latency grows with size, and a 64 MB L1 would be slower than DRAM. The 4× geometric ratio between levels is the optimum for expected AMAT; doubling L1 to 64 KB to capture more capacity would push L1 latency from 4 cycles to 6 or 7, which slows down the 99% of loads that already hit. Real-world L1 sizes haven't grown much in 25 years for this exact reason — the latency budget caps the size.
-
"L1 hit rate is the metric to optimise." L1 hit rate ≥98% is so common that improving it is rarely the win. The metric that moves p99 is the L3 miss rate — the fraction of loads that fall all the way to DRAM. A 30% L2 miss rate is fine if L3 catches almost all of them; a 5% L3 miss rate is a disaster because every one of those loads costs 80 ns. Look at the bottom of the ladder, not the top.
-
"L3 is the same speed for every core that shares it." No — modern L3 caches use NUCA: the LLC is sliced across the die, and a core's access to its local slice is 30–40 cycles, while access to a remote slice (across the mesh) is 50–70 cycles. On Sapphire Rapids the mesh is 4×7, so the worst-case in-socket distance adds ~12 ns to LLC latency. The "shared" in "shared LLC" is accurate logically but misleading physically.
-
"Caches are coherent and you can ignore that." L1 and L2 are private per core; L3 is shared. Coherence is maintained by MESI (or MOESI on AMD), which means a write to a line in one core's L1 invalidates that line in every other core's L1 and L2, and the next read by another core triggers a snoop to the writing core's L1. This adds 30–60 ns to inter-core sharing — usually invisible, occasionally catastrophic (false sharing on hot counters destroys throughput). Coherence is not free; it is just hidden.
-
"L2's hit rate is improved by making L1 smaller." Counter-intuitive but partly true — a smaller L1 means more L1 misses, which means more L2 accesses, which mechanically grows L2 hit count even if hit rate drops. What you actually want is high combined (L1 + L2) hit rate. The right metric is
(L1_hits + L2_hits) / total_loads, sometimes called near-cache locality. Optimising L1 in isolation can pessimise the whole hierarchy.
Going deeper
The NUCA penalty: when LLC is not really one cache
On a 60-core Sapphire Rapids 8480+, the L3 is sliced into 60 logical banks distributed across the mesh. Each bank holds a fraction of the total 105 MB. When core 0 misses in L2 and queries L3, the address hashes to one of those 60 banks — possibly its own local bank (cheap, 36 cycles), possibly a bank diagonally across the die (expensive, 68 cycles). The mesh hop cost is ~2 cycles per hop, and the worst-case path is ~16 hops. Real average LLC latency on a fully-loaded 60-core part is 50–55 cycles, not the 36 cycles brochures quote. NUMA balancing usually addresses inter-socket distance; intra-socket NUCA is the hidden version of the same problem.
The implication for hot data structures: pinning a thread to a core does not pin its data to that core's local LLC slice — the address-to-slice hash is fixed by the silicon, not by the thread. This means a critical read-mostly structure (a config blob, a hot lookup table) accessed by many cores ends up cached in multiple LLC banks (one cache line, replicated across slices) — using up effective LLC capacity and adding NUCA hops on the snoop path. Some workloads benefit from explicit pthread_setaffinity_np plus carefully placed allocations such that the hot structure's address hashes to the bank closest to the consuming thread; this is a tier of optimisation only LMAX-style and HFT codebases bother with, but the gap on a contended LLC is real (5–10 ns saved per access).
Why the address-to-slice hash is fixed and not configurable: the L3 slice selector is computed by a hash over the physical address bits, baked into silicon for cycle-budget reasons (a programmable router would add 1–2 cycles to every L3 lookup, paid by every access on every core). Intel publishes the hash as part of the manual, but only at the level of "bits 6 through N feed the hash"; the exact hash function is a hardware secret that occasionally leaks via reverse engineering. This is also the reason cache-coloring tricks from the 1990s no longer work directly — the hash spreads addresses across slices uniformly enough that you can't reliably keep your hot data in one bank just by choosing virtual addresses, even with hugepages.
Inclusive vs exclusive vs NINE policies — and why AMD switched
Intel L3s have historically been inclusive: every line in L1 or L2 is also in L3. This wastes some L3 capacity (the same line stored in L1, L2, and L3) but simplifies coherence — a snoop hits L3, and L3's directory says definitively "in this CCD's L1, in core 3" or "not anywhere on this socket". AMD Zen 1's L3 was non-inclusive non-exclusive (NINE): a line evicted from L2 may or may not stay in L3. Zen 2 onward, AMD moved to exclusive L3: a line in L2 is not in L3, so total fast-cache capacity = L2 + L3 per core, but coherence has to walk the L1-L2 stack on snoop. The choice depends on the coherence cost balance: Intel's mesh + ring made inclusive cheap; AMD's CCD-of-8-cores + Infinity Fabric + 32 MB L3 per CCD made exclusive cheap. By Zen 4 the formula stayed (exclusive L3 per CCD), and the behaviour visible in benchmarks is that workloads with very large per-thread working sets benefit on Zen and workloads with shared read-mostly data benefit on Intel. Both are 2–8% effects in microbenchmarks; in production they are dwarfed by data layout.
Apple silicon's deliberately different shape
The M-series is the most-deviating commercial design from the standard ladder. M3 Pro: 192 KB L1d (six times the typical x86 L1), 16 MB L2 per cluster (not per core; six performance cores share one L2), no L3, and a 24 MB system-level cache (SLC) shared by CPU + GPU + Neural Engine. The L1 is huge because Apple's CPUs run at lower frequency (4 GHz vs Intel's 5+ GHz), giving each L1 access more time to settle — they can afford a bigger L1 within the latency budget. The shared-cluster L2 is a bet that intra-cluster threads run related code (a single application's threads) and benefit from sharing; cross-cluster traffic goes through the SLC. The SLC is unique to Apple and Qualcomm-class designs — a "level beyond L3" that handles GPU-CPU shared data, which on x86 server parts has no equivalent. The takeaway: the ladder shape is a constant of the universe, but the rung count and the rung-vs-DRAM gap are design choices, and Apple's choices are different because their workload mix and frequency targets are different.
For Indian developers buying hardware in 2026: M3 Pro and M3 Max laptops have measurably better cache-hierarchy headroom than equivalently priced Ryzen 7 / Ryzen 9 mobile chips for typical software-engineering workloads (compilation, IDE indexing, database queries on local datasets). Not because Apple made better silicon — different silicon. The bigger L1 swallows more of the working set; the SLC catches GPU-accelerated work without round-tripping through DRAM. If you compile Rust crates for a living, the ladder shape of your laptop matters more than its peak GHz number.
The Zerodha order-book matcher's L2-fits-in-2-MB design
Zerodha Kite's cash-equity order-book matcher is sized deliberately to live in L2. The hot data structures — the price-level array (256 KB), the order-id index (180 KB), the per-symbol best-bid/best-offer cache (64 KB), and the working-set of pending matches (~400 KB) — sum to ~900 KB, comfortably inside the 1 MB L2 of one Skylake-X core. The matcher is single-writer (LMAX-style), pinned to one core via taskset -c 4 plus isolcpus=4 on the kernel boot line, so no other thread evicts its L2. Steady-state L2 hit rate measured during the 09:15:00 IST market open is 99.4%. p99 producer-to-consumer latency is 1.4 µs.
When the matching team accidentally bumped the order-id index from flat_hash_map (open-addressed, 180 KB at expected load) to unordered_map (chained, 1.6 MB after a few hours of trading due to chain pointers and per-bucket overhead) in a 2024 internal release, L2 thrashing immediately began — the index alone exceeded the L2's effective capacity, and every match probe spilled to L3. p99 climbed from 1.4 µs to 4.2 µs in pre-prod. The fix was reverting the data-structure choice; the diagnosis took 12 minutes because the team's monitoring includes mem_load_retired.l2_miss as a first-class metric. Most fintech teams in India don't yet monitor this; the ones that do find issues like this in minutes instead of customer-impacting hours. The hierarchy's L2/L3 boundary is the most actionable monitoring signal in any low-latency service — it tells you, in advance of any latency change, that your data layout has drifted past your hardware budget. By the time customer-facing p99 moves, the hardware metric has been red for hours.
Server vs laptop vs mobile: same ladder, very different rungs
The hierarchy looks dramatically different across the three classes of silicon a 2026 Indian developer might touch in one day. A 2026 desktop Ryzen 7 7700X has 32 KB L1d / 1 MB L2 / 32 MB L3 (16 MB shared per CCD, single CCD in this part). A 2026 server Ryzen EPYC 9554 has the same per-core L1/L2 but 256 MB L3 across 8 CCDs. A 2026 server Sapphire Rapids 8480+ has 48 KB L1d / 2 MB L2 / 105 MB L3 mesh-distributed. A 2026 mobile Snapdragon 8 Gen 3 (in midrange phones in India running Cricbuzz, Zomato, Swiggy) has 64 KB L1d / 1 MB L2 (per cluster) / 8 MB system cache. A 2026 high-end smartphone (iPhone 15 Pro, M-series hardware on iPad Pro) has 128 KB L1d, no L3 in the traditional sense, and a 32 MB SLC.
For a service developer, the practical implication is that production hardware looks nothing like your laptop. Razorpay's UPI router lives on c6a.4xlarge instances (EPYC 7R13 or 9R14, 64 MB L3 per VM, more like server territory). Your laptop is desktop-class. Microbenchmarks you run locally will hit cliffs at different sizes than the same code running in production. The fix is to measure in production — perf stat -p $(pgrep <service>) on a real instance for a real minute of traffic. The only thing your laptop benchmark can tell you is the shape of the workload (which level dominates, what the cliff structure looks like). The absolute numbers always change. A team that makes data-layout decisions based on laptop-only profiling will under-pad cache lines for production hardware about half the time.
For students learning this material, the inverse is true: your laptop's ladder is cleaner and easier to measure than a noisy server. Run cache_ladder.py on your own ThinkPad / MacBook / IdeaPad and you will get cleaner cliffs than any server-room measurement. Once you trust your laptop's measurement, the same Python script run on an AWS c7i.4xlarge or a Hetzner CCX13 will show you the server's shape. Two runs, an hour of work, and you have a calibrated mental model of every CPU you will program for the next decade.
Reproduce this on your laptop
# Cache-ladder microbenchmark + perf-stat ground truth
sudo apt install linux-tools-common linux-tools-generic
python3 -m venv .venv && source .venv/bin/activate
pip install numpy
sudo sysctl kernel.perf_event_paranoid=0
python3 cache_ladder.py | tee ladder.txt
# Confirm the cliffs with hardware counters:
perf stat -e cycles,instructions,L1-dcache-load-misses,LLC-load-misses \
python3 -c "import cache_ladder; cache_ladder.bench(64*1024)"
# On Apple silicon (no perf), use Xcode Instruments → "Counters" template instead.
lscpu | grep -E "L1|L2|L3" # confirm your cache sizes match the cliff positions
You should see your L1 cliff at your reported L1 size, your L2 cliff at your reported L2 size, and your L3 cliff at your reported L3 size — verifiable in three minutes. If the cliffs do not align with lscpu, your kernel's perf_event_paranoid may be filtering events, or your laptop has set-associative aliasing on the random permutation (rare; usually only on very small L1d parts under 24 KB). Run the benchmark twice; the cliff positions are reproducible to within 5%.
Where this leads next
The hierarchy is the structure; the next chapters are about how that structure interacts with the data and code you write:
- /wiki/cache-lines-and-why-64-bytes-rules-everything — the 64-byte unit of motion and what it does to data structures.
- /wiki/cache-coherence-mesi-moesi — the protocol that keeps multiple cores' L1s consistent, and the latency it leaks.
- /wiki/false-sharing-the-silent-killer — when two cores share a line they did not mean to share.
- /wiki/data-layout-for-cache-friendliness — structure-of-arrays vs array-of-structures, padding, hot/cold splits.
- /wiki/prefetching-hardware-vs-software — how the prefetcher tries to hide cache misses, and when you have to help.
- /wiki/tlb-and-address-translation-costs — the second cache hierarchy, hidden behind the first.
The hierarchy reappears in NUMA (Part 3, where the LLC is no longer the bottom — there is "remote" DRAM further down), in profiling (Part 5, where flamegraphs of memory-bound functions point to the wrong line of code), and in capacity planning (Part 14, where service capacity falls off a cliff exactly when the per-request working set times the request rate exceeds LLC). Every later part references the ladder either to use it (prefetching, layout) or to break it (NUMA, coherence, false sharing).
The senior-engineer instinct on a slow service — "what fits where?" — is the hierarchy talking. Internalising the four numbers (1 ns / 4 ns / 12 ns / 80 ns) and three sizes (32 KB / 1 MB / 32 MB) gives you a model of every CPU your code will run on for the next decade. Vendors will move them; the ratios will not.
A useful exercise the day after you read this chapter: pull lscpu | grep -E "L1|L2|L3" on every machine you have access to (laptop, dev VM, prod instance, the ARM box in the lab, your phone over Termux) and write the numbers in a notebook. Within a week you will recognise three or four "ladder shapes" — Intel mesh-server, AMD CCD-server, Apple unified, ARM-cluster — and you will be able to predict, just from the shape, which of your services will run well on which silicon. This is not a memorisation exercise; it is a calibration of intuition. The numbers stop being abstract the moment you have run lscpu on five different machines and seen the pattern. Every flamegraph, every perf stat, every benchmark you read for the rest of your career will land somewhere on that ladder, and your first read on whether a service is healthy or sick will be the level it is operating at.
The next chapter — cache lines and the 64-byte unit of motion — picks up where this one ends: the hierarchy is the structure, and 64 bytes is the quantum of motion within it. Every byte your code touches is part of a 64-byte transfer; designing data structures to use that transfer well is the topic that converts hierarchy awareness into actual speed.
References
- Hennessy & Patterson, Computer Architecture: A Quantitative Approach (6th ed., 2019) — Chapter 2 derives the AMAT formula and the geometric-ladder argument. The canonical reference for this chapter's claims.
- Drepper, "What Every Programmer Should Know About Memory" (2007) — §3 (CPU caches) explains set associativity, line size choices, and inclusive vs exclusive policies in depth.
- Intel® 64 and IA-32 Architectures Optimization Reference Manual — per-microarchitecture cache sizes, latencies, and the full PMU event list (
mem_load_retired.*). - AMD Zen 4 Software Optimization Guide (PUB 57647) — definitive numbers for L1/L2/L3 sizes, associativity, and exclusive-LLC behaviour on EPYC 9004.
- Agner Fog, Optimizing software in C++ (2024) — §9 on caching is the most pragmatic field manual for matching code to a real cache hierarchy.
- LMbench / lat_mem_rd — the standard latency-vs-stride benchmark; the cache_ladder.py here is its Python cousin.
- Brendan Gregg, Systems Performance (2nd ed., 2020) — Chapter 7 (Memory) covers PMU-based cache analysis in production-debugging language.
- /wiki/wall-cpus-are-fast-memory-is-not — the prequel to this chapter; the gap that the ladder exists to bridge.