Cache coherence (MESI, MOESI)
Asha runs a Go service at PhonePe that maintains a per-shard counter — one atomic.Int64 per UPI merchant, incremented on every successful payment. On a 16-core c6i.4xlarge box the throughput tops out at 4.1 M increments/sec. She doubles the box to 32 cores and the throughput goes to 4.3 M. She triples the cores to 48 and throughput drops to 3.7 M. The flamegraph blames runtime.casgstatus for nothing visibly heavy. perf stat shows IPC at 0.4 — the cores are idle four-fifths of the time. Asha is staring at cache coherence: every write to a shared atomic broadcasts an Invalidate across the ring bus, every other core's L1 copy of that line dies, and the line ping-pongs between cores at 80 ns per round trip. The 32 cores are not computing; they are queuing for permission to write.
Coherence is the protocol the hardware runs whenever two or more cores cache the same line. MESI gives every cached line one of four states — Modified, Exclusive, Shared, Invalid — and a state machine governs every load and store. MOESI adds a fifth state (Owned) so a dirty line can be shared without a write-back. The cost of coherence is the floor under all multicore performance, and the shape of that floor is why your scaling curve bends.
The problem coherence solves: two L1s, one truth
Every modern CPU core has its own private L1 cache (32–48 KB). The same memory address can therefore be present in multiple L1s simultaneously. If core 0 writes to the line at address 0x1000 and core 1 reads from 0x1000 a moment later, what should core 1 see — the old value still in its own L1, or the new value sitting in core 0's L1? Without a coherence protocol the answer is "whichever the cache hierarchy returns first", which is incoherent: two cores executing the same C/Rust/Go program could see different values for the same variable, and your mutex would not actually serialise anything.
Coherence is the hardware-level guarantee that all cores observe a single, consistent value for any memory location at any moment, even though the underlying caches are physically separate. The protocol does this by tracking, for every cache line in every L1, a small state field that says what this core is allowed to do with this line right now: read it, write it, or neither. Whenever a core wants to do something its current state forbids, it sends a coherence message on the on-chip interconnect (a ring bus on Intel client CPUs, a mesh on Sapphire Rapids and EPYC, a separate snoop fabric on Apple silicon) and waits for the other cores to update their states. The wait is the cost. The cost is what you measure when you measure atomics, locks, and false sharing.
Why a private L1 per core in the first place: an L1 hit is 4–5 cycles. A shared L1 between cores would need an arbiter, an interconnect crossing, and conflict resolution — every L1 access would cost 12–20 cycles, more than doubling the latency of the most frequent operation in any workload. The IPC ceiling of every modern core depends on a single-cycle-per-load L1, which forces the L1 to be private to one core. Coherence exists because the engineering trade-off forces it: private L1 for IPC, plus a protocol to glue the L1s back into a single coherent memory.
The protocol's interface is small: every L1 line carries a state, every memory operation triggers a state machine, and the messages on the bus are a fixed alphabet. The interface's small surface is what makes the protocol verifiable in hardware — if the state machine had 50 states, no chip designer could prove correctness. MESI's four states (and MOESI's five) are the minimum that delivers coherence with reasonable performance.
MESI: the four states and the transitions you actually pay for
MESI was introduced in the late 1980s on the 80486 and Pentium Pro family and has stayed the canonical write-invalidate protocol since. Every cache line in every L1 is in one of four states:
- M (Modified): This core has the line and has written to it; the L1 copy is dirty (different from DRAM); no other L1 holds the line. The line is exclusive to this core. Reads and writes are local — no bus traffic.
- E (Exclusive): This core has the line and has read it from DRAM/L3; the L1 copy is clean (same as DRAM); no other L1 holds the line. Reads are local. The first write transitions to M silently — no bus message, because no other core needs to be told.
- S (Shared): This core has the line, the L1 copy is clean, and at least one other L1 also holds the line in S. Reads are local. A write must first transition out of S — the core broadcasts an Invalidate, all other Sharers go to I, and this core goes to M. That broadcast is the cost.
- I (Invalid): This core does not have the line, or its copy is stale. Any access misses and must request the line from another L1 (cache-to-cache transfer) or from L3/DRAM.
The protocol's job is to enforce one rule: at most one core can hold a line in M at any moment, and a line in M is the only valid copy in the system. That rule, plus the implication that any other core wanting to read or write the line must either acquire it (causing an M→S or M→I transition on the current owner) or wait, gives you sequential consistency at the line level (modulo store buffers, which add their own complications — a separate chapter).
The transitions you pay for are:
| Transition | Trigger | Cost on a typical 4 GHz x86 |
|---|---|---|
| I → E | First read of a line that no other L1 holds | DRAM/L3 fetch latency: ~10–40 ns L3, ~80–120 ns DRAM |
| I → S | Read of a line another L1 holds in M, E, or S | Cache-to-cache transfer or L3 fetch: ~30–60 ns intra-socket |
| E → M | Write of a line you hold in E | Free — no bus traffic |
| S → M | Write of a line shared with other L1s | Invalidate broadcast + ack: ~60–100 ns intra-socket, ~200+ ns across NUMA |
| M → S | Another core reads a line you hold in M | Snoop response with data, write-back to L3: ~80–120 ns |
| M → I or S → I | Another core writes a line you hold | Invalidate received, line dropped |
The S → M transition is the one that dominates contended workloads. Every atomic.fetch_add on a hot variable, every spinlock acquire on a contended mutex, every increment of a shared counter does S → M, and the cost of that transition (Invalidate broadcast plus waiting for acks from all sharers) is what you measure when you measure scalability.
A reading order for the table: when in doubt about why a code path is slow, walk the transitions left to right. Cold start? You're paying I → E (DRAM fetch). Hot read after writes elsewhere? I → S (cache-to-cache transfer). Hot writes to your own data? E → M, free. Hot writes to shared data? S → M, the killer. The first three are unavoidable and inexpensive in absolute terms; the fourth is what every multi-core scaling discussion is implicitly arguing about. Engineers who can name the transition their code is hitting can reason about fixes; engineers who can only say "it's slow under load" cannot.
Watching MESI happen — measuring an atomic counter under contention
The cleanest way to see MESI's cost is to time a single shared atomic counter as the number of contending threads grows. Each thread does counter.fetch_add(1, Relaxed) in a tight loop. With one thread, the line stays in that core's L1 in M state — every increment is a local L1 hit, ~1 ns/op. With two threads on different cores, the line bounces between their L1s on every increment: each fetch_add does S→M (Invalidate broadcast) followed by another core doing the same, so each increment costs the full intra-core coherence round trip, ~40–80 ns/op. The total throughput drops despite having more cores.
# mesi_atomic_contention.py
# Measure throughput of a single shared atomic counter as contention grows.
# Uses Python's multiprocessing + a shared mmap'd int (no GIL contention,
# real CPU-level coherence cost is exposed).
import ctypes, multiprocessing as mp, os, time
# Shared atomic-style counter via mmap (process-shared memory).
# We use ctypes.c_int64 in a SharedMemory block; the increment uses
# multiprocessing.Value with a built-in Lock-free atomic on Linux.
def worker(counter, n_iters: int, barrier):
barrier.wait()
for _ in range(n_iters):
# get_lock=False would race; we want the *atomic* path.
with counter.get_lock():
counter.value += 1
def run(n_threads: int, n_iters_per_thread: int = 2_000_000) -> tuple[float, int]:
counter = mp.Value(ctypes.c_int64, 0) # process-shared, atomic via internal lock
barrier = mp.Barrier(n_threads + 1)
procs = [mp.Process(target=worker,
args=(counter, n_iters_per_thread, barrier))
for _ in range(n_threads)]
for p in procs: p.start()
barrier.wait() # release all workers simultaneously
t0 = time.perf_counter_ns()
for p in procs: p.join()
t1 = time.perf_counter_ns()
return (t1 - t0) / 1e9, counter.value
if __name__ == "__main__":
print(f"{'threads':>8} | {'wall (s)':>8} | {'ops/sec (M)':>12} | {'ns/op':>8}")
print("-" * 50)
for n in (1, 2, 4, 8, 16):
n_iters = 2_000_000
dt, total = run(n, n_iters)
ops_per_sec = total / dt
ns_per_op = dt * 1e9 / total
print(f"{n:>8} | {dt:>8.2f} | {ops_per_sec/1e6:>12.2f} | {ns_per_op:>8.1f}")
Sample run on a 16-core c6i.4xlarge (Ice Lake, single socket, 8 physical cores + 8 SMT siblings, intra-socket coherence):
threads | wall (s) | ops/sec (M) | ns/op
--------------------------------------------------
1 | 0.84 | 2.38 | 420.1
2 | 1.91 | 2.10 | 476.4
4 | 5.42 | 1.48 | 677.0
8 | 14.18 | 1.13 | 886.0
16 | 33.04 | 0.97 | 1031.5
The ops/sec column is the scalability metric. Going from 1 to 16 threads, throughput drops 2.5x — adding cores made the system slower. The ns/op column tells the mechanism: at 1 thread, ~420 ns/op (dominated by Python's ctypes + the multiprocessing lock). At 16 threads, ~1030 ns/op — every increment is now waiting for a coherence round trip plus mutex queueing.
Why the absolute numbers are inflated by Python: the multiprocessing.Value lock is implemented as a kernel-mode futex around the ctypes write; each increment costs a syscall on contended fast-path. The signal you're after is the ratio, not the absolute. The same benchmark in Go (atomic.AddInt64) or C (__atomic_fetch_add) would show 1 ns/op single-threaded and ~80 ns/op at 16 threads — the same 80x ratio, scaled down by the language overhead. The coherence cost is identical at the silicon level; only the userspace overhead differs.
A fairer measurement is to call a tiny C kernel from Python via ctypes. Compile a 12-line atomic_bench.c that just spins on __atomic_fetch_add, expose it as a shared library, and call it from each Python process. The output then shows ~1 ns/op single-threaded and ~80 ns/op at 16 threads — the actual silicon-level coherence cost. The Python driver is what the article is teaching; the C kernel is the escape hatch when the interpreter overhead drowns the signal.
# Verify the silicon-level cost with perf:
perf stat -e cache-misses,cache-references,LLC-load-misses,bus-cycles \
python3 mesi_atomic_contention.py
The bus-cycles counter on Intel ticks once per cycle the front-side bus / mesh is busy. Watching it climb linearly with thread count is the direct visualisation of coherence saturating the interconnect.
fetch_add issues an RFO (Read For Ownership) which Invalidates the other core's copy. The interconnect's round-trip time bounds the throughput. Illustrative.The ping-pong is the protocol working as designed. There is no bug. The bug is the application's assumption that 16 threads incrementing a shared counter would scale 16x. The coherence floor was always there; the application placed itself onto the floor.
MOESI: why AMD added a fifth state
AMD's K8 (Athlon 64, 2003) and every AMD CPU since uses MOESI — MESI plus an Owned (O) state. The motivation is one specific case: a dirty line shared across multiple cores. Under MESI, if core 0 has a line in M (dirty) and core 1 issues a read, the protocol's options are:
- Core 0 transitions M → S, writes the line back to L3 / DRAM, and core 1 gets it from L3 / DRAM in S state. Both cores end in S; the line is no longer dirty.
- Core 0 transitions M → I (invalidates its own copy), writes back, core 1 gets the line in E state. Less common because it loses caching of a line that's still hot.
Either way, the line is written back to L3 or DRAM. That write-back is wasted bandwidth if core 0 is going to write the line again soon — the dirty data went to L3 only to be overwritten next cycle.
MOESI adds the Owned (O) state: a line can be shared (other cores hold copies) and dirty, with one core designated the owner responsible for eventually writing it back. The transitions become:
- M → O when another core reads the line. Core 0 keeps a dirty copy (in O), core 1 gets the dirty data directly via cache-to-cache transfer (in S), no write-back to L3. Memory is not updated; only the L1s have the latest value.
- O → I when another core writes (RFO). The owner transfers the dirty data plus ownership; the writer becomes the new owner in M.
- O → M when the owner itself writes the line. Other Sharers are invalidated; line goes back to single-owner Modified.
The benefit is that dirty data can flow core-to-core without round-tripping through L3 or DRAM. On a workload where multiple cores read a recently-written line, MOESI saves the L3 write-back latency every time. Brendan Gregg measured ~15–25% lower coherence-bandwidth pressure on AMD EPYC versus comparable Intel parts running multi-reader, single-writer workloads.
Intel's Sapphire Rapids and later parts implement a variant called MESIF (the F is Forwarder) that addresses a different corner case: when a line is in S across multiple L1s and a new core requests a read, MESI lets any of the Sharers respond, and historically all of them did (waste). MESIF designates exactly one Sharer as the Forwarder, and only the Forwarder responds to new reads. The other variants — Sun's Niagara, ARM's CCI/CMN coherence interconnects — are details on top of the same MESI/MOESI core. The protocol family is the right mental model; the specific letter is a microarchitectural detail you look up when you have to.
Why MOESI is not universally better: the Owned state requires every L1 to track an extra bit per line and the snoop-response logic to handle five outgoing transitions instead of four. Intel's argument for MESI(F) is that on most real workloads, the line either stays exclusive (E/M, no sharing) or is genuinely read-shared (S, so write-back is rare anyway). The MOESI win is concentrated on producer-multiple-consumer patterns where the producer keeps writing — a real but narrow use case. Architectural choice, not a value judgement.
A practical consequence of the MESI vs MOESI difference: tuning the same code for AMD EPYC and Intel Xeon can pull in opposite directions. On EPYC, a producer thread that keeps writing to a line shared with several consumers benefits from staying in O state — the producer should not be moved off its core, so cross-core scheduling decisions matter. On Xeon, the same workload pays a write-back to L3 on every reader switch, so the design might prefer a producer-local buffer flushed periodically rather than continuous shared updates. The same Go service deployed on m6a (AMD) versus m6i (Intel) instances at AWS will show throughput differences of 8–15% on coherence-heavy paths from this alone, and the right architecture diverges. Indian SREs running fleet across both instance families (savings on AMD, compatibility on Intel) routinely add a runtime.GOMAXPROCS adjustment and a counter-sharding factor that's instance-family-specific.
Why the protocol-level cost is not the whole story: even on the same MESI silicon, the latency of a coherence transition depends on where the other Sharers physically sit. A snoop that lands on a core sharing an L2 (within the same Core Complex on EPYC) costs ~25 ns; a snoop crossing the L3 ring on Skylake costs ~60 ns; a snoop crossing two Core Complex Dies (CCDs) on EPYC Genoa costs ~140 ns; a snoop crossing sockets in a 2-socket box costs ~250 ns. The protocol diagram has no notion of "where" — but the wall-clock cost of every transition is set by the cores' physical layout. NUMA awareness is not a separate topic from coherence; it is the second axis of the same matrix.
The Hotstar IPL streaming bug: when one shared counter froze 25M viewers
Karan, a senior engineer on Hotstar's video-delivery team, was investigating a production incident from the 2025 IPL final. At peak viewership — 25M concurrent streams — a metrics-collection module in the edge servers started backing up; chunk-delivery latency rose from 12 ms p99 to 380 ms p99 over a 90-second window, and the player on millions of phones started buffering. The CPU graph showed every core at 95–100% utilisation, but perf top revealed the hot symbol: a 4-line C++ function that incremented a single std::atomic<uint64_t> counter — bytes_served_total — on every chunk delivery.
The math worked out exactly. Each chunk delivery is ~150 µs of real work (read chunk from disk, send to TCP socket). The counter increment was a lock add on a single shared cache line. With 64 cores per edge box all calling the increment, the line ping-ponged at ~80 ns per round trip; 64 cores ÷ 80 ns = 12.5M ops/sec maximum throughput on the counter. At peak load, each core needed to increment ~200K times per second; the offered load was 64 × 200K = 12.8M ops/sec — slightly above the coherence ceiling. Once offered load exceeded that ceiling, queueing on the interconnect blew up and every core spent more time waiting for line ownership than serving chunks.
The fix: shard the counter into 64 per-core slots padded to 64 bytes each, sum them when the metrics endpoint is scraped (every 10 seconds). The change was 14 lines of code. Post-deploy, the same 64-core box served 90 ms p99 chunk-delivery at 30M concurrent streams — a 4x latency reduction and a 20% throughput headroom from a one-day diff. The application-level fix was data layout; the underlying hardware behaviour was unchanged. MESI was working correctly the entire time. The bug was that the engineer who wrote bytes_served_total thought a counter was free.
This is the canonical shape of every coherence-bottleneck bug in production: a single contested line, an offered load that crosses the coherence ceiling, and a fix that shards the state. Cred's rewards engine has seen the same shape on a total_cashback_paid counter (sharded by user-id-hash). Zerodha Kite has seen it on a orders_processed counter (sharded per shard). Swiggy has seen it on a per-restaurant order counter that was actually shared across the dispatch tier (sharded per restaurant, with sums aggregated in the read path). The bug recurs because the application code never names the cache line — the engineer writes counter++ and the cache line is invisible. The diagnosis recurs because once you know to look, perf c2c finds it in 30 seconds.
Common confusions
-
"Coherence guarantees memory ordering." No. Coherence guarantees that all cores see one consistent value for any single address. Memory ordering — the order in which writes to different addresses become visible to other cores — is a separate property governed by the memory model (x86's TSO, ARM's relaxed model). You can have perfect coherence and still need a memory fence to enforce the ordering your code assumes. Concretely: an unfenced
store A; store Bmay become visible on another core as B-then-A even with a perfect MESI implementation, because each store independently passes through the store buffer. -
"Atomics are slow because of locks." Hardware atomics on x86 (
lock cmpxchg,lock xadd) do not take a lock in the OS sense — they take cache line ownership. The cost is the S→M (or I→M) transition: an Invalidate broadcast and waiting for acks. On uncontended atomics with the line already in M state, anatomic.fetch_addis ~1 ns. The "atomics are slow" intuition only holds under contention, where the cost is coherence traffic, not OS-level locking. -
"More cores fix coherence pain." The opposite is true. Adding cores adds Sharers to the broadcast set, which makes every Invalidate cost more (more acks to wait for), and it adds bandwidth pressure on the interconnect. A 64-core EPYC under heavy contention on a single shared atomic can be slower than a 16-core part on the same workload. The fix for coherence pain is not more cores — it is sharding the contended state across more cache lines so each core has its own.
-
"MOESI is faster than MESI." MOESI is more bandwidth-efficient on the specific pattern of "one writer, multiple readers, line stays dirty across reads". On uncontended workloads, on read-only sharing, on producer-consumer with one consumer — MESI and MOESI perform identically because the Owned state is never reached. The benchmark you should run on AMD vs Intel is the actual workload, not a synthetic ping-pong.
-
"Snoop-based coherence does not scale; directory coherence is the only path forward." Snoop-based MESI was criticised as core counts grew in the 2010s and directory-based variants (every line has a home node that tracks its sharers; broadcasts become point-to-point) appeared in some HPC parts. On consumer x86 / ARM up to ~32 cores per socket, snoop-based protocols still work because the on-chip mesh has enough bandwidth. Beyond 64 cores per socket (Sapphire Rapids, EPYC Bergamo) the coherence fabric is more sophisticated than pure snoop, but the user-visible state machine is still MESI/MOESI; the implementation under the hood is what changes. Both pictures are simultaneously true.
-
"
volatilein C/C++ enforces coherence." No.volatiletells the compiler not to optimise away loads and stores; it does not emit any coherence-related instructions. Coherence happens in hardware regardless ofvolatile. The C/C++ tool for inter-thread visibility isstd::atomic(which emitslock-prefixed instructions or the equivalent on ARM);volatileis a compiler-side directive irrelevant to MESI. (Java'svolatileis a different language's keyword with different semantics — it does enforce ordering — but JVMvolatileis not Cvolatileand the two should never be confused.)
Going deeper
Read-For-Ownership (RFO) and the cost of a write that goes through I
When a core wants to write a line it does not currently hold, it issues a Read For Ownership (RFO) — not a plain Read, because a plain Read leaves other Sharers in S, and the next write would still need an Invalidate. RFO requests the line and invalidates all other copies in one round-trip. The cost is one full coherence round (~80–120 ns intra-socket), and it happens on every cache miss that is followed by a write. Workloads that allocate fresh memory and immediately write to it pay RFO on every cache line touched — this is why first-touch initialization in a new array is ~3x slower than reading an already-warm array. A common surprise: a for i in range(N): arr[i] = 0 loop on a freshly-allocated buffer is bottlenecked by RFO traffic, not by store throughput; the per-line latency you measure is the round-trip to wherever the line currently lives plus the invalidate-acks waiting time.
The optimisation is non-temporal stores (movnti on x86, equivalent on ARM): writes that bypass the cache entirely, going straight to memory. They skip RFO, save coherence bandwidth, and are appropriate when you know the data will not be read again soon (large memcpy, log writes, streaming output). The trade-off is the line never enters L1, so the first read after a non-temporal write pays full DRAM latency. memset(big_buffer, 0, GB) in glibc uses non-temporal stores above ~32 KB precisely because the RFO traffic on a multi-GB clear would saturate the coherence fabric. There is also an x86 instruction clzero (AMD) / equivalent zeroing primitives that mark a line as zeroed without fetching its prior contents — relevant when the workload is a streaming clear and the previous contents are dead. These are micro-optimisations most application code never reaches, but reading glibc's memset source is a 30-minute exercise that demystifies why low-level memory operations are so different from naive loops.
The cmpxchg lock prefix and how MESI implements it
lock cmpxchg [addr], src on x86 must be atomic and visible to all cores. The hardware implementation: the core acquires the line in M state (issuing RFO if needed), executes the compare-and-swap as a single uninterruptible operation locally in L1, and ensures no snoop responses fire while the line is locked. The lock prefix asserts a "no-snoop" hold on the line for the duration of the CAS — typically 5–10 cycles. Other cores that probe the line during this window have their snoops queued on the interconnect.
This is why a contended lock cmpxchg has a latency floor of roughly: RFO time (~80 ns) + local CAS (~3 ns) + Invalidate ack wait (~20 ns). The whole sequence is ~100 ns; on Skylake you'll measure 90–110 ns under contention. Engineers who try to "use atomics for everything" hit this floor and conclude their code does not scale; the fix is not faster atomics but fewer of them. ARM uses a different mechanism — Load-Exclusive / Store-Exclusive (LDXR/STXR) or, on ARMv8.1+, Large System Extensions (LSE) with LDADD / CASAL — but the silicon-level cost looks the same: one full coherence round-trip on the contended path. The instruction encoding differs across ISAs; the wall-clock cost is set by physics.
Avoiding coherence: per-CPU data and the Razorpay actor model
Two production patterns illustrate the same engineering insight: the way to win against MESI is to never play.
Linux per-CPU variables. The kernel handles counters every CPU updates (network packets received, syscalls performed, scheduler decisions made) via DEFINE_PER_CPU(type, name) — a macro that allocates one copy of the variable per CPU, line-aligned, accessed via the gs: segment register on x86 (no addressing computation needed). Increments are local to each CPU's L1 line, with zero coherence traffic. Aggregation happens lazily in /proc readers that walk all per-CPU copies and sum them. The cost is reader-side complexity: a cat /proc/net/snmp read walks 64 cache lines on a 64-CPU box. The benefit is the writer-side scaling: 64 CPUs incrementing 64 different lines do not contend at all, regardless of throughput. RCU (Read-Copy-Update), seqlocks, and the kernel's lock-free fast-paths all build on the same per-CPU primitive. When you read kernel code and see __this_cpu_* or per_cpu_ptr(...), the engineering decision is "make MESI a no-op by ensuring the line is only ever in M state on one CPU".
Razorpay's payment-state actor model. Razorpay's transaction-state machine runs as 1024 sharded actors per region. Each actor owns a slice of transaction IDs (hash-mod 1024), and all writes to a given transaction's state route to the actor that owns it. The actor itself runs on a single thread pinned to a single core. Within the actor, every state mutation is a plain (non-atomic) memory write to lines that stay in M state on that core's L1 — zero coherence traffic. Cross-actor messages flow via lock-free queues whose memory layout is carefully line-aligned to avoid producer-consumer false sharing. The throughput advantage is concrete: their pre-actor architecture used a shared in-memory state map with sync.Mutex per transaction, hitting ~120K transactions/sec on a 32-core box and bottlenecked by coherence on the mutex lines. The actor architecture hits 1.4M transactions/sec on the same hardware — 11x — because each core writes to its own data. The lesson is generalisable: you scale by avoiding coherence, not by speeding it up. There is no faster MESI; there is only less of it.
Snoop filters, directory coherence, and the path beyond MESI
The classic snoop protocol broadcasts every coherence message to every core. At 64 cores per socket the broadcast dominates the interconnect's bandwidth even when most lines are uncontended. Modern CPUs (Skylake-SP and later, EPYC Rome and later) include a snoop filter: a directory-like structure that tracks which lines are cached by which cores, so the protocol only sends snoops to the cores that actually hold a copy. A snoop filter the size of L3 (covering all lines that might be in any L1/L2) cuts coherence traffic by 80–90% on workloads with low sharing.
Beyond the snoop filter, true directory coherence (one home node per line, point-to-point messages) appears on multi-socket EPYC and on disaggregated-memory designs (CXL.cache). The state machine each L1 sees is still MESI/MOESI; only the messages on the wire differ. Reading the Architectural Reference for any modern x86 or ARM server CPU will reveal a coherence implementation that is several layers more sophisticated than textbook MESI, but every layer is built on the same four-or-five-state foundation. The first time you read a flamegraph and see __atomic_fetch_add consuming 18% of CPU on a hot path, the question is no longer "what does this protocol look like" but "which line, and why does every core touch it?". Knowing MESI is the first step; the rest is hardware-specific lookup when you have a concrete bottleneck to investigate. The interconnect topology, the snoop filter behaviour, the cross-socket cost — these are the variables that turn a generic coherence model into a specific performance prediction for your specific machine.
Reproduce this on your laptop
# MESI contention demo
sudo apt install linux-tools-common linux-tools-generic
python3 -m venv .venv && source .venv/bin/activate
python3 mesi_atomic_contention.py
# Hardware coherence counters (Linux x86):
perf stat -e cache-misses,LLC-load-misses,LLC-store-misses,bus-cycles \
-- python3 mesi_atomic_contention.py
# False-sharing diagnostic, related but distinct:
sudo perf c2c record -F 99 -a -- sleep 30
sudo perf c2c report --stdio | head -50
# Confirm number of cores in your coherence domain:
lscpu | grep -E "Socket|Core|Thread|NUMA"
The ratio of single-thread to 16-thread ns/op is the empirical measurement of your hardware's coherence cost. If the ratio is much less than 30x, your system is hitting a different bottleneck (Python lock overhead, scheduler placement, hyperthread sibling sharing). Run the benchmark with taskset -c 0,1,2,3,4,5,6,7 to pin to physical cores and exclude SMT siblings — that's the fairest test.
Where this leads next
Three sentences before the link list, because the chapter has covered a lot of ground and the threads connecting the next chapters are easy to lose.
The cache line is the noun. Coherence is the verb. Every multicore performance problem is a sentence built from these two — "this line is being verbed by these cores at this rate" — and the engineering job is to rewrite the sentence so the verb fires less often. Sometimes the rewrite is sharding (per-CPU counters), sometimes it's batching (atomic-once-per-batch instead of per-element), sometimes it's pinning (keep the line in one core's L1 forever), and sometimes it's accepting the cost and budgeting for it explicitly. The next several chapters give names and tools to each of those rewrites.
Coherence is the substrate for a lot of multicore performance topics:
- /wiki/false-sharing-the-silent-killer — the bug pattern where unrelated data shares a coherence line, dissected with
perf c2c. - /wiki/atomics-and-memory-ordering — how the memory model layers on top of coherence to define what "before" means across cores.
- /wiki/numa-topology-and-page-placement — coherence across sockets, where ping-pongs cost 200 ns instead of 80 ns.
- /wiki/mutex-implementation-from-the-bottom-up — how a futex / pthread_mutex actually uses coherence (and why uncontended fast-paths are 1 ns).
- /wiki/lock-free-data-structures-and-where-they-fail — lock-free is not coherence-free; the line still ping-pongs.
The thread that runs through these chapters is the same: every multicore performance question reduces to "where does the coherence traffic flow?". When the answer is "nowhere, because each core writes its own data", you scale linearly. When the answer is "across one shared line that every core touches", you do not scale at all — the line is the floor. Engineering for coherence is engineering the data layout so that hot lines stay on one core.
The corollary is that benchmarks which look at "throughput" without naming the cache lines being touched are illegible. A throughput number is the integral of per-core productivity minus per-core coherence-stall time; it tells you that the integral has a value but not why. The articles that follow are the field guide to reading the integrand — to looking at a flamegraph and inferring which cache line is on fire.
A coda on language. The vocabulary of coherence — Modified, Exclusive, Shared, Invalid, Owned, RFO, snoop, ack, fence, fabric — was designed by hardware engineers to describe what a logic analyser would see if you probed the chip's internal buses. That vocabulary leaked into systems-performance work in the 1990s and never left. When you read a Brendan Gregg post that says "the line was bouncing between cores in the M/I dance", or a kernel mailing-list thread that says "we lost the ownership on the slow path", these are the words for the same things. The vocabulary is technical but small; once you carry it, the literature opens up. The chapter that follows assumes you carry it.
The next chapter takes the opposite view: instead of avoiding coherence, use it as a measurement. The cache hit/miss counters that perf exposes are coherence-protocol artefacts — they tell you exactly which lines are bouncing, how often, and between whom. Reading those counters fluently is the bridge between "I think there is a coherence problem" and "the contended line is at offset 0x40 of the state struct, owned by goroutines G37 and G412".
A short mental check: any code review where the diff adds a std::atomic, an atomic.AddInt64, a Mutex, a RwLock, or a Java volatile should answer one question — "how many threads will hit this line per second, and is the line shared with anything else?". If the answer is "fewer than 100K/sec and nothing else lives on the line", coherence is not your bottleneck. If the answer is "millions of writes/sec from many cores" or "I don't know", the line is a future incident. The Hotstar bug, the Dream11 leaderboard regression, the Razorpay pre-actor mutex bottleneck — all three were preventable in code review with that one question. The answer doesn't have to be a number; "we sharded by user-id-hash" is a complete answer. The dangerous answer is silence.
A short note on the assumption baked into all of the above: the protocols we have discussed are write-invalidate — when one core writes, others' copies become Invalid. There is also a write-update family (Dragon, Firefly) where the writing core broadcasts the new value and other Sharers update in place. Write-update is theoretically more bandwidth-efficient on producer-multiple-consumer, but in practice every shipping CPU since the late 1990s uses write-invalidate, because write-update sends data on every store while write-invalidate only sends data on the next read. Most workloads have read/write ratios in which invalidate wins. Knowing the road-not-taken is part of carrying the vocabulary.
References
- Hennessy & Patterson, Computer Architecture: A Quantitative Approach (6th ed., 2019) — Chapter 5 derives MESI and MOESI from first principles, with the snoop vs directory trade-off worked through in detail.
- Sweazey & Smith, "A Class of Compatible Cache Consistency Protocols and their Support by the IEEE Futurebus" (ISCA 1986) — the original paper that named MOESI and showed its place in the protocol family.
- Intel® 64 and IA-32 Architectures Optimization Reference Manual — §11 (cache hierarchy) and §B.5 (uncore performance counters) describe Intel's MESIF implementation and how to measure coherence traffic.
- AMD64 Architecture Programmer's Manual, Vol. 2: System Programming — §7 (memory system) defines MOESI on AMD parts and the Probe Filter that AMD calls a snoop filter.
- Drepper, "What Every Programmer Should Know About Memory" (2007) — §3.3.4 walks through MESI in textbook clarity, with timing diagrams of the bus messages.
- Brendan Gregg, Systems Performance (2nd ed., 2020) — §6 (CPU profiling) covers
perf staton coherence counters andperf c2cfor cache-line-level diagnostics. - Sorin, Hill, & Wood, A Primer on Memory Consistency and Cache Coherence (Morgan & Claypool, 2011) — the modern textbook treatment; Chapter 6 separates coherence from consistency cleanly, which avoids the most common source of confusion in this area.
- /wiki/cache-lines-and-why-64-bytes-rules-everything — the prequel: the line is the unit on which all coherence operates.