Note: Company names, engineers, incidents, numbers, and scaling scenarios in this article are hypothetical — even when they resemble real ones. See the full disclaimer.
Cache coherence as a concurrency primitive
KapitalKite's order-book hot path increments a std::atomic<uint64_t> sequence number on every order. On a 64-core EPYC under a Tatkal-hour load burst, p99 add-order spikes from 9 µs to 140 µs. Aditi's first instinct was a lock-contention bug. There is no lock. The atomic increment compiles to a single lock incq instruction. Yet perf c2c record -- ./matchbench reports 8.4 million HITM events — a load that hit a cache line in another core's modified state — concentrated on one 64-byte line. The CPU is not slow because of code. It is slow because of a protocol that nobody on the team had ever named: MESI, the cache-coherence protocol that makes "shared memory" mean what your program thinks it means.
A multicore CPU does not have shared memory; it has many private L1d caches that agree on the value of each 64-byte cache line through a hardware protocol called MESI. Every atomic operation, every mutex acquisition, and every lock-free CAS is a request to MESI to move a cache line into the right state on the right core. Coherence is not free — a line bounced between two cores costs ~80 ns per flip, and contended atomics on a shared line are MESI traffic, not arithmetic. The protocol is the primitive on which every higher-level synchronization primitive is built.
What cache coherence actually is
Look at any modern multicore CPU and you will find this fact: each physical core has its own L1 data cache (32–64 KB on Intel and AMD; 64 KB on Apple M-series). Each core's L1d is private. Yet your C++ program writes 42 to *p on core 0, and a load of *p on core 5 returns 42 — not stale memory, not garbage. The hardware that makes this true is a protocol that sends messages between caches every time a line is read, written, or evicted.
The protocol most x86 and ARM CPUs use is MESI, a four-state-per-cache-line invariant first published by Papamarcos & Patel in 1984. Every line in every L1d is in one of four states from the point of view of that core:
- M (Modified) — this core has the only valid copy; its value is dirtier than memory; no other core has the line.
- E (Exclusive) — this core has the only copy and it is clean (matches memory); no other core has the line.
- S (Shared) — this core has a clean copy and one or more other cores also have clean copies; nobody has modified it.
- I (Invalid) — this core's copy is stale or not present; a load will miss and fetch from L2/L3/DRAM (or from another core's L1 via a snoop).
The invariant the protocol maintains: for any cache line, at most one core may be in M or E state, OR multiple cores may be in S state, but never one core in M and another core with a copy at all. A write-intent on a line forces every other core's copy to I (invalidation); the writer transitions S→M or E→M and proceeds with a private dirty copy. A read-intent on a line that another core holds in M forces that core to write-back (M→S) and supply the data; the reader transitions I→S.
lock-prefixed instruction, every std::atomic store, every mutex lock() ultimately translates to one or more transitions in this diagram. The protocol runs in hardware on every cycle; software cannot disable it, only avoid the expensive transitions.Why MESI's invariant is the primitive: the protocol guarantees that the value any core observes for a cache line is consistent with some single global write order on that line. This is not the same as sequential consistency for the whole program (loads and stores can still reorder within a core), but it is per-line linearizability. Every higher synchronization primitive — atomics, mutexes, RCU, lock-free queues — is built on the assumption that each cache line has a single global modification history that all cores agree on. Without MESI (or its descendants MOESI, MESIF), std::atomic<int>::fetch_add would not be atomic; it would be a racy read-modify-write across private caches.
Measuring the cost — pingponging a single cache line between cores
Theory aside, the question that matters operationally is: how much does this cost? When two cores write to the same cache line, every write is a MESI transition. The line moves from one core's M state to the other core's M state through I and back, and every move is an interconnect round trip. On a single-socket EPYC or modern Intel server, that round trip is 50–100 ns. On a two-socket box where the cores are on different sockets, it is 200–400 ns because the snoop crosses the socket interconnect.
Here is the canonical micro-benchmark — two threads each incrementing a shared std::atomic<uint64_t> in a hot loop. The number printed is the average time per increment, which is exactly the cost of one MESI ping-pong round.
// pingpong.cpp — measure cost of one MESI bounce per atomic increment.
// Build: g++ -O2 -std=c++17 -pthread pingpong.cpp -o pingpong
// Usage: ./pingpong <cpuA> <cpuB> e.g. ./pingpong 0 1 (cross-core)
// ./pingpong 0 0 (single-core baseline)
#define _GNU_SOURCE
#include <atomic>
#include <chrono>
#include <cstdio>
#include <cstdlib>
#include <pthread.h>
#include <sched.h>
#include <thread>
static constexpr uint64_t ITERS = 50'000'000;
static std::atomic<uint64_t> shared{0}; // the contended line
static void pin(int cpu) {
cpu_set_t s; CPU_ZERO(&s); CPU_SET(cpu, &s);
pthread_setaffinity_np(pthread_self(), sizeof s, &s);
}
static void worker(int cpu) {
pin(cpu);
for (uint64_t i = 0; i < ITERS; i++) {
shared.fetch_add(1, std::memory_order_relaxed); // lock incq on x86
}
}
int main(int argc, char** argv) {
if (argc != 3) { fprintf(stderr, "usage: %s cpuA cpuB\n", argv[0]); return 2; }
int a = atoi(argv[1]), b = atoi(argv[2]);
auto t0 = std::chrono::steady_clock::now();
std::thread tA(worker, a), tB(worker, b);
tA.join(); tB.join();
auto ns = std::chrono::duration_cast<std::chrono::nanoseconds>(
std::chrono::steady_clock::now() - t0).count();
double per = double(ns) / (2.0 * ITERS);
fprintf(stderr, "cpus=%d,%d total_ns=%lld per_op=%.1f ns final=%lu\n",
a, b, (long long)ns, per, shared.load());
return 0;
}
Sample output on a Ryzen 7950X (16 cores, single CCD on cores 0–7, second CCD on 8–15):
$ ./pingpong 0 0 # both threads on the same logical CPU — sequential, no MESI
cpus=0,0 total_ns=412813921 per_op=4.1 ns final=100000000
$ ./pingpong 0 1 # two cores, same CCD (intra-CCD interconnect)
cpus=0,1 total_ns=8214095123 per_op=82.1 ns final=100000000
$ ./pingpong 0 8 # two cores, different CCDs (inter-CCD via Infinity Fabric)
cpus=0,8 total_ns=14302118473 per_op=143.0 ns final=100000000
The first run pins both threads to one logical CPU. They never run truly concurrently, the line stays in M on that one core, and lock incq runs at its uncontended cost — about 4 ns. The second run spreads them across two cores on the same CCD. Each fetch_add requires the line to migrate from the other core's M to this core's M, and that migration is one round trip on the intra-CCD interconnect — ~80 ns. The third run crosses CCDs on Zen 4: the line traverses the Infinity Fabric between the two L3 slices, paying ~140 ns per increment. The cost is not the increment. The cost is the MESI traffic the increment forces.
Why memory_order_relaxed does not help here: relaxed ordering only weakens the reordering constraints on the loads and stores around the atomic. It does not change the underlying cache-line traffic. lock incq is the same instruction on x86 regardless of the C++ memory order specified — the only thing that changes with stronger orders is whether the compiler emits an additional mfence around the atomic. The MESI bounce is intrinsic to having two cores write to one line, not to ordering. The fix is to not share the line, not to weaken the ordering.
Cache-line ping-pong unfolds in time
The benchmark gives you a number. The diagram gives you the cost's source — the messages that flow between cores on every increment.
The protocol-traffic cost is also visible in perf c2c (cache-to-cache):
$ perf c2c record -- ./pingpong 0 1
$ perf c2c report --stats-only
=================================================
Trace Event Information
=================================================
Total records : 100002154
Locked Load/Store : 100000000 <-- our lock incq stream
Load Operations : 50001077
Loads - HITM : 49998213 <-- cross-core M-state hits, the bad ones
Loads - Local HITM : 49998213
Loads - LLC Misses : 12
Every single fetch_add other than the very first is a HITM — a load that hit the line in another core's modified state. That is the signature of cache-line ping-pong, the operational symptom you grep for in perf c2c reports when an "atomic" suddenly becomes the slowest thing in your profile.
Why this is the foundation of every concurrency primitive
This is the chapter's claim, stated directly: cache coherence is the lowest-level mechanism that makes shared-memory concurrency work at all, and every primitive above it inherits MESI's costs.
A std::mutex::lock() on x86 (libstdc++ uses futex-backed parking) is in the uncontended fast path a single lock cmpxchg on a futex word. Uncontended, the line is in M on the calling core, and the cost is ~18 ns. Contended across two cores, the cmpxchg is a MESI bounce — 80 ns minimum, with additional cost if the futex syscall is taken. The mutex did not become slow because the kernel got busy; it became slow because two cores wanted the same cache line. Why this composes: every atomic-based primitive — spinlocks, futexes, ticket locks, MCS locks, lock-free counters, the per-core ref-count in std::shared_ptr — has a hot cache line that the protocol shuttles between cores. The throughput ceiling of any such primitive is roughly 1 / line_migration_cost operations per second per contending core pair. On a 64-core EPYC with ~150 ns cross-CCD migration, that is 6.7 M ops/sec ceiling on a single contended line — independent of how clever the lock implementation is.
The implication: contention is not a property of the lock implementation; it is a property of which cache line your operation touches. The fix is structural — partition the data so different cores write to different lines (per-core counters, sharded queues, NUMA-local allocators) — not algorithmic ("use a faster lock"). Every chapter in Parts 5, 6, 7, and 8 of this curriculum revolves around this re-framing.
# probe_coherence.py — sweep one cache line vs separated lines, see the cost.
# Drives the C++ kernel across pinning configurations and prints a table.
import subprocess, json, sys
def measure(a, b):
r = subprocess.run(["./pingpong", str(a), str(b)],
capture_output=True, text=True, check=True)
last = r.stderr.strip().split()
per_op = float(last[2].split("=")[1])
return per_op
configs = [
("same-core (no MESI)", 0, 0),
("same-CCD cores", 0, 1),
("cross-CCD cores", 0, 8),
]
print(f"{'config':24s} {'ns/op':>8s} {'mops/s':>8s}")
for name, a, b in configs:
ns = measure(a, b)
print(f"{name:24s} {ns:8.1f} {1000.0/ns:8.2f}")
config ns/op mops/s
same-core (no MESI) 4.1 243.90
same-CCD cores 82.1 12.18
cross-CCD cores 143.0 6.99
The 60× gap between the same-core baseline and the cross-CCD case is pure protocol cost. It is not avoidable by changing the atomic operation (fetch_add vs CAS), the memory order (relaxed vs seq_cst), or the language (C++ vs Rust vs Java). The only fix is to stop sharing the line.
Common confusions
- "Cache coherence is the same as memory consistency" They are different layers. Cache coherence is per-line: every core agrees on the modification history of each cache line. Memory consistency (sequential, TSO, ARM weak) is across lines: in what order can different cores observe writes to different lines. x86 has strong cache coherence (MESI/MOESI) and a fairly strong consistency model (TSO). ARM has equally strong coherence and a much weaker consistency model. You can have coherence without strong consistency, and the difference is the entire subject of Part 3.
- "
volatileprovides cache coherence" No.volatilein C/C++ tells the compiler not to optimise away the load or store; it has nothing to do with caches. Coherence is a hardware property — you get it whether or not you usevolatile. Usestd::atomic<T>(which compiles tolock-prefixed instructions on x86) to control both compiler reordering and the memory order;volatilecontrols only the former and only on the local core. - "Atomic operations bypass the cache" They do not.
lock incqon x86 acquires the cache line in M state on the issuing core (using MESI), executes the read-modify-write in L1d, and releases the line. There is no DRAM round trip. The line never leaves the cache hierarchy. The "lock" prefix is a coherence-protocol marker, not a memory-bus lock — the original 80486 actually assertedLOCK#on the bus, but every CPU since the Pentium Pro (1995) implements it via cache coherence. - "Snooping happens for every memory access" It happens for every miss and for every coherence transition the issuing core requests. Hits in M, E, or S state stay local — no bus traffic. The protocol's efficiency depends on most accesses being hits in stable states; ping-pong is pathological because every access is a state transition.
- "MESI is the only protocol" Real CPUs use variants. Intel uses MESIF (adds Forward state — exactly one S-cache forwards data to a requester, the others stay quiet). AMD uses MOESI (adds Owned state — one S-cache holds the dirty value and supplies it without forcing write-back to memory). ARM uses MESI-like protocols with directory-based filtering. The four-state intuition (own / exclusive / shared / nothing) is correct everywhere; the optimisations differ.
- "More shared cache levels mean more coherence cost" The opposite is often true. A shared L3 lets two cores' L1d caches communicate via the L3 slice without crossing a socket interconnect — the line migrates one level, not three. Cross-socket boxes pay the most coherence cost; single-socket boxes with shared L3 pay much less. Multi-CCD AMD layouts are an intermediate case (Infinity Fabric between CCDs is faster than QPI between sockets, slower than intra-CCD).
Going deeper
Directory-based coherence vs snoop-based, and why scale changed everything
Snoop-based coherence (the original MESI implementation) broadcasts every coherence request to every core's cache. This works on 4–8 cores and falls apart at 64+ — the broadcast traffic scales O(n²) with core count, and the snoop bandwidth becomes the bottleneck. Modern many-core CPUs (Intel Xeon 56-core+, AMD EPYC 96-core+) use directory-based coherence: a hardware directory tracks which caches hold each line and which states they are in, so a coherence request is sent only to the cores that actually hold the line. The directory adds latency (an extra hop to consult the directory before sending the snoop) but reduces traffic from O(n) to O(1) per request. The crossover is around 16 cores. AMD's CCD-CCD coherence over Infinity Fabric is directory-based; intra-CCD coherence is snoop-based. Why this matters for your program: cross-CCD atomic contention pays the directory hop on top of the line migration — that is why ./pingpong 0 8 was 143 ns instead of 82 ns. The protocol layer was different. On a 64-core box, the same atomic spread across 64 contending cores can collapse to under 1 M ops/sec; the directory becomes saturated and starts queueing.
The "Owned" state in MOESI and what it saves
In MESI, when core 0 has a line in M state and core 1 reads it, the line transitions to S on core 0 and S on core 1, and the dirty value must be written back to memory (or to the next-level cache) to maintain the S-state invariant of "matches memory". MOESI adds the O (Owned) state: core 0 transitions M→O, core 1 transitions I→S, and the value stays dirty in core 0's cache. Core 0 supplies subsequent reads from its O state without write-back. Only when the line is evicted from core 0's cache does the write-back happen. This saves a write-back per M→S downgrade on read-mostly shared lines — a meaningful optimisation for AMD's typical workload (server consolidation, where read sharing dominates write sharing). Intel's MESIF chose a different optimisation point (filtering which S-cache responds to forwarding); both are isomorphic in correctness, different in traffic patterns.
CricStream's video-segment hash table — when MESI defeated a "lock-free" data structure
CricStream's CDN edge nodes maintain a per-region hash table of currently-streaming video segments, with one bucket per cricket match (~2,000 active matches at the IPL final). The original implementation used a std::atomic<Segment*> bucket head with CAS-based insertion — textbook lock-free, no mutex anywhere. On the IPL final, with 12 cores all updating buckets concurrently, the team measured 6 M ops/sec — fine for the load. On the next-gen 64-core EPYC box, the same code measured 4 M ops/sec. It got slower with more cores. perf c2c showed every bucket head was sharing a 64-byte cache line with three or four other bucket heads. Inserting into bucket A invalidated three sibling buckets' lines on the cores that were currently inserting into them. The fix was structural: pad each bucket to a full cache line (alignas(64)), eliminating false sharing between buckets. Throughput rose to 38 M ops/sec on the same hardware. The "lock-free algorithm" was correct all along; the coherence protocol was the bottleneck. (This is the false sharing chapter's preview — Part 2 of this curriculum keeps coming back to it.)
Coherence and atomic write-combine — when the protocol takes a different path
Modern x86 supports write-combining (WC) memory regions where stores accumulate in a write-combining buffer and are flushed as a 64-byte burst, bypassing the normal coherence flow. WC is used for graphics framebuffers, DMA-capable buffers, and some PMEM regions. WC stores do not participate in MESI in the same way — the destination cache line is invalidated, not migrated, and the protocol skips the read-for-ownership phase. This is faster for streaming writes (a video frame copy is 5–10× faster on WC memory than on cached memory) but useless for atomics — lock cmpxchg on WC memory is undefined behaviour on x86. The atomic operations you write in C++ assume cacheable, coherent memory, and the compiler will not emit them otherwise.
Where this leads next
The next chapter — false sharing: the invisible killer — picks up where this one ends. Two unrelated counters on the same cache line cause exactly the ping-pong measured here, even though the program never accesses the same variable from two cores. The fix (cache-line padding, alignas(64), std::hardware_destructive_interference_size) is a five-line change that can move a benchmark from 4 M ops/sec to 38 M ops/sec, as the CricStream story shows.
After Part 2 closes, memory models builds on coherence to define what across-line ordering guarantees a CPU offers. Coherence gives you per-line linearizability for free; consistency is the harder problem of how loads and stores to different lines may be reordered. The atomic operations in Part 4 are coherence requests with explicit ordering attached; the Treiber stack and Michael-Scott queue in Part 7 are coherence patterns dressed up as data structures. Every chapter that follows is, in the end, a story about which cache lines you choose to share and which transitions you force MESI to perform.
References
- Papamarcos & Patel, "A Low-Overhead Coherence Solution for Multiprocessors with Private Cache Memories" (ISCA 1984) — the original MESI paper.
- Sweazey & Smith, "A Class of Compatible Cache Consistency Protocols" (ISCA 1986) — the four-state classification (MESI, MOESI, MEI, MSI) and trade-offs.
- Intel 64 and IA-32 Architectures Software Developer's Manual, Vol. 3, Ch. 11 "Memory Cache Control" — Intel's authoritative description of cache states and coherence.
- AMD64 Architecture Programmer's Manual, Vol. 2, §7 "Memory System" — MOESI on AMD, including the Owned state's transitions.
- Sorin, Hill, Wood, A Primer on Memory Consistency and Cache Coherence (2nd ed., 2020) — the textbook reference; chapters 6–9 cover MESI/MOESI/MESIF and directory protocols.
- Joe Mario, "C2C — False Sharing Detection in Linux Perf" — the introductory
perf c2cpost; this is the tool you reach for when MESI ping-pong is suspected. - Folly's
cachelinepadandHazptr— Facebook's production library; the padding patterns in the source are an empirical guide to what coherence costs in practice. - Internal — cores, hyperthreads, and the pretend-ness of parallelism — the topology layer that defines which cores' caches are siblings.
- Internal — wall: no discussion is possible without a machine model — Part 2's framing chapter.
# Reproduce on your laptop (Linux x86_64)
sudo apt install build-essential linux-tools-common
g++ -O2 -std=c++17 -pthread pingpong.cpp -o pingpong
./pingpong 0 0 # baseline: same logical CPU, no MESI traffic
./pingpong 0 1 # cross-core, same CCD on AMD / same socket on Intel
./pingpong 0 8 # cross-CCD on AMD (substitute a far-CPU on Intel)
# Trace the actual coherence events:
sudo perf c2c record -- ./pingpong 0 1
sudo perf c2c report --stats-only
# Drive everything via the Python harness:
python probe_coherence.py