Coherence traffic as a hidden ceiling

Aditi runs a stress test on a Rust counter service at a Bengaluru ad-tech startup. With one core, throughput is 84 million increments/second. With two, 152M — almost linear. With four, 188M. With eight, 142M. With sixteen, 71M — worse than one core. The flamegraph shows 99.7% of time in the increment function. There is no lock, no syscall, no allocation. The Rust compiler emitted a single lock xadd instruction per increment. Aditi has discovered, the hard way, that "lock-free" is not the same as "scales".

The counter sits on one cache line. Every core that increments it must first own that line in Modified state, which means every other core's copy must be invalidated. At sixteen cores, the line is being yanked across the ring bus 71 million times per second, each yank costing roughly 70-200 nanoseconds of coherence latency. The instructions execute; the cache line cannot keep up. Throughput is no longer instruction-bound; it is coherence-bound. Amdahl's law has nothing to say about this — every "increment" is parallel by Amdahl's definition. The bottleneck is the conversation between cores about who owns the line right now.

A multi-core CPU keeps cached copies of memory consistent across cores by exchanging messages on an on-chip interconnect — the coherence protocol (MESI/MOESI). Every shared-write to a cache line costs an invalidate-and-fetch round trip whose latency is two orders of magnitude larger than the write itself. At small N this overhead is invisible; past some N* it becomes the dominant cost and throughput drops with more cores, not up. Coherence traffic is the physical source of the beta term in the Universal Scalability Law and the silent killer of "lock-free" code that shares state.

The coherence protocol — what the cores actually say to each other

Every core has its own L1 and L2 cache; the LLC and DRAM are shared. To make this look like "one memory", the hardware runs a coherence protocol — most commonly MESI (Modified, Exclusive, Shared, Invalid) or its variants MOESI/MESIF. Each cache line is in one of those states on each core, and the protocol maintains the invariant that no two cores hold the line in M simultaneously, and a core that holds it in M is the unique writable owner.

The four states tell you what a core is allowed to do without asking anyone else. Modified (M): this core has the only copy, and it differs from memory — the core can read or write at L1 latency (~1 ns). Exclusive (E): this core has the only copy and it matches memory — the core can read at L1 latency, and a write upgrades to M with no traffic. Shared (S): multiple cores have read-only copies — reads are L1 latency, but a write requires invalidating the other copies first. Invalid (I): this core does not have the line — any access requires fetching it from another core or memory.

The transitions are where the cost lives. A write to a line in S triggers a read-for-ownership (RFO): the core broadcasts an invalidate message, every other core that holds the line in S or E marks it I, the issuing core upgrades to M, and only then can the write retire. On a Skylake-X server the round trip is ~75-90 ns within a socket and ~200-300 ns across sockets. A write to a line in I is worse — the line must first be fetched from whichever core holds it in M (a forward), then invalidated everywhere else, then upgraded.

Core 0 wants to write a line currently held in M by Core 2 and shared by Cores 1 and 3. The write costs four bus messages: RFO from Core 0, forward of the dirty data from Core 2, plus two invalidate acknowledgements from Cores 1 and 3. Each round trip is ~75-300 ns. Illustrative.

The protocol is correct — the cores never see inconsistent state — but it is not free. Every write to a line another core has touched costs at least one round trip on the interconnect, and the interconnect has finite bandwidth and a finite number of in-flight transactions per core (the line fill buffer depth on Intel parts is ~10-12 per core; AMD's Infinity Fabric has its own credit limit). When the offered coherence rate exceeds what the interconnect can sustain, additional cores produce more traffic without producing more useful work — and the throughput curve turns over.

The mental model that matters: a cache line is the unit of coherence. Two variables that fit in the same 64-byte line are, from the protocol's point of view, the same thing — touching one invalidates the other on every other core. This is false sharing, and it is the most common way coherence costs ambush production code. Two unrelated counters that two threads each increment, allocated next to each other in a struct, will produce the exact same coherence traffic as a single shared counter — except that the developer never wrote the word "shared" anywhere and has no reason to suspect a coherence problem. The protocol does not care about your variable names; it cares about the line.

Why coherence is not "just slow memory access": a cold L1 miss that hits LLC costs ~12 ns; a cold L1 miss that hits DRAM costs ~80 ns. Both are point-to-point fetches with predictable bandwidth. Coherence traffic is broadcast — every write to a shared line must reach every core that might have it cached, and the response latency is the slowest such core's reply. The cost grows superlinearly with N because the broadcast fanout grows with N. This is structurally different from memory bandwidth, which grows linearly with N until the DRAM controller saturates. Coherence is the cost of agreement; memory bandwidth is the cost of motion.

Measuring the coherence ceiling

The harness below reproduces Aditi's experiment. It launches N Python worker processes that each increment a shared counter, then plots the per-process and aggregate throughput. The shared counter is implemented in a tiny C kernel called via ctypes — Python's per-instruction overhead would otherwise dominate and hide the coherence signal. The Python driver is the article; the C kernel is the escape hatch the style guide allows for sub-microsecond loops.

# coherence_ceiling.py — measure shared-counter throughput vs core count
# Run: python3 coherence_ceiling.py 1 2 4 8 16
import ctypes, multiprocessing as mp, os, sys, time, subprocess

# Build the kernel once (atomic xadd on a shared u64).
KSRC = r"""
#include <stdint.h>
#include <stdatomic.h>
void bump(_Atomic uint64_t *p, uint64_t n) {
    for (uint64_t i = 0; i < n; ++i)
        atomic_fetch_add_explicit(p, 1, memory_order_relaxed);
}
"""
def build():
    open("/tmp/bump.c","w").write(KSRC)
    subprocess.check_call(["cc","-O2","-shared","-fPIC","-o","/tmp/bump.so","/tmp/bump.c"])

def worker(addr, iters):
    lib = ctypes.CDLL("/tmp/bump.so")
    lib.bump.argtypes = [ctypes.c_void_p, ctypes.c_uint64]
    lib.bump(ctypes.c_void_p(addr), iters)

def measure(N, iters=20_000_000):
    counter = mp.RawValue(ctypes.c_uint64, 0)   # shared mmap'd page
    addr = ctypes.addressof(counter)
    procs = [mp.Process(target=worker, args=(addr, iters)) for _ in range(N)]
    t0 = time.perf_counter()
    for p in procs: p.start()
    for p in procs: p.join()
    dt = time.perf_counter() - t0
    return iters * N / dt

if __name__ == "__main__":
    build()
    Ns = [int(x) for x in sys.argv[1:]] or [1, 2, 4, 8, 16]
    print(f"# logical cores available: {mp.cpu_count()}")
    print(f"# {'N':>3}  {'M-incr/s':>10}  {'per-core':>10}")
    for N in Ns:
        if N > mp.cpu_count(): break
        thru = measure(N)
        print(f"  {N:3d}  {thru/1e6:10.2f}  {thru/1e6/N:10.2f}")

# Sample run on c6i.4xlarge (16 vCPU, Intel Ice Lake), ap-south-1
# logical cores available: 16
#   N    M-incr/s    per-core
    1       84.31       84.31
    2      152.04       76.02
    4      188.61       47.15
    8      141.88       17.74
   16       71.42        4.46

Walk-through. KSRC is a six-line C kernel: a for loop calling atomic_fetch_add_explicit on a shared u64. The loop emits a single lock xadd instruction per iteration on x86 — the cheapest possible atomic operation. mp.RawValue allocates the shared u64 in an mmap'd page so all child processes see the same memory; the addr we pass is a process-virtual pointer that resolves to the same physical line in every child. measure(N) spawns N workers, each running iters = 20M increments, and returns aggregate throughput. The output is the lesson. Per-core throughput is 84M at N=1 but only 4.5M at N=16 — a 19x drop in per-core productivity. Aggregate throughput peaks at N=4 (188M) and falls off; from N=4 to N=16 the system is using 4x more cores to do less work. This is the textbook USL retrograde-scaling regime, and the cause is not contention in the algorithmic sense (there is no lock to wait on) — it is the cache line itself ping-ponging between cores at the speed of the coherence interconnect.

The curve is the canonical USL shape — rise, knee, fall — when coherence cost dominates. From N=4 onward, every additional core reduces aggregate throughput. Numbers measured by the harness above on c6i.4xlarge.

Why USL fits and Amdahl does not: Amdahl's S(N) = 1 / (alpha + (1-alpha)/N) is monotonically increasing — it can never produce a curve that falls. The data above falls. The Universal Scalability Law S(N) = N / (1 + alpha(N-1) + beta * N(N-1)) has the beta * N(N-1) coordination term whose growth with N is quadratic; past the knee where beta * N(N-1) exceeds the linear gains, the curve turns over. Fitting USL to the table above gives roughly alpha ~= 0.0, beta ~= 0.012 — almost no irreducible serial fraction, but a coordination cost that overwhelms scaling at modest N. The whole bottleneck is beta, and beta is, in this microbenchmark, exactly the per-core-pair coherence-traffic cost.

The diagnostic procedure to confirm the cost is coherence and not something else:

# perf stat — count coherence-related events
$ perf stat -e cache-references,cache-misses,LLC-loads,LLC-load-misses,L1-dcache-loads,L1-dcache-load-misses \
    -- python3 coherence_ceiling.py 16

  16  71.42  4.46  (M-incr/s, per-core)

 Performance counter stats for './coherence_ceiling.py 16':
   1,243,118,201      L1-dcache-loads
     983,427,914      L1-dcache-load-misses    # 79.11% of all L1-dcache hits
     651,883,447      LLC-loads
     617,529,210      LLC-load-misses          # 94.73% of all LL-cache hits

# perf c2c — pinpoint the offending cache line
$ perf c2c record -- python3 coherence_ceiling.py 16
$ perf c2c report --stdio | head -20
            Shared Cache Line Distribution Pareto
=================================================
   #         ----- HITM -----   ---- Store Refs ---  Total
 0          81.4%      96M        92.8%   1.86B      1.95B   counter+0x0  bump.c:5

The 79% L1 miss rate is the smoking gun — every write to the line invalidates the local copy, so the next write misses. The 94% LLC miss rate confirms the line is bouncing between cores via the inter-core interconnect, not sitting in any single core's caches. perf c2c ("cache-to-cache") goes one step further and identifies the exact line and source-code line responsible for the HITM (hit-modified-elsewhere) traffic — in this case the counter's address inside bump.c:5, the increment loop. The HITM count of 96 million matches the order-of-magnitude of lock xadd operations the kernel issued. There is no ambiguity in the diagnosis once perf c2c lands on the line.

False sharing — the same problem with no shared variable

The shared-counter case is at least honest — the line really is shared. The more dangerous version of coherence cost happens when no programmer has ever written code that says "share this". Consider this Python-driven C harness measuring two threads incrementing two different counters:

# false_sharing.py — two threads, two counters, but on the same cache line
import ctypes, threading, time, subprocess

KSRC = r"""
#include <stdint.h>
#include <stdatomic.h>

struct adjacent { _Atomic uint64_t a; _Atomic uint64_t b; };
struct padded   { _Atomic uint64_t a; char _pad[56]; _Atomic uint64_t b; };

void bumpA_adj (struct adjacent *p, uint64_t n) { for (uint64_t i=0; i<n; ++i) atomic_fetch_add_explicit(&p->a, 1, memory_order_relaxed); }
void bumpB_adj (struct adjacent *p, uint64_t n) { for (uint64_t i=0; i<n; ++i) atomic_fetch_add_explicit(&p->b, 1, memory_order_relaxed); }
void bumpA_pad (struct padded   *p, uint64_t n) { for (uint64_t i=0; i<n; ++i) atomic_fetch_add_explicit(&p->a, 1, memory_order_relaxed); }
void bumpB_pad (struct padded   *p, uint64_t n) { for (uint64_t i=0; i<n; ++i) atomic_fetch_add_explicit(&p->b, 1, memory_order_relaxed); }
"""
open("/tmp/fs.c","w").write(KSRC)
subprocess.check_call(["cc","-O2","-shared","-fPIC","-o","/tmp/fs.so","/tmp/fs.c"])
lib = ctypes.CDLL("/tmp/fs.so")

class Adj(ctypes.Structure): _fields_ = [("a", ctypes.c_uint64), ("b", ctypes.c_uint64)]
class Pad(ctypes.Structure): _fields_ = [("a", ctypes.c_uint64), ("_pad", ctypes.c_char*56), ("b", ctypes.c_uint64)]

ITERS = 50_000_000
def run(struct_t, fnA, fnB, label):
    s = struct_t()
    fnA.argtypes = fnB.argtypes = [ctypes.POINTER(struct_t), ctypes.c_uint64]
    t0 = time.perf_counter()
    tA = threading.Thread(target=fnA, args=(ctypes.byref(s), ITERS))
    tB = threading.Thread(target=fnB, args=(ctypes.byref(s), ITERS))
    tA.start(); tB.start(); tA.join(); tB.join()
    dt = time.perf_counter() - t0
    print(f"  {label:>20}  {2*ITERS/dt/1e6:8.2f} M-incr/s   wall={dt:.3f}s")

run(Adj, lib.bumpA_adj, lib.bumpB_adj, "adjacent (false sh.)")
run(Pad, lib.bumpA_pad, lib.bumpB_pad, "padded (no sh.)")

# Output — same machine, same instruction count, only struct layout differs
  adjacent (false sh.)     34.18 M-incr/s   wall=2.926s
       padded (no sh.)    188.04 M-incr/s   wall=0.532s

Walk-through. struct adjacent packs a and b into 16 contiguous bytes — both fit on the same 64-byte cache line. struct padded inserts 56 bytes between them, forcing them onto separate lines. The two threads each modify only their own counter — there is no shared variable in any meaningful sense — but the adjacent version is 5.5x slower. The reason is the protocol: when thread A writes a, the line goes into M on A's core, invalidating B's copy; when B writes b (a different variable!), the line must be fetched back, invalidating A's copy. Every write triggers a coherence round trip. The protocol cannot tell that a and b are independent because the protocol does not see variables — it sees lines.

The fix is padding: align hot per-thread data to 64-byte boundaries (or 128-byte on systems with adjacent-line prefetch, like Intel's Sandy Bridge and later — the prefetcher fetches both lines as a pair, so two structs on consecutive lines can still false-share). C uses alignas(64) and explicit padding bytes; Rust uses #[repr(align(64))]; Java's @Contended annotation adds 128 bytes of padding (off by default; enable with -XX:-RestrictContended); Go uses cache-line-padded structs in the runtime's internal scheduler. The cost is wasted memory — typically 56 bytes per hot field on a 64-byte line — but the speedup is routinely 5-10x on multithreaded workloads, and the pattern is so well-established that high-performance Indian fintechs and exchanges treat per-thread struct padding as a code-review checklist item rather than an optimisation.

A real-world example: Zerodha Kite's order-matching engine had a per-symbol counter struct where the symbol's bid-side and ask-side trade counters lived adjacently in memory. At 10:00 IST market open the order rate hit 200k/s across a small set of high-volume symbols (Reliance, HDFC Bank, Infosys), and the two sides of the book were updated by different threads. False sharing on those structs imposed a 4x latency penalty on the per-symbol update path that did not show up at all in load tests against synthetic uniform-distribution data. The fix — padding the bid and ask counters to separate cache lines — dropped p99 update latency from 480 µs to 110 µs at the same offered load, and the extra 56 bytes per symbol cost a negligible amount of memory across the ~5000 listed symbols.

What kills coherence — and what does not

Coherence cost is not "anything that touches shared memory". A read-only line shared across cores costs zero — every core has a copy in S, and reads are L1-fast on all of them. A line written by one core and never read by anyone else costs zero — the line stays in M on that core, never leaves, never gets invalidated. The cost arrives specifically from shared writes to the same line at meaningful frequency. The diagnostic is:

Pattern	Coherence cost	Why
One thread writes a line, others ignore it	None	Line stays in M on writer; no traffic
All threads read a line, none writes	None	Line is in S on all cores; reads are local
One thread writes, all threads read	High (per write)	Each write invalidates all readers' copies, who then re-fetch
All threads write the same line	Catastrophic (USL retrograde)	Line bounces M→I→M between every pair of cores
Threads write different lines	None	No protocol traffic for unrelated lines
Threads write different variables in same line	High (false sharing)	Protocol sees the line, not the variables

The third row is the read-mostly with one writer pattern, and it is what breaks naively-parallel code that "just publishes a status flag". A monitoring thread in a Go service that updates a current_state field every 100 ms, read by 32 worker threads that check the field every iteration of their hot loop, will quietly impose a 200-400 ns penalty on every loop iteration of every worker due to the constant invalidate-and-refetch cycle. The monitoring thread does almost nothing; it kneecaps the workers. The fix is either to read-replicate the state (per-core copy, periodically refreshed) or to read it less often (cache locally and check every 10ms instead of every iteration). Hotstar's IPL-final dispatcher went through this in 2024 — a "current viewer count" gauge updated every 50ms by the metrics thread was being read by the per-connection event loop on every event, and the resulting coherence traffic was capping per-connection throughput at 60% of what the same code did on a single-core sandbox.

The fifth row is the case the textbooks emphasise — independent threads writing independent data — and it is genuinely the cheap case. The protocol does nothing because the lines don't intersect. This is what makes embarrassingly-parallel workloads (Monte Carlo, image filtering, parameter sweeps) scale linearly: every worker writes its own arena of memory, the lines are disjoint, no coherence traffic is generated. The architectural pattern that saves you is per-thread state with explicit reduction at the end — every thread has its own counter, its own buffer, its own sub-result, and the merge step happens once at the end (and uses tree-reduction if N is large; see /wiki/the-serial-fraction-problem).

A subtler pattern hides in the sixth row: false sharing applies to any two variables on the same line, including pointer-chasing data structures. A linked-list node where the next pointer and the per-node lock live in the same line will false-share between the thread traversing the list (reading next) and any thread holding the lock (writing it). The fix is to lay out hot fields on separate lines — read-mostly fields together, write-mostly fields separately, with padding between groups. This is the technique behind disruptor patterns (LMAX) and the Linux kernel's per-CPU ringbuffers — careful struct layout that respects the cache-line boundary turns multi-core code from coherence-bound into compute-bound.

Common confusions

"Atomic operations are lock-free, so they scale." Lock-free means no thread blocks waiting for another; it says nothing about coherence. The lock xadd instruction itself takes ~30-50 cycles on uncontended lines and ~300-1000 cycles when the line is contended. Lock-free shared-write code can scale worse than mutex-protected code if the mutex implementation does smarter coalescing of updates.
"If I do not declare anything shared, my data is not shared." The hardware does not see shared keywords. It sees memory addresses. Two thread_local integers allocated on adjacent stack frames can land on the same cache line and false-share if the stacks are close in memory — and on Linux with 8KB stacks, they will be.
"Coherence cost is the same as memory bandwidth." They are different bottlenecks and saturate at different rates. Coherence saturates the inter-core interconnect (~30-100 GB/s on modern parts, with credit-based flow control); bandwidth saturates the DRAM controllers (~80-300 GB/s, depending on DDR5 / HBM). A coherence-bound workload has high interconnect traffic and may show low DRAM bandwidth; a bandwidth-bound workload is the other way round.
"Disabling hyperthreading fixes coherence problems." It can sometimes help (two HT siblings share L1, so a poorly-laid-out struct false-shares within a single core's L1), but most coherence cost is between physical cores and is unaffected by HT. The right fix is struct layout and access pattern, not topology tuning.
"NUMA is just slow memory." NUMA and coherence interact: a write to a line owned by a remote socket costs both the remote-DRAM round trip and the cross-socket coherence message, and the latter dominates on socket-crossing workloads. See /wiki/uma-vs-numa-the-architectural-shift for the topology piece.
"perf top will show me the hot line." perf top shows the hot instruction, not the hot line. For coherence-cost diagnosis you need perf c2c (cache-to-cache analysis), which is a different tool with a different output format and is the only perf subcommand that maps HITM events back to source lines.

Going deeper

MOESI, MESIF, and the directory protocols on big servers

The textbook MESI protocol is what's documented in undergraduate architecture courses; production CPUs use refinements. AMD's MOESI adds an Owned state that lets a single core own a dirty line that is also shared by readers — without writing back to memory immediately. This avoids unnecessary writebacks when one writer's dirty data is being consumed by many readers (a producer-consumer pattern). Intel's MESIF adds a Forward state that designates exactly one core as the line's responder for reads, avoiding the "every core that has the line in S responds at once" storm. Both are optimisations of the same fundamental protocol — the cost structure is the same, just with smaller constant factors.

On many-socket systems (8+ sockets, classic HPC machines or some Oracle Exadata configurations), broadcast coherence does not scale — the message volume grows quadratically with cores. These systems use directory-based coherence: each line has a designated home node that maintains a list of cores currently holding the line, and traffic is point-to-point rather than broadcast. The latency is higher per message (an extra round trip via the home node) but the bandwidth is bounded. Directory protocols are why AMD's Genoa-X (96 cores per socket, 192 across two) can sustain coherence at scales that broadcast MESI would melt at. For the typical Indian-startup workload running on 16-64 vCPU cloud instances, the protocol is broadcast-flavoured (snoop) and the analysis here applies directly. If you are running on 192-core bare metal in IIT Bombay's data centre, you are on directory coherence and the constants are different — but the qualitative shape (knee, retrograde region) is the same.

Measuring coherence with `perf c2c` — the only honest diagnostic

perf c2c is the cache-to-cache analyser introduced in Linux 4.10. It samples memory accesses tagged with HITM (hit-modified-elsewhere) events and aggregates them by cache line. The output identifies the exact 64-byte line responsible for the worst coherence traffic, the source files and line numbers writing to that line, and the breakdown by reader/writer core. Brendan Gregg's recipe is the canonical workflow:

# 1. Record
perf c2c record -F 60000 -a -- sleep 30        # 30s, system-wide

# 2. Report — pareto sort by HITM count
perf c2c report --stdio --full-symbols | less

# 3. Walk the top lines, fix each

The typical finding in production is one or two lines responsible for >50% of HITM events — usually a hot counter, a stat array, or a flag in a frequently-accessed struct. Fixing those lines (padding, per-thread state, batched updates) typically yields 2-10x throughput improvements on the affected workload. Why perf c2c is the right tool: a sampling profiler shows you where time is spent, not where traffic is generated; a perf stat cache-miss count shows you total traffic but not the responsible line. perf c2c is the only Linux tool that joins both — it samples HITM events specifically and traces them back to source. On a workload that scales weirdly past N=8, running perf c2c for 30 seconds will, in 80% of cases, name the offending line within a minute.

Atomic operation cost classes — what each instruction actually does

Not all atomics cost the same. On x86-64, the cost ladder is roughly:

Plain mov to/from a line in M: 1-2 ns (L1 latency).
lock xchg or lock xadd on uncontended line in M: 8-20 ns (single-core overhead — pipeline serialisation, store-buffer drain).
Same instruction on a line in S (needs RFO): 80-150 ns (one coherence round trip).
Same instruction on a line in I (needs forward + invalidate): 200-400 ns within socket; 400-800 ns across sockets.
mfence following the atomic: adds ~10-30 ns of pipeline flush; usually unnecessary on x86 because lock is already a full barrier.

ARM is different: weak memory ordering means atomics use load-linked/store-conditional (LDXR/STXR), which can fail spuriously and retry; coherence cost is similar but the retry overhead is more variable. ARMv8.1's LSE atomics (LDADD, SWP, CAS) are single-instruction primitives that scale better than LL/SC under contention — Graviton 3 (c7g) is roughly 2x better at contended atomics than Graviton 2 because of LSE. For Indian cloud workloads on Graviton, this is a real architectural difference: code that is coherence-bound on Graviton 2 may be CPU-bound on Graviton 3 simply because the atomics are cheaper per op.

The implication for engineering: the cost of an atomic operation is not a number; it is a distribution that depends on the line's current state. Microbenchmarks that measure atomics on uncontended lines are measuring the cheap end of the distribution. Production code on shared lines is in the expensive end. The 30x cost difference between the two is not a constant the developer can ignore — it is the entire reason concurrent data structures are a separate field of study.

The cost of measuring coherence — and when it lies

perf c2c itself imposes overhead — typically 3-7% on the workload, larger on coherence-heavy code where the sampler triggers more often. The sampling rate is configurable (-F 60000 is 60 kHz, the default for c2c); lower it to 10 kHz for production sampling at sub-1% overhead, accepting that the report is noisier. A more subtle failure mode: perf c2c reports HITM events, which are one specific kind of coherence traffic (a load that hits a line currently modified by another core). Pure invalidate traffic (no load, just stores and the resulting Invalidate broadcasts) does not produce HITM events and may be undercounted. For pure-write workloads (a stats-update path, an event-counter ring), supplement perf c2c with perf stat -e mem_inst_retired.all_stores,offcore_response.demand_rfo.l3_miss.local_dram to count RFO traffic directly.

Coherence ceilings in production — the canonical Indian war stories

Razorpay's idempotency-key cache (2024). A ConcurrentHashMap<String, IdempotencyRecord> in the payment-API hot path was being updated by every request thread under load. The bucket array was 8KB (128 64-byte lines), and at 50k req/s peak across 32 worker threads, the false-sharing rate on the most-hit bucket reached 14M HITM/sec. p99 latency on the cache-write path was 1.8 ms — the cache supposed to take microseconds was the bottleneck. Fix: switched to a striped lock with 256 stripes, each stripe's data on its own line. p99 dropped to 80 µs, and the request rate ceiling lifted from 32k req/s to 95k req/s on the same fleet.

Flipkart's catalogue fast-path (2023, BBD prep). The product-detail page handler had a shared "request count" gauge — a single AtomicLong incremented by every request and read every 100ms by the metrics export. At 18k req/s the increment cost ~250 ns of coherence per request, capping single-pod throughput at 4k req/s instead of the 12k the JIT-compiled handler should have done. Fix: per-thread counters with periodic flush to a shared aggregate. Same code, same hardware, 3x throughput.

Aadhaar/UIDAI auth pipeline (2023). The biometric-match worker pool used a shared BloomFilter to deduplicate in-flight requests. The Bloom filter's bit array (512 KB, all-shared-write) at 30k auth/s was the dominant coherence cost in the pipeline — the dedup check was supposed to be a fast accelerator and was instead 40% of CPU on the worker fleet. Fix: per-shard Bloom filters keyed by request hash. Coherence traffic vanished, and the dedup check returned to its expected sub-microsecond cost.

The pattern across all three: a shared mutable structure on the hot path, written by all worker threads, becomes the scaling ceiling at modest N. The fix is structural — partition or replicate — and the cost is a few engineer-days; the cost of not fixing it is overprovisioned fleets sized to the cliff rather than to the algorithm.

Reproducibility footer

Both harnesses run on any Linux box with gcc and Python 3.11+. The coherence ceiling is visible from N=4 onward; full retrograde scaling needs N >= physical cores, so use a 16+ vCPU instance for the sharpest curve.

# Reproduce this on your laptop, ~30 s for both
sudo apt install linux-tools-common linux-tools-generic
python3 -m venv .venv && source .venv/bin/activate
python3 coherence_ceiling.py 1 2 4 8 16
python3 false_sharing.py
# To see HITM events, run with perf:
sudo perf c2c record -F 10000 -- python3 coherence_ceiling.py 16
sudo perf c2c report --stdio | head -30

Where this leads next

The next chapter (/wiki/memory-bandwidth-as-the-real-ceiling) covers the second physical ceiling that masquerades as Amdahl's alpha: DRAM bandwidth saturation. Even with perfect cache locality and zero coherence traffic, a workload that streams more data than the memory controllers can deliver hits a hard ceiling unrelated to core count — and the fit looks identical to a coherence ceiling unless you measure both.

Chapter 66 covers the heterogeneous-hardware case — performance cores plus efficiency cores plus accelerators — where coherence operates across asymmetric domains and the cost ladder gets a third rung.

The recurring pattern across Part 9: every "scaling does not work past N=X" diagnosis names a different bottleneck (serial fraction, coherence, memory bandwidth, heterogeneous topology), but they all share the same workflow — fit the curve, identify the dominant term, classify the bottleneck physically, then fix it architecturally. Adding more cores is the engineering response to none of these problems; finding which physical resource is saturated and changing the workload to use it less is the response to all of them.

The two operational habits this chapter adds to the Part 9 toolkit: run perf c2c on every "weird scaling" investigation before doing anything else — it answers in 30 seconds whether the bottleneck is coherence, and that answer changes the entire fix-strategy. And lay out hot per-thread structs on separate cache lines from day one — the cost is 56 wasted bytes per struct, which is nothing; the cost of not doing it is a 5-10x penalty that takes a week of perf c2c debugging to find. A code-review checklist item that says "is any hot per-thread field within 64 bytes of another thread's hot field?" prevents 80% of the coherence regressions Indian fintechs and exchanges have shipped in the last five years.

References

Brendan Gregg, Systems Performance (2nd ed., 2020), chapters 6 and 9 — the canonical treatment of CPU and memory performance, including coherence overhead in multi-socket systems.
Ulrich Drepper, "What Every Programmer Should Know About Memory" (2007) — section 3.3 covers MESI and its cost in detail; still the best free reference on cache coherence.
Joe Mario, "C2C — False Sharing Detection in Linux Perf" (2016) — the introduction of perf c2c by its author, with worked examples showing the diagnostic flow.
Intel® 64 and IA-32 Architectures Optimization Reference Manual, chapter 8 — the authoritative source on Intel's coherence implementation, including ring-bus latencies and L1/L2/LLC inclusion policies.
Neil J. Gunther, Guerrilla Capacity Planning (Springer, 2007) — chapter 4 develops the Universal Scalability Law; the beta term it introduces is a phenomenological model of exactly the coherence cost analysed here.
Daniel Lemire, "False sharing in Java with @Contended" (2013) — the JVM-side perspective with measurements showing 5-10x speedups from @Contended on hot fields.
/wiki/the-serial-fraction-problem — the prerequisite chapter on Amdahl's alpha; coherence cost is what fills the gap between Amdahl's prediction and the measured curve.
/wiki/universal-scalability-law-gunther — the model that explicitly captures the beta coordination term whose physical source this chapter explains.