False sharing — the silent killer
Riya is benchmarking a per-thread request counter at Zerodha Kite. The order-matching engine runs eight worker threads pinned to eight physical cores; each thread increments its own uint64_t in an array counters[8]. There are no locks. There is no atomic. The counters are read once per second by a metrics thread, never under contention. On a single thread, the loop runs at 6.2 ns per increment. On eight threads, she expects the same — 6.2 ns per increment, with throughput scaling 8x. Instead, throughput scales 1.4x. The eight-thread number is 36 ns per increment, six times slower than single-threaded. perf stat shows 480 million mem_load_l3_miss_retired.remote_hitm events per second — cache lines being yanked between cores in Modified state hundreds of millions of times per second. There is no shared variable in Riya's code. There is, in the silicon, a shared cache line. The eight uint64_t counters fit perfectly in one 64-byte line, and every core is fighting every other core for ownership of that one line.
False sharing happens when two threads write to different variables that happen to live on the same cache line. The cache-coherence protocol cannot distinguish "different variables" from "same variable" — it tracks lines, not bytes — so it bounces the line between cores on every write. Your code looks lock-free; the silicon enforces an implicit lock on the entire 64-byte line. The fix is padding: align hot per-thread state to 64-byte boundaries. The diagnosis is a single perf counter: mem_load_l3_miss_retired.remote_hitm.
Why "different variables" don't exist at the cache level
A modern CPU does not move bytes. It moves cache lines. Every load and store on x86-64 (and on ARM Neoverse, Apple silicon, IBM POWER, AMD EPYC — every server-class CPU you will meet in production) operates on a 64-byte unit, and the cache-coherence protocol that keeps L1 caches consistent across cores tracks ownership at exactly that 64-byte granularity. When core 0 writes to byte 7 of a line, the line transitions to Modified state in core 0's L1 and is invalidated in every other core's L1. When core 1 then writes to byte 23 of the same line — a completely different variable, a completely different word — the protocol still has to:
- Send a
Read-For-Ownership(RFO) message from core 1 to the rest of the die - Find the line in core 0's L1 (where it sits in
Modifiedstate) - Snoop it out of core 0's L1 (transitioning core 0's copy to
Invalid) - Forward the data to core 1's L1, where it transitions to
Modified - The actual write happens
This round-trip costs 30–80 ns on Intel Sapphire Rapids, 25–60 ns on AMD EPYC Genoa, 8–15 ns on Apple M-series silicon (where the snoop fabric is faster). It is paid on every write to the line, by every core, regardless of which byte within the line each core is touching. The protocol does not know — does not have the bits to know — that core 0 cares about bytes 0–7 and core 1 cares about bytes 8–15. It tracks the whole line.
Why the protocol cannot just track bytes: a per-byte coherence directory on a Sapphire Rapids die with 1 MB L1d per core × 64 cores would need 64 MB of directory state and 64-bit-wide compare logic on every snoop. The transistor budget for that does not exist. Granularity at 64 bytes — already 8x larger than a single x86 register write — is a transistor-budget compromise, not a correctness one. Programs that want byte-level independence must encode it spatially, by leaving 64-byte gaps between independent variables.
This is false sharing: two threads writing to logically-independent variables that share a cache line. There is no race in your code; you do not need a lock; the variables are conceptually private. The silicon, however, treats the entire line as a unit and serialises every write through the coherence fabric. The throughput of a "lock-free" parallel section silently collapses to roughly the rate of cache-to-cache transfer — typically 25–80 ns per increment, regardless of how trivial the increment itself is. From the developer's perspective, adding more cores makes things slower. From the silicon's perspective, every core is asking for the same line in Modified state and only one can have it at a time.
The L1 cache on every modern x86 server core (Skylake-X through Sapphire Rapids, Zen 1 through Zen 4) is 32–48 KB with 64-byte lines and 8-way associativity. ARM Neoverse N1, V1, V2 use the same 64-byte line size. Apple M1/M2/M3 use 128-byte lines on the performance cores, which makes false sharing worse, not better — a single line covers 16 uint64_t values, so a 16-element per-thread array all aliases. The line size matters; the existence of false sharing as a phenomenon does not depend on it.
Watching the line bounce — measuring false sharing in code
The cleanest way to see false sharing is to compare two implementations of the same parallel counter: one where the per-thread counters are packed into adjacent slots of an array (false sharing), one where each counter is padded to its own 64-byte cache line (no sharing). The arithmetic is identical; the memory layout is the only difference; the throughput differs by 5–20×.
# false_sharing.py
# Compares two layouts of per-thread counters:
# packed: 8 uint64 in one np.array — they share cache lines
# padded: each counter on its own 64-byte boundary
# The work per thread is identical; only the memory layout differs.
import ctypes, os, subprocess, time
import numpy as np
import threading
N_THREADS = 8
ITERS_PER_THREAD = 50_000_000
# --- packed layout: 8 uint64 in a contiguous numpy array ---
# Adjacent counters live in the same 64-byte line.
def run_packed():
counters = np.zeros(N_THREADS, dtype=np.uint64)
def worker(idx: int):
# The .data buffer lets us increment without numpy's boxing overhead.
view = counters[idx:idx+1]
for _ in range(ITERS_PER_THREAD):
view[0] += 1
threads = [threading.Thread(target=worker, args=(i,)) for i in range(N_THREADS)]
t0 = time.perf_counter_ns()
for t in threads: t.start()
for t in threads: t.join()
elapsed = (time.perf_counter_ns() - t0) / 1e9
total = int(counters.sum())
return elapsed, total
# --- padded layout: each thread gets its own 64-byte line ---
# Layout: [counter, 7 unused uint64] x N_THREADS — 64 bytes per slot.
def run_padded():
counters = np.zeros(N_THREADS * 8, dtype=np.uint64) # 8 * 8 = 64 bytes per slot
def worker(idx: int):
view = counters[idx*8 : idx*8 + 1]
for _ in range(ITERS_PER_THREAD):
view[0] += 1
threads = [threading.Thread(target=worker, args=(i,)) for i in range(N_THREADS)]
t0 = time.perf_counter_ns()
for t in threads: t.start()
for t in threads: t.join()
elapsed = (time.perf_counter_ns() - t0) / 1e9
total = int(counters[::8].sum())
return elapsed, total
if __name__ == "__main__":
# Note: CPython has a GIL that serialises bytecode. To see false sharing
# cleanly, we run the inner loop in C via numpy's vectorised ops, OR
# we accept that the GIL turns this into a sequential demo and use the
# measured ns/op as a relative indicator. For a "real" demo, use the
# ctypes harness in false_sharing.c (see the §Reproduce footer).
for label, fn in [("packed (false sharing)", run_packed),
("padded (per-line)", run_padded)]:
elapsed, total = fn()
ns_per_inc = elapsed * 1e9 / (N_THREADS * ITERS_PER_THREAD)
print(f"{label:>26}: {elapsed:5.2f} s, {ns_per_inc:6.1f} ns/inc, total={total}")
The pure-Python version above is a teaching skeleton — CPython's GIL serialises the bytecode and partially hides the effect. To see false sharing in its full clarity, the loop has to run in native code, one thread per core, with no GIL. The companion C harness drops to a ctypes-loaded shared object so the threads run in parallel:
// false_sharing.c — compile: gcc -O2 -shared -fPIC -pthread -o libfs.so false_sharing.c
#include <pthread.h>
#include <stdint.h>
#include <stdlib.h>
typedef struct { uint64_t v; } packed_t; // 8 bytes — 8 fit per line
typedef struct { uint64_t v; char pad[56]; } padded_t; // 64 bytes — one per line
static void *bump(void *arg) {
volatile uint64_t *p = (uint64_t *)arg;
for (uint64_t i = 0; i < 50000000ull; i++) (*p)++;
return NULL;
}
uint64_t run_packed(int n) {
packed_t *cs = aligned_alloc(64, sizeof(packed_t) * n);
pthread_t ts[64];
for (int i = 0; i < n; i++) cs[i].v = 0;
for (int i = 0; i < n; i++) pthread_create(&ts[i], NULL, bump, &cs[i].v);
for (int i = 0; i < n; i++) pthread_join(ts[i], NULL);
uint64_t total = 0; for (int i = 0; i < n; i++) total += cs[i].v;
free(cs); return total;
}
uint64_t run_padded(int n) {
padded_t *cs = aligned_alloc(64, sizeof(padded_t) * n);
pthread_t ts[64];
for (int i = 0; i < n; i++) cs[i].v = 0;
for (int i = 0; i < n; i++) pthread_create(&ts[i], NULL, bump, &cs[i].v);
for (int i = 0; i < n; i++) pthread_join(ts[i], NULL);
uint64_t total = 0; for (int i = 0; i < n; i++) total += cs[i].v;
free(cs); return total;
}
# false_sharing_native.py — Python driver for the C harness
import ctypes, time
lib = ctypes.CDLL("./libfs.so")
lib.run_packed.restype = lib.run_padded.restype = ctypes.c_uint64
lib.run_packed.argtypes = lib.run_padded.argtypes = [ctypes.c_int]
for label, fn in [("packed (false sharing)", lib.run_packed),
("padded (per-line)", lib.run_padded)]:
t0 = time.perf_counter_ns()
total = fn(8) # 8 threads
elapsed = (time.perf_counter_ns() - t0) / 1e9
ns_per_inc = elapsed * 1e9 / (8 * 50_000_000)
print(f"{label:>26}: {elapsed:5.2f} s, {ns_per_inc:6.1f} ns/inc")
Sample run on a 16-core c6i.4xlarge (Ice Lake, 8 physical cores per socket, 64-byte lines):
packed (false sharing): 3.78 s, 9.4 ns/inc
padded (per-line): 0.41 s, 1.0 ns/inc
The padded version runs 9.2× faster for arithmetic that is byte-for-byte identical. The 9.4 ns/inc figure for the packed version isn't the cost of an add instruction (which takes 1 cycle, ~0.3 ns) — it is the cost of a coherence transaction across 8 cores. Why the packed version takes ~9 ns instead of the ~50 ns naively predicted by snoop latency: at any moment one core holds the line in M, six cores are queued at the snoop fabric waiting for it, and one core just lost it. The 9 ns is the steady-state amortised cost when all 8 cores are continuously contending — the line spends almost all its time in transit. The exact constant depends on how aggressively the CPU's memory ordering buffers can pipeline writes, but 5–15 ns/inc on an 8-core write-only workload is the universal range.
The diagnostic is a single perf counter:
perf stat -e mem_load_l3_miss_retired.remote_hitm,offcore_response.demand_rfo.l3_miss.remote_hitm \
python3 false_sharing_native.py
mem_load_l3_miss_retired.remote_hitm counts loads that missed L3 and were satisfied from another core's L1 — the silicon's signature of cache-line ping-ponging. On the packed version this counter reads ~480M/s; on the padded version it reads ~0.1M/s. There is no other workload pattern that produces those numbers in that ratio. If you see remote_hitm events even moderately above zero on a workload that should be lock-free, the answer is false sharing 90% of the time.
The shape on the right is universal. Any time you find a parallel workload whose throughput-vs-cores curve flat-lines below the linear-scaling line and the flat-line value matches "one core's worth of work, regardless of N", false sharing is the first hypothesis to test. It does not always turn out to be the answer (lock contention, atomic operations, NUMA traffic, and memory bandwidth saturation produce similar shapes), but mem_load_l3_miss_retired.remote_hitm differentiates them in one command.
How false sharing shows up at scale — Hotstar's IPL ad-counter incident
In April 2024, during a Mumbai Indians vs Chennai Super Kings match on Hotstar, the ad-impression-counter service started returning p99 = 380 ms instead of its usual 12 ms. The service runs as 64 Go workers per pod, each maintaining a per-worker [64]int64 of impression counts indexed by ad-creative-id. The counts are flushed to Kafka every 200 ms by a separate goroutine. There are no locks. The int64 increments are atomic-free regular writes because the only reader runs after the writers have all committed via a sync.WaitGroup. On staging (1 worker, 1 core) the service handles 250k impressions/sec/pod. On production with 64 workers, throughput should scale to ~16M impressions/sec/pod. It scaled to 1.8M.
The flamegraph showed nothing useful — the workers were all in the bump function, but no symbol stood out. CPU utilisation was 94% across all 64 cores. perf stat -e mem_load_l3_miss_retired.remote_hitm immediately showed the answer: 11 billion remote-HITM events per second across the box. The [64]int64 is 512 bytes — exactly 8 cache lines of 64 bytes each. Sixty-four workers, eight workers per cache line on average, each pair fighting every other pair for line ownership. The flush goroutine, which read the array every 200 ms, made things slightly worse by promoting the lines to Shared and then back to Modified when the next worker write arrived.
The fix was 8 lines of Go: change [64]int64 to [64]struct{ v int64; _ [56]byte }. Each counter is now in its own 64-byte cache line. The struct is 64 bytes, so 64 of them is 4096 bytes — eight times more memory than before, completely insignificant compared to the 12 GB heap the service uses. Throughput jumped from 1.8M to 14.6M impressions/sec/pod after deployment; p99 latency dropped from 380 ms to 9 ms. The Kafka flush rate stayed identical because nothing about the workload had changed, only the memory layout.
This is the canonical shape of false-sharing incidents in Indian-scale systems: a parallel data structure (counters, queues, slabs, per-thread allocator caches) was laid out for memory efficiency without thinking about cache-line boundaries; under low load there is no contention so the layout works fine; under peak load every line is contended and the line-bouncing tax kicks in. CRED's per-shard rate-limiter, Swiggy's per-region order-counter array, Dream11's per-T20-match score buffer — all four have hit this shape. The pattern repeats because the language idioms encourage tight packing (Go's struct fields are word-aligned, not line-aligned; Java's class layout is JVM-managed; C arrays are densely packed by default) and the cost is invisible in single-threaded benchmarks.
The detection process is operationally cheap once you know to look. A reasonable practice is to wire mem_load_l3_miss_retired.remote_hitm into your service's runtime metrics — Brendan Gregg's pmu-tools exposes it as a continuous gauge — and alert when it exceeds, say, 5% of total memory references. False-sharing bugs do not appear from nowhere; they appear when an existing struct gets a new field, or a new worker count is added, or the access pattern changes shape. Continuous monitoring catches them on the deploy that introduces them, not weeks later under peak load.
Padding strategies and the trade-offs
Padding to one cache line per hot variable is the default fix. But it is not the only fix, and it has costs.
Manual padding with a _ [56]byte field (or char pad[56]; in C, uint64_t _pad[7]; in Rust) is the most explicit. The reader sees the padding in the struct definition; the intent is documented; the layout is predictable. Cost: 8× memory for a single counter, ~$2/year per pod in cloud rupees, irrelevant. This is the right default for known-hot fields like rate-limiter buckets, per-thread allocator stats, and per-shard counters.
Standard-library annotations package the same idea. C++17 added std::hardware_destructive_interference_size, a compile-time constant that reads as 64 on x86 and 128 on Apple silicon; you align with alignas(std::hardware_destructive_interference_size). Rust offers #[repr(align(64))] on structs. Java has @Contended (since JDK 8, hidden behind -XX:-RestrictContended); the JVM inserts the padding for you. Go does not have a stdlib idiom — you write the byte-padding by hand, which is exactly what the Hotstar fix does.
Splitting the array into per-thread slices sidesteps padding entirely. Instead of [64]int64 shared across goroutines, give each goroutine its own int64 allocated separately on the heap; the Go allocator guarantees no two heap allocations share a cache line. The memory cost is identical; the layout is more idiomatic. The disadvantage is reduced locality for the rare scan-everything operation, which now has to chase 64 pointers instead of striding through one array.
Hot/cold splitting is the right approach when you have a struct with both contended-write fields and rarely-touched fields. Put the hot fields on their own cache line; let the cold fields share lines freely. A RateLimiter struct with tokens: int64 (written every request) and name: string, created: time.Time, tags: []string (read once per minute) should have tokens in its own line and the rest packed. This is the layout most performance-sensitive Java and C++ libraries adopt: ConcurrentHashMap's bucket headers, jemalloc's per-thread arena descriptors, the Linux kernel's per-cpu variables.
The cost of over-padding is real but small. A 64-byte counter that should have been 8 bytes uses 8× the L1 cache space, so a working set of 1024 padded counters consumes 64 KB — twice an L1d. Cache pressure on hot paths matters; if the only access is a single increment, the padded version still benefits because the alternative (false sharing) is far worse. If the access is a scan of all 1024 counters to compute a sum, the unpadded version is faster because everything fits in L1d. The decision rule: if writes outnumber scans by ≥10x, pad. If scans outnumber writes, do not.
# pad_or_not.py — a quick decision helper for "should I pad this struct?"
# Inputs: number of threads, fraction of accesses that are writes, line size.
# Outputs: estimated speedup from padding under steady-state contention.
def predict_speedup(n_threads: int, write_fraction: float,
snoop_ns: float = 50, work_ns: float = 1.0) -> float:
"""Return predicted (no-pad-time / pad-time) ratio."""
# Padded: every access takes work_ns (purely local L1).
pad_ns = work_ns
# Unpadded write: forces RFO, ~snoop_ns latency.
# Unpadded read: line is in M elsewhere, becomes S — also a snoop.
unpad_write_ns = snoop_ns
unpad_read_ns = snoop_ns * 0.5 # roughly half the cost
unpad_avg = (write_fraction * unpad_write_ns +
(1 - write_fraction) * unpad_read_ns)
# Under N-thread contention the slowest core sets the rate.
contention_factor = 1 + (n_threads - 1) * 0.4
return (unpad_avg * contention_factor) / pad_ns
if __name__ == "__main__":
print(f"{'threads':>8} | {'write%':>8} | {'speedup':>8}")
for n in (2, 4, 8, 16):
for wf in (0.1, 0.5, 1.0):
s = predict_speedup(n, wf)
print(f"{n:>8} | {int(wf*100):>7}% | {s:>7.1f}x")
threads | write% | speedup
2 | 10% | 19.6x
2 | 50% | 49.0x
2 | 100% | 70.0x
4 | 10% | 43.1x
4 | 50% | 107.8x
8 | 100% | 280.0x
16 | 100% | 550.0x
The numbers are deliberately conservative on snoop_ns (50 ns) and pessimistic on contention. Even so, the prediction matches reality: high-write-fraction parallel workloads pay massive false-sharing taxes, and the speedup from padding can exceed 100×. The model is approximate, but it is the right shape — write-heavy, many-thread workloads need padding; read-heavy, few-thread workloads do not.
Common confusions
-
"False sharing is the same as cache contention." Cache contention is when two threads compete for capacity in a shared cache (L2 or L3). False sharing is when two threads' writes invalidate each other's L1 copies of the same line. They look similar in symptoms (degraded scaling) but differ in mechanism: contention is a capacity problem solved by giving each thread its own cache slice; false sharing is a coherence problem solved by spatial separation. The
perfcounters differ —cache-missesfor contention,mem_load_l3_miss_retired.remote_hitmfor false sharing. -
"Read-only access cannot cause false sharing." Mostly true, with one caveat. Multiple cores reading the same line all hold it in
Sharedstate simultaneously — no bouncing, no problem. But on x86, a read on a line currently inModifiedstate on another core forces a snoop and a state transition (M → S on the writer, I → S on the reader); this is a one-time cost. If the line is repeatedly written by some cores and read by others, every read forces a coherence message. Pure read-only sharing is free; mixed read/write sharing costs almost as much as write/write false sharing. -
"Padding to 8 bytes is enough for an
int64." No — padding must extend to a full cache line (64 bytes on x86 / 128 bytes on Apple silicon). A struct with a single 8-byte field followed by 56 bytes ofpadis 64 bytes total; the next struct starts at the next 64-byte boundary; no two structs share a line. Padding to 8 bytes (i.e., not padding at all for anint64) is what the unpadded version does and is what causes the bug. -
"
std::atomic<int64>avoids false sharing." No. Atomic operations are atomic with respect to other operations on the same variable, but they are still cache-line-granular. Twostd::atomic<int64>variables packed adjacently in memory will cause false sharing exactly as much as two non-atomic ones. Thelockprefix on x86 forces a pipeline drain plus a coherence transaction; on a contended cache line, atomic ops are worse than non-atomic ops because the pipeline drain serialises with the coherence wait. Atomic and non-false-sharing are orthogonal properties; you need both, and you achieve the second by padding. -
"My language's stdlib protects me." Some do, some don't. Go's runtime aligns
sync.Mutexon a cache line for performance reasons, but plain struct fields are not auto-padded. Java's@Contendedannotation is gated behind a JVM flag; the default behaviour does not pad. Rust'sstd::sync::atomictypes are not auto-padded; you needcrossbeam::utils::CachePadded. C++ hasstd::hardware_destructive_interference_sizebut does not insert it implicitly anywhere. The default in every mainstream language is "no padding". Padding is something you opt into, not something you can assume is happening. -
"False sharing only matters on x86 servers." It matters more on x86 than on ARM in absolute terms because Intel's snoop fabric is slower than Apple's, but it exists everywhere. ARM Neoverse N2 has the same 64-byte line size as x86; Apple M3 has 128-byte lines, which makes the problem 2× more likely (twice the addresses alias). AMD EPYC's CCX-to-CCX transfers are slower than within-CCX, so EPYC servers can suffer worse than equivalent Intel boxes when the contending cores are on different CCXes. The phenomenon is universal; the constants vary.
Going deeper
The MESI / MOESI state transitions, in code
The full state machine determines which transitions are cheap and which are catastrophic. On Intel (MESI), every line is in one of four states: Modified (owned, dirty, exclusive), Exclusive (owned, clean, exclusive), Shared (read-only, possibly multiple owners), Invalid (not present). On AMD (MOESI), there is an additional Owned state that allows a dirty line to be shared without writing back to L3 first; this saves bandwidth in some workloads but does not change false-sharing behaviour.
False sharing is the M → I → M cycle: core 0 writes (line goes M), core 1 writes (line goes I on core 0, M on core 1), core 0 writes again (line goes I on core 1, M on core 0), and so on. Each transition involves a snoop and a forward across the on-chip interconnect. On Sapphire Rapids the interconnect is a mesh; the worst-case latency between two cores on opposite corners of the die is ~80 ns. On EPYC Genoa it is the Infinity Fabric, with cross-CCX latencies up to 100+ ns.
The cleanest tool for inspecting this in real time is perf c2c (cache-to-cache), specifically designed by Joe Mario at Red Hat for diagnosing false sharing:
sudo perf c2c record -F 60000 -- ./your_binary
sudo perf c2c report --stats
The report shows, for each cache line that experienced HITM events, which two cores were fighting and which source-code lines (with debug info) corresponded to the contended addresses. It is the single best tool for finding non-obvious false sharing in a complex codebase. Brendan Gregg's Systems Performance (2nd ed., §6.6) covers this workflow in detail; reading that chapter once is the highest-leverage 30 minutes you can spend on coherence diagnostics.
Inter-line vs intra-line: when 128-byte alignment helps
Most documentation says "align to 64 bytes". On Intel CPUs from Sandy Bridge onwards, the L2 cache prefetcher fetches lines in adjacent pairs (the spatial prefetcher) — when you miss on line N, the prefetcher fetches N+1 too. If your hot variable is on line N and your next hot variable is on line N+1, the prefetcher pulls line N+1 into the local L1 every time line N is accessed, defeating part of your padding. The fix is to align to 128 bytes — i.e., to a pair of lines — which is why std::hardware_destructive_interference_size is 128 on some Intel platforms even though the line size is 64. The same logic applies on Apple silicon, which has 128-byte L1 lines plus a 256-byte spatial prefetcher unit.
This corner case matters for the most-contended structures only: rate-limiter buckets, lock structures, work-stealing deque headers. For ordinary per-thread counters, 64-byte padding is enough; the spatial-prefetcher false-sharing variant requires that both adjacent lines are written from different cores, which is rare. When you see HITM events that survive 64-byte padding, try 128-byte padding next.
The Razorpay actor-model fix — designing layout from the start
The Hotstar fix above is a patch: a struct grew, the patch added padding, the bug went away. The Razorpay payment-state actor model (mentioned in the /wiki/cache-coherence-mesi-moesi chapter) takes a different approach — it designs the layout so false sharing is structurally impossible. Each transaction-state actor owns a slice of state allocated separately on the heap; threads pin to cores; cross-actor messages flow through lock-free queues whose internal buffers are aligned to 128-byte boundaries.
The result is that no two transaction-state writes from different actors ever land on the same cache line — not because someone added _ [56]byte after every field, but because the data architecture makes it impossible. This generalises: when designing for high write concurrency, the question is not "where do I add padding?" but "what is my unit of ownership and does it map to a cache line?" If the answer is yes, padding is unnecessary; if the answer is no, padding is a band-aid.
This is the same lesson Linux's per-CPU variables encode (DEFINE_PER_CPU macros allocate one variable per CPU, each on its own cache line, with hardware support for fast access via gs/fs segment register on x86), and the same lesson lock-free queue libraries like LMAX Disruptor encode (sequence numbers and event slots are explicitly cache-line-aligned in the design, not as an afterthought). When you see a system that scales to 64+ cores cleanly, look for cache-line-aware design at the structural level; you will always find it.
Reproduce this on your laptop
# Linux x86 with perf, GCC, Python 3.11+
sudo apt install linux-tools-common linux-tools-generic build-essential
python3 -m venv .venv && source .venv/bin/activate
pip install numpy
# Build the C harness (loop runs in native code, no GIL):
gcc -O2 -shared -fPIC -pthread -o libfs.so false_sharing.c
# Run the benchmark:
python3 false_sharing_native.py
# Confirm with perf:
sudo perf stat -e mem_load_l3_miss_retired.remote_hitm \
-- python3 false_sharing_native.py
# Find the contended lines with perf c2c:
sudo perf c2c record -F 60000 -- python3 false_sharing_native.py
sudo perf c2c report --stats
Expect roughly 8–15× speedup from padding on any 8-core or larger box. The exact ratio depends on snoop latency: Apple M-series will show 4–6× (faster fabric), Sapphire Rapids will show 8–12×, dual-socket EPYC with cross-socket contention will show 20–30×. The shape — packed scales sub-linearly, padded scales linearly — is universal.
Where this leads next
The cache line was the noun of /wiki/cache-lines-and-why-64-bytes-rules-everything. The coherence protocol was the verb of /wiki/cache-coherence-mesi-moesi. False sharing is what happens when those two facts collide in code that did not know it was sharing anything. Once you carry the model — coherence tracks lines, not variables — the entire class of "lock-free is slower than locked" mysteries dissolves. The fix is always spatial: put the things one core writes on a different line from the things another core writes.
The next chapters extend the coherence story to broader settings:
- /wiki/numa-topology-and-page-placement — when the contending cores are on different NUMA nodes, the coherence transaction crosses a socket boundary; the cost goes up another 2–3×.
- /wiki/hardware-prefetchers-and-when-they-help — the spatial prefetcher's 128-byte fetch unit is what makes 128-byte alignment matter; the same prefetcher hides some L1 misses on padded structures.
- /wiki/atomic-operations-and-their-real-cost — atomics serialise via the same coherence fabric; on a contended cache line,
atomicis slower than non-atomic, not faster. - /wiki/lock-free-data-structures-and-the-line-as-unit-of-ownership — designing data structures so the unit of ownership is a cache line is the structural answer to false sharing.
- /wiki/per-cpu-variables-and-when-the-kernel-uses-them — Linux's structural answer: every per-cpu allocation is line-aligned by definition.
The deeper thread is that the parallel-programming model in your head — "threads operate on independent variables independently" — does not match the silicon. The silicon operates on cache lines, and any time two threads' variables share a line, they share an implicit lock that you did not write and cannot see in your code. Carrying this model changes how you design data structures: the line, not the byte or the word, is the unit of independence. Languages and runtimes can hide this from you up to a point, but on hot paths under contention the silicon wins, and the only way to get full parallel scaling is to align your data layout to the silicon's actual unit of motion.
A final calibration: if you write a parallel data structure today and skip the padding "to save memory", the memory you save is 56 bytes per hot field on a structure that probably consumes megabytes already. The performance you lose is, in extreme cases, 100× — Hotstar's IPL incident saved 3.5 KB per pod and lost 88% of the service's throughput. The trade is overwhelmingly in favour of padding hot per-thread state, and there is no production-grade reason to skip it. The default should be padded; the exception should require a written justification.
References
- Joe Mario, "C2C — False Sharing Detection in Linux Perf" — the canonical introduction to
perf c2c, by the engineer who built it. Includes example output from real workloads. - Brendan Gregg, Systems Performance (2nd ed., 2020) — §6.6 (cache analysis with
perf c2cand PMU counters), §16 (case studies including false sharing in production). - Intel® 64 and IA-32 Architectures Optimization Reference Manual — §3.7 (sharing in multiprocessor systems), §B.4 (uncore PMU events including
mem_load_l3_miss_retired.remote_hitm). - Drepper, "What Every Programmer Should Know About Memory" (2007) — §6.4.2 (atomicity and false sharing), with timing diagrams that still apply 18 years later.
- Martin Thompson, "Mechanical Sympathy: False Sharing" — the LMAX Disruptor team's classic post showing 70× speedups from cache-line padding on Java workloads.
- Pellegrini, Quaglia, "Asymmetric Lock-Free Synchronization" (Euro-Par 2015) — formal analysis of cache-line-aware lock-free queue designs; the academic basis for the padding patterns LMAX uses in production.
- /wiki/cache-coherence-mesi-moesi — the protocol substrate that makes false sharing a coherence problem rather than a contention problem.
- /wiki/cache-lines-and-why-64-bytes-rules-everything — the 64-byte unit that defines the granularity at which false sharing appears.