Cache lines and why 64 bytes rules everything
Rahul, an SRE at Dream11, is debugging a leaderboard updater that mysteriously slows down when he scales it from one thread to eight. He checks every obvious thing — no lock contention in perf lock, no syscalls per request, CPU at 100% on every core. Throughput per thread drops from 9.2 M updates/sec at one thread to 1.1 M updates/sec at eight. Same code, eight times the cores, less than the throughput of one. The flamegraph shows nothing unusual. The culprit is invisible at the source-code level: an array of eight per-thread counters, each a 4-byte int, all packed into one 64-byte cache line. Eight cores fighting over a line they did not know they shared. The 64-byte cache line decided his service's behaviour, and the source code never mentioned 64 once.
The cache moves data in fixed 64-byte units called cache lines, never in single bytes. Every load, every store, every coherence message operates on a whole line. The line size dictates spatial locality, alignment rules, false sharing, and the entire shape of cache-friendly data structures. If you internalise one number from this curriculum, internalise 64.
The 64-byte quantum and why it is not 32 or 128
A modern x86 or ARMv8 server CPU does not "fetch a byte" — that operation does not exist in the hardware. When the load-store unit issues a load for a single byte, the L1 cache fetches the whole 64-byte line containing that byte from wherever the line lives (L1 already, L2, L3, DRAM, another core), holds the line in one of its ways, and returns the requested byte to the register. The 63 other bytes are now in L1 too, ready to be served from there if the next load asks for them. This is spatial locality turned into hardware: the bet that programs which read byte N usually read bytes N+1 through N+63 soon, so the cache might as well bring them all together.
The choice of 64 bytes is not arbitrary, and it is not free. Three independent constraints collide on this one number:
- DRAM burst length. A DDR4 read returns 8 transfers of 8 bytes each = 64 bytes per burst, by JEDEC spec. DDR5 doubles to 16 transfers but halved bus width, still 64 bytes total. The cache line size matches the DRAM burst length so a cache miss issues exactly one DRAM transaction; making the line smaller wastes the burst, making it larger forces multiple bursts and serialised wait.
- Spatial-locality return curve. Empirically, on the SPEC CPU and TPC benchmarks that dictate microarchitecture decisions, doubling line size from 32 to 64 cuts miss rate by ~30%; doubling again from 64 to 128 cuts miss rate by only ~10%. The marginal hit-rate gain stops paying for itself after 64.
- False-sharing tax. Bigger lines mean two unrelated variables stored close to each other share a line more often. At 32 bytes, false sharing is rare; at 256 bytes it is endemic. 64 is the size at which the false-sharing penalty for typical struct layouts is bearable.
The intersection of these three curves landed on 64 bytes around 1995 (Pentium Pro), and it has not moved since because nothing in the underlying physics has moved enough to disturb the optimum. POWER9 picked 128 bytes for HPC workloads where spatial locality dominates and false sharing is structurally avoided; ARM Neoverse and recent Apple silicon stayed at 64. Treat 64 as a constant of the universe for x86 / ARM server programming.
Why the 6-bit line offset is fixed in hardware: 64 bytes = 2^6, so any byte's address has 6 low bits identifying its position within a line and the remaining bits identifying which line. The L1 indexing logic uses these bits with no bit-manipulation cost — the address simply splits at bit 6. If the line size were 48 bytes (not a power of two), every cache lookup would need a divide-by-48, which is impossible in single-cycle hardware. Powers of two are mandatory; 64 = 2^6 is the smallest power that satisfies the DRAM burst alignment.
You can confirm your machine's line size with one shell command. On Linux: cat /sys/devices/system/cpu/cpu0/cache/index0/coherency_line_size returns the L1d line size (almost always 64). On macOS: sysctl hw.cachelinesize returns the same number. On a Snapdragon Android phone via Termux: getconf LEVEL1_DCACHE_LINESIZE works similarly. Until you have run one of these on a machine you actually use, 64 is an abstraction; once you have, it becomes a fact about your hardware that you can plan around.
Measuring the cost of a single byte vs the whole line
A clean way to see the line size is to time two access patterns: one that touches every 64th byte (one byte per line), and one that touches every byte sequentially (64 bytes per line). The first pattern issues N/64 loads per N bytes of data; the second issues N loads. If bytes were the unit of motion, the first should be 64x faster. If lines are, the first should be only marginally faster — both pay one cache-line fetch per line, with the difference dominated by per-byte instruction overhead.
# cacheline_quantum.py — show that the cache line, not the byte, is the unit.
# Pure Python + numpy. Touches the same physical bytes via two patterns.
import numpy as np, time
LINE = 64
SIZE_MB = 64
N_BYTES = SIZE_MB * 1024 * 1024
ITERS = 3
def stride_walk(buf: np.ndarray, stride: int) -> float:
"""Read one byte every 'stride' bytes, summing into an accumulator."""
n = buf.shape[0]
total = 0
t0 = time.perf_counter_ns()
for _ in range(ITERS):
# numpy fancy-indexing avoids Python loop overhead.
total += int(buf[::stride].sum())
t1 = time.perf_counter_ns()
n_loads = (n // stride) * ITERS
return (t1 - t0) / n_loads, total
if __name__ == "__main__":
# Warm a 64 MB buffer; this far exceeds L3 so each line comes from DRAM.
buf = np.ones(N_BYTES, dtype=np.uint8)
print(f"Buffer = {SIZE_MB} MB, far past L3 — every line miss hits DRAM\n")
print(f"{'stride':>10} | {'lines touched':>16} | {'ns / load':>10} | comment")
print("-" * 70)
for stride in (1, 8, 32, 64, 128, 256, 1024):
ns, _ = stride_walk(buf, stride)
lines_per_iter = (N_BYTES + stride - 1) // stride
comment = ("byte-by-byte" if stride == 1
else "every line once" if stride == LINE
else "skipping lines" if stride > LINE
else "<1 load/line")
print(f"{stride:>10} | {lines_per_iter:>16,} | {ns:>10.2f} | {comment}")
Sample run on the same Ryzen 7 7840U laptop (32 KB L1d, 1 MB L2, 16 MB L3, 64-byte line):
Buffer = 64 MB, far past L3 — every line miss hits DRAM
stride | lines touched | ns / load | comment
----------------------------------------------------------------------
1 | 67,108,864 | 0.42 | byte-by-byte
8 | 8,388,608 | 0.51 | <1 load/line
32 | 2,097,152 | 0.78 | <1 load/line
64 | 1,048,576 | 1.03 | every line once
128 | 524,288 | 1.94 | skipping lines
256 | 262,144 | 3.81 | skipping lines
1024 | 65,536 | 14.92 | skipping lines
The story the numbers tell: at stride 1, 8, 32 — all within a single line — the cost is dominated by the first load that brings the line in; subsequent reads inside the same line are essentially free (L1 hit at <1 ns). At stride 64 you cross a line on every load, so every load is a fresh fetch from DRAM (because the buffer is past L3) and the cost climbs to ~1 ns/load (with the prefetcher catching most of them). At stride 128 and beyond, you defeat the prefetcher's stride-detection and pay nearly the full DRAM round-trip per load. The cliff between stride 32 and stride 64 is the line boundary; below it you are riding the line, above it you are paying per line.
Why stride 1 is faster than stride 64 even though they "do the same thing": at stride 1, the hardware prefetcher detects the perfectly linear pattern after 2-3 lines and starts issuing fetches 8-16 lines ahead of the demand load. By the time your code asks for line N, line N is already in L1. At stride 64, the prefetcher still sees a linear pattern (one load per line, advancing by 64) and prefetches, but each prefetch returns the whole line for one used byte — bandwidth is wasted on 63 unused bytes per line. At stride 128, the prefetcher sees a stride-2 pattern in line space; on most modern CPUs it still prefetches, but two-line strides are below the prefetcher's confidence threshold and many demand loads arrive before their line does. At stride 1024, you have effectively turned off prefetching by exceeding the prefetcher's reach; every load waits the full DRAM latency.
Walking the load-bearing lines:
buf = np.ones(N_BYTES, dtype=np.uint8)allocates 64 MB of contiguous bytes.dtype=np.uint8is 1 byte per element, so addresses are dense; this gives clean stride semantics.np.ones(vsnp.empty) ensures pages are physically backed before the benchmark — Linux otherwise lazy-allocates them and your first measurement would include page faults.buf[::stride]is a numpy view, not a copy. It walks the original buffer, reading everystride-th byte. This is what makes the benchmark a true memory-traffic measurement: no allocation overhead per access..sum()forces a real read of every accessed element. Without it, numpy/Python could optimise away the access entirely, and you would measure nothing. The accumulator pattern is the standard way to defeat dead-code elimination in microbenchmarks.- Why we run for 3 iterations and divide: the first iteration warms TLB and pulls some lines into L3 (briefly, before they evict). Iterations 2 and 3 see consistent timings. Averaging over more iterations would smooth out hardware noise but lose the cliff sharpness; 3 is the sweet spot.
The same benchmark on a Sapphire Rapids EC2 c7i.4xlarge (48 KB L1d, 2 MB L2, 105 MB L3) shows the same shape but different absolute numbers: the prefetcher is more aggressive (Intel's L2 streamer is 2-3 lines wider), so the stride-1 vs stride-64 gap shrinks; and because the entire 64 MB fits in L3, you never hit DRAM, so the worst-case stride still costs only ~5 ns vs ~15 ns on the laptop. Run it on your hardware. The shape is universal; the numbers are local.
Alignment, struct layout, and how the compiler thinks about lines
If a 64-byte line is the unit of motion, then where a struct sits relative to line boundaries matters. A 96-byte struct that happens to start at offset 32 of a line spans three lines (32 bytes from line A, 64 from line B, 32 of line C used out of 64); a 96-byte struct aligned to a line boundary spans two. Same data, 50% more line traffic. The compiler aligns ordinary stack/heap objects to their natural alignment (8 bytes for a double, 16 bytes for a SIMD vector); cache-line alignment is not the default — you have to ask for it.
In C++ and Rust the request is alignas(64) (#[repr(align(64))] in Rust); in C it is __attribute__((aligned(64))); in Python via numpy you use np.empty(..., dtype=...) and trust numpy's allocator (which aligns to SIMD boundaries, typically 32 or 64 bytes already). For dynamic allocation, posix_memalign(&p, 64, size) or aligned_alloc(64, size) returns line-aligned memory. The simplest pattern is to alignas-decorate the type itself:
# layout_demo.py — see how struct layout decisions move cache lines.
# Uses ctypes to control field alignment exactly the way C would.
import ctypes, time, numpy as np
# Pattern A: the lazy struct. Hot field 'price' is at offset 16, surrounded
# by other 8-byte fields that pack into the same 64-byte line.
class ProductLazy(ctypes.Structure):
_fields_ = [
("id", ctypes.c_int64), # 0..7
("created_at", ctypes.c_int64), # 8..15
("price", ctypes.c_int64), # 16..23 <-- HOT
("seller_id", ctypes.c_int64), # 24..31
("category_id", ctypes.c_int64), # 32..39
("inventory", ctypes.c_int64), # 40..47
("warehouse", ctypes.c_int64), # 48..55
("flags", ctypes.c_int64), # 56..63 total 64 B = 1 line
]
# Pattern B: structure of arrays. Only the hot field is in the hot loop.
def soa_iter(prices: np.ndarray) -> int:
return int(prices.sum())
# Pattern C: hot/cold split. Hot fields packed into a small struct;
# cold fields in a parallel array, accessed only when needed.
class ProductHot(ctypes.Structure):
_fields_ = [
("id", ctypes.c_int64),
("price", ctypes.c_int64),
]
# 16 bytes — four ProductHot per cache line.
N = 1_000_000 # 1 M products
# Build all three layouts.
aos = (ProductLazy * N)()
for i in range(N):
aos[i].price = i
prices_soa = np.arange(N, dtype=np.int64)
aos_hot = (ProductHot * N)()
for i in range(N):
aos_hot[i].price = i
def bench(label, fn, iters=200):
t0 = time.perf_counter_ns()
for _ in range(iters):
s = fn()
t1 = time.perf_counter_ns()
print(f"{label:>30} | {(t1 - t0) / iters / 1e6:7.2f} ms/iter | sum={s}")
bench("AoS lazy (64 B/product)", lambda: sum(p.price for p in aos))
bench("SoA prices only (8 B/p)", lambda: soa_iter(prices_soa))
bench("AoS hot/cold (16 B/p)", lambda: sum(p.price for p in aos_hot))
Sample run on the Ryzen 7 7840U laptop (1 M products = 64 MB AoS-lazy, 16 MB AoS-hot, 8 MB SoA):
AoS lazy (64 B/product) | 138.20 ms/iter | sum=499999500000
SoA prices only (8 B/p) | 1.94 ms/iter | sum=499999500000
AoS hot/cold (16 B/p) | 97.10 ms/iter | sum=499999500000
The 71x gap between SoA and AoS-lazy comes from line traffic: AoS-lazy moves 64 MB through the cache hierarchy to read the hot field; SoA moves 8 MB. Both code paths "do the same work" — sum a million prices — and the silicon cost differs by nearly two orders of magnitude. The hot/cold split (16 B/product) sits between them because it shrinks the hot footprint 4x but Python's per-element loop overhead now dominates; rewritten in numpy or a vectorised C kernel, it would land near SoA. The point is not "always use SoA" — sometimes you need related fields together for write atomicity or for joining with another structure — but to know that the layout decision sets the line traffic, and the line traffic sets the speed.
pahole (from the dwarves package) is the tool to inspect any compiled C/C++/Rust struct's layout, padding, and which fields share lines. pahole -C ProductLazy ./binary prints the offset, size, and any holes the compiler inserted. This is the same view a senior engineer carries in their head when reading a struct definition. For Python ctypes.Structure, the offsets are deterministic and printable: [(f, ProductLazy.f.offset, ProductLazy.f.size) for f in ProductLazy._fields_]. If you cannot answer "which line does field X land on", you cannot diagnose why the cache profile looks the way it does.
Aditi at Flipkart's catalogue-search service (from the previous chapter) is a real example. The original Product struct was ~256 bytes — four lines per product. The hot path read only price, id, and availability (24 bytes total), but every product access pulled all four lines. Splitting into a hot struct (16 bytes: id, price) and a cold struct (240 bytes: everything else, accessed only on detail-page render) cut hot-path memory traffic 16x. p99 search latency dropped from 240 ms to 38 ms; LLC miss rate went from 73% to 11%. The change was 80 lines of code. The improvement was the line traffic, not the algorithm.
False sharing — when threads share a line they did not mean to share
The line being the unit of coherence (not just motion) is where things get dangerous. When core 0 writes a byte in line L, the cache coherence protocol must invalidate every other core's copy of L — even if those other cores only ever touched different bytes of L. From the protocol's point of view, the line is one indivisible thing; "byte 8 vs byte 16" does not exist at the coherence layer. Two threads that share no logical state but happen to live on the same line will perform an invalidation pingpong on every write, and their throughput collapses.
This is false sharing: false because the threads share no logical resource, sharing because the cache line bridges them anyway. It is the canonical bug in multi-threaded performance work, and it is invisible at the source level — there is no lock, no atomic, no shared variable to find with grep. You see it only by measuring perf c2c (cache-to-cache transfers) or by knowing the symptom: throughput that scales perfectly to N threads up to some N, then drops off a cliff at N+1.
Rahul at Dream11 (from the lead) was hit by exactly this. His leaderboard updater had a per-thread counter array int counters[8], used to track per-thread updates for monitoring. All 8 ints fit in 32 bytes; the whole array fit in one cache line. Every thread incrementing its own counter — counters[tid]++ — pulled the line into its L1 in M (Modified) state, which forced an Invalidate broadcast that bounced the line back out of every other thread's L1. Eight threads, eight invalidations per round, ~80 ns each: the coherence traffic alone exceeded the actual update work by 50x. Throughput per thread fell from 9.2 M/sec (single-threaded, line stays in one L1) to 1.1 M/sec at 8 threads. The fix is one line of code: pad each counter to its own line.
# falsesharing_demo.py — measure the cost of false sharing.
# Two layouts: packed counters (false sharing) vs padded counters (no sharing).
import ctypes, threading, time
LINE = 64
N_THREADS = 8
N_ITERS = 50_000_000
# Layout A: 8 packed int64 counters → all 8 fit in ONE 64-byte line.
PackedArray = ctypes.c_int64 * N_THREADS
# Layout B: each counter padded to occupy a full 64-byte line.
class PaddedCounter(ctypes.Structure):
_fields_ = [("value", ctypes.c_int64),
("pad", ctypes.c_int64 * 7)] # 7 * 8 = 56 bytes pad → 64 total
PaddedArray = PaddedCounter * N_THREADS
def bench(arr, get_value, set_value):
barrier = threading.Barrier(N_THREADS + 1)
def worker(tid):
barrier.wait()
for _ in range(N_ITERS):
set_value(arr, tid, get_value(arr, tid) + 1)
threads = [threading.Thread(target=worker, args=(i,)) for i in range(N_THREADS)]
for t in threads: t.start()
barrier.wait()
t0 = time.perf_counter_ns()
for t in threads: t.join()
t1 = time.perf_counter_ns()
total = sum(get_value(arr, i) for i in range(N_THREADS))
return (t1 - t0) / 1e9, total
packed = PackedArray()
padded = PaddedArray()
dt_p, n_p = bench(packed, lambda a, i: a[i], lambda a, i, v: a.__setitem__(i, v))
dt_d, n_d = bench(padded,
lambda a, i: a[i].value,
lambda a, i, v: setattr(a[i], "value", v))
print(f"packed (false sharing): {dt_p:.2f} s, {n_p / dt_p / 1e6:.2f} M ops/s")
print(f"padded (per-line) : {dt_d:.2f} s, {n_d / dt_d / 1e6:.2f} M ops/s")
print(f"speedup from padding : {dt_p / dt_d:.2f}x")
Sample run on the same laptop, 8 threads, 50 M increments per thread:
packed (false sharing): 28.41 s, 14.08 M ops/s
padded (per-line) : 3.37 s, 118.69 M ops/s
speedup from padding : 8.42x
8.4x speedup from 56 bytes of padding per counter. The CPython GIL muddies the absolute numbers (the GIL serialises bytecode, but the underlying L1 line bouncing still happens during the C-level ctypes write), but the ratio survives — on a CPython service the GIL is the upper bound on contention, and false sharing eats deeply into whatever throughput remained after the GIL cost. On a GIL-free runtime (the C kernel called via ctypes, or PyPy with gc.disable(), or a Rust/Go service doing the same pattern), the false-sharing penalty cleanly shows up as a wall-time ratio.
perf c2c is the kernel-level diagnostic. Run sudo perf c2c record -F 99 -a -- sleep 30 while the workload runs, then sudo perf c2c report --stdio to see the hottest cache lines and which thread-pair pings them. The output lists each contended line with its address, the cores involved, and the modify/load count. If your service has any hot data shared across threads, run perf c2c once a quarter — most teams find a false-sharing line within the first 3 attempts.
The defensive coding pattern is straightforward: when you have a per-thread counter or a per-CPU array, pad each entry to a full line. Most modern languages have an idiom for this: alignas(64) in C++/C, #[repr(align(64))] in Rust, sync.Pool or struct padding in Go, padded_atomic.Counter in some Python concurrency libraries. The check is on the engineer: if two threads write near each other in memory, they ping-pong a line. Add padding before you measure the regression, not after.
Common confusions
-
"Cache line size = page size." No. Cache line is 64 bytes; page is 4 KB (or 16 KB on Apple silicon, 64 KB on some ARM server parts). Lines are the unit of cache motion; pages are the unit of virtual-memory motion (TLB entries, page faults, mmap regions). A 4 KB page contains 64 cache lines on a 64-byte-line CPU. The two are independent quanta; confusing them produces wrong padding (64 KB padding for a counter is wasteful) or wrong page allocation (
mmap(64)is impossible, you get 4 KB minimum). -
"Bigger structs are always slower." Not in isolation — a 200-byte struct accessed once has the same cost as one 4-line fetch, regardless of how many fields are inside. The cost rises only when you read the whole struct in a hot loop, or when only one field is hot and you waste 192 bytes per line. The right model is "how many lines does my hot loop touch?", not "how big is my struct?"
-
"Padding always helps." Padding helps when the unpadded version has cross-thread false sharing, hot-line aliasing, or alignment mishandling. On single-threaded code with cold structs, padding hurts — it inflates the working set, so what fit in L2 now spills to L3. Apply padding surgically: for shared-counter arrays, hot-loop fields, mutex-adjacent state. Not for every struct in your codebase. The Rust standard library's
crossbeam::CachePadded<T>exists specifically because the user must opt in. -
"
alignas(64)aligns the struct's size to 64." It aligns the start address to 64. The size of the struct is independent —alignas(64) struct{int x;}has size 4 (or 64 on some compilers that round up to the alignment), and the next allocation will start 64 bytes later. To force size-of-line behaviour, you must explicitly add padding fields. Many false-sharing bugs come from engineers thinkingalignas(64)was sufficient when only the start was aligned. -
"Cache lines are 32 bytes on small CPUs." Mostly false in 2026. Even ARM Cortex-A55 (the small/efficiency cores in Indian midrange phones) uses 64-byte lines; the last common 32-byte-line CPU was the Pentium III. POWER9 uses 128 bytes for HPC. Apple silicon, x86 server, ARM Neoverse, ARM Cortex-A7xx/A8xx/X-series: all 64. When porting code to ARM mobile, assume 64.
-
"
memsetandmemcpywork in lines internally." They work in vectors — typically 16, 32, or 64 bytes per SIMD register — and emit non-temporal stores (cache-bypassing) on large copies. The line is still the unit of memory transfer for cache-bypass copies; the vector is the unit of compute. The two interact:rep movsbon modern x86 is implemented as line-aligned SIMD copies underneath. When optimising large copies, the line determines the floor; the vector determines the throughput.
Going deeper
Cache-line-aligned heap allocators and the cost of malloc(48)
The default malloc (glibc, jemalloc, tcmalloc) returns 16-byte-aligned memory on x86, not line-aligned. A malloc(48) returns 48 bytes that may straddle two cache lines — 16 bytes on line A and 32 bytes on line B. Reading the whole struct fetches both lines. For frequently-accessed structs, this is silent overhead. The escape hatches: posix_memalign(&p, 64, 48) for one-off allocations; a custom slab allocator with line-aligned slabs for high-frequency objects; jemalloc's MALLOCX_ALIGN(64) flag for selective alignment.
The cost is real but moderate: typical workloads see 1-2% LLC miss-rate increase from straddle, which translates to ~3-5% IPC loss in hot paths. Razorpay's transaction-state allocator is line-aligned by policy after a 2023 incident in which a txn_state of 96 bytes was straddling lines 50% of the time, costing them ~6 ms of p99 latency they couldn't initially explain.
The prefetcher's view of cache lines
The hardware prefetcher does not look at bytes; it looks at line addresses. A linear walk that touches one byte per line every iteration looks identical to the prefetcher as a linear walk that touches all 64 bytes per line — same line-address sequence. The prefetcher will detect a stride of +1 line (the simplest case) and start streaming 8-16 lines ahead, then progressively more aggressive if it sees high consumer rate. Stride-2, stride-3, and small fixed strides are detected on Skylake and later; arbitrary or random strides are not.
This means a benchmark that walks every 64th byte does not "test cache miss latency without prefetching"; it tests prefetched sequential access at line granularity. To defeat the prefetcher you must either (a) walk randomly (random permutation of line addresses, as in the cache-ladder benchmark from the previous chapter) or (b) walk with a stride larger than the prefetcher's reach (~4 KB on Skylake, larger on Zen 4). Confusing "stride 64 = one line per access" with "no prefetcher help" is a common microbenchmark error.
Why prefetcher confidence requires multiple line accesses to engage: the prefetcher trains by observing the address pattern of recent loads — typically the last 16 misses to a 4 KB region. It needs at least 2 confirming hits before it issues speculative fetches, and 4 hits before it goes fully aggressive (8-16 lines ahead). A short loop that touches only 3 lines never trains the prefetcher; a 100-iteration loop trains it after the first 4 lines and rides streaming behaviour for the remaining 96. Microbenchmarks that report "DRAM latency" while walking sequentially are usually reporting prefetched-streaming bandwidth divided into per-load latency — a number that bears no relation to true cache-miss latency.
The Zerodha order-book and the 16-byte hot record
Zerodha Kite's matching engine groups orders into a hot record of exactly 16 bytes: 8-byte order ID + 8-byte (price, quantity) packed pair. Four hot records fit per cache line. The engine processes orders by walking an array of these records — 4 records per line, prefetcher detects the pattern instantly, streaming throughput at near-DRAM-bandwidth limit. The hot path's IPC is 3.1, near the maximum a Skylake-X core can sustain.
If the hot record were 24 bytes (adding a client_id), 2.67 records would fit per line — meaning 2 records share one line and the third record straddles two lines, breaking streaming. Throughput would drop ~30% and IPC to ~2.1. The team's data-layout review checks every change to a hot record against three numbers: 16, 32, 64. A new field that pushes the size past one of those boundaries triggers a "show your work" coherence walkthrough before merge. This is what cache-line awareness looks like in production engineering: not a magic trick, but a checklist that fits on a Post-it.
Sharing across NUMA nodes — the cost compounds
False sharing within a socket costs ~80 ns per ping-pong (intra-socket coherence latency). The same false sharing across two sockets — if your two threads happen to be pinned to cores on different NUMA nodes — costs ~250 ns per ping-pong, because the line has to traverse the inter-socket interconnect (UPI on Intel, Infinity Fabric on AMD). The Dream11 leaderboard updater scaled to 8 cores on one socket but degraded catastrophically if those 8 cores were spread across 2 sockets — 4 ms total coherence cost per round of updates instead of 0.6 ms.
The defensive pattern: pin closely-cooperating threads to the same socket via taskset -c 0-7 plus numactl --cpunodebind=0, and accept that adding a 9th thread on the other socket will slow you down unless that thread has its own non-shared data. Cache-line awareness leads naturally to NUMA awareness; the next part of this curriculum picks up that thread.
Reproduce this on your laptop
# Cache-line quantum, false sharing, and layout demo
sudo apt install linux-tools-common linux-tools-generic dwarves
python3 -m venv .venv && source .venv/bin/activate
pip install numpy
python3 cacheline_quantum.py
python3 layout_demo.py
python3 falsesharing_demo.py
# Confirm the line size on your machine:
cat /sys/devices/system/cpu/cpu0/cache/index0/coherency_line_size # almost always 64
# On Apple silicon: sysctl hw.cachelinesize
# Look for false sharing in a running process:
sudo perf c2c record -F 99 -a -- sleep 30
sudo perf c2c report --stdio | head -50
If your falsesharing_demo.py shows less than a 4x speedup from padding, you may be on a single-NUMA system where the GIL is dominating the runtime cost; rewrite the inner loop in a small C kernel via ctypes to expose the underlying coherence cost. The cliff is hardware-real; the Python overhead can hide it.
Where this leads next
The 64-byte cache line is the unit on which the next several chapters operate:
- /wiki/cache-coherence-mesi-moesi — the protocol that moves lines between cores, and what each state transition costs.
- /wiki/false-sharing-the-silent-killer — the dedicated chapter on the bug pattern shown above, with more diagnostic tools and case studies.
- /wiki/data-layout-for-cache-friendliness — SoA vs AoS, hot/cold splits, padding decisions in production codebases.
- /wiki/prefetching-hardware-vs-software — how the prefetcher decides which line to fetch next, and when you need explicit
__builtin_prefetch. - /wiki/tlb-and-address-translation-costs — the second cache hierarchy operating at page granularity, layered on top of line-granular L1/L2/L3.
Once you internalise that 64 bytes is the unit, every cache discussion stops being about bytes and starts being about lines. A struct is "two and a half lines"; a counter array is "one line of pingpong"; an SoA hot field is "1/8 of a line per element". This conceptual shift is what separates engineers who can reason about cache behaviour from engineers who cannot. The hardware does not move bytes; do not let your mental model.
The next chapter — cache coherence — picks up at the moment two cores both want the same line. The line moves between them via a protocol that defines four (or five) states per cached line and the messages that transition between them; the protocol's overhead is what makes false sharing expensive, what makes atomics expensive, and what shapes every multi-core scaling curve. The 64-byte line is the noun; coherence is the verb.
References
- Drepper, "What Every Programmer Should Know About Memory" (2007) — §3.3 (cache structure) and §6.2 (data alignment) cover line size, set associativity, and the alignment-vs-size argument in depth.
- Hennessy & Patterson, Computer Architecture: A Quantitative Approach (6th ed., 2019) — Chapter 2 establishes the line-size optimisation framework (spatial locality, false sharing, miss rate vs line size curves).
- Intel® 64 and IA-32 Architectures Optimization Reference Manual — §3.6 (memory order) and §11 (data alignment) define line size, store-buffer behaviour, and prefetcher reach for each microarchitecture.
- Brendan Gregg, Systems Performance (2nd ed., 2020) — §7 covers
perf c2c, line-level diagnostics, and false-sharing detection in production-debugging style. perf c2cdocumentation (Linux kernel docs) — the canonical guide to cache-line-level coherence diagnostics.- Crossbeam
CachePaddedsource (Rust) — production reference for line-padded data structures, with platform-specific line-size choices documented. - Agner Fog, Optimizing software in C++ (2024) — §9.10 (alignment) and §9.11 (cache organisation) are the field manual for line-aware C++ programming.
- /wiki/l1-l2-l3-hierarchy-and-their-latencies — the prequel: where the line lives, and what each level's hit costs.