Note: Company names, engineers, incidents, numbers, and scaling scenarios in this article are hypothetical — even when they resemble real ones. See the full disclaimer.

Wall: measuring is harder than optimizing

Aditi at ParakhTrade had spent two weeks doing everything right. Her order-matcher service on the new 2-socket Sapphire Rapids 8480+ box was pinned with numactl --cpunodebind=0,1 --membind=0,1, the matching engine threads were sharded by symbol across sockets, the order book used jemalloc with narenas:2 and a per-arena pinning callback, and numastat -p $(pgrep matcher) showed 97% local hits on both nodes. The dashboard was the colour the dashboard was supposed to be. She rolled out to 10% of production at 09:14 IST, six minutes before market open, and watched. p99 order-match latency on the canary held at 1.4 ms — exactly the lab number. She rolled to 50% at 09:18. The same. She rolled to 100% at 09:23. p99 spiked to 4.2 ms within ninety seconds and stayed there for the rest of the trading day. The dashboard still said 97% local. The flamegraph still pointed at memory. Nothing on her screen could tell her where the latency had gone.

Optimising for NUMA is a problem you can solve in a week of careful work. Measuring NUMA — knowing whether your optimisation actually held, in production, under real load, on a kernel that is rebalancing under you — is a problem most teams never solve. numastat counts what is easy to count, not what costs you latency. AutoNUMA migrates pages while you A/B-test. Containers reshuffle topology between pod restarts. Every signal lies in a different way. This chapter is the wall that closes Part 3 and motivates Part 4 (Benchmarking).

The dashboards report the wrong thing

numastat -p <pid> is the canonical NUMA observability command. It prints six counters per node: numa_hit, numa_miss, numa_foreign, interleave_hit, local_node, other_node. Engineers learn the names, watch the ratios, and assume that when numa_hit / (numa_hit + numa_miss) is above 95% the placement is healthy. That assumption is the first place the wall opens up.

The kernel populates these counters in the page-fault path. numa_hit increments when a page fault is satisfied by the node the calling thread asked for and got. numa_miss increments when the requested node was full and the kernel had to spill to another node. local_node and other_node increment based on which node ran the faulting thread. None of these counters are touched on the hot path of normal access. A thread running on node 1 that loads from a page resident on node 0 increments nothing — the access is just slower. The page-fault counters say everything is fine because, in the page-fault sense, everything is fine. The pages were placed wherever the kernel decided to place them. The application later read them from the wrong CPU, but that is not a fault, so it is not counted.

The page-fault counters fire once per page on first touch. The load counters that could tell you about residency would have to fire on every memory access — billions per second per core — so the kernel does not maintain them. Illustrative — the gap between counted faults and uncounted loads is the gap between dashboard and reality.

What you actually want to know — of the loads I issued in the last second, what fraction crossed a socket boundary? — is not in numastat. It is, partially, in the PMU's OFFCORE_RESPONSE events on Intel and LS_DC_ACCESSES_BY_LOCALITY on AMD. Reading those events requires perf stat -e with the right umask, the right unit-mask filter, and a kernel new enough to expose the event. Aditi's matcher dashboard had numastat on the front page and no PMU events anywhere, because PMU events are per-CPU, expensive to aggregate, and the team had wired the dashboard months ago when the matcher was on a single-socket box and numastat was the only thing that mattered.

There is a second hazard buried in the same counters. numa_foreign increments on the node that received a foreign allocation, not the node that requested it. If a thread on node 1 asked for node-1 memory but the kernel had to spill the allocation to node 0 because node 1's free-page pool was empty, numa_foreign ticks on node 0 and numa_miss ticks on node 1. Reading numastat -p and seeing a low numa_foreign on a node tells you nothing about whether that node's threads are getting their allocations satisfied locally; it tells you whether other nodes are dumping allocations into this one. Engineers misread this column constantly. The PaisaBridge matcher dashboard had numa_foreign plotted as "remote allocations from this node" with the time-series flat-lining at zero on socket 1; the actual remote-allocation count for socket-1 threads was in numa_miss on socket 1, which the dashboard did not plot at all because the panel had been built for a single-socket machine and the second-node breakdown had never been added.

The lesson generalises beyond numastat. Every observability tool exposes the metrics that are easy for it to expose, not the metrics that map to the question you wanted to ask. The metric you actually want — "fraction of hot-path loads served from a non-local node" — is computable, but not from any single tool, and not from any default dashboard. Building it requires a deliberate combination of perf stat --per-socket -e mem_load_l3_miss_retired.remote_dram (numerator) and mem_load_retired.l1_hit + mem_load_retired.l2_hit + mem_load_retired.l3_hit + mem_load_l3_miss_retired.local_dram + mem_load_l3_miss_retired.remote_dram (denominator), divided per socket, plotted over time, ideally with the NUMA topology overlaid. No vendor ships this dashboard out of the box. Every NUMA-mature production team has built a version of it and considers it a competitive advantage.

Why numastat's counters cannot just be made better: the kernel does not see ordinary loads. The MMU translates a virtual address to a physical address, walks the page tables (or hits the TLB), and issues the load on the memory bus. The kernel is not in the loop. The hardware PMU sees every load but classifying each load by node-of-residency requires the chip's offcore response logic, which counts a small number of programmed events at a time and which different chip generations expose with different unit-masks. There is no dmesg-equivalent flow where the kernel can keep a perfect "loads-by-node" counter without paying a PMU slot per CPU.

The kernel is rebalancing while you measure

The second crack in the wall is AutoNUMA balancing. Linux 3.13 added a kernel daemon (numa_balancing) that periodically samples a process's pages, marks some of them PROT_NONE to provoke a fault on the next access, and uses the faulting CPU to decide whether to migrate the page closer. It is enabled by default in every modern distro kernel (/proc/sys/kernel/numa_balancing = 1) and it is constantly running.

This means the placement you set up at 09:00 is not the placement you have at 09:30. Pages migrate. Threads' NUMA hint pages get rewritten. The mbind policy you set persists for new allocations, but pages that were placed before the policy can be moved. For a pinned latency-sensitive service this is usually a feature; for an A/B test it is a contamination source you cannot eliminate without taking the kernel feature off, which changes the very thing you are testing.

# autonuma_drift.py — observe AutoNUMA migrating pages under you.
# Allocates a 512 MB buffer on node 0 with first-touch, then runs a
# read-only worker pinned to node 1 for 60 seconds, sampling
# /proc/<pid>/numa_maps each second to see how many pages have moved.
#
# Run as: sudo python3 autonuma_drift.py
# Requires: 2-socket box, libnuma, /proc/sys/kernel/numa_balancing == 1.

import ctypes, ctypes.util, os, re, subprocess, sys, time, threading
from ctypes.util import find_library

libnuma = ctypes.CDLL(find_library("numa") or "libnuma.so.1")
libnuma.numa_available.restype = ctypes.c_int
libnuma.numa_max_node.restype = ctypes.c_int
libnuma.numa_alloc_onnode.argtypes = [ctypes.c_size_t, ctypes.c_int]
libnuma.numa_alloc_onnode.restype = ctypes.c_void_p
libnuma.numa_run_on_node.argtypes = [ctypes.c_int]

if libnuma.numa_max_node() < 1:
    sys.exit("Need 2+ NUMA nodes for this benchmark.")

with open("/proc/sys/kernel/numa_balancing") as f:
    bal = f.read().strip()
print(f"numa_balancing = {bal}  ({'on' if bal == '1' else 'off'})")

SIZE = 512 * 1024 * 1024
buf = libnuma.numa_alloc_onnode(SIZE, 0)
ctypes.memset(buf, 0xAB, SIZE)              # first-touch on node 0
print(f"allocated {SIZE//(1<<20)} MB on node 0 at 0x{buf:x}")

stop = threading.Event()
def reader():
    libnuma.numa_run_on_node(1)             # pin to node 1
    discard = (ctypes.c_byte * 4096)()
    while not stop.is_set():
        for off in range(0, SIZE, 4096):
            ctypes.memmove(discard, buf + off, 64)

t = threading.Thread(target=reader); t.start()

pid = os.getpid()
pat = re.compile(r"N(\d+)=(\d+)")
print(f"{'t':>4}  {'N0_pages':>10}  {'N1_pages':>10}  pct_on_N1")
for tick in range(60):
    time.sleep(1)
    n0 = n1 = 0
    with open(f"/proc/{pid}/numa_maps") as f:
        for line in f:
            if f"{buf:x}" not in line:        # filter to our allocation
                continue
            for node, count in pat.findall(line):
                if node == "0": n0 += int(count)
                elif node == "1": n1 += int(count)
    total = n0 + n1
    pct = 100.0 * n1 / total if total else 0.0
    print(f"{tick:>4}  {n0:>10,}  {n1:>10,}  {pct:>6.1f}%")

stop.set(); t.join()

A sample run on a 2-socket Sapphire Rapids 8480+ box with numa_balancing=1:

$ sudo python3 autonuma_drift.py
numa_balancing = 1  (on)
allocated 512 MB on node 0 at 0x7f8a3c000000
   t   N0_pages    N1_pages  pct_on_N1
   0    131,072            0     0.0%
   2    131,072            0     0.0%
   5    130,816          256     0.2%
  10    127,488        3,584     2.7%
  20    104,192       26,880    20.5%
  30     71,168       59,904    45.7%
  45     22,016      109,056    83.2%
  60      6,144      124,928    95.3%

The walkthrough on what to read here:

At t=0, every one of the 131,072 4 KB pages of the buffer is on node 0 — exactly where numa_alloc_onnode and the explicit memset placed them. The dashboard would call this perfect placement.
By t=5, 256 pages have moved. AutoNUMA's first scan ran. The kernel sampled some of the buffer's pages, marked them PROT_NONE, and the worker thread on node 1 took a fault on each one. The fault handler decided "this thread is on node 1 and the page is on node 0; the thread is the consumer; migrate the page." The migration happens immediately in the fault handler — migrate_pages is invoked, pages are copied, TLBs are flushed.
By t=20, 20% of the buffer has migrated. The migration rate is set by kernel.numa_balancing_scan_period_min_ms (default 1000) and _max_ms (default 60000), and by the size of each scan window (numa_balancing_scan_size_mb, default 256 MB). On a 512 MB buffer the kernel walks roughly half per scan; the worker's reads then convert hint-faults into migrations.
By t=60, 95% of the buffer is on node 1. The kernel has effectively re-implemented the application's intended placement, automatically. Why this is dangerous for benchmarks even though it is good for steady-state production: a 60-second run shows 95% remote at the start and 5% remote at the end. If you measure throughput averaged over the run, you get a number that means nothing — it is the average of "very slow" and "fast", with the transition timing depending on AutoNUMA's scanner schedule. Run the benchmark twice, get two different numbers, depending on whether AutoNUMA was halfway through a scan when you started.
The fix for benchmarking is to disable AutoNUMA for the duration: echo 0 | sudo tee /proc/sys/kernel/numa_balancing. The fix for production is to leave it on and accept that the placement will drift toward optimal on its own. You cannot have both reproducibility and the feature.

numa_balancing is also why benchmarks run on single-socket developer boxes do not predict 2-socket production behaviour. On a single socket, the balancer has nothing to do and never runs; on 2 sockets, it is a constantly-firing background process that interacts with your memory layout. SetuStream's IPL transcoder team learned this in 2024: the lab benchmarks all ran on a 1-socket Threadripper workstation and showed a 12% improvement from a placement change. The same change rolled to 2-socket production servers showed 4% improvement — the rest had been AutoNUMA cleaning up the application's placement automatically.

There is a quieter consequence of AutoNUMA that is worth flagging for any team running stateful processes for hours or days: the migrate_pages calls that AutoNUMA invokes are not free. Each migrated page costs a copy (at DRAM bandwidth) plus a TLB shootdown across the cores that had the page mapped — the shootdown is an IPI to every CPU in the page's address-space mask, and every recipient has to drop into kernel mode, flush part of its TLB, and return. On a busy 2-socket box with 96 cores, a page migration can briefly stall dozens of cores. The aggregate cost is small (typically <1% of CPU) but the latency tail is not — a process that triggers 50 migrations in a single 100 ms window will see a p99.9 spike correlated with the migration burst. Tracing this requires perf record -e migrate:mm_migrate_pages and is a frequent surprise in services where p99.9 matters more than mean.

Containers shuffle the topology between restarts

The third crack: in any container orchestrator that does not have explicit NUMA pinning, the CPU set you got at pod start is not the CPU set the next pod gets after a restart. The kubelet's CPU manager has a static policy that pins guaranteed-QoS pods to specific cores, but the default none policy assigns CPUs from a shared pool and can pick different cores on each restart. The Topology Manager is a separate subsystem and is off unless you turn it on.

The result: an A/B test that compares "before" and "after" code on the same pod template, in different pods, is not comparing the same NUMA topology. The "before" pod might land on cores 0–7 (all socket 0), the "after" pod on cores 0,1,16,17,32,33,48,49 (split across sockets). The performance difference you measure is a mix of the code change and a topology change, and you have no way to disentangle them without re-deploying with the same code on both pods to get a baseline.

Without `Guaranteed` QoS class plus CPU Manager `static` policy plus Topology Manager `single-numa-node`, kubelet picks CPU sets from a shared pool and the placement varies. The variance shows up as p99 noise that looks like code-change effects. Illustrative — based on the kubelet `none` CPU policy default.

DigiPaisa's UPI auth team hit this in late 2024 during a roll-out of a payload-size optimisation. The new code dropped p99 from 38 ms to 31 ms in their staging cluster — a clean 18% win. The same code in production showed p99 ranging from 28 ms to 49 ms across the 200-pod fleet, with the variance correlated to which physical node the pod landed on. Three of the production hosts had recently been added to the cluster and had different CPU pinning behaviour because the underlying kubelet config drifted during a rolling kernel upgrade. The "win" was real on 60% of the fleet and a regression on 8%; on the rest it was within noise. Without per-pod NUMA telemetry the team couldn't see this and almost rolled back a real improvement because the aggregate p99 looked unchanged.

The deeper lesson is that aggregate metrics across a fleet of pods with non-uniform NUMA placement are a kind of statistical lie. The mean is the average of two distributions — well-placed and poorly-placed pods — and the mean does not describe either. The p99 is the 99th percentile of a mixed population and tells you only that some pod somewhere had a bad time; it does not tell you which pods, on which hosts, with which placement. The actionable signal lives in per-pod histograms keyed by host, and those histograms are not what most observability stacks ship by default. Building them takes deliberate effort: a kube_pod_info join in PromQL, a node_name label on the matcher's own latency histogram, and a heatmap visualisation in Grafana that shows latency variance by host. DigiPaisa's auth team built exactly this dashboard during the 2024 incident's post-mortem. It is now the first thing they look at on any deploy. Not the aggregate p99 — the per-host heatmap.

Production patterns and their pitfalls

Trusting the flamegraph's frame name. A flamegraph generated with perf record -F 99 on a 2-socket box shows you which functions were on-CPU when the sampler fired. It does not tell you whether those functions were on-CPU because they were doing work or because they were stalled waiting for a remote DRAM access. A function that spends 40% of its CPU time stalled on cross-socket loads looks identical, in a flamegraph, to a function that spends 40% of its CPU time doing real arithmetic. To distinguish them you need PMU stall events: perf record -e cycles -e cycle_activity.stalls_l3_miss -F 99 --call-graph dwarf, then look at the per-function ratio of l3-miss-stall cycles to total cycles. Most teams do not do this; they look at the flat flamegraph, optimise the fattest box, and discover that the "hot function" gets faster while the service gets slower because the optimisation moved work to a different cache-miss-heavy function. Why a stalled cycle and a working cycle are indistinguishable to the sampler: perf record at 99 Hz fires an interrupt 99 times a second per CPU and records the program counter at the interrupt point. The PC is the address of the next instruction to retire — but on an out-of-order machine, an instruction can sit at the head of the retirement window for hundreds of cycles waiting on a load. Every one of those cycles, the PC is the same. The sampler records that PC for as long as the stall lasts. The flamegraph attributes the stall time to that function, the same way it would attribute real compute time. Without a PMU event that fires only on stall cycles, the two are unobservable separately.

Aggregating per-CPU counters with arithmetic mean. perf stat -a (system-wide) and perf stat -p (per-process) report aggregated counters across cores. If half your cores are on socket 0 with a 4% L3 miss rate and the other half are on socket 1 with a 28% L3 miss rate, the aggregate reports a 16% L3 miss rate — a number that does not describe either side and obscures the actual story. The fix is perf stat --per-socket or --per-core, which preserves the per-domain breakdown. PaisaBridge's matcher debugging session for Aditi's incident finally cracked when someone ran perf stat --per-socket: socket 0 was at IPC 1.8, socket 1 was at IPC 0.4, and the dashboard's IPC 1.1 had been hiding both extremes.

Using time for benchmarks on tickless kernels. The time command reports real, user, and sys based on clock_gettime(CLOCK_MONOTONIC). On a kernel built with CONFIG_NO_HZ_FULL=y and an isolated CPU set (isolcpus=...), the timer ticks are suppressed on the isolated cores, and short-running benchmarks can finish in fewer ticks than expected. Worse, on the same kernel, frequency scaling can boost the core for the first few hundred milliseconds and then settle to a lower clock — time reports the wall-clock seconds, not the cycle count. The correct measurement for benchmarks is perf stat -e task-clock,cycles,instructions so you see both wall-clock and cycle count and can spot frequency drift directly.

Treating Prometheus rate counters as if they were instantaneous. A Prometheus rate(numastat_remote_dram_total[1m]) query returns the per-second rate averaged over the last minute. If a NUMA-related spike lasted 8 seconds — say, a market-open burst on the matcher — and you scrape Prometheus on a 15-second interval with a 1-minute rate window, the spike is averaged into a window that is 87% calm and 13% spiked. The rate query reports 13% of the actual spike value, which probably falls below the alerting threshold. The spike never pages anyone. The next morning the dashboard looks fine and nobody investigates. The fix is to drop the rate window to 15-30 seconds for NUMA telemetry specifically, which trades sample-noise for spike-fidelity. The trade-off is a deliberate choice that has to be re-made for each metric.

Confusing latency with throughput when only throughput improved. A NUMA-aware placement change often improves throughput (more total work done per second) without improving p99 latency, because the bottleneck for tail latency is a different code path — usually a synchronous I/O or a lock contention spike. Teams celebrating a 20% throughput improvement and assuming p99 followed are usually wrong; you have to measure both, separately, and not use throughput as a proxy for latency. This is the bridge to Part 4 (benchmarking) and Part 7 (latency).

The lab-prod gap is structural, not bad luck

There is a comforting story engineers tell themselves: "the lab benchmark and production disagree because production has unpredictable load". The reality is sharper. The lab and production disagree on NUMA-sensitive code because they are different machines running different kernels under different orchestrators with different background work, and every one of those differences is a placement-affecting variable.

The lab box is usually a single-socket developer workstation with numa_balancing=1 (no effect — only one node), transparent_hugepage=madvise, an ondemand governor, no isolated CPUs, and no other workload. The production box is a 2-socket server with numa_balancing=1 (active, migrating pages every minute), transparent_hugepage=always, a performance governor, possibly isolcpus, and 30+ other containers competing for memory bandwidth. The same Python script will produce a different numastat profile, a different perf stat IPC, and a different p99 on these two boxes — and the gap is not because production has "more load", it is because the OS is making different placement decisions in response to different topology and configuration.

Aditi's matcher hit this exactly. The lab benchmark on her single-socket dev workstation showed the new placement code dropping p99 from 1.8 ms to 1.4 ms — a clean win. Production p99 jumped to 4.2 ms because the 2-socket scheduler was bouncing matcher threads between sockets every few hundred milliseconds during the market-open burst, the AutoNUMA daemon was simultaneously trying to migrate pages to follow the threads, and the cumulative effect was page-migration thrash that did not exist on the single-socket lab box. The code was identical. The behaviour was not.

The actionable form of this is a checklist that the team runs before every NUMA-sensitive deploy:

Check	Lab	Prod	Same?
`cat /proc/version`	6.6.30-dev	6.6.31-prod	close enough
`cat /sys/devices/system/cpu/cpu0/cpufreq/scaling_governor`	`ondemand`	`performance`	NO
`cat /proc/sys/kernel/numa_balancing`	1	1	yes
`cat /sys/kernel/mm/transparent_hugepage/enabled`	`[madvise]`	`[always]`	NO
`numactl --hardware` (node count)	1	2	NO
`nproc`	16	96	NO
Background containers	0	28	NO

Five of the seven rows differ. Any one of them can flip a NUMA-sensitive number by 2-3×. A "lab benchmark passes, production regresses" outcome is not surprising; it is the expected result of running on two different machines and pretending they are the same. The fix is not to make the lab match production exactly (impractical) but to measure both, document the gap explicitly, and refuse to claim a production win until the production measurement confirms it.

The mature pattern adopted by Indian backends doing latency-critical work — ParakhTrade's matcher, DigiPaisa's UPI authoriser, SetuStream's transcode farm — is a canary protocol rather than a benchmark protocol. Roll the change to one production pod, measure it for an hour against the rest of the fleet acting as a control group, and only then expand. The protocol does not eliminate the lab-prod gap; it ignores the lab. The lab is a place to verify the code compiles, the tests pass, and nothing obviously wrong happens. The number that matters comes from production, against a control, with the same kernel, the same kubelet, the same neighbours. The lab is for correctness; production is for performance. Treating them as interchangeable was the original sin every team in this list confessed to in the post-mortem that produced the canary protocol they now use.

Common confusions

"numastat -p $PID showing 95%+ local hits means my service is well-placed." It means 95% of the page faults were satisfied locally. Page faults are cold-path events; the hot path is loads, which numastat does not see. A service with 100% local faults can still incur 60% remote loads if the access pattern crosses sockets. Confirm with PMU offcore counters, not numastat.
"AutoNUMA fixes anything I get wrong." AutoNUMA migrates pages toward the threads that touch them, but it does not move threads to follow data, it does not handle write-shared pages well (migration thrashes), and it cannot fix structural misplacement of read-shared data. It also takes 30-90 seconds to converge on a fresh placement, which is forever for a request-response service. It is a safety net for steady state, not a substitute for explicit mbind.
"Disabling AutoNUMA makes my placement deterministic." Disabling AutoNUMA stops the migration daemon, but the kernel's other placement decisions (page-cache placement, default mempolicy on shared mmap, CFS scheduler load-balancing across sockets) all continue to run. Determinism requires numa_balancing=0 plus pinned threads (taskset or sched_setaffinity) plus explicit mbind on every long-lived allocation plus MEMPOLICY_BIND strictness — and even then, kernel kthreads still allocate on whatever node they happen to wake up on.
"perf stat's aggregated IPC tells me how the service is performing." Aggregated IPC across asymmetric sockets is meaningless. Use --per-socket or --per-core so you see the spread; the spread itself is the diagnostic signal. A high spread under uniform load indicates placement asymmetry or work-distribution asymmetry, both of which are bugs.
"My container is Guaranteed QoS, so its CPU set is fixed." Guaranteed QoS is necessary but not sufficient. You also need cpuManagerPolicy=static on the kubelet (cluster-wide config), topologyManagerPolicy=single-numa-node (also cluster-wide), and CPU/memory limits set to whole integers (not fractional). Miss any of these and the kubelet defaults take over and you get pool-shared CPUs again.
"Switching from numactl to a NUMA-aware allocator removed my placement bug." It removed one placement bug (the first-touch-on-startup pattern from the previous chapter). It did not address page-cache placement for files mmap'd by the service, kernel-allocated buffers (network packet rings, BPF maps), or static initialisers that run before the allocator's NUMA mode is configured. A clean placement audit checks all four; most teams check only the heap. Tools like perf c2c reveal false-sharing offenders during the sample window but cannot enumerate coherence problems exhaustively, and /proc/vmstat's numa_pages_migrated counter tells you migration happened, not whether it helped — judging "helped" requires comparing latency with AutoNUMA on vs off, which only an explicit experiment can answer.

Going deeper

PMU offcore-response events: the only honest signal

Intel's OFFCORE_RESPONSE_0 and OFFCORE_RESPONSE_1 PMU events let you classify each load by where it was satisfied. The event takes a request-mask (DMND_DATA_RD, DMND_RFO, DMND_IFETCH, etc.) and a response-mask (L3_HIT, L3_MISS, LOCAL_DRAM, REMOTE_DRAM, REMOTE_HITM, etc.). You program the umask to count, e.g., DMND_DATA_RD with REMOTE_DRAM response, and the counter increments once per remote-DRAM-served load. AMD's equivalent is LS_DC_ACCESSES.LOCAL_DRAM and LS_DC_ACCESSES.REMOTE_DRAM. The catch: each PMU has only 4-8 general-purpose counters, and offcore-response events use one of the special slots, so you can program at most 1-2 offcore breakdowns concurrently. Production telemetry uses perf record -e ... rotating through event groups, sampling each group for a fraction of a second, then aggregating — Brendan Gregg's numa-loaded script demonstrates the pattern. The output gives you per-process, per-second, per-node load distribution. This is the signal to trust; everything else is a proxy.

`perf c2c`: the cross-socket cache-line tracker

perf c2c record and perf c2c report are an underused tool that uses the PMU's HITM (Hit Modified) event to find cache lines that are bouncing between cores under coherence traffic. The output shows, per cache line, how many remote-modified hits each core saw — which is the signature of false sharing across sockets. For Aditi's matcher, perf c2c would have shown a hot 64-byte line in the order-book index that was being modified on socket 0 and read on socket 1 thousands of times per second, which is exactly the cross-socket-coherence pattern that numastat cannot see. The tool is finicky (kernel needs CONFIG_HW_BREAKPOINT, root or perf_event_paranoid <= 0, and the report parser is slow on 2 GB perf.data files), but it is the only tool that surfaces coherence traffic by code location. Production debug runs use it with a 5-10 second sample window during the incident, then a long offline analysis pass.

eBPF for ad-hoc per-load breakdown

The bcc numa_top.bt script (bpftrace-based) attaches a tracepoint to sched_switch and on every context-switch logs the migrating thread's PID, source CPU, and dest CPU, with a per-CPU histogram of "how many cross-socket migrations per second". This reveals scheduler-induced placement thrash that no other tool catches. The corresponding load-classifier (mem_load.bt) uses Intel PEBS to sample memory loads and classify by latency bucket, giving you a real-time histogram of loads-per-second by latency. The eBPF approach has overhead (1-3% of CPU per probe, more if probes fire frequently) and requires a recent kernel (5.4+ for stable BPF tracepoints), but it scales to per-process tracing of running production services. For one-off NUMA debugging, eBPF is the fastest path from "something is wrong" to "here is the line of code".

The pattern that has worked for Indian backends doing serious NUMA work is to keep a small library of pre-written bpftrace scripts (numa_migrations.bt, cross_socket_loads.bt, mempolicy_drift.bt) checked into the same repo as the service code, with a make trace target that runs them against a running pod. The first time you need them is a 02:00 incident; that is not the moment to be writing eBPF for the first time. Pre-loading the toolkit during calm engineering weeks turns a 4-hour debug session into a 20-minute one. The SetuStream performance team's internal repo has 23 such scripts; the public bpftrace tools/ directory has another 60. Steal liberally from both.

The reproducibility ladder

For an A/B test on a NUMA service to produce a number you can trust, climb the ladder: (1) Same kernel version and config (/proc/version, /proc/config.gz if exposed). (2) Same numa_balancing setting (write 0 to disable, or a fixed scan period). (3) Same CPU isolation (isolcpus, nohz_full, rcu_nocbs if used). (4) Same CPU frequency scaling governor (performance for benchmarks, never ondemand). (5) Same THP setting (/sys/kernel/mm/transparent_hugepage/enabled set to never or always, never madvise for benchmarks). (6) Same PMU isolation (no other process running perf). (7) Identical taskset or sched_setaffinity for the test process. Skip any rung and your A/B numbers carry that rung's variance baked in. Most teams skip rungs 2, 5, and 6, then wonder why their benchmarks are noisy.

Reproduce this on your laptop

# 2-socket box recommended; single-socket boxes will not show drift.
sudo apt install numactl libnuma-dev linux-tools-common linux-tools-generic
python3 -m venv .venv && source .venv/bin/activate
# numa_balancing knob and PMU access:
echo 1 | sudo tee /proc/sys/kernel/numa_balancing       # ensure on
sudo sysctl kernel.perf_event_paranoid=0                # PMU access

sudo python3 autonuma_drift.py                          # see migration in real time

# Show the offcore-response truth your dashboard hides:
sudo perf stat --per-socket -e cycles,instructions,\
mem_load_l3_miss_retired.remote_dram \
  -- python3 -c "import time; time.sleep(5)"

# Find cross-socket coherence traffic by code location:
sudo perf c2c record -- python3 your_workload.py
sudo perf c2c report --stdio | head -60

# Disable autonuma for a clean A/B:
echo 0 | sudo tee /proc/sys/kernel/numa_balancing

Where this leads next

This is the wall that closes Part 3. Five chapters of NUMA tooling — numactl, topology discovery, allocators, sharded data structures — have given you the levers to do the work. This chapter has shown why doing the work is the easy half. The hard half is knowing whether you actually did it, in production, on a kernel that is rebalancing under you, in a container that gets a different topology each restart.

Each Part of this curriculum closes with a wall that frames the limit of the tools the Part just taught. Part 2 closed with "CPUs are fast; memory is not": you can optimise compute as much as you like, but DRAM latency is the floor under everything. Part 3 closes with this chapter's wall: you can optimise placement as much as you like, but if you cannot reliably measure placement in production, you cannot tell whether your optimisation worked. The two walls compose. The Part-2 wall says compute speed is bounded by memory; the Part-3 wall says memory placement is bounded by your ability to observe it. Together they motivate Parts 4-7 — every one of those Parts is a different attack on the observability problem this chapter framed.

Part 4 (Benchmarking without lying) is the direct response. It teaches the methodology to measure anything — NUMA effects, allocator overhead, kernel-vs-userspace cost — without the foot-guns this chapter just enumerated:

/wiki/why-most-benchmarks-are-wrong — the canonical list of mistakes that produce confident, wrong numbers.
/wiki/coordinated-omission-and-why-wrk-lies — the open-loop / closed-loop distinction, and why wrk and ab produce histograms that systematically under-report tail latency.
/wiki/perf-stat-and-the-events-that-actually-matter — programming PMU events for offcore-response, l3-miss-stalls, and the per-socket breakdown this chapter argued for.
/wiki/cpu-frequency-scaling-and-benchmark-noise — why your governor is the largest single source of benchmark variance, and why the performance governor is non-negotiable for measurement runs.
/wiki/isolcpus-and-cpu-pinning-for-reproducible-runs — the kernel boot parameters that exclude cores from the scheduler entirely, giving you a clean room for the benchmark process.

The deeper habit, carrying forward: every claim about performance must come with a measurement, and every measurement must come with the conditions under which it was taken. "p99 dropped from 4.2 ms to 1.4 ms" is not a claim — it is half of one. The full claim is "p99 dropped from 4.2 ms to 1.4 ms on a 2-socket Sapphire Rapids 8480+ box with numa_balancing=0, governor=performance, THP=never, kernel 6.6.30, measured with wrk2 -R 50000 -t 16 -c 256 -d 60s and HdrHistogram-corrected percentiles." The conditions are the claim.

Aditi eventually found her matcher's regression. The perf c2c report output identified a 64-byte cache line in the order-book index that was being modified on socket 0 (the leader thread) and read on socket 1 (the follower thread). The fix was to pad the index entries to a full cache line and explicitly route reads through a per-socket replica. p99 dropped to 1.3 ms in production — confirmed by the per-host heatmap, not the aggregate dashboard. The fix took 40 lines of C++. Finding the bug took two weeks. The week-to-line ratio is the shape of NUMA work: the mechanism is small, the measurement is large, and the team that has internalised the measurement discipline is the team that ships the win.

The matcher post-mortem ended with one new piece of internal tooling and one new piece of internal culture. The tooling was the per-host heatmap dashboard, now mandatory for any deploy touching the matching engine's hot path. The culture was a single line in the engineering handbook: "We do not claim a NUMA win until production confirms it, on a control group of unchanged pods, on the same kernel, on the same kubelet config, with the same neighbours." The line is not exciting. It is the discipline that turns the tools in this Part into outcomes.

The conditions are the claim, the dashboard lies, and the only honest signal is the one you have to assemble yourself out of PMU events that nobody ships by default. That is the wall — and that is exactly the wall the next Part is built to climb.

References

Andi Kleen, "An NUMA API for Linux" (Novell whitepaper, 2005) — the design rationale for mbind, set_mempolicy, and the libnuma userspace API; still the clearest explanation of what the counters mean.
Brendan Gregg, Systems Performance (2nd ed., 2020), Chapter 7 — Memory — the production-grade reading list for numastat, perf c2c, and the offcore-response events.
Mel Gorman, "Automatic NUMA balancing in the Linux kernel" (LWN, 2014) — the kernel-side explanation of how AutoNUMA samples, faults, and migrates; reading this is the difference between "the daemon is magic" and "the daemon is a sampling page-fault handler".
Intel® 64 and IA-32 Architectures Software Developer's Manual, Volume 3B, Chapter 19 — Performance Monitoring — the authoritative description of OFFCORE_RESPONSE events; the umask tables in Section 19.6.x are what perf list is rendering.
Joe Mario, "C2C — False Sharing Detection in Linux Perf" (Linux Plumbers, 2016) — the canonical introduction to perf c2c from the author of the tool; explains the HITM event and how to read the report.
Kubernetes, "Control Topology Management Policies on a Node" — the orchestrator-level documentation for the policies that determine whether your container even gets deterministic NUMA placement.
Gil Tene, "How NOT to Measure Latency" (Strange Loop, 2015) — the talk that invented "coordinated omission" as a term; the methodology applies to NUMA benchmarks the same way it applies to load tests.
/wiki/numa-aware-allocators-and-data-structures — the previous chapter; the levers this chapter argues you cannot trust without the right measurements.