numactl and memory binding
Karan at PhonePe was rolling out a new fraud-scoring model on the UPI authoriser fleet. The model held a 64 GB feature index in shared memory; benchmarks on a single c7a.metal-48xl box showed p99 of 4.2 ms. Production rolled to 30 nodes overnight; p99 climbed to 11.8 ms with no other change. Twelve hours of grepping perf record later, an SRE noticed the systemd unit on production used numactl --cpunodebind=0,1,2,3 but had quietly dropped the --membind flag during a refactor. The threads were pinned to socket 0, but the 64 GB index — mmap-ed at startup before the pin — had landed wherever the first-touch allocator put it, which on the new SKU was uniformly across all 8 nodes. Every feature lookup was a 50/50 cross-socket access. The fix was one flag. Finding the missing flag took 12 hours, because nobody was reading the kernel's numa_maps for the running process.
numactl is a thin user-space wrapper around the mbind, set_mempolicy, and sched_setaffinity syscalls. Its four binding modes — --membind, --preferred, --interleave, and --cpunodebind — answer two different questions: "which CPUs may this process run on?" and "which NUMA nodes may its pages live on?". You almost always need both, and getting one without the other is the single most common NUMA misconfiguration in production. Read /proc/<pid>/numa_maps to verify what actually happened, not what you asked for.
The four flags and what they actually do
numactl ships with four binding flags that look similar and are not interchangeable. Confusing them is the bug at the top of this chapter. The clean way to remember them is that two flags pin CPUs (--cpunodebind, --physcpubind) and three flags pin memory (--membind, --preferred, --interleave). You combine one from each side to get a fully-pinned process.
| Flag | What it pins | Strict? | When to reach for it |
|---|---|---|---|
--cpunodebind=N[,M] |
The process's CPU affinity to all CPUs in NUMA node N (and M, ...) | Yes — EBUSY on migration attempts to other nodes |
Default for "run on this socket / chiplet" |
--physcpubind=0-11 |
CPU affinity to a literal CPU list | Yes | When you want specific cores, not whole nodes |
--membind=N[,M] |
Memory allocations may only come from these nodes | Yes — OOM rather than spill |
Latency-critical; you would rather fail than fall back to remote |
--preferred=N |
Allocator tries node N first, falls back transparently | No | Most workloads — gives you locality with a safety net |
--interleave=N,M[,...] |
Each new page round-robins among the listed nodes | No | Bandwidth-bound workloads that benefit from spreading |
--localalloc |
Allocate on whatever node the touching CPU is on (default kernel policy is also this) | No | Explicit "first-touch" — useful when threads pre-pinned |
The two CPU flags are mutually exclusive with each other; the three memory flags are mutually exclusive with each other; but you pick one CPU flag and one memory flag together. numactl --cpunodebind=0 --membind=0 ./my-service is the canonical incantation for "pin this whole process to socket 0, refuse to allocate elsewhere".
--membind is strict; --preferred is soft; --interleave spreads; --localalloc follows the toucher. Illustrative — based on the Linux mbind(2) man page semantics.Why strict vs soft matters in production: a Razorpay matcher pinned with --membind=0 will OOM and crash if the JVM's heap grows past node 0's free memory — even if 200 GB is sitting idle on node 1. That sounds bad, until you consider the alternative: a --preferred=0 matcher silently spills to node 1, p99 climbs from 4 ms to 9 ms because half the heap is now remote, and nobody notices until the SLO breach pages the on-call. Strict gives you a loud failure; soft gives you a quiet regression. For latency-critical paths, the loud failure is the better signal.
What numactl is actually doing — the syscall view
numactl is roughly 800 lines of C wrapping three Linux syscalls: set_mempolicy, mbind, and sched_setaffinity. The kernel does the work; numactl just translates flags into syscall arguments. Knowing this lets you replace numactl with code when you need finer control — per-thread policies, per-allocation policies, or runtime policy changes.
# numactl_under_the_hood.py
# Demonstrates that numactl is a thin wrapper. We invoke set_mempolicy
# and mbind directly via ctypes, then check that /proc/self/numa_maps
# shows the allocation landing on the node we asked for.
#
# Run as: numactl --hardware first to see your topology, then:
# python3 numactl_under_the_hood.py
# Requires Linux with CONFIG_NUMA. Works on a 1-node laptop too
# (the policy is set, the placement is trivially satisfied).
import ctypes, ctypes.util, mmap, os, subprocess, sys
libc = ctypes.CDLL(ctypes.util.find_library("c"), use_errno=True)
# Constants from /usr/include/linux/mempolicy.h
MPOL_DEFAULT, MPOL_PREFERRED, MPOL_BIND, MPOL_INTERLEAVE = 0, 1, 2, 3
MPOL_MF_STRICT, MPOL_MF_MOVE = 1, 2
PAGE = os.sysconf("SC_PAGE_SIZE") # 4096 on x86, often 64K on ARM
SIZE = 256 * 1024 * 1024 # 256 MiB allocation
def discover_nodes():
"""Read /sys to find which NUMA nodes exist."""
if not os.path.isdir("/sys/devices/system/node"):
return [0]
return sorted(int(d.removeprefix("node"))
for d in os.listdir("/sys/devices/system/node")
if d.startswith("node") and d[4:].isdigit())
def bind_to_node(addr: int, length: int, node: int):
"""Call mbind(addr, length, MPOL_BIND, &mask, maxnode, MPOL_MF_STRICT)."""
nmask = ctypes.c_ulong(1 << node)
rc = libc.syscall(237, # __NR_mbind on x86_64
ctypes.c_void_p(addr), ctypes.c_size_t(length),
ctypes.c_int(MPOL_BIND),
ctypes.byref(nmask), ctypes.c_ulong(64),
ctypes.c_uint(MPOL_MF_STRICT))
if rc != 0:
raise OSError(ctypes.get_errno(), f"mbind failed: {os.strerror(ctypes.get_errno())}")
nodes = discover_nodes()
target = nodes[-1] if len(nodes) > 1 else 0 # last node, or only node
print(f"Available NUMA nodes: {nodes}; binding to node {target}")
buf = mmap.mmap(-1, SIZE, prot=mmap.PROT_READ | mmap.PROT_WRITE,
flags=mmap.MAP_PRIVATE | mmap.MAP_ANONYMOUS)
addr = ctypes.addressof(ctypes.c_char.from_buffer(buf))
bind_to_node(addr, SIZE, target)
# Touch every page so the kernel actually allocates it.
view = (ctypes.c_char * SIZE).from_buffer(buf)
for off in range(0, SIZE, PAGE):
view[off] = 0x55
# Now /proc/self/numa_maps reports per-mapping placement.
with open(f"/proc/{os.getpid()}/numa_maps") as f:
for line in f:
if "anon" in line and "bind" in line:
print(line.strip())
break
A sample run on a 2-socket EPYC 9654 (NPS=4, 8 nodes):
$ python3 numactl_under_the_hood.py
Available NUMA nodes: [0, 1, 2, 3, 4, 5, 6, 7]; binding to node 7
7f3a4e000000 bind:7 anon=65536 dirty=65536 active=0 N7=65536 kernelpagesize_kB=4
The line that matters is N7=65536: every one of the 65,536 4 KiB pages we touched landed on node 7, exactly as mbind(MPOL_BIND, mask=0x80, MPOL_MF_STRICT) requested. The bind:7 token confirms the policy attached to the VMA. If we had left MPOL_MF_STRICT off and node 7 had been full, the kernel would have happily spilled to nearby nodes; with strict on, the next page-fault on a full node 7 returns ENOMEM and the process dies with OOMKilled in dmesg.
The walkthrough on the parts that matter most:
libc.syscall(237, ...)invokesmbinddirectly. The syscall number 237 is x86_64-specific (it's 235 on aarch64); production code useslibnuma's wrapper. We go raw here to make the kernel-interface boundary visible.MPOL_MF_STRICTis the flag that makes the difference between "I'd like" and "I demand". Without it,mbindreturns success but the placement is advisory — pages may end up off-node if the target is full or if the kernel's page-allocator's fast path picked another zone first.- Touch every page in the loop because Linux is lazy:
mmapreserves virtual address range, but physical pages are only allocated on first write (the page-fault path). Until you touch a page,numa_mapswon't show it. Why this trips up benchmarks: anmmap-then-numactlsequence that doesn't touch every page produces misleadingnumastatoutput. The pages aren't on the wrong node; they aren't anywhere yet. Many "NUMA tuning had no effect" bug reports trace back to this — the workload allocates a 64 GB region, runs for 10 seconds, only touches 8 GB of it, and concludes that NUMA placement doesn't matter. /proc/<pid>/numa_mapsis the truth-source.numastat -p <pid>aggregates it;pmap -X <pid>adds the VMA names. Whenever a NUMA debug starts, the first command iscat /proc/<pid>/numa_maps | head -30— it tells you what the kernel actually did, which is often not what yournumactlflags asked for.
The set_mempolicy syscall does the same thing as mbind but applies the policy to future allocations of the calling thread, not a specific address range. numactl --membind=0 ./prog calls set_mempolicy(MPOL_BIND, mask=0x1) once before exec-ing ./prog, so every allocation ./prog makes inherits the bind policy. A multi-threaded program that wants per-thread policies (one thread on node 0, another on node 1) calls set_mempolicy itself from each thread after creation.
Where the placement actually happens — first-touch vs interleave
NUMA placement decisions in Linux happen at page-fault time, not at malloc time, not at mmap time. This is the most under-appreciated detail of NUMA tuning. Your application calls malloc(64 * GB) on thread A pinned to node 0; the allocator reserves 64 GB of virtual address space; nothing is on any node yet. Thread B, pinned to node 1, then iterates over the buffer initialising it. Every page faults on thread B's CPU. Under the default MPOL_DEFAULT policy (which is first-touch / local-alloc), every page lands on node 1, not node 0.
# first_touch_demo.py
# Demonstrate that NUMA placement depends on the thread that first touches
# each page, not the thread that allocated. This is the root cause of the
# "I pinned my service but it's still slow" class of bug.
import ctypes, ctypes.util, mmap, os, subprocess, threading
PAGE = os.sysconf("SC_PAGE_SIZE")
SIZE = 64 * 1024 * 1024 # 64 MiB
# Allocate; do NOT touch (so no pages are placed yet).
buf = mmap.mmap(-1, SIZE, prot=mmap.PROT_READ | mmap.PROT_WRITE,
flags=mmap.MAP_PRIVATE | mmap.MAP_ANONYMOUS)
view = (ctypes.c_char * SIZE).from_buffer(buf)
# Pick two CPUs on different NUMA nodes if available.
def cpus_for_node(n):
p = f"/sys/devices/system/node/node{n}/cpulist"
if not os.path.exists(p): return []
raw = open(p).read().strip()
out = []
for chunk in raw.split(","):
if "-" in chunk:
lo, hi = (int(x) for x in chunk.split("-")); out.extend(range(lo, hi+1))
else:
out.append(int(chunk))
return out
nodes = sorted(int(d[4:]) for d in os.listdir("/sys/devices/system/node")
if d.startswith("node") and d[4:].isdigit())
cpu_a = cpus_for_node(nodes[0])[0]
cpu_b = cpus_for_node(nodes[-1])[0] if len(nodes) > 1 else cpu_a
def toucher(start, end, cpu):
os.sched_setaffinity(0, {cpu})
for off in range(start, end, PAGE):
view[off] = 0x42
# Half the buffer touched on node 0's CPU, half on node N's CPU.
half = SIZE // 2
ta = threading.Thread(target=toucher, args=(0, half, cpu_a))
tb = threading.Thread(target=toucher, args=(half, SIZE, cpu_b))
ta.start(); tb.start(); ta.join(); tb.join()
# Snapshot the placement.
out = subprocess.check_output(["numastat", "-p", str(os.getpid())]).decode()
print(out)
Sample run on a 2-node laptop (nodes 0 and 1):
$ python3 first_touch_demo.py
Per-node process memory usage (in MBs) for PID 14782 (python3)
Node 0 Node 1 Total
--------------- --------------- ---------------
Huge 0.00 0.00 0.00
Heap 5.21 0.00 5.21
Stack 0.04 0.00 0.04
Private 32.13 32.10 64.23
---------------- --------------- --------------- ---------------
Total 37.38 32.10 69.48
The Private row shows 32 MiB on node 0 and 32 MiB on node 1, exactly mirroring the work-split in our two threads. The mmap happened on whichever CPU the main thread was running on, but the kernel placed each page when it was first written, on whichever node that CPU belonged to. This is first-touch — the kernel default — and it's both useful and dangerous.
It's useful because it means a well-pinned worker pool gets local memory automatically: each thread initialises its own region, each region lands on the thread's node, and subsequent accesses are local. It's dangerous because the allocator is not the toucher in many real codebases — JVM heap is allocated by a heap-management thread that may not be on the right node; numpy arrays are zero-filled by numpy.zeros on the calling thread, but accessed by worker threads later; mmap-backed shared memory is touched once by the loader process and then read by everyone forever. In all three cases, locality depends on a coincidence of who-touched-first.
The two ways out are explicit binding (use mbind or numactl --membind=N to force placement regardless of toucher) or interleave (use MPOL_INTERLEAVE to spread pages round-robin across nodes, accepting average-case latency in exchange for predictability).
The interleave alternative trades a bit of best-case latency for a lot of worst-case predictability. numactl --interleave=all round-robins each page among all NUMA nodes; on a 4-node box, page 0 lands on node 0, page 1 on node 1, page 2 on node 2, page 3 on node 3, page 4 on node 0, and so on. Local accesses are now ~25% (1-in-4 chance the toucher is on the page's node); remote accesses are ~75%; but the bandwidth budget is now the sum of all four memory controllers, not just one. For bandwidth-bound workloads — HPC stencils, TensorFlow training graphs, FFmpeg encoders — interleave often beats local-binding because the workload was bottlenecked on one node's DRAM channels, not on remote-access latency.
Why interleave wins for bandwidth-bound code: a single DDR5-4800 channel delivers ~38 GB/s peak; one EPYC chiplet has 3 channels, so ~115 GB/s of theoretical bandwidth. A workload that streams 200 GB/s through memory (a common shape for FP16 training kernels on a CPU fallback, or for FFmpeg's HEVC encoder during 4K transcodes) cannot be served by one node — it saturates that node's controllers and stalls. Interleaving across 8 nodes pools 8 × 115 = 920 GB/s of bandwidth, and the 30 ns of average extra latency per page is hidden by the workload's own deep memory pipeline. Latency-bound code (a hash-table probe, a B-tree walk) sees the opposite: 30 ns extra per access dominates, because the workload was never bandwidth-bound to begin with. The right policy depends on which side of the latency-vs-bandwidth boundary your workload sits on, and perf stat -e cycle_activity.stalls_l3_miss,offcore_response.demand_data_rd.l3_miss.local_dram is the measurement that tells you.
Production traps and the verification ladder
Three classes of bug recur often enough that experienced operators check for them by reflex.
Trap 1: pinning the threads but not the memory. Karan's incident at the top of this chapter. The systemd unit's ExecStart=numactl --cpunodebind=0,1,2,3 /opt/phonepe/scorer pins CPUs but leaves memory under default first-touch policy. If the loader process touches the 64 GB index before the worker threads pin themselves, the index lands wherever the loader was running. Verify with numastat -p $(pgrep scorer): if the per-node distribution doesn't match your binding, the binding isn't binding what you think.
Trap 2: pinning the parent but not the children. A bash script numactl --cpunodebind=0 ./service.sh binds the bash interpreter, not the Java process the script eventually exec's. Most modern services inherit policies through exec, but several don't — JVMs that re-exec themselves with different flags (e.g. CRaC, GraalVM AOT runners), Python services that spawn worker subprocesses with multiprocessing.Process, Go services that runtime-fork into a child with LockOSThread. Verify by reading /proc/<child-pid>/numa_maps directly, not the parent's.
Trap 3: pinning to the wrong nodes after a BIOS change. This is the previous chapter's territory (NUMA topology discovery) revisited from the binding side. A pin-mask --cpunodebind=0,1 written for a 2-node box pins to one chiplet's worth of CPUs after a BIOS flip to NPS=4 (8 nodes). Verify by re-running numactl --hardware on every deploy and asserting the node count matches the pin-mask's assumption.
# verify_pinning.py
# Run after starting your service; reads /proc/<pid>/numa_maps and reports
# whether the running placement matches the binding intent.
import argparse, os, re, sys
def read_numa_maps(pid: int):
"""Yield (vma_name, policy, per_node_pages) for each VMA in /proc/PID/numa_maps."""
with open(f"/proc/{pid}/numa_maps") as f:
for line in f:
parts = line.split()
if len(parts) < 2: continue
policy = parts[1] # 'default', 'bind:0', 'interleave:0-3', ...
per_node = {int(m.group(1)): int(m.group(2))
for m in re.finditer(r"N(\d+)=(\d+)", line)}
yield policy, per_node
def main(pid: int, expected_nodes: set[int]):
total = {}
policies = set()
for policy, per_node in read_numa_maps(pid):
policies.add(policy.split(":")[0])
for n, p in per_node.items():
total[n] = total.get(n, 0) + p
grand = sum(total.values()) or 1
print(f"PID {pid}: {grand*4//1024} MiB resident across nodes:")
for n in sorted(total):
pct = 100 * total[n] / grand
marker = "OK " if n in expected_nodes else "OFF"
print(f" [{marker}] node {n}: {total[n]*4//1024:>6} MiB ({pct:5.1f}%)")
leak = sum(p for n, p in total.items() if n not in expected_nodes)
if leak:
print(f"\nFAIL: {leak*4//1024} MiB leaked outside expected nodes {sorted(expected_nodes)}")
sys.exit(1)
print(f"\nOK: all memory within expected nodes {sorted(expected_nodes)}")
if __name__ == "__main__":
ap = argparse.ArgumentParser()
ap.add_argument("pid", type=int)
ap.add_argument("--nodes", required=True, help="comma-list, e.g. 0,1")
args = ap.parse_args()
main(args.pid, {int(x) for x in args.nodes.split(",")})
Sample run after starting the scorer pinned to nodes 0–3:
$ python3 verify_pinning.py $(pgrep scorer) --nodes 0,1,2,3
PID 18342: 65532 MiB resident across nodes:
[OK ] node 0: 16380 MiB ( 25.0%)
[OK ] node 1: 16384 MiB ( 25.0%)
[OK ] node 2: 16382 MiB ( 25.0%)
[OK ] node 3: 16386 MiB ( 25.0%)
OK: all memory within expected nodes [0, 1, 2, 3]
Run this script as a healthcheck after every deploy. If it ever prints FAIL, the deploy violates the binding contract — fail the rollout, alert the on-call, do not let traffic onto the node. Razorpay's deploy pipeline runs this check on a 5 % canary for 60 seconds before promoting; the check has caught two production incidents (a JVM that re-exec'd itself losing its policy, and a sidecar that touched the parent's mmap region from the wrong node) before they reached the rest of the fleet.
Common confusions
-
"
numactl --cpunodebind=0pins memory too." No.--cpunodebindonly sets CPU affinity. The default memory policy (MPOL_DEFAULT, first-touch) still applies, and pages land on whichever node first touches them — which is usually but not always node 0. Always pair with--membind=0or--preferred=0to control memory placement explicitly. -
"
--membindand--preferredare interchangeable." They are not.--membindis strict — exhausting the bound nodes causesOOMKilled, not silent spillover.--preferredis soft — the kernel tries the preferred node first and falls back to others when full. Pick strict for latency-critical services where the loud failure is the right signal; pick preferred when the workload has a known hot region (--preferred=0) but tolerates spillover for cold data. -
"
--interleave=allis for everyone." Only for bandwidth-bound workloads. A latency-bound workload (small working set, mostly L3-resident) interleaved across 8 nodes pays cross-socket latency on 7-of-8 accesses for no benefit — its bottleneck wasn't bandwidth. Use interleave whenperf stat -e LLC-load-missesshows you're DRAM-bandwidth-saturated on one node, not when your code has the wrong constant factors. -
"Once
numactlruns, the policy is permanent." A child process can override the inherited policy by callingset_mempolicyitself; some library code (older versions of jemalloc withnarenas=auto, the JVM with+UseNUMA) does exactly this on startup. Verify withnumastat -p <pid>after the service has been running for 30 seconds, not at start time. -
"
numactlworks on cgroup-confined containers the same way." Mostly, but the cgroup'scpuset.memsandcpuset.cpusconstrain what the kernel will allow. If your container's cgroup is restricted to nodes 0–1 and yournumactl --membind=2,3request asks for nodes the cgroup forbids, the syscall returnsEINVALandnumactlfails to launch the process. Check/sys/fs/cgroup/cpuset.mems.effectivefrom inside the container before troubleshooting binding flags. -
"The kernel's auto-NUMA balancer (
numa_balancing=1) makes manual binding unnecessary." It does help, but it works on a 100 ms+ horizon and migrates pages reactively after observing remote accesses. For a request-response service with p99 SLO of 4 ms, the migration overhead itself becomes a tail-latency source. Manual binding gives you predictable placement from second 1; auto-balancing gives you "eventual locality" with measurement noise. For HPC and batch workloads, auto-balancing is fine. For low-latency services, setnuma_balancing=0and bind manually.
Going deeper
mbind vs set_mempolicy vs move_pages — three ways to move memory
mbind(addr, len, policy, mask, flags) attaches a policy to a specific virtual address range and may move existing pages if MPOL_MF_MOVE is set. set_mempolicy(policy, mask) attaches a policy to the calling thread, affecting all future allocations by that thread. move_pages(pid, count, pages[], nodes[], status[], flags) moves a specific list of pages without changing any policy. The three serve different needs: mbind is for "this 64 GB region must live on node 0"; set_mempolicy is for "from now on, this thread allocates on node 0"; move_pages is for "I just discovered these specific pages are on the wrong node, fix them and don't touch anything else". numactl only invokes the first two; tools like numa_move_pages from libnuma expose the third.
Production tip: MPOL_MF_MOVE_ALL (vs MPOL_MF_MOVE) tells the kernel to move shared pages too, not just pages exclusive to the calling process. Required when migrating shared-memory regions (a Postgres shared buffer, an AeroSpike shared mmap) — without it, the syscall reports success but the shared pages stay where they were.
libnuma — the C API that numactl is built on
libnuma (-lnuma, headers in <numa.h>) is the C library every NUMA-aware service should be calling instead of shelling out to numactl. It exposes numa_available(), numa_set_membind(), numa_alloc_onnode(size, node), numa_set_localalloc(), and the bitmap helpers numa_bitmask_setbit, numa_bitmask_alloc. The advantages over numactl are per-thread granularity (different threads can have different policies) and per-allocation granularity (numa_alloc_onnode is mmap + mbind in one call). Java's +UseNUMA JVM flag is libnuma-backed; Postgres's NUMA-aware shared buffer (in development as of 16/17) is libnuma-backed; jemalloc's numa extension is libnuma-backed. If you're writing a service that cares about NUMA placement and runs longer than a few minutes, link against libnuma and call it from the right thread at the right time, rather than relying on the launcher's numactl flags.
NUMA in containers — Kubernetes Topology Manager
Kubernetes's CPU Manager (--cpu-manager-policy=static) and Topology Manager (--topology-manager-policy=single-numa-node or restricted) implement the equivalent of numactl --cpunodebind --membind for pods. The kubelet computes per-NUMA-node availability and admits a pod only if its requested CPU + memory + device topology can be satisfied entirely on one node. Once admitted, the kubelet writes the cgroup's cpuset.cpus and cpuset.mems to enforce the binding. Hotstar's IPL encoder pods run with topologyManagerPolicy: single-numa-node and cpuManagerPolicy: static; the encoder gets exclusive cores on one node, the FFmpeg memory pool stays on that node, and cross-encoder interference is eliminated. The cost: lower bin-packing efficiency. The kubelet refuses to schedule a pod if no single node can fit it, even if the pod would fit fine across two nodes.
What numastat actually shows you
numastat (without -p) shows kernel-wide NUMA counters from /sys/devices/system/node/node*/numastat: numa_hit (allocations satisfied on the requested node), numa_miss (asked for node X, got Y because X was full), numa_foreign (page on this node was wanted by another), interleave_hit (interleave policy got the requested node), local_node (fault from a CPU on the page's home node), other_node (fault from a CPU not on the home node). The two most actionable counters: numa_miss rising means memory pressure is forcing spills, and other_node rising as a fraction of total faults means your workload is increasingly accessing remote pages — the binding may have drifted. Set up a Prometheus exporter that emits per-node deltas of these counters every 10 seconds, alert on sudden changes. The PhonePe SRE team's binding-drift dashboard is exactly this.
Reproduce this on your laptop
# On any Linux box (single-node laptops still work — the policies are set,
# the placement just collapses trivially):
sudo apt install numactl libnuma-dev linux-tools-common
numactl --hardware # confirm topology
numactl --membind=0 --cpunodebind=0 \
python3 -c "import time; time.sleep(60)" & # bind a sleeper
numastat -p $! # see its placement
python3 -m venv .venv && source .venv/bin/activate
# numactl_under_the_hood.py uses ctypes (stdlib); no pip install needed
python3 numactl_under_the_hood.py
python3 first_touch_demo.py
python3 verify_pinning.py $! --nodes 0
A single-socket laptop reports one node and the binding is degenerate; a 2-node desktop or a c6a.metal cloud box exercises the real policies. Both teach the same discipline: ask, then verify.
Where this leads next
You can now place threads and pages where you want them. The next chapters turn that placement into measurement and into the wires beneath:
- /wiki/interconnects-qpi-upi-infinity-fabric — the cross-socket links that carry every cache-line miss your binding strategy fails to prevent. Bandwidth budgets, coherence overhead, and the symptoms of saturation in
perf stat. - /wiki/numa-aware-allocators-and-data-structures — once placement is solved, the next layer is making
mallocitself NUMA-aware: per-arena allocators, sharded hash tables, per-node freelists. The shape of code that doesn't fight the topology. - /wiki/measuring-numa-effects-with-perf — turning placement into perf-counter expectations:
perf stat -e numa_hit,numa_miss,numa_foreignon a workload tells you whether your binding is doing what you intended.
The deeper habit is to treat binding as a contract that the deploy pipeline verifies, not a CLI flag your service's launcher happens to mention. Karan's incident at PhonePe was caused by a missing flag in a systemd unit, but the structural fix wasn't "always remember to write --membind". It was a healthcheck — verify_pinning.py above — that the deploy pipeline blocks promotion on. Three lines of numastat parsing prevent the next 12-hour incident.
References
- Linux
numactl(8)manual page — flag reference; the section on--cpunodebindvs--physcpubindis essential. - Linux
mbind(2)manual page — the syscallnumactlwraps for memory binding; coversMPOL_MF_STRICT,MPOL_MF_MOVE,MPOL_MF_MOVE_ALL. - Linux
set_mempolicy(2)manual page — the per-thread policy syscall; explains the precedence order between thread, VMA, and system policies. libnumadocumentation — the C API for fine-grained NUMA control; the right surface for production services.- Brendan Gregg, Systems Performance (2nd ed., 2020) — Chapter 7 (Memory) on NUMA debugging; covers
numastat,numa_maps, and the auto-balancing knobs. - Linux kernel
Documentation/admin-guide/mm/numa_memory_policy.rst— the canonical description ofMPOL_DEFAULT,MPOL_BIND,MPOL_PREFERRED,MPOL_INTERLEAVE. - Christoph Lameter, "Local and Remote Memory: Memory in a Linux/NUMA System" (2006) — the foundational LWN article; still accurate on policy semantics.
- /wiki/numa-topology-discovery — the previous chapter; you must discover the topology before you can bind to it.