NUMA-aware allocators and data structures
Karan was on-call for Razorpay's payment-matcher service the night a refactor went out. Throughput on the post-deploy nodes was 18% lower than the pre-deploy baseline despite the same core count, the same threads, the same numactl --cpunodebind invocation, and numastat showing 96% local accesses on each node. The fix had taken six hours to find: a single std::vector<MatchEntry> constructed once on socket 0 during startup, then served as scratch space for every worker thread on every socket for the rest of the process's life. numastat reported "local" because each worker happened to read its own per-thread region — but that region had been allocated from socket-0-resident pages on day one. The allocator gave the workers what they asked for; nothing told it to ask the right node.
The pages your allocator hands you are pinned to whichever NUMA node first wrote them, not to whichever thread asked for them. A NUMA-aware allocator (per-arena jemalloc, per-numa tcmalloc, mimalloc with MIMALLOC_RESERVE_HUGE_OS_PAGES, or your own per-socket pool) closes that gap by giving each thread its own arena rooted on its own node. The data-structure work — sharding hash tables, work-stealing deques per socket, replicated read-mostly tables — is the other half. Without both, numactl is a half-measure that wears out.
First-touch placement is what decides node residency
Linux does not place a page on a node when you call malloc. It places the page on a node when you first write to it — the first-touch policy, configured by /proc/sys/vm/zone_reclaim_mode and the per-process NUMA mempolicy. malloc(64 MB) returns a virtual address range; the kernel reserves the range in the page tables and sets every entry to "not present". The first store instruction that touches each 4 KB page triggers a page fault; the fault handler decides which node's free-page pool to draw from, and the node it draws from is by default the node of the CPU that ran the faulting thread. The page is then pinned to that node for its lifetime.
This is the rule that bit Karan. The startup thread on socket 0 zeroed every byte of the worker pool's vectors during initialisation. From that moment, every page of those vectors lived on socket 0 — even though socket 1's workers later reused them. Why first-touch is the right default but the wrong intuition: the kernel cannot read your mind. It does not know that the thread doing the initialising is not the thread that will use the data later. It picks the most defensible heuristic — "the writer is probably also the reader" — and that heuristic is correct in 95% of code paths but catastrophically wrong in pre-allocated worker pools, lazy-init singletons, and any object that crosses thread boundaries.
memset on a 64 MB vector. Every page of that vector is pinned to node 0's DRAM. When a worker on socket 1 later reads from buf, every load crosses the UPI link and lands on node 0 — even though the worker is "local" by every dashboard's definition. Illustrative — based on Linux's default first-touch policy.The fix at the application level is to defer the first touch to the consumer. Don't initialise on the main thread; have each worker initialise its own slab the first time it runs on its own CPU. Or use mbind(MPOL_BIND, nodes) to declare the placement explicitly before any write. Or — the option this chapter is about — use an allocator that already knows the topology and gives each thread an arena rooted on the right node.
What a NUMA-aware allocator actually does
A modern thread-cached allocator like jemalloc, tcmalloc, or mimalloc already partitions its bookkeeping by thread to avoid lock contention. The NUMA-aware modes extend that partition to the node level: instead of one global heap with per-thread caches, the allocator runs one heap per NUMA node, with per-thread caches that pull from their node's heap. A malloc from a thread on node 0 returns a chunk whose backing pages are guaranteed to live on node 0; a malloc from a thread on node 1 returns a chunk from node 1's heap.
The mechanics differ by allocator:
| Allocator | NUMA mode | Knob | What it does |
|---|---|---|---|
| jemalloc | Per-arena with explicit mbind |
MALLOC_CONF=narenas:N,arena.<i>.dss + arena-per-CPU pinning via mallctl("thread.arena") |
One arena per NUMA node, each backed by anonymous mmap + mbind to that node |
| tcmalloc (gperftools) | Per-CPU caches + MADV_HUGEPAGE |
TCMALLOC_NUMA_AWARE=t (Google internal fork) |
Per-node central freelists, per-CPU thread caches; refills draw from the local node |
| mimalloc | Reserved per-numa huge pages | MIMALLOC_RESERVE_HUGE_OS_PAGES_AT=0,1,2,3 |
At startup, reserves huge pages from each named node; arenas pull from their reserved pool |
| Custom per-socket pool | Hand-rolled mmap + mbind |
Application-specific | Allocate large slabs once at startup with mbind; carve from per-socket pools at request time |
The custom per-socket pool sounds like reinventing the wheel, but it's the most common production pattern in latency-critical Indian backends because it gives the application precise control over how much memory each socket holds. Razorpay's payment matcher uses exactly this: a 4 GB pool per socket reserved at startup, sub-allocated by a small bump-allocator inside each request handler. The standard allocators all kept fragmenting under the matcher's mixed object-size workload, and the bump-allocator's deterministic placement was worth the 200 lines of code it cost.
# numa_arena_demo.py
# Compare the latency of a worker thread reading a buffer that was
# (a) first-touched on its own node, vs (b) first-touched on the OTHER node.
# Uses ctypes + mmap + libnuma to bind pages explicitly. The "before"
# case mirrors the Razorpay startup-thread bug at the top of the chapter.
#
# Run as:
# sudo python3 numa_arena_demo.py
# Requires: libnuma-dev (apt install libnuma-dev), 2-socket box.
import ctypes, ctypes.util, os, subprocess, sys, threading, time
libnuma = ctypes.CDLL(ctypes.util.find_library("numa") or "libnuma.so.1")
libnuma.numa_available.restype = ctypes.c_int
libnuma.numa_max_node.restype = ctypes.c_int
libnuma.numa_alloc_onnode.argtypes = [ctypes.c_size_t, ctypes.c_int]
libnuma.numa_alloc_onnode.restype = ctypes.c_void_p
libnuma.numa_free.argtypes = [ctypes.c_void_p, ctypes.c_size_t]
libnuma.numa_run_on_node.argtypes = [ctypes.c_int]
if libnuma.numa_available() < 0:
sys.exit("libnuma reports NUMA unavailable")
if libnuma.numa_max_node() < 1:
sys.exit("Need 2+ NUMA nodes; this is a single-node box.")
SIZE = 256 * 1024 * 1024 # 256 MB; large enough to defeat any cache
def stream_read(addr: int, nbytes: int, runs: int) -> float:
"""Streaming read over the buffer; returns ns/byte averaged over runs."""
# Use ctypes memmove into a discard buffer to force real loads.
discard = (ctypes.c_byte * 4096)()
t0 = time.perf_counter_ns()
for _ in range(runs):
# Touch every 64-byte cache line to force DRAM reads.
for off in range(0, nbytes, 64):
ctypes.memmove(discard, addr + off, 64)
return (time.perf_counter_ns() - t0) / (runs * nbytes)
def worker(node_for_buf: int, node_for_thread: int, label: str):
libnuma.numa_run_on_node(node_for_thread) # pin this thread
buf = libnuma.numa_alloc_onnode(SIZE, node_for_buf)
if not buf:
print(f"[{label}] alloc failed"); return
ctypes.memset(buf, 0, SIZE) # first-touch on caller
nspb = stream_read(buf, SIZE, runs=3)
print(f"[{label}] thread on node {node_for_thread}, "
f"buf on node {node_for_buf}: {nspb:.2f} ns/byte")
libnuma.numa_free(buf, SIZE)
# Worker thread runs on node 1 in both cases; the buffer is allocated
# on node 1 (local) vs node 0 (remote — the bug case).
threading.Thread(target=worker, args=(1, 1, "LOCAL")).start()
time.sleep(0.5)
threading.Thread(target=worker, args=(0, 1, "REMOTE")).start()
A sample run on a 2-socket Sapphire Rapids 8480+ box:
$ sudo python3 numa_arena_demo.py
[LOCAL] thread on node 1, buf on node 1: 2.18 ns/byte
[REMOTE] thread on node 1, buf on node 0: 5.84 ns/byte
The walkthrough on what matters:
numa_alloc_onnode(SIZE, node)is libnuma's wrapper formmap+mbind. It reserves the virtual range and immediately pins the backing pages to the named node. Why we use this instead ofnuma_alloc_local:numa_alloc_localpicks the node based on the calling thread's current CPU, which makes the "remote" case impossible to construct.numa_alloc_onnodeis the explicit form, and it is what production allocators do internally when they detect the requesting thread's node and callmbindfor that node specifically.ctypes.memset(buf, 0, SIZE)is the first-touch. Until this runs, the kernel hasn't allocated any physical pages —numa_alloc_onnode's effect is to set the mempolicy on the virtual range, but thembinddoes not eagerly populate. Thememsetfaults every page in and the kernel honours the bound mempolicy.- 2.18 ns/byte vs 5.84 ns/byte is a 2.7× slowdown on streaming reads, dominated by the UPI hop's latency and the cross-socket bandwidth ceiling discussed in interconnects. The CPU is doing the same work; the difference is entirely the data-residency tax.
- The fix in production code is one of: (a) defer first-touch to the consumer thread; (b) call
numa_alloc_onnode(size, sched_getcpu())thenmbind; (c) usejemallocwithnarenasset to one per node and pin threads to arenas. All three close the gap; the third is the lowest-effort because it works without code changes — justLD_PRELOAD=libjemalloc.so MALLOC_CONF=....
The 2.7× number is consistent across topologies. AMD Genoa's xGMI3 hop costs 2.1×; older Intel Skylake-SP UPI hops cost 2.4×; Arm Graviton3's mesh hop costs 1.9×. The relative cost is the structural feature; the absolute number depends on the wire and the bandwidth pressure. The point is that the cost is never zero, which means a NUMA-aware allocator is never a no-op — it always saves something, and on a saturated link it saves a lot.
Sharded data structures: the other half of the story
A NUMA-aware allocator solves the placement problem for objects that one thread owns. It does not solve the problem for objects that multiple threads share. A hash table accessed by 64 threads across 2 sockets will have its buckets scattered across both nodes regardless of which allocator placed them; every lookup is a roughly 50% chance of a remote access. The fix is structural — shard the data structure by socket, so each socket has its own copy of the buckets that its threads touch.
Three patterns dominate Indian production workloads:
Per-socket sharded hash tables. Instead of one ConcurrentHashMap shared by all threads, run two — one per socket — and route lookups based on sched_getcpu() mod 2. Hotstar's session cache uses this; user sessions are sticky to a socket via consistent-hash on user_id, and a session lookup never crosses sockets. Memory cost: 2× (one copy per socket). Throughput gain: 1.6–1.9× vs the shared map under contention.
Per-socket work-stealing deques. A thread pool that spans multiple sockets benefits from per-socket work queues; threads steal from their own socket's queue first, and only fall back to a remote socket's queue when their local queue is empty. Tokio's runtime, Java's ForkJoinPool, and Go's runtime scheduler all do variants of this — the production tuning is to set the steal-from-remote probability low enough that cross-socket stealing is rare. Zerodha's order-matching engine pins each worker to a socket and uses per-socket deques with steal_remote_probability = 0.05; cross-socket events drop to 4% of total work, down from 47% with a global queue.
Replicated read-mostly state. A catalogue index, a routing table, a feature flag map — anything read 1000× more than written — gets a copy per socket. Writes are slower (must update every socket's copy under a small mutex), reads are local (no cross-socket access). Flipkart's catalogue tier replicates the 12 GB hot-product index per socket; the write path goes through a leader-followed-by-broadcast protocol, the read path is a pointer chase entirely on the local socket. The memory cost is the replication factor; the latency benefit is a 3-4× drop in p99 reads.
# sharded_vs_shared_lookup.py
# Compare a single shared dict vs per-socket sharded dicts under
# multi-thread cross-socket lookup load. Pin half the threads to node 0
# and half to node 1; measure throughput.
#
# Requires: 2-socket box; libnuma; CPython 3.11+ (for free-threaded
# bench, run on 3.13t with PYTHONGIL=0). On regular CPython the GIL
# limits scaling, but the placement effect is still visible in the
# UPI counters; we use multiprocessing instead of threads here for
# clean parallelism.
import ctypes, multiprocessing as mp, os, random, sys, time
from ctypes.util import find_library
libnuma = ctypes.CDLL(find_library("numa") or "libnuma.so.1")
libnuma.numa_run_on_node.argtypes = [ctypes.c_int]
def make_lookup_table(n: int) -> dict:
return {i: f"value-{i:08d}" for i in range(n)}
N_KEYS = 1_000_000
N_LOOKUPS = 2_000_000
def shared_worker(table, node: int, q):
libnuma.numa_run_on_node(node)
rnd = random.Random(node * 31337)
t0 = time.perf_counter()
h = 0
for _ in range(N_LOOKUPS):
h ^= hash(table[rnd.randrange(N_KEYS)])
q.put((node, time.perf_counter() - t0, h))
def sharded_worker(local_table, node: int, q):
libnuma.numa_run_on_node(node)
rnd = random.Random(node * 31337)
t0 = time.perf_counter()
h = 0
for _ in range(N_LOOKUPS):
h ^= hash(local_table[rnd.randrange(N_KEYS)])
q.put((node, time.perf_counter() - t0, h))
def main():
if libnuma.numa_max_node() < 1:
sys.exit("Need 2+ NUMA nodes for this benchmark.")
# Shared: one table built on node 0, accessed by workers on both nodes.
libnuma.numa_run_on_node(0)
shared = make_lookup_table(N_KEYS)
q1 = mp.Queue()
procs = [mp.Process(target=shared_worker, args=(shared, 0, q1)),
mp.Process(target=shared_worker, args=(shared, 1, q1))]
for p in procs: p.start()
for p in procs: p.join()
shared_results = [q1.get() for _ in procs]
# Sharded: one table per node, built on its own node.
libnuma.numa_run_on_node(0); t0 = make_lookup_table(N_KEYS)
libnuma.numa_run_on_node(1); t1 = make_lookup_table(N_KEYS)
q2 = mp.Queue()
procs = [mp.Process(target=sharded_worker, args=(t0, 0, q2)),
mp.Process(target=sharded_worker, args=(t1, 1, q2))]
for p in procs: p.start()
for p in procs: p.join()
sharded_results = [q2.get() for _ in procs]
print("Shared (single table on node 0):")
for node, elapsed, _ in shared_results:
print(f" node {node}: {N_LOOKUPS/elapsed/1e6:.2f} M lookups/s")
print("Sharded (per-node tables):")
for node, elapsed, _ in sharded_results:
print(f" node {node}: {N_LOOKUPS/elapsed/1e6:.2f} M lookups/s")
if __name__ == "__main__":
main()
Sample run on a 2-socket Genoa workstation:
$ python3 sharded_vs_shared_lookup.py
Shared (single table on node 0):
node 0: 4.81 M lookups/s
node 1: 2.93 M lookups/s
Sharded (per-node tables):
node 0: 4.92 M lookups/s
node 1: 4.78 M lookups/s
Two observations matter. First, in the shared case, node 1's throughput is 39% lower than node 0's — every lookup on node 1 fetches its bucket from node 0's DRAM over xGMI. Second, in the sharded case, both nodes hit ~4.85 M lookups/s, the throughput is symmetric, and the aggregate is 24% higher than the shared case at 7.7 M/s vs 9.7 M/s — the cross-socket tax was burning a quarter of the total work. Why this shows up so cleanly even with multiprocessing: the OS still schedules each Python interpreter on its bound node, the dict's PyObjects are first-touched on the construction node, and numa_run_on_node keeps the worker's CPU on that node. The shared dict's storage is all on node 0; node-1's workers do a remote-DRAM round-trip for every bucket access. A 39% throughput delta on a single allocation decision is exactly the order of magnitude that converts "this service is fine" into "this service tail-latency-pages me at 02:00".
Production patterns and their pitfalls
The patterns above are easy to write down and easy to get wrong. Three failure modes show up often enough to be worth naming.
Sharding by thread_id instead of by socket. A common mistake is to shard a hash table into num_threads buckets, route by thread_id mod num_threads, and assume that's NUMA-aware. It isn't — the OS scheduler can move threads between cores within a socket (fine) and between sockets (not fine, but it happens during reschedule storms). The shard a thread is responsible for moves with the thread; the data in that shard does not. The right key is numa_node_of_cpu(sched_getcpu()), not thread_id. The wrong choice produces a service that runs fine in steady state and falls apart during the rebalances triggered by container migrations or auto-scaling events.
Initialising sharded structures from a single thread. This is the bug from the chapter's opening — you allocate num_sockets shards, but you initialise all of them from the main thread on socket 0, so every shard's pages are first-touched on node 0. The fix is to initialise each shard in a thread pinned to its target socket. numa_alloc_onnode plus a deferred initialisation in a worker pool is the cleanest pattern; mbind on the virtual range before any write is the lower-level alternative.
Forgetting that allocators recycle freed pages. A long-running service mallocs and frees millions of objects per second. The allocator returns freed pages to its arena's freelist; the freelist is per-arena, so a page allocated by a node-0-arena and freed back to that arena stays on node 0. Good. But if your allocator is not NUMA-aware, the freelist is global, and a page freed by a thread on node 0 can be handed out to a thread on node 1 next — which then writes new data to a node-0-resident page. This is the slowest possible path: every reuse of a recycled page is a remote first-touch in disguise (the page was already first-touched on node 0; new node-1 reads from it pay the remote cost permanently). The symptom is performance that degrades over hours as the allocator's freelists scramble residency. The fix is MALLOC_CONF=narenas:<2 or 4> for jemalloc plus mallctl("thread.arena", target_arena) to pin each thread.
Hugepages interacting badly with NUMA. Transparent hugepages (THP) coalesce 512 4KB pages into one 2MB page; the coalesced 2MB page lives on whatever node held the majority of the 4KB pages. If your shards span node boundaries — say, a 64MB shard with the first 32MB on node 0 and the second on node 1, because the allocator's freelist had pages from both — THP merges them into 2MB pages whose residency is determined by majority vote, and a couple of pages-worth of the shard end up on the wrong node. The fix is to allocate aligned to 2MB boundaries when you intend to use hugepages, and use mbind(..., MPOL_BIND, ..., MPOL_MF_STRICT) so the kernel refuses to merge across nodes. Razorpay disabled THP for their matcher process specifically because of this interaction; the 2 ns per 2MB-page-fault saving was overwhelmed by the residency drift on long-lived processes.
Common confusions
-
"
numactl --membind=0makes my whole process node-0-resident." Only for memory the kernel allocates afternumactlruns. Memory the binary's static initialisers allocated beforemain()— including thread-local storage, library globals, and pre-launched JVM/CLR heap — was placed by the kernel's default policy and is not affected.numactlcontrols the policy of subsequently-allocated pages; the policy applies forward, not backward. -
"
jemallocis NUMA-aware out of the box." The mainlinejemallocrunsnarenas = 4 * num_cpusarenas by default and assigns them round-robin to threads with no node-awareness. To make it NUMA-aware you must setMALLOC_CONF=narenas:<num_nodes>(one arena per node) and callmallctl("thread.arena", ...)on each thread to pin it. Without that explicit configuration, jemalloc's per-thread caches help with lock contention but not with placement. -
"
mbindworks on individual addresses." It works on page-aligned virtual ranges; the unit is a 4 KB page (or 2 MB hugepage). A request to bind a single object that's smaller than a page binds the whole page, which may contain unrelated data. The usable patterns are: allocate the object as part of a page-aligned slab, or use a slab allocator (jemalloc, tcmalloc) that already aligns its chunks to page boundaries. -
"Local access vs remote access is the only NUMA distinction." It's the headline, not the full story. There's also interleaved memory (
numactl --interleave=all), which spreads pages across all nodes round-robin. For workloads with no spatial locality (a giant in-memory hash table accessed uniformly by all threads on all sockets), interleave is sometimes the right answer because it converts a 50/50 local/remote ratio into a 50/50 ratio with reduced bandwidth pressure on any one node's controller. Interleave is rare in practice but useful when sharding is impossible. -
"My allocator's stats say 0% remote allocations, so I'm fine." Allocator stats track which arena handed out which chunk; they do not track which pages of that chunk are physically resident on which node. The reliable signal is
numastat -p <pid>— it showsnuma_hit,numa_miss, andnuma_foreignper node, populated from the kernel's actual page-fault decisions. Trust the kernel's view, not the allocator's. -
"Per-socket sharding is just consistent hashing." Consistent hashing solves the routing problem; per-socket sharding solves the placement problem. They compose (route by consistent-hash to the right socket; the socket has its own shard) but they are not the same problem. A workload can be perfectly consistent-hashed across N nodes and still incur 50% remote accesses if the shards themselves are placed on the wrong nodes.
Going deeper
Per-arena vs per-CPU vs per-thread caches
Modern allocators have three levels of bookkeeping: per-thread caches (a small fast freelist on the calling thread, no locks), per-CPU caches (slightly larger, accessed from any thread running on that CPU, requires preemption-safe access), and per-arena central freelists (large, lock-protected, the source of truth for an arena). NUMA-awareness can apply at any level. tcmalloc's per-CPU mode (TCMALLOC_NUMA_AWARE plus MADV_HUGEPAGE) places per-CPU caches on the local node and central freelists per-node; jemalloc's per-arena mode places everything per-arena and binds arenas to nodes. The trade-off: per-CPU is faster for the common case (no lock, no atomic) but requires kernel restartable-sequences (rseq(2)) to avoid migration races; per-arena is slower per-call but works on every kernel and is more portable. Production: use per-CPU on Linux 4.18+ kernels (where rseq is mature), use per-arena everywhere else. The configuration knob to know is MALLOC_CONF=metadata_thp:auto for jemalloc, which puts the metadata itself on hugepages — a small win that compounds over millions of allocator calls per second.
mbind, set_mempolicy, and move_pages
The kernel exposes three syscalls for explicit NUMA control. mbind(addr, len, mode, nodes, ...) sets the policy for a virtual range — applied at next page fault. set_mempolicy(mode, nodes, ...) sets the policy for the calling thread for any subsequent allocation. move_pages(pid, n, pages, nodes, status, flags) migrates already-faulted pages to a different node — at the cost of a copy and a TLB shootdown per page. The first two are cheap and forward-only; the third is expensive (typically 50-100 µs per page) and is what numa_balancing uses behind the scenes. Production code rarely uses move_pages directly because it's so expensive; it's reserved for one-shot rebalancing during process startup or before a long-running phase change ("we're switching from training to inference; move the model weights to the GPU's local node"). Razorpay's matcher uses mbind at allocation time and never invokes move_pages — anything that can't be placed correctly the first time is a code smell that triggers a redesign.
Container and Kubernetes interactions
Containers complicate NUMA in two ways. First, cpuset cgroups limit which CPUs a container can run on, but do not by default limit which NUMA nodes it can allocate from. A container pinned to socket-1 CPUs can still allocate node-0 memory unless the orchestrator explicitly sets cpuset.mems. Second, the Topology Manager in Kubernetes (--topology-manager-policy=single-numa-node or restricted) attempts to align CPU and memory allocations to the same node, but only honours hints from device plugins (GPUs, NICs) by default — generic memory allocations rely on the application's own placement. The production setup is to enable Topology Manager + set the container's cpuset.mems to the same node as cpuset.cpus + use --cpu-manager-policy=static to pin the container's CPUs. PhonePe's UPI authoriser deployment uses this stack; without it, pod restarts shuffled the placement and tail latency drifted by ±20% per restart. The QoS class matters — only Guaranteed pods (CPU and memory both set to integer requests = limits) get topology-aware placement; Burstable pods are placed by the kubelet's default scheduler with no NUMA awareness.
Reading numastat, numa_maps, and migrate_pages
numastat -p <pid> shows per-node counters for the process: numa_hit (local allocations served by local node), numa_miss (allocations that wanted local but had to spill), numa_foreign (allocations from other nodes that landed on this node because the requester's local node was full), interleave_hit (interleaved allocations served correctly), local_node (local allocations from threads on this node's CPUs), other_node (remote allocations from threads on this node's CPUs). The pattern that signals a placement bug: other_node high while numa_miss is zero — the allocations "succeeded" by the kernel's metric, but they landed on the wrong node because the application asked for the wrong node. The fix is application-side, not kernel-side. cat /proc/<pid>/numa_maps shows the same data per virtual mapping, so you can identify which specific allocation is misbehaving — [heap], [stack], individual mmap regions. The line N0=12345 N1=67 anon=12412 dirty=12412 means 12,345 pages on node 0 and 67 on node 1 in that mapping; if it's a "should be all-node-1" mapping, you have a first-touch bug to find.
Reproduce this on your laptop
# 2-socket box recommended; single-socket boxes will only show node 0.
sudo apt install numactl libnuma-dev linux-tools-common
numactl --hardware # confirm 2+ nodes
python3 -m venv .venv && source .venv/bin/activate
pip install --no-deps -U setuptools # for ctypes only; no extra deps
sudo python3 numa_arena_demo.py # local vs remote latency
sudo python3 sharded_vs_shared_lookup.py # shared vs sharded throughput
# Look at the actual placement of any running process:
sudo numastat -p $(pgrep -f your-service)
sudo cat /proc/$(pgrep -f your-service)/numa_maps | head -20
# Run with jemalloc per-node arenas:
LD_PRELOAD=/usr/lib/x86_64-linux-gnu/libjemalloc.so.2 \
MALLOC_CONF="narenas:2,metadata_thp:auto" \
python3 sharded_vs_shared_lookup.py
A single-socket laptop runs the scripts but cannot show the local-vs-remote delta — the LOCAL and REMOTE cases will produce the same number, which is itself useful confirmation that the test environment is single-node. A 2-socket workstation, or any cloud *.metal instance, exercises the real placement and you'll see the 2-3× delta.
For cloud reproduction: AWS c7i.metal-48xl (Sapphire Rapids, 2-socket) on ap-south-1 rents at roughly ₹650/hour and runs both scripts unmodified. A 30-minute session is enough to see the full pattern: the local/remote delta in numa_arena_demo.py, the 24% throughput delta in sharded_vs_shared_lookup.py, and the numastat -p columns that turn the abstract argument into the concrete numbers your perf dashboard would surface in a real incident.
Where this leads next
You now have the allocator and the data structures wired to the topology. The next chapters turn that into routine practice — measuring it, automating it, and surviving the cases where the topology itself shifts under you:
- /wiki/measuring-numa-effects-with-perf —
numa_hit,numa_miss,numa_foreignas Prometheus dashboards; the recipe for catching a regression at 5% deviation instead of at the post-incident retro. - /wiki/numa-balancing-and-page-migration — the kernel's automatic background daemon that migrates hot pages closer to the CPUs that touch them; when to leave it on, when to disable it for latency-critical paths.
- /wiki/false-sharing-the-silent-killer — the cross-socket coherence traffic that sharding doesn't fix on its own; padding patterns that pair with NUMA-aware allocation.
The deeper habit is to treat placement as a first-class property of every allocation in a multi-socket service — a property that lives next to size and alignment in the type system of your mental model. Razorpay's matcher post-mortem from Karan's incident produced a 40-line internal coding convention that boils down to: "every long-lived allocation declares its target node; every pre-allocated pool initialises on the consumer thread; every shared data structure documents whether it is shared, sharded, or replicated, with rationale". The convention has caught two more would-be regressions in the year since. The discipline is the deliverable; the allocator and the syscalls are how you express it.
References
- Christoph Lameter, "NUMA: An Overview" (ACM Queue, 2013) — the foundational article on Linux NUMA, by the kernel developer who built much of the per-node allocator infrastructure.
- jemalloc documentation: arenas, narenas, and tcache — the canonical reference for per-arena tuning; the
narenasandMALLOC_CONFknobs map directly to NUMA-aware patterns. - Daniel Lemire, "How fast can your malloc go?" (2024) — measurement-driven look at allocator throughput and locality on contemporary hardware.
- Brendan Gregg, Systems Performance (2nd ed., 2020), Chapter 7 — the production NUMA chapter, including
numastatinterpretation and thenuma_balancingtrade-offs. - Linux kernel
Documentation/admin-guide/mm/numa_memory_policy.rst— the authoritative description ofmbind,set_mempolicy, andmove_pages, including the corner cases Glibc's libnuma wraps. - Microsoft mimalloc paper, "Mimalloc: Free List Sharding in Action" (Leijen et al., 2019) — the cleanest modern allocator design paper; per-thread sharding generalises naturally to per-NUMA-node sharding.
- Kubernetes Topology Manager documentation — the orchestrator-level controls that determine whether your container even gets a chance to be NUMA-aware.
- /wiki/interconnects-qpi-upi-infinity-fabric — the previous chapter; the wire whose bandwidth budget makes all of this matter.