NUMA topology discovery
Aditi at Zerodha was debugging a 40 % p99 regression on the Kite order-matcher. The same binary, the same kernel, the same c6a.metal. Yesterday's box reported 2 NUMA nodes; today's reported 8. Nothing in the application had changed. What had changed was that BIOS-level NPS4 sub-NUMA clustering had been enabled by an automation playbook three nights ago, and the matcher's pin-mask — written for 2 nodes — was now scattering threads across 4 chiplets per socket. Her numactl --physcpubind=0-95 was no longer pinning to "socket 0"; it was pinning to "the first 96 logical CPUs the kernel happened to enumerate", which on this BIOS were spread across all 8 sub-nodes. The fix took 15 minutes. Finding the fix took her 11 hours, because she trusted her code's pin-mask without reading the topology.
NUMA topology discovery is the act of asking the running kernel — not your spec sheet, not the cloud-vendor docs — how many memory nodes exist, which CPUs belong to each, and how the firmware is grouping chiplets this boot. The tools are numactl --hardware, lstopo from hwloc, and the /sys/devices/system/node/ tree. Run them on every box your code touches, every time. The same hardware can look 2-node, 4-node, or 8-node depending on BIOS settings, and the wrong assumption costs you 1.5–3× latency on real workloads.
Why discovery is non-negotiable
The previous chapter (UMA vs NUMA: the architectural shift) ended with a script that walked /sys/devices/system/node. That was the cheapest discovery you can do — twenty lines of Python. The reason it matters is that the topology your kernel reports is not the topology you might assume from the SKU. The same dual-EPYC-9654 box can be presented to the OS as 2, 4, or 8 NUMA nodes depending on a BIOS setting called NPS (Nodes Per Socket); the same dual-Xeon-Platinum-8480 box can be 2 or 4 nodes depending on SNC (Sub-NUMA Clustering). The setting is firmware-level; your application sees the result and has to behave correctly under whichever shape it boots into today.
Three concrete production incidents tie this to money:
- Razorpay payment-routing, 2024: a fleet refresh upgraded BIOS firmware on 30 % of nodes; the new firmware defaulted NPS from 1 (one node per socket) to 4. The matcher's
numactl --interleave=0,1started interleaving across 8 nodes instead of 2. p99 climbed from 11 ms to 19 ms across the upgraded fleet for three days, until someone rannumactl --hardwareon a single host and noticed the 8-node output. - Hotstar ingestion, IPL 2025 final: a Kubernetes node taint was applied to "single-socket" nodes for a memory-bound encoder pod. The taint logic checked
lscpu | grep Socketand saw1, but the box was a dual-socket EPYC presented as 1 socket due to a BIOS misconfiguration. The pod ran, mistook 384 GB on socket 1 as local, and saturated xGMI for the duration of the final innings. ₹40 lakh of streaming-quality regression. - Aadhaar UIDAI auth, 2023 capacity audit: the team's allocator config (
MALLOC_CONF=narenas:16) was tuned for a 2-node SKU and shipped to a 4-socket replacement SKU as-is. Each arena now spanned multiple NUMA nodes, fragmenting allocations. p99 rose 22 % until the discovery script flagged the SKU change.
All three incidents share a structural cause: the operator's mental model of the hardware was static, but the actual kernel topology was dynamic. Discovery is the discipline of reading the box every time you deploy onto it, before you trust any tuning that depends on its shape.
numactl --hardware output is the only authoritative source of which mode is active right now.Why the same hardware presents differently: NPS (Nodes-Per-Socket on AMD; SNC / Sub-NUMA Clustering on Intel) tells the firmware how to group memory channels and chiplets into NUMA domains. NPS=1 lumps all 12 channels of a socket into one node — simplest, but the cross-chiplet L3 access cost is hidden from the scheduler. NPS=4 splits the 12 channels into 4 groups of 3, exposing intra-socket non-uniformity to the OS. The trade-off: NPS=1 is easier to program for and gives the OS less work, but lets bandwidth-hungry workloads land on a chiplet whose channels are saturated by another worker. NPS=4 lets workloads pin to a 3-channel slice they own, but multiplies the number of allocator arenas, scheduling domains, and NUMA-balancing decisions the kernel has to make. Production teams pick based on whether their workload partitions cleanly to chiplet-sized chunks (NPS=4 wins) or runs as one large shared dataset (NPS=1 wins).
The discovery toolkit, in order of use
Three tools cover 95 % of NUMA discovery. Run them in this order on any new box: numactl --hardware for the headline numbers, lstopo for the visual topology, /sys/devices/system/node/ for the raw kernel facts when something looks wrong.
numactl --hardware is the first call. It is shipped with the numactl package (Debian/Ubuntu/RHEL/Alpine all package it). The output is dense:
$ numactl --hardware
available: 8 nodes (0-7)
node 0 cpus: 0 1 2 3 4 5 6 7 8 9 10 11 96 97 98 99 100 101 102 103 104 105 106 107
node 0 size: 95876 MB
node 0 free: 89234 MB
node 1 cpus: 12 13 14 15 16 17 18 19 20 21 22 23 108 109 110 111 112 113 114 115 116 117 118 119
node 1 size: 95879 MB
node 1 free: 91102 MB
...
node 7 cpus: 84 85 86 87 88 89 90 91 92 93 94 95 180 181 182 183 184 185 186 187 188 189 190 191
node 7 size: 95878 MB
node 7 free: 92107 MB
node distances:
node 0 1 2 3 4 5 6 7
0: 10 11 12 12 32 32 32 32
1: 11 10 12 12 32 32 32 32
2: 12 12 10 11 32 32 32 32
3: 12 12 11 10 32 32 32 32
4: 32 32 32 32 10 11 12 12
5: 32 32 32 32 11 10 12 12
6: 32 32 32 32 12 12 10 11
7: 32 32 32 32 12 12 11 10
This is a 2-socket EPYC 9654 in NPS=4 mode. Node 0 has CPUs 0-11 (the physical cores) and 96-107 (their SMT siblings). Local distance is 10. Adjacent-chiplet distance inside the same socket is 11–12. Cross-socket distance is 32. Nodes 0–3 are socket 0; nodes 4–7 are socket 1, and the matrix's block structure tells you that — the cross-socket entries form two solid 32-blocks, while intra-socket entries are 11s and 12s.
Why the SMT siblings show up as 0 and 96 instead of 0 and 1: AMD's enumeration scheme assigns all physical cores first, then all hyperthreads. So on a 192-thread box, CPUs 0–95 are physical, 96–191 are hyperthreads, and CPU 96 is the SMT sibling of CPU 0. Intel's default enumeration is interleaved (CPU 0 and CPU 1 are siblings of the same core), but configurable via Hyperthreading=Off-Sequential BIOS settings. Reading this wrong leads to the classic bug "I pinned 12 threads to CPUs 0–11 expecting 12 cores; I got 6 cores plus 6 SMT siblings, half my threads contend on shared execution units, throughput drops 30 %". Always check lscpu | grep -E 'Core|Socket|Thread' and the /sys/devices/system/cpu/cpu*/topology/thread_siblings_list files when planning a pin-mask.
lstopo (from the hwloc package) is the second call. It produces a tree-or-graph view that makes the hierarchy clickable when run as lstopo (X11) or lstopo --of png topology.png (headless). On a remote box without X11, lstopo --of console prints an ASCII tree:
$ lstopo --of console
Machine (768GB total)
Package L#0
NUMANode L#0 (P#0 96GB)
L3 L#0 (32MB)
L2 L#0 (1024KB) + L1d L#0 (32KB) + L1i L#0 (32KB) + Core L#0
PU L#0 (P#0)
PU L#1 (P#96)
L2 L#1 (1024KB) + L1d L#1 (32KB) + L1i L#1 (32KB) + Core L#1
PU L#2 (P#1)
PU L#3 (P#97)
... (12 cores per L3 — one chiplet)
NUMANode L#1 (P#1 96GB)
L3 L#1 (32MB)
...
Package L#1
NUMANode L#4 (P#4 96GB)
...
The tree mirrors the cache hierarchy: Package (socket) → NUMANode → L3 → L2/L1d/L1i → Core → PU (logical CPU). lstopo reads from the same /sys files as numactl but joins them with cache topology, so you can see exactly which cores share an L3 (the chiplet boundary), which cores share an L2 (none on x86, but ARM Neoverse shares L2 across pairs), and how SMT siblings hang off cores.
The third call is the raw sysfs tree. Every numactl --hardware and lstopo value comes from one of these files; reading them yourself is what you do when the higher-level tool reports something surprising:
/sys/devices/system/node/node0/
cpulist # "0-11,96-107" — CPUs in this node
cpumap # bitmask form of cpulist
distance # space-separated distances to all nodes
meminfo # MemTotal/MemFree/Cached for this node only
numastat # numa_hit / numa_miss counters
hugepages/ # per-node hugepage pools (1G, 2M)
vmstat # per-node VM statistics
compact # write 1 to compact this node's memory
The /sys/devices/system/cpu/cpu<N>/topology/ tree carries the inverse mapping — given a CPU, which package, core, and thread group:
/sys/devices/system/cpu/cpu5/topology/
physical_package_id # "0" — which socket
core_id # "5" — which physical core within socket
thread_siblings_list # "5,101" — all CPUs sharing this physical core
core_cpus_list # same thing on newer kernels
cluster_id # ARM-only on x86, "−1"; on ARMv8.4+, the cluster
Together, the two trees let you answer any topology question without trusting an SKU spec sheet. The next section turns that into a Python script.
A reusable discovery script
Every tuning decision your code makes — pinning threads, allocating with mbind, sharding a hash table — depends on the topology you discover at startup. Hardcoding the topology in config (workers_per_socket: 96) is the bug. Reading it at startup is the fix.
# numa_discover.py
# Read /sys/devices/system/node and /sys/devices/system/cpu to build a
# topology map the application can branch on. Designed for production:
# returns a dict, not stdout. Tested on EPYC 9654, Xeon Platinum 8480,
# Graviton 3, and a single-socket laptop (1 node).
import json
import os
import re
from pathlib import Path
NODE_ROOT = Path("/sys/devices/system/node")
CPU_ROOT = Path("/sys/devices/system/cpu")
def parse_list(s: str) -> list[int]:
"""Parse '0-11,96-107' into [0,1,...,11,96,...,107]."""
out: list[int] = []
for chunk in s.strip().split(","):
if "-" in chunk:
lo, hi = (int(x) for x in chunk.split("-"))
out.extend(range(lo, hi + 1))
elif chunk:
out.append(int(chunk))
return out
def read_text(p: Path, default: str = "") -> str:
try:
return p.read_text().strip()
except FileNotFoundError:
return default
def discover() -> dict:
if not NODE_ROOT.exists():
return {"nodes": 1, "topology": "non-numa", "cpus": list(range(os.cpu_count() or 1))}
nodes: list[dict] = []
for nd in sorted(NODE_ROOT.glob("node[0-9]*"),
key=lambda p: int(re.search(r"node(\d+)$", p.name).group(1))):
n = int(re.search(r"node(\d+)$", nd.name).group(1))
cpus = parse_list(read_text(nd / "cpulist"))
distance = [int(x) for x in read_text(nd / "distance").split()]
# MemTotal lives in /sys/devices/system/node/nodeN/meminfo
meminfo = read_text(nd / "meminfo")
m = re.search(r"MemTotal:\s+(\d+)\s+kB", meminfo)
mem_mb = int(m.group(1)) // 1024 if m else 0
nodes.append({"node": n, "cpus": cpus, "distance": distance, "mem_mb": mem_mb})
# Group nodes into sockets via physical_package_id
socket_of_cpu: dict[int, int] = {}
for n in nodes:
for c in n["cpus"]:
pkg = read_text(CPU_ROOT / f"cpu{c}" / "topology" / "physical_package_id")
if pkg:
socket_of_cpu[c] = int(pkg)
sockets = sorted(set(socket_of_cpu.values()))
nodes_per_socket = {s: [] for s in sockets}
for n in nodes:
s = socket_of_cpu.get(n["cpus"][0])
if s is not None:
nodes_per_socket[s].append(n["node"])
return {
"nodes": len(nodes),
"sockets": len(sockets),
"nodes_per_socket": nodes_per_socket,
"topology": "numa" if len(nodes) > 1 else "uma",
"details": nodes,
}
if __name__ == "__main__":
t = discover()
print(json.dumps(t, indent=2))
print(f"\n→ {t['nodes']} NUMA nodes across {t.get('sockets', 1)} socket(s)")
if t["topology"] == "numa":
nps = t["nodes"] // t.get("sockets", 1)
print(f"→ NPS / SNC mode appears to be {nps} (nodes per socket)")
Sample run on a 2-socket EPYC 9654 in NPS=4:
$ python3 numa_discover.py
{
"nodes": 8,
"sockets": 2,
"nodes_per_socket": {"0": [0, 1, 2, 3], "1": [4, 5, 6, 7]},
"topology": "numa",
"details": [
{"node": 0, "cpus": [0, 1, ..., 11, 96, ..., 107], "distance": [10, 11, 12, 12, 32, 32, 32, 32], "mem_mb": 95876},
...
]
}
→ 8 NUMA nodes across 2 socket(s)
→ NPS / SNC mode appears to be 4 (nodes per socket)
The walkthrough on the parts that matter most:
parse_listturns thecpulistformat ("0-11,96-107") into a flat Python list. The format is the kernel's standard for any CPU set —cpuset.cpus,irq_affinity_list, the cgroup v2cpuset.cpus.effectivefile. Knowing this parser by heart saves an hour the first time you debug a misbehaving cgroup.distance = [int(x) for x in read_text(nd / "distance").split()]loads the row of the distance matrix for this node. The matrix is symmetric on x86, so the row from node 0 tells you everything about node 0's relations. On some ARM topologies it isn't symmetric — the link from A to B is faster than B to A — and you need both rows.physical_package_idis what tells you which socket each CPU lives on. The kernel's NUMA node ID and socket ID are not the same thing — in NPS=4, sockets are 0/1 but nodes are 0–7. Conflating them is the root of half of all NUMA misconfigurations. Why this matters in production: when you writenumactl --cpunodebind=0, you bind to node 0, which on NPS=4 is just one chiplet (24 CPUs out of 192). When you writetaskset -c 0-95, you bind to the first 96 logical CPUs, which on NPS=4 is socket 0 (4 nodes, 96 CPUs counting hyperthreads). The two commands are not equivalent. Aditi's bug at the top of this chapter was exactly this confusion — her pin-mask assumedcpunodebind=0meant socket 0, but on the new BIOS it meant chiplet 0.nodes_per_socketis the headline interpretation: how many NUMA nodes the BIOS is exposing per physical socket. If this is 1, you're in the simple 2-node-per-2-socket world. If this is 2, 4, or 8, you're in chiplet-aware territory and your pin-masks need to match.
The script is 60 lines, has no third-party dependencies, and runs in under 50 ms. Ship it as part of your service's startup. Log the resulting nodes, sockets, and nodes_per_socket fields. Alert when those numbers change between deploys — a node that booted with 2 NUMA nodes yesterday and 8 today is the signal Aditi needed at hour 1, not hour 11.
Reading lstopo for the picture you cannot get from numactl
numactl --hardware gives you the rectangles. lstopo gives you the picture, with cache hierarchy joined to NUMA. For visual inspection it is the single most useful systems-performance tool you don't yet have installed.
The visual makes three things obvious that text-form numactl hides. First, the chiplet boundary is the L3 boundary: cores within one chiplet share an L3 of 32 MB; a load from another chiplet in the same NUMA node still has to cross a memory-fabric link (Infinity Fabric) and pays an additional 5–10 ns. Second, the NUMA-node-to-chiplet ratio (3 chiplets per node in NPS=4) means a pin-mask of "one chiplet" is not the same as "one NUMA node" — pinning to chiplet 0 only gives you 1/3 of the node's memory bandwidth budget. Third, the socket boundary is the place where xGMI (or UPI on Intel) crosses, and is the only place where the distance jumps from 11/12 to 32.
Run lstopo --of png > topology.png once on every cloud SKU you deploy on, save the PNG to your wiki, and refer back to it when you read flamegraphs. The visual recall is faster than re-deriving the topology from numactl text every incident.
lstopo also has an underused flag: lstopo --output-format xml produces machine-readable XML that includes hwloc's full graph (cache associativities, memory-channel widths, PCIe device locality). Production ops teams at PhonePe and Hotstar feed this XML into their config-management pipeline and assert that BIOS settings match the expected NPS / SNC / hyperthreading shape before promoting a node into production traffic. The check is one xmllint --xpath away; the cost of skipping it is the kind of incident the chapter opened with.
Cloud-SKU quirks and the BIOS settings that flip them
The cloud SKUs Indian engineering teams meet most often have specific NPS / SNC defaults that differ from on-prem builds. Knowing them saves the first 30 minutes of every cloud-migration NUMA debug.
| Cloud SKU | Hardware | Default mode | Discovery surprise |
|---|---|---|---|
AWS c6a.metal |
Dual EPYC 7R13 | NPS=1 (2 nodes) | numactl --hardware shows 2 nodes; AMD spec sheet implies 8 chiplets per socket but they are L3-only, not NUMA |
AWS c7a.metal-48xl |
Dual EPYC 9R14 | NPS=4 (8 nodes) | Same SKU family as c6a but 4× the node count — pin-masks must change |
AWS r7iz.metal-32xl |
Dual Sapphire Rapids | SNC=2 (4 nodes) | Intel default is SNC off (2 nodes); AWS enabled SNC=2 silently in late 2023 |
Azure HBv4 |
Dual EPYC 9V33X | NPS=4 (8 nodes) | HPC-targeted SKU; documented in the SKU page |
GCP c3-standard-176 |
Dual Sapphire Rapids | SNC=off (2 nodes) | GCP keeps SNC off for predictability |
GCP c3d-standard-180 |
Dual EPYC 9B14 | NPS=2 (4 nodes) | Genoa-X variant; 4 nodes is the GCP default |
Oracle BM.Standard.E5.192 |
Dual EPYC 9J14 | NPS=4 (8 nodes) | Oracle exposes the maximum granularity |
The on-prem story is messier. Razorpay's bare-metal fleet runs Supermicro EPYC 9654 boards; the BIOS default from Supermicro is NPS=1, but the server-image build pipeline flips it to NPS=4 in IPMI before deployment. PhonePe's bare-metal Xeon Platinum 8480+ boards default to SNC=off; their build pipeline leaves it that way except on the 8 boxes designated for HFT-style market-data ingestion, which run SNC=2. Both teams maintain a config-management assertion that BIOS state matches expected mode before any production traffic lands on the node.
# Quick BIOS-mode check on a running Linux box (no reboot required):
$ numactl --hardware | head -1 # gives the node count
$ lscpu | grep Socket # gives the socket count
$ python3 -c "
import json, subprocess
out = subprocess.check_output(['numactl','--hardware']).decode()
nodes = int([l for l in out.splitlines() if l.startswith('available:')][0].split()[1])
sockets = int(subprocess.check_output(['lscpu']).decode().split('Socket(s):')[1].split()[0])
print(f'NPS/SNC mode: {nodes // sockets} nodes per socket ({nodes} total / {sockets} sockets)')
"
When the result of that 5-line check changes between two deploys of the same image, your config-management pipeline must fail the deploy. The cost of treating BIOS settings as static is the cost of the incidents at the top of this chapter — six-figure-rupee outages, weeks of misattributed root cause.
The dmidecode route gives you BIOS strings without numactl:
$ sudo dmidecode -t bios | grep -i 'Vendor\|Version\|Release Date'
$ sudo dmidecode -t processor | grep -i 'Version\|Voltage'
But dmidecode does not report NPS / SNC directly — those are runtime-only properties of how the firmware advertised the SRAT (System Resource Affinity Table) to the kernel. The kernel parses SRAT into the /sys/devices/system/node/ tree at boot. To read SRAT directly: cat /sys/firmware/acpi/tables/SRAT | hexdump -C (or use the acpidump tool to parse it). Most operators do not go this deep — they trust numactl --hardware — but knowing SRAT is the source of truth helps when SRAT-vs-kernel disagreements happen (rare; usually a kernel bug after a CPU hotplug event).
Common confusions
-
"
numactl --hardwarecount = socket count." No. On a NPS=1 / SNC=off box it does. On any box with sub-NUMA clustering enabled, the node count issockets × NPS. A 2-socket EPYC in NPS=4 reports 8 nodes; a 4-socket Xeon in SNC=2 reports 8 nodes. Always cross-check withlscpu | grep Socketto separate "how many physical sockets" from "how many NUMA nodes the kernel sees". -
"
numactl --cpunodebind=Nbinds to socket N." Only when nodes are sockets. When NPS / SNC > 1,cpunodebind=0binds to one chiplet's worth of CPUs — typically 1/4 or 1/8 of a socket. To bind to a whole socket on NPS=4, writenumactl --cpunodebind=0,1,2,3. The flag operates on node IDs, not socket IDs; the kernel does not expose a "bind to socket" alternative.taskset -c 0-95is the closest socket-level pin, but it does not also bind memory allocations to that socket the waynumactldoes. -
"hwloc and numactl read different sources." They don't. Both read
/sys/devices/system/node/and/sys/devices/system/cpu/.hwlocadds richer parsing (cache associativity, PCIe locality, GPU topology vianvidia-smi topo) but the NUMA shape comes from the same kernel files. Iflstopoandnumactl --hardwaredisagree on node count, you have a kernel bug — file it. -
"Discovery is a startup-time concern." Not always. Modern kernels support CPU hotplug (
echo 0 > /sys/devices/system/cpu/cpu5/online), and cloud workload-migration features can live-migrate a guest onto a host with different NPS settings. Long-running services should re-discover topology periodically (every 60 s) or subscribe to udev events on/sys/devices/system/node/and react. Production teams at Razorpay log a topology snapshot every minute as a sanity check; the day it changes mid-run is the day they catch the live-migrate. -
"Discovery doesn't matter on a single-socket box." Single-socket EPYC parts (the 1-socket SKUs with 96 cores, 12 chiplets) still have NPS modes. NPS=4 on a single socket gives you 4 NUMA nodes within one physical CPU. Workloads written for the "single socket = single node" assumption will scatter across chiplets and lose 10–15 % throughput before they notice. Discovery applies whenever the kernel reports more than one node, regardless of how many sockets are physically present.
-
"AWS Nitro hides NUMA from the guest." Only on small instance types. The
metalvariants and any instance with more than ~64 vCPUs expose the host's NUMA shape directly to the guest.c6a.metal(192 vCPU, 2-socket) shows 2 nodes innumactl --hardware;c7a.metal-48xlshows 8. Smaller VM-shaped instances (e.g.c6a.4xlarge, 16 vCPU) usually live entirely inside one NUMA node and report 1 — but verify, don't assume; very-small SKUs sometimes straddle a NUMA boundary if the host is fragmented.
Going deeper
How the kernel builds the topology — SRAT, SLIT, MSCT
At boot, the BIOS hands the kernel three ACPI tables that together describe NUMA: SRAT (System Resource Affinity Table) maps CPUs and memory ranges to NUMA proximity domains; SLIT (System Locality Information Table) gives the distance matrix; MSCT (Maximum System Characteristics Table) describes the maximum possible expansion (used by virtualisation hypervisors). The kernel parses these in arch/x86/mm/numa.c (numa_init → acpi_numa_init) and populates the /sys/devices/system/node/ tree from the parsed structures.
The implication for discovery is that everything numactl shows came from the BIOS via SRAT/SLIT, with one parsing step in between. When the kernel disagrees with reality (very rare), the cause is usually a buggy SRAT — fixed by a BIOS update. When two kernels on the same hardware report different topologies, the cause is a BIOS version difference (the SRAT-emit logic changed between BIOS releases). The dmesg | grep -i numa log lines on boot show the parsing step; saving these logs alongside topology snapshots pays off in long-tail debugging.
NUMA on ARM and the Graviton question
ARM servers (AWS Graviton 3/4, Ampere Altra) use the same Linux kernel infrastructure (/sys/devices/system/node/) and the same discovery tools. The differences are subtle: ARM's interconnect is CMN-700 (Coherent Mesh Network), not Infinity Fabric or UPI; distance values come out smaller (typically 10/20 instead of 10/32) because the ARM mesh has lower remote latency than x86 inter-socket links; and ARM exposes a cluster layer in /sys/devices/system/cpu/cpu*/topology/cluster_id that x86 leaves at -1.
Graviton 3 (c7g.16xlarge) is single-socket, 64-core, and reports 1 NUMA node — but the cluster layer reveals that cores are grouped in pairs sharing L2. Graviton 4 (c8g.metal-48xl) is dual-socket, 96 cores per socket, 192 total, and reports 2 nodes by default. AWS does not expose ARM sub-NUMA equivalents on Graviton. Discovery on ARM is simpler and more uniform than on x86 — but the same discipline applies.
Discovery in containers and Kubernetes
Docker containers without --cap-add=SYS_ADMIN and without bind-mounting /sys/devices/system/node from the host see only what the cgroup gave them. The default bridge-network container on a NUMA host sees the host's full topology in /sys/devices/system/node/ (read-only) but is constrained by the cgroup's cpuset.cpus and cpuset.mems to a subset. Reading /sys/fs/cgroup/cpuset.cpus.effective and /sys/fs/cgroup/cpuset.mems.effective tells you which CPUs and nodes the container is allowed to use.
Kubernetes Topology Manager (--topology-manager-policy=single-numa-node) pins pods to a single NUMA node when scheduling. The discovery dance is: the kubelet reads the host topology at startup, the scheduler matches pod resource requests against per-node availability, the runtime sets cpuset / memset on the cgroup. Production teams at Hotstar enable single-numa-node for their FFmpeg-encoder pods (memory-bandwidth-heavy, single-threaded-per-pod, perfect for chiplet-pin) and explicitly restricted for their Java services (which can spill across multiple nodes via --interleave=all).
Discovery on GPUs — nvidia-smi topo -m
GPUs are NUMA too. An 8×H100 box exposes 8 separate HBM stacks, each closer to one GPU than to others, with NVLink links among them. nvidia-smi topo -m prints the GPU-to-GPU and GPU-to-NIC affinity matrix:
$ nvidia-smi topo -m
GPU0 GPU1 GPU2 GPU3 NIC0 CPU Affinity NUMA Affinity
GPU0 X NV18 NV18 NV18 PXB 0-31 0
GPU1 NV18 X NV18 NV18 PXB 0-31 0
GPU2 NV18 NV18 X NV18 SYS 32-63 1
GPU3 NV18 NV18 NV18 X SYS 32-63 1
The NV18 entries indicate 18 NVLink lanes between GPU pairs (full bandwidth); PXB is "across PCIe host bridges" (slower); SYS is "across NUMA sockets" (slowest). The NUMA Affinity column shows which CPU NUMA node each GPU is closest to — the data-loading pipeline should run on that node. Treating discovery as CPU-only on a GPU box is leaving 30–50 % of training throughput on the floor.
Reproduce this on your laptop
# On any Linux box, single-socket or dual-socket:
sudo apt install numactl hwloc linux-tools-common
numactl --hardware # the headline output
lstopo --of console | head -50 # ASCII topology tree
lstopo --of png > topology.png # save the visual
# Set up the discovery script:
python3 -m venv .venv && source .venv/bin/activate
# numa_discover.py uses only stdlib — no pip install needed
python3 numa_discover.py | python3 -m json.tool
# Watch topology in real time (1 second poll):
watch -n 1 'numactl --hardware | head -3; echo; cat /sys/devices/system/node/node0/numastat'
# Compare to your CPU's enumerated topology:
lscpu | grep -E '^(Architecture|CPU\(s\)|Socket|Core|Thread|NUMA)'
cat /proc/cpuinfo | grep 'physical id' | sort -u
A laptop is single-node; the output is short. Cloud metal instances are 2- to 8-node; the output is long. Both teach the same discipline: trust the kernel, not the spec sheet.
Where this leads next
Discovery is the prerequisite. Once you know the shape, the next chapters build the levers and the measurement loops on top:
- /wiki/numactl-and-memory-binding — using the discovered node IDs to pin CPUs (
--cpunodebind) and bind memory (--membind,--interleave,--preferred). The user-space mechanics for putting pages where you want them. - /wiki/interconnects-qpi-upi-infinity-fabric — the wires between the nodes you just discovered: bandwidth budgets, coherence overhead, link-saturation symptoms readable through
perf stat. - /wiki/measuring-numa-effects-with-perf — turning the discovered topology into perf-counter expectations:
perf stat -e numa_miss/numa_hiton a workload tells you whether your pin-masks are working as designed.
The deeper habit is to treat topology discovery as something you do at every deploy boundary — every new SKU, every BIOS update, every kernel bump. Aditi's incident at Zerodha was caused by trusting yesterday's topology after a BIOS automation playbook ran. The discovery script she now runs at service startup, with a hard fail if nodes_per_socket differs from the deploy manifest, would have caught the change in the first millisecond instead of the 11th hour.
The 30-second discipline: numactl --hardware, lscpu | grep Socket, lstopo --of console | head. Three commands, every box, every time. The cost is 30 seconds; the cost of skipping it is the rest of this part of the curriculum trying to debug a workload whose topology assumption was wrong.
References
- Linux kernel documentation — NUMA — the canonical description of how the kernel populates
/sys/devices/system/node/from ACPI SRAT and SLIT. - hwloc (Portable Hardware Locality) project —
lstopo's home; documentation includes the full topology object hierarchy and the XML schema. - AMD EPYC 9004 Series Architecture Overview — NPS modes (NPS=1/2/4), chiplet-to-channel mapping, recommended modes per workload class.
- Intel Xeon Scalable (Sapphire Rapids) — Sub-NUMA Cluster Modes — SNC=Off/2/4 behaviour, distance values, BIOS configuration paths.
numactlmanual page (Linux) — flag reference; the section on--cpunodebindvs--physcpubindis essential.- Brendan Gregg, Systems Performance (2nd ed., 2020) — Chapter 7 (Memory) for production-perspective NUMA debugging; Appendix C lists the discovery commands every operator should know.
- ACPI Specification — System Resource Affinity Table — the firmware-side source of truth that the Linux kernel parses into NUMA topology.
- /wiki/uma-vs-numa-the-architectural-shift — the previous chapter; the architectural pivot this chapter teaches you to read.