NUMA topology discovery

Aditi at Zerodha was debugging a 40 % p99 regression on the Kite order-matcher. The same binary, the same kernel, the same c6a.metal. Yesterday's box reported 2 NUMA nodes; today's reported 8. Nothing in the application had changed. What had changed was that BIOS-level NPS4 sub-NUMA clustering had been enabled by an automation playbook three nights ago, and the matcher's pin-mask — written for 2 nodes — was now scattering threads across 4 chiplets per socket. Her numactl --physcpubind=0-95 was no longer pinning to "socket 0"; it was pinning to "the first 96 logical CPUs the kernel happened to enumerate", which on this BIOS were spread across all 8 sub-nodes. The fix took 15 minutes. Finding the fix took her 11 hours, because she trusted her code's pin-mask without reading the topology.

NUMA topology discovery is the act of asking the running kernel — not your spec sheet, not the cloud-vendor docs — how many memory nodes exist, which CPUs belong to each, and how the firmware is grouping chiplets this boot. The tools are numactl --hardware, lstopo from hwloc, and the /sys/devices/system/node/ tree. Run them on every box your code touches, every time. The same hardware can look 2-node, 4-node, or 8-node depending on BIOS settings, and the wrong assumption costs you 1.5–3× latency on real workloads.

Why discovery is non-negotiable

The previous chapter (UMA vs NUMA: the architectural shift) ended with a script that walked /sys/devices/system/node. That was the cheapest discovery you can do — twenty lines of Python. The reason it matters is that the topology your kernel reports is not the topology you might assume from the SKU. The same dual-EPYC-9654 box can be presented to the OS as 2, 4, or 8 NUMA nodes depending on a BIOS setting called NPS (Nodes Per Socket); the same dual-Xeon-Platinum-8480 box can be 2 or 4 nodes depending on SNC (Sub-NUMA Clustering). The setting is firmware-level; your application sees the result and has to behave correctly under whichever shape it boots into today.

Three concrete production incidents tie this to money:

All three incidents share a structural cause: the operator's mental model of the hardware was static, but the actual kernel topology was dynamic. Discovery is the discipline of reading the box every time you deploy onto it, before you trust any tuning that depends on its shape.

The same physical box, three BIOS settings, three kernel topologiesThree side-by-side panels of a dual-EPYC-9654. Left panel: NPS=1, the kernel sees 2 nodes. Middle panel: NPS=2, the kernel sees 4 nodes. Right panel: NPS=4, the kernel sees 8 nodes, each spanning 3 chiplets.Same dual-EPYC-9654 hardware, three BIOS-driven topologiesNPS=1 — 2 NUMA nodesnode 0 (96 CPUs)socket 0, all 12 chipletsnode 1 (96 CPUs)socket 1, all 12 chipletsdistance: 10 / 32simplest mental modelNPS=2 — 4 NUMA nodesn0 (48 CPU)6 chipletsn1 (48 CPU)6 chipletsn2 (48 CPU)6 chipletsn3 (48 CPU)6 chipletsdistance: 10 / 12 / 32intra-socket halvesNPS=4 — 8 NUMA nodesn0n1n2n3n4n5n6n724 CPUs / 3 chiplets eachdistance: 10/11/12/32finest granularitymost allocator pressure
One physical box, three BIOS configurations, three kernel-visible topologies. Illustrative — based on AMD EPYC 9004-series NPS modes documented in the AMD architecture overview. Your numactl --hardware output is the only authoritative source of which mode is active right now.

Why the same hardware presents differently: NPS (Nodes-Per-Socket on AMD; SNC / Sub-NUMA Clustering on Intel) tells the firmware how to group memory channels and chiplets into NUMA domains. NPS=1 lumps all 12 channels of a socket into one node — simplest, but the cross-chiplet L3 access cost is hidden from the scheduler. NPS=4 splits the 12 channels into 4 groups of 3, exposing intra-socket non-uniformity to the OS. The trade-off: NPS=1 is easier to program for and gives the OS less work, but lets bandwidth-hungry workloads land on a chiplet whose channels are saturated by another worker. NPS=4 lets workloads pin to a 3-channel slice they own, but multiplies the number of allocator arenas, scheduling domains, and NUMA-balancing decisions the kernel has to make. Production teams pick based on whether their workload partitions cleanly to chiplet-sized chunks (NPS=4 wins) or runs as one large shared dataset (NPS=1 wins).

The discovery toolkit, in order of use

Three tools cover 95 % of NUMA discovery. Run them in this order on any new box: numactl --hardware for the headline numbers, lstopo for the visual topology, /sys/devices/system/node/ for the raw kernel facts when something looks wrong.

numactl --hardware is the first call. It is shipped with the numactl package (Debian/Ubuntu/RHEL/Alpine all package it). The output is dense:

$ numactl --hardware
available: 8 nodes (0-7)
node 0 cpus: 0 1 2 3 4 5 6 7 8 9 10 11 96 97 98 99 100 101 102 103 104 105 106 107
node 0 size: 95876 MB
node 0 free: 89234 MB
node 1 cpus: 12 13 14 15 16 17 18 19 20 21 22 23 108 109 110 111 112 113 114 115 116 117 118 119
node 1 size: 95879 MB
node 1 free: 91102 MB
...
node 7 cpus: 84 85 86 87 88 89 90 91 92 93 94 95 180 181 182 183 184 185 186 187 188 189 190 191
node 7 size: 95878 MB
node 7 free: 92107 MB
node distances:
node    0    1    2    3    4    5    6    7
  0:   10   11   12   12   32   32   32   32
  1:   11   10   12   12   32   32   32   32
  2:   12   12   10   11   32   32   32   32
  3:   12   12   11   10   32   32   32   32
  4:   32   32   32   32   10   11   12   12
  5:   32   32   32   32   11   10   12   12
  6:   32   32   32   32   12   12   10   11
  7:   32   32   32   32   12   12   11   10

This is a 2-socket EPYC 9654 in NPS=4 mode. Node 0 has CPUs 0-11 (the physical cores) and 96-107 (their SMT siblings). Local distance is 10. Adjacent-chiplet distance inside the same socket is 11–12. Cross-socket distance is 32. Nodes 0–3 are socket 0; nodes 4–7 are socket 1, and the matrix's block structure tells you that — the cross-socket entries form two solid 32-blocks, while intra-socket entries are 11s and 12s.

Why the SMT siblings show up as 0 and 96 instead of 0 and 1: AMD's enumeration scheme assigns all physical cores first, then all hyperthreads. So on a 192-thread box, CPUs 0–95 are physical, 96–191 are hyperthreads, and CPU 96 is the SMT sibling of CPU 0. Intel's default enumeration is interleaved (CPU 0 and CPU 1 are siblings of the same core), but configurable via Hyperthreading=Off-Sequential BIOS settings. Reading this wrong leads to the classic bug "I pinned 12 threads to CPUs 0–11 expecting 12 cores; I got 6 cores plus 6 SMT siblings, half my threads contend on shared execution units, throughput drops 30 %". Always check lscpu | grep -E 'Core|Socket|Thread' and the /sys/devices/system/cpu/cpu*/topology/thread_siblings_list files when planning a pin-mask.

lstopo (from the hwloc package) is the second call. It produces a tree-or-graph view that makes the hierarchy clickable when run as lstopo (X11) or lstopo --of png topology.png (headless). On a remote box without X11, lstopo --of console prints an ASCII tree:

$ lstopo --of console
Machine (768GB total)
  Package L#0
    NUMANode L#0 (P#0 96GB)
      L3 L#0 (32MB)
        L2 L#0 (1024KB) + L1d L#0 (32KB) + L1i L#0 (32KB) + Core L#0
          PU L#0 (P#0)
          PU L#1 (P#96)
        L2 L#1 (1024KB) + L1d L#1 (32KB) + L1i L#1 (32KB) + Core L#1
          PU L#2 (P#1)
          PU L#3 (P#97)
        ... (12 cores per L3 — one chiplet)
    NUMANode L#1 (P#1 96GB)
      L3 L#1 (32MB)
        ...
  Package L#1
    NUMANode L#4 (P#4 96GB)
      ...

The tree mirrors the cache hierarchy: Package (socket) → NUMANode → L3 → L2/L1d/L1i → Core → PU (logical CPU). lstopo reads from the same /sys files as numactl but joins them with cache topology, so you can see exactly which cores share an L3 (the chiplet boundary), which cores share an L2 (none on x86, but ARM Neoverse shares L2 across pairs), and how SMT siblings hang off cores.

The third call is the raw sysfs tree. Every numactl --hardware and lstopo value comes from one of these files; reading them yourself is what you do when the higher-level tool reports something surprising:

/sys/devices/system/node/node0/
  cpulist               # "0-11,96-107" — CPUs in this node
  cpumap                # bitmask form of cpulist
  distance              # space-separated distances to all nodes
  meminfo               # MemTotal/MemFree/Cached for this node only
  numastat              # numa_hit / numa_miss counters
  hugepages/            # per-node hugepage pools (1G, 2M)
  vmstat                # per-node VM statistics
  compact               # write 1 to compact this node's memory

The /sys/devices/system/cpu/cpu<N>/topology/ tree carries the inverse mapping — given a CPU, which package, core, and thread group:

/sys/devices/system/cpu/cpu5/topology/
  physical_package_id   # "0" — which socket
  core_id               # "5" — which physical core within socket
  thread_siblings_list  # "5,101" — all CPUs sharing this physical core
  core_cpus_list        # same thing on newer kernels
  cluster_id            # ARM-only on x86, "−1"; on ARMv8.4+, the cluster

Together, the two trees let you answer any topology question without trusting an SKU spec sheet. The next section turns that into a Python script.

A reusable discovery script

Every tuning decision your code makes — pinning threads, allocating with mbind, sharding a hash table — depends on the topology you discover at startup. Hardcoding the topology in config (workers_per_socket: 96) is the bug. Reading it at startup is the fix.

# numa_discover.py
# Read /sys/devices/system/node and /sys/devices/system/cpu to build a
# topology map the application can branch on. Designed for production:
# returns a dict, not stdout. Tested on EPYC 9654, Xeon Platinum 8480,
# Graviton 3, and a single-socket laptop (1 node).

import json
import os
import re
from pathlib import Path

NODE_ROOT = Path("/sys/devices/system/node")
CPU_ROOT = Path("/sys/devices/system/cpu")

def parse_list(s: str) -> list[int]:
    """Parse '0-11,96-107' into [0,1,...,11,96,...,107]."""
    out: list[int] = []
    for chunk in s.strip().split(","):
        if "-" in chunk:
            lo, hi = (int(x) for x in chunk.split("-"))
            out.extend(range(lo, hi + 1))
        elif chunk:
            out.append(int(chunk))
    return out

def read_text(p: Path, default: str = "") -> str:
    try:
        return p.read_text().strip()
    except FileNotFoundError:
        return default

def discover() -> dict:
    if not NODE_ROOT.exists():
        return {"nodes": 1, "topology": "non-numa", "cpus": list(range(os.cpu_count() or 1))}

    nodes: list[dict] = []
    for nd in sorted(NODE_ROOT.glob("node[0-9]*"),
                     key=lambda p: int(re.search(r"node(\d+)$", p.name).group(1))):
        n = int(re.search(r"node(\d+)$", nd.name).group(1))
        cpus = parse_list(read_text(nd / "cpulist"))
        distance = [int(x) for x in read_text(nd / "distance").split()]
        # MemTotal lives in /sys/devices/system/node/nodeN/meminfo
        meminfo = read_text(nd / "meminfo")
        m = re.search(r"MemTotal:\s+(\d+)\s+kB", meminfo)
        mem_mb = int(m.group(1)) // 1024 if m else 0
        nodes.append({"node": n, "cpus": cpus, "distance": distance, "mem_mb": mem_mb})

    # Group nodes into sockets via physical_package_id
    socket_of_cpu: dict[int, int] = {}
    for n in nodes:
        for c in n["cpus"]:
            pkg = read_text(CPU_ROOT / f"cpu{c}" / "topology" / "physical_package_id")
            if pkg:
                socket_of_cpu[c] = int(pkg)
    sockets = sorted(set(socket_of_cpu.values()))
    nodes_per_socket = {s: [] for s in sockets}
    for n in nodes:
        s = socket_of_cpu.get(n["cpus"][0])
        if s is not None:
            nodes_per_socket[s].append(n["node"])

    return {
        "nodes": len(nodes),
        "sockets": len(sockets),
        "nodes_per_socket": nodes_per_socket,
        "topology": "numa" if len(nodes) > 1 else "uma",
        "details": nodes,
    }

if __name__ == "__main__":
    t = discover()
    print(json.dumps(t, indent=2))
    print(f"\n→ {t['nodes']} NUMA nodes across {t.get('sockets', 1)} socket(s)")
    if t["topology"] == "numa":
        nps = t["nodes"] // t.get("sockets", 1)
        print(f"→ NPS / SNC mode appears to be {nps} (nodes per socket)")

Sample run on a 2-socket EPYC 9654 in NPS=4:

$ python3 numa_discover.py
{
  "nodes": 8,
  "sockets": 2,
  "nodes_per_socket": {"0": [0, 1, 2, 3], "1": [4, 5, 6, 7]},
  "topology": "numa",
  "details": [
    {"node": 0, "cpus": [0, 1, ..., 11, 96, ..., 107], "distance": [10, 11, 12, 12, 32, 32, 32, 32], "mem_mb": 95876},
    ...
  ]
}

→ 8 NUMA nodes across 2 socket(s)
→ NPS / SNC mode appears to be 4 (nodes per socket)

The walkthrough on the parts that matter most:

The script is 60 lines, has no third-party dependencies, and runs in under 50 ms. Ship it as part of your service's startup. Log the resulting nodes, sockets, and nodes_per_socket fields. Alert when those numbers change between deploys — a node that booted with 2 NUMA nodes yesterday and 8 today is the signal Aditi needed at hour 1, not hour 11.

Reading lstopo for the picture you cannot get from numactl

numactl --hardware gives you the rectangles. lstopo gives you the picture, with cache hierarchy joined to NUMA. For visual inspection it is the single most useful systems-performance tool you don't yet have installed.

lstopo view of an EPYC 9654 socket in NPS=4 modeA nested-rectangle diagram. The outer rectangle is one socket. Inside it are 4 NUMA-node rectangles, each containing 3 L3-cache rectangles (chiplets), each containing 8 cores. Each core has L2, L1d, L1i, and 2 PUs (hyperthreads).lstopo view: one socket of an EPYC 9654 in NPS=4 modePackage L#0 (socket)NUMANode L#0 (24 CPU)96 GB DRAM, 3 chipletsL3 (32 MB chiplet 0)8 cores × {L2, L1, 2 PU}CPUs 0-7, 96-103L3 (32 MB chiplet 1)CPUs 8-15, 104-111L3 (32 MB chiplet 2)CPUs 16-23, 112-119NUMANode L#1 (24 CPU)96 GB, distance 11 to L#0L3 chiplet 3L3 chiplet 4L3 chiplet 5NUMANode L#2 (24 CPU)96 GB, distance 12 to L#0L3 chiplet 6L3 chiplet 7L3 chiplet 8NUMANode L#3distance 12 to L#0chiplet 9chiplet 10chiplet 11
One socket of an EPYC 9654 in NPS=4: 4 NUMA nodes, each containing 3 L3-cache chiplets, each containing 8 cores. The second socket is the same shape repeated, with cross-socket distance 32. Illustrative — based on AMD EPYC 9004 architecture overview.

The visual makes three things obvious that text-form numactl hides. First, the chiplet boundary is the L3 boundary: cores within one chiplet share an L3 of 32 MB; a load from another chiplet in the same NUMA node still has to cross a memory-fabric link (Infinity Fabric) and pays an additional 5–10 ns. Second, the NUMA-node-to-chiplet ratio (3 chiplets per node in NPS=4) means a pin-mask of "one chiplet" is not the same as "one NUMA node" — pinning to chiplet 0 only gives you 1/3 of the node's memory bandwidth budget. Third, the socket boundary is the place where xGMI (or UPI on Intel) crosses, and is the only place where the distance jumps from 11/12 to 32.

Run lstopo --of png > topology.png once on every cloud SKU you deploy on, save the PNG to your wiki, and refer back to it when you read flamegraphs. The visual recall is faster than re-deriving the topology from numactl text every incident.

lstopo also has an underused flag: lstopo --output-format xml produces machine-readable XML that includes hwloc's full graph (cache associativities, memory-channel widths, PCIe device locality). Production ops teams at PhonePe and Hotstar feed this XML into their config-management pipeline and assert that BIOS settings match the expected NPS / SNC / hyperthreading shape before promoting a node into production traffic. The check is one xmllint --xpath away; the cost of skipping it is the kind of incident the chapter opened with.

Cloud-SKU quirks and the BIOS settings that flip them

The cloud SKUs Indian engineering teams meet most often have specific NPS / SNC defaults that differ from on-prem builds. Knowing them saves the first 30 minutes of every cloud-migration NUMA debug.

Cloud SKU Hardware Default mode Discovery surprise
AWS c6a.metal Dual EPYC 7R13 NPS=1 (2 nodes) numactl --hardware shows 2 nodes; AMD spec sheet implies 8 chiplets per socket but they are L3-only, not NUMA
AWS c7a.metal-48xl Dual EPYC 9R14 NPS=4 (8 nodes) Same SKU family as c6a but 4× the node count — pin-masks must change
AWS r7iz.metal-32xl Dual Sapphire Rapids SNC=2 (4 nodes) Intel default is SNC off (2 nodes); AWS enabled SNC=2 silently in late 2023
Azure HBv4 Dual EPYC 9V33X NPS=4 (8 nodes) HPC-targeted SKU; documented in the SKU page
GCP c3-standard-176 Dual Sapphire Rapids SNC=off (2 nodes) GCP keeps SNC off for predictability
GCP c3d-standard-180 Dual EPYC 9B14 NPS=2 (4 nodes) Genoa-X variant; 4 nodes is the GCP default
Oracle BM.Standard.E5.192 Dual EPYC 9J14 NPS=4 (8 nodes) Oracle exposes the maximum granularity

The on-prem story is messier. Razorpay's bare-metal fleet runs Supermicro EPYC 9654 boards; the BIOS default from Supermicro is NPS=1, but the server-image build pipeline flips it to NPS=4 in IPMI before deployment. PhonePe's bare-metal Xeon Platinum 8480+ boards default to SNC=off; their build pipeline leaves it that way except on the 8 boxes designated for HFT-style market-data ingestion, which run SNC=2. Both teams maintain a config-management assertion that BIOS state matches expected mode before any production traffic lands on the node.

# Quick BIOS-mode check on a running Linux box (no reboot required):
$ numactl --hardware | head -1     # gives the node count
$ lscpu | grep Socket               # gives the socket count
$ python3 -c "
import json, subprocess
out = subprocess.check_output(['numactl','--hardware']).decode()
nodes = int([l for l in out.splitlines() if l.startswith('available:')][0].split()[1])
sockets = int(subprocess.check_output(['lscpu']).decode().split('Socket(s):')[1].split()[0])
print(f'NPS/SNC mode: {nodes // sockets} nodes per socket ({nodes} total / {sockets} sockets)')
"

When the result of that 5-line check changes between two deploys of the same image, your config-management pipeline must fail the deploy. The cost of treating BIOS settings as static is the cost of the incidents at the top of this chapter — six-figure-rupee outages, weeks of misattributed root cause.

The dmidecode route gives you BIOS strings without numactl:

$ sudo dmidecode -t bios | grep -i 'Vendor\|Version\|Release Date'
$ sudo dmidecode -t processor | grep -i 'Version\|Voltage'

But dmidecode does not report NPS / SNC directly — those are runtime-only properties of how the firmware advertised the SRAT (System Resource Affinity Table) to the kernel. The kernel parses SRAT into the /sys/devices/system/node/ tree at boot. To read SRAT directly: cat /sys/firmware/acpi/tables/SRAT | hexdump -C (or use the acpidump tool to parse it). Most operators do not go this deep — they trust numactl --hardware — but knowing SRAT is the source of truth helps when SRAT-vs-kernel disagreements happen (rare; usually a kernel bug after a CPU hotplug event).

Common confusions

Going deeper

How the kernel builds the topology — SRAT, SLIT, MSCT

At boot, the BIOS hands the kernel three ACPI tables that together describe NUMA: SRAT (System Resource Affinity Table) maps CPUs and memory ranges to NUMA proximity domains; SLIT (System Locality Information Table) gives the distance matrix; MSCT (Maximum System Characteristics Table) describes the maximum possible expansion (used by virtualisation hypervisors). The kernel parses these in arch/x86/mm/numa.c (numa_initacpi_numa_init) and populates the /sys/devices/system/node/ tree from the parsed structures.

The implication for discovery is that everything numactl shows came from the BIOS via SRAT/SLIT, with one parsing step in between. When the kernel disagrees with reality (very rare), the cause is usually a buggy SRAT — fixed by a BIOS update. When two kernels on the same hardware report different topologies, the cause is a BIOS version difference (the SRAT-emit logic changed between BIOS releases). The dmesg | grep -i numa log lines on boot show the parsing step; saving these logs alongside topology snapshots pays off in long-tail debugging.

NUMA on ARM and the Graviton question

ARM servers (AWS Graviton 3/4, Ampere Altra) use the same Linux kernel infrastructure (/sys/devices/system/node/) and the same discovery tools. The differences are subtle: ARM's interconnect is CMN-700 (Coherent Mesh Network), not Infinity Fabric or UPI; distance values come out smaller (typically 10/20 instead of 10/32) because the ARM mesh has lower remote latency than x86 inter-socket links; and ARM exposes a cluster layer in /sys/devices/system/cpu/cpu*/topology/cluster_id that x86 leaves at -1.

Graviton 3 (c7g.16xlarge) is single-socket, 64-core, and reports 1 NUMA node — but the cluster layer reveals that cores are grouped in pairs sharing L2. Graviton 4 (c8g.metal-48xl) is dual-socket, 96 cores per socket, 192 total, and reports 2 nodes by default. AWS does not expose ARM sub-NUMA equivalents on Graviton. Discovery on ARM is simpler and more uniform than on x86 — but the same discipline applies.

Discovery in containers and Kubernetes

Docker containers without --cap-add=SYS_ADMIN and without bind-mounting /sys/devices/system/node from the host see only what the cgroup gave them. The default bridge-network container on a NUMA host sees the host's full topology in /sys/devices/system/node/ (read-only) but is constrained by the cgroup's cpuset.cpus and cpuset.mems to a subset. Reading /sys/fs/cgroup/cpuset.cpus.effective and /sys/fs/cgroup/cpuset.mems.effective tells you which CPUs and nodes the container is allowed to use.

Kubernetes Topology Manager (--topology-manager-policy=single-numa-node) pins pods to a single NUMA node when scheduling. The discovery dance is: the kubelet reads the host topology at startup, the scheduler matches pod resource requests against per-node availability, the runtime sets cpuset / memset on the cgroup. Production teams at Hotstar enable single-numa-node for their FFmpeg-encoder pods (memory-bandwidth-heavy, single-threaded-per-pod, perfect for chiplet-pin) and explicitly restricted for their Java services (which can spill across multiple nodes via --interleave=all).

Discovery on GPUs — nvidia-smi topo -m

GPUs are NUMA too. An 8×H100 box exposes 8 separate HBM stacks, each closer to one GPU than to others, with NVLink links among them. nvidia-smi topo -m prints the GPU-to-GPU and GPU-to-NIC affinity matrix:

$ nvidia-smi topo -m
        GPU0    GPU1    GPU2    GPU3    NIC0    CPU Affinity   NUMA Affinity
GPU0     X     NV18    NV18    NV18    PXB     0-31           0
GPU1    NV18     X     NV18    NV18    PXB     0-31           0
GPU2    NV18    NV18     X     NV18    SYS     32-63          1
GPU3    NV18    NV18    NV18     X     SYS     32-63          1

The NV18 entries indicate 18 NVLink lanes between GPU pairs (full bandwidth); PXB is "across PCIe host bridges" (slower); SYS is "across NUMA sockets" (slowest). The NUMA Affinity column shows which CPU NUMA node each GPU is closest to — the data-loading pipeline should run on that node. Treating discovery as CPU-only on a GPU box is leaving 30–50 % of training throughput on the floor.

Reproduce this on your laptop

# On any Linux box, single-socket or dual-socket:
sudo apt install numactl hwloc linux-tools-common
numactl --hardware                     # the headline output
lstopo --of console | head -50         # ASCII topology tree
lstopo --of png > topology.png         # save the visual

# Set up the discovery script:
python3 -m venv .venv && source .venv/bin/activate
# numa_discover.py uses only stdlib — no pip install needed
python3 numa_discover.py | python3 -m json.tool

# Watch topology in real time (1 second poll):
watch -n 1 'numactl --hardware | head -3; echo; cat /sys/devices/system/node/node0/numastat'

# Compare to your CPU's enumerated topology:
lscpu | grep -E '^(Architecture|CPU\(s\)|Socket|Core|Thread|NUMA)'
cat /proc/cpuinfo | grep 'physical id' | sort -u

A laptop is single-node; the output is short. Cloud metal instances are 2- to 8-node; the output is long. Both teach the same discipline: trust the kernel, not the spec sheet.

Where this leads next

Discovery is the prerequisite. Once you know the shape, the next chapters build the levers and the measurement loops on top:

The deeper habit is to treat topology discovery as something you do at every deploy boundary — every new SKU, every BIOS update, every kernel bump. Aditi's incident at Zerodha was caused by trusting yesterday's topology after a BIOS automation playbook ran. The discovery script she now runs at service startup, with a hard fail if nodes_per_socket differs from the deploy manifest, would have caught the change in the first millisecond instead of the 11th hour.

The 30-second discipline: numactl --hardware, lscpu | grep Socket, lstopo --of console | head. Three commands, every box, every time. The cost is 30 seconds; the cost of skipping it is the rest of this part of the curriculum trying to debug a workload whose topology assumption was wrong.

References

  1. Linux kernel documentation — NUMA — the canonical description of how the kernel populates /sys/devices/system/node/ from ACPI SRAT and SLIT.
  2. hwloc (Portable Hardware Locality) projectlstopo's home; documentation includes the full topology object hierarchy and the XML schema.
  3. AMD EPYC 9004 Series Architecture Overview — NPS modes (NPS=1/2/4), chiplet-to-channel mapping, recommended modes per workload class.
  4. Intel Xeon Scalable (Sapphire Rapids) — Sub-NUMA Cluster Modes — SNC=Off/2/4 behaviour, distance values, BIOS configuration paths.
  5. numactl manual page (Linux) — flag reference; the section on --cpunodebind vs --physcpubind is essential.
  6. Brendan Gregg, Systems Performance (2nd ed., 2020) — Chapter 7 (Memory) for production-perspective NUMA debugging; Appendix C lists the discovery commands every operator should know.
  7. ACPI Specification — System Resource Affinity Table — the firmware-side source of truth that the Linux kernel parses into NUMA topology.
  8. /wiki/uma-vs-numa-the-architectural-shift — the previous chapter; the architectural pivot this chapter teaches you to read.