Note: Company names, engineers, incidents, numbers, and scaling scenarios in this article are hypothetical — even when they resemble real ones. See the full disclaimer.

UMA vs NUMA: the architectural shift

Riya's payment-routing service at PaisaBridge had a beautiful constant in its latency model: 72 ns to reach DRAM, no matter which thread asked. She had benchmarked it for two years on dual-Xeon E5-2680 v2 boxes (Ivy Bridge, 2013) and the constant held. In 2024 she migrated to dual-EPYC 9654 (Zen 4, 2023). Same socket count. More cores. The constant disappeared. A pointer-chase from socket 0 into socket-0 memory still cost ~80 ns; the same chase into socket-1 memory cost 145 ns; under load the second number drifted to 280 ns. Nothing in her code had changed. What had changed was that the box had stopped being uniform-memory-access (UMA) and become non-uniform-memory-access (NUMA) — and the constant in her latency model was now a function of (core, page).

UMA means every CPU core sees DRAM through the same shared memory controller, with one latency. NUMA means each socket owns a memory controller, and accessing memory attached to another socket pays interconnect cost on top of the DRAM cost. The architectural pivot from UMA to NUMA happened in 2003 with AMD's Opteron and Intel's 2008 Nehalem; every server you write code for in 2026 is NUMA. The cost of pretending it is still UMA is a 1.5–3× latency penalty on remote loads, and a throughput plateau the moment your workload spills past one socket.

What UMA was, and why it stopped working

A UMA system has one memory controller. All CPUs reach DRAM through it. The bus between CPUs and the memory controller — Intel called it the front-side bus (FSB) through 2008 — is shared, and arbitration on the FSB serialises memory traffic from every core. In a 4-socket Pentium Pro server (1996) every load from any of the 4 sockets travelled the same FSB, took the same ~120 ns to DRAM, and competed with every other socket's load for FSB cycles.

The UMA design has two virtues: every core has the same view of memory (programming model is simple), and the memory controller is a single point you can optimise. It has one fatal flaw: the FSB is a fixed-bandwidth resource, and adding more cores does not add more bandwidth. By the early 2000s, server CPUs were already saturating their FSB at 4–8 cores; doubling the core count did not double throughput because half the new cores spent their cycles waiting for FSB arbitration.

UMA put the memory controller in a shared chip (Northbridge / MCH) on the motherboard. NUMA pushed it inside each CPU package, giving each socket its own memory it can reach quickly — and turning the other socket's memory into a remote resource. Illustrative — based on Intel Xeon Pentium-Pro era and AMD EPYC 9654 / Intel Sapphire Rapids published topologies.

The pivot was AMD's Opteron in 2003. AMD pulled the memory controller out of the motherboard chipset and integrated it into the CPU die itself. Each Opteron socket got its own DDR memory channels, directly attached. Sockets talked to each other (and to memory attached to other sockets) through a new interconnect AMD called HyperTransport. Intel followed in 2008 with Nehalem, integrating the memory controller and introducing the QuickPath Interconnect (QPI), later renamed UPI with Skylake-SP in 2017. AMD renamed HyperTransport to Infinity Fabric with Zen in 2017, with the inter-socket variant called xGMI.

The architectural change is precise: instead of one memory controller serving all CPUs through one bus, every CPU has its own memory controller serving its own DRAM, and CPUs communicate over a switched point-to-point interconnect. The win is bandwidth — a 2-socket EPYC 9654 has 24 DDR5 channels (12 per socket) at ~32 GB/s each, totalling ~770 GB/s of DRAM bandwidth, versus an FSB-era box that maxed out at ~30 GB/s shared among all sockets. The cost is non-uniformity: a load from socket 0 into socket-1 DRAM has to cross the interconnect, and pays for the privilege.

Why integrating the memory controller helps at all: when the controller lived on a separate chip (the Memory Controller Hub, or MCH), every memory access traversed two boards-worth of wires plus the FSB arbitration logic. The on-package controller has microscopic wires to its DRAM channels, no FSB to arbitrate, and the memory channels are fed by signal-integrity-tuned traces on the motherboard for sub-30 ns DRAM round-trip. Multiplied across multiple sockets, you also gain bandwidth proportionally — six channels per socket times two sockets is twelve channels of independent bandwidth, while one MCH could only physically carry 2–4 channels on its package. The integration was both a latency win and a bandwidth-scaling win; the NUMA non-uniformity it introduced was the price.

The new cost model: NUMA distance

In a NUMA system, every node-to-node hop has a cost called NUMA distance. The Linux kernel exposes this in /sys/devices/system/node/nodeN/distance. By convention, the distance from a node to itself is 10 (an arbitrary baseline). The distance to another node is some larger integer — typically 12, 15, 20, 32 — that approximates the access-cost ratio between local and remote DRAM. A distance of 20 means roughly 2× the latency to that remote node compared with local; a distance of 32 means roughly 3.2×.

For a typical 2-socket EPYC 9654 box, the distance matrix looks like:

node 0 distances: 10 32
node 1 distances: 32 10

For a 4-socket Cascade Lake box with mesh interconnect:

node 0 distances: 10 21 21 21
node 1 distances: 21 10 21 21
node 2 distances: 21 21 10 21
node 3 distances: 21 21 21 10

For an 8-socket Sandy Bridge box with QPI ring topology — where some pairs of sockets are 1 hop apart and others are 2 — the matrix is non-uniform, and reading it tells you which pairs of sockets to keep your shared data between:

node 0 distances: 10 16 16 22 16 22 22 22
node 1 distances: 16 10 22 16 22 16 22 22
...

The distance integer is not a measured nanosecond — it is a kernel-supplied hint to the scheduler and the page allocator. The kernel uses it for exactly two things: (1) deciding which NUMA node to allocate memory from when no policy is specified (prefer the lowest-distance node from the current CPU), and (2) deciding whether to migrate a page closer to the thread that is accessing it (the auto-NUMA-balancing knob). Your application code rarely reads the distance directly; the kernel reads it to make the same decisions you would have made with numactl.

The five numbers from the previous chapter are the latency reality the distance integer compresses:

Local L1 hit: ~1 ns.
Local L3 hit, same chiplet: ~12 ns.
Local L3 hit, different chiplet: ~17–22 ns.
Local DRAM: 75–95 ns.
Remote DRAM (other socket): 120–160 ns under low contention; 250–400 ns under interconnect saturation.

# numa_distance_walk.py
# Read /sys/devices/system/node/* to print the NUMA topology, distances,
# and the CPUs / memory attached to each node. This is the Python
# equivalent of `numactl --hardware`, but parsed so the script can
# branch on the topology in a real workload.

import os
import re
from pathlib import Path

NODE_ROOT = Path("/sys/devices/system/node")

def read_int(path):
    return int(path.read_text().strip())

def parse_cpulist(s):
    """Parse '0-23,48-71' into a list of ints."""
    cpus = []
    for chunk in s.strip().split(","):
        if "-" in chunk:
            lo, hi = map(int, chunk.split("-"))
            cpus.extend(range(lo, hi + 1))
        else:
            cpus.append(int(chunk))
    return cpus

def read_node(node_dir):
    n = int(re.search(r"node(\d+)$", node_dir.name).group(1))
    cpus = parse_cpulist((node_dir / "cpulist").read_text())
    distances = list(map(int, (node_dir / "distance").read_text().split()))
    mem_kb = read_int(node_dir / "meminfo_total")  # may not exist on all kernels
    return {"node": n, "cpus": cpus, "distance_to": distances, "mem_kb": mem_kb}

# Walk all NUMA nodes
nodes = sorted(NODE_ROOT.glob("node[0-9]*"))
if not nodes:
    print("No NUMA nodes exposed — likely single-socket or container without /sys/devices/system/node")
    raise SystemExit(0)

topology = []
for nd in nodes:
    try:
        topology.append(read_node(nd))
    except FileNotFoundError as e:
        print(f"  (skip {nd.name}: {e})")

print(f"NUMA nodes: {len(topology)}")
for t in topology:
    print(f"  node {t['node']}: {len(t['cpus'])} CPUs, {t['mem_kb']/1024/1024:.1f} GB RAM")
    print(f"    distances: {t['distance_to']}")
    print(f"    cpu range: {t['cpus'][0]}..{t['cpus'][-1]}")

# What you would do with this in production: pin a worker pool to one node.
# os.sched_setaffinity(0, set(topology[0]['cpus']))

Sample run on a 2-socket EPYC 9654 (c6a.metal):

$ python3 numa_distance_walk.py
NUMA nodes: 2
  node 0: 96 CPUs, 384.0 GB RAM
    distances: [10, 32]
    cpu range: 0..95
  node 1: 96 CPUs, 384.0 GB RAM
    distances: [32, 10]
    cpu range: 96..191

Why distance 32 maps to roughly 1.7–2.0× latency, not 3.2×: the integer is a kernel-supplied policy hint, not a calibrated latency multiplier. The kernel needs an ordering of nodes from "best" to "worst" relative to a given CPU, and the integer provides that ordering. The actual ratio depends on the hardware; on EPYC 9654, the local-vs-remote ratio under low contention is ~1.78× (the chapter's pointer-chase numbers), but the kernel writes 32 because that's what the firmware tells it. AMD's BIOS reports the value; Intel's BIOS reports a different value for the same physical topology. Treat the integer as ordinal — what matters is distance(local) < distance(remote) and distance(adjacent_socket) < distance(far_socket) in mesh topologies. Never multiply distance/10 against your latency budget; measure the actual ratio with perf stat or a pointer-chase microbenchmark.

The asymmetry in the matrix is the whole story. UMA's distance matrix has every entry equal — the architecture-encoded statement "every load is the same". NUMA's matrix has a diagonal of 10s and off-diagonal entries that climb with hop count. When your application allocates a 64 GB shared cache, the matrix is what tells you whether to put it on one node, interleave across all nodes, or shard it.

What changes for the programmer

The UMA-era programming model is simple. malloc returns a pointer; the address is universal; any thread on any core reaches it equivalently; the only thing that matters for performance is whether the data fits in cache. The NUMA-era model is more layered. malloc still returns a pointer, but the page that pointer refers to lives on a specific NUMA node. Threads that run on that node's cores reach the page in 80 ns; threads on other nodes pay 145 ns or more. The address is still universal (you can dereference it from any thread without error); the latency is not.

Three behaviours that were correct under UMA become silent performance bugs under NUMA.

Single-threaded warmup. A common idiom under UMA: allocate a buffer, then have the main thread scribble zeros over it to ensure the pages are committed and warm. Under NUMA, this warmup runs on whatever node the main thread is currently scheduled on, and the kernel's first-touch policy places every page on that node. When worker threads later read the buffer from other nodes, they pay remote-access cost on every line. The fix is to warm the buffer in parallel from all nodes that will eventually use it, so each page is first-touched by a thread on the node that will later access it most.

The kernel's first-touch policy makes "who writes first" decide "where the page lives". A single-threaded warmup hands every page to one socket; a parallel warmup matches page placement to the threads that will read it. Half a flamegraph of `__memmove_avx_unaligned` stalls comes from this exact mistake. Illustrative.

Global counters and shared queues. A std::atomic<uint64_t> request counter incremented by every request handler. Under UMA, the counter's cache line lives in shared L3 or L2; updates are serialised but the cache-coherence protocol stays inside one socket and overhead is bounded. Under NUMA, the line bounces between sockets every time a handler on each side increments. Each bounce is an interconnect round-trip; under load, the counter alone can consume 10–20 % of cycles. The fix is per-socket counters summed periodically (the Aadhaar story is the canonical case).

Default-thread-pool allocation. Modern runtimes (Java's ForkJoinPool, Go's GOMAXPROCS scheduler, Tokio, Erlang's BEAM scheduler) treat all CPUs as equivalent and let the OS scheduler decide which thread runs where. Under UMA, this is correct. Under NUMA, the runtime ends up scheduling threads on whichever socket is least loaded at any moment, regardless of where their working-set data lives. The result is a steady stream of cross-socket migrations, each one cold-cache-faulting the migrated thread. The fix is socket-affine thread pools: each pool of workers pinned to one socket, with work-stealing only within the socket. PaisaBridge's payment-router moved from a 64-thread global pool to two 32-thread socket-pinned pools and saw p99 drop from 28 ms to 11 ms with no other change.

The transition is not zero-cost in code complexity. Per-socket counters need a reduction step. Socket-pinned thread pools need affinity to be set explicitly (pthread_setaffinity_np, taskset, numactl --physcpubind). Allocations need policy hints (numa_alloc_onnode, numa_alloc_interleaved, mbind syscalls). Most production code carries some of this, often inconsistently — one part of the system is NUMA-aware, another part allocates everything on node 0 by accident, and the misalignment produces the latency variance that plagues multi-socket deployments.

The BharatBazaar catalogue ranking team found this during their Mega Bargain Days 2023 capacity planning. The ranker process was correctly NUMA-aware — its inverted index was sharded per socket, its workers were pinned. But the JVM heap itself — managed by G1GC — was allocated as one large mmap'd region first-touched by the JVM startup thread on whichever socket the OS scheduler put it on. Half the application's object allocations ended up on the wrong socket relative to the worker that read them. Adding numactl --interleave=all to the JVM launch command — six characters in the systemd unit — cut p99 from 19 ms to 12 ms across the fleet. Why interleave-all helped a workload that was already partially NUMA-aware: G1GC allocates objects from thread-local-allocation-buffers (TLABs) carved out of the heap. With first-touch placement, the TLAB pages were all on node 0 (where the JVM main thread started), and worker threads on node 1 paid remote-DRAM cost on every TLAB read. Interleaving spreads the heap pages 50/50 across both nodes, so on average half a worker's allocations come from local DRAM regardless of which node the worker is on. It is a lossy fix — half the workload still pays remote cost — but it cuts the worst-case 100 % remote down to a uniform 50 %, which on this workload was the difference between 19 ms p99 and 12 ms p99.

Failure modes that look UMA from the outside

The hardest NUMA bugs are the ones where the system stays correct, the dashboards stay green-ish, and only the latency tail tells you something is wrong. Three failure modes are worth recognising on sight.

The cold-cache migration storm. A latency-critical service runs with N+1 redundancy and a load balancer. Under steady load, each replica's threads keep their working set warm in local L3. Under a traffic spike, the OS scheduler migrates threads across sockets to balance CPU pressure; each migrated thread now reads cold cache lines from a different L3 slice (or worse, remote DRAM), and its per-request latency spikes for the next few milliseconds. The dashboard shows mean CPU at 60 %, which looks healthy. The p99.9 doubles for the duration of the spike, which is what wakes the on-call engineer. The signature: perf stat -e sched:sched_migrate_task,context-switches shows migration count tracking the latency excursion. The fix: pin per-replica thread pools to one socket; trade some scheduler flexibility for migration-free steady state.

The lopsided allocator. A service uses a generic memory allocator (glibc's ptmalloc, jemalloc with default settings) that hands out arenas from a pool. The arenas live on whichever node first allocated them — which is whichever node the main thread happened to start on. Worker threads pull from these arenas regardless of which node they're on. Half the allocations land "right", half land "wrong"; the per-request latency is bimodal. Why this is uniquely hard to debug: the bimodality is invisible to averages and to most percentile dashboards because it averages out at the request level (each request makes hundreds of allocations). The signature shows up only in perf c2c record (a Linux tool that profiles cache-line contention) or in the per-allocation-size histograms exposed by jemalloc's stats interface. Most teams never look at either, so the bug lives in production for years as "ambient slowness" until someone profiles it deliberately. The fix: jemalloc with --enable-percpu-arena and MALLOC_CONF=percpu_arena:percpu, or tcmalloc which has been NUMA-aware by default since 2018.

Interconnect saturation under coordinated load. Under normal load, your interconnect carries 5 GB/s of cross-socket traffic and has 60 GB/s of headroom. Under a SetuStream-IPL-final-style traffic spike — every replica sees 14× normal QPS at the same instant — the cross-socket traffic also spikes 14×, hitting 70 GB/s on a 64 GB/s link. The interconnect is now the global bottleneck; every cross-socket load queues on it. Per-socket CPU utilisation drops (cores are stalled waiting for memory), and dashboards interpret "CPU dropped" as "load is being handled fine" — exactly the wrong conclusion. The signature: perf stat -e amd_iommu_0/cmd_buf_full/ (AMD) or perf stat -e uncore_upi_0/qpi_data_bandwidth_tx/ (Intel) saturating their counter rates. The fix is structural — reduce cross-socket traffic — not adding capacity, because more replicas of a topology-blind service add more interconnect pressure.

The common thread is that NUMA failure modes don't trip the alarms you've already configured. The latency tail moves, but the means and medians often don't. The CPU graph looks normal because stalled cores still count as "busy" in top. The flamegraph still names the same hot functions. The only signals that tell the truth are the NUMA-specific perf counters and the numa_maps view of your process's page placement — both of which require knowing they exist before you go looking. That meta-discipline — knowing which signals to trust when the easy ones are lying — is the difference between teams that diagnose NUMA bugs in hours and teams that ship them as ambient performance loss for quarters.

Common confusions

"NUMA only matters on multi-socket boxes." Most do, but chiplet-based single-socket parts (AMD Zen 2/3/4 EPYC, Intel Sapphire Rapids tiles) introduce intra-socket non-uniformity too. A 96-core EPYC 9654 has twelve 8-core chiplets; loads from a different chiplet's L3 are slower than loads from the same chiplet's L3. The kernel exposes this as sub-NUMA clustering (SNC) — set in BIOS, the single socket is presented to the kernel as 4 NUMA nodes, each covering 3 chiplets. Single-socket NUMA-awareness is real on chiplet hardware.
"NUMA distance is the latency ratio." It is an ordinal hint, not a measured nanosecond multiplier. A distance of 32 might be 1.7× latency on EPYC and 2.1× on Xeon — the kernel uses the integer for ordering decisions, not for capacity planning. Always measure the actual ratio with perf stat or a pointer-chase benchmark on the specific hardware.
"numactl --interleave=all always helps." It helps when your workload's data has no obvious NUMA-friendly partitioning and you want to avoid the worst-case "all on one node" failure mode. It hurts when your workload IS NUMA-aware, because interleaving forces 50 % of accesses remote even when you've sharded properly. The right default depends on whether your code knows where its data should be — a stateless web service is a great fit for interleave-all; a cache-shard server is a terrible fit.
"Auto-NUMA-balancing makes NUMA invisible." It tries. The kernel periodically samples access patterns by injecting page faults and migrates pages to the node that accesses them most. For workloads with stable access patterns and long-lived processes, it works. For workloads with bursty or migrating access patterns, it adds latency variance — every sampling fault costs tens of microseconds — and sometimes makes things worse. Production teams at PaisaBridge and ParakhTrade disable it (echo 0 > /proc/sys/kernel/numa_balancing) on latency-critical services and pin manually.
"The cloud abstracts NUMA away." AWS, Azure, and GCP all expose multi-socket topology on their metal and large memory-optimised SKUs. c6a.metal (192 vCPU, 2-socket EPYC), m7i.metal-48xl (192 vCPU, 2-socket Sapphire Rapids), Standard_HB176-96rs_v4 (Azure, 96-core single-socket but with sub-NUMA clustering), c3-standard-176 (GCP) — all NUMA. numactl --hardware shows it directly. Container orchestrators (Kubernetes Topology Manager) can pin pods to NUMA nodes when explicitly configured; the default is "scheduler picks", which produces the cross-socket migrations this chapter is about.
"NUMA is just a CPU thing." GPUs are also NUMA — an 8×H100 box has 8 separate HBM stacks, each closer to one GPU than the others, and inter-GPU memory access goes over NVLink (faster but still non-uniform). The mental model that "memory is plural and locality matters" is the same; the constants are 10–100× tighter and the interconnect is NVSwitch instead of UPI/xGMI. Cluster-NUMA — where memory across nodes connected by RDMA fabric is treated as a single addressable space (CXL, future) — is the next frontier; the same architectural pivot, scaled to a rack.

Going deeper

ccNUMA: cache coherence under non-uniform memory

The full name of every modern x86 NUMA system is ccNUMA — cache-coherent non-uniform memory access. Memory is non-uniform, but the cache-coherence protocol still presents a single coherent view across all sockets, which means a write on one socket is visible to all sockets without explicit synchronisation. The protocol that makes this work — Intel's MESIF (Modified-Exclusive-Shared-Invalid-Forward), AMD's MOESI (the same five plus Owned) — sends coherence probes over the interconnect every time a cache line is read or written by multiple sockets.

Coherence probes are part of the interconnect's bandwidth budget. On a 2-socket EPYC 9654 with ~64 GB/s of cross-socket Infinity Fabric, a workload with heavy false sharing can spend 40–60 % of interconnect bandwidth on coherence traffic alone, leaving only 25–35 GB/s for actual remote-DRAM data. The visible symptom is that adding cores on socket 1 makes things worse — the more cores writing to shared lines, the more coherence probes fly across the wire, and the less bandwidth remains for the data those cores are trying to read. False sharing — covered in its own chapter — is a single-socket issue that becomes a multi-socket disaster.

When NUMA is more than 2 sockets — mesh and ring topologies

The 2-socket case has one interconnect link and a single distance ratio. Larger NUMA systems get more interesting topologies. Intel's Skylake-SP and later use a mesh on-die for intra-socket and UPI for inter-socket; an 8-socket box may be wired as a mesh (every socket has direct links to several others, distance 21 to all) or a ring (some pairs are 1 hop apart, distance 16; others are 2 hops, distance 22).

The matrix asymmetry matters for shared-data placement. If sockets 0 and 1 are 1-hop neighbours but socket 0 and socket 4 are 2-hop, you place your shared queue between sockets 0 and 1, not between sockets 0 and 4. Modern Linux's NUMA scheduler reads the distance matrix and tries to keep cooperating threads on low-distance node pairs; production code that ignores the matrix and pins threads arbitrarily across an 8-socket box can see latency variance of 3× depending on which socket-pair the OS happened to land them on.

The CXL future: pooled memory across the rack

Compute Express Link (CXL) 2.0 introduces memory pooling — a separate memory module that multiple servers can share as a network-attached NUMA node. Conceptually, your distance matrix grows: local DRAM at distance 10, same-socket DRAM at 10, other-socket DRAM at 32, CXL-pooled DRAM at 80–120. Latency to CXL is currently 200–400 ns (compared to 80 ns local), so CXL is for capacity expansion (terabytes of cheap memory) and shared-state use cases (cluster-coherent caches), not latency-critical paths.

Vistron, Meta, and Samsung have published 2024-2025 measurements of CXL-attached memory deployments; the consensus is that CXL is "another tier" — between DRAM and SSD — that needs the same locality-aware programming as inter-socket NUMA, just with looser constants. The architectural pivot this chapter names is happening again, one level out, and the discipline transfers.

How Linux decides which node to allocate from

The kernel's default policy is MPOL_DEFAULT, which means "use the policy of the calling thread" — and the thread's default policy is MPOL_LOCAL (was MPOL_DEFAULT before 5.11), meaning "allocate from the node the calling CPU is on". Combined with first-touch placement, this gives you the local-allocation behaviour you'd want — if the thread that first touches a page is the thread that mostly reads it.

The lever to override is set_mempolicy(2) (per-thread), mbind(2) (per-region), or the numactl user-space wrapper. The four policy modes are: DEFAULT (thread-local with first-touch), PREFERRED (try this node, fall back if full), BIND (must allocate from this node, OOM if it can't), and INTERLEAVE (round-robin pages across a node mask). Each is the right answer for a different workload — BIND for a partitioned shard, INTERLEAVE for a uniformly-accessed hash table, PREFERRED for cache-friendly fallbacks.

Reading /proc/<pid>/numa_maps for any running process shows which policy each memory region uses and which nodes its pages are physically on. For a NUMA-bug investigation, this file is the ground truth — the flamegraph won't show you that 60 % of your heap is on node 1 when your workers run on node 0, but numa_maps will.

Why some workloads stay UMA-shaped on NUMA hardware

Not every multi-socket workload sees a NUMA tax. The shape that stays clean is embarrassingly parallel with per-thread working sets that fit in a single socket's L3 plus a small footprint of locally-allocated DRAM. Examples: per-request video transcoding (SetuStream's encode pipeline, when each worker reads its segment from local memory), per-key cryptographic hashing (BCrypt password verification at DigiPaisa, where each request is independent), in-RAM cache lookups against a sharded keyspace where each shard fits in one socket (PaisaBridge's idempotency-key cache, sharded by key prefix to one socket per shard).

The common property is that no two threads share data. Each thread reads only memory it (or its socket-local producer) wrote, and writes only memory its own consumers will read locally. The interconnect carries no application data — only periodic kernel coherence overhead — and the throughput scales linearly with cores until the interconnect's coherence-probe traffic itself becomes the bottleneck (typically past 96 cores per socket).

This is also why the partitioned-shard pattern is the dominant production response to NUMA. If you can't make your shared data go away, shard it once across sockets and never share across the boundary again. The ParakhTrade order-matcher example from the previous chapter — moving the order book from "shared, first-touched on socket 0" to "sharded by symbol prefix, one shard per socket" — turned a workload that was getting 1.4× scaling into one getting 1.9×. The hardware was identical; only the shape of data sharing changed.

The pattern transfers: any workload that can be partitioned by a stable key (user-id, symbol, geo-region) into roughly equal shards can run as N independent UMA-shaped sub-workloads on a NUMA box, one per socket. The cross-socket interconnect carries only out-of-band coordination traffic (heartbeats, leadership, occasional rebalancing), which is small relative to the data-plane bandwidth each socket consumes locally. This is the architecture every database that runs well on multi-socket hardware (CockroachDB, ScyllaDB, Aerospike) implements explicitly, and the architecture every monolithic in-memory cache that doesn't (Memcached older than 1.6) reaches for as a v2.

Reproduce this on your laptop

Most laptops are single-socket and will show one node. To explore NUMA in earnest:

# On a multi-socket Linux box (AWS c6a.metal, Azure HBv4, GCP c3-standard-176):
sudo apt install linux-tools-common linux-tools-generic numactl
numactl --hardware                  # see node count, distances, CPU/RAM per node
cat /sys/devices/system/node/node0/distance
cat /proc/<pid>/numa_maps | head    # see policy + page placement of any process

python3 -m venv .venv && source .venv/bin/activate
pip install numpy
python3 numa_distance_walk.py        # the script in this chapter

# Toggle auto-NUMA-balancing off and watch latency variance change:
echo 0 | sudo tee /proc/sys/kernel/numa_balancing
# (re-enable with: echo 1 | sudo tee /proc/sys/kernel/numa_balancing)

If your laptop is single-socket, numactl --hardware will report one node. The exercise still teaches you the interface — every cloud SKU you'll ever scale to exposes it — and the numa_maps view of your laptop's running processes shows the same kernel mechanism, just with one node. The constants vary; the discipline does not.

Where this leads next

UMA is where memory was a single resource. NUMA is where memory became a topology of resources, each closer to some CPUs than to others. Every chapter in Part 3 from this point operates on the NUMA assumption — multiple memory controllers, an interconnect with bandwidth limits, and access cost as a function of (thread location, page location).

The chapters that follow build the discipline:

/wiki/numa-topology-discovery — numactl --hardware, lstopo, /sys/devices/system/node. Read your hardware before you measure it.
/wiki/numactl-and-memory-binding — the user-space and library-level levers for placing pages and pinning threads. numa_alloc_onnode, mbind, MPOL_BIND, set_mempolicy.
/wiki/interconnects-qpi-upi-infinity-fabric — the wire between sockets: bandwidth budgets, coherence overhead, link-saturation symptoms.
/wiki/numa-aware-allocators-and-data-structures — the patterns (per-node arenas, sharded structures, replication) that make production code stop fighting the topology.

The mental pivot to take into Part 3: every UMA-era programming idiom you carry forward is a hypothesis that needs validation on the new hardware. Some idioms still work — most of them, on workloads that fit one socket. The ones that break do so silently — the code is correct, the tests pass, the benchmarks on the dev box look fine, and the production p99 is mysterious. The remaining chapters of Part 3 are the toolkit for finding those idioms before they cost you a fleet.

Riya's payment-routing service got its constant back, but only after she rebuilt the latency model around (core_node, page_node) instead of a single number. The constant for (0,0) is 80 ns; for (0,1) it's 145 ns; for (0,1) under interconnect saturation it's 280 ns. The model is more complicated, but it predicts production now. The constant did not disappear; it just stopped being one number.

References

Lameter, "NUMA: An Overview" (USENIX Queue, 2013) — the practitioner's introduction; the background reading the rest of Part 3 builds on.
Linux kernel documentation — NUMA Memory Policy — the authoritative reference for MPOL_* policies, mbind, set_mempolicy.
AMD EPYC 9004 Series Architecture Overview — Infinity Fabric / xGMI specifications, sub-NUMA clustering, NPS modes on Genoa.
Intel Xeon Scalable (Sapphire Rapids) Architecture Brief — UPI 2.0, mesh-on-die, sub-NUMA cluster modes.
Brendan Gregg, Systems Performance (2nd ed., 2020) — Chapter 7 (Memory) for production-debugging perspective on NUMA bugs.
Hennessy & Patterson, Computer Architecture: A Quantitative Approach (6th ed., 2017) — Chapter 5 for ccNUMA coherence protocols and distance models.
Compute Express Link (CXL) Consortium — CXL 2.0 Specification — the next NUMA frontier: memory pooling across servers.
/wiki/wall-single-socket-is-no-longer-where-the-action-is — the previous chapter; the wall this one formalises.