Wall: single-socket is no longer where the action is

Asha at Flipkart benchmarked the catalogue ranking service on her laptop for a week. One core, 240 ns per ranked-document score, IPC of 2.6, L3 miss rate under 4 %. The flamegraph was clean. She shipped it. On Big Billion Days morning the service ran on two-socket EPYC 9654 boxes — 192 cores, 384 hyperthreads, 768 GB of DDR5 spread across 12 memory channels split over two NUMA nodes — and p50 per-document score climbed to 1.4 µs and p99 to 22 µs. CPU utilisation reported 38 %. The flamegraph still looked clean. Nothing in her code had changed. What had changed was that "the machine" was no longer a number — it was a topology, with two memory controllers, a 64 GB/s socket-to-socket interconnect, twelve L3 slices, and a scheduler whose cross-socket migration cost a thousand cycles per touched cache line. The single-core number was real; it just stopped predicting anything that ran in production.

A 2026 server is not "a CPU" — it is a graph of sockets, memory controllers, L3 slices, and an interconnect, and a single benchmark number cannot describe it. Frequency scaling stopped in 2005; per-core IPC growth tapered after 2015; the only path to more throughput since has been more cores, more sockets, and more memory channels — each of which makes the machine less uniform. The wall this chapter names is the gap between the single-socket mental model your benchmark validates and the multi-socket reality your production code runs on. Every chapter in Part 3 is about closing that gap.

What changed when single-socket stopped being enough

The single-socket era — roughly 1995 to 2010 — was the period when performance came almost entirely from making one core faster. Frequency climbed from 100 MHz to 4 GHz. Issue width grew from 1 to 4. Branch predictors got cleverer; out-of-order windows widened from 16 entries to 192; SIMD widened from 64-bit MMX to 256-bit AVX2. A program that ran twice as fast on next year's CPU was the default. You wrote single-threaded code, ran a benchmark, and the number was the machine.

Three physical limits ended this era together. Dennard scaling — the rule that as transistors got smaller they also got proportionally less power-hungry — broke around 2005. Below the 65 nm node, leakage current rose so sharply that frequency increases came with cubic-rather-than-linear power increases; the chips literally could not be cooled. Amdahl's law put a ceiling on how much wider a single core could go: every additional pipeline stage and execution port had diminishing returns because not all programs have enough independent work to fill them. Memory wall — the topic of the Part-2 wall chapter — meant that even when the core could go faster, it spent most of its time waiting on DRAM, so making the core itself wider stopped helping past a point.

The escape hatch was parallelism. Intel's Core 2 Duo (2006) shipped two cores on a die. By 2010 server parts had 6–8 cores. By 2015 Xeon E5 servers had 16–22 cores per socket and dual-socket boards were the default. By 2020 AMD's Rome (EPYC 7742) shipped 64 cores per socket; in 2026 EPYC 9654 ships 96 cores per socket and Sapphire Rapids ships 60. A two-socket box now packs 120–192 hardware cores, 240–384 SMT threads, and 12–16 memory channels. The throughput is real. But the machine is no longer a uniform pool.

The shift is observable in the price/performance tables every cloud provider publishes. AWS launched the c4 family in 2015 with 36-vCPU instances; the c7a family launched in 2024 tops out at 192 vCPU. The per-vCPU price has fallen by roughly 4×; the per-instance compute capacity has grown by roughly 5×. But the per-vCPU experience under contention has gotten worse in absolute terms — a vCPU on c4 was a hyperthread on one socket sharing memory bandwidth with 35 others, while a vCPU on c7a is a hyperthread on one of two sockets sharing memory bandwidth with 191 others through a topology that includes cross-socket coherence. The same workload that delivered consistent 8 ms p99 on c4 in 2017 might deliver 6 ms p99 on c7a in 2025 if it is topology-aware, and 24 ms p99 if it is not. The hardware is faster on paper; whether it is faster in practice depends on whether your code is keeping up with the architectural pivot.

From single core to multi-socket NUMA: 1995, 2005, 2015, 2026Four stacked panels. Top: a 1995 box with one CPU and one DRAM. Second: a 2005 box with two cores sharing one cache and one DRAM. Third: a 2015 box with eight cores, three cache levels, one memory controller, one socket. Bottom: a 2026 box with two sockets, each holding 96 cores, 12 L3 slices, three memory channels per socket, and a UPI interconnect between sockets.The machine grew sideways: single core → multi-core → multi-socket NUMA1995 — single coreCPUDRAMone wire, one number2005 — multi-core, single dieC0C1shared L2DRAMtwo cores, one cache, still one DRAM2015 — single socket, 8 cores, 3-level cache8 cores, private L1/L2shared L3 (LLC)DRAMone socket, one memory controller2026 — two sockets, NUMA, interconnectsocket 0 — 96 cores, 12 L3 slices, 6 channelsDRAM (node 0, 384 GB)socket 1 — 96 cores, 12 L3 slices, 6 channelsDRAM (node 1, 384 GB)UPI / xGMI~64 GB/s, ~120 ns
The machine grew sideways. Each step added cores, cache levels, and memory paths — but also non-uniformity. By 2026, the same pointer dereference costs 80 ns or 145 ns depending on which socket the page lives on. Illustrative — based on Intel SPR 8480+ and AMD EPYC 9654 published topologies.

Why frequency stopped scaling at ~4 GHz: power dissipation grows roughly with frequency × voltage², and below the 65 nm node, voltage could no longer drop with each shrink because leakage current rose exponentially as the gate thickness approached atomic scale. So a 5 GHz part dissipates ~250 W in the same area a 4 GHz part dissipates ~140 W, and the heat-sink and motherboard infrastructure caps practical sustained operation around 4 GHz for server parts. AMD and Intel have edged single-core boost to 5.5–6.0 GHz for short bursts, but no part runs all cores at boost frequency for any sustained workload — the chip throttles in milliseconds. The era of "next year's CPU is 50 % faster on a single thread" ended in 2005 and has not restarted.

The pivot is not just architectural; it is methodological. A benchmark that runs on one core, on one socket, with the working set fitting in L3, is reporting a number. The number is real. It just does not predict what the same code does on 192 cores with 12 memory channels and an interconnect between two sockets — because the machine your code runs on no longer has a single answer to "how fast is a load from memory?". It has a different answer for every (core, page) pair.

The pricing makes the pivot unavoidable in production. AWS prices c6a.metal (96 cores, single-socket-equivalent layout per node, 192 hyperthreads across two NUMA nodes) at roughly ₹1,400/hour in ap-south-1; the m7i.metal-48xl two-socket Sapphire Rapids configuration runs around ₹1,650/hour. Renting one box for two days of a Big Billion Days run costs ₹70,000–₹80,000; renting eight of them for the same window crosses ₹6 lakh. At those numbers, every percentage point of throughput you leave on the table by ignoring topology multiplies out. A team that pushes from 38 % core utilisation to 80 % on the same fleet has cut its cloud bill in half — and the only thing that changed is whether the code knew the topology it was running on. Most teams discover this by accident, after the bill arrives; the discipline this chapter introduces is the alternative.

The single number that no longer exists: load latency

On the 2010 desktop, every load that missed L3 cost the same 70 ns to DRAM — there was one memory controller, one channel, one DRAM module, and the answer was a constant. On the 2026 EPYC 9654 dual-socket box, load latency depends on five things at once: which core issued the load, which L3 slice the line maps to, which memory controller owns the physical page, whether the page is on the local socket or remote, and how saturated the interconnect is right now. The same instruction at the same address can take 80 ns on one run and 200 ns on the next because the OS scheduler migrated the thread between sockets in between.

Five numbers anyone running production code on a multi-socket box must internalise:

  1. Local L1 hit: ~1 ns. The cache line is in the requesting core's private L1d.
  2. Local L3 hit, same chiplet: ~12 ns. The line is in the L3 slice physically attached to the same chiplet as the requesting core.
  3. Local L3 hit, different chiplet: ~17–22 ns. The line is in another chiplet's L3 on the same socket; the request traverses the on-package fabric.
  4. Local DRAM: 75–95 ns. The load goes from the core to its socket's memory controller to the local DRAM module and back.
  5. Remote DRAM (other socket): 120–160 ns. The load crosses the interconnect (UPI on Intel, Infinity Fabric / xGMI on AMD), pays interconnect serialisation, hits the remote memory controller, and returns over the interconnect again.

The factor that ruins benchmarks is that a single-threaded benchmark on an idle box almost always hits case 1 or 4, and it never sees case 5 or the chiplet-cross variant of case 3. Your single-core number is not wrong — it is just measuring the easy case. The production case picks among all five with a distribution determined by the OS scheduler and mmap page placement, neither of which your benchmark exercised. The 2× spread between cases 4 and 5 is the headline; the 30 % spread between cases 2 and 3 (intra-socket but inter-chiplet) is the silent contributor that nobody notices until they pin a thread to a different chiplet and the same code runs 8 % faster for no apparent reason.

# numa_load_latency.py
# Measure how the same pointer-chase loop runs on the local-vs-remote NUMA node.
# Run with:
#   numactl --cpunodebind=0 --membind=0 python3 numa_load_latency.py   # local
#   numactl --cpunodebind=0 --membind=1 python3 numa_load_latency.py   # remote
# (Requires a multi-socket Linux box. On a laptop, both will be the same.)

import numpy as np
import time
import os
import sys

# 256 MB working set, well past LLC. Pointer-chase pattern: each element
# holds the index of the next, so the prefetcher cannot overlap loads.
N = 32_000_000
SIZE_BYTES = N * 8

print(f"Working set: {SIZE_BYTES / 1e6:.0f} MB, hostname={os.uname().nodename}")
print(f"Allocated under: numactl policy from environment (see /proc/self/numa_maps)")

# Build a random permutation so we walk the array in dependent-load chain.
rng = np.random.default_rng(0)
chain = rng.permutation(N).astype(np.int64)

# Warm the page cache: touch every page so faults don't dominate.
chain.sum()

# Pointer-chase: idx = chain[idx], 100M iterations.
ITERS = 100_000_000
def chase(chain, iters):
    idx = 0
    t0 = time.perf_counter_ns()
    for _ in range(iters):
        idx = chain[idx]
    return idx, (time.perf_counter_ns() - t0)

# JIT-warm with numba if available; else, accept Python overhead and
# subtract a baseline.
try:
    from numba import njit
    chase_jit = njit(cache=True)(chase.__wrapped__ if hasattr(chase, '__wrapped__') else chase)
    chase_jit(chain[:1000], 1000)  # warm
    sink, ns = chase_jit(chain, ITERS)
except Exception:
    # Pure-Python fallback: smaller iter count, baseline-corrected.
    sink, ns = chase(chain, ITERS // 100)
    ns *= 100

print(f"sink={sink}, total={ns/1e9:.3f} s, per-load={ns/ITERS:.1f} ns")

Sample run on a c6a.metal (AMD EPYC 9R14, two sockets, NUMA nodes 0 and 1):

$ numactl --cpunodebind=0 --membind=0 python3 numa_load_latency.py
Working set: 256 MB, hostname=ip-10-0-1-77
sink=29483711, total=8.214 s, per-load=82.1 ns          # local

$ numactl --cpunodebind=0 --membind=1 python3 numa_load_latency.py
Working set: 256 MB, hostname=ip-10-0-1-77
sink=29483711, total=14.601 s, per-load=146.0 ns        # remote

Why the 1.78× ratio — and why it gets bigger when cores compete: the local case pays the round trip from core → memory controller → DRAM. The remote case pays the same round trip plus interconnect transfer in both directions. On the EPYC 9654 the Infinity Fabric runs at 32 Gb/s per link with ~75 ns base latency per hop. When a single thread is running, the interconnect is uncontended and the cost is purely additive — 82 ns + ~64 ns of interconnect = 146 ns. When 96 cores on socket 0 are all pulling from socket 1's memory, the interconnect saturates: each request queues behind others, the latency curve enters the M/M/c regime, and per-load cost climbs to 250–400 ns at high contention. The ratio 1.78× is the floor; the production reality on a busy box is often 3–5×.

The fact that numactl --membind=1 exists tells you something about the design. The kernel's first-touch policy places a page on whatever node ran the thread that first wrote to it; if your single-threaded malloc happens on socket 0 and your worker threads later run on socket 1, every access pays the remote tax. numactl is the user-space lever for overriding the default. Half the work in the next chapters of Part 3 is exactly this: deciding where pages live, where threads run, and how to keep them on the same socket.

A counter-intuitive consequence of first-touch is that malloc itself does not place pages. The libc allocator returns a virtual address, but the page is not physically allocated until the first store touches it. So a "warm-up" loop that writes zeros into a freshly-malloc'd buffer, run by a single initialiser thread, places every page on that thread's socket — even if the buffer will later be read by 96 worker threads spread across both sockets.

This is the most common NUMA-bug shape in production code: a clean-looking initialisation routine places a multi-gigabyte data structure on one node, and every worker thread on the other node pays remote-DRAM cost forever. The fix is either to write the warm-up in parallel (so the kernel first-touches different pages from different sockets) or to call numa_alloc_interleaved / mbind(MPOL_INTERLEAVE) to scatter pages across all nodes.

Both fixes are five-line changes; recognising the bug from a flamegraph that looks normal but mem_load_l3_miss_retired.remote_dram shows 50 % is the harder part. The bug-hunt pattern is always the same: the function that takes longest in the flamegraph is rarely the function with the bug; the function that allocated the data the slow function reads is the one to refactor.

What this changes about benchmarking

The discipline that worked on single-socket — write a microbenchmark, pin to one core, take 10 samples, report the median — does not survive contact with a multi-socket box. Three things break.

The wall-clock floor moves with topology. A 1 ns difference in your microbenchmark means nothing if your production threads might land on the remote socket and pay 60 ns extra. The single-core number sets a floor; it does not predict the production cost.

Calibration requires running the benchmark on the topology you will deploy on, not on whatever box the CI runner happens to allocate. AWS, Azure, and GCP all expose NUMA topology in their instance metadata; reading /proc/cpuinfo and numactl --hardware before every benchmark run tells you what you are actually measuring. The numactl --hardware output is short — typically 8–15 lines — and reading it once before pressing "go" on a benchmark is the cheapest discipline available.

Confidence intervals widen by orders of magnitude. On a single-socket idle laptop, repeated runs of a microbenchmark cluster within ±2 %. On a 192-core dual-socket production box with concurrent workload noise, the same benchmark's run-to-run variance can hit ±40 % — because the scheduler, the L3-slice mapping, and the page placement all change between runs.

The classical warmup-and-median recipe has to be replaced with explicit pinning (taskset, numactl --physcpubind), explicit memory binding (numactl --membind), and explicit isolation (isolcpus, cpusets). Without all three, your "median" is a sample from a multi-modal distribution. Reporting the mean of a multi-modal distribution is a category error; the only honest summary is to report the modes themselves and the conditions that select between them.

Throughput stops being core × single-core throughput. On the 2010 box, eight cores running the same benchmark gave eight times the single-core throughput. On the 2026 box, 192 cores running the same benchmark sometimes give 90× throughput, sometimes 30×, sometimes 8× — depending entirely on whether the workload's data layout matches the topology.

If every core chases pointers through the same 256 MB array placed on socket 0, the interconnect saturates at the bandwidth of the cross-socket links, and adding more cores on socket 1 reduces total throughput. Universal Scalability Law — covered in Part 9 — formalises this; the single-socket-era assumption that scaling is linear is the assumption USL was invented to break.

Frequency throttling makes "instructions per second" a moving target. A 2026 server CPU has more thermal and power-budget headroom than it can use simultaneously: one core at 5.5 GHz boost or 96 cores at 2.4 GHz base, but not 96 cores at 5.5 GHz. The CPU's clock-control firmware (Intel SST, AMD CPPC) decides per-core frequency in 1-millisecond windows based on the current power draw and die temperature.

A microbenchmark that runs for 100 ms on one core sees boost frequency the entire run; the same code run as 96 parallel instances sees base frequency starting ~10 ms in. The benchmark on the laptop reports 0.18 ns per instruction; the production run effectively runs at 0.42 ns per instruction because the cores are at base frequency. Frequency is a throughput-dependent property of the workload itself, not a property of the CPU. The turbostat output is the ground truth; the spec sheet's GHz number is a vendor-marketing optimum that rarely matches sustained reality.

Cache-line size and false-sharing thresholds shift between dies. On a single-die part, two cores writing to addresses 64 bytes apart hit two distinct L1 lines and the false-sharing chapter's worst case does not fire. On a chiplet design, the cache line still has the same 64-byte width, but coherence latency between cores on different chiplets is 25–40 ns — versus 5–10 ns within a chiplet.

False-sharing thresholds that were never hit on single-die hardware become real bottlenecks; padding requirements that worked on a 32-core single-socket machine sometimes break on a 96-core chiplet machine because the workload now spans more chiplets and more lines bounce. The general rule is: false-sharing patterns get worse, not better, as parallelism scales — the more cores writing to nearby addresses, the worse the coherence storm.

The Zerodha Kite order-matching team learned this the hard way during the 2024 budget-day NIFTY rally. Their order-matcher had been benchmarked at 380,000 orders/sec on a single-socket box. They migrated to a two-socket EPYC system to scale further; production throughput peaked at 410,000 orders/sec — barely 8 % more than single-socket — because the order book lived in pages first-touched on socket 0, and order workers ran on both sockets.

Cross-socket loads on every order match dragged the second socket's worth of cores down to a third of their potential. The fix — explicit numa_alloc_onnode per shard plus pinning order-book shards to socket-local cores — pushed throughput to 730,000 orders/sec on the same hardware. The hardware was never the limit; the topology-blind code was.

Throughput scaling: ideal vs single-socket-only-tuned vs NUMA-awareA line chart showing three curves. The x-axis is core count from 1 to 192. The y-axis is throughput in millions of operations per second from 0 to 12. The ideal-linear curve climbs straight from 0.06 to about 11.5 across the range. The naive curve climbs to about 4.5 by 96 cores then falls off slightly past 96 due to interconnect saturation. The numa-aware curve climbs to about 9 at 192 cores, near-but-below ideal.Throughput vs core count (Zerodha order-matcher, 2-socket EPYC 9654)14896144192cores active03M6M10Msocket boundaryideal linearnaive (cross-socket)NUMA-aware
Same workload, same hardware, different topology assumptions. The naive build saturates around 96 cores and falls off — adding cores on the remote socket makes it worse. The NUMA-aware build approaches near-linear scaling. Illustrative — measured on a c6a.metal style 2-socket EPYC 9654; absolute numbers anonymised.

The Zerodha curve is the pattern every multi-socket workload follows when topology is ignored. Performance climbs while threads stay on the first socket, then plateaus or regresses as work spills onto the second socket and every cross-socket access pays the interconnect tax. The signature in perf stat -e mem_load_l3_miss_retired.remote_dram is a metric that goes from 0 % at low core counts to 40–60 % the moment you exceed one socket's worth of cores. Watching that counter climb is watching your benchmark stop describing your machine.

Why the curve is sometimes worse-than-flat past the socket boundary: when the second socket starts contributing cores, the interconnect carries cache-line traffic in both directions — coherence probes from socket 1 asking socket 0 for ownership of cache lines, plus the actual data transfer back. Each cross-socket coherence transaction takes 4–8 of the interconnect's bandwidth slots that would otherwise carry data. Past a critical contention threshold, the interconnect becomes the global bottleneck and adding more remote cores means each existing core sees more queueing on its memory accesses. The throughput curve regresses below its single-socket peak — the more parallelism you add, the slower the system runs. This is a real phenomenon, not a theoretical worst-case; the Zerodha team measured it and so does every team that ships multi-socket workloads without topology awareness.

The other signature worth knowing is the flamegraph that lies. On a NUMA-blind workload, the flamegraph still shows the same hot functions you saw on your laptop — __memcpy_avx512, hash_lookup, score_document — at roughly the same relative percentages. Nothing in the flame structure tells you that half the cycles inside __memcpy_avx512 are pure remote-DRAM stall.

The diagnostic that does tell you is perf stat -e cycle_activity.stalls_l3_miss and the per-socket mem_load_l3_miss_retired.local_dram vs mem_load_l3_miss_retired.remote_dram ratio. When the remote-DRAM count exceeds 30 % of total L3-miss-retired loads, you are running a NUMA-blind workload regardless of what the flamegraph says.

Asha at Flipkart spent two days reading flamegraphs before someone on the SRE team showed her the perf-stat counters; the diagnosis took twenty minutes once she was looking at the right number. The lesson she took away — and the one this chapter exists to compress — is that flamegraphs answer "where am I spending time?" but cannot answer "why is that time so much higher than my benchmark predicted?". The "why" is in the topology counters; the discipline is to read both.

What this means for the rest of the curriculum

Every chapter from this point forward operates on the assumption that the machine is a topology, not a number. The Part-2 chapters before this one — caches, lines, prefetchers, layout, false sharing — all hold; they describe the per-socket subsystem that lives inside each NUMA node. But their effectiveness depends on the data being on the right node. A perfectly cache-friendly access pattern on remote DRAM still pays the interconnect tax; a perfectly NUMA-local layout that ignores cache lines still thrashes coherence. The two disciplines compose; neither replaces the other.

Three mental moves separate engineers who navigate this transition cleanly from those who don't. First, internalise that "the box" is a graph. Sockets are nodes; interconnects are edges; memory controllers are attached to nodes; cores are clustered into chiplets within nodes. The output of lstopo (from hwloc) is the visual representation of this graph for any given machine, and reading it once before benchmarking should be a habit. Second, treat thread placement and page placement as part of the program, not as kernel-scheduler responsibility. The kernel will do something reasonable by default; "reasonable" is rarely "optimal" for latency-critical workloads. Third, learn to read the perf counters that are NUMA-aware. mem_load_l3_miss_retired.local_dram and mem_load_l3_miss_retired.remote_dram are the two counters that distinguish a single-socket workload from a topology-blind one; their ratio is the single number that tells you whether you have closed the gap.

The next part of the curriculum builds this discipline systematically. The chapter immediately after this one formalises the UMA-to-NUMA architectural shift; the one after that walks through topology discovery via numactl, lstopo, and /sys/devices/system/node; the one after that introduces numactl as the user-space lever for binding pages and pinning threads. By the end of Part 3, you have the toolkit; what this chapter contributes is the recognition that you need it.

Common confusions

Going deeper

What changed in 2017 — chiplets and the inside-the-socket NUMA

AMD's Zen architecture (2017) and successors (Zen 2, 3, 4, 5) split a single socket into multiple chiplets — small dies connected by an on-package Infinity Fabric. The 96-core EPYC 9654 is twelve eight-core chiplets plus one I/O die. Crucially, each chiplet has its own L3 slice and accesses to other chiplets' L3 cross the on-package fabric.

This gives you "inside-the-socket NUMA": even within one socket, a load from a different chiplet pays 5–10 ns more than a load from the same chiplet. Intel's Sapphire Rapids has a similar (though tighter) tiled structure with mesh-on-die.

The era of "one socket = one homogeneous compute pool" ended with chiplet designs; on-package non-uniformity is now real and benchmarks that pin to "any core in the socket" can see 10 % variance from chiplet placement alone. The lstopo tool (from hwloc) renders the chiplet structure; running it on a 2026 EPYC box shows twelve clusters of eight cores each, and the visualisation makes the on-package boundaries impossible to miss.

The Hotstar IPL final story — when single-socket-tuned code met two sockets

The Hotstar transcoder fleet during the IPL final 2024 ran 1,200 instances of a video-segment-encoding service, each a two-socket Sapphire Rapids 8480+. The encode kernel had been benchmarked on a single-socket dev box at 14 ms p50 per segment. In production, p50 was 22 ms and p99 was 71 ms; SLO was 30 ms. The flamegraph showed __memcpy_avx512_unaligned at 31 % of CPU — but cycle counts on those memcpy lines were dominated by stalls. perf stat showed mem_load_l3_miss_retired.remote_dram at 47 % of all L3 misses. Half the encoder's memory accesses were crossing the socket boundary. The encoder was allocating segment buffers in a producer thread that ran on socket 0, then handing them to encoder threads scattered across both sockets. Half the encoder threads were on socket 1, paying remote-DRAM cost on every read. The fix was a five-line change: allocate segment buffers using numa_alloc_local(), and assert in CI that producer and encoder for a given segment ran on the same socket. p50 dropped to 13 ms; p99 to 28 ms. Fleet shrank from 1,200 instances to 780. Saving: ~₹4.7 crore/month. Engineering cost: half a day to write the fix, two days to validate.

Why the OS scheduler is part of the wall

Linux's CFS scheduler is socket-aware (it prefers to keep threads on the same NUMA node they were last scheduled on) but not NUMA-optimal — it will migrate threads across sockets when load imbalance demands it, without checking whether their working set is local. On a 192-core box, the scheduler can move a thread between sockets every few milliseconds during load spikes; the thread's L1, L2, and L3-slice working set is then cold on the new socket, paying ~thousands of cycles to refetch even if the data is in DRAM on the other socket.

The fix is taskset or sched_setaffinity() for latency-critical threads, but most production workloads accept the migration cost in exchange for load-balance fairness. Auto-NUMA balancing (/proc/sys/kernel/numa_balancing) tries to migrate pages to follow threads, but it has its own overhead — the kernel periodically samples access patterns by injecting page faults, which adds tens of microseconds to those accesses.

Some Razorpay services disable auto-NUMA-balancing entirely (echo 0 > /proc/sys/kernel/numa_balancing) and pin threads explicitly; others leave it on and accept the noise. The right answer depends on your workload's tolerance for latency variance — a topic Part 7 covers in depth.

The Aadhaar/UIDAI auth pipeline — when one socket was enough, until it wasn't

UIDAI's biometric-auth pipeline ran for years on single-socket Xeon E5 v4 boxes (22 cores per socket) at peak loads of 380 auth/second per box. The workload — fingerprint match against a sharded index — fits the single-socket model: each request hits one shard, the shard fits in L3, and the per-request working set never escapes one node. Throughput scaled linearly with cores; the team had no reason to think about multi-socket.

In 2022, with Aadhaar-linked DBT (direct-benefit-transfer) traffic crossing 1B authentications/day, the team migrated to two-socket Ice Lake boxes (40 cores per socket) expecting near-doubling. They got a 1.4× improvement instead of the expected 1.9×. The investigation took three weeks. The cause was a global rate-limiter, implemented as a single atomic_uint64_t counter incremented by every request handler. On the single-socket box, the cache line lived in the shared L3 and contention was bounded. On the two-socket box, the cache line bounced between sockets every few microseconds — every increment from socket 1 invalidated socket 0's copy, every increment from socket 0 invalidated socket 1's copy. The rate-limiter's overhead went from <1 % of cycles to 14 %.

The fix was a per-CPU counter array, summed periodically by a separate thread; throughput jumped to 2.0× the single-socket baseline, slightly better than ideal because the new design also removed contention noise. The lesson: assumptions baked into single-socket code show up as performance cliffs the moment the topology grows. Code that was correct stays correct; code that was fast becomes ambient. The story is not x86-only either — Apple's M2 Ultra fuses two dies with a 2.5 TB/s interconnect, AWS Graviton4 ships in two-socket configurations on c8g.metal-48xl, and every one of them has the same socket-boundary cliff. The constants differ; the discipline does not.

Reproduce this on your laptop

Most laptops are single-socket and will not show the local-vs-remote gap. To see it:

# On a multi-socket Linux box (e.g. AWS c6a.metal, Azure HBv4, GCP n2-standard-128):
sudo apt install linux-tools-common linux-tools-generic numactl
numactl --hardware                  # see how many NUMA nodes the box has
python3 -m venv .venv && source .venv/bin/activate
pip install numpy numba

# Local: thread on node 0, memory on node 0
numactl --cpunodebind=0 --membind=0 python3 numa_load_latency.py

# Remote: thread on node 0, memory on node 1
numactl --cpunodebind=0 --membind=1 python3 numa_load_latency.py

# Watch the difference live with perf:
sudo perf stat -e mem_load_l3_miss_retired.local_dram,mem_load_l3_miss_retired.remote_dram \
    numactl --cpunodebind=0 --membind=1 python3 numa_load_latency.py

On any 2-socket box from the 2018-and-later generation, the local-vs-remote ratio is in the 1.5–2.0× range for single-thread runs, and 3–6× for runs that contend on the interconnect. If your laptop is single-socket, the two numactl commands give the same answer — and that, by itself, is the lesson: the laptop you benchmark on is not the machine your code runs on in production.

Where this leads next

Part 2 was about the memory hierarchy as if it were uniform — caches, lines, prefetchers, layout. Every chapter in it implicitly assumed one socket, one memory controller, one global "DRAM". This chapter is the moment that assumption breaks. The next part — Part 3, NUMA and multi-socket — replaces the uniform-memory model with a topology graph and walks through what changes once you accept that "memory" is plural.

The chapters that follow build the multi-socket programming discipline:

The pivot to think with: the 2026 production server is not a CPU, it is a small distributed system on one motherboard. The network is the interconnect; the nodes are the sockets; the latency budget is split between local DRAM, remote DRAM, and L3 hits across chiplets.

Every technique that works for distributed systems — locality, sharding, data-near-compute, traffic minimisation — has a microarchitectural analogue inside one server. The skill that follows is recognising which of your single-socket assumptions just stopped holding, and rebuilding the mental model around the topology your code actually runs on.

When Asha re-ran her catalogue ranking benchmark with numactl --cpunodebind=0 --membind=0 --physcpubind=0, she got the 240 ns/op number back — exactly the laptop measurement, on production hardware. The benchmark was always real. It just described a different machine than the one her code actually ran on. The gap between those two machines is what this curriculum is about, and Part 3 is where you learn to close it.

References

  1. Wulf and McKee, "Hitting the Memory Wall: Implications of the Obvious" (1995) — the paper that named the wall, predicting essentially the trajectory the industry followed.
  2. Hennessy & Patterson, Computer Architecture: A Quantitative Approach (6th ed., 2017) — Chapter 5 (multiprocessor architecture), Chapter 6 (warehouse-scale computing); the standard reference for the multi-socket transition.
  3. AMD EPYC 9004 Series Architecture (Zen 4 / Genoa) — chiplet topology, Infinity Fabric specifications, on-package NUMA structure on the EPYC 9654.
  4. Intel Xeon Scalable Processors (Sapphire Rapids) Architecture — UPI 2.0 specifications, mesh-on-die structure, latency targets.
  5. Christoph Lameter, "NUMA: An Overview" (USENIX 2013) — the practitioner's introduction to NUMA programming and the kernel interfaces.
  6. Brendan Gregg, Systems Performance (2nd ed., 2020), Chapter 7 (Memory) — the production-debugging perspective on multi-socket memory issues.
  7. /wiki/wall-cpus-are-fast-memory-is-not — the Part-2 wall this chapter is downstream of; the single-socket memory wall is the limit that pushed the industry sideways.
  8. /wiki/uma-vs-numa-the-architectural-shift — the next chapter, which formalises everything this one motivates.