Memory bandwidth as the real ceiling

Karan, an engineer on Flipkart's catalogue ranking team, parallelises a feature-vector dot-product kernel that scores 8 million products against a query embedding. Single-threaded on his c6i.4xlarge it runs at 4.2 GB/s of useful data movement — already memory-heavy, but tolerable. He fans it across 16 vCPUs expecting 16x. He gets 5.8x. CPU utilisation is 100% on every core. The flamegraph shows 99% of time in the inner SIMD loop, no syscalls, no allocations, no locks, no shared writes to invalidate any cache line. The cores are busy; they are just busy waiting. The 8 million vectors are 12 GB; the L3 holds 30 MB of them at a time; the rest is streaming from DRAM, and DRAM on this socket can deliver about 38 GB/s. Sixteen cores asking for 4.2 GB/s each is 67 GB/s of demand against a 38 GB/s pipe. The pipe wins.

Memory bandwidth is the second physical ceiling that masquerades as Amdahl's serial fraction. Unlike coherence traffic, which is a coordination cost between cores, bandwidth is a delivery cost between the DRAM controllers and everything else. It does not show up in perf c2c. It does not look like contention. It looks like cores that are busy executing instructions but those instructions are stalled on memory loads — and the only way to see it is to measure DRAM controller saturation directly. This chapter is about how to measure it, when it dominates, why DDR5 and HBM exist, and what to do when you discover your scaling problem is the wire and not the code.

Every multi-core CPU has a fixed number of memory channels feeding a fixed number of DRAM controllers, and their combined bandwidth is hard-capped (~38–80 GB/s on modern DDR4/DDR5 servers, ~800+ GB/s on HBM). When the aggregate working-set demand of all cores exceeds that ceiling, additional cores cannot add throughput — they only add stall cycles. The fit looks like coherence saturation but the diagnostic is different: high cycles, low IPC, near-zero LLC residency, and pcm-memory reading near peak GB/s. The fix is structural — reduce bytes-per-useful-op, not add cores.

The bandwidth pipe — why DRAM has a fixed cap

A modern server CPU has a small number of integrated memory controllers (typically 2–6 per socket), each driving a small number of memory channels (DDR4: 64 data bits + 8 ECC; DDR5: two 32-bit subchannels per slot). Each channel runs at a fixed transfer rate determined by the DIMM clock — DDR4-3200 transfers at 3200 MT/s, DDR5-4800 at 4800 MT/s. The peak theoretical bandwidth is just channels × bytes_per_transfer × MT/s. For a 2-socket Skylake-X with 6 channels per socket of DDR4-2666: 2 × 6 × 8 × 2.666 = 256 GB/s total, ~128 GB/s per socket. That is the ceiling of every workload that actually streams memory.

The reason this matters for scaling: cores share the controllers. A 32-core EPYC 7543 has 32 cores feeding 8 DDR4-3200 channels; if every core wants 5 GB/s, demand is 160 GB/s and the channels can deliver ~204 GB/s peak (often ~140 GB/s sustained after row-activate / refresh overhead). Add more cores and the per-core bandwidth share drops — the aggregate stays flat. Unlike coherence cost, which gets worse with N (the curve turns retrograde), bandwidth saturation gets flat with N: throughput hits the ceiling and stays there. Both look like "scaling broke past N=X"; the diagnostic separating them is whether the DRAM controllers are at saturation.

Illustrative — typical 2024 server class.

The DRAM tier delivers ~10-30x less bandwidth per core than the LLC and ~100x less than L1. A workload whose working set escapes the LLC pays this 10-30x bandwidth tax silently — there is no error, no warning, just slower aggregate throughput. Illustrative — typical 2024 server class.

The reader's gut model usually puts DRAM "next to" the CPU because they sit in adjacent slots on a motherboard; the hardware reality is that DRAM is two orders of magnitude slower than L1 in both bandwidth and latency, and the wire between them — the memory channel — is the narrowest link in the system once your working set escapes the LLC. The cores themselves can issue loads at vastly higher rates than the channels can satisfy them, so the cores end up stalled — instructions in flight but waiting for data — while the channels are running at full utilisation. From the OS's point of view this looks like 100% CPU; from the hardware's point of view this is 100% memory.

Why DRAM has fixed bandwidth and not "as much as you need": DRAM cells are leaky capacitors arranged in arrays; each access requires opening a row (~15 ns), reading or writing the column (~10 ns), and eventually closing the row to refresh capacitors before they decay (every ~64 ms). The channel between the DIMM and the controller has to be physically wide (64 data bits) and clocked at fixed rates (3200 MT/s for DDR4-3200) determined by signal integrity over the trace lengths on the motherboard. You cannot increase bandwidth without adding channels (more pins on the CPU package, more DIMM slots) or generations (DDR5 doubled the clock and split each channel in two). It is a physical interconnect with hard physical limits, not a software resource you allocate.

Measuring the bandwidth ceiling

The harness below pushes a tunable amount of memory traffic per worker and sweeps worker count, plotting aggregate bandwidth. It uses numpy because numpy arrays have predictable contiguous memory layout — Python lists of objects do not, and would not produce a measurable signal. The kernel is np.sum(arr) over a buffer too large to fit in any cache, which on x86-64 compiles down to a tight SIMD streaming load.

# bandwidth_ceiling.py — measure aggregate DRAM bandwidth vs core count
# Run: python3 bandwidth_ceiling.py 1 2 4 8 16
import multiprocessing as mp, numpy as np, time, sys, os

# 256 MB per worker — well above any LLC; forces DRAM-bound streaming
BUF_BYTES = 256 * 1024 * 1024
ITERS = 8

def worker(seed, q):
    # Each worker allocates its own buffer in its own pages.
    arr = np.zeros(BUF_BYTES // 8, dtype=np.float64)
    arr[::1024] = seed  # touch every 8KB page so the OS commits real DRAM
    t0 = time.perf_counter()
    s = 0.0
    for _ in range(ITERS):
        s += float(arr.sum())  # streaming SIMD load over 256 MB
    dt = time.perf_counter() - t0
    bytes_moved = ITERS * BUF_BYTES
    q.put((bytes_moved / dt, s))  # bandwidth, checksum (prevent DCE)

def measure(N):
    q = mp.Queue()
    procs = [mp.Process(target=worker, args=(i, q)) for i in range(N)]
    t0 = time.perf_counter()
    for p in procs: p.start()
    for p in procs: p.join()
    dt = time.perf_counter() - t0
    bws = [q.get()[0] for _ in range(N)]
    return sum(bws), sum(bws)/N, dt

if __name__ == "__main__":
    Ns = [int(x) for x in sys.argv[1:]] or [1, 2, 4, 8, 16]
    print(f"# logical cores: {mp.cpu_count()}, buffer per worker: {BUF_BYTES//(1024*1024)} MB")
    print(f"# {'N':>3}  {'agg GB/s':>10}  {'per-core GB/s':>14}  wall(s)")
    for N in Ns:
        if N > mp.cpu_count(): break
        agg, per, wall = measure(N)
        print(f"  {N:3d}  {agg/1e9:10.2f}  {per/1e9:14.2f}  {wall:.2f}")

# Sample run on c6i.4xlarge (16 vCPU Ice Lake, 8 channels DDR4-3200, ap-south-1)
# logical cores: 16, buffer per worker: 256 MB
#   N    agg GB/s   per-core GB/s  wall(s)
    1       12.40           12.40  0.165
    2       23.11           11.55  0.177
    4       33.74           8.43   0.243
    8       38.62           4.83   0.424
   16       39.18           2.45   0.836

Walk-through. BUF_BYTES = 256 MB is chosen to be far larger than the 30 MB LLC, so every arr.sum() is forced to stream from DRAM — there is no cache reuse to hide behind. arr[::1024] = seed touches every 8KB OS page so the kernel actually commits backing pages from DRAM (Linux uses lazy allocation; uninstantiated np.zeros lives in the zero page until written). ITERS = 8 averages out single-shot variance; with the buffer this large each .sum() is ~20 ms of pure streaming. The numbers tell the story. Per-core bandwidth at N=1 is 12.4 GB/s — the limit of one core's load buffer plus prefetcher saturation. At N=2 it nearly doubles aggregate but per-core only marginally drops, so two cores share the channels happily. By N=4 aggregate is 33.7 GB/s and per-core has dropped to 8.4 GB/s — the channels are getting saturated. From N=8 onward, aggregate is flat at ~38-39 GB/s — exactly the published sustained DDR4-3200 ceiling for this socket. Per-core productivity collapses from 12.4 to 2.45 — a 5x drop in bytes-per-second-per-core, identical to Karan's catalogue ranker.

Unlike coherence-bound code (which turns retrograde), bandwidth-bound code flattens at the DRAM channel ceiling. Aggregate stays at ~38 GB/s; per-core throughput drops as 1/N because the fixed pipe is shared across more workers. Numbers measured by the harness above on c6i.4xlarge.

The diagnostic that confirms bandwidth saturation (and rules out coherence) uses Intel's pcm-memory or AMD's amd_uprof to read the DRAM controller counters directly:

# pcm-memory — direct DRAM controller throughput, sampled every 1 second
$ sudo pcm-memory 1
---------------------------------------||---------------------------------------
--             Socket 0              --||--             Socket 1              --
---------------------------------------||---------------------------------------
--  Memory Channel Monitoring        --||--  Memory Channel Monitoring        --
---------------------------------------||---------------------------------------
--  Mem Ch  0:  Reads (MB/s):  4582  --||--  Mem Ch  0:  Reads (MB/s):     0  --
--  Mem Ch  1:  Reads (MB/s):  4598  --||--  Mem Ch  1:  Reads (MB/s):     0  --
--  Mem Ch  2:  Reads (MB/s):  4571  --||--  Mem Ch  2:  Reads (MB/s):     0  --
--  Mem Ch  3:  Reads (MB/s):  4612  --||--  Mem Ch  3:  Reads (MB/s):     0  --
--  Mem Ch  4:  Reads (MB/s):  4604  --||--  Mem Ch  4:  Reads (MB/s):     0  --
--  Mem Ch  5:  Reads (MB/s):  4587  --||--  Mem Ch  5:  Reads (MB/s):     0  --
--  Mem Ch  6:  Reads (MB/s):  4591  --||--  Mem Ch  6:  Reads (MB/s):     0  --
--  Mem Ch  7:  Reads (MB/s):  4583  --||--  Mem Ch  7:  Reads (MB/s):     0  --
---------------------------------------||---------------------------------------
--  NODE 0 Mem Read (MB/s) :  36728  --||--  NODE 1 Mem Read (MB/s) :      0  --
--  NODE 0 Mem Write(MB/s) :    412  --||--  NODE 1 Mem Write(MB/s) :      0  --
---------------------------------------||---------------------------------------
--                System Read Throughput(MB/s):    36728                      --
--                System Write Throughput(MB/s):     412                      --
--                System Memory Throughput(MB/s):  37140                      --
---------------------------------------||---------------------------------------

Every channel is reading at ~4.58 GB/s — well above 90% of the 5 GB/s per-channel sustained ceiling. Aggregate matches the harness measurement (37 GB/s vs 39 GB/s, the small gap is sampling jitter). When pcm-memory shows every channel near its peak and the workload is still slow, the diagnosis is unambiguous: you are memory-bandwidth bound, and no amount of parallelism will help. Compare with the coherence-bound diagnostic (perf c2c showing HITM events on a hot line) — they are mutually exclusive failure modes that look superficially identical from the outside (cores busy, scaling poor) but have entirely different fixes.

Why pcm-memory is the right tool and not perf stat: perf stat -e cache-misses counts cache-miss events but those events do not map cleanly to DRAM bandwidth — a cache miss could hit another core's cache (no DRAM traffic) or hit DRAM (full DRAM traffic), and the counters do not separate them by default. pcm-memory reads the integrated memory controller's own performance counters, which count actual bytes transferred on the DRAM bus. It is the only tool that gives you bytes per second on the wire as a direct measurement, which is the only number that decides whether you are at the bandwidth ceiling.

The roofline model — why bandwidth and compute are duals

Sam Williams' roofline model (Berkeley, 2008) is the canonical mental model for bandwidth-vs-compute bottlenecks. Plot arithmetic intensity (FLOPs per byte of data moved from DRAM) on the x-axis and achieved performance (FLOPs per second) on the y-axis. The plot has two regions: a sloped line on the left where performance is bandwidth-limited (perf = arith_intensity × peak_bandwidth), and a horizontal ceiling on the right where performance is compute-limited (perf = peak_FLOPS). The ridge point — where the slope hits the ceiling — is the kernel's machine balance, the arithmetic intensity above which compute matters and below which memory matters.

For Ice Lake at peak: ~2 TFLOPS double-precision FMA across 16 cores, ~38 GB/s sustained DRAM. The ridge is at 2000 / 38 ≈ 53 FLOPs per byte (~6.6 FLOPs per double-precision element). A naive vector dot product is 2 FLOPs per element (one multiply, one add) — well to the left of the ridge, deeply bandwidth-bound. A dense matrix multiply with N=4096 is 2N FLOPs per byte after blocking — well to the right, compute-bound. The roofline tells you, before you write the code, which physical resource your kernel will saturate first. Why arithmetic intensity is the right axis and not "operation count": two kernels can do the same number of FLOPs but move very different amounts of data — a dot product of two vectors moves 16 bytes per FLOP-pair, while an in-cache matrix multiply moves a fraction of a byte per FLOP. The bottleneck is determined by the ratio, not by either quantity alone. Engineers who try to "optimise" a bandwidth-bound kernel by reducing FLOPs (replacing a*b+c*d with (a+c)*(b+d)/2 say) are optimising the wrong axis — the time is in the load, not the multiply.

# roofline.py — compute and plot arithmetic intensity for a few kernels
import time, numpy as np

def time_it(fn, *args):
    t0 = time.perf_counter(); s = fn(*args); dt = time.perf_counter() - t0
    return dt, s

N = 8 * 1024 * 1024  # 64 MB of doubles, well past LLC
a = np.random.rand(N); b = np.random.rand(N)

# Kernel 1 — dot product: 2 FLOPs/elem, 16 bytes/elem → 0.125 FLOP/byte
dt, s = time_it(np.dot, a, b)
flops = 2 * N / dt; bw = 16 * N / dt
print(f"dot      : {flops/1e9:6.2f} GFLOPS  {bw/1e9:6.2f} GB/s  AI={flops/bw:.3f} FLOP/B")

# Kernel 2 — element-wise multiply-add: 2 FLOPs/elem, 24 bytes/elem
c = np.empty_like(a)
dt, _ = time_it(lambda: np.multiply(a, b, out=c))
flops = N / dt; bw = 24 * N / dt
print(f"a*b      : {flops/1e9:6.2f} GFLOPS  {bw/1e9:6.2f} GB/s  AI={flops/bw:.3f} FLOP/B")

# Kernel 3 — 1024-point matrix mult, hot in cache: ~2*1024 FLOPs/byte
M = np.random.rand(1024, 1024); P = np.empty((1024, 1024))
dt, _ = time_it(lambda: np.dot(M, M, out=P))
flops = 2 * 1024**3 / dt; bytes_moved = 8 * 3 * 1024**2  # 3 matrices, mostly in cache
bw = bytes_moved / dt
print(f"matmul1k : {flops/1e9:6.2f} GFLOPS  {bw/1e9:6.2f} GB/s  AI={flops/bw:.2f} FLOP/B")

# Sample run on c6i.4xlarge
dot      :   4.61 GFLOPS   36.92 GB/s  AI=0.125 FLOP/B
a*b      :   1.58 GFLOPS   37.96 GB/s  AI=0.042 FLOP/B
matmul1k : 384.20 GFLOPS    9.62 GB/s  AI=39.96 FLOP/B

The dot product hits 36.9 GB/s — within 3% of the channel ceiling — and 4.6 GFLOPS, which is abysmal compared to the 2000 GFLOPS the cores can do. It is bandwidth-bound. The element-wise multiply moves more bytes per FLOP and so achieves even less compute. The matmul, by contrast, achieves 384 GFLOPS — 80x more — at only 9.6 GB/s of memory traffic, because each loaded byte contributes to ~40 FLOPs instead of 0.125. Same 16 cores, same DRAM, completely different bottleneck. The roofline diagnoses the regime in one number: the arithmetic intensity. Why this matters for parallel scaling: bandwidth-bound kernels do not benefit from more cores past the point of channel saturation. Compute-bound kernels benefit linearly with cores until the FLOP ceiling is hit. Mixed kernels — and most real workloads are mixed — benefit until whichever ceiling the workload sits closest to is hit. Identifying which ceiling you are about to hit is what tells you whether to add cores or restructure the algorithm.

What lifts the bandwidth ceiling — DDR5, HBM, and CXL

Three technologies have made the bandwidth ceiling less crushing in 2024-2026 hardware. Knowing which one you are running on changes the ceiling number you should plan against.

DDR5 doubles the DDR4 transfer rate (3200 → 4800-6400 MT/s) and splits each 64-bit channel into two independent 32-bit subchannels, halving the bank conflict probability under random access. Sapphire Rapids and Sierra Forest with 8 channels of DDR5-4800 hit ~75 GB/s per socket sustained — roughly 2x DDR4-3200 at the same channel count. Genoa and Bergamo with 12 channels of DDR5-4800 reach ~110-130 GB/s per socket. For Indian cloud workloads the migration is happening on AWS m7i (Sapphire Rapids), c7g (Graviton 3 with DDR5), and Azure's Mv3 series. If you are bandwidth-bound on a c6i and considering more cores, switching to a c7i with DDR5 likely buys you more headroom than the core count increase alone would.

HBM (High Bandwidth Memory) stacks DRAM die vertically and connects them to the processor with a 1024-bit silicon interposer rather than a 64-bit motherboard channel. HBM3 delivers ~819 GB/s per stack, and chips like Intel's Xeon Max (4 stacks, 64 GB HBM, optionally with DDR5 alongside) hit ~1.2 TB/s aggregate. NVIDIA's H100 and AMD's MI300 use HBM exclusively. The cost is capacity — HBM tops out at ~128 GB per package, vs multiple TB for DDR5 — and price (~5-10x per GB). HBM is the right answer when your working set fits in 64 GB and bandwidth is the bottleneck; for analytics workloads at Flipkart's scale (catalogue ranking, recommender embeddings) this is exactly the profile, which is why ML inference fleets have moved en masse to HBM-equipped GPUs and accelerators.

CXL (Compute Express Link) does the opposite — it adds bandwidth via memory-pooling expansion: PCIe-attached DRAM modules accessible at ~30-50 GB/s per card. CXL 2.0 lets multiple servers share a pool of memory at ~150 ns access latency (vs ~75 ns local DRAM). The bandwidth per socket increases, but the latency per access also increases. CXL is the right answer when capacity matters more than per-access latency — large in-memory databases, Spark shuffles, sparse-graph workloads. It does not help a streaming workload that already saturates local channels; it helps a workload that needs more memory than fits on the motherboard.

Why these three are not interchangeable: they trade off bandwidth, latency, and capacity along different axes. DDR5 maintains the latency profile of DDR4 (~80 ns) and improves bandwidth ~2x. HBM gives ~10-20x bandwidth at slightly higher latency (~95 ns) and severely reduced capacity. CXL gives ~3-5x effective bandwidth via pooling at much higher latency (~150-300 ns) but unlimited capacity. The right pick depends on whether your kernel's working set fits in HBM, whether your latency budget tolerates CXL, and whether the compute ceiling above you is high enough that more bandwidth even matters. A workload sitting at 40 GB/s on DDR4 with kernels of arithmetic intensity 5 will not be helped by HBM — the cores cannot consume more bytes per FLOP than they already are.

Common confusions

"More cores always means more aggregate throughput." Past the bandwidth ceiling, more cores produce only stall cycles. Aggregate throughput is flat or worse from the saturation point onward; per-core productivity drops as 1/N because the fixed pipe is shared.
"Bandwidth saturation looks like coherence saturation." The shapes of the curves are different — bandwidth flattens, coherence turns retrograde. The diagnostics are different — pcm-memory for bandwidth (DRAM controller counters), perf c2c for coherence (HITM events). Confusing them leads to fixing the wrong thing: padding cache lines does nothing for bandwidth saturation; raising channel count does nothing for coherence ping-pong.
"Cache misses are the same as bandwidth saturation." A cache miss costs latency (the load must complete before the dependent instruction issues); bandwidth saturation is when the aggregate miss rate exceeds the channels' capacity. A workload can have high cache miss rate but low total bandwidth (a few cores doing pointer-chasing) and not be bandwidth-bound. A workload can have moderate miss rate but many cores doing it, and be bandwidth-bound.
"100% CPU utilisation means I am compute-bound." Linux's %CPU counts cycles where the core is not idle. A core stalled waiting for a DRAM load is not idle — it is in mem_load_retired.l3_miss. The OS sees 100% CPU; the hardware sees ~5% IPC. The right utilisation metric for "is this CPU doing useful work?" is IPC (instructions per cycle) from perf stat, not %CPU from top.
"Memory bandwidth is constant per core." It is constant per channel, and channels are shared. As you add cores, per-core bandwidth share drops linearly. A single-core micro-benchmark on a 16-channel server measures the per-core ceiling at full channel availability — completely unrepresentative of what the same kernel does at N=16.
"DDR5 fixes everything." DDR5 raises the ceiling ~2x, but if your kernel is at arithmetic intensity 0.1 FLOP/byte and you needed 100x more compute, DDR5 buys you the first 2x and then you are stuck at the new ceiling. The structural fix is to raise arithmetic intensity (block, tile, fuse loops, change algorithm) — that is the only fix that scales with hardware generations.

Going deeper

The exact mechanics of a DRAM read — row activate, column access, refresh

A DRAM access is not a single operation; it is a sequence. The controller issues an ACT (activate) command which copies the target row into the bank's sense amplifiers (~15 ns latency, called tRCD). Then it issues a RD (read) or WR (write) for the target column (~10 ns, CAS latency). Subsequent accesses to the same row are fast (~1 ns, tCCD). When a different row in the same bank is needed, the current row must be PRE (precharged) back to the array (~15 ns, tRP) before activation. Modern controllers exploit bank parallelism — DDR4 has 16 banks per chip, DDR5 has 32 — by interleaving accesses across banks so one row can be active while another is being precharged.

The implication: random-access memory patterns destroy DRAM throughput. Sequential streaming hits the same row for ~512 bytes (one DRAM page) before precharge, achieving near-peak channel utilisation. Random-access of cache lines hits a different row almost every time, paying tRCD + tRP + tCAS ≈ 40 ns per access — a 4-8x slowdown vs sequential. The Stream benchmark numbers you see published assume sequential streaming; a random-access workload will see 1/4 of those numbers even on the same hardware. Why this matters for scan-heavy Indian fintech workloads: Razorpay's reconciliation jobs read transaction tables sequentially (good), but their join operations against a customer table do random index lookups (bad). The reconciliation phase saturates DDR4-3200 at ~38 GB/s per socket; the join phase achieves ~9 GB/s on the same socket. The bandwidth ceiling depends on the access pattern, not just the hardware spec sheet.

NUMA + bandwidth — the cross-socket penalty

On a 2-socket server, each socket has its own memory controllers and channels. A thread on socket 0 reading memory allocated on socket 1 traverses the inter-socket link (UPI on Intel, Infinity Fabric on AMD). The bandwidth of UPI is ~30-50 GB/s per direction, less than a single socket's local DRAM. The latency is ~140-200 ns vs ~80 ns local. A cross-socket workload thus halves its bandwidth and doubles its latency — a 4x productivity penalty.

The fix is numactl --cpunodebind=0 --membind=0 to pin both threads and memory to the same socket, or to use numactl --interleave=all to spread allocations evenly so cross-socket traffic balances. The Linux automatic NUMA balancer (numa_balancing=1) tries to migrate pages closer to the threads using them, but the migration cost is ~2-5 µs per page and only helps if access patterns are stable enough for migration to pay off. For latency-sensitive Indian payment-processing workloads (Paytm, PhonePe), the routine is to pin the per-NUMA worker pool with numactl at startup and never touch the auto-balancer; the predictable pinned latency beats the variable migrated latency every time.

Software prefetching and the limits of speculation

The hardware prefetcher detects sequential and stride-N access patterns and issues loads ~16-64 cache lines ahead of the CPU's actual demand. When it works, the load completes before the demand-load issues, hiding DRAM latency entirely. When it fails — pointer-chasing, irregular strides, indirect addressing — every load stalls for the full DRAM round trip.

__builtin_prefetch (GCC/Clang) and _mm_prefetch (Intel intrinsics) let software issue prefetches manually. Used well, they hide 50-80 ns of DRAM latency on every load — turning a memory-bound kernel into a compute-bound one if the arithmetic intensity allows. Used poorly, they waste bandwidth fetching lines that get evicted before use, worsening throughput. The rule is: prefetch ~10 iterations ahead, only when the access pattern is unpredictable enough that the hardware prefetcher fails. Profile with perf stat -e l1d.replacement,l2_lines_in.all before and after to confirm the prefetcher is helping. Zerodha Kite's order-book traversal uses manual prefetching when walking the price-level linked list — each level's next pointer is prefetched 4 levels ahead, which dropped the median traversal time from 1.8 µs to 0.7 µs at p99 during 10:00 IST market open.

CXL memory and the bandwidth-via-tier hack

CXL (Compute Express Link) lets you attach memory expansion modules over PCIe Gen5, accessible to the CPU as cache-coherent memory at ~150 ns latency vs ~75 ns local DRAM. The bandwidth per CXL card is ~30-50 GB/s (PCIe Gen5 x16). For a server with 4 CXL cards, that is 120-200 GB/s of additional bandwidth on top of the local DDR5 — at the cost of higher latency on accesses that hit CXL.

The architectural pattern: hot data on local DRAM, warm data on CXL, cold data on SSD. The OS or the application places pages by access frequency. For in-memory databases and Spark workloads at Indian unicorns, this is the path forward when the working set exceeds 1 TB and DDR5 capacity is the binding constraint. Hotstar's video-segment cache (the in-memory store of which video segments are hot at any given moment, used for CDN cache decisions) was a candidate for CXL in 2025 — the working set is ~600 GB at peak IPL traffic, and the access pattern is read-heavy with predictable hot keys. The transition added 50 ns of average latency on cache reads (acceptable for the use case) but tripled the per-server cache capacity, halving the CDN cache miss rate.

The bandwidth-bound parallel pattern — sub-linear is the new linear

The mental shift required: when you parallelise a bandwidth-bound kernel, stop expecting linear speedup past the channel ceiling. The correct expectation is "speedup until the ceiling, flat afterwards". A kernel doing 4 GB/s per core on an 8-channel DDR4 socket (38 GB/s ceiling) can productively use ~9 cores in parallel; cores 10-32 add no aggregate throughput. The right architectural response is to over-subscribe with cheaper compute — run multiple bandwidth-bound jobs concurrently on the same socket so they share the channels (each job gets less, but cumulative throughput stays at the ceiling) and use the saved cores for compute-bound jobs that do benefit from parallelism. This is why analytics clusters at Flipkart and Razorpay run mixed workloads on each node rather than pure single-tenant scheduling — the bandwidth-bound joins and the compute-bound regressions complete each other on the same hardware.

Bandwidth war stories from Indian production

Flipkart's catalogue ranker (BBD 2024 prep). The product-scoring service ran 16-thread feature-vector dot products against an 80 GB embedding store. Single-thread bandwidth was 12.1 GB/s; aggregate at 16 threads stalled at 38.4 GB/s — exactly the c6i.4xlarge DDR4-3200 ceiling. The team's first reflex was "more cores" — they tried c6i.8xlarge (32 vCPU). Aggregate stayed at 38.4 GB/s; per-core dropped to 1.2 GB/s. The fix was structural: quantise embeddings from float32 to int8, dropping bytes-per-dot-product 4x. The kernel arithmetic intensity went from 0.125 to 0.5 FLOP/byte, the working set fit in L3 for the top-1M products, and aggregate throughput jumped from 38 GB/s to 84 GB/s (most of the new traffic served from L3 with much higher bandwidth ceiling). The 16-vCPU instance now does what 64-vCPU instances were targeted to do.

Zerodha Kite's tick-replay harness (2024). The post-trade analytics pipeline replays a day's tick stream — ~120 GB of compressed Level-2 order book updates — through a series of strategy backtests. The decompression and parse stages were bandwidth-bound at 36 GB/s on c6i.4xlarge. Adding 8 backtests in parallel on the same node should have multiplied throughput; instead aggregate stayed at ~37 GB/s and each backtest took 8x longer than running alone. The diagnostic was unambiguous in pcm-memory — every channel near peak. Fix: pipeline the decompression stage with the backtest stages so the decompressed buffer is kept hot in L3 and consumed by all backtests from cache rather than re-fetched from DRAM. Aggregate effective throughput rose to 180 GB/s of post-decompression bandwidth (most served from L3); wall time for 8-strategy backtests dropped from 4 hours to 35 minutes.

Hotstar's chunk-encoder fleet (IPL 2025). The video-chunk re-encoder reads ~5 MB chunks from S3 and re-encodes them at lower bitrates. At peak IPL traffic the encoder fleet was bandwidth-bound at the EBS-to-DRAM path on c6i instances — encoding throughput plateaued at 42 GB/s aggregate per node despite 32 vCPUs of compute headroom. Switching to c7i (Sapphire Rapids + DDR5) raised the ceiling to 78 GB/s aggregate; the same 32 cores now did 1.85x the work at the same instance cost. The fix was a hardware migration rather than a code change — the workload's arithmetic intensity was already as high as the encoder format allowed, so the only lever was a bigger pipe.

The pattern across all three: bandwidth saturation is not a code bug, it is an architectural mismatch between kernel arithmetic intensity and hardware machine balance. The fix is to either reduce bytes-per-useful-op (quantisation, compression, pipeline reuse) or to raise the hardware ceiling (DDR5, HBM, CXL). Adding cores never helps once the channels are full.

A note on cloud-instance pricing and the bandwidth-per-rupee axis

The cloud pricing optimisation that follows from this chapter is non-obvious. AWS prices c6i.4xlarge at roughly ₹35/hour in ap-south-1; c7i.4xlarge (DDR5) at ₹42/hour. For a bandwidth-bound workload, the c7i's 75 GB/s ceiling vs the c6i's 38 GB/s is ~2x more useful throughput at 1.2x the price — a 67% improvement in bandwidth-per-rupee. For a compute-bound workload (matmul, image filtering), the same instance pair is roughly equal in performance because the compute ceiling is similar; you pay 20% more for nothing.

The skill is knowing which axis your workload sits on before you size the fleet. The roofline calculation in this chapter gives you that answer in thirty seconds: compute arithmetic intensity, look up the machine balance of the candidate instance type, and pick whichever instance has the higher ceiling on the resource your workload saturates. Indian platform teams that do this calculation routinely (rather than benchmarking each new instance type ad hoc) routinely beat their fleet budgets by 30-40% on bandwidth-bound analytics workloads — and the calculation costs nothing.

Where this leads next

The next chapter (/wiki/heterogeneous-computing-and-the-end-of-symmetric-multiprocessing) covers the third major scaling regime: heterogeneous topologies — performance cores, efficiency cores, accelerators, GPUs — where the cost ladder gains a third rung and the choice "which core do I run this on" becomes a first-class scheduling decision.

The recurring pattern across Part 9 (chapters 60-66): every "scaling does not work past N=X" investigation lands on a different physical resource — serial fraction, coherence interconnect, memory bandwidth, heterogeneous topology — but the workflow is the same. Measure the curve, identify the dominant term, classify the bottleneck physically, fix it architecturally. Adding cores is the engineering response to none of these problems; finding which physical resource is saturated and changing the workload to use it less is the response to all of them.

The two operational habits this chapter adds to the Part 9 toolkit. First, run pcm-memory on every "scaling does not work" investigation alongside perf c2c — the two tools answer different questions (bandwidth vs coherence) and you need both to know which physical resource is the binding constraint. Second, calculate arithmetic intensity for every hot kernel before deciding whether parallelism is the right fix — a kernel at AI < 1 FLOP/byte is bandwidth-bound and parallelism is bounded by the channel ceiling; a kernel at AI > 50 FLOP/byte is compute-bound and parallelism scales until cores run out. The roofline calculation takes thirty seconds and replaces months of fruitless core-count tuning. See /wiki/coherence-traffic-as-a-hidden-ceiling for the coherence-side companion to this chapter and /wiki/the-serial-fraction-problem for the Amdahl-side foundation.

Reproducibility footer

Both harnesses run on any Linux box with Python 3.11+ and numpy. The bandwidth ceiling is visible from N=2 onward; full saturation needs N >= channels, so use a 16+ vCPU instance for the sharpest curve. pcm-memory requires Intel hardware and the intel-pcm package; AMD users can substitute amd_uprof_cli.

# Reproduce this on your laptop, ~60 s for both harnesses
sudo apt install linux-tools-common linux-tools-generic intel-pcm
python3 -m venv .venv && source .venv/bin/activate
pip install numpy
python3 bandwidth_ceiling.py 1 2 4 8 16
python3 roofline.py
# To see DRAM controller saturation, run the harness in one terminal and pcm-memory in another:
sudo pcm-memory 1 &
python3 bandwidth_ceiling.py 16

References

Sam Williams, Andrew Waterman, David Patterson, "Roofline: An Insightful Visual Performance Model for Multicore Architectures" (CACM 2009) — the canonical paper introducing the roofline model used throughout this chapter.
John McCalpin, "Memory Bandwidth and Machine Balance in Current High Performance Computers" (1995) — the STREAM benchmark and the "machine balance" concept; still the standard for measuring sustained DRAM bandwidth.
Brendan Gregg, Systems Performance (2nd ed., 2020), chapter 7 — Memory — the practical treatment of DRAM bandwidth and the tools to measure it.
Ulrich Drepper, "What Every Programmer Should Know About Memory" (2007) — sections 2.1-2.3 on DRAM organisation, refresh, and bank parallelism.
Intel® Performance Counter Monitor (pcm-memory) documentation — the tool used in this chapter's diagnostic; the only direct measurement of DRAM controller throughput on Intel hardware.
JEDEC DDR5 SDRAM Standard (JESD79-5C, 2024) — the authoritative spec for DDR5 bandwidth, dual-subchannel architecture, and timing parameters.
/wiki/coherence-traffic-as-a-hidden-ceiling — the companion chapter on the other physical scaling ceiling; coherence and bandwidth saturation look similar from outside but have entirely different fixes.
/wiki/the-serial-fraction-problem — Amdahl's alpha; bandwidth saturation is the second of the three physical sources of alpha (serial code, coherence, bandwidth).