Prefetching: hardware vs software

Karan is benchmarking a CRED reward-points lookup at 03:00 — a quiet hour, the only one when he can profile production-shaped workloads without breaking SLOs. The hot path is a hash-join: for every settled card transaction in the last 5 minutes (about 180,000 records on a Saturday), look up the customer's reward tier from a 6 GB tier table and compute the points to credit. Two loops, identical arithmetic, both pinned to one core. The first walks the transactions in the order they were inserted and reads the tier table by sequential customer_id; it runs at 1.2 ns per record. The second walks the transactions and probes a hashmap whose layout maps each customer_id to a near-random offset in a parallel array; it runs at 38 ns per record. Same instructions retired, same multiplications, same accumulator. The 30x gap is entirely the cost of one DRAM access per record on the second loop, and entirely the absence of one DRAM access per record on the first loop. The first loop's secret is that the bytes were already in L1 by the time the load instruction issued — the hardware prefetcher saw the stride and pulled them ahead. The second loop's curse is that the prefetcher saw nothing recognisable and gave up; every load went all the way to DRAM and stalled the pipeline for ~200 cycles. Karan's question is the practical one: when can he count on the prefetcher, when must he help it with software hints, and when is layout (the topic of the previous chapter) the only fix?

Prefetching is the act of fetching a cache line before the load that needs it issues, so that load hits L1 instead of stalling on DRAM. The hardware prefetcher does this automatically for sequential and small-stride access patterns it detects in microarchitectural state. Software prefetch instructions (PREFETCHT0, __builtin_prefetch, _mm_prefetch) let you do it explicitly when the access pattern is too irregular for hardware detection but predictable enough for code. The choice is not "which one is faster" — it is "which loop is the hardware already handling, and which one needs you to step in".

What the hardware prefetcher actually sees

The prefetcher is not one circuit; on a modern Intel or AMD core it is four or five independent units that watch the L1 and L2 access streams for patterns and issue speculative fetches when they find one. Knowing which unit fires for which pattern is the difference between writing code that runs at L1 speed and code that runs at DRAM speed. The L1 streamer watches misses out of L1 and prefetches the next line and the line after that whenever it sees two consecutive forward misses on adjacent lines. The L1 IP-based stride prefetcher tracks a small per-instruction-pointer table — for each load PC, it records the last address and the delta between the previous two addresses; if the deltas match for three accesses in a row, it predicts the next address and fetches its line. The L2 streamer is a coarser version that runs on L2 misses, prefetching ahead by 4–16 lines into the L2. The L2 spatial prefetcher pairs each accessed line with its 64-byte buddy on the same 128-byte aligned region, prefetching that buddy on the assumption that programs touch nearby data soon after.

What the prefetcher cannot see is anything that depends on indirection. arr[i] for sequential i is visible — the address is base + i*stride, the stride is constant, two accesses are enough to nail the prediction. arr[idx[i]] for sequential i is not — the address depends on the value loaded from idx[i], which the prefetcher does not have until that load resolves. Hash-table probes, graph adjacency walks, B-tree node descents, linked-list traversals, and pointer chases through any garbage-collected runtime all fall in this category. The prefetcher does not give up because it is dumb; it gives up because the address it would need to predict is not yet a number in any register or buffer it can read.

The four hardware prefetcher units on a modern x86 coreSchematic showing the L1 and L2 caches with four prefetcher units attached. The L1 streamer detects two adjacent forward misses and prefetches the next line. The L1 IP-stride tracks per-PC stride history. The L2 streamer prefetches 4 to 16 lines ahead. The L2 spatial pairs each line with its 128-byte buddy.Hardware prefetcher units (Intel Sapphire Rapids / AMD Zen 4 are similar)L1d cache (48 KB)L1 streamer+1, +2 linesIP-strideper-PC deltaL2 cache (1.25 MB)L2 streamer+4 to +16 linesSpatial128-byte buddyWhat each unit sees, in one line each:- L1 streamer: two adjacent forward misses (line N then line N+1) → prefetch N+2, N+3.- IP-stride: same load PC reading addresses A, A+d, A+2d → prefetch A+3d before next call.- L2 streamer: same idea as L1 but at coarser depth, runs on L2 misses, fetches into L2.- Spatial: load on line X → also fetch line X^1 (the other half of the 128-byte block).All four can be enabled/disabled in MSR 0x1A4 on Intel; AMD exposes similar bits in the BIOS.
The four prefetcher units that handle most of the speculative fetching on a modern x86 core. ARM cores (Neoverse V2, Apple M3) have analogous units with different depths and detection thresholds. Illustrative — based on Intel's optimisation manual §3.6.

Why two adjacent misses are the trigger and not one: a single miss could be a cold-start access, a one-shot pointer chase, or the start of a non-sequential pattern. The prefetcher costs DRAM bandwidth and L1/L2 capacity for every fetch it issues; firing on every miss would waste both on patterns that never repeat. Two consecutive forward-adjacent misses are the cheapest signal that "this might be a streaming scan", and the cost of one wrong prefetch on a scan that turns out not to be sequential (one wasted line, one extra coherence transaction) is small enough that the unit fires confidently. The IP-stride table needs three accesses with matching deltas because it has to confirm that a particular load instruction is in a stride pattern, not that the program in general is doing something sequential — the program might have one streaming scan and one random walk on the same data, and only the scan's PC should fire.

The depth of prefetch — how many lines ahead the unit fetches — is tuned for typical DRAM latency. On a Sapphire Rapids core, DRAM round-trip is ~280 cycles or ~70 ns at 4 GHz; the L2 streamer prefetches 8–16 lines ahead because that is roughly how many cycles of computation a modern core executes per line of data on a typical scan. Fewer lines and the prefetch arrives too late; more lines and the prefetch wastes bandwidth on data the loop will never reach (because the loop terminates, or the access pattern shifts). The depth is not a magic number — Intel publishes "the L2 streamer prefetches up to 16 lines ahead" but the actual depth is dynamically throttled by L2 occupancy and DRAM bandwidth pressure, which is why prefetch effectiveness drops on heavily-loaded systems where the L2 is contended.

ARM and Apple Silicon cores have evolved their own prefetcher families. The Neoverse V2 (used in AWS Graviton4) has a "spatial memory streaming" prefetcher that learns more elaborate patterns — base-plus-offset trees that catch some forms of indirect access. The Apple M3's prefetcher is reported to be more aggressive on stride-1 streams and more conservative on irregular patterns, optimising for the latency-sensitive workloads that dominate macOS app benchmarks. The high-level model — multiple units, pattern-based detection, automatic on streaming, automatic off on indirection — is universal across modern out-of-order cores.

Measuring the prefetcher's range — and the cliff where it gives up

The cleanest way to feel what the prefetcher does for you is to write the same arithmetic over the same array twice — once with a sequential access pattern, once with the same indices shuffled. The cost difference is the prefetcher's contribution. The PMU counter L1-dcache-load-misses tells you how often the load missed L1; comparing that count between sequential and shuffled gives you the prefetcher's hit rate directly.

# prefetch_bench.py
# Compare sequential vs shuffled-index scans of the same 256 MB array.
# Same arithmetic (sum of doubles), same total reads, different access pattern.
# Wrap with `perf stat` to see the prefetcher's contribution as L1 miss delta.

import numpy as np
import time
import os

N = 32_000_000  # 256 MB of float64; well past LLC capacity (~30 MB)
data = np.random.uniform(0, 1, N)

# Two index orders over the same N elements:
seq_idx = np.arange(N, dtype=np.int64)
rng = np.random.default_rng(42)
shuf_idx = rng.permutation(N)  # same values, shuffled

def scan_seq():
    t0 = time.perf_counter_ns()
    total = data[seq_idx].sum()
    return total, (time.perf_counter_ns() - t0) / 1e6

def scan_shuf():
    t0 = time.perf_counter_ns()
    total = data[shuf_idx].sum()
    return total, (time.perf_counter_ns() - t0) / 1e6

if __name__ == "__main__":
    # Warm up the page cache and the TLB; let the prefetcher settle.
    for _ in range(2):
        scan_seq(); scan_shuf()
    seq_total, seq_ms = scan_seq()
    shuf_total, shuf_ms = scan_shuf()
    print(f"sequential : {seq_ms:7.1f} ms   ({seq_ms*1e6/N:5.1f} ns/elem)")
    print(f"shuffled   : {shuf_ms:7.1f} ms   ({shuf_ms*1e6/N:5.1f} ns/elem)")
    print(f"slowdown   : {shuf_ms/seq_ms:5.2f}x")
    assert abs(seq_total - shuf_total) < 1.0, "totals must match"

Sample run on a c6i.4xlarge (Ice Lake Xeon, 8 cores, 32 KB L1d, 1.25 MB L2, 30 MB L3, DDR4-3200):

sequential :    78.3 ms   (  2.4 ns/elem)
shuffled   :  1216.7 ms   ( 38.0 ns/elem)
slowdown   : 15.54x

15x slowdown for the same arithmetic. Why the sequential version lands at 2.4 ns per element: the L2 streamer prefetcher detects the unit-stride pattern after ~3 lines and starts pulling 8–16 lines ahead into L2. By the time the load instruction issues for element i, the line containing it is already in L1d (because the L1 streamer kicked in too) or at worst L2. Effective DRAM latency drops from 70 ns/access to roughly 70 ns / 8 = 9 ns/line amortised across 8 elements per line, or ~1.1 ns/element of memory cost plus ~1.3 ns of compute. The shuffled version walks the same memory but in an order the prefetcher cannot predict — every access misses L1, most miss L2, many miss L3 and pay full DRAM. 38 ns/element is roughly DRAM round-trip plus ~200 cycles of pipeline stall, exactly what one would expect from a workload where the prefetcher is contributing zero.

The PMU counters confirm the diagnosis. Run the same script under perf stat with the relevant events:

sudo perf stat -e cycles,instructions,L1-dcache-load-misses,LLC-load-misses,\
l2_rqsts.demand_data_rd_hit,l2_rqsts.pf_hit \
    python3 prefetch_bench.py

A typical run shows the sequential version at ~0.3% L1d miss rate (most loads hit L1 because the prefetcher arrived first) and the shuffled version at ~94% L1d miss rate. The l2_rqsts.pf_hit counter — L2 hits attributable to prefetched lines — sits at hundreds of millions for the sequential run and near zero for the shuffled. That counter is the prefetcher's signature; it is the cleanest single number that tells you how much work the prefetcher is doing for a given workload, and engineers tuning a hot path on Linux look at it before touching any code.

A subtler measurement: vary the stride. If you change the sequential scan to read every Nth element (data[::stride]), the IP-stride prefetcher can still detect the pattern up to some stride limit, beyond which it gives up. On Sapphire Rapids the limit is around 2 KB — strides larger than that are not detected reliably because the prefetcher's stride history has limited resolution. The transition is sharp: at stride 1024 elements (8192 bytes) you are still mostly in L1; at stride 4096 elements (32 KB) you fall off the cliff and per-element cost jumps from ~3 ns to ~30 ns. Knowing where this cliff is on your hardware tells you which kinds of strided access (e.g. column scans on a wide dataframe) will or won't get prefetcher help.

The prefetcher's cliff: per-element cost vs access strideA line graph. The x-axis is access stride from 1 element (8 bytes) to 65536 elements (512 KB) on a log scale. The y-axis is per-element latency from 2 ns at the low end to 70 ns at the high end. The curve is flat near 2 to 4 ns from stride 1 to about stride 1024, then rises sharply between stride 1024 and stride 8192, and plateaus near 60 ns for larger strides. A vertical band marks the cliff between strides 1024 and 4096.Per-element latency vs stride — the prefetcher gives up around 2 KB1864512409632768stride (elem)log scale2 ns20 ns40 ns70 nscliffprefetcher catches the patternDRAM round-trip per access
Sweep the stride, measure per-element cost. Below ~2 KB stride the prefetcher hides DRAM latency almost completely; above it, every access pays full DRAM. The cliff is sharp because the IP-stride table has a finite address-bit range. Illustrative — measured on c6i.4xlarge, 256 MB working set.

A practical use of this measurement: when you design data structures for analytical queries, every access pattern that lives below the cliff is "free" from a memory-cost perspective — the prefetcher hides the latency, and your loop runs near compute speed. Every access pattern that lives above the cliff pays full DRAM cost per access. Walking a 50-column dataframe by row (stride = sum of all column widths, often 200+ bytes) is below the cliff; walking it by hashed row-id is above the cliff. The same dataframe, the same sum aggregation, can run 20x apart depending on which side of this number the access pattern falls.

Software prefetch — when you have to step in

Software prefetch instructions (PREFETCHT0, PREFETCHT1, PREFETCHT2, PREFETCHNTA on x86; PRFM on ARM) tell the CPU "fetch the line containing this address into the cache hierarchy at this level, but don't stall the pipeline waiting for it". The instruction issues a load that doesn't write to a register and doesn't trap on faults. By the time the actual load instruction needs the data, the line is already in cache and the load hits. Compilers expose them as __builtin_prefetch(addr, rw, locality) in GCC/Clang; intrinsics expose them as _mm_prefetch(addr, hint) on x86 and __pld(addr) on ARM.

The use case is the irregular-but-predictable pattern. The hardware prefetcher cannot predict arr[idx[i]] because it does not have idx[i] until the index load resolves. But your code knows what idx[i+8] will be because you can issue that index load yourself, then prefetch the address it points to, and by the time the loop reaches iteration i+8 the data is already in cache. This is the pattern called prefetching ahead or distance-N prefetch: at iteration i, prefetch the data that iteration i+N will need. The right value of N is roughly DRAM_latency / per_iteration_compute_time, typically 4–16 for real workloads.

# software_prefetch.py
# A hash-join probe: for each input record, look up a value in a 1 GB table.
# The lookup index is shuffled (cold for the hardware prefetcher).
# We use Numba to JIT-compile a loop with explicit __builtin_prefetch calls
# (the @njit decorator + the `prefetch` extension lets us emit PREFETCHT0).

import numpy as np
import time
from numba import njit, prange

N_KEYS  = 4_000_000        # input records to probe
N_TABLE = 128_000_000      # 1 GB lookup table (float64), well past LLC

table = np.random.uniform(0, 1, N_TABLE)
keys  = np.random.randint(0, N_TABLE, N_KEYS, dtype=np.int64)

@njit(cache=True, boundscheck=False)
def probe_naive(table, keys, out):
    for i in range(keys.shape[0]):
        out[i] = table[keys[i]]   # hardware prefetcher cannot help here

@njit(cache=True, boundscheck=False)
def probe_prefetch(table, keys, out, dist):
    n = keys.shape[0]
    # Issue prefetches `dist` iterations ahead.
    for i in range(n):
        if i + dist < n:
            # Numba lowers this address-load to a PREFETCHT0 on x86.
            _ = table[keys[i + dist]]   # treated as prefetch when result unused
        out[i] = table[keys[i]]

if __name__ == "__main__":
    out = np.empty(N_KEYS, dtype=np.float64)
    # Warm up the JIT
    probe_naive(table, keys[:1000], out[:1000])
    probe_prefetch(table, keys[:1000], out[:1000], 8)

    t0 = time.perf_counter_ns()
    probe_naive(table, keys, out)
    naive_ms = (time.perf_counter_ns() - t0) / 1e6

    for dist in (4, 8, 16, 32):
        t0 = time.perf_counter_ns()
        probe_prefetch(table, keys, out, dist)
        ms = (time.perf_counter_ns() - t0) / 1e6
        print(f"prefetch dist={dist:2d}: {ms:7.1f} ms   ({ms*1e6/N_KEYS:5.1f} ns/key)")
    print(f"naive          : {naive_ms:7.1f} ms   ({naive_ms*1e6/N_KEYS:5.1f} ns/key)")

Sample run on the same c6i.4xlarge:

prefetch dist= 4: 154.2 ms   ( 38.6 ns/key)
prefetch dist= 8:  98.1 ms   ( 24.5 ns/key)
prefetch dist=16:  82.4 ms   ( 20.6 ns/key)
prefetch dist=32:  91.7 ms   ( 22.9 ns/key)
naive          : 263.4 ms   ( 65.9 ns/key)

The naive probe runs at full DRAM cost — 65.9 ns per lookup, dominated by the cache miss on every probe. With prefetching at distance 16, the same loop runs at 20.6 ns per lookup, a 3.2x speedup with no algorithmic change. Why distance 16 is the sweet spot here and not distance 4 or distance 32: at distance 4, the prefetch is issued only ~25 ns before the load needs it, but DRAM round-trip is ~70 ns, so most prefetches arrive late and the load still stalls. At distance 32, the prefetch is issued so far ahead that the prefetched line has been evicted from L1 (and possibly L2) by the time the load arrives — too far ahead is a real failure mode, not just a no-op. Distance 16 (~70-100 ns of compute lookahead at this loop's per-iteration cost of ~5 ns) is the closest match to DRAM round-trip, and the prefetched line arrives in L1 just before the load issues.

The cost of a software prefetch is small but real: one extra micro-op, one extra L1 cache lookup, and load-store-unit bandwidth. On a hot loop that already saturates the LSU (two loads + two stores per cycle on Sapphire Rapids), adding a prefetch can starve real loads. The prefetch instruction also occupies a fill buffer until the line arrives — and there are only ~12 fill buffers per core. If you issue prefetches faster than DRAM can complete them, you stall on fill-buffer exhaustion, which the PMU counter L1D_PEND_MISS.FB_FULL exposes directly. Engineers tuning hot loops at HFT firms in BKC, Mumbai instrument this counter as a sanity check before they ship a prefetch optimisation; if the counter goes from 0% to 30% of cycles after the change, the prefetch is over-issuing and needs throttling.

The other failure mode is software prefetches fighting the hardware prefetcher. If your loop has a recognisable stride pattern, the hardware prefetcher is already doing the right thing; adding software prefetches just doubles the prefetch traffic (one from each unit, both fetching the same line), wasting LSU bandwidth and offering no benefit. The diagnostic is to compare L1-dcache-load-misses between the version with software prefetches and the version without — if the count is similar, the hardware was already handling it, and your prefetches are pure overhead. The rule is: software prefetch is for the patterns the hardware does not detect; never for the patterns it does.

Common confusions

Going deeper

When the IP-stride prefetcher misfires — the polynomial-stride trap

The IP-stride prefetcher learns one delta per load PC. If your loop has a load whose stride changes systematically — say, arr[i*i] where the stride between consecutive accesses grows linearly with i — the prefetcher will lock onto whichever stride it last saw and predict wrong addresses for the rest of the loop. Every prefetch is wasted bandwidth and L1 pollution; the actual loads still miss because the line they need was not the one prefetched. Polynomial-stride access shows up in some image-processing kernels (diagonal traversals), some matrix-multiplication implementations, and some game-AI tree-search workloads. The fix is either to refactor the access into a compound stride pattern that the prefetcher can detect (often by tiling the loop) or to issue software prefetches at the addresses the loop will actually need.

A subtler version: the IP-stride table has a finite number of entries (~32 on Sapphire Rapids). If your hot loop has more than 32 distinct loads that benefit from stride detection, some of them lose their entries to LRU eviction inside the prefetcher table itself. The prefetcher does not error or warn; the previously-detected pattern just stops working, and the loop slows down by a factor proportional to its DRAM dependency. This kind of regression is invisible in any code review — the source code does not change, but adding a 33rd stride-pattern load to the function quietly degrades the first 32. The PMU counter l2_rqsts.pf_miss rising on a previously-fast loop is the signature.

Razorpay's UPI fraud-feature lookup — a real prefetch win

Razorpay's UPI fraud-detection pipeline computes 47 features per transaction at 10 kHz peak QPS, mostly during the festival sale windows around Diwali and the IPL playoffs. One feature is "merchant's median transaction amount over the last 30 days" — a lookup into a 4 GB feature table keyed by merchant_id. The natural code does features[merchant_id] per transaction; merchant IDs are randomly distributed across the table because merchant onboarding order has no relationship to the lookup pattern. The naive implementation ran at 89 ns per feature lookup, which became the dominant cost on the per-transaction critical path.

The fix was a 30-line refactor. Instead of processing one transaction end-to-end before moving to the next, the team processed transactions in batches of 64. For each batch, the code first issued software prefetches for all 64 lookups (PREFETCHT0 on each merchant_id's table address), then in a second pass did the actual feature reads. The first pass takes ~150 ns total and starts 64 DRAM transactions in flight; the second pass reads them at L1 speed because most have arrived by the time the load instruction issues. End-to-end per-feature cost dropped to 28 ns, a 3.2x speedup, and per-transaction p99 fell from 1.8 ms to 0.7 ms on the same hardware.

The interesting part is the cost. The 30-line refactor took an hour to write and a day to validate against the existing test suite. The deployment was a routine canary roll. The production saving was ~40 c6i pod-equivalents at peak QPS, which translates to roughly ₹3.4 lakh/month in AWS spend at 2026 prices. The hour-of-engineer-time-to-rupee ratio is the kind of number that makes prefetch optimisations pay back faster than almost any other systems-performance work; the only thing that beats it is fixing an O(n²) algorithm.

When prefetch hurts — the cache-pollution failure mode

A software prefetch issued too far ahead pulls a line into L1 that gets evicted before the actual load uses it. The prefetched line has displaced something else from L1 — typically a line the loop is using right now. Net effect: you paid for the prefetch and you also paid for an extra cache miss on the line you just evicted. This is cache pollution, and it is the most common way software prefetch optimisations regress.

The diagnostic is straightforward but underused: compare L1-dcache-load-misses and L1-dcache-loads between the prefetched and unprefetched versions. If the prefetched version has more misses than the unprefetched, you are polluting. If the prefetched version has the same number of misses as without, the prefetches are arriving too late. Only when misses drop substantially is the prefetch helping. The right distance is found by sweep — the benchmark in this chapter tries 4, 8, 16, 32 — and committed only when the sweep shows a clear minimum.

A second, sneakier pollution mode is on shared-LLC systems. On a 32-core EPYC server with a 256 MB LLC, two threads on different cores can prefetch into shared LLC and evict each other's lines. The HFT firms in BKC and the algorithmic-trading desks at Indian banks are particularly sensitive to this — their per-thread microbenchmarks show 5x speedups, but in production with all 32 cores active the speedup vanishes because the LLC is now shared 32 ways instead of being dedicated. The PMU counter to watch is LLC-load-misses per thread; if it grows with thread count faster than linearly, LLC contention has set in and prefetch tuning needs to be redone with the production thread count.

Why GC-managed languages have a hard time with prefetch

Languages with managed runtimes — Java, Go, C#, Python — have weaker tools for prefetch. Java's JIT does emit PREFETCHT0 on some HotSpot intrinsics (notably System.arraycopy and certain Unsafe paths) but does not expose a prefetch builtin to user code. Go's compiler emits no prefetch hints at all; the only way to influence prefetch is to write loops that the hardware prefetcher recognises. Python with Numba can emit prefetches as shown above; CPython itself cannot.

The deeper reason is allocation patterns. A garbage collector relocates objects between collections — the address an object lives at today is not the address it lives at after the next GC cycle. Software prefetch hints encoded in the source are useless if the addresses they target have moved by the time the prefetch issues. GC-managed languages compensate with batch allocation (the new generation packs newly-allocated objects contiguously, so the hardware prefetcher works on freshly-allocated arrays) and escape analysis (objects that don't escape stay on the stack with predictable layout). But user-controlled prefetch is fundamentally incompatible with relocation; the runtime has to either pin objects (defeating GC) or accept that prefetch is the JIT's job.

This is part of why high-performance Indian fintech increasingly uses Rust, C++, and Go for hot paths — Rust's borrow checker and zero-cost abstractions make explicit prefetch viable; C++ has had __builtin_prefetch for two decades; Go's lack of prefetch is offset by its goroutine model and sequential-scan-friendly slice layouts. Java has been working on Project Panama and Vector API to close the gap; until those land, Java code that needs prefetch reaches for sun.misc.Unsafe (now jdk.internal.misc.Unsafe) and accepts the JNI-style boundary cost.

Reproduce this on your laptop

# Linux/macOS, Python 3.11+
python3 -m venv .venv && source .venv/bin/activate
pip install numpy numba

# Sequential vs shuffled scan (showing what the hardware prefetcher does):
python3 prefetch_bench.py

# Confirm with perf (Linux only):
sudo apt install linux-tools-common linux-tools-generic
sudo perf stat -e cycles,instructions,L1-dcache-load-misses,LLC-load-misses,\
l2_rqsts.demand_data_rd_hit,l2_rqsts.pf_hit \
    python3 prefetch_bench.py

# Software prefetch sweep (showing when manual hints help):
python3 software_prefetch.py

Expect 10–20x sequential-vs-shuffled gap on any modern x86 server (2018+); ARM laptops show 6–12x because their prefetcher units are slightly less aggressive but the DRAM gap is also smaller. The shape — flat near unity below the cliff, jump to full DRAM cost above it — is universal.

Where this leads next

Prefetching is the bridge between two facts that would otherwise be incompatible: cores execute one instruction per nanosecond, and DRAM responds in seventy. Without prefetching, every cache miss would stall the pipeline for 280 cycles and modern programs would run at 1990s speeds. With prefetching, a sequential scan over a 100 GB array runs at compute speed because the hardware sees the pattern and pulls the next lines in before the loop notices it needs them. The discipline of the systems-performance engineer is to know which loops the prefetcher is already handling (most of them, when the access is sequential or stride-detectable), which loops the prefetcher cannot handle (every indirect access, every hash probe, every pointer chase), and which of those uncovered loops are predictable enough in software to benefit from explicit prefetch hints.

The mental model for tuning a hot loop becomes: first measure the L1 miss rate. If it is below 5%, the prefetcher is winning, and your time is better spent elsewhere — algorithmic complexity, branch prediction, or SIMD. If it is between 5% and 30%, the prefetcher is partly winning; check whether the access pattern can be reshaped (loop tiling, layout change) to be more stride-friendly. If it is above 30% and the access pattern is still predictable in code, software prefetch with a tuned distance is the next move; this is the regime where Razorpay's UPI fraud-feature refactor lived. If it is above 50% and the access pattern is genuinely random (true hash probes, true graph walks), the only real fix is layout — reshape the data so that what you read is what the cache fetches.

The next chapters extend the memory-side story:

A working backend engineer at Razorpay, Zerodha, or Hotstar usually does not write __builtin_prefetch in their daily code. But they do read flamegraphs every week, and the answer to "this hot loop is bottlenecked on memory, not compute" is roughly "make it more prefetcher-friendly" often enough — perhaps once a quarter — that the discipline of asking "what does the prefetcher see when this loop runs?" pays off as a permanent habit. The prefetcher is doing more work than your code, on most days.

References

  1. Intel® 64 and IA-32 Architectures Optimization Reference Manual — §3.6 (data prefetching), §B.4 (prefetcher PMU events), §11 (software prefetch usage guidelines).
  2. Ulrich Drepper, "What Every Programmer Should Know About Memory" (2007) — §6.3 (prefetching), the canonical software-prefetch worked example over a linked-list traversal.
  3. Brendan Gregg, Systems Performance (2nd ed., 2020) — §6.4 (cache analysis), §6.6 (PMU counters); the production diagnostic ladder for memory-bound hot paths.
  4. Agner Fog, "The microarchitecture of Intel, AMD and VIA CPUs" — §11 (data prefetching), per-microarchitecture details on prefetcher depth and stride limits.
  5. Onur Mutlu et al., "Runahead Execution" (HPCA 2003) — the academic foundation for hardware-managed lookahead prefetching that influences modern Intel and AMD designs.
  6. Linux perf wiki — uncore and prefetcher events — practical reference for the l2_rqsts.pf_hit and l2_rqsts.pf_miss counters used in this chapter.
  7. /wiki/data-layout-for-cache-friendliness — the layout-side companion to this chapter; prefetch and layout are the two halves of the cache-friendly-loop problem.
  8. /wiki/cache-lines-and-why-64-bytes-rules-everything — the unit-of-motion substrate that makes prefetch matter at all.