Out-of-order execution and reorder buffers
Karan at Zerodha is profiling the order-matching engine on a Tuesday morning. The hot loop reads a price level from a flat array, compares it against an incoming order, conditionally updates a counter, and steps to the next level. He times one iteration at 3.1 ns. He times one hundred iterations at 142 ns — not 310. The CPU somehow ran the second iteration before the first one had finished, and the third before the second, and at any moment during those 142 ns there were forty-something half-executed iterations sitting in flight inside the core. The numbers only line up if you accept that "next instruction" is a fiction the CPU maintains for your benefit.
The structure that maintains this fiction is the reorder buffer (ROB) — a circular array of slots, one per in-flight µop, that records the original program order so the CPU can finish things in any order it likes and still commit them in the order you wrote. The ROB is the difference between a 1980s in-order MIPS R2000 retiring at 0.7 IPC and a 2024 Apple M1 Firestorm retiring at 4+ IPC on the same C code. It is also the structure that fills up first when you write code the CPU cannot parallelise, and watching it fill is how you tell pointer-chase from streaming-load by reading two perf counters.
Out-of-order (OoO) execution lets the CPU dispatch µops as their operands become ready rather than as the programmer wrote them; the reorder buffer keeps original program order so retirement looks sequential. The ROB's size — 224 on Skylake, 320 on Zen 4, 630 on Apple M1 — is the maximum amount of latency the CPU can hide behind independent work. When the ROB fills with µops waiting on one slow load, IPC collapses regardless of how clever the rest of the core is.
In-order, out-of-order, and the gap between them
A 1980s in-order CPU executed instructions in strict program order: fetch instruction 1, finish all of it, fetch instruction 2, finish all of it. If instruction 1 was a load that missed L1 and went to DRAM (~80 ns ≈ 270 cycles at 3.4 GHz), the entire core sat idle for 270 cycles waiting. Instruction 2, even if it had nothing to do with instruction 1, did not start until 271.
A modern out-of-order CPU does the opposite. It fetches and decodes in program order, but dispatches µops to execution ports the moment their operands are ready, regardless of where they sit in the program. The slow load on instruction 1 still takes 270 cycles to complete, but during those 270 cycles instructions 2, 3, 4, ..., up to ~220 more, can be issued, executed, and waiting to retire. If instruction 1 is independent of the next 200 instructions, the core has done 200 instructions' worth of useful work in the time the in-order chip did zero.
The OoO engine does not make memory faster. It does not reduce the load's 270 cycles by even one. What it does is find independent work to overlap with the load, so the core's 4-wide retirement bandwidth keeps moving while the load is in flight. Why this is the entire game: memory latency has barely moved in fifteen years (DDR3-1600 in 2010 had ~13 ns CAS; DDR5-6400 in 2024 has ~12 ns), but core IPC ceilings have gone from 2-wide to 4-wide to 8-wide. The only way the wider core gets used is to overlap more independent work with each unavoidable memory access. Every architectural trick — bigger ROB, more LFBs, deeper RS, smarter prefetch — is a different way to expose that overlap.
This is also why "the language doesn't matter, the algorithm matters" is half a lie. Two algorithms with the same big-O and the same instruction count can run 15× apart in wall-time depending on whether the OoO engine can find parallel work in their µop streams. Karan's matching-engine inner loop ran at 3.1 ns/iteration not because the silicon was magic but because every iteration's loads were independent of the previous iteration's loads — the OoO engine kept ~46 iterations in flight, all making progress, while individual load latencies were still 80 ns each. Switch the same algorithm to a linked-list traversal and you get 80 ns/iteration: same instructions, same data, no parallel work to find, ROB sits 95% full of waiting µops.
The historical jump matters here. The Pentium Pro (1995) was the first mass-market OoO x86, with a 40-entry ROB. By 2008 (Nehalem) the ROB was 128 entries; by 2015 (Skylake) it was 224; by 2024 (Sapphire Rapids on the desktop / Granite Rapids on server) it pushed 512. The ROB-size curve is steeper than the clock-speed curve has been since Dennard scaling broke around 2005 — single-core IPC has roughly tripled since 2005 while clocks moved from 3.0 GHz to 4.5 GHz boost. Almost all of that IPC growth came from bigger ROBs, more execution ports, and the supporting renaming/scheduling machinery — i.e., from finding more independent work in the same instruction stream rather than executing each instruction faster. The Apple M-series, Qualcomm Oryon, and AWS Graviton 4 cores extend this trend with ROBs in the 400–630 range; in 2026, a "fast core" is mostly defined by how much in-flight work it can manage at once.
What the ROB actually stores
The ROB is a circular FIFO of slots, indexed by a head pointer (the oldest in-flight µop) and a tail pointer (the newest). Each slot holds:
- The µop's opcode and operand encoding (what to compute).
- Its destination physical-register number — the rename map's allocation for this µop's output.
- A completed bit — set when the execute stage reports the result is in the physical register.
- An exception bit — set if execution faulted (page fault, divide-by-zero, etc.).
- Bookkeeping for the load buffer (LB) and store buffer (SB) if the µop is a memory op.
When a µop finishes executing, the execute stage flips its completed bit. The retire stage walks the head pointer forward: while the head's completed bit is set and there is no pending exception, it commits the µop's result to architectural state, frees the renamed physical register's old binding, and advances the head. Retirement is in program order; execution is not.
The ROB is also where precise exceptions come from. If µop 4 takes a page fault but µop 0 hasn't retired yet, the architectural state still reflects "before µop 0". The ROB walks the head forward, retiring 0, 1, 2, 3 normally, and only then does it observe µop 4's exception bit and trigger the fault handler — with architectural state pointing exactly at instruction 4 as if nothing past it ever ran. Why this matters for debuggers, signal handlers, and JITs: precise exceptions are what let your try/except see the line number of the actual fault, what let gdb step into the right instruction on SIGSEGV, and what let the JVM/V8 tier-up at a known program point. Without ROB-driven retirement, the core could not undo µops 5, 6, 7 that already executed past the fault — and you'd see register state from "the future" in your stack trace.
The ROB's size is the most important number in the OoO engine. Skylake: 224 entries. Ice Lake: 352. Sapphire Rapids: 512 (but only 224 active for any single thread without HT pressure tricks). Zen 3: 256. Zen 4: 320. Apple M1 Firestorm: ~630. ARM Cortex-X4: 384. Each slot is roughly 16 bytes of SRAM-backed state plus the wakeup logic, so the ROB is also one of the bigger structures in the core's transistor budget — adding 100 ROB entries is not free, which is why x86 has trailed Apple's M-series despite Intel's much larger fab budgets.
Measuring the ROB at work: the MLP knee
The clearest way to see the ROB is to write two microbenchmarks that touch the same number of cache lines but differ only in memory-level parallelism (MLP) — how many independent loads the OoO engine can have in flight at once. Skylake's line fill buffer (LFB) caps a single core's outstanding L1 misses at 12, so a workload with MLP = 1 (pointer chase) and one with MLP = 12 (parallel chases) on the same DRAM-resident data should differ by ~12× in throughput.
# bench_mlp.py — measure the effect of memory-level parallelism on throughput.
# Both loops touch the same N cache lines from the same array.
# Loop A: every load depends on the previous load (MLP=1, the ROB cannot help).
# Loop B: K independent chains interleaved (MLP≈K, the ROB hides latency).
import sys, time, random, array
N = 8 * 1024 * 1024 # 8 M slots, 64 MB working set, defeats L3
K = int(sys.argv[2]) if len(sys.argv) > 2 else 1 # number of parallel chains
def make_chain(n, seed):
"""Build a permutation array: idx[i] = next index in a single random cycle."""
perm = list(range(n))
random.Random(seed).shuffle(perm)
nxt = array.array('q', [0] * n)
for i in range(n - 1):
nxt[perm[i]] = perm[i + 1]
nxt[perm[-1]] = perm[0] # close the cycle
return nxt, perm[0] # return array + start index
def chase_serial(nxt, start, steps):
"""MLP = 1: each step depends on the previous step's load."""
p = start
s = 0
for _ in range(steps):
p = nxt[p]
s ^= p
return s
def chase_parallel(nxts_starts, steps):
"""MLP ≈ K: K independent chains advance one step each per outer iter."""
ps = [s for (_, s) in nxts_starts]
nxts = [n for (n, _) in nxts_starts]
s = 0
K = len(ps)
for _ in range(steps // K):
for k in range(K):
ps[k] = nxts[k][ps[k]]
s ^= ps[0]
return s
mode = sys.argv[1]
if mode == "serial":
nxt, start = make_chain(N, 42)
t0 = time.perf_counter_ns()
s = chase_serial(nxt, start, N)
t1 = time.perf_counter_ns()
print(f"serial MLP=1 steps={N} ms={(t1-t0)/1e6:.1f} ns/step={(t1-t0)/N:.1f}")
elif mode == "parallel":
chains = [make_chain(N // K, seed) for seed in range(1000, 1000 + K)]
t0 = time.perf_counter_ns()
s = chase_parallel(chains, N)
t1 = time.perf_counter_ns()
print(f"parallel K={K} steps={N} ms={(t1-t0)/1e6:.1f} ns/step={(t1-t0)/N:.1f}")
Run it sweeping K from 1 to 16, with perf stat collecting the L3 miss rate so you can confirm the working set really is DRAM-bound.
# run_mlp.py — sweep K and watch ns/step drop until the LFB saturates.
import subprocess, re, sys
def run(mode, K=None):
cmd = ["perf", "stat", "-x,",
"-e", "cycles,instructions,LLC-load-misses,l1d_pend_miss.pending_cycles",
"python3", "bench_mlp.py", mode]
if K is not None:
cmd.append(str(K))
r = subprocess.run(cmd, capture_output=True, text=True)
print(r.stdout, end="")
return r.stderr
print(run("serial"))
for K in (1, 2, 4, 8, 12, 16):
print(run("parallel", K))
Sample output on a c6i.4xlarge (Ice Lake, LFB depth = 12):
serial MLP=1 steps=8388608 ms=2380.2 ns/step=283.7
parallel K=1 steps=8388608 ms=2390.6 ns/step=284.9
parallel K=2 steps=8388608 ms=1240.4 ns/step=147.8
parallel K=4 steps=8388608 ms=670.1 ns/step=79.9
parallel K=8 steps=8388608 ms=380.5 ns/step=45.3
parallel K=12 steps=8388608 ms=290.7 ns/step=34.6
parallel K=16 steps=8388608 ms=292.1 ns/step=34.8
The throughput climbs almost linearly from K=1 (284 ns/step) to K=12 (34 ns/step) — an 8× gain — and then stops. K=16 gives the same ~35 ns/step as K=12. The LFB is full; adding more parallel chains doesn't help because the core can't track more than 12 outstanding L1 misses at once. Why the curve flattens exactly at K=12: the LFB has 12 slots on Ice Lake; once all 12 are occupied with in-flight DRAM accesses, the load-store unit blocks further loads from issuing. Adding a 13th independent chain just means one more µop sits in the RS waiting for an LFB slot to free. The effective MLP saturates at the LFB depth, so additional software-level parallelism is wasted. Zen 3 has 22 LFB slots, Zen 4 has 24, Apple M1 ≈ 30 — the "knee" of this curve moves accordingly. This is the single most useful chart for sizing how aggressively to unroll memory-bound loops on a given microarchitecture.
Walking the key lines:
for _ in range(steps): p = nxt[p]; s ^= p— the serial loop. Each iteration's load address is the previous iteration's load result. The ROB fills with µops, but they're all blocked on the same dependency chain. MLP = 1, throughput = 1 / (memory latency).for k in range(K): ps[k] = nxts[k][ps[k]]— the parallel inner loop. K independent chains advance one step each. The OoO engine sees K loads with no dependencies between them; it issues all K to the LFB; they complete roughly together. Effective MLP ≈ K (capped by the LFB).-e l1d_pend_miss.pending_cycles— the perf event that directly measures MLP: the average number of cycles where there's at least one outstanding L1 miss. Divide bycyclesto get "fraction of cycles with at least one miss in flight"; the fraction climbs from ~0.10 at K=1 to ~0.95 at K=12.- The fact that the Python
forloop doesn't kill the signal — yes, the interpreter is overhead, but the gap between the two regimes (8× throughput improvement) is so large that it survives the noise. For sub-µs measurements you'd drop to a C kernel viactypes; here we don't need to.
This experiment is the empirical version of the ROB story. The ROB is the structure that lets you have 12 loads in flight; the LFB is the structure that caps it. The OoO engine's job is to find K independent µops to fill those 12 slots; your job, as the programmer, is to write code where K independent µops actually exist.
A subtle point worth sitting with: the Python code in the parallel loop interleaves K chains using a Python-level for k in range(K). CPython's interpreter dispatch is tens of nanoseconds per bytecode — a real overhead. Yet the K=12 measurement still hits 34 ns/step, faster than the K=1 measurement's 284 ns/step. The reason is that even Python-interpreted µops have plenty of independence between them: each nxts[k][ps[k]] lookup is independent of the others, the bytecode dispatch's ALU work overlaps with the K outstanding DRAM loads, and the OoO engine genuinely does fill the LFB through the haze of CPython overhead. If you were to drop to a C kernel via ctypes you would push the K=12 number down to roughly 8 ns/step — a further 4× — but the shape of the curve would be identical: linear improvement up to the LFB's depth, then flat. The shape is the lesson; the absolute numbers are the platform.
Why the ROB fills and what happens when it does
The ROB has 224 slots on Skylake. If your hot loop has 4 µops/iteration, that's ~56 iterations in flight at any moment. If each iteration's load misses to DRAM and all loads are independent, the ROB stays mostly empty: loads complete, retire, new ones dispatch. If each iteration's load depends on the previous, the head sits stuck on slot 0 for 270 cycles waiting for the load to complete. Slots 1, 2, 3, ..., 223 fill up behind it. At slot 223 the rename stage stalls — there is no slot 224 to dispatch into — and the front-end backs up. IPC drops to roughly 4 / 270 = 0.015 until the head finally retires.
This is the back-end memory-bound stall that perf stat --topdown reports as BackendBound.MemoryBound.DRAM. It has nothing to do with the front-end, branch prediction, or port pressure. The fix is not "a faster CPU" — every shipping x86 core has DRAM at ~80 ns. The fix is to expose more independent loads in the µop stream: split a linked list into a parallel array of pointers, prefetch ahead by 8 cache lines, unroll the loop manually so dependencies between iterations are broken.
The asymmetric case is front-end-bound stalls, where the ROB drains because the front-end can't deliver new µops fast enough — branch mispredicts, I-cache misses, µop-cache evictions. There the ROB head does retire normally; the tail can't keep up. The diagnostic difference is direct: a back-end stall shows ROB occupancy near full; a front-end stall shows ROB occupancy near empty. perf stat -e uops_issued.any,uops_retired.retire_slots gives you both numbers.
There is one further symptom worth knowing. When the ROB is mostly full but the head is not stuck on a load — instead it's stuck on a long-latency compute µop like a 96-cycle integer divide or a 5-cycle dependent FPU chain — the diagnostic looks almost identical to the memory-bound case (high resource_stalls.rob) but the fix is different. perf stat -e arith.divider_active will spike for divider-bound code; cycle_activity.stalls_total minus cycle_activity.stalls_l3_miss will be large. Compute-bound back-end stalls are rarer than memory-bound ones in real workloads (most production code is memory-bound long before it is compute-bound), but they exist — image-processing pipelines, cryptographic inner loops, and physics simulations are the usual suspects. The general technique is the same as for memory: expose more independent work so the ROB has parallel non-divider µops to retire while the divider is busy.
A real Flipkart Big Billion Days story from October 2024: the catalogue search service was running at IPC = 0.55 on c6i.8xlarge, p99 = 320 ms, SLO = 200 ms. The on-call team's first instinct was "scale out" — they doubled instance count and got 5% latency relief at 1.9× cost. A senior engineer ran perf stat --topdown -a sleep 30 and got 71% back-end-bound, 18% retiring, 8% front-end-bound, 3% bad-spec. Drilling down with toplev.py -l3 showed BackendBound.MemoryBound.L3_Bound = 48% — half of all cycles were waiting for L3 hits. The catalogue's hot-path data structure was a unordered_map<sku_id, attributes*> where each attribute pointer chased into a separately-allocated struct. The fix landed two days later: pack attributes inline into the bucket node, flatten the hash table into a parallel array layout. IPC went from 0.55 to 1.4, p99 dropped to 110 ms, the team rolled back the instance-count doubling, and the savings paid the engineer's salary for the year. The ROB was full the entire time — but full of µops waiting on dependent L3 loads, not full of useful work.
The general pattern: when the ROB is full and IPC is low, you do not need a faster CPU; you need to find independent work. When the ROB is empty and IPC is low, you do not need a faster CPU; you need a smaller code footprint or fewer branches. The ROB tells you which.
Edge cases the textbook diagram skips
Three failure modes deserve their own paragraph because they are common in production and almost never appear in undergraduate computer-architecture courses.
The ROB-RS imbalance trap. Skylake's ROB is 224 entries; its RS is 97 entries. The RS is the smaller pool and frequently fills before the ROB does. When the RS is full, rename stalls — even though there are 100+ free ROB slots — because new µops have nowhere to wait for their operands. perf stat -e resource_stalls.rs separates RS-full from ROB-full stalls. Code with long dependency chains (RS-bound) and code with poor MLP (ROB-bound) look identical at the IPC level but require different fixes. The Hotstar IPL streaming team found this in 2024: their video-segmenter was at IPC = 0.7, resource_stalls.rob = 6%, resource_stalls.rs = 33%. The fix was not "shorter dependency chains" but "fewer simultaneously live FP results" — they reduced the working set of in-flight FPU ops by reordering scalar arithmetic, freeing RS slots, and IPC went to 1.4 without changing memory access patterns at all.
ROB flush on memory-ordering machine clear (MOMC). When a younger load hits a cache line that an older store on the same core has not yet committed, x86 must guarantee the load sees the store's value. If the OoO engine speculatively issued the load before the store committed and the store ends up in the same cache line, the load is flushed along with everything younger than it in the ROB. The cost: 30–50 cycles, sometimes more. perf stat -e machine_clears.memory_ordering counts these. False sharing across cores produces them in bulk; reducing them often means padding hot atomics to a full cache line.
Self-modifying code and SMC machine clears. Less common but spectacular. JITs (V8, JVM C2, .NET RyuJIT) write new instructions then jump to them. If the rewritten bytes overlap a cache line that any currently in-flight µop's prefetched instruction stream touched, the entire ROB is flushed and the front-end re-fetches from L1i with the new bytes. Tens of thousands of cycles per event. perf stat -e machine_clears.smc is your friend here; aligning JIT-emitted code to a 64-byte boundary and padding gaps between functions is what lets a JIT-heavy service hold IPC > 1.
A 2023 Dream11 incident illustrates the SMC problem in production. Their contest-payout service used HotSpot's tiered C2 JIT, and during the IPL final's payout cascade (200× normal write QPS for 30 seconds), the JIT recompiled hot methods more aggressively than usual to meet the throughput. machine_clears.smc jumped from a baseline of ~50/s to ~80,000/s during the cascade; IPC collapsed from 1.4 to 0.3 for the duration. The mitigation was to set -XX:CompileThreshold=100000 (delaying re-compilation past the 30-second cascade window) and pre-warming the JIT with a synthetic load before the payout window opened. Two-line config change, ₹6 crore in deferred infrastructure for the next IPL final. The lesson: knowing what flushes the ROB is what lets you find the fix that does not require buying more hardware.
Common confusions
-
"Out-of-order execution means my code runs in any order." It doesn't. Architectural state — register writes visible to other instructions in this thread, stores visible to other cores — happens in program order via the ROB. What reorders is the internal scheduling of when each µop computes its result. From inside the same thread, you cannot tell out-of-order is happening. From another core, the memory model (TSO on x86, weaker on ARM) defines what reorderings are observable.
-
"The ROB is the same as the reservation station." They're different structures. The ROB is indexed by program order and tracks who retires next. The reservation station (RS) — also called the scheduler or issue queue — is indexed by readiness and tracks who issues next. A µop is in both at once: ROB slot 137 to remember its position, RS slot 22 to wait for its operands. When the RS marks it ready, execute fires; when execute completes, the ROB's completed bit flips. Mixing them up is the most common ROB-related mistake.
-
"A bigger ROB always helps." Only if your workload has independent work to fill it. A pure pointer-chase with a 224-entry ROB performs identically to one with a 1024-entry ROB — the dependency chain serialises everything, so 990 of those 1024 slots sit idle. Apple's 630-entry ROB only buys M1's IPC because their compilers and library code aggressively expose MLP; running the same x86 binary's pointer-chase under Rosetta does not magically benefit.
-
"The ROB hides cache misses but not branch mispredicts." Other way around for the front-end side. The ROB hides back-end latency (memory, FPU divides) by overlapping independent µops. It does not hide branch mispredicts — when a branch is found mispredicted, every µop fetched past it (potentially the entire ROB tail) must be flushed, costing ~15–20 cycles of work and adding pipeline-refill latency. The deeper your pipeline, the more the ROB-flush costs you on a mispredict.
-
"
perf statshows me IPC; that's enough." IPC is the headline. The diagnostic depth is in--topdown's four bins (FrontendBound, BackendBound, BadSpeculation, Retiring) and the L2 breakdown of each. A workload with IPC 0.6 and 70% back-end-bound is fixed differently than one with IPC 0.6 and 40% bad-speculation. Always runperf stat --topdownbefore deciding what to fix. -
"Renaming and the ROB do the same thing." Renaming converts architectural register names (
rax) into physical-register-file (PRF) entries, eliminating false write-after-write and write-after-read dependencies. The ROB tracks program order and retirement readiness, totally separate from where data lives. A µop holds a ROB slot and a PRF entry and (until issued) an RS slot — three different structures with three different sizes (Skylake: 224 / 180 int + 168 FP / 97). Saturating any one of them stalls rename.perf stat -e resource_stalls.{rob,rs,sb}separates which one.
Going deeper
Tomasulo's algorithm and where the ROB came from
Out-of-order execution was invented by Robert Tomasulo at IBM in 1967 for the IBM System/360 Model 91's floating-point unit, twenty years before mainstream microprocessors used it. The original Tomasulo's algorithm had reservation stations and register renaming but no reorder buffer — it could not handle precise exceptions, which made it impractical for general-purpose CPUs. The ROB was added by Smith and Pleszkun at Wisconsin in 1985 ("Implementing Precise Interrupts in Pipelined Processors", IEEE Transactions on Computers) precisely to solve the exception problem. Every modern OoO core descends from this combination: Tomasulo's RS for wakeup, Smith-Pleszkun's ROB for in-order retirement. Read the 1985 paper if you want to see the modern core's skeleton sketched in 12 pages, before it was buried under decades of accretion.
The intermediate history is worth knowing because vendors still revisit the trade-offs. The Pentium Pro (1995, Intel) was the first mass-market OoO x86; its 40-entry ROB was bigger than anyone had previously shipped at consumer prices, and Intel's ability to outscale RISC competitors over the next decade traces directly to this microarchitectural lead. AMD's K7 / K8 (1999–2003) closed the gap with their own ROB-based design. The DEC Alpha 21264 (1996) had a 80-entry instruction window that was larger and more aggressive than any contemporary x86, and Alpha's per-clock IPC remained best-in-class until DEC's collapse. Sun's UltraSPARC III (2001) tried in-order with deep multithreading instead — a road not taken at scale, though IBM's POWER5 SMT-2 design preserved part of the idea. Today's IBM POWER10 has a 480-entry ROB with 8-way SMT; the design space is wider than just the Intel/AMD x86 axis.
Memory ordering and the store buffer
The ROB retires loads and stores in program order, but memory becomes visible to other cores at a slightly different moment. Loads commit when the ROB retires them — but stores go into a separate store buffer (SB) of 56–80 entries on modern x86. The SB drains to L1 at its own pace, so a store can be retired (architecturally committed in this thread's view) before it is visible to other cores. This is the source of x86's TSO (Total Store Order) memory model: each core sees its own stores in order, but observers on other cores see stores potentially reordered with respect to their loads (StoreLoad reordering). The mfence instruction, used in lock primitives, drains the store buffer. ARM's weaker model lets the SB reorder more aggressively, which is why ARM code typically needs more explicit dmb barriers than x86. Why this matters for lock-free programming: the ROB makes single-thread reasoning easy (program order is observable in your own register file) but multi-thread reasoning hard (other threads see what the SB makes visible, not what the ROB retired). The "happens-before" relations in C++/Java/Go memory models are precisely the rules for which ROB-retired stores other cores are guaranteed to observe in which order.
The Apple M1 anomaly and why the ROB went huge
Apple's M1 (launched late 2020) shipped with a ~630-entry ROB — nearly 3× Skylake's 224 — and 6+ wide rename and retirement. The reason was their target workload: macOS's Objective-C / Swift runtime is heavy on indirect calls, dynamic dispatch, and pointer-chasing data structures. To hit competitive IPC on this workload, the core needs to hide the resulting cache misses behind enormous amounts of speculative independent work. The wider ROB was the only path; x86 cores ship narrower partly because their thermal/power budgets at 3.5+ GHz prohibit Apple's transistor-heavy approach. The Zen 4 / Zen 5 trend toward 320–512 entry ROBs is a partial catch-up. For the systems-performance reader the lesson is: the ROB's size is a constraint a microarchitecture optimises for its expected workloads, and a workload whose MLP exceeds that constraint will run faster on the architecture with the bigger ROB even at lower clock speed.
The Razorpay case: ROB pressure on payment-routing
Razorpay's payment-routing service, written in Go, was IPC = 0.8 on c6i.4xlarge during normal traffic and 0.45 during the IRCTC Tatkal window's UPI spike (10:00 IST, 18M sessions in 90 seconds, payment volume 14× normal). perf stat -e uops_dispatched.thread,resource_stalls.rob showed resource_stalls.rob jumping from 8% of cycles to 41% during the spike — the ROB was filling and stalling rename. The hot path was a chain of pointer dereferences through a routing-rules graph: bank → rule → next-rule → terminal. Each dereference was an independent L3 miss but the chain made them dependent. The fix was to flatten the routing graph into a parallel-array representation indexed by integer keys, and prefetch the next two rules' keys at each step. IPC went from 0.45 to 1.6 during the spike, the ROB stall rate dropped to 11%, and Razorpay's UPI p99 stayed under their 200 ms SLO during the next month's GST filing-deadline traffic spike. The ROB had not got bigger; the workload's MLP had — from ~1.5 to ~10 — and the ROB now had work to fill its slots with.
Speculation, side channels, and what mitigations cost the ROB
Spectre (Kocher et al., 2018) showed that the OoO engine's speculative pre-execution leaves measurable footprints in the cache hierarchy even when the µops are eventually flushed from the ROB. A speculative load on a mispredicted branch warms a cache line; the attacker timing-probes the cache to read out a victim's secret indirectly. Mitigations — IBPB / IBRS / STIBP MSRs on Intel, retpolines on syscall paths, the kernel's mitigations=auto Linux 5.x default — work by limiting what the OoO engine speculates past. Each mitigation cuts into the ROB's effective depth: an indirect-call retpoline forces a serialised lfence, draining roughly 30–50 µops of speculative work before the call resolves. PhonePe measured a 9% throughput regression on UPI authorisation when their hosts were patched for Spectre v2 in 2018. The IPC cost is real and durable; it is also why Cascade Lake (2019) and later silicon, with hardware Spectre v2 mitigations, recovered most of the loss. The ROB's depth on paper is one thing; the ROB's usable depth after security mitigations is another, often 5–15% smaller depending on workload.
Reproduce this on your laptop
sudo apt install linux-tools-common linux-tools-generic
python3 -m venv .venv && source .venv/bin/activate
# bench_mlp.py and run_mlp.py from above; no pip deps beyond stdlib
sudo perf stat -e cycles,instructions,LLC-load-misses,l1d_pend_miss.pending_cycles \
python3 bench_mlp.py serial
for K in 1 2 4 8 12 16; do
sudo perf stat -e cycles,instructions,LLC-load-misses \
python3 bench_mlp.py parallel $K
done
sudo perf stat --topdown python3 bench_mlp.py serial
You should see ns/step plateau between K=8 and K=16 on Skylake/Ice Lake, between K=16 and K=22 on Zen 3, between K=20 and K=24 on Zen 4. The exact knee is your machine's LFB depth.
Where this leads next
The ROB is one of three structures that decide how much latency you can hide. The next chapters fill in the others.
- Branch prediction: TAGE and perceptron predictors — chapter 3, how the predictor decides which speculative path to fill the ROB with.
- Front-end vs back-end bound: reading top-down — chapter 4, the diagnostic methodology that tells you whether the ROB is full or empty during a stall.
- Instructions per cycle: what 4-wide retire really means — chapter 5, the IPC ceiling the ROB feeds.
- The line fill buffer and memory-level parallelism — Part 2, the structure that caps how many concurrent L1 misses the ROB can have outstanding.
- Memory-ordering machine clears and false sharing — Part 2, the cross-core symptom of ROB flushes that ruins lock-free data structures at scale.
Part 2 of the curriculum (caches and memory) follows directly from this chapter — the ROB hides latency, but only the latency the cache hierarchy actually charges. To know what to hide, you have to know what the L1/L2/L3/DRAM hierarchy costs and how the prefetcher reduces some of it for free.
A practical takeaway for the on-call engineer: the four perf events cycles, instructions, resource_stalls.rob, and l1d_pend_miss.pending_cycles together tell you almost everything about whether the ROB is the limiting structure on your workload. If instructions/cycles < 1.0 and resource_stalls.rob > 20% of cycles, the ROB is full and the workload is back-end bound. If instructions/cycles < 1.0 and resource_stalls.rob < 5%, the ROB is empty and the workload is front-end bound — fix branches, not memory. Add l1d_pend_miss.pending_cycles / cycles and you also know whether the back-end stall is memory-level (high, ~0.7+) or core-level port pressure (low, ~0.2–). Three numbers, fifteen seconds, the right diagnosis. Bind it to a hotkey on your laptop the way Aditi at Razorpay did; it pays for itself the first incident you use it on.
Subsequent chapters in Part 1 (branch prediction, IPC, top-down) and Part 2 (caches, prefetchers, MLP, false sharing) make these three numbers more precise — but the first 80% of the diagnosis is in this chapter's pair of perf counters. Bookmark this page; you will be back when the next p99 alert fires.
A final framing for the file you should put next to your laptop's perf hotkey: the ROB is how much future you can run in advance. The bigger and more independent your code's ROB occupancy, the more of the next 200 ns you've already started before the current 80 ns finishes. Performance work is, in the end, the discipline of giving the ROB more of that future to chew on.
A practical note: HotSpot's -XX:+PrintAssembly (with the hsdis plugin) and Go's GOSSAFUNC=funcName env var both let you read the JIT'd machine code your program actually runs. Pair that with perf record to attribute samples to specific machine instructions. The first time you do this on your own production code is the moment OoO execution stops being abstract and becomes a concrete thing you can read off the page — "this load is independent of that one, the OoO engine will overlap them; this load uses the result of that one, it won't". The investment is roughly half a day to set up the toolchain; the payback is a decade of clearer thinking about every performance bug you encounter.
References
- Smith & Pleszkun, "Implementing Precise Interrupts in Pipelined Processors" (IEEE TC, 1985) — the paper that introduced the ROB. Twelve pages; still the cleanest explanation.
- Tomasulo, "An Efficient Algorithm for Exploiting Multiple Arithmetic Units" (IBM Journal, 1967) — the original out-of-order paper. Reservation stations and renaming, no ROB yet.
- Agner Fog, "The microarchitecture of Intel, AMD and VIA CPUs" — the canonical independent reference for ROB sizes, RS depths, LFB counts, and execution port maps across every shipping CPU. Chapter 8 covers the OoO engine in detail.
- Hennessy & Patterson, Computer Architecture: A Quantitative Approach (6th ed.) — chapter 3 on instruction-level parallelism. The textbook treatment of OoO, hazard detection, and renaming.
- Yasin, "A Top-Down Method for Performance Analysis and Counters Architecture" (ISPASS 2014) — the methodology behind
perf stat --topdown, including howBackendBound.MemoryBound.DRAMis computed from ROB-occupancy counters. - Brendan Gregg, Systems Performance (2nd ed., 2020) — chapter 6 on CPUs, especially the section on speculative execution costs and the post-Spectre mitigation overhead.
- Andi Kleen,
pmu-toolsandtoplev.py— the open-source TMA implementation. Runtoplev.py -l3 ./your-binaryand read the back-end-bound and resource-stalls sub-bins. - Pipelines: fetch, decode, issue, execute, retire — chapter 1 of this curriculum, the framing this chapter builds on.