Pipelines: fetch, decode, issue, execute, retire

Aditi at Razorpay is staring at a payment-routing service that does ₹600 crore in volume every Tuesday. The service has 16 cores, the CPU is at 38% utilised, and the p99 of /route is 14 ms when the SLO is 8. She runs perf stat. Cycles look normal. Instructions per second is reasonable. Then she looks at the ratio — instructions / cycles = 0.62 — and the rest of the day is spent figuring out why a modern Xeon, capable of retiring four instructions per clock, is only retiring half of one.

That ratio is IPC, instructions per cycle, and it is the single most diagnostic number a performance engineer can read off a CPU. A 4-wide retire and a 2.2 GHz clock can in principle retire 8.8 billion instructions per second per core; Aditi's service was retiring closer to 1.4 billion, and the other 7.4 billion cycles per second per core were spent waiting — for the front-end to deliver instructions, for the back-end to drain a load, or for a mispredicted branch to be unwound. The first job is not "go faster" but "find which kind of waiting".

A modern CPU is a deeply pipelined out-of-order machine that fetches, decodes, renames, issues, executes, and retires instructions in overlapping stages. Throughput is bounded not by clock speed but by which stage starves first — usually the front-end (instruction supply) or the back-end (waiting on memory). IPC is the number that tells you which one is hurting; everything in Part 1 is about reading that signal.

The five stages, and why there are really fifteen

The textbook diagram of a CPU pipeline shows five boxes — fetch, decode, execute, memory, writeback — borrowed from the 1980s MIPS R2000. Modern x86 and ARM cores have somewhere between 14 and 19 pipeline stages, but they still cluster into five logical phases. Knowing the phases is enough to read a perf stat output; knowing the exact stages of your specific microarchitecture is what gets you the last 20% of performance.

The five logical stages of a modern CPU pipelineA horizontal flow with five stage boxes left to right: Fetch, Decode, Rename and Issue, Execute, Retire. Above each box, the front-end versus back-end split is shown. Below each, the typical width and the most common stall cause are listed.FRONT-END (in-order)BACK-END (out-of-order)Fetchread 32B fromL1i + BTBwidth: 16–32 Bstalls: I-cache miss,branch mispredictDecodex86 → µops+ µop cachewidth: 4–8 µopsstalls: complex insn,µop-cache missRename + Issuearch → phys regs,into ROB / RSwidth: 5–6 µopsstalls: ROB full,RS fullExecute10+ ports:ALU / FPU / LD / STwidth: 8–10 portsstalls: cache miss,port contentionRetirecommit inprogram orderwidth: 4–8 µopsstalls: head ofROB stuckEach stage is 1–4 physical pipeline cycles; total in-flight: 200–400 µops on Skylake/Zen3.
The five logical phases of a modern CPU pipeline. The front-end is in-order and serial; the back-end is out-of-order and parallel. Different stalls live on different sides of the dashed line. Illustrative — not measured data.

Fetch pulls 16–32 bytes per cycle from the L1 instruction cache, guided by the branch predictor (BTB — branch target buffer). On a Skylake-X core, this is a 4-cycle pipeline by itself.

Decode turns variable-length x86 instructions into fixed-size internal µops (micro-operations). A simple add eax, 1 becomes one µop; a complex rep movsb can balloon to thousands. Modern Intel cores have a decoded µop cache (DSB) that holds ~1500 recently-decoded µops, so a hot loop running entirely from the µop cache skips the legacy decoder entirely. Zen 3/4 has the equivalent op cache.

Rename + Issue is where the magic happens. The architectural registers (rax, rbx, r8...) are renamed to physical registers from a pool of 180–320, eliminating false dependencies. The renamed µops drop into the reorder buffer (ROB) and the reservation station (RS) — the ROB tracks original program order for retirement, the RS holds µops waiting for their inputs.

Execute is where µops actually run, dispatched from the RS to one of 8–10 execution ports as soon as their operands are ready. Skylake has 4 ALUs (ports 0, 1, 5, 6), 2 load AGUs (ports 2, 3), 1 store AGU (port 4), and various FPU/SIMD units sharing the integer ports. Why ports matter: even with infinite ROB capacity, your throughput is bounded by port pressure. A loop of 100% integer adds peaks at 4 µops/cycle (4 ALU ports). A loop of 100% loads peaks at 2/cycle. Mix them and you can hit the full 4-wide retire — provided the front-end can feed it.

Retire commits completed µops back to architectural state, in original program order, at up to 4–8 µops/cycle. This is where exceptions become observable, where stores actually become visible to other cores, and where the IPC counter you read with perf stat increments.

The whole thing is an assembly line. At any moment, there are 200–400 µops in flight on a Skylake core — being fetched, decoded, renamed, executing, or waiting to retire. The CPU's job is to keep the pipe full. Yours is to write code that doesn't make that hard.

A useful mental check: if the front-end retired 4 µops every cycle, a 3.4 GHz core would retire 13.6 billion µops per second. Real production workloads run at 0.5–2 IPC, retiring 1.7–6.8 billion. The factor of 2–8× between the silicon's peak and the workload's actual throughput is the performance budget you have not yet spent — and reading that budget through the right counters is what Part 1 is about. Every chapter from here on is, in some sense, a strategy for spending that budget.

The cost of the wrong mental model is concrete. A team at MakeMyTrip in 2023 doubled their hot-path cluster from 64 to 128 c6i.4xlarge instances to chase a p99 regression, spending roughly ₹38 lakh/month extra. The actual fix, found six weeks later by a contractor with a perf stat cheat sheet, was a hash-map collision pattern that pushed IPC from 1.6 down to 0.4 on the booking-search path. Going back to 64 instances after the IPC fix saved more than the regression cost. "Add cores" is a refactor of last resort, not first.

Measuring IPC: a Python driver wrapping perf stat

The fastest way to see the pipeline at work is to measure IPC on two loops that look identical but feed the back-end at very different rates. Write a microbenchmark, wrap it with perf stat, parse the counters in Python.

# bench_ipc.py — measure IPC of two loops with the same op count
# Run: python3 bench_ipc.py linear   (and:   python3 bench_ipc.py chase)
import sys, time, array, random

N = 64 * 1024 * 1024  # 64 M, enough to defeat L3 on most laptops

def linear_sum():
    """Sequential add — branch-predictable, prefetcher-friendly."""
    a = array.array('q', [1] * N)
    s = 0
    t0 = time.perf_counter_ns()
    for i in range(N):
        s += a[i]
    t1 = time.perf_counter_ns()
    print(f"linear: sum={s} elapsed_ms={(t1-t0)/1e6:.1f}")

def pointer_chase():
    """Random pointer-chase — every load depends on the previous."""
    idx = list(range(N))
    random.Random(42).shuffle(idx)
    nxt = array.array('q', idx)            # nxt[i] = where to jump next
    t0 = time.perf_counter_ns()
    p, s = 0, 0
    for _ in range(N):
        p = nxt[p]
        s ^= p
    t1 = time.perf_counter_ns()
    print(f"chase: sum={s} elapsed_ms={(t1-t0)/1e6:.1f}")

if __name__ == "__main__":
    {"linear": linear_sum, "chase": pointer_chase}[sys.argv[1]]()

Now invoke it under perf stat from a Python harness so you can parse the counters:

# run_ipc.py — wrap bench_ipc.py with perf stat, parse IPC
import subprocess, re, sys

def measure(mode):
    r = subprocess.run(
        ["perf", "stat", "-x,", "-e", "cycles,instructions,branch-misses,L1-dcache-load-misses,LLC-load-misses",
         "python3", "bench_ipc.py", mode],
        capture_output=True, text=True,
    )
    print(f"\n=== {mode} ===")
    print(r.stdout, end="")
    counters = {}
    for line in r.stderr.splitlines():
        # perf -x, format: <count>,,<event>,...
        parts = line.split(",")
        if len(parts) >= 3 and parts[0].replace(".", "").isdigit():
            counters[parts[2]] = int(parts[0])
    cyc, ins = counters.get("cycles", 0), counters.get("instructions", 0)
    if cyc:
        print(f"cycles      = {cyc:>15,}")
        print(f"instructions= {ins:>15,}")
        print(f"IPC         = {ins/cyc:.3f}")
        print(f"L1d-misses  = {counters.get('L1-dcache-load-misses',0):>15,}")
        print(f"LLC-misses  = {counters.get('LLC-load-misses',0):>15,}")

for m in ("linear", "chase"):
    measure(m)

Sample run on a c6i.4xlarge (Ice Lake, 16 vCPU, 32 GB):

=== linear ===
linear: sum=67108864 elapsed_ms=2110.4
cycles      =   6,420,318,055
instructions=  14,103,772,901
IPC         = 2.197
L1d-misses  =      33,841,972
LLC-misses  =          12,008

=== chase ===
chase: sum=...      elapsed_ms=24,780.1
cycles      =  74,210,884,612
instructions=  10,902,330,118
IPC         = 0.147
L1d-misses  =      62,915,447
LLC-misses  =      62,891,402

Same Python interpreter, similar instruction count, 15× difference in wall time and 14× difference in IPC. The linear loop runs at 2.2 IPC because the prefetcher streams the next 16 cache lines into L1 and the back-end is fed at full speed. The pointer-chase runs at 0.15 IPC because every load depends on the previous load's result — a dependency chain the OoO engine cannot break, and almost every load misses to DRAM.

Why the LLC-miss count for chase is ~63M out of ~64M iterations: the working set (64M × 8 bytes = 512 MB) is 16× the L3 size on this Ice Lake part (~32 MB), so essentially every random access falls through L1, L2, and L3 to DRAM at ~80 ns. With dependent loads, the back-end cannot overlap them — only one load is in flight at a time per dependency chain. That's why IPC collapses: 80 ns × 3.4 GHz ≈ 270 cycles per instruction.

Walking the key lines:

Why the back-end is not always the bottleneck

The mental model "back-end is the bottleneck because memory is slow" is half the truth. There are at least four common shapes a stall can take, and only one of them is "memory was slow":

  1. Back-end memory-bound — load missed L3, ROB filled with µops waiting on the load. Symptom: high LLC-load-miss rate, IPC collapses to single-load-at-a-time.
  2. Back-end core-bound — execution ports oversubscribed (e.g. seven independent multiplies per cycle but only two FP-multiply ports). Symptom: port utilisation pegged, no cache misses, IPC plateaus below width.
  3. Front-end fetch-bound — branch mispredict, I-cache miss, or µop-cache eviction. Symptom: idq_uops_not_delivered.core is high; back-end is starving for work.
  4. Bad speculation — predictor confident but wrong. Symptom: branch-misses rate >2%, the speculative work is drained at every mispredict.

The other half: on branchy or large-code-footprint workloads, the front-end starves first. The Zerodha Kite order-matching engine on cash equity at 10:00 IST sees this — the matching state machine has dozens of branches per matched order, and the BTB cannot hold all the targets. Branch mispredicts cost ~15–20 cycles each (the mis-speculated work in the pipeline has to be drained), and at one mispredict every 50 instructions, the front-end loses 30% of its throughput before the back-end has anything to do.

Top-down stall classificationA stacked horizontal bar chart with four categories left to right: retiring (green-ish accent), bad speculation, front-end bound, back-end bound. Two scenarios shown — a memory-bound workload with 70 percent back-end bound, and a branchy workload with 40 percent front-end bound and 25 percent bad speculation.Memory-bound (pointer chase): IPC = 0.15retireFEBACK-END BOUND (~70%)L3+DRAMBranchy (matching engine): IPC = 0.6retiring (~25%)bad-spec (~25%)FRONT-END BOUND (~40%)BEThe same CPU. The same 4-wide retire ceiling. Two completely different stall profiles.`perf stat --topdown` gives you these percentages. Fix the biggest box.
Two stall profiles on the same CPU. The pointer-chase wastes the back-end; the matching engine wastes the front-end. The fix differs accordingly. Illustrative percentages — measure your own with perf stat --topdown.

The right diagnostic flow:

  1. Run perf stat -a sleep 30 on the production host. Read IPC.
  2. If IPC ≥ 1.5, the CPU is mostly retiring — your hot path is working hard, look elsewhere (allocator, syscalls, GC).
  3. If IPC < 1.0, run perf stat --topdown to split the slack into FE/BE/bad-spec/retire.
  4. Front-end-bound? Your code has too many branches or too large an I-cache footprint. Look at branch density and I-TLB miss rate.
  5. Back-end-bound? Your code is waiting on memory. Look at L1d/L2/L3 miss rates and the dependency-chain length in the hot loop.
  6. Bad-speculation? The branch predictor is failing. Look at your tightest data-dependent branch.

A real Hotstar SRE story from the 2024 IPL final, 25M concurrent viewers: the catalogue API was running at IPC = 0.7 on c6i.8xlarge instances. perf stat --topdown showed 52% back-end bound, 28% front-end bound, 9% bad spec, 11% retiring. Two fixes landed in the same week — a memory-layout change in the catalogue cache (struct-of-arrays instead of array-of-structs) bought IPC up to 1.1 and pushed back-end-bound below 35%; profile-guided optimisation (PGO) with the playoff workload as input bought IPC up to 1.6 by reorganising the hot code so the µop cache covered the inner loop. p99 dropped from 180 ms to 95 ms, ₹4 crore in compute spend deferred for the next IPL season.

The diagnostic flow above is the discipline of Part 1. Every chapter that follows — branch prediction, ROB sizing, port pressure, µop cache hits — is a refinement of one of those six steps. If you cannot answer "what is the IPC on the hot path" within 30 seconds of being paged, you are doing performance engineering on hope rather than measurement. Aditi at Razorpay keeps a one-line script (perf stat -a sleep 5 2>&1 | grep -E 'cycles|insn') bound to a hotkey on her on-call laptop; it has saved her hours on three separate incidents.

What the OoO engine actually buys you

A naive in-order CPU running at 3 GHz could in principle issue 3 billion instructions/second. A modern OoO CPU at 3 GHz issues 12 billion µops/second on its peak — but only on code that lets it. The OoO engine's job is to hide latency by finding parallel work in the instruction stream that the programmer wrote serially.

program order:                      OoO execution:
  ld r1, [a]    (200 cyc miss)        ld r1, [a]   ───────[200]──
  add r2, r1,1                                    waiting r1   ─[1]─ add
  ld r3, [b]    (200 cyc miss)        ld r3, [b]   ───────[200]──
  add r4, r3,1                                    waiting r3   ─[1]─ add
  ld r5, [c]    (200 cyc miss)        ld r5, [c]   ───────[200]──
  ...                                  three loads in flight in parallel

If a, b, c are independent, all three loads issue in the first cycle and complete around cycle 200 together. Total latency: ~205 cycles for 6 instructions — IPC ≈ 0.03 in absolute terms but 3× better than serial execution of the same code. The OoO engine does not make memory faster; it makes the program hide more memory latency.

The ROB is the fuel tank. Skylake's 224-entry ROB means up to 224 µops can be in flight, and you can have 12–16 outstanding L1d misses simultaneously (the LFB — line fill buffers). When the ROB fills up because the head is stuck on a slow load, the front-end is forced to stop fetching, and IPC collapses regardless of how clever your branch predictor is. Why pointer-chase pegs at 0.15 IPC despite a 224-entry ROB: every load depends on the prior load. The ROB does fill up — but with 220 µops all waiting on a single dependency chain. There is no parallel work to find. The ROB protects you from latency, but only when the program has independent work to do; a serial chain is OoO-proof.

The lesson: performance engineering is about feeding the back-end with independent work. Loop unrolling, SIMD, software pipelining, prefetch, branch hoisting — all of these are tricks to expose more independent work in the instruction stream so the ROB has something to do while waiting for the slow load.

A small numerical walkthrough makes this concrete. Suppose your hot loop touches one cache line per iteration and that line is in DRAM (~80 ns ≈ 270 cycles at 3.4 GHz). If iterations are dependent (each load uses the previous result), you complete one load every 270 cycles → throughput = 1/270 ≈ 0.004 loads/cycle. If iterations are independent and the LFB has 12 slots, you complete 12 loads every 270 cycles → throughput = 12/270 ≈ 0.044 loads/cycle, an 11× improvement on the same hardware running the same number of instructions. Why the 12× ceiling is not infinite: the LFB caps how many outstanding L1d misses a single core can have in flight. Skylake/Ice Lake = 12, Zen 3 = 22, Zen 4 = 24, Apple M1 firestorm ≈ 30. This is the hidden parameter behind "memory-level parallelism" (MLP); a workload's MLP — average outstanding cache misses — is what separates DRAM-bound code that runs at 0.15 IPC from DRAM-bound code that runs at 1.5 IPC. Same memory speed; very different software.

Common confusions

Going deeper

Top-down microarchitecture analysis (TMA)

Intel's Top-down Microarchitecture Analysis is the methodology behind perf stat --topdown, formalised by Yasin (Intel, 2014). At the top level it splits cycles into four bins: Retiring, Bad Speculation, Front-end Bound, Back-end Bound, each summing to 100%. Each bin further decomposes — back-end-bound splits into core-bound (port pressure, dependency chains) and memory-bound (L1, L2, L3, DRAM). Intel's pmu-tools (Andi Kleen) and the toplev.py script give the full breakdown automatically. Spend an afternoon running toplev.py -l3 on your hot path; the output is more diagnostic than any flamegraph for back-end issues.

The reorder buffer is not infinite — but it's bigger than you think

Skylake-X: ROB = 224, integer PRF = 180, FP PRF = 168, RS = 97, store buffer = 56, line fill buffers = 12. Zen 4: ROB = 320, PRF = 224 int + 192 FP. Apple M1 firestorm cores: ROB = 630, the deepest in any shipping CPU. The trend over a decade has been "wider and deeper" — bigger ROBs, more PRF entries, more execution ports. The reason: memory has not got faster, so the only way to keep IPC up is to find more independent work to overlap with each load. The M1's 630-entry ROB is the canonical example — it can hide 5+ DRAM accesses in flight, which is why it's competitive with x86 cores at half the clock.

Why the front-end is asymmetric

Modern Intel cores can decode 4–5 simple x86 instructions per cycle in the legacy decoder, or fetch 6–8 µops/cycle from the µop cache (DSB), or fetch up to 8 µops/cycle from the loop stream detector (LSD) for tiny loops that fit in 56–64 µops. The front-end picks the highest-bandwidth source available. Why this matters: a hot loop that fits in the LSD runs 30% faster than the same loop that spills out of the µop cache, even with identical decoded µops. The cost is invisible from perf stat -e instructions — you have to look at idq.dsb_uops, idq.mite_uops, and idq.ms_uops to see which front-end source is feeding your loop. Code-layout tools (BOLT, Propeller, AutoFDO) optimise for this — placing the hottest 64 KB of code on contiguous pages so it fits the I-TLB and the DSB.

The Zerodha case: front-end-bound order matching

The Zerodha Kite cash-equity matching engine, written in C++, was IPC = 0.6 at 10:00 IST market open in 2023. The hot path — match an incoming order against the price-time priority book — has 14 indirect branches per matched order (vtable dispatch into order-type handlers, side-of-book dispatch, fill-or-kill checks). BTB pressure was real: with 200,000 distinct order targets per session, the BTB's 4096 entries thrashed. The fix was devirtualisation — replace the runtime polymorphism with a switch over an enum tag — and PGO with a recorded morning workload as the training input. IPC went from 0.6 to 1.4, the matching engine's p99 latency dropped from 180 µs to 75 µs, and the broker hit its <100 µs SLO without buying new hardware. The full story is in Part 5; the point here is that front-end stalls are real, common, and ignored at most companies.

The same shape recurs across financial services: a Dream11 contest-payout engine in 2024, with a polymorphic rules-evaluator that grew from three contest types to thirty over two years, hit IPC = 0.5 during the T20 World Cup final's payout cascade. The same devirtualisation pattern — replace std::variant<...> visit dispatch with a switch on enum — bought IPC up to 1.3 and shaved 40% off the payout-completion latency. The lesson generalises beyond Indian fintech: any system where the set of code paths grows over time without the BTB scaling along with it will eventually become front-end-bound, and the fix is almost always either devirtualisation or PGO, not "more cores".

Speculation, security, and the in-order alternative

The OoO engine speculates aggressively — it executes µops past unresolved branches and loads, only committing when the branch resolves. Spectre (Kocher et al., 2018) and Meltdown (Lipp et al., 2018) showed that the microarchitectural side effects of mis-speculated execution (cache lines warmed, predictor entries trained) leak data across security boundaries. The mitigations — IBPB/IBRS/STIBP MSRs, retpolines, LFENCE on syscalls — cost real IPC, with mitigations=auto on Linux 5.x adding 5–15% slowdowns on syscall-heavy workloads. PhonePe measured a 9% throughput regression on UPI authorisation when their hosts were patched for Spectre v2 in 2018; the only way back was upgrading to Cascade Lake silicon with hardware mitigations.

The OoO engine is also not free in transistors or watts: ROB, RS, rename tables, and rollback machinery all cost. Apple's E-cluster cores and ARM Cortex-A55 are in-order 2-wide designs that retire at 1/4 the power of an OoO big core. Phones, laptops, and AWS Graviton C/T-series ship with big-LITTLE clusters so the OS can park bursty interactive work on OoO cores and background daemons on in-order ones. The Razorpay batch ledger reconciliation job runs 2× more cost-effective on Graviton T4g than on c6i at the same SLA — because the workload's IPC is only 0.4 either way and the in-order chip costs less per IPC. The same speculation that gave you OoO's headroom gave you a side channel and a power bill; in-order is still alive in 2026 silicon for good reasons.

Reproduce this on your laptop:

# Linux only — perf is a kernel-paired tool
sudo apt install linux-tools-common linux-tools-generic
python3 -m venv .venv && source .venv/bin/activate
# bench_ipc.py and run_ipc.py from above; no pip deps beyond stdlib
sudo perf stat -e cycles,instructions,branch-misses,L1-dcache-load-misses,LLC-load-misses \
    python3 bench_ipc.py linear
sudo perf stat --topdown python3 bench_ipc.py chase

You should see linear IPC > 1.5 and chase IPC < 0.3 on any modern x86 laptop. The exact numbers depend on your DRAM speed and L3 size; the story — 10× IPC gap — is universal.

Where this leads next

The next chapters in Part 1 take this skeleton and put muscle on it.

By the end of Part 1 you will read a perf stat --topdown output the way an SRE reads a flamegraph: as a diagnosis, not a list of numbers.

Part 2 (caches and memory) follows because back-end-memory-bound is the most common stall on real services. Part 3 (NUMA) follows because on multi-socket boxes the cache hierarchy gains a second axis. Part 4 (benchmarking without lying) is what stops you from chasing phantoms when frequency scaling or warmup hides the real number.

References

  1. Yasin, "A Top-Down Method for Performance Analysis and Counters Architecture" (ISPASS 2014) — the paper behind perf stat --topdown. The single most useful 10 pages a performance engineer can read.
  2. Intel® 64 and IA-32 Architectures Optimization Reference Manual — Volume 3, especially the "Pipeline" and "Top-down" appendices. Open it; bookmark the front-end / back-end stall tables.
  3. Agner Fog's microarchitecture manual — the canonical independent reference for x86 pipeline depths, port mappings, and µop-cache behaviour across every shipping CPU.
  4. Hennessy & Patterson, Computer Architecture: A Quantitative Approach (6th ed.) — chapters 3 and 4 on instruction-level parallelism and the OoO engine. The textbook foundation.
  5. Brendan Gregg, Systems Performance (2nd ed., 2020) — chapter 6 on CPUs; chapter 13 on perf. The handbook for the working engineer.
  6. Andi Kleen, pmu-tools and toplev.py — the open-source implementation of TMA on top of Linux perf. Run toplev.py -l3 ./your-binary and read the output.
  7. The Append-Only Log — cross-domain reference on the storage-side mirror of "feed the back-end": the log buys the same kind of independence between writes that the ROB buys between µops.