Branch prediction and why it matters

Aditi at Zerodha has a function that filters a sorted array of 100 million order prices and counts how many are above a threshold. On her laptop it runs in 380 ms. She copies the array, shuffles it (same numbers, different order), and runs the same function on the shuffled copy. It takes 1,640 ms — 4.3× slower. The instruction count is identical. The cache footprint is identical. The only thing that changed is whether the if (price > threshold) branch goes the same way many times in a row, or flips chaotically. The CPU was guessing, and on the sorted data it guessed right; on the shuffled data it guessed wrong, and every wrong guess cost her ~17 cycles of pipeline work that had to be thrown away.

That throw-away is the entire reason branch prediction exists, and the reason it matters more than almost anything else in your hot path. A 2026 Zen 4 core fetches up to 6 instructions per cycle, dispatches up to 6 µops, has a 320-entry reorder buffer, and runs at 4.5 GHz. Every one of those numbers is a fiction unless the front-end knows which instructions to fetch next — and the next instruction is conditional on a branch whose direction the CPU does not yet know. So the branch predictor guesses. When it guesses right (and on a typical workload it does, ~96–99% of the time), the pipeline stays full. When it guesses wrong, the entire chain of speculatively-fetched, decoded, renamed, dispatched µops behind the branch is flushed, the front-end re-steers to the correct target, and the pipeline refills from cycle zero. On a typical Skylake/Zen pipeline that re-steer costs 14–20 cycles of completely lost work — and at 4 GHz, those 17 cycles are 4.25 ns each, gone, with nothing retired.

Branch prediction lets the CPU fetch and execute past a conditional branch before it knows the branch's direction; on a correct prediction the work counts, on a misprediction it is flushed. Modern predictors (TAGE, perceptron, ITTAGE) achieve 96–99% accuracy on real workloads. Each misprediction costs 14–20 cycles of refill. The structure of your data — sorted vs random, periodic vs chaotic — is often a bigger lever on misprediction rate than the structure of your code.

The pipeline's problem and why guessing is the only answer

A pipelined CPU finishes one µop per cycle per port even though each µop takes 14+ cycles to traverse fetch → decode → rename → dispatch → execute → retire. The trick is to keep all 14+ stages busy with different µops at different stages of completion, so 14 are in flight and one finishes per cycle. This trick requires that the front-end know the address of the next instruction every cycle, because every cycle it fetches the next 32 bytes from L1i.

For straight-line code, the next instruction's address is pc + 4 (or pc + instruction_length on x86, where instructions are variable-length and decoders do their own work to figure out where one ends and the next begins). Trivial. The problem is conditional branches: je, jne, cmp+jl, the indirect call rax, the ret. The branch decides the next instruction's address, but the branch itself is an ALU operation that finishes only at the execute stage, ~10–14 cycles into the pipeline. If the front-end waited for the branch to resolve before fetching the next instruction, the pipeline would drain to zero between every branch — and real code has a branch every 5–6 instructions on average. IPC would collapse to ~0.4.

So the front-end does not wait. The branch predictor produces a single-cycle answer: "this branch will go here". The front-end fetches from that address. If the predictor is right, the speculatively-fetched µops were the right ones; they execute, retire, and count. If the predictor is wrong, those µops are squashed, the front-end is steered to the correct target, and ~14–20 cycles of pipeline refill cost is paid. Branch prediction is the bet that keeps the pipeline full; misprediction is the cost of a losing bet.

Pipeline bubble from a mispredicted branchA pipeline diagram with stages fetch, decode, rename, dispatch, execute, retire across the top. Cycles 1 through 18 along the bottom. Dark cells show useful instructions flowing through stages. At cycle 12 a branch in execute is found mispredicted; cells from cycles 13 through 16 show squashed (lighter) entries. The correct path begins re-fetching at cycle 17.Pipeline bubble: branch mispredicted at cycle 12FDRNDSPEXRETBR!c1c2c3c4c5c6c7c8c9c10c11c12c13c14c15c16c17accent = useful work; light = squashed; box = bubbleBranch resolves wrong at cycle 12; cycles 8–11's speculative work is flushed; fetch re-steers at cycle 13; first correct µop reaches retire at cycle 17. Net cost: ~5 cycles where retire was empty.
The branch resolves at execute (cycle 12) and is found wrong. Every speculative µop that entered the pipeline at cycles 8–11 is flushed; fetch re-steers and the pipeline refills. Illustrative — not measured data.

Why the cost is "pipeline depth", not "branch latency": the branch's ALU latency itself is one cycle. The cost is that fetch ran ahead of the branch by however many stages the front-end is from execute — typically 14–20 stages on modern x86. Every µop in those stages is speculative on the branch's outcome; on misprediction every one of them is squashed. Deeper pipelines hit higher peak clock speeds but pay more per misprediction. The Pentium 4's 31-stage pipeline (2002) was a famous over-correction in this direction; modern designs cap pipeline depth at ~14–20 partly because the misprediction cost from anything deeper exceeds the clock-speed gain.

The empirical regime is brutal. On a workload with 5% branch density and 4% misprediction rate, ~0.2% of all instructions cause a 17-cycle bubble. At 4 GHz that's roughly 0.002 × 17 × 0.25 ns = 8.5 ps of stall per instruction, or about 8.5% of execution time burnt on misprediction alone. Drop the misprediction rate to 1% and the same workload runs ~6% faster. Push it to 10% (which sorted-vs-shuffled can do) and you've doubled the runtime.

Inside the predictor: from 2-bit counters to TAGE

Branch predictors evolved from comically simple to startlingly clever over thirty years. The progression is worth walking because each step exposes a property of real branches that the predictor exploits.

Static prediction (1980s): branch backward = predict taken (it's probably a loop), branch forward = predict not-taken (it's probably an if for an error path). Accuracy on real code: ~65%. Used as a fallback when no dynamic state exists.

Bimodal predictor (1990s, Patt et al.): for each branch's PC, keep a 2-bit saturating counter: 00=strongly-not-taken, 01=weakly-not-taken, 10=weakly-taken, 11=strongly-taken. Each correct prediction saturates the counter further; each wrong one steps it toward the other side. A branch that's been taken 10× in a row needs 2 misses to flip to "not taken". Accuracy on a typical workload: ~88%.

Two-level adaptive (gshare/gselect) (Yeh & Patt 1991): hash the global branch history (a bit per recent branch, taken=1, not-taken=0) with the branch's PC. Index into a table of 2-bit counters with the hash. Now the same branch can predict differently depending on what came before it. Accuracy: ~95%.

TAGE (Tagged Geometric history length, Seznec 2006): combine multiple predictors, each indexed by a different length of global history (e.g., 5, 13, 30, 71, 162 bits). A long history catches deep correlation patterns; a short history catches tight loops. The longest-history predictor that has a tagged entry for this branch wins. Accuracy on SPECint2006: ~97–99% — within fractions of a percent of the theoretical entropy limit. This is what every high-end x86, ARM, and Apple core ships in 2026.

Perceptron predictor (Jiménez & Lin 2001): a per-branch linear classifier whose weights are updated online. Used by AMD on Bulldozer and refined through Zen 4. Roughly equivalent accuracy to TAGE but with different hardware trade-offs. Some Zen designs use a hybrid TAGE + perceptron.

ITTAGE (Indirect-Target TAGE): TAGE for indirect-branch targets (which address does call rax go to?), not just direction. Critical for virtual function dispatch in C++/Java/Go interfaces.

The historical jump from bimodal to TAGE roughly tripled the work the OoO engine could keep usefully in flight. A branch is, in that sense, the front-end's contribution to memory-level parallelism: every correctly-predicted branch lets the front-end keep filling the ROB with µops that will retire; every mispredicted branch is a wasted opportunity to do so. The accuracy gain from bimodal (~88%) to TAGE (~98%) means the predictor goes from causing one bubble per ~10 branches to one per ~50, which on a workload with branch density 1-in-5 means a 5× reduction in bubble frequency. That alone explains a large fraction of the IPC growth from late-1990s x86 cores to modern ones.

TAGE predictor: bank selection by history lengthFive horizontal banks labeled with history lengths 0, 5, 13, 30, 71. Each bank holds tagged entries indexed by hash of the PC and history. An arrow shows the longest matching bank's prediction being selected. The base bimodal bank at length 0 is the fallback.TAGE: pick the longest history that has a hit for this PCT0 (bimodal, no history)always-hit fallback, 2-bit counterT1 (history len 5)tag check, 2-bit ctr + 2-bit usefulT2 (history len 13)tag check, 2-bit ctr + 2-bit usefulT3 (history len 30) ← longest hit, this prediction winstag check hits; predict takenT4 (history len 71)tag check miss; not consulted
Five banks, four with tag-checked entries indexed by progressively longer global history. The longest bank that has a tagged entry for this PC wins; on a tag miss the predictor falls back to a shorter bank. The bimodal base (T0) always hits as a fallback. Illustrative — not measured data.

Why TAGE is so good: a branch's direction at a given moment in execution is correlated with the recent global path that led there — and different branches need different amounts of history to disambiguate. A for loop's exit branch is correlated with a 1–5 bit history (the last few iterations). A polymorphic dispatch in a parser is correlated with a 30–70 bit history (which token type sequence got us here). TAGE gives every branch the "right" amount of history by letting it claim a tagged entry in whichever bank's hash collides least. This is far more powerful than any single fixed-history predictor and within a percent or two of the information-theoretic optimum.

There are two more structures the front-end needs alongside the direction predictor. The branch target buffer (BTB) stores predicted target addresses for direct and indirect branches — a 4096-entry BTB on Skylake, ~16K on Zen 4. The return address stack (RAS) stores the predicted target of ret instructions (the address of the call that pushed the return address). The RAS is small (~32 entries) but cheap and accurate, and a corrupt RAS — common in deeply recursive code that spills past 32 frames — produces a flood of indirect-branch mispredictions even when the direction predictor is doing fine.

Measuring branch mispredictions in your code

perf stat -e branches,branch-misses is the headline. Run it and you get the rate. The Aditi sorted-vs-shuffled experiment from the lead is two perf stat invocations apart.

# bench_branch.py — measure misprediction rate on sorted vs shuffled data.
# The hot loop counts elements above a threshold. Same instructions either way;
# only the data order changes the predictor's accuracy.
import sys, time, random, array

N = 64 * 1024 * 1024  # 64 M ints, ~256 MB working set
THRESHOLD = N // 2

def build_sorted():
    return array.array('q', range(N))

def build_shuffled():
    a = list(range(N))
    random.Random(42).shuffle(a)
    return array.array('q', a)

def count_above(arr, t):
    """The hot loop. One branch per element. Predictor learns from history."""
    c = 0
    for x in arr:
        if x > t:
            c += 1
    return c

mode = sys.argv[1]
arr = build_sorted() if mode == "sorted" else build_shuffled()

t0 = time.perf_counter_ns()
c = count_above(arr, THRESHOLD)
t1 = time.perf_counter_ns()
ms = (t1 - t0) / 1e6
ns_per = (t1 - t0) / N
print(f"{mode}  count={c}  ms={ms:.0f}  ns/elem={ns_per:.1f}")

Now wrap it with perf stat to see what the predictor actually did:

# run_branch.py — invoke perf stat, parse stderr, print misprediction rate.
import subprocess, re, sys

def run(mode):
    cmd = ["perf", "stat", "-x,",
           "-e", "cycles,instructions,branches,branch-misses",
           "python3", "bench_branch.py", mode]
    r = subprocess.run(cmd, capture_output=True, text=True)
    print(r.stdout, end="")
    # perf stat -x, output: value,unit,event,...  per line on stderr
    counters = {}
    for line in r.stderr.splitlines():
        parts = line.split(",")
        if len(parts) >= 3 and parts[0].replace(".", "").isdigit():
            counters[parts[2]] = int(parts[0].replace(".", ""))
    if "branches" in counters and "branch-misses" in counters:
        rate = 100.0 * counters["branch-misses"] / counters["branches"]
        ipc = counters["instructions"] / counters["cycles"]
        print(f"  ipc={ipc:.2f}  br-miss-rate={rate:.2f}%")

run("sorted")
run("shuffled")

Sample output on a c6i.4xlarge (Ice Lake-SP, Linux 6.1):

sorted    count=33554432  ms=380   ns/elem=5.7
  ipc=2.34  br-miss-rate=0.04%
shuffled  count=33554432  ms=1640  ns/elem=24.4
  ipc=0.55  br-miss-rate=49.31%

The numbers are stark. Sorted: the predictor sees the branch go not-taken for the first 32M iterations, taken for the last 32M; after a brief learning period it gets every prediction right except at the transition, hence 0.04% misprediction. Shuffled: the data is uncorrelated with the index, the predictor has nothing to latch onto, and lands on roughly 50% — the entropy of an unbiased coin. IPC drops 4×, runtime rises 4.3×.

Walking the key lines:

The classic next experiment, due to Stack Overflow lore, is to replace the if with a branchless equivalent — c += int(x > t) — and re-run on shuffled data. The branchless version compiles to a setg + addq on x86 and a cset + add on ARM. No branch, nothing to mispredict. On shuffled data it runs at roughly the sorted speed, vindicating the diagnosis. (CPython's bytecode doesn't quite expose this; you'd need numpy (arr > t).sum() or a C kernel to see the full effect.)

Where mispredictions actually live in production code

A handful of patterns produce most of the mispredictions a typical service hits in production. Recognising them in your own flamegraphs and perf stat outputs is more useful than memorising the predictor algorithm — predictors evolve every microarchitecture generation, but the workload patterns that defeat them are remarkably stable across decades.

Polymorphic dispatch. Animal::speak() called on a vector of Cat, Dog, Cow, Cat, Cow, Dog... compiles to an indirect call through a vtable. The ITTAGE predictor learns the call site's history, but with three or four interleaved targets the BTB churns and the misprediction rate climbs to 5–15%. Razorpay's payment-router had this exact shape: each transaction was routed through a chain of Rule::apply() calls on a polymorphic IRule interface, and ITTAGE could not keep up with 30+ rule types interleaved in production traffic. Replacing the polymorphic dispatch with a switch on a small integer tag dropped the call-site misprediction rate from 9% to 0.4% and the routing service's IPC went from 0.8 to 1.5.

Data-dependent branches on uncorrelated input. Aditi's filter on shuffled data is the canonical example. The fix is either (a) sort the data so the predictor latches on, (b) make the operation branchless via SIMD or cmov, or (c) accept the cost. SIMD-accelerated filtering (numpy, Polars, DuckDB) avoids the branch entirely.

Indirect branches in interpreters. CPython, V8, JVM, .NET CLR — every interpreter dispatches on the next bytecode via an indirect branch. The "computed goto" pattern (switch over &&label) and threaded-code interpreters specifically reduce branch-prediction pressure by giving each opcode its own dispatch site, so the predictor can specialise per-opcode. CPython 3.11's specialising adaptive interpreter, PyPy's tracing JIT, and V8's IGNITION + TurboFan are all, in part, branch-prediction-aware optimisations.

switch on enum with many cases. If 60% of traffic hits one case, the predictor learns that case as the default and only mispredicts on the 40%. If traffic is evenly distributed across 8 cases, the predictor cannot win — misprediction rate climbs to ~80% on the dispatch. Rewriting an 8-way switch as two binary checks (if (case in {A,B,C,D})...) sometimes helps if the binary split aligns with the actual traffic distribution.

Recursion past the RAS depth. Calling 50 levels deep on a CPU with a 32-entry RAS means the bottom 18 returns will mispredict. Tail-call optimisation, iterative rewrites, or splitting recursive descent parsers into explicit-stack form removes this hazard.

A 2024 Hotstar IPL-streaming incident showed how subtle this can get. The video-segmenter service had an inner loop that classified each video frame as keyframe / P-frame / B-frame using a switch on the frame's H.264 NAL type. Under steady-state encoding of cricket footage, frame-type sequences were highly periodic (1 keyframe per 60 frames, then 14 P-frames, then 45 B-frames — a fixed cadence), and the predictor learned the period exactly. During a sudden scene-change (a 6 hitting the camera), the encoder emitted an unscheduled keyframe and the next 30 frames had irregular type interleaving. branch-misses spiked from 0.5% to 8.3% for the duration; the segmenter fell behind by ~80 ms; the streaming pipeline buffered, viewers experienced a brief stall. The fix was to replace the switch with a lookup table — processor[frame_type](frame) indexed by the NAL byte directly — which compiles to a single indirect call whose target the BTB caches per frame-type. Misprediction-rate-during-scene-change dropped to 1.2%, and the segmenter no longer stuttered on big shots.

A subtler Zerodha example, the one Aditi found by accident: their order-validation pipeline ran a chain of if checks — if order.symbol_valid && order.qty > 0 && order.price_in_band && order.margin_ok && !order.user_blocked. Under normal traffic 99.7% of orders pass all five checks. The compiler short-circuits the chain, but the predictor sees five branches per order and learns each one as "almost always taken". Misprediction rate: 0.6%. During the 09:15 IST market open's pre-open auction, the symbol-valid check started failing 12% of the time (clients submitting orders for symbols not yet activated). The first-check misprediction rate jumped to 22%; the subsequent checks, which never even ran when the first failed, didn't fire — but the predictor's BTB now had stale state for the second-through-fifth check sites and re-warmed them with high misprediction during the next minute as traffic normalised. IPC oscillated between 1.7 and 0.6 in 15-second windows for several minutes after market-open every day. The fix was to validate order.symbol_valid in a separate, earlier dispatch lane — before the main pipeline — so the rare-failure case never polluted the always-pass main pipeline's predictor state. Misprediction stabilised at 0.4% across the entire morning window, and the order-acceptance latency p99 fell from 2.4 ms to 0.9 ms during the first five minutes of trading.

What the misprediction cost actually buys (and what it doesn't)

It is worth being precise about what the pipeline-flush actually costs, because under-counting and over-counting are both common.

The misprediction cost is the pipeline depth in cycles, multiplied by the front-end's effective issue width during refill. On a 4-wide front-end with 18-stage pipeline, a single misprediction discards roughly 4 × 18 = 72 µop-slots of speculative work, of which typically only 30–40% would have retired anyway (the rest were waiting on operands or memory). So the effective cost of one misprediction is roughly 25–30 µops worth of useful work, paid as 14–20 wall-clock cycles of empty retirement.

What the misprediction does not cost is anything that the speculative path did finish — speculative loads that have already brought a cache line into L1 are not undone (which is why Spectre works). Speculative stores that have not yet committed from the store buffer are squashed cleanly. Speculative ALU operations that wrote to physical-rename registers leave those registers dead; the rename-recovery process during a flush walks the register-rename map back to the branch's snapshot in 1–2 cycles. The pipeline-refill latency is the dominant cost; the rename-recovery is comparatively cheap.

This decomposition matters when you read flamegraphs. A function that spends 12% of cycles "in" itself but with 40% branch-misprediction rate is paying its bill mostly to the front-end, and the fix is structural (sort, branchless, dispatch table). A function that spends 12% of cycles in itself with 1% misprediction is paying its bill to back-end memory, and the fix is in the data-layout chapters of Part 2. The same flamegraph row, two different problems, two different fix shapes — only the perf events distinguish them. Why this distinction is the entire diagnostic premise: a flamegraph alone tells you where time is spent, not why. The why comes from per-function PMU counters — branch-misses, cache-misses, resource_stalls.* attributed by sample. perf record -e branch-misses followed by perf report is the second-pass investigation that turns a flamegraph from "this function is slow" into "this function is slow because of this microarchitectural bottleneck", which is the only kind of finding that yields a fix.

Common confusions

Going deeper

Reading top-down to confirm front-end-bound

A branch-misprediction problem is one of three flavours of front-end-bound stall (the others: I-cache misses, µop-cache evictions). perf stat --topdown -l3 separates them. The hierarchy:

FrontendBound  → BadSpeculation  → BranchMispredict
              → MachineClears

If BadSpeculation.BranchMispredict > 10%, your hot path's predictor is losing — spend the engineering on either (a) making branches more predictable (sort data, peel off cold cases), or (b) removing them (SIMD, branchless). If it's MachineClears instead, the cause is memory-ordering or self-modifying-code clears (see the previous chapter on the ROB) — different fix entirely. Always check --topdown before deciding which knob to turn.

The Spectre family and what mitigations cost

Spectre v1 trains the direction predictor with a long sequence of in-bounds accesses, then makes one out-of-bounds access; the predictor speculatively follows the in-bounds pattern, executes the would-be-rejected load, and leaks the result through cache timing. Spectre v2 trains the BTB to redirect an indirect call to attacker-chosen code. Mitigations include (a) lfence after bounds checks (serialises speculation), (b) retpolines that replace indirect calls with a ret-trampoline that the RAS handles safely, and (c) IBPB / IBRS MSR writes that flush predictor state on context switch. PhonePe measured a 9% throughput regression on UPI authorisation when their hosts were patched in 2018; Cascade Lake (2019) added hardware mitigations that recovered most of the loss. Why mitigations hurt IPC: every retpoline forces a serialised ret sequence (~30–50 cycles of drained pipeline); every lfence after a bounds check is a serialising barrier; every IBPB on syscall flushes the predictor's learned state, costing tens of mispredictions on the syscall return path. The cumulative tax is real and durable on pre-2019 hardware.

Indirect branches and the BTB capacity wall

The BTB has finite capacity (4K–16K entries on x86, larger on Apple). A code base with 50,000 distinct call sites (large C++ binary, JIT-compiled JVM with many compiled methods) will overflow the BTB — call sites get evicted, then re-encountered, miss in the BTB, and pay a default-prediction penalty. Profile-guided optimisation (PGO) and post-link optimisers (Propeller, BOLT) exist partly to cluster hot call sites in instruction-address space so they index into the same BTB lines, reducing overflow pressure. Facebook's BOLT reportedly gives 5–10% IPC gains on HHVM and the like by improving BTB locality alone.

Apple M1, traffic-mix shocks, and cold-start tax

Apple's M1 / M2 / M3 cores ship a TAGE-class predictor with unusually long history (~1000+ bits in some analyses) and a BTB in the 16K+ range. Combined with the wide ROB (~630 entries), the M1 keeps far more speculative work in flight per branch than any contemporary x86. This is part of why Apple's IPC on Objective-C and Swift code (heavy on dynamic dispatch, polymorphic calls) outpaces x86 cores at lower clock speed: the predictor wins more often, and when it wins more often, the wider OoO engine stays full. The architectural choice to spend transistor budget on predictor accuracy rather than clock speed is the M1's defining design move.

The same logic shows up in the inverse during traffic-mix shocks. Flipkart's catalogue search service, October 2024 sale: peak QPS 14× normal, p99 SLO 200 ms. Two days into the sale, p99 drifted to 280 ms, only on a subset of hosts. perf stat --topdown on a struggling host reported BadSpeculation = 22% — three times the baseline of 7%. Drilling deeper, frontend_retired.indirect_call_misp showed indirect-call misprediction concentrated in the Lucene-style query rewriter, which dispatched on a QueryNode interface with seven concrete subtypes. Under normal traffic, queries were dominated by two subtypes (TermQuery and BooleanQuery), and the BTB happily cached those two targets per call site. During the sale, the query mix shifted — flash-sale promotional queries used PhraseQuery and SpanNearQuery much more, and the call sites' BTB entries thrashed across four subtypes instead of two. The fix was a tag-dispatched fast path for the top three QueryNode subtypes (a 3-arm switch on a small integer tag, no virtual call), with a fallback to the polymorphic dispatch for the rare cases. Indirect-call misprediction rate fell from 11% to 1.3% on the rewriter's hot loop; p99 returned to 180 ms; the on-call team reverted a costly horizontal scale-out they'd done the previous night.

A related production pattern is the cold-start tax. Every fresh process starts with empty predictor state. The first few thousand iterations of any new hot loop run with predictor accuracy closer to the static-fallback ~65% than to TAGE's 99%; the predictor is filling its tagged banks with usable entries. CRED's reward-engine team measured this as a 1.4× p99 cliff in the first 30 seconds after a fresh JVM came online during a rolling deploy. Mitigation patterns: synthetic warm-up traffic before a host joins the load balancer; using -XX:CompileThreshold and tiered JIT to keep predictor-friendly compiled code resident; pinning long-lived processes so the OS scheduler does not migrate them off-core (which empties the per-core BTB).

Branchless rewrites and the predictor-as-cache mental model

The textbook example of a branchless rewrite is c += int(x > t) instead of if (x > t) c++;. On a predictable branch (sorted data, >99% one-way), the branchy version costs ~1 cycle per element on average — the predictor is right almost every time, no flush. The branchless version costs ~1.2 cycles per element regardless of data — slightly slower. On an unpredictable branch (random data, ~50% misprediction), the branchy version costs ~17 cycles × 50% = ~8.5 cycles per element. The branchless version still costs ~1.2 cycles. The branchless rewrite is a 7× win on the unpredictable case and a 20% loss on the predictable case. The decision criterion is "what is the misprediction rate on production traffic?" — measured, not guessed. SIMD takes this further: AVX-512 processes 8–16 elements at once, all branchless, and gets 50–100× throughput over a branchy scalar loop on unpredictable filters. Polars and DuckDB's filter operators are SIMD-branchless; this is a large fraction of why they outpace Pandas on real analytical queries.

It often helps to think of every branch-predictor structure as a cache. The bimodal table is a cache of "which way did this PC go last time". The TAGE banks are a cache of "which way does this PC go given the last K branches' outcomes". The BTB is a cache of "where did this PC's branch go to last". The RAS is a LIFO cache of call-push targets. Like every cache, each has finite capacity, an eviction policy, and a warm-up period after a workload shift. When you see misprediction rate climb during a workload-mix change, you are watching the predictor's cache get cold-miss-stormed in real time. The fix shapes follow the cache analogy: increase associativity (larger BTB, longer history bank — a microarchitectural choice you don't get to make), reduce working set (cluster hot call sites, fewer dispatch types — a code-layout choice you do get to make), or pre-warm explicitly (synthetic load before cutover — an operational choice).

Reproduce this on your laptop

sudo apt install linux-tools-common linux-tools-generic
python3 -m venv .venv && source .venv/bin/activate
# bench_branch.py and run_branch.py as above; only stdlib used
sudo perf stat -e cycles,instructions,branches,branch-misses \
    python3 bench_branch.py sorted
sudo perf stat -e cycles,instructions,branches,branch-misses \
    python3 bench_branch.py shuffled
sudo perf stat --topdown -l3 python3 bench_branch.py shuffled

Sorted should show <0.1% misprediction and IPC > 2; shuffled should show ~50% misprediction and IPC < 0.7. The exact numbers depend on your microarchitecture (Skylake, Ice Lake, Zen 3/4, Apple M-series via Linux on Asahi) but the gap between the two is the predictor's contribution.

Where this leads next

Branch prediction is the front-end's bet. The next chapters fill in the cost model and the diagnostic ladder.

Part 2 (caches and memory) and Part 5 (CPU profiling) build directly on this. The flamegraph reader who learns to spot a "tall narrow stack" will recognise it as a misprediction bottleneck only if they know what the predictor does and what its bubble looks like in perf record's sampling output. The capacity-planning chapter (Part 14) folds branch-prediction effects into the IPC term of its scaling model.

One concrete connection back to the previous chapter: a misprediction is the only way the ROB's tail-end speculative work gets discarded wholesale. The ROB-flush protocol on misprediction is the same one that fires on a memory-ordering machine clear or an SMC clear — the chapters before this one explained the mechanism; this chapter explains the most common reason it fires. When you read cycles ≈ 4 × instructions in a perf stat output, the question to ask is which of {misprediction, MOMC, SMC, divider, L3 miss} is filling the ROB with squashed work. The events that distinguish them all live within five lines of perf list's output.

A practical takeaway: the four perf events cycles, instructions, branches, branch-misses together give you the first-order branch diagnosis in fifteen seconds. If instructions/cycles < 1.5 and branch-misses/branches > 3%, your hot path is bleeding cycles on the front-end — look for polymorphic dispatch, unpredictable data branches, or switch over too-many-cases. Add perf stat --topdown -l3 and you'll know whether the front-end stall is branch mispredicts, I-cache, or µop-cache. Three numbers, fifteen seconds, the right diagnosis. The same hotkey discipline that worked for the ROB chapter pays off here.

The deeper move, once you've internalised the predictor: design data layouts and dispatch patterns for the predictor. Sort on the predicate column when you'll filter it many times. Use a tag-based dispatch (small integer enum + switch) when traffic is biased; use a function pointer table when traffic is uniform across many cases. Cluster hot functions in the binary so their BTB entries don't churn. These are not micro-optimisations; on a service running 24×7 at Indian-scale throughput, a 5% IPC gain on the hot path is the difference between provisioning 100 instances and 95 — and at ₹15K/instance/month on a c6i.4xlarge reserved instance, that's ₹9 lakh/year saved on a single optimisation.

The branch predictor is also a structure whose effects compound across the stack. A predictor-friendly hot loop runs faster, which means it holds the L1i cache more tightly (less code footprint per unit of work), which means the µop cache hit rate climbs, which means the front-end delivers more µops per cycle, which means the back-end stays fuller — every layer of the front-end benefits from the predictor's wins, and the IPC gain you measure with perf stat is the integrated effect of all of them. Conversely, a predictor-hostile loop doesn't just lose on misprediction; it churns the BTB, evicts µop-cache entries, sometimes evicts L1i entries on the recovery path, and the cascading losses are why a 10% misprediction rate often produces a 30% IPC hit rather than a 17% one. Understanding the predictor is, in this sense, the single best lever for understanding why front-end performance behaves non-linearly with the workload's complexity.

A closing thought before the references: the branch predictor is the place in the microarchitecture where the shape of your data is most visible. Sorted vs shuffled, biased vs uniform, periodic vs chaotic — all of those distributional properties get folded into a single number, the misprediction rate, which then sets the IPC floor of your hot path. Almost no other CPU structure is so directly sensitive to data distribution. That is what makes branch prediction worth understanding deeply: it is the bridge between "the data my service processes" and "the cycles my service costs", and the bridge is short enough that you can usually see across it once you know what to look for.

References

  1. Seznec & Michaud, "A case for (partially) TAgged GEometric history length branch prediction" (JILP 2006) — the TAGE paper. Reads cleanly even without a microarchitecture background.
  2. Yeh & Patt, "Two-Level Adaptive Training Branch Prediction" (MICRO 1991) — the foundational paper that introduced global-history-indexed prediction.
  3. Jiménez & Lin, "Dynamic Branch Prediction with Perceptrons" (HPCA 2001) — the perceptron predictor AMD shipped variants of in Bulldozer through Zen.
  4. Agner Fog, "The microarchitecture of Intel, AMD and VIA CPUs" — chapter 3 on branch prediction, with specific BTB/RAS sizes and predictor types per CPU generation.
  5. Kocher et al., "Spectre Attacks: Exploiting Speculative Execution" (S&P 2019) — the security-side companion to this chapter.
  6. Yasin, "A Top-Down Method for Performance Analysis and Counters Architecture" (ISPASS 2014) — the methodology behind --topdown -l3.
  7. Brendan Gregg, Systems Performance (2nd ed., 2020) — chapter 6 on CPUs, branch-prediction events and how to read them.
  8. Out-of-order execution and reorder buffers — chapter 2 of this curriculum. The predictor's correct guesses fill the ROB; its wrong guesses flush it.