Performance counters (PMUs) and what to measure

Asha at Zerodha is staring at a flamegraph that says nothing useful. The order-matching engine's hot function match_order shows up at 41% of CPU time, but the function is 18 lines of straight-line code with no obvious cost. She runs perf stat. IPC = 0.7. She runs it again with -e cache-misses. 4.2 million cache misses per second. She runs it again with -e branch-misses. Negligible. She runs it again with -e cycle_activity.stalls_l3_miss. 38% of cycles. The function is waiting on memory, not computing. None of this was visible from the source code. None of it was visible from the flamegraph alone. All of it came from a small piece of silicon called the Performance Monitoring Unit that has been counting events the whole time, sitting next to the actual core, costing nothing to run.

A PMU is a set of hardware counters built into every modern CPU core that count microarchitectural events — cycles, instructions retired, cache misses, branch mispredictions, µops dispatched — at full speed with near-zero overhead. The events you ask for are the lens you use to diagnose: cycles + instructions gives you IPC; cache events tell you where memory is hurting; the Top-Down hierarchy tells you which of the four pipeline stages is rate-limiting. The hard part is not running perf stat — it is knowing which events to combine, and reading the resulting numbers without lying to yourself.

What a PMU actually is, and why it costs nothing

Every modern x86 core has a small block of silicon called the Performance Monitoring Unit. On Intel client parts (12th-gen Alder Lake, 13th-gen Raptor Lake, 14th-gen Meteor Lake) it consists of 4 fixed-purpose counters and 8 programmable counters per core. On Intel server parts (Ice Lake-SP, Sapphire Rapids, Emerald Rapids) the count is 4 fixed and up to 12 programmable. AMD Zen 4 and Zen 5 expose 6 programmable counters per core. Each counter is a 48-bit register that increments every time a specific microarchitectural event happens — a cycle ticks, an instruction retires, a load misses L1, a branch mispredicts.

The counters live next to the execution pipeline, not in software. Incrementing them is free in the sense that matters: the core does not slow down because a counter is enabled. The reason is that the events the counter watches are happening anyway — the cycle counter is just a 48-bit version of the same clock that drives the pipeline; the L1-miss counter is wired into the L1 cache controller's miss-handling logic. The counter is a tap, not a probe.

PMU sits beside the core, taps existing event signalsDiagram showing a CPU core with three boxes inside: front-end, back-end, retire. Event signal lines (cycles, branch mispredict, L1 miss, retired uops) tap from these boxes into a separate PMU block on the right, which contains fixed counters (cycles, instructions, ref-cycles, slots) and 8 programmable counters. An MSR interface labelled rdpmc and the perf_event syscall path lead to userspace.CPU core (execution pipeline)front-endback-endretireL1d / L1i / L2 / branch predictor / TLB— each emits its own event signals —PMU4 fixed counterscycles / inst / ref-cyc / slots8–12 programmableL1-miss / branch-miss / uops_issued / cycle_activity…and 200+ moreEvent signals are wired from each pipeline stage into the PMU's mux. The PMU selects which signals feed which counters; the counters increment in hardware at full speed. Userspace reads the counters via rdpmc or the perf_event syscall.
The PMU is a tap into existing pipeline event signals. Counters increment at hardware speed; the cost is in reading the counter, not in incrementing it. Illustrative — not measured data.

The fixed counters always count the same thing — they cannot be repurposed. On Intel:

The programmable counters can be set to any one of ~200 documented events from a per-microarchitecture table. You program them by writing a configuration register (an MSR — Model-Specific Register) that selects the event, then reading the counter (also an MSR, via rdpmc). The Linux kernel exposes this through perf_event_open(2); everything you run with perf stat, perf record, perf top, bpftrace, BCC's BPF_PERF_OUTPUT, py-spy, scalene, and pprof ultimately bottoms out in that one syscall.

Why the count of programmable counters matters: if you ask perf stat for 12 events on a CPU with 8 programmable counters, the kernel time-multiplexes — running 8 events for ~50 ms, then swapping to the other 4 for ~50 ms, scaling each count up by the run-fraction. The scaled numbers are estimates, not exact counts. On a steady-state workload the estimates are close. On a bursty workload (a 200-microsecond hot path that runs once and never again) the multiplexing can miss the event entirely on half the counters. perf stat's output column "[100.00%]" tells you the fraction of time each event ran on a counter; below ~50% the number is unreliable.

The four numbers every diagnosis starts with

You can spend a career learning the 200+ events your CPU exposes. You start with four. These four, in this order, give you the first cut:

  1. cycles — wall clock for the CPU, measured at the core's actual running frequency.
  2. instructions — instructions retired (architectural — inst_retired.any).
  3. branch-misses — branches that the predictor got wrong, costing 15–20 cycles each.
  4. cache-misses — last-level cache misses, costing ~200 cycles each (DRAM latency).

The first derived metric — IPC (instructions per cycle) — is instructions / cycles. A modern x86 core can retire up to 5 (Skylake) or 6 (Sunny Cove and later) µops per cycle, but the architectural-instruction rate caps lower because of micro-fusion (one instruction = sometimes two µops). Numbers to anchor against:

For Asha's order matcher at IPC 0.7, the next questions are: how many of those 0.3 missing IPC come from memory? How many from the front-end? How many from branch mispredictions? The four-number cut is the start; the real diagnosis lives one level deeper.

The Top-Down methodology, the one diagnosis that scales

Yasin's 2014 ISPASS paper, and Intel's subsequent productisation as perf stat --topdown, is the closest thing systems performance has to a universal first-cut diagnostic. The methodology classifies every issue slot — the renamer's per-cycle 4 / 5 / 6 issue capacity — into one of four categories:

                                  +---------------+
                                  |  Issue slot   |
                                  +-------+-------+
                                          |
                  +-----------------------+-----------------------+
                  |                       |                       |
            +-----v-----+         +-------v-------+      +--------v-------+
            | Retiring  |         | Bad           |      |  Stalled       |
            | (good!)   |         | Speculation   |      |                |
            +-----------+         +---------------+      +-------+--------+
                                                                 |
                                                  +--------------+--------------+
                                                  |                             |
                                          +-------v--------+            +-------v---------+
                                          | Frontend Bound |            | Backend Bound   |
                                          +----------------+            +-----------------+

Each slot, every cycle, lands in exactly one bucket. Retiring means useful work — the µop allocated in this slot eventually retired, and the work counts. Bad Speculation means the slot was used by a µop that got squashed (branch misprediction, memory ordering machine clear). Frontend Bound means the slot was empty because the front-end could not deliver a µop this cycle (decode stall, i-cache miss, branch resteer). Backend Bound means the slot was empty because the back-end could not accept a µop this cycle (no free port, ROB full, memory stall).

The percentages always sum to 100. A healthy hot loop is 60–80% Retiring, 5–10% Bad Speculation, 5–15% Frontend Bound, 10–25% Backend Bound. Asha's order matcher at IPC 0.7 will show something like 25% Retiring, 5% Bad Speculation, 5% Frontend Bound, 65% Backend Bound — the immediate diagnosis is "you are stalled on the back-end". One more level of drill-down (-l3) splits Backend Bound into Memory_Bound vs Core_Bound, and Memory_Bound into L1 / L2 / L3 / DRAM bound. Asha's L3-bound number jumps to 38%; her diagnosis is now "your hot path is missing in L3 and waiting for DRAM". That points at data-structure layout, not at match_order's code.

Top-Down breakdown of issue slotsStacked horizontal bar chart showing Top-Down breakdown for two scenarios: a healthy compute-bound loop (75% Retiring, 5% Bad Speculation, 8% Frontend, 12% Backend) and a memory-bound loop (24% Retiring, 4% Bad Speculation, 7% Frontend, 65% Backend with 38% of that being L3-bound). Each segment labelled with percentage.Healthy compute-bound loop (IPC = 3.4)Retiring 75%BS 5%FE 8%BE 12%Memory-bound loop (Asha's matcher, IPC = 0.7)Retiring 24%BS 4%FE 7%Backend 65% (L3-bound 38%)The Top-Down breakdown sums to 100% by construction. The empty issue slot is the unit of measurement; whichever category absorbs the slot is the named bottleneck.A high Backend % with high Memory_Bound and high L3_Bound means working-set fell out of LLC. The fix is data layout, not source-level micro-optimisation.Read --topdown -l3 first. Resist the temptation to look at branch-misses or cache-misses individually — they are noise without the slot accounting around them.
Top-Down's stacked bar tells you which of four categories owns the empty issue slots. The chart compresses 200+ event-counter combinations into one diagnostic frame. Illustrative — not measured data.

Why Top-Down is so much better than reading individual events: a high cache-misses count tells you cache misses happened, not whether they cost you anything. If the OoO engine hides the misses behind other work, IPC stays fine and the cache misses are free. Top-Down accounts for stalled issue slots — the slots where the core wanted to do work and couldn't. That accounting is the only one that maps directly to lost throughput. Reading cache-misses without Top-Down is like reading "10000 packets dropped" without knowing whether the link is at 1% or 100% utilisation. The denominator is everything.

Reading PMU events from Python — the artefact

The reproducible workflow on Linux is perf stat invoked from a Python driver, with the driver parsing perf stat's machine-readable CSV output. This pattern lets you wrap any benchmark in any language, then post-process the counters in code rather than by eye.

# pmu_diagnose.py — run a benchmark, capture PMU events, classify by Top-Down.
# Works on Intel CPUs with --topdown support (Skylake+).
import subprocess, re, sys, json
from pathlib import Path

# A small benchmark with two phases: compute-bound, then memory-bound.
BENCH = '''
import numpy as np, time
N = 4096
# Phase 1: compute-bound (matrix ops in L1/L2)
A = np.random.rand(N, N).astype(np.float32)
t0 = time.perf_counter_ns()
for _ in range(3):
    B = (A * A) + 1.0
t1 = time.perf_counter_ns()
print(f"phase1_ns={t1-t0}")
# Phase 2: memory-bound (random gather over 256 MB)
big = np.random.rand(32_000_000).astype(np.float64)
idx = np.random.randint(0, big.size, size=10_000_000)
t0 = time.perf_counter_ns()
s = big[idx].sum()
t1 = time.perf_counter_ns()
print(f"phase2_ns={t1-t0}  sum={s:.3f}")
'''
Path("bench.py").write_text(BENCH)

EVENTS = ("cycles,instructions,branch-misses,cache-misses,"
          "L1-dcache-load-misses,LLC-load-misses,"
          "cycle_activity.stalls_l3_miss,uops_issued.any,uops_retired.retire_slots")

def parse_perf_csv(stderr: str) -> dict:
    out = {}
    for line in stderr.splitlines():
        # perf -x, format: count,unit,event,runtime,pct,...
        parts = line.split(",")
        if len(parts) >= 3 and re.match(r"^[\d,]+$", parts[0]):
            n = int(parts[0].replace(",", ""))
            out[parts[2]] = n
    return out

def run(label: str, extra: list[str]) -> dict:
    cmd = ["perf", "stat", "-x,", "-e", EVENTS] + extra + ["python3", "bench.py"]
    r = subprocess.run(cmd, capture_output=True, text=True)
    c = parse_perf_csv(r.stderr)
    if "cycles" in c and "instructions" in c:
        ipc = c["instructions"] / c["cycles"]
        l3_stall_pct = 100.0 * c.get("cycle_activity.stalls_l3_miss", 0) / c["cycles"]
        print(f"[{label}] IPC={ipc:.2f}  "
              f"branch-misses={c['branch-misses']:>10,}  "
              f"LLC-misses={c['LLC-load-misses']:>10,}  "
              f"L3-stall%={l3_stall_pct:5.1f}")
    return c

run("default", [])

# Top-Down breakdown (Skylake+; older CPUs ignore --topdown)
td = subprocess.run(["perf", "stat", "--topdown", "-l3", "python3", "bench.py"],
                    capture_output=True, text=True)
print("\n--- Top-Down (-l3) ---\n" + td.stderr)

Sample output on a c6i.4xlarge (Ice Lake-SP, kernel 6.1):

[default] IPC=1.42  branch-misses=    382,114  LLC-misses=  4,192,830  L3-stall%= 31.4

--- Top-Down (-l3) ---
   Frontend_Bound:                          7.2 %
   Bad_Speculation:                         2.8 %
   Retiring:                               28.4 %
   Backend_Bound:                          61.6 %
       Memory_Bound:                       54.1 %
           L1_Bound:                        4.8 %
           L2_Bound:                        2.6 %
           L3_Bound:                       12.9 %
           DRAM_Bound:                     33.8 %

Walking the load-bearing lines:

A workflow using just cache-misses would have seen "4.2M LLC misses" and shrugged. The workflow using cycle_activity.stalls_l3_miss and --topdown -l3 immediately tells you those 4.2M misses are absorbing 33.8% of every issue slot — they are the bottleneck, and the search space narrows to data layout.

Sampling vs counting — when each is the right tool

Two distinct PMU usage modes exist, and conflating them produces wrong diagnoses.

Counting mode (perf stat) accumulates a counter across the whole run and prints the total at exit. It tells you how much but not where. IPC = 1.4 on the whole binary tells you nothing about which function caused it — the binary may have one hot function at IPC 0.4 dragging down twenty cold ones at IPC 4. Counting is the right mode for headline diagnostics ("is the back-end stalled?"), the wrong mode for localising the cause.

Sampling mode (perf record) configures the PMU to interrupt the CPU every N events and capture the instruction pointer, the call stack, and the user/kernel context at that interrupt. The default sampling event is cycles — every ~1 ms of CPU time you get one sample, weighted toward functions that spend the most cycles (hence "CPU time profile"). But you can sample on any event: perf record -e cache-misses produces a profile weighted toward functions that take cache misses; perf record -e branch-misses localises mispredictions.

The killer combination: perf record -e cycles,L1-dcache-load-misses,branch-misses -F 99. One run, three profiles. The cycles profile shows where time is spent; the L1-dcache-load-misses profile shows where the cache misses live; the branch-misses profile shows where the branch predictor is failing. Three flamegraphs from one trace, each weighted by a different event. Brendan Gregg's flamegraph.pl (now embedded in perf script and bcc's profile tool) renders any of them. A function that is hot in cycles but cold in cache-misses is compute-bound and benefits from algorithmic optimisation; a function that is hot in cache-misses but only proportionally hot in cycles is memory-bound and benefits from data-layout work. The shape of the disagreement between flamegraphs is the diagnosis.

Sampling has overhead — typically 1–5% at 99 Hz — because each interrupt walks the stack. Counting has near-zero overhead. Run counting always (cheap headline), run sampling when counting reveals a problem worth localising.

A subtle Hotstar example: during the IPL final 2024, the transcoder service hit 87% CPU utilisation (the alerting threshold). The on-call engineer ran perf stat and saw IPC = 1.1 — back-end stalled. He ran perf stat --topdown -l3 and saw L3_Bound at 24% and DRAM_Bound at 18%. He then ran perf record -e cache-misses -p $(pgrep transcoder) -- sleep 30 and rendered a flamegraph weighted by cache misses. The flamegraph showed 41% of cache misses concentrated in framecache_lookup — a function that took 11% of CPU time in the cycles flamegraph. Cache misses were 4× over-represented in that function relative to its CPU share — the smoking gun. The fix was to size the framecache hash table to fit in L3 (it had grown past it under the IPL final's higher concurrent stream count). LLC miss rate dropped from 12% to 4%, IPC climbed from 1.1 to 1.9, and the transcoder fleet shed 22% of its instances on the chase. ₹3.4 crore/month saved on a fleet that runs in ap-south-1 at roughly ₹16 lakh per month per c6i.8xlarge reservation. The whole diagnosis took 90 minutes and never looked at source code until the last 10.

Common confusions

Going deeper

The Top-Down hierarchy and Yasin's paper

Yasin's 2014 ISPASS paper, "A Top-Down Method for Performance Analysis and Counters Architecture", is the foundation. Read sections 3 and 4 to understand the pipeline-slot model and how the four categories partition the issue space without overlap. Intel's productisation in perf stat --topdown (Skylake-era) encodes the same partitioning into hardware support — the slots fixed counter on Sapphire Rapids exists specifically so the kernel can run Top-Down without any multiplexing. Sapphire Rapids' Perfmon 5 architecture also adds dedicated counters for Retiring, Bad_Speculation, Frontend_Bound, Backend_Bound directly, so the whole Top-Down level-1 breakdown costs five fixed counters and zero programmable ones. AMD's equivalent is the Pipeline Utilization metric in uProf (Zen 3+) — the methodology is portable, the counter mapping is per-vendor.

The Razorpay payments-router blind-spot

October 2024, Razorpay's payments-router (the service that handles UPI tx routing across PSP banks): a routine deploy went out, p99 climbed from 18 ms to 31 ms over the next six hours. The on-call team's first instinct was to run perf record -F 99 and look at the flamegraph. The flamegraph was identical to the previous deploy — same shape, same hot functions, same fractions. They almost concluded "no regression, alerting noise". The senior SRE on the team ran perf stat --topdown -l3 and saw Frontend_Bandwidth.MITE at 19% (baseline 4%). The flamegraph could not see this because the regression was distributed across every function — a global front-end-bandwidth degradation, not a localised hot function. The cause was a compiler upgrade that changed the binary's overall instruction layout, reducing the µop cache hit rate. The fix was a build-flag change. The regression had been visible to Top-Down all along; the flamegraph alone had concealed it. This is the canonical case for "always read Top-Down before any flamegraph". The cost of the six-hour delay in detecting it was an estimated ₹47 lakh in delayed UPI settlements (queueing fees, customer compensation, partner-bank SLA hits).

Reading PEBS-precise events for cycle attribution

Standard PMU sampling is imprecise — when the counter overflows, the interrupt fires a few cycles later, and the instruction pointer captured at interrupt time is not necessarily the one that caused the event. On a memory-miss event at 200-cycle latency, the IP can be 50–200 cycles downstream of the actual load. Intel's Precise Event-Based Sampling (PEBS) fixes this for a subset of events by snapshotting the architectural state into a memory buffer at the exact instruction the event fired. Events with :p suffix in perf list are PEBS-capable: mem_load_retired.l3_miss:p, inst_retired.any:p, frontend_retired.dsb_miss:p. Use PEBS sampling whenever the question is "which exact line of code caused this event" — non-PEBS sampling lies about that question. The cost is restricted to specific events and ~1% extra overhead. AMD's equivalent is Instruction-Based Sampling (IBS), which samples every Nth instruction and walks back to capture all the events that touched it — different mechanism, similar precision goal.

When PMU events are not enough — eBPF-driven probes

The PMU answers "how much is happening at the hardware level". It does not answer "which userspace function called this kernel path", "which Kafka topic this LLC miss correlated with", or "what was the request-id for the syscall that took 80 ms". Those answers need software-level instrumentation that PMU events feed into. eBPF is the substrate: a kprobe on kmem_cache_alloc can read the PMU's cycle counter at entry and exit, computing the cycle cost per call site; a uprobe on a Python function can read the same counter and attribute cycles to userspace call sites. The PMU is the high-frequency oscilloscope; eBPF is the wiring that connects it to your application's logical structure. Part 6 of this curriculum covers the eBPF side in depth; the relevant point here is that PMU events are a primitive, not a complete answer.

Reproduce this on your laptop

sudo apt install linux-tools-common linux-tools-generic
python3 -m venv .venv && source .venv/bin/activate
pip install numpy
# Allow non-root perf events for the current user (kernel >=4.6):
sudo sysctl kernel.perf_event_paranoid=0
python3 pmu_diagnose.py
# Then a sampling run:
perf record -e cycles,cache-misses,branch-misses -F 99 -g -- python3 bench.py
perf report --stdio | head -50

Expect IPC near 1.4 on the gather phase, DRAM_Bound between 25% and 40%, and a cache-misses-weighted profile concentrated in numpy's gather kernel. Specific numbers will vary by CPU generation and DRAM speed, but the shape — back-end-bound, memory-stall-dominant, gather-localised — is the signal.

When to ignore PMU events entirely

PMU diagnosis is the right tool for questions about single-machine throughput and latency. It is the wrong tool for questions about distributed systems throughput, queueing-induced tail latency, coordination overhead between services, or I/O-driven blocking. A service that spends 95% of its wall-clock time epoll_wait-ing on a socket has IPC undefined — the CPU is halted, not running instructions slowly. PMU events still exist (the cycle counter freezes when the core is halted), but they do not tell you anything about the service's bottleneck because the bottleneck is upstream. Always check top's CPU utilisation column first: if the service is using less than ~30% of a core, do not run perf stat — run strace, bpftrace, or look at upstream queue depths instead. PMU is a CPU-bound tool; non-CPU-bound services need different tools.

Where this leads next

The PMU is the measurement substrate every later chapter in this curriculum builds on. The next steps:

Part 5 (CPU profiling) wraps the PMU into flamegraph-style sampling tools — py-spy, perf record, pprof. Part 6 (eBPF) uses the PMU's cycle counter inside kernel-space probes for per-syscall and per-event attribution. Part 7 (latency / tail latency) feeds PMU samples into HdrHistograms to localise tail-latency causes by hardware event. Part 14 (capacity planning) uses PMU-derived IPC and stall percentages to build per-instance capacity models that predict the cliff before load tests find it.

The deeper pattern: a senior engineer's cognitive workflow on any performance regression starts with three commands. First top (is the service CPU-bound?). Second perf stat --topdown -l3 (which side of the rename stage is stalled?). Third perf record -e <event-from-step-2> -F 99 -g (where in the code does the stall live?). Three commands, four numbers per step, ten minutes total to reach a hypothesis. The PMU makes that workflow possible — it is the silicon-level telemetry that turns "the service is slow" into "the service is L3-bound in the framecache hash function". Without it, the diagnosis is guesswork; with it, the diagnosis is measurement.

The PMU's deepest lesson for an engineer who has only ever read top and htop: the CPU is not opaque. There is a microscope built into every core. It costs nothing to use. The cost is in learning what to look at — which of 200+ events to combine, which derived metrics to trust, when to sample versus count, when PEBS-precise versus standard sampling. The first month you use it you will misread the output; by the third month you will pull up --topdown -l3 reflexively before any other diagnostic. The transition from "the dashboard says CPU is high" to "the back-end is L3-bound at 38%, the framecache hash needs to fit in 28 MB" is the transition from observation to diagnosis. The PMU is the substrate of that transition.

References

  1. Yasin, "A Top-Down Method for Performance Analysis and Counters Architecture" (ISPASS 2014) — the foundational paper for the Top-Down methodology, including the pipeline-slot accounting and the four-category partitioning.
  2. Intel® 64 and IA-32 Architectures Software Developer's Manual, Volume 3B — Chapters 18–19 on performance monitoring, with the per-microarchitecture event tables and PEBS configuration registers.
  3. AMD Processor Programming Reference (PPR), Family 19h Models 10h-1Fh — AMD's equivalent reference for Zen 4 PMU events, IBS configuration, and counter widths.
  4. Brendan Gregg, Systems Performance (2nd ed., 2020) — Chapter 6 (CPUs) and Chapter 13 (perf), the canonical practitioner's guide to PMU usage on Linux with concrete perf recipes.
  5. Andi Kleen's pmu-toolstoplev.py for Top-Down on older CPUs, ocperf.py for symbolic event names, and the event_download.py helper for keeping per-CPU event tables current.
  6. Linux kernel perf_event_open(2) man page — the syscall every PMU-reading tool wraps; needed when you build your own non-perf front-end.
  7. Brendan Gregg, "CPU Flame Graphs" — sampling-mode visualisation with PMU-event-weighted profiles.
  8. Micro-ops, fusion, and decode bandwidth — chapter 6 of this curriculum, where the front-end events read by Top-Down originate.