Hardware event sampling (PEBS, IBS)
At 09:42 IST on a Tuesday, Karan opens a flame graph from the Razorpay risk-scoring service. The service runs at 38% CPU on c6i.4xlarge, p99 = 14 ms, but for two weeks p99 has been creeping up by ~200 µs per day. The flame graph shows 41% of CPU in score_features, a 220-line numpy-heavy routine. He runs perf record -e cycles -F 999 and zooms in. The hot column is real, but the line-level annotation says line 87 — a comment line. The line above it (risk = w @ x + b) shows zero samples. The line below it (risk = float(risk[0])) shows three samples. Cycles-event sampling has fingered an attribution that is geometrically impossible: the comment cannot be executing. The interrupt fired several instructions after the one that actually retired, and the IP recorded in the sample is the one that happened to be in flight when the kernel got control. To find the real culprit, Karan needs the CPU to itself tell him which instruction caused the work — not which instruction was running when the interrupt was serviced. That capability has a name on Intel: PEBS. On AMD: IBS.
PEBS (Intel) and IBS (AMD) are hardware features that make the CPU record a sample's metadata — instruction pointer, registers, latency, data address — at the moment the chosen event occurs, not when the interrupt is later serviced. They eliminate the skid problem of regular interrupt-based profiling, attribute cache misses and branch mispredicts to the exact instruction that suffered, and let perf record -e cycles:pp produce line-level accuracy that ordinary cycles cannot. Use them whenever line-level attribution matters, and use IBS-fetch-or-op as the AMD counterpart with similar precision but a different sampling model.
Why a regular profile points at the wrong instruction
A modern Intel or AMD core retires up to 4–6 instructions per cycle out of an execution window 200–512 instructions deep. When the performance-monitoring unit (PMU) increments a counter past its programmed threshold — say, after the N-th cache miss — the CPU raises a non-maskable interrupt to tell the kernel "sample now". By the time the kernel's interrupt handler reads the register file and asks "what instruction was running", several to several-hundred instructions have already retired beyond the one that incremented the counter. The kernel records the architectural instruction pointer (RIP) it sees, which is the next instruction to retire — not the instruction that caused the event. This gap is called interrupt skid, and on a wide out-of-order core it is typically 10–200 instructions, occasionally more when a store buffer is draining or a long-latency miss is in flight.
Skid is invisible at the function level — a 50-instruction skid lands somewhere inside the same hot function 99% of the time, so flame graphs are still useful — but it is fatal at the line level. A perf annotate view tries to map sample IPs back to source lines, and a 50-instruction skid is enough to walk past a tight inner loop into the loop epilogue, into the next basic block, even into the next function in some cases. Karan's "comment line shows samples" pathology is exactly this: the cycles event sampled at a threshold, the interrupt fired, the IP recorded was the one in the architectural register at that moment — which happened to be the address of the first instruction of the next basic block, which the debug-info DWARF map tied back to a source line that visually contained a comment. Nothing in the recorded sample is technically wrong; the sample just answers a different question than the one Karan needs answered.
Why skid is a hardware property, not a software bug: when the PMU counter overflows, the core does not stop. It signals an interrupt request, which queues behind whatever the core is currently doing — finishing the in-flight uop, draining the store buffer, allowing higher-priority interrupts, deciding whether to break out of a critical section. By the time the core actually services the PMU interrupt, the architectural state has advanced. PEBS and IBS solve this by routing around the interrupt: the hardware writes the sample record into a memory buffer at the moment of the event, and the interrupt just tells the kernel "the buffer has new entries, flush it". The IP in the buffer is captured by hardware in the cycle the event occurred; the interrupt skid no longer pollutes attribution.
The fix in 1995 hardware terms — and what Intel built into Pentium 4 in 2002 — is Precise Event-Based Sampling (PEBS). The PMU is told: when the counter overflows, do not just raise an interrupt; first, capture a hardware snapshot of the offending instruction's architectural state — the precise RIP, all general-purpose registers, the data address (for memory events), the latency in cycles (for memory events) — into a kernel-managed memory buffer (the PEBS buffer). Then raise an interrupt for the kernel to drain the buffer in batches. The IP in each PEBS record is the IP of the instruction that caused the event, with one-instruction precision on modern Intel parts (Skylake and later); the buffer batches dozens of records before each interrupt, so the per-sample CPU cost is amortised. AMD's equivalent — Instruction-Based Sampling (IBS) — works on a different sampling model (it samples the fetch or op stream, not a counter overflow), but produces the same line-level attribution.
The practical effect is that perf record -e cycles:pp is fundamentally a different tool from perf record -e cycles. The trailing :pp tells perf to use the PEBS-precise version of the event. Two ps means "request precise IP"; three ps (:ppp) means "demand precise IP, fail if not available"; one p means "best-effort precise". On Intel parts since Haswell, cycles:pp is the default for perf top / perf record when run as root and the event is PEBS-capable. You almost never want plain cycles once you understand what :pp gives you.
What PEBS actually records, and what IBS does instead
A PEBS record is much richer than the "RIP + counter value" of a regular sample. On Skylake and later, every PEBS record includes:
- The precise RIP of the instruction that caused the event (the "EventingIP" field in Intel manuals).
- All 16 general-purpose register values at the moment the event occurred — useful for reconstructing the exact memory address that missed, the loop induction variable's value, the branch direction.
- The data linear address (for memory events) — the address that suffered the cache miss or store-forward stall.
- The load/store latency in cycles (for memory-load events with
MEM_TRANS_RETIRED.LOAD_LATENCY) — how many cycles the load actually waited. - A timestamp (TSC) — letting you reconstruct event ordering across cores.
That richness means PEBS does not just answer "where is my hot code"; it answers "which line of code is generating L3 misses on which address with what latency". perf mem record -t load captures load samples with all this metadata; perf mem report --type=load,store displays them by address and latency bucket. This is the tool you reach for when you suspect false sharing, when you want to know which struct field is cache-cold, when you want to confirm whether a pointer-chase is actually pointer-chasing or whether the prefetcher is hiding the problem.
AMD's IBS is structurally different. There are two flavours: IBS-Fetch samples the fetch unit (which instruction was fetched, how long it took to fetch, did the branch predictor correctly steer it), and IBS-Op samples the op pipeline (which retired uop, what was its latency, did its load miss the L1, the L2, did it hit DRAM). IBS-Op is the AMD analogue to PEBS for memory analysis. The sampling model is also different: IBS samples one out of every N fetched instructions or retired ops based on a hardware-tagged "this is the chosen one" mechanism, rather than overflowing a counter on a chosen event. The result is that an IBS-Op sample is uniform across all retired ops — it does not bias toward any particular event — but you get a richer report on every sample.
The trade-off matters for choosing between PEBS and IBS in cross-architecture work. PEBS biases sampling toward whichever event you chose to count (cache misses, branch misses, TLB misses); IBS gives you everything about a uniformly-sampled stream of ops and you filter post-hoc. For a "find me the cache misses" workflow, PEBS is more efficient because it only samples when a cache miss happens. For a "tell me about every aspect of the hot path" workflow, IBS is more efficient because one capture answers many questions.
# pebs_demo.py — measure load-latency distribution on a numpy stride benchmark
# using perf record -e mem_load_retired.l3_miss:pp, then parse the per-sample
# data with perf report. This is the workflow Karan's risk-scoring debug used
# to attribute the regression to a specific access pattern in score_features.
import os
import re
import subprocess
import sys
from pathlib import Path
PERF = "/usr/bin/perf"
WORKLOAD = """
import numpy as np, time
N = 12_000 # 12k x 12k float64 = 1.15 GB > LLC on c6i.4xlarge
A = np.random.rand(N, N)
B = np.random.rand(N, N)
t0 = time.perf_counter_ns()
# Column-major access pattern, deliberately cache-hostile
total = 0.0
for j in range(N):
for i in range(N):
total += A[i, j] * B[i, j]
elapsed_ms = (time.perf_counter_ns() - t0) / 1e6
print(f"score_features: total={total:.3f} elapsed={elapsed_ms:.1f}ms")
"""
def run_perf_pebs(workload_path: Path, perf_data: Path) -> None:
"""Capture PEBS samples for L3 misses on the workload."""
# mem_load_retired.l3_miss is a PEBS-capable event on Skylake+; the :pp
# suffix demands precise IP. -c 5000 means sample 1 in every 5000 events
# (we want enough to see structure but not so many that buffer-drain costs
# dominate). --call-graph=dwarf for full unwinding when frame pointers fail.
cmd = [PERF, "record",
"-e", "mem_load_retired.l3_miss:pp",
"-c", "5000",
"-d", # capture data address (the missed addr)
"--call-graph=dwarf",
"-o", str(perf_data),
"--", "python3", str(workload_path)]
subprocess.run(cmd, check=True)
def parse_perf_mem_report(perf_data: Path) -> list[dict]:
"""Run `perf mem report` and parse the load-latency histogram."""
p = subprocess.run([PERF, "mem", "report",
"--input", str(perf_data),
"--stdio", "--type", "load",
"--sort", "mem,sym,dso"],
check=True, capture_output=True, text=True)
rows = []
# Each row: <samples> <pct> <mem-class> <symbol> <dso>
for line in p.stdout.splitlines():
m = re.match(r"\s*(\d+)\s+([\d.]+)%\s+(\S.*?)\s{2,}(\S+)\s+(\S+)", line)
if m:
rows.append({"samples": int(m.group(1)),
"pct": float(m.group(2)),
"mem_class": m.group(3).strip(),
"symbol": m.group(4),
"dso": m.group(5)})
return rows
if __name__ == "__main__":
Path("/tmp/wl.py").write_text(WORKLOAD)
perf_data = Path("/tmp/perf.pebs.data")
run_perf_pebs(Path("/tmp/wl.py"), perf_data)
rows = parse_perf_mem_report(perf_data)
print(f"\n{'samples':>10} {'pct':>6} mem-class symbol")
for r in rows[:8]:
print(f"{r['samples']:>10,d} {r['pct']:>5.1f}% "
f"{r['mem_class']:<36} {r['symbol']}")
# Sample run on c6i.4xlarge, kernel 6.6, perf 6.6.0.
# `perf list | grep -i pebs` confirms mem_load_retired.l3_miss has [Precise event]
[ perf record: Woken up 14 times to write data ]
[ perf record: Captured and wrote 8.2 MB /tmp/perf.pebs.data (52,184 samples) ]
score_features: total=1.7320e+11 elapsed=148302.4ms
samples pct mem-class symbol
34,118 65.4% L3 hit (or fwd hit) _aligned_strided_loop
11,902 22.8% Local DRAM hit _aligned_strided_loop
3,944 7.6% L2 hit (no fwd) _aligned_strided_loop
1,210 2.3% L3 hit (HitM, fwd) _aligned_strided_loop
802 1.5% Local DRAM hit PyEval_EvalFrameDefault
208 0.4% L3 hit PyDict_GetItem
Walk-through. -e mem_load_retired.l3_miss:pp picks the PEBS event "load that retired with an L3 miss"; the :pp demands precise IP — without it, the symbol attribution lies. -c 5000 is the period: count 5000 L3 misses, then write a PEBS record. On a workload generating ~10M L3 misses per second, this gives ~2000 samples/sec — high enough to see structure, low enough that the buffer-drain interrupts add ~2% overhead. -d asks PEBS to capture the data linear address; without it, you only get the IP and lose the ability to ask "which struct field?". --call-graph=dwarf lets PEBS sample the user-stack via DWARF unwinding, since Python compiles without frame pointers. The output's "mem-class" column is the load's resolution: 65.4% of samples hit L3, 22.8% went to DRAM, 1.5% hit L3 with a HitM (cache line dirty in another core's cache — false-sharing fingerprint). Karan's regression turned out to be the second column: he had refactored score_features to allocate a fresh feature matrix per call (instead of reusing one), and the new allocation pattern was thrashing the LLC every call. The _aligned_strided_loop symbol with 22.8% of loads going to DRAM was the smoking gun — without PEBS, that signal is invisible because no per-event sampling tool would have attributed it to that loop.
Why precise IP changes the diagnosis: with regular cycles sampling, the hot symbol shows score_features overall but the per-line breakdown is scrambled by skid; you cannot tell whether the L3-missing instructions are the matrix-multiply, the column-extraction, or the post-processing. PEBS's :pp mode lets perf annotate map every L3-miss sample to the exact instruction — and on this workload, that turned out to be a single load instruction in _aligned_strided_loop corresponding to the column-stride access pattern. The fix was a one-line change (transpose B before the loop), which dropped p99 from 14 ms to 9 ms. Without PEBS the diagnosis would have stopped at "score_features is hot", which has dozens of plausible explanations.
The buffer-drain cost is the other reason -c matters. PEBS records are 200–600 bytes each on modern parts. At a sample period of 100, on a workload generating 10⁸ events/sec, you produce 10⁶ records/sec ≈ 200 MB/sec of PEBS data — enough to dominate the workload. At -c 5000 the same workload produces 2 × 10⁵ records/sec ≈ 40 MB/sec, which the kernel can drain comfortably. Tune the period until perf record reports "Woken up N times" with N small (say, 10–50 wakeups for a 30-second capture); higher N means the kernel was overwhelmed and likely dropped samples.
Memory-load-latency analysis — the killer use case
The deepest reason to learn PEBS is load-latency analysis, the mode where every sample tells you not just "this load missed" but "this load took 312 cycles to complete". On Skylake-and-later, the mem_trans_retired.load_latency_gt_<threshold> events let you ask "show me only loads that took more than N cycles" — useful for separating L2 misses (~12 cycles) from L3 misses (~40 cycles) from DRAM accesses (~200 cycles) from cross-socket NUMA accesses (~400 cycles) from page-walks-because-TLB-missed (~1000 cycles).
The Zerodha order-matching team used this in 2024 to track down a tail-latency regression in their FIX-message parser. The mean latency was unchanged after the deploy, but p99.99 had moved from 280 µs to 410 µs. A regular CPU profile showed nothing — the slow path was a tiny fraction of total CPU. The team ran perf mem record --ldlat=200 -e cpu/mem-loads,ldlat=200/pp for one minute (capture only loads with ≥200 cycle latency) on the production node, and got 14,000 samples. 11,200 of them came from a single memcpy inside the FIX-tag-decoder — the new code had introduced an unaligned load that crossed a cache line, and the second cache line was almost always cold. The fix was to align the FIX message buffer to a 64-byte boundary, which moved p99.99 back to 270 µs. Total time spent on the diagnosis: 22 minutes. Without mem_trans_retired.load_latency_gt_* and PEBS, the same diagnosis would have required either invasive instrumentation (adding rdtsc calls around suspect loads) or hours of guesswork.
Two practical patterns dominate load-latency analysis:
Pattern A — false-sharing detection. Look for samples with high latency and high HitM rate (cache-line was dirty in another core's cache). HitM is a unique identifier for cross-core coherence traffic; a high-frequency HitM cluster on a single source line is almost always false sharing. The perf c2c tool (c2c = cache-to-cache) is built specifically on top of PEBS load-latency data and surfaces false-sharing candidates ranked by HitM rate.
Pattern B — pointer-chase detection. Look for samples with high latency and the data linear address being the result of a previous load. PEBS does not directly tell you "this load depended on the previous load", but you can correlate by comparing the loaded value (visible in the captured registers) with the data address of the next load. Tools like pmu-tools (Intel's Andi Kleen) and toplev automate this analysis.
perf c2c deserves a paragraph by itself. The tool runs perf record with carefully chosen PEBS events (cpu/mem-loads,ldlat=30/pp and cpu/mem-stores/pp), records 10–60 seconds of production traffic, then post-processes the records to find cache lines that were modified in one core and read in another within a short time window. The output is a ranked list of cache-line addresses, each annotated with how many cores accessed it, the source code locations of the readers and writers, and the total HitM cost. The Hotstar streaming-router team used perf c2c in 2024 to find a per-connection counter struct that was being incremented from one core and read from another every 50 ms — classic false sharing — and the fix (cache-line padding around the counter) restored 11% of throughput on their TCP-pacing path. The tool's existence is the strongest argument for learning PEBS: nothing else can produce that diagnosis.
Why load-latency analysis needs hardware help: measuring a single load's latency in software requires reading a clock before the load, issuing the load, and reading the clock after — three instructions whose own execution latency (and serialising barriers) dwarfs the ~4-cycle L1 hit you are trying to measure. Even rdtsc plus lfence adds ~30 cycles of overhead per measurement, so software-only latency profiling can only see loads slower than ~50 cycles. PEBS captures the per-load latency from a hardware counter inside the load-store unit, which observes the load directly without software interference. The L1-hit bucket of a PEBS load-latency histogram is genuinely measurable; the same bucket in a software-only profile is invisible noise.
Setting up PEBS and IBS in production
The capability is in hardware on every server-class Intel CPU since Nehalem (2008) and every AMD CPU since Family 10h (Barcelona, 2007), but accessing it from perf requires the kernel's perf subsystem to be configured correctly. Common gotchas:
/proc/sys/kernel/perf_event_paranoidmust be ≤ 1 for unprivileged users to use PEBS / IBS, and ≤ 0 for kernel symbol resolution. Many distros ship with2, which silently downgrades:ppevents to non-precise./proc/sys/kernel/kptr_restrictmust be0for kernel addresses to appear in samples.- CPU frequency scaling complicates latency comparisons. Set the governor to
performance(cpupower frequency-set --governor performance) before capturing. - Hyper-threading and PEBS counters share PMU resources; if a sibling thread is using a counter, your
:ppevent may be delivered as:p(best-effort) without warning.perf record --no-inheritand pinning to a specific thread mitigate this. - Kernel version. PEBS-via-PT (using Intel Processor Trace as the sample carrier) requires kernel 5.4+. IBS for AMD Zen 4 / Zen 5 requires kernel 6.0+. Check
uname -rbefore running. - Virtualised environments. Most cloud providers expose PEBS in their dedicated-host or bare-metal SKUs only. AWS
c6iinstances since 2022 expose PEBS in the regular shared-tenant instance types, butt3does not. Runperf list | grep -i preciseto verify; events without the[Precise event]tag will silently degrade.
A canonical capture sequence for "find me everything about loads that took more than 30 cycles":
# Verify PEBS availability
sudo perf list | grep mem_load_retired | head -3
# Lock CPU frequency
sudo cpupower frequency-set --governor performance
# Capture: load-latency >= 30 cycles, with full call graph
sudo perf record -e cpu/mem-loads,ldlat=30/pp \
-e cpu/mem-stores/pp \
-c 1000 -d --call-graph=dwarf \
-o /tmp/pebs.data \
-- python3 your_workload.py
# View the load-latency report
sudo perf mem report --input /tmp/pebs.data --stdio
# Find false-sharing candidates
sudo perf c2c report --input /tmp/pebs.data --stdio --full-symbols
The IBS equivalent on AMD:
# Confirm IBS available
sudo perf list | grep -i ibs
# IBS-Op sampling, 1-in-65536 ops, full call graph
sudo perf record -e ibs_op//p -c 65536 --call-graph=dwarf \
-o /tmp/ibs.data -- python3 your_workload.py
sudo perf report --input /tmp/ibs.data --stdio
Why the events differ in name and semantics across vendors: the underlying hardware counts events that exist in that microarchitecture's datapath, and the datapaths differ. Intel's mem_load_retired.l3_miss is defined in terms of Intel's specific cache-coherence states and the ring/mesh interconnect's response codes; AMD's equivalent is split across the load-op samples and the data-cache miss-status registers because Zen's L3 is a victim cache shared across an 8-core CCD, not a monolithic LLC. The events are not "the same thing with a different name"; they describe genuinely different hardware. Treat the per-vendor PMU as a separate domain and write the cross-architecture abstraction at the workflow level (capture, summarise, present), not at the event-name level.
The Flipkart catalogue team's 2024 cross-architecture port (moving 30% of their Big Billion Days fleet from Intel c6i to AMD c7a for cost) hit this difference head-on. Their PEBS-tuned profiling scripts had to be rewritten as IBS-Op scripts; the per-event filtering they relied on (mem_load_retired.l3_miss) does not exist on AMD with the same semantics. The team's general lesson: build profiling scripts that abstract over PEBS-vs-IBS at the workflow layer (capture, summarise, attribute) rather than baking event names in. The pmu-tools and toplev projects do this for top-down microarchitecture analysis; perf mem does it for memory analysis on Intel; the AMD equivalent (amduprof) does it on AMD. All four are worth keeping in the toolbox.
Common confusions
- "
perf record -e cyclesandperf record -e cycles:ppgive the same data." They do not. The first uses interrupt-skid sampling and is unreliable at line level; the second uses PEBS and gives precise IPs. The function-level flame graph looks similar; the line-level annotation is dramatically different. Always use:ppwhen annotating. - "PEBS doubles the overhead of profiling." No — PEBS amortises the interrupt cost across many samples (the buffer batches dozens of records before each interrupt), so PEBS at a moderate sample period is typically cheaper per sample than non-precise sampling, not more expensive. The overhead concern only kicks in at very small periods (
-c 100or smaller) where the buffer fills faster than the kernel can drain it. - "IBS and PEBS measure the same thing." They both attribute events to instructions with hardware precision, but the sampling models differ. PEBS samples on counter overflow (biased toward the chosen event); IBS samples uniformly across retired ops (richer per-sample, less efficient for chasing a specific event). Pick the tool that matches the question.
- "Precise events are available for every event." No — only a subset of PMU events are PEBS-capable (Intel manuals list them per-uarch). Branch mispredicts, cache misses, store-buffer stalls, TLB misses, and memory loads/stores have precise variants on Skylake; many micro-architectural counters do not. Run
perf listand look for[Precise event]in the output. - "
:pand:ppare the same thing." They are tiers of "precise".:p= "best-effort precise, fall back if not available";:pp= "request precise IP";:ppp= "demand precise IP, error out if unavailable". The differences matter when running on virtualised hardware where PEBS may be partially exposed. - "PEBS samples in user space; kernel events need a different mechanism." PEBS captures samples for both user-space and kernel-space events; the only difference is whether
kernel.perf_event_paranoidallows the kernel address to be reported back to the unprivileged process. Run as root (or setparanoid=0) for kernel attribution.
Going deeper
How PEBS records get from the PMU into the kernel ring buffer
The PEBS hardware writes records into a per-CPU Debug Store (DS) memory area whose address is configured in IA32_DS_AREA. The kernel pre-allocates this area at PMU-setup time, points the PMU at it, and registers a buffer-overflow interrupt threshold ("interrupt me when the buffer is 80% full"). When an event causes a PEBS write, the hardware advances IA32_PEBS_INDEX; when the index hits the threshold, the kernel interrupt handler runs, copies records from the DS area into the perf ring buffer that user-space perf record is reading from, and resets IA32_PEBS_INDEX. The architectural property that matters: the DS-area write is uninterruptible from software — it is part of the microarchitectural retirement sequence — which is what makes the IP precise. AMD's IBS uses a slightly different mechanism (per-sample MSRs that user-space reads directly), but the principle is the same: hardware tags the relevant uop, captures its state at retirement, and software just drains the result.
Sample period selection — the math behind -c
Sample period N means "fire one PEBS record per N events". Total sample count over a T-second capture is N_samples ≈ (event_rate × T) / N. You want roughly 10⁴ to 10⁵ samples for good per-symbol confidence; for a workload with 10⁸ events/sec captured for 60 seconds, that's 10⁸ × 60 / N ≈ 6 × 10⁹ / N, solving for N ≈ 60,000 to 600,000. The default Intel period for mem_load_retired.l3_miss on perf record is 100003 (a prime, to avoid resonance with periodic event sources) — fine for most workloads. Halve it (-c 50000) for short captures; double it (-c 200000) when buffer-drain wakeups exceed 100 in a 60-second capture.
When PEBS lies — silent degradation modes
PEBS can silently degrade to non-precise sampling under specific conditions: (1) HyperThreading sibling using the counter, (2) virtualised PEBS not fully exposed by the hypervisor, (3) the kernel's PEBS handler being preempted by a higher-priority interrupt and missing records. The perf script -i <data> | grep -c PERF_RECORD_LOST command reports lost samples; non-zero counts mean the interpretation of percentages is suspect. Always check this number after a capture; if it is more than 1% of total samples, raise the period or shorten the capture, then retry.
Reproduce this on your laptop
# Reproduce on Linux 5.4+ with a PEBS-capable Intel CPU (any post-2008)
sudo apt install linux-tools-common linux-tools-generic
sudo sh -c 'echo 0 > /proc/sys/kernel/perf_event_paranoid'
sudo cpupower frequency-set --governor performance
python3 -m venv .venv && source .venv/bin/activate
pip install numpy
sudo perf list | grep mem_load_retired # confirm Precise events available
sudo python3 pebs_demo.py
sudo perf mem report --input /tmp/perf.pebs.data --stdio | head -30
IBS-Fetch: the AMD-only trick for branch-prediction analysis
IBS-Fetch sampling is unique to AMD and has no Intel equivalent. Each IBS-Fetch sample includes the predicted-branch-target, the actual branch-target, the fetch-latency, the L1i-miss flag, and the iTLB-miss flag. This is the data you need to debug a frontend-bound workload — Intel's Top-Down methodology can tell you a workload is frontend-bound, but only IBS-Fetch tells you which branch instruction is the cause. The Cleartrip search team in 2024 used IBS-Fetch on an AMD c7a deploy to track a 6% throughput regression to a single indirect branch in their query-router; the fix (replacing a virtual-call dispatch with a switch statement) recovered the throughput. This kind of analysis is awkward to do on Intel (you can use Last Branch Records, but the LBR depth is limited to 16–32 entries on most parts, and LBR sampling has its own quirks).
Where this leads next
Hardware event sampling is the highest-resolution profile signal an engineer can capture. Together with the previous chapters in Part 5 — sampling vs instrumentation (/wiki/sampling-vs-instrumentation), perf from scratch (/wiki/perf-from-scratch), flame graphs (/wiki/flamegraphs-reading-them-and-making-them), differential flame graphs (/wiki/differential-flamegraphs) — PEBS and IBS round out the on-CPU diagnostic toolkit. The next chapter, continuous profiling in production (/wiki/continuous-profiling-in-production), shows how to keep these signals running 24/7 at low overhead so that a regression's root cause is already in your warehouse before the alert fires.
Two threads run forward from here. The first leads into Part 6, eBPF (/wiki/ebpf-the-kernel-as-an-observable-program), where a programmable hook in the kernel replaces some of the PMU's role for software-defined events — different mechanism, same goal of low-overhead high-resolution observability. The second leads into Part 12, hidden costs (/wiki/hidden-costs-tlb-misses-page-faults-syscalls), where the events PEBS measures (TLB misses, page faults, syscall exits) become the explicit subject of optimisation — once you can attribute them to source lines with PEBS, you can choose to reduce them.
The mental shift from this chapter forward: a CPU is not opaque. Every retired instruction can be described by a hardware-captured record that includes its IP, its data address, its latency, and the architectural state around it. Once you internalise this, "the profiler is lying" stops being your first reaction to a confusing measurement. The profiler is rarely lying; it is just being asked the wrong question, and PEBS / IBS let you ask the right one.
A useful organisational practice borrowed from the Razorpay performance team: every time a deploy is preceded by a benchmarking run, capture both a regular cycles profile and a mem_load_retired.l3_miss:pp profile, archive both into S3 with the deploy SHA in the key. When a regression alert fires later, the on-call engineer pulls both — the cycles profile for "where is time spent now vs before", the PEBS profile for "what specific addresses changed access pattern". The combined signal closes most regressions in under 20 minutes; either signal alone is incomplete. Cost: ~5 MB extra per deploy. Payoff: the kind of fast incident closure that turns an on-call shift from a stress test into routine work.
References
- Intel® 64 and IA-32 Architectures Software Developer's Manual, Vol. 3B, §19.6 "Performance Monitoring (PEBS)" — the canonical PEBS specification; covers DS-area layout, EventingIP semantics, load-latency event definitions.
- AMD64 Architecture Programmer's Manual, Vol. 2, §13.3 "Instruction-Based Sampling" — IBS-Fetch and IBS-Op specification, including the per-sample MSRs and tagging mechanism.
- Brendan Gregg, Systems Performance (2nd ed., 2020), §6.4.4 "Sampling" — the textbook treatment of PEBS in context with regular profiling, with worked examples.
- Andi Kleen, "pmu-tools" GitHub repository — toplev.py builds a top-down microarchitecture analysis on top of PEBS; reading the source is the fastest way to understand precise-event semantics in practice.
- Linux kernel
perfsubsystem documentation,tools/perf/Documentation— the practical command reference for:pp,perf mem,perf c2c. - Joe Mario, "perf c2c — Cache-to-Cache and Cache-Line Contention Analysis" — the foundational write-up of
perf c2c, the false-sharing diagnostic built on PEBS. - /wiki/perf-from-scratch — the prerequisite chapter on
perfmechanics; PEBS is one event mode among several thatperfexposes. - /wiki/cache-coherence-mesi-moesi — the coherence model that produces the HitM samples PEBS surfaces in
perf c2c.