perf from scratch
Karan's Hotstar catalogue service is at p99 = 1.4 s during the IPL playoffs and the dashboard says CPU is 73% on every pod. He SSHes in, types perf top, sees __memmove_avx_unaligned_erms at 22% and panics. He runs perf record -F 99 -p $(pgrep catalogue) -g -- sleep 30, then perf report, and the column headers — Overhead, Children, Self, Command, Shared Object, Symbol — are simultaneously informative and impenetrable. Which column is the percent of CPU? Why does Children equal 100% on [unknown]? Why does the same function show up under libc-2.31.so and [kernel.kallsyms]? Karan needs an answer in the next 9 minutes — that's how long the playoff break is. This chapter rebuilds perf from the syscall up, so the column headers stop being a riddle.
perf is a thin user-space wrapper over the perf_event_open(2) syscall, which programs hardware performance counters and software timers in the kernel and streams sample records into a per-CPU mmap'd ring buffer. Every perf subcommand — stat, record, report, top, script, annotate — is a different way to drive that one syscall. Learn the syscall and the subcommands stop being magic; learn the ring buffer's overflow semantics and you stop being surprised when samples vanish during traffic spikes.
What perf actually is — one syscall, six tools
perf ships in the Linux kernel source tree (tools/perf/), not as a separate package. It is a user-space CLI built on a single kernel facility: perf_event_open(2). That syscall takes a struct perf_event_attr describing what to count (cycles, instructions, branch misses, cache references, page faults, context switches, ...) and how to count it (counting mode, sampling mode, with or without stack capture), and returns a file descriptor. You read(2) the fd to get a counter value; you mmap(2) it to get a ring buffer of sample records; you ioctl(2) it to enable, disable, reset, or refresh.
Every perf subcommand is a recipe over that fd:
perf statopens counters in counting mode, runs your command, reads the counters, divides, prints a table.perf recordopens counters in sampling mode, mmap's the ring buffer, runs your command (or attaches to a pid), drains samples toperf.data.perf reportreadsperf.data, resolves PCs to symbols, aggregates by stack, prints a TUI tree.perf topisperf record+perf reportcontinuously refreshed, in-memory, no on-disk file.perf scriptreadsperf.dataand prints one line per sample — the format flame-graph generators consume.perf annotatereadsperf.data, finds samples that landed in a specific function, and prints the function's instructions interleaved with sample-attributed percentages.
Why this layering matters in production: when a perf record run produces a flamegraph that looks wrong, the bug can be in the live half (sample rate too low, ring buffer overflow, missing call-graph unwinder) or the offline half (perf script's output filtered the wrong way, the flamegraph generator collapsed stacks aggressively, symbol resolution failed because /tmp/perf-<pid>.map was missing for a JIT). Knowing which side of the line the bug lives on cuts the diagnosis time in half. The recurring pattern at Razorpay's reliability team: when a flamegraph has a fat [unknown] bar, the live half is fine and the offline half couldn't resolve PCs — the fix is --symfs or installing debug symbols, not re-running the profile.
The split also explains why some flags only work at record time (-F, -e, -g, -c, --call-graph dwarf|fp|lbr) and others only at report time (--sort, --no-children, --stdio, -i, --symfs). Live flags configure the syscall; offline flags configure how perf.data is read back. Mixing them up — perf record --sort or perf report -F 999 — produces silently wrong commands that perf does not always reject.
How perf_event_open actually programs the hardware
The syscall takes five arguments — attr, pid, cpu, group_fd, flags — and the attr struct is where every interesting design decision lives. For sampling mode, the kernel programs a hardware counter to overflow after a configured number of events (sample_period) or at a configured frequency (sample_freq), and on overflow it raises a non-maskable interrupt (NMI). The NMI handler captures the program counter, walks the user stack via the configured unwinder, and writes a PERF_RECORD_SAMPLE record into the per-CPU ring buffer.
Three things in attr matter most:
typeandconfigselect the event source.type = PERF_TYPE_HARDWARE,config = PERF_COUNT_HW_CPU_CYCLESis the cycle counter.type = PERF_TYPE_SOFTWARE,config = PERF_COUNT_SW_CPU_CLOCKis the wall-clock-driven software timer that fires regardless of CPU activity. The choice matters: a hardware cycle counter only ticks when the CPU is on; a software timer ticks always. For "what is on-CPU", use hardware cycles. For "what is on-CPU including idle (off-CPU stays in idle, so this still finds you on-CPU)", use the software clock — but for off-CPU work specifically, you need scheduler tracepoints, covered in the next chapter.sample_freqvssample_periodis the rate question.sample_freq = 99means "auto-tune the period so we get 99 samples per second per CPU";sample_period = 1000000means "fire every 1M cycles".freqis whatperf record -F 99selects;periodis what-c 1000000selects. Frequency adapts to load (low CPU load = larger period); period is fixed.sample_typeis a bitmask of what each sample record contains:PERF_SAMPLE_IP(instruction pointer),PERF_SAMPLE_CALLCHAIN(frame-pointer-walked stack),PERF_SAMPLE_STACK_USER(raw stack bytes for DWARF unwinding to do later),PERF_SAMPLE_BRANCH_STACK(LBR — Last Branch Record, hardware-recorded last 16–32 branches),PERF_SAMPLE_TIME,PERF_SAMPLE_TID,PERF_SAMPLE_CPU. The choice of unwinder is here:--call-graph fpsetsCALLCHAIN,--call-graph dwarfsetsSTACK_USERplus a fixed dump size (default 8192 bytes),--call-graph lbrsetsBRANCH_STACK. Each has trade-offs.
Why the buffer size is the silent killer of long flamegraphs: a service handling a Hotstar IPL traffic burst can briefly hit 99% CPU on every core, which means every NMI fires on time and the ring buffer fills at peak rate. If perf record is fighting for CPU with the very service it is profiling — common on a CPU-saturated pod — the user-space drainer falls behind, the head laps the tail, and the kernel starts emitting PERF_RECORD_LOST records. The flamegraph is now missing exactly the high-load moments you wanted to investigate. The fix is either -m 16M (bigger buffer) or running perf record with a real-time scheduling priority (--realtime 99), and on production pods the --snapshot flag with a triggered dump is the cleanest pattern — record continuously into a circular buffer, and only on alert flush it.
The kernel's perf_event_open(2) man page is honest about this: under "BUGS" it notes that on heavily loaded systems samples can be lost silently if the buffer is too small. perf record prints [ perf record: Woken up X times to write data ] and [ Lost N samples ] at the end of a run; if the second number is nonzero, the flamegraph has gaps. Always check those lines.
A from-scratch sampler — perf_event_open in 80 lines of Python via ctypes
The fastest way to internalise what perf is doing is to write the syscall yourself. The script below opens a PERF_TYPE_SOFTWARE / PERF_COUNT_SW_CPU_CLOCK event in counting mode, reads it before and after a Python loop, and prints the result. There is no perf binary involved; this is the syscall, raw.
# perf_event_from_scratch.py — open a perf event via ctypes, no perf binary.
# Demonstrates: this is what perf stat is doing under the hood, just with
# more event sources and prettier formatting.
#
# Tested on Linux 5.15 / 6.6, x86_64. Needs root or
# sudo sysctl kernel.perf_event_paranoid=-1
# (default of 2 only allows user events on the calling process, which is
# exactly what we use here, so usually no sudo is needed).
import ctypes, ctypes.util, os, struct, time
# Constants from <linux/perf_event.h>. Values are stable across kernels
# precisely because perf is part of the kernel ABI.
PERF_TYPE_HARDWARE = 0
PERF_TYPE_SOFTWARE = 1
PERF_COUNT_HW_CPU_CYCLES = 0
PERF_COUNT_HW_INSTRUCTIONS = 1
PERF_COUNT_SW_CPU_CLOCK = 0
PERF_COUNT_SW_TASK_CLOCK = 1
# struct perf_event_attr — only the fields we need. Real struct is 120+ bytes;
# we lay out enough to set type, size, config, and a couple of flag bits.
class PerfEventAttr(ctypes.Structure):
_fields_ = [
("type", ctypes.c_uint32),
("size", ctypes.c_uint32),
("config", ctypes.c_uint64),
("sample_period", ctypes.c_uint64), # union with sample_freq
("sample_type", ctypes.c_uint64),
("read_format", ctypes.c_uint64),
("flags", ctypes.c_uint64), # disabled, inherit, exclude_kernel, ...
("wakeup_events", ctypes.c_uint32),
("bp_type", ctypes.c_uint32),
("bp_addr", ctypes.c_uint64),
("bp_len", ctypes.c_uint64),
("padding", ctypes.c_byte * 64), # rest of struct, zeroed
]
# perf_event_open is syscall 298 on x86_64.
SYS_perf_event_open = 298
libc = ctypes.CDLL(ctypes.util.find_library("c"), use_errno=True)
def perf_event_open(attr, pid, cpu, group_fd, flags):
fd = libc.syscall(SYS_perf_event_open, ctypes.byref(attr),
pid, cpu, group_fd, flags)
if fd < 0:
err = ctypes.get_errno()
raise OSError(err, os.strerror(err))
return fd
def measure(event_type: int, event_config: int, label: str, work_fn):
attr = PerfEventAttr()
attr.type = event_type
attr.size = ctypes.sizeof(PerfEventAttr)
attr.config = event_config
attr.flags = (1 << 0) # disabled = 1, we'll enable after fork-equivalent
fd = perf_event_open(attr, 0, -1, -1, 0) # pid=0 → self, cpu=-1 → any CPU
PERF_EVENT_IOC_ENABLE = 0x2400
PERF_EVENT_IOC_DISABLE = 0x2401
libc.ioctl(fd, PERF_EVENT_IOC_ENABLE, 0)
t0 = time.perf_counter()
work_fn()
elapsed = time.perf_counter() - t0
libc.ioctl(fd, PERF_EVENT_IOC_DISABLE, 0)
raw = os.read(fd, 8)
count = struct.unpack("Q", raw)[0]
os.close(fd)
print(f" {label:>30s} = {count:>15,d} ({count/elapsed:>12,.0f}/sec)")
return count
def hot_loop():
# ~50M Python iterations. Real CPU work, mostly bytecode dispatch.
s = 0
for i in range(5_000_000):
s += (i * 2654435761) & 0xFFFFFFFF
return s
print("=== Razorpay payments-API style hot loop, 5M iterations ===")
measure(PERF_TYPE_HARDWARE, PERF_COUNT_HW_CPU_CYCLES, "cycles", hot_loop)
measure(PERF_TYPE_HARDWARE, PERF_COUNT_HW_INSTRUCTIONS, "instructions", hot_loop)
measure(PERF_TYPE_SOFTWARE, PERF_COUNT_SW_CPU_CLOCK, "cpu_clock_ns", hot_loop)
measure(PERF_TYPE_SOFTWARE, PERF_COUNT_SW_TASK_CLOCK, "task_clock_ns",hot_loop)
# Sample run on c6i.large (Skylake-X, kernel 5.15.0-1051-aws):
=== Razorpay payments-API style hot loop, 5M iterations ===
cycles = 842,613,229 (1,879,041,584/sec)
instructions = 1,610,448,712 (3,591,180,520/sec)
cpu_clock_ns = 448,213,901 ( 999,847,392/sec)
task_clock_ns = 448,021,118 ( 999,418,221/sec)
# Quick math the reader can do in their head:
# IPC = instructions / cycles = 1.61e9 / 8.43e8 = 1.91
# wall = cpu_clock_ns ≈ 0.448 s
# Python is doing 1.6 billion instructions to run 5M iterations of a
# one-line arithmetic loop — i.e. ~320 instructions per Python iteration.
# That's the bytecode dispatch tax: most of those instructions are CPython's
# ceval.c switch dispatch, not the i*2654435761 work itself.
The walk-through. perf_event_open(attr, 0, -1, -1, 0) is the entire perf syscall — pid=0 means "this process", cpu=-1 means "whichever CPU it runs on", group_fd=-1 means "not part of a counter group", flags=0 means "default semantics". Every perf subcommand calls this with different attr fields. PERF_EVENT_IOC_ENABLE / _DISABLE are how perf stat's --delay and --interval-print work internally — they toggle the counter on/off without closing it. os.read(fd, 8) returns an 8-byte little-endian counter value when read_format = 0; read_format has bits to add total_time_enabled, total_time_running, and per-counter values for grouped reads, which is how perf stat reports counters that were multiplexed because there weren't enough hardware counter slots. The IPC of 1.91 is the same number perf stat -- python3 hot.py would report, derived the exact same way. Why the IPC is so high for an interpreter loop: each Python bytecode is a chain of dependent loads (read opcode, jump to handler, read operand, decode, push result), but Skylake-X has 6 ports of execution and aggressive out-of-order, so it overlaps the address calculation of the next opcode with the integer ALU work of the current one. The interpreter dispatch is not as serialised as it looks in C — modern OoO engines extract IPC from chains that look serial in source.
The whole perf stat UX is this script plus pretty formatting plus an event-name lookup table plus group-counter logic for read_format. There's no magic.
Reading perf report — the columns demystified
After perf record -F 99 -g -p $(pgrep catalogue) -- sleep 30, you run perf report --stdio --no-children and see something like this:
# Total Lost Samples: 0
# Samples: 2K of event 'cycles'
# Event count (approx.): 2487163091
#
# Overhead Command Shared Object Symbol
# ........ ............... ........................ ....................
#
18.41% catalogue [kernel.kallsyms] [k] copy_user_enhanced_fast_string
14.22% catalogue libc-2.31.so [.] __memmove_avx_unaligned_erms
9.07% catalogue catalogue [.] catalogue::serialize::serialize_response
7.83% catalogue catalogue [.] tokio::runtime::scheduler::run_task
5.91% catalogue [kernel.kallsyms] [k] futex_wake
4.62% catalogue catalogue [.] hashbrown::HashMap::get
...
Overhead is the column you want. It is the percentage of samples that landed in this symbol (Self time). Command is the process name from the kernel's task_struct->comm. Shared Object is the file the symbol came from — a .so, the main binary, or [kernel.kallsyms] for kernel functions. The [k] and [.] markers distinguish kernel and user-space symbols. Symbol is the resolved function name; if symbol resolution failed you get [unknown] or a hex address.
The --no-children flag turns off the second column. Without it, perf report shows two percentages: Children (samples in this function or anything it called) and Self (samples in this function only). For a typical service, main will have Children = 100% and Self = 0.001% because almost no time is spent in main itself but main transitively contains all work. Children-time is what flamegraphs visualise as stack height; Self-time is what optimisation targets care about. Showing both in a TUI is overwhelming; the --no-children flag is what most engineers eventually settle on for the first pass.
The rule: a symbol with high Self% but low Children% (or --no-children showing a high Overhead%) is a leaf. Optimising it directly pays off. A symbol with high Children% but low Self% is a hub — optimising it requires looking at what it calls. __memmove_avx_unaligned_erms is always a leaf; tokio::runtime::scheduler::run_task is always a hub. Karan's __memmove_avx_unaligned_erms at 14% means the catalogue service is moving memory around — likely deserialising a payload, copying a buffer, reallocating a vector — and the optimisation lives in whoever called memmove, not in memmove itself. perf report lets you press Enter on that line to see the call chains that lead to it.
# parse_perf_report.py — wrap perf report and pull out the diagnosis.
# This is what a production runbook should call: don't make Karan eyeball
# the TUI at 02:30 when the playoff is over and traffic is recovering.
import subprocess, re, json, sys
def perf_record_and_report(pid: int, seconds: int = 30, freq: int = 99) -> dict:
# Step 1 — record. -g for call graph (frame pointers), -F for sample rate.
record = subprocess.run(
["perf", "record", "-F", str(freq), "-g", "-p", str(pid),
"-o", f"/tmp/perf_{pid}.data", "--", "sleep", str(seconds)],
capture_output=True, text=True)
# perf record prints sample stats to stderr.
sample_match = re.search(r'(\d+) samples', record.stderr)
lost_match = re.search(r'Lost (\d+)', record.stderr)
samples = int(sample_match.group(1)) if sample_match else None
lost = int(lost_match.group(1)) if lost_match else 0
# Step 2 — report, plain text, no children, top 20 only.
report = subprocess.run(
["perf", "report", "-i", f"/tmp/perf_{pid}.data",
"--stdio", "--no-children", "--percent-limit", "0.5"],
capture_output=True, text=True)
# Step 3 — parse the table. Columns: Overhead Command DSO Symbol.
rows = []
for line in report.stdout.splitlines():
m = re.match(r'\s*(\d+\.\d+)%\s+(\S+)\s+(\S+)\s+\[(\w)\]\s+(.+)$', line)
if not m: continue
overhead, comm, dso, kind, sym = m.groups()
rows.append({"overhead_pct": float(overhead), "comm": comm,
"dso": dso, "kind": "kernel" if kind == "k" else "user",
"symbol": sym.strip()})
return {"samples": samples, "lost": lost,
"lost_pct": (lost / samples * 100) if samples else None,
"top": rows[:10]}
def diagnose(report: dict) -> str:
if report["samples"] is None:
return "perf record produced no samples — wrong pid? service idle?"
if report["lost_pct"] and report["lost_pct"] > 5:
return (f"WARNING: {report['lost_pct']:.1f}% of samples lost — "
f"increase ring buffer with -m 16M or run with --realtime 99")
leaf_kernel = sum(r["overhead_pct"] for r in report["top"]
if r["kind"] == "kernel")
leaf_libc = sum(r["overhead_pct"] for r in report["top"]
if "libc" in r["dso"])
leaf_user = 100 - leaf_kernel - leaf_libc
return (f"Top symbol: {report['top'][0]['symbol']} "
f"({report['top'][0]['overhead_pct']:.1f}%). "
f"Kernel time {leaf_kernel:.1f}%, libc {leaf_libc:.1f}%, "
f"app code {leaf_user:.1f}%. ")
if __name__ == "__main__":
pid = int(sys.argv[1])
rep = perf_record_and_report(pid, seconds=30)
print(json.dumps(rep, indent=2))
print("\nDiagnosis:", diagnose(rep))
# Sample run on Karan's catalogue pod (pid 18419) during IPL playoff:
{
"samples": 2731,
"lost": 0,
"lost_pct": 0.0,
"top": [
{"overhead_pct": 18.41, "comm": "catalogue", "dso": "[kernel.kallsyms]",
"kind": "kernel", "symbol": "copy_user_enhanced_fast_string"},
{"overhead_pct": 14.22, "comm": "catalogue", "dso": "libc-2.31.so",
"kind": "user", "symbol": "__memmove_avx_unaligned_erms"},
{"overhead_pct": 9.07, "comm": "catalogue", "dso": "catalogue",
"kind": "user", "symbol": "catalogue::serialize::serialize_response"},
...
]
}
Diagnosis: Top symbol: copy_user_enhanced_fast_string (18.4%).
Kernel time 24.3%, libc 14.2%, app code 61.5%.
The walk-through. perf record -F 99 -g produces 99 samples/sec/CPU with frame-pointer call graphs — the cheapest stack-walking option, requires -fno-omit-frame-pointer at compile time. perf report --no-children --percent-limit 0.5 suppresses the children column and hides anything below 0.5% overhead; this is what makes the output skimmable. re.match(r'\s*(\d+\.\d+)%\s+(\S+)\s+(\S+)\s+\[(\w)\]\s+(.+)$', line) is the parsing trick: perf report --stdio is stable across versions enough to regex-parse, but if the format ever shifts use perf script (one line per sample) instead of perf report (aggregated). The diagnosis line answers Karan's actual question: 24.3% kernel, 14.2% libc memmove, 61.5% app — meaning the service is spending one-quarter of its CPU in kernel I/O paths (copy_user_enhanced_fast_string is the network/socket copy primitive), so the real fix is reducing per-request payload size or batching the response, not micro-optimising the serialiser. Why "copy_user_enhanced_fast_string" matters as a kernel-side hint: that function is what copy_to_user ends up calling on x86-64, and it appears in flamegraphs whenever the service is moving data across the kernel/user boundary. 18% of CPU there means this service is not compute-bound; it is bandwidth-bound on the syscall boundary. The optimisation is fewer, bigger writes — sendfile, splice, io_uring with registered buffers — not faster serialisation.
When perf lies — three traps and how to spot them
Three failure modes of perf cause production teams to misdiagnose flamegraphs. All three are visible in the output if you know what to look for.
Trap 1: missing frame pointers, garbled stacks. If the binary was compiled with -O2 (default) and without -fno-omit-frame-pointer, the frame-pointer unwinder inside the kernel walks garbage and produces stacks that look plausible but are wrong. Symptoms: perf report shows [unknown] frames, fat unrelated libraries near the top, or stacks shorter than the actual call depth. Fix: rebuild with -fno-omit-frame-pointer, or switch to --call-graph dwarf (uses DWARF debug info to unwind, larger samples, slower) or --call-graph lbr (uses Intel's hardware Last-Branch-Record, fastest, but only the last 16–32 frames). On Rust/Go binaries built with default flags, lbr is the safest production choice; the binary doesn't need rebuilding and the 16-deep limit is enough for most flamegraphs that aren't deeply recursive.
Trap 2: JIT'd languages and [unknown] symbols. A JVM, V8, PyPy, or .NET runtime emits machine code at runtime that has no entry in any .so's symbol table. perf sees those PCs and cannot resolve them; the flamegraph shows [unknown] exactly where the hot path is. The fix is /tmp/perf-<pid>.map — a text file the runtime is supposed to write, listing <addr> <size> <symbol> for each JITted method. Java's async-profiler writes it; OpenJDK with -XX:+PreserveFramePointer -agentpath:libperf-jvmti.so writes it; V8 needs --perf-prof. Without that file, perf is blind to the runtime.
Trap 3: lost samples during the spike you cared about. The [ Lost N samples ] line at the end of perf record's output is non-negotiable — if N is more than ~1% of total samples, the flamegraph has gaps in exactly the high-load moments you wanted to see (because that's when the ring buffer overflows). Fixes in order of cost: -m 16M (raise per-CPU buffer), --realtime 99 (let perf record preempt other work), --snapshot (record into a circular buffer, dump on signal — this is what production-continuous profilers do).
There is a fourth trap, less common: counter multiplexing. Hardware PMUs have a fixed number of counter slots — typically 4 general-purpose plus 3 fixed on Intel x86. If you ask perf stat for more events than fit, the kernel time-multiplexes them, running each for a fraction of the wall time and scaling the count up. The scaled values are estimates. perf stat shows time enabled and time running per event; if enabled / running > 1.0 the count was scaled. Most of the time this is fine, but for short-running benchmarks with many requested events, the scaling noise can dominate. Solution: ask for fewer events per run, or pin events to fixed counters where possible.
Common confusions
- "
perf topis the live version ofperf record." They share the syscall but differ structurally:perf topre-aggregates a sliding window of samples in-memory and refreshes the screen, whereasperf recordwrites every sample to disk.perf topis a triage tool — "what's hot right now?";perf recordis the diagnostic tool — "let me capture 30 seconds and analyse it later". On a saturated production pod,perf topitself contributes to CPU pressure because of the constant TUI redraws;perf record --snapshotis the production-safe alternative. - "
perf statandperf recordmeasure the same thing." No.statis counting mode — read counter, run, read counter, subtract; you get exact totals but no per-function breakdown.recordis sampling mode — interrupt N times per second, capture stack; you get per-function breakdown but counts are statistical estimates. Usestatfor "did my optimisation reduce cache misses by 30%"; userecordfor "where are those cache misses coming from". - "
perf record -F 99is fine for short benchmarks." Not always. A 100 ms benchmark at 99 Hz produces 9.9 samples per CPU, which is statistical noise. For sub-second profiles, raise to-F 999or-F 4000; for multi-minute production runs,-F 99is correct and-F 999is wasteful. - "Kernel symbols mean my code is doing something kernel-y." Often, but not always.
[kernel.kallsyms]symbols appear because the sample landed in kernel mode at the time of the NMI; that can be because your code made a syscall, but it can also be because of a timer interrupt, page fault, or scheduler tick that happened to coincide with the sample. Use-e cycles:u(user-only) to filter out kernel time entirely if you want a pure user-space profile. - "
perf reportpercentages add up to 100%." They add up to 100% across all samples in the file, but if you filtered (--comm,-c,--cpu) or set--percent-limit, the displayed rows sum to less. Always check the# Samples:and# Event count (approx.):headers at the top of the report — they tell you the denominator. - "
perfis Linux-only because ofperf_event_open." Yes — but the analogues exist: macOS hasdtrace(and Instruments wrapping it), Windows has ETW (andxperf/ Windows Performance Analyzer wrapping it), Solaris haddtracefirst. The mental model — counters and ring buffers and per-CPU sampling — transfers; only the syscall and CLI change.
Going deeper
What perf list is actually showing — and why half of it is unsupported on your CPU
perf list enumerates every event the kernel knows the name of: cycles, instructions, cache-misses, but also mem_load_retired.l1_miss, frontend_retired.dsb_miss, cycle_activity.stalls_l3_miss, and dozens of micro-architecture-specific events. The names come from two sources: the kernel's hard-coded event list (the "generic" events at the top of the output) and JSON files shipped in tools/perf/pmu-events/arch/<arch>/<vendor>/<microarch>.json. When perf list runs, it reads /sys/devices/cpu/caps/pmu_name to identify the microarchitecture, then loads the matching JSON. If you're on AWS Graviton (ARM Neoverse) the list is different from Intel Skylake, which is different from AMD Zen 3. The events you read about in an Intel optimisation guide may not exist on the CPU you're profiling. Worse, an event with the same name may have a slightly different meaning on a different microarch. The defence is to always cross-reference perf list output against the vendor's optimisation manual for the specific part — Intel's "Optimization Reference Manual", AMD's "Software Optimization Guide", Arm's "Neoverse Performance Analysis Methodology". The Razorpay reliability team learned this the hard way during a 2024 migration from Intel-based EC2 to Graviton: a flamegraph dimension they relied on (l3_lat_cache.reference) didn't exist on Neoverse N1, and their automated capacity script silently produced zero values for two weeks before someone noticed.
perf record --call-graph — three unwinders, three trade-offs
The frame-pointer unwinder (fp) is the cheapest: it walks rbp through stack frames at NMI time, inside the kernel, in roughly 1 µs per stack. It needs the binary built with -fno-omit-frame-pointer. Most distributions ship libraries without frame pointers (the GCC default has been to omit them since ~2008), so production stacks have garbage above any libc or libstdc++ frame. The DWARF unwinder (dwarf) dumps a fixed slice of the user stack (default 8 KB) into each sample record and resolves it offline using DWARF debug info; this works without rebuilding anything but produces samples 50–100× larger and roughly 5× slower. The LBR unwinder (lbr) reads Intel's hardware Last Branch Record buffer (16 entries on Skylake, 32 on Sapphire Rapids); it's almost free but limited in depth. Production engineers at Flipkart's Big Billion Days team standardised on lbr for Java services (works without rebuilding, depth of 16 is enough for the catalogue API's stack), fp for Go services (Go always preserves frame pointers since 1.7), and dwarf for one-off investigations on third-party binaries they can't rebuild. The mistake to avoid: don't use dwarf on a high-throughput production process — the 50× sample-size inflation will overflow your ring buffer and you'll lose more than you capture.
perf script — the format flamegraph generators eat
perf script reads perf.data and emits one block of text per sample: a header line with comm/pid time: period eventname: followed by the call stack, one frame per line, indented. Brendan Gregg's flamegraph.pl consumes this format directly: perf script | stackcollapse-perf.pl | flamegraph.pl > flame.svg. The collapse step turns the multi-line per-sample format into one line per unique stack with a count, which is what flamegraph.pl actually wants. Knowing this is what unlocks debugging when flamegraphs go wrong: you can grep perf script output for a specific function, count how many samples mention it, and verify the flamegraph proportion matches by hand. When a flamegraph at Hotstar showed __memmove_avx_unaligned_erms at 22% but perf script | grep memmove | wc -l said only 4% of samples mentioned it, the discrepancy was because the collapse script was treating two slightly-different stacks as different (one had [stripped] for an unresolved frame); fixing the symbol resolution merged them. Always sanity-check a flamegraph against perf script | grep.
Reproduce this on your laptop
# Reproduce on your laptop
sudo apt install linux-tools-common linux-tools-generic
sudo sysctl kernel.perf_event_paranoid=-1
python3 -m venv .venv && source .venv/bin/activate
pip install py-spy
# (1) The from-scratch sampler — no perf binary needed:
python3 perf_event_from_scratch.py
# (2) Real perf record + parse from Python:
python3 -c "import time; [i*2654435761 & 0xFFFFFFFF for i in range(50_000_000)]" &
PID=$!
python3 parse_perf_report.py $PID
Where this leads next
This chapter rebuilt perf from the syscall up; you now know what -F, -g, --call-graph, -m, and --snapshot actually configure, and what columns perf report is showing you. The next chapters in Part 5 use this foundation to read flamegraphs fluently and to decide whether the bottleneck is on-CPU or off-CPU.
Flame graphs and how to read them (/wiki/flame-graphs-and-how-to-read-them) takes the perf script output from this chapter and turns it into the visualisation that names the hot path in one screen.
Off-CPU flamegraphs — the other half (/wiki/off-cpu-flamegraphs-the-other-half) covers the case where perf record shows low CPU but the service is slow — when threads are blocked on locks, I/O, or sleeping, and the on-CPU sampler tells you the wrong story.
Hardware event sampling — PEBS and IBS (/wiki/hardware-event-sampling-pebs-ibs) goes one layer deeper into the PMU itself, covering Intel's Precise Event-Based Sampling and AMD's Instruction-Based Sampling — what perf record -e cycles:pp and -e cpu/mem-loads/pp actually do, and why they matter for cache-miss attribution.
References
- Brendan Gregg, Systems Performance (2nd ed., 2020), Chapter 13 "perf" — the operational reference for every flag in this chapter, including the ring buffer overflow discussion and the call-graph unwinder trade-offs.
- Linux
perf_event_open(2)man page — the syscall ABI, includingperf_event_attrfield semantics,read_formatflags, and the "BUGS" section that documents lost-samples behaviour. - Linux kernel source:
tools/perf/— the user-space CLI source. Readbuiltin-record.candbuiltin-report.cto see exactly how the subcommands wire up to the syscall. - Brendan Gregg, "perf Examples" — a comprehensive recipe collection; the
perf record --call-graphdiscussion and theflamegraph.plpipeline at the end of this chapter draw on it directly. - Intel® 64 and IA-32 Architectures Optimization Reference Manual, Chapter B "Performance Monitoring Events" — the canonical event reference for Intel PMUs; cross-reference event names from
perf listagainst this when working on an Intel host. - Arm Neoverse Reference Designs — Performance Analysis Methodology — the equivalent for AWS Graviton; events and methodology differ from x86 enough to break Intel-tuned dashboards on migration.
- /wiki/sampling-vs-instrumentation — the previous chapter, which explained why sampling is safe in production and instrumentation is not. This chapter is the operational deep-dive on Linux's sampling tool.
- /wiki/flame-graphs-and-how-to-read-them — the next chapter; takes
perf scriptoutput and turns it into the diagnostic visualisation.