perf from scratch

Karan's Hotstar catalogue service is at p99 = 1.4 s during the IPL playoffs and the dashboard says CPU is 73% on every pod. He SSHes in, types perf top, sees __memmove_avx_unaligned_erms at 22% and panics. He runs perf record -F 99 -p $(pgrep catalogue) -g -- sleep 30, then perf report, and the column headers — Overhead, Children, Self, Command, Shared Object, Symbol — are simultaneously informative and impenetrable. Which column is the percent of CPU? Why does Children equal 100% on [unknown]? Why does the same function show up under libc-2.31.so and [kernel.kallsyms]? Karan needs an answer in the next 9 minutes — that's how long the playoff break is. This chapter rebuilds perf from the syscall up, so the column headers stop being a riddle.

perf is a thin user-space wrapper over the perf_event_open(2) syscall, which programs hardware performance counters and software timers in the kernel and streams sample records into a per-CPU mmap'd ring buffer. Every perf subcommand — stat, record, report, top, script, annotate — is a different way to drive that one syscall. Learn the syscall and the subcommands stop being magic; learn the ring buffer's overflow semantics and you stop being surprised when samples vanish during traffic spikes.

What perf actually is — one syscall, six tools

perf ships in the Linux kernel source tree (tools/perf/), not as a separate package. It is a user-space CLI built on a single kernel facility: perf_event_open(2). That syscall takes a struct perf_event_attr describing what to count (cycles, instructions, branch misses, cache references, page faults, context switches, ...) and how to count it (counting mode, sampling mode, with or without stack capture), and returns a file descriptor. You read(2) the fd to get a counter value; you mmap(2) it to get a ring buffer of sample records; you ioctl(2) it to enable, disable, reset, or refresh.

Every perf subcommand is a recipe over that fd:

perf's six subcommands all sit on one syscallA vertical stack diagram. At the bottom, hardware PMU counters and software timers. Above them, the kernel's perf subsystem with the perf_event_open syscall as the gateway and a per-CPU ring buffer. Above that, six boxes labelled perf stat, perf record, perf top, perf report, perf script, perf annotate. Arrows show stat reads counters, record/top mmap the ring buffer, and report/script/annotate read perf.data.perf — six subcommands, one syscall, one ring bufferHardware PMU + software timers (cycles, instructions, branch-miss, ctx-switch, ...)Kernel perf subsystemperf_event_open(2) — the gatewayper-CPU mmap'd ring bufferperf statread() countersperf recordmmap → fileperf topmmap → screenperf reportreads perf.dataperf scriptone line / sampleperf annotateasm + sampleslive, talk to kerneloffline, parse perf.data
The three live tools (`stat`, `record`, `top`) talk to the kernel via `perf_event_open`; the three offline tools (`report`, `script`, `annotate`) parse the `perf.data` file that `record` writes. Knowing which tool is live vs offline tells you whether a flag affects measurement or only presentation.

Why this layering matters in production: when a perf record run produces a flamegraph that looks wrong, the bug can be in the live half (sample rate too low, ring buffer overflow, missing call-graph unwinder) or the offline half (perf script's output filtered the wrong way, the flamegraph generator collapsed stacks aggressively, symbol resolution failed because /tmp/perf-<pid>.map was missing for a JIT). Knowing which side of the line the bug lives on cuts the diagnosis time in half. The recurring pattern at Razorpay's reliability team: when a flamegraph has a fat [unknown] bar, the live half is fine and the offline half couldn't resolve PCs — the fix is --symfs or installing debug symbols, not re-running the profile.

The split also explains why some flags only work at record time (-F, -e, -g, -c, --call-graph dwarf|fp|lbr) and others only at report time (--sort, --no-children, --stdio, -i, --symfs). Live flags configure the syscall; offline flags configure how perf.data is read back. Mixing them up — perf record --sort or perf report -F 999 — produces silently wrong commands that perf does not always reject.

How perf_event_open actually programs the hardware

The syscall takes five arguments — attr, pid, cpu, group_fd, flags — and the attr struct is where every interesting design decision lives. For sampling mode, the kernel programs a hardware counter to overflow after a configured number of events (sample_period) or at a configured frequency (sample_freq), and on overflow it raises a non-maskable interrupt (NMI). The NMI handler captures the program counter, walks the user stack via the configured unwinder, and writes a PERF_RECORD_SAMPLE record into the per-CPU ring buffer.

Three things in attr matter most:

  1. type and config select the event source. type = PERF_TYPE_HARDWARE, config = PERF_COUNT_HW_CPU_CYCLES is the cycle counter. type = PERF_TYPE_SOFTWARE, config = PERF_COUNT_SW_CPU_CLOCK is the wall-clock-driven software timer that fires regardless of CPU activity. The choice matters: a hardware cycle counter only ticks when the CPU is on; a software timer ticks always. For "what is on-CPU", use hardware cycles. For "what is on-CPU including idle (off-CPU stays in idle, so this still finds you on-CPU)", use the software clock — but for off-CPU work specifically, you need scheduler tracepoints, covered in the next chapter.
  2. sample_freq vs sample_period is the rate question. sample_freq = 99 means "auto-tune the period so we get 99 samples per second per CPU"; sample_period = 1000000 means "fire every 1M cycles". freq is what perf record -F 99 selects; period is what -c 1000000 selects. Frequency adapts to load (low CPU load = larger period); period is fixed.
  3. sample_type is a bitmask of what each sample record contains: PERF_SAMPLE_IP (instruction pointer), PERF_SAMPLE_CALLCHAIN (frame-pointer-walked stack), PERF_SAMPLE_STACK_USER (raw stack bytes for DWARF unwinding to do later), PERF_SAMPLE_BRANCH_STACK (LBR — Last Branch Record, hardware-recorded last 16–32 branches), PERF_SAMPLE_TIME, PERF_SAMPLE_TID, PERF_SAMPLE_CPU. The choice of unwinder is here: --call-graph fp sets CALLCHAIN, --call-graph dwarf sets STACK_USER plus a fixed dump size (default 8192 bytes), --call-graph lbr sets BRANCH_STACK. Each has trade-offs.
The per-CPU mmap'd ring buffer that perf record drainsA circular ring buffer with head and tail pointers. The kernel writes sample records at the head; the user-space perf record process reads them at the tail. When the writer catches up to the reader, the kernel either overwrites old records (overwrite mode) or drops new ones and increments PERF_RECORD_LOST (default). A separate panel shows what one PERF_RECORD_SAMPLE looks like: header, ip, tid, time, callchain, period.Per-CPU ring buffer — kernel writer, perf-record readerring buffer(default 528 KB)head (kernel writes)tail(perf reads)PERF_RECORD_SAMPLE layoutheader.type = PERF_RECORD_SAMPLEheader.size = 152 bytes (typical)ip = 0x7fa3c812be20 (rip at IRQ)tid = 18421, pid = 18419time = 31417289 (ns since boot)callchain = [ 0x7fa3c812be20 // libc memmove 0x562a1130ab44 // catalogue::serialize 0x562a112ff9d0 // catalogue::handle_get 0x562a112e8120 // tokio::workerperiod = 999937 (cycles)If head laps tail → kernel emits PERF_RECORD_LOST{lost: N} and the report shows "N samples dropped".
Each `PERF_RECORD_SAMPLE` is roughly 100–250 bytes depending on `sample_type`. At 99 Hz × 16 CPUs the writer produces ~150 KB/sec; the default 528 KB per-CPU buffer holds ~3.5 seconds of samples before overflow. `perf record -m 8M` raises this when traffic spikes overflow the default.

Why the buffer size is the silent killer of long flamegraphs: a service handling a Hotstar IPL traffic burst can briefly hit 99% CPU on every core, which means every NMI fires on time and the ring buffer fills at peak rate. If perf record is fighting for CPU with the very service it is profiling — common on a CPU-saturated pod — the user-space drainer falls behind, the head laps the tail, and the kernel starts emitting PERF_RECORD_LOST records. The flamegraph is now missing exactly the high-load moments you wanted to investigate. The fix is either -m 16M (bigger buffer) or running perf record with a real-time scheduling priority (--realtime 99), and on production pods the --snapshot flag with a triggered dump is the cleanest pattern — record continuously into a circular buffer, and only on alert flush it.

The kernel's perf_event_open(2) man page is honest about this: under "BUGS" it notes that on heavily loaded systems samples can be lost silently if the buffer is too small. perf record prints [ perf record: Woken up X times to write data ] and [ Lost N samples ] at the end of a run; if the second number is nonzero, the flamegraph has gaps. Always check those lines.

A from-scratch sampler — perf_event_open in 80 lines of Python via ctypes

The fastest way to internalise what perf is doing is to write the syscall yourself. The script below opens a PERF_TYPE_SOFTWARE / PERF_COUNT_SW_CPU_CLOCK event in counting mode, reads it before and after a Python loop, and prints the result. There is no perf binary involved; this is the syscall, raw.

# perf_event_from_scratch.py — open a perf event via ctypes, no perf binary.
# Demonstrates: this is what perf stat is doing under the hood, just with
# more event sources and prettier formatting.
#
# Tested on Linux 5.15 / 6.6, x86_64. Needs root or
#   sudo sysctl kernel.perf_event_paranoid=-1
# (default of 2 only allows user events on the calling process, which is
#  exactly what we use here, so usually no sudo is needed).

import ctypes, ctypes.util, os, struct, time

# Constants from <linux/perf_event.h>. Values are stable across kernels
# precisely because perf is part of the kernel ABI.
PERF_TYPE_HARDWARE = 0
PERF_TYPE_SOFTWARE = 1
PERF_COUNT_HW_CPU_CYCLES = 0
PERF_COUNT_HW_INSTRUCTIONS = 1
PERF_COUNT_SW_CPU_CLOCK = 0
PERF_COUNT_SW_TASK_CLOCK = 1

# struct perf_event_attr — only the fields we need. Real struct is 120+ bytes;
# we lay out enough to set type, size, config, and a couple of flag bits.
class PerfEventAttr(ctypes.Structure):
    _fields_ = [
        ("type",        ctypes.c_uint32),
        ("size",        ctypes.c_uint32),
        ("config",      ctypes.c_uint64),
        ("sample_period", ctypes.c_uint64),  # union with sample_freq
        ("sample_type", ctypes.c_uint64),
        ("read_format", ctypes.c_uint64),
        ("flags",       ctypes.c_uint64),    # disabled, inherit, exclude_kernel, ...
        ("wakeup_events", ctypes.c_uint32),
        ("bp_type",     ctypes.c_uint32),
        ("bp_addr",     ctypes.c_uint64),
        ("bp_len",      ctypes.c_uint64),
        ("padding",     ctypes.c_byte * 64),  # rest of struct, zeroed
    ]

# perf_event_open is syscall 298 on x86_64.
SYS_perf_event_open = 298
libc = ctypes.CDLL(ctypes.util.find_library("c"), use_errno=True)
def perf_event_open(attr, pid, cpu, group_fd, flags):
    fd = libc.syscall(SYS_perf_event_open, ctypes.byref(attr),
                      pid, cpu, group_fd, flags)
    if fd < 0:
        err = ctypes.get_errno()
        raise OSError(err, os.strerror(err))
    return fd

def measure(event_type: int, event_config: int, label: str, work_fn):
    attr = PerfEventAttr()
    attr.type = event_type
    attr.size = ctypes.sizeof(PerfEventAttr)
    attr.config = event_config
    attr.flags = (1 << 0)  # disabled = 1, we'll enable after fork-equivalent
    fd = perf_event_open(attr, 0, -1, -1, 0)  # pid=0 → self, cpu=-1 → any CPU

    PERF_EVENT_IOC_ENABLE = 0x2400
    PERF_EVENT_IOC_DISABLE = 0x2401
    libc.ioctl(fd, PERF_EVENT_IOC_ENABLE, 0)
    t0 = time.perf_counter()
    work_fn()
    elapsed = time.perf_counter() - t0
    libc.ioctl(fd, PERF_EVENT_IOC_DISABLE, 0)

    raw = os.read(fd, 8)
    count = struct.unpack("Q", raw)[0]
    os.close(fd)
    print(f"  {label:>30s} = {count:>15,d}  ({count/elapsed:>12,.0f}/sec)")
    return count

def hot_loop():
    # ~50M Python iterations. Real CPU work, mostly bytecode dispatch.
    s = 0
    for i in range(5_000_000):
        s += (i * 2654435761) & 0xFFFFFFFF
    return s

print("=== Razorpay payments-API style hot loop, 5M iterations ===")
measure(PERF_TYPE_HARDWARE, PERF_COUNT_HW_CPU_CYCLES,    "cycles",       hot_loop)
measure(PERF_TYPE_HARDWARE, PERF_COUNT_HW_INSTRUCTIONS,  "instructions", hot_loop)
measure(PERF_TYPE_SOFTWARE, PERF_COUNT_SW_CPU_CLOCK,     "cpu_clock_ns", hot_loop)
measure(PERF_TYPE_SOFTWARE, PERF_COUNT_SW_TASK_CLOCK,    "task_clock_ns",hot_loop)
# Sample run on c6i.large (Skylake-X, kernel 5.15.0-1051-aws):
=== Razorpay payments-API style hot loop, 5M iterations ===
                          cycles =     842,613,229   (1,879,041,584/sec)
                    instructions =   1,610,448,712   (3,591,180,520/sec)
                    cpu_clock_ns =     448,213,901   (  999,847,392/sec)
                   task_clock_ns =     448,021,118   (  999,418,221/sec)

# Quick math the reader can do in their head:
#   IPC = instructions / cycles = 1.61e9 / 8.43e8 = 1.91
#   wall = cpu_clock_ns ≈ 0.448 s
#   Python is doing 1.6 billion instructions to run 5M iterations of a
#   one-line arithmetic loop — i.e. ~320 instructions per Python iteration.
#   That's the bytecode dispatch tax: most of those instructions are CPython's
#   ceval.c switch dispatch, not the i*2654435761 work itself.

The walk-through. perf_event_open(attr, 0, -1, -1, 0) is the entire perf syscall — pid=0 means "this process", cpu=-1 means "whichever CPU it runs on", group_fd=-1 means "not part of a counter group", flags=0 means "default semantics". Every perf subcommand calls this with different attr fields. PERF_EVENT_IOC_ENABLE / _DISABLE are how perf stat's --delay and --interval-print work internally — they toggle the counter on/off without closing it. os.read(fd, 8) returns an 8-byte little-endian counter value when read_format = 0; read_format has bits to add total_time_enabled, total_time_running, and per-counter values for grouped reads, which is how perf stat reports counters that were multiplexed because there weren't enough hardware counter slots. The IPC of 1.91 is the same number perf stat -- python3 hot.py would report, derived the exact same way. Why the IPC is so high for an interpreter loop: each Python bytecode is a chain of dependent loads (read opcode, jump to handler, read operand, decode, push result), but Skylake-X has 6 ports of execution and aggressive out-of-order, so it overlaps the address calculation of the next opcode with the integer ALU work of the current one. The interpreter dispatch is not as serialised as it looks in C — modern OoO engines extract IPC from chains that look serial in source.

The whole perf stat UX is this script plus pretty formatting plus an event-name lookup table plus group-counter logic for read_format. There's no magic.

Reading perf report — the columns demystified

After perf record -F 99 -g -p $(pgrep catalogue) -- sleep 30, you run perf report --stdio --no-children and see something like this:

# Total Lost Samples: 0
# Samples: 2K of event 'cycles'
# Event count (approx.): 2487163091
#
# Overhead   Command          Shared Object             Symbol
# ........  ...............  ........................  ....................
#
    18.41%   catalogue        [kernel.kallsyms]         [k] copy_user_enhanced_fast_string
    14.22%   catalogue        libc-2.31.so              [.] __memmove_avx_unaligned_erms
     9.07%   catalogue        catalogue                 [.] catalogue::serialize::serialize_response
     7.83%   catalogue        catalogue                 [.] tokio::runtime::scheduler::run_task
     5.91%   catalogue        [kernel.kallsyms]         [k] futex_wake
     4.62%   catalogue        catalogue                 [.] hashbrown::HashMap::get
     ...

Overhead is the column you want. It is the percentage of samples that landed in this symbol (Self time). Command is the process name from the kernel's task_struct->comm. Shared Object is the file the symbol came from — a .so, the main binary, or [kernel.kallsyms] for kernel functions. The [k] and [.] markers distinguish kernel and user-space symbols. Symbol is the resolved function name; if symbol resolution failed you get [unknown] or a hex address.

The --no-children flag turns off the second column. Without it, perf report shows two percentages: Children (samples in this function or anything it called) and Self (samples in this function only). For a typical service, main will have Children = 100% and Self = 0.001% because almost no time is spent in main itself but main transitively contains all work. Children-time is what flamegraphs visualise as stack height; Self-time is what optimisation targets care about. Showing both in a TUI is overwhelming; the --no-children flag is what most engineers eventually settle on for the first pass.

The rule: a symbol with high Self% but low Children% (or --no-children showing a high Overhead%) is a leaf. Optimising it directly pays off. A symbol with high Children% but low Self% is a hub — optimising it requires looking at what it calls. __memmove_avx_unaligned_erms is always a leaf; tokio::runtime::scheduler::run_task is always a hub. Karan's __memmove_avx_unaligned_erms at 14% means the catalogue service is moving memory around — likely deserialising a payload, copying a buffer, reallocating a vector — and the optimisation lives in whoever called memmove, not in memmove itself. perf report lets you press Enter on that line to see the call chains that lead to it.

# parse_perf_report.py — wrap perf report and pull out the diagnosis.
# This is what a production runbook should call: don't make Karan eyeball
# the TUI at 02:30 when the playoff is over and traffic is recovering.

import subprocess, re, json, sys

def perf_record_and_report(pid: int, seconds: int = 30, freq: int = 99) -> dict:
    # Step 1 — record. -g for call graph (frame pointers), -F for sample rate.
    record = subprocess.run(
        ["perf", "record", "-F", str(freq), "-g", "-p", str(pid),
         "-o", f"/tmp/perf_{pid}.data", "--", "sleep", str(seconds)],
        capture_output=True, text=True)
    # perf record prints sample stats to stderr.
    sample_match = re.search(r'(\d+) samples', record.stderr)
    lost_match = re.search(r'Lost (\d+)', record.stderr)
    samples = int(sample_match.group(1)) if sample_match else None
    lost = int(lost_match.group(1)) if lost_match else 0

    # Step 2 — report, plain text, no children, top 20 only.
    report = subprocess.run(
        ["perf", "report", "-i", f"/tmp/perf_{pid}.data",
         "--stdio", "--no-children", "--percent-limit", "0.5"],
        capture_output=True, text=True)

    # Step 3 — parse the table. Columns: Overhead Command DSO Symbol.
    rows = []
    for line in report.stdout.splitlines():
        m = re.match(r'\s*(\d+\.\d+)%\s+(\S+)\s+(\S+)\s+\[(\w)\]\s+(.+)$', line)
        if not m: continue
        overhead, comm, dso, kind, sym = m.groups()
        rows.append({"overhead_pct": float(overhead), "comm": comm,
                     "dso": dso, "kind": "kernel" if kind == "k" else "user",
                     "symbol": sym.strip()})
    return {"samples": samples, "lost": lost,
            "lost_pct": (lost / samples * 100) if samples else None,
            "top": rows[:10]}

def diagnose(report: dict) -> str:
    if report["samples"] is None:
        return "perf record produced no samples — wrong pid? service idle?"
    if report["lost_pct"] and report["lost_pct"] > 5:
        return (f"WARNING: {report['lost_pct']:.1f}% of samples lost — "
                f"increase ring buffer with -m 16M or run with --realtime 99")
    leaf_kernel = sum(r["overhead_pct"] for r in report["top"]
                      if r["kind"] == "kernel")
    leaf_libc = sum(r["overhead_pct"] for r in report["top"]
                    if "libc" in r["dso"])
    leaf_user = 100 - leaf_kernel - leaf_libc
    return (f"Top symbol: {report['top'][0]['symbol']} "
            f"({report['top'][0]['overhead_pct']:.1f}%). "
            f"Kernel time {leaf_kernel:.1f}%, libc {leaf_libc:.1f}%, "
            f"app code {leaf_user:.1f}%. ")

if __name__ == "__main__":
    pid = int(sys.argv[1])
    rep = perf_record_and_report(pid, seconds=30)
    print(json.dumps(rep, indent=2))
    print("\nDiagnosis:", diagnose(rep))
# Sample run on Karan's catalogue pod (pid 18419) during IPL playoff:
{
  "samples": 2731,
  "lost": 0,
  "lost_pct": 0.0,
  "top": [
    {"overhead_pct": 18.41, "comm": "catalogue", "dso": "[kernel.kallsyms]",
     "kind": "kernel", "symbol": "copy_user_enhanced_fast_string"},
    {"overhead_pct": 14.22, "comm": "catalogue", "dso": "libc-2.31.so",
     "kind": "user",   "symbol": "__memmove_avx_unaligned_erms"},
    {"overhead_pct":  9.07, "comm": "catalogue", "dso": "catalogue",
     "kind": "user",   "symbol": "catalogue::serialize::serialize_response"},
    ...
  ]
}

Diagnosis: Top symbol: copy_user_enhanced_fast_string (18.4%).
Kernel time 24.3%, libc 14.2%, app code 61.5%.

The walk-through. perf record -F 99 -g produces 99 samples/sec/CPU with frame-pointer call graphs — the cheapest stack-walking option, requires -fno-omit-frame-pointer at compile time. perf report --no-children --percent-limit 0.5 suppresses the children column and hides anything below 0.5% overhead; this is what makes the output skimmable. re.match(r'\s*(\d+\.\d+)%\s+(\S+)\s+(\S+)\s+\[(\w)\]\s+(.+)$', line) is the parsing trick: perf report --stdio is stable across versions enough to regex-parse, but if the format ever shifts use perf script (one line per sample) instead of perf report (aggregated). The diagnosis line answers Karan's actual question: 24.3% kernel, 14.2% libc memmove, 61.5% app — meaning the service is spending one-quarter of its CPU in kernel I/O paths (copy_user_enhanced_fast_string is the network/socket copy primitive), so the real fix is reducing per-request payload size or batching the response, not micro-optimising the serialiser. Why "copy_user_enhanced_fast_string" matters as a kernel-side hint: that function is what copy_to_user ends up calling on x86-64, and it appears in flamegraphs whenever the service is moving data across the kernel/user boundary. 18% of CPU there means this service is not compute-bound; it is bandwidth-bound on the syscall boundary. The optimisation is fewer, bigger writes — sendfile, splice, io_uring with registered buffers — not faster serialisation.

When perf lies — three traps and how to spot them

Three failure modes of perf cause production teams to misdiagnose flamegraphs. All three are visible in the output if you know what to look for.

Trap 1: missing frame pointers, garbled stacks. If the binary was compiled with -O2 (default) and without -fno-omit-frame-pointer, the frame-pointer unwinder inside the kernel walks garbage and produces stacks that look plausible but are wrong. Symptoms: perf report shows [unknown] frames, fat unrelated libraries near the top, or stacks shorter than the actual call depth. Fix: rebuild with -fno-omit-frame-pointer, or switch to --call-graph dwarf (uses DWARF debug info to unwind, larger samples, slower) or --call-graph lbr (uses Intel's hardware Last-Branch-Record, fastest, but only the last 16–32 frames). On Rust/Go binaries built with default flags, lbr is the safest production choice; the binary doesn't need rebuilding and the 16-deep limit is enough for most flamegraphs that aren't deeply recursive.

Trap 2: JIT'd languages and [unknown] symbols. A JVM, V8, PyPy, or .NET runtime emits machine code at runtime that has no entry in any .so's symbol table. perf sees those PCs and cannot resolve them; the flamegraph shows [unknown] exactly where the hot path is. The fix is /tmp/perf-<pid>.map — a text file the runtime is supposed to write, listing <addr> <size> <symbol> for each JITted method. Java's async-profiler writes it; OpenJDK with -XX:+PreserveFramePointer -agentpath:libperf-jvmti.so writes it; V8 needs --perf-prof. Without that file, perf is blind to the runtime.

Trap 3: lost samples during the spike you cared about. The [ Lost N samples ] line at the end of perf record's output is non-negotiable — if N is more than ~1% of total samples, the flamegraph has gaps in exactly the high-load moments you wanted to see (because that's when the ring buffer overflows). Fixes in order of cost: -m 16M (raise per-CPU buffer), --realtime 99 (let perf record preempt other work), --snapshot (record into a circular buffer, dump on signal — this is what production-continuous profilers do).

There is a fourth trap, less common: counter multiplexing. Hardware PMUs have a fixed number of counter slots — typically 4 general-purpose plus 3 fixed on Intel x86. If you ask perf stat for more events than fit, the kernel time-multiplexes them, running each for a fraction of the wall time and scaling the count up. The scaled values are estimates. perf stat shows time enabled and time running per event; if enabled / running > 1.0 the count was scaled. Most of the time this is fine, but for short-running benchmarks with many requested events, the scaling noise can dominate. Solution: ask for fewer events per run, or pin events to fixed counters where possible.

Common confusions

Going deeper

What perf list is actually showing — and why half of it is unsupported on your CPU

perf list enumerates every event the kernel knows the name of: cycles, instructions, cache-misses, but also mem_load_retired.l1_miss, frontend_retired.dsb_miss, cycle_activity.stalls_l3_miss, and dozens of micro-architecture-specific events. The names come from two sources: the kernel's hard-coded event list (the "generic" events at the top of the output) and JSON files shipped in tools/perf/pmu-events/arch/<arch>/<vendor>/<microarch>.json. When perf list runs, it reads /sys/devices/cpu/caps/pmu_name to identify the microarchitecture, then loads the matching JSON. If you're on AWS Graviton (ARM Neoverse) the list is different from Intel Skylake, which is different from AMD Zen 3. The events you read about in an Intel optimisation guide may not exist on the CPU you're profiling. Worse, an event with the same name may have a slightly different meaning on a different microarch. The defence is to always cross-reference perf list output against the vendor's optimisation manual for the specific part — Intel's "Optimization Reference Manual", AMD's "Software Optimization Guide", Arm's "Neoverse Performance Analysis Methodology". The Razorpay reliability team learned this the hard way during a 2024 migration from Intel-based EC2 to Graviton: a flamegraph dimension they relied on (l3_lat_cache.reference) didn't exist on Neoverse N1, and their automated capacity script silently produced zero values for two weeks before someone noticed.

perf record --call-graph — three unwinders, three trade-offs

The frame-pointer unwinder (fp) is the cheapest: it walks rbp through stack frames at NMI time, inside the kernel, in roughly 1 µs per stack. It needs the binary built with -fno-omit-frame-pointer. Most distributions ship libraries without frame pointers (the GCC default has been to omit them since ~2008), so production stacks have garbage above any libc or libstdc++ frame. The DWARF unwinder (dwarf) dumps a fixed slice of the user stack (default 8 KB) into each sample record and resolves it offline using DWARF debug info; this works without rebuilding anything but produces samples 50–100× larger and roughly 5× slower. The LBR unwinder (lbr) reads Intel's hardware Last Branch Record buffer (16 entries on Skylake, 32 on Sapphire Rapids); it's almost free but limited in depth. Production engineers at Flipkart's Big Billion Days team standardised on lbr for Java services (works without rebuilding, depth of 16 is enough for the catalogue API's stack), fp for Go services (Go always preserves frame pointers since 1.7), and dwarf for one-off investigations on third-party binaries they can't rebuild. The mistake to avoid: don't use dwarf on a high-throughput production process — the 50× sample-size inflation will overflow your ring buffer and you'll lose more than you capture.

perf script — the format flamegraph generators eat

perf script reads perf.data and emits one block of text per sample: a header line with comm/pid time: period eventname: followed by the call stack, one frame per line, indented. Brendan Gregg's flamegraph.pl consumes this format directly: perf script | stackcollapse-perf.pl | flamegraph.pl > flame.svg. The collapse step turns the multi-line per-sample format into one line per unique stack with a count, which is what flamegraph.pl actually wants. Knowing this is what unlocks debugging when flamegraphs go wrong: you can grep perf script output for a specific function, count how many samples mention it, and verify the flamegraph proportion matches by hand. When a flamegraph at Hotstar showed __memmove_avx_unaligned_erms at 22% but perf script | grep memmove | wc -l said only 4% of samples mentioned it, the discrepancy was because the collapse script was treating two slightly-different stacks as different (one had [stripped] for an unresolved frame); fixing the symbol resolution merged them. Always sanity-check a flamegraph against perf script | grep.

Reproduce this on your laptop

# Reproduce on your laptop
sudo apt install linux-tools-common linux-tools-generic
sudo sysctl kernel.perf_event_paranoid=-1
python3 -m venv .venv && source .venv/bin/activate
pip install py-spy
# (1) The from-scratch sampler — no perf binary needed:
python3 perf_event_from_scratch.py
# (2) Real perf record + parse from Python:
python3 -c "import time; [i*2654435761 & 0xFFFFFFFF for i in range(50_000_000)]" &
PID=$!
python3 parse_perf_report.py $PID

Where this leads next

This chapter rebuilt perf from the syscall up; you now know what -F, -g, --call-graph, -m, and --snapshot actually configure, and what columns perf report is showing you. The next chapters in Part 5 use this foundation to read flamegraphs fluently and to decide whether the bottleneck is on-CPU or off-CPU.

Flame graphs and how to read them (/wiki/flame-graphs-and-how-to-read-them) takes the perf script output from this chapter and turns it into the visualisation that names the hot path in one screen.

Off-CPU flamegraphs — the other half (/wiki/off-cpu-flamegraphs-the-other-half) covers the case where perf record shows low CPU but the service is slow — when threads are blocked on locks, I/O, or sleeping, and the on-CPU sampler tells you the wrong story.

Hardware event sampling — PEBS and IBS (/wiki/hardware-event-sampling-pebs-ibs) goes one layer deeper into the PMU itself, covering Intel's Precise Event-Based Sampling and AMD's Instruction-Based Sampling — what perf record -e cycles:pp and -e cpu/mem-loads/pp actually do, and why they matter for cache-miss attribution.

References