Sampling vs instrumentation

Aditi's Razorpay payments-API service is doing 18,000 RPS at p99 = 42 ms and her on-call runbook says the CPU is "the bottleneck". She runs cProfile on a staging copy and sees serialize_response taking 38% of CPU. She rewrites it in orjson. The benchmark says 22% faster. She ships. The production p99 does not move. She runs py-spy record on a live pod for 90 seconds and sees serialize_response is 4% of CPU; the actual hot function is wait_for_db_connection at 41%. The cProfile number was a lie — not a bug, a category of lie called "instrumentation overhead inflates exactly the function you're inside of". The two profilers measured the same code and produced different answers because they work fundamentally differently. This chapter is what they are, when each lies, and how to read both honestly.

Sampling profilers interrupt the CPU N times per second and record what the program counter is on; their cost is bounded and does not change the program's behaviour, but they miss anything quieter than 1/N seconds. Instrumentation profilers wrap every function call with timing code; they see every call but their per-call overhead skews the very functions they measure, especially short hot ones. Pick sampling for production and long-running services; pick instrumentation only when you need exact call counts on a workload that is not perturbed by the slowdown.

How sampling actually works — interrupt, peek, walk the stack

A sampling profiler is a pile of plumbing built on a single primitive: a timer that fires N times per second and, on each fire, asks the operating system "what is this thread's program counter and call stack?". Linux's perf defaults to N = 99 Hz (deliberately a prime so it doesn't synchronise with periodic workloads at 100 Hz). py-spy defaults to 100 Hz. The Linux kernel's perf_event_open syscall installs the timer; on each tick the kernel walks the user-space stack via the saved frame pointers (or DWARF unwinding, or LBR — last-branch-record — if the CPU supports it), writes the resulting (PC, stack) tuple into a ring buffer, and the profiler's user-space reader drains the buffer to disk.

The whole point of the design: the profiler is not in the program's call path. The CPU runs your code at full speed. Once every 10 ms an interrupt fires, the kernel snapshots the stack, and the CPU goes back to your code. The cost is roughly 5–20 µs per sample, all in kernel mode, paid out of some thread's CPU budget but spread across all running threads in proportion to how long they were on-CPU. At 99 Hz that is 99 × 15 µs = 1.5 ms/sec of overhead — about 0.15% of one core. You can run this on a production pod handling real traffic and the customer-visible p99 will not move.

How a sampling profiler captures stacks at 99 HzA horizontal timeline of one second showing 99 evenly-spaced tick marks. Above the timeline are coloured segments representing different functions on the CPU; below each tick mark is a small stack snapshot showing which function was on top. Most ticks land in the longest-running function; a brief 200µs function is visited only by chance.99 Hz sampling — one second of execution, 99 stack snapshotsparse_request (30% CPU)db_callrender_html (40% CPU)log_response (19%)200µs hot fn~30 hits~6 hits~40 hits~19 hitsprobably missedEach tick samples the top of stack. After 99 ticks, the count per function isproportional to its on-CPU time. A function shorter than ~10 ms is sometimes missed entirely.Run for 10 seconds → 990 samples → resolution down to ~1ms; run for 60 seconds → 5,940 samples → ~0.2 ms.
Illustrative — not measured data. The 200 µs "hot" function is on-CPU for 0.02% of the second, so at 99 Hz it has a ~2% chance of being sampled even once. Samplers see proportions of long-running work; they cannot see brief work that happens between ticks.

Why the prime frequency matters: many real workloads have a 100 Hz or 1000 Hz periodic structure — Python's GIL switches every 100 ticks of the bytecode interpreter, the JVM's safepoint poll fires at intervals tied to wall time, the Linux scheduler tick is 250 Hz or 1000 Hz depending on CONFIG_HZ. If the sampler's frequency divides one of these, the sampler will systematically see (or systematically miss) the same phase of the periodic event every time. Choosing a prime like 99 or 997 guarantees that over many samples the phase relationship drifts, so what gets observed is the time-average of the workload rather than one repeating slice of it. This is the same reason perf record defaults to 4000 Hz / 99 Hz instead of round-numbered alternatives.

The price you pay is statistical resolution. A function on-CPU for 100 ms in a one-second window will be sampled 99 × 0.1 = 9.9 times — call it 10, with shot noise of ±√10 ≈ ±3. A function on-CPU for 10 ms will be sampled once on average, with shot noise of ±1 — one standard deviation is the entire signal. A function on-CPU for 1 ms will be sampled once every 10 seconds; in a 60-second profile you'll see it 6 times with shot noise of ±2.4, comfortably visible. A function on-CPU for 100 µs will be sampled once every 100 seconds and is functionally invisible to a 99 Hz sampler. To find these, you either crank the frequency to 999 Hz (10× more samples, 10× more overhead, but 10× better tail resolution), profile for 10 minutes (3.6× more wall time), or switch tools.

How instrumentation works — wrap every call, time the entry and exit

Instrumentation profilers do the opposite. They modify the program — at compile time (-pg for gprof), at load time (LD_PRELOAD of a shim library), or at runtime (CPython's sys.setprofile, the JVM's JVMTI agent, eBPF uprobes) — so that every function entry and every function exit goes through the profiler's bookkeeping code. The profiler reads a high-resolution clock on entry, reads it again on exit, and accumulates (call_count, total_time, max_time) per function. There is no statistical sampling: every call of every function shows up in the report.

The cost is paid at every function call, in proportion to how often you call. A function called 10 times per second carries about 2 × clock_read = 200 ns of overhead — invisible. A function called 10 million times per second — say, an inner-loop helper or an __hash__ method on a hot dict — carries 2 × 100 ns × 10^7 = 2 sec/sec of overhead. The profiler said this function takes 60% of CPU. Without the profiler it takes 3%. The "60%" is real time, but most of it is the profiler measuring itself.

Instrumentation overhead skews short functions disproportionatelyA horizontal grouped bar chart. For each of three function lengths — 1 µs, 10 µs, 100 µs — two bars are shown: actual time and instrumented time. The instrumentation overhead per call is fixed at ~0.2 µs. The 1 µs function shows a 20% inflation; the 10 µs function 2%; the 100 µs function 0.2%.Why instrumentation lies most about the functions you care aboutfunction body = 1 µs1 µs (real)+ 0.2 µs probe = 1.2 µs+20% inflationfunction body = 10 µs10 µs (real)10.2 µs+2% inflationfunction body = 100 µs100 µs (real)100.2 µs+0.2%
Illustrative — not measured data. A constant 200 ns probe cost per call inflates a 1 µs function by 20% and a 100 µs function by 0.2%. The shorter (and typically hotter) the function, the bigger the lie. cProfile's per-call probe is closer to 1–2 µs; for any Python function shorter than 10 µs the cProfile number is more probe than program.

There's a deeper, structural problem with instrumentation: it changes what the optimiser does. The C compiler may have inlined a hot accessor; with -pg it cannot, because the profiler must see the call. The JIT may have hoisted a constant load out of a loop; with the profiling agent attached, it cannot, because every call must hit the bookkeeping path. CPython's sys.setprofile disables the C-level fast paths in ceval.c that bypass Python's frame-creation machinery — frames must be created so the profiler can hook them, even for functions where CPython would normally elide that. The instrumented program is, in real ways, a different program. The profile of the different program is precise, and irrelevant.

Why instrumentation specifically perturbs the hot path more than the cold path: the JIT and the C compiler concentrate their optimisation budget on the loops the program actually spends time in — escape analysis, inlining, vectorisation, bounds-check elimination — because that is where optimisation pays off. Instrumentation forces every function boundary in the binary to remain a real call, which retracts those optimisations exactly on the hot loops where they were doing the most work. A function that ran at IPC = 2.4 in the optimised binary may drop to IPC = 0.9 with -pg attached because the inlined helper is now an actual call, the call burns a register-spill, the spill defeats the SIMD vectoriser, and the loop runs scalar. The result is not a flat 2× slowdown — it is concentrated on the hottest loops, so the inflation cProfile reports for hot functions is itself biased upward by exactly the path the optimiser used to make them fast.

When each is right — the production decision

The choice is not a preference. It is dictated by what you can afford to perturb and what resolution you need.

Use sampling when: the program is in production or staging carrying real traffic; the workload is long-running enough that 30–60 seconds of profiling is feasible; you want a flamegraph or top-N list of where time goes; you want to be able to repeat the profile in a week and compare. Sampling is what perf, py-spy, async-profiler (JVM), pprof's CPU profile, dotnet-trace --providers Microsoft-Windows-DotNETRuntime:0x4 all do. It is the default for any production-facing investigation. Brendan Gregg's flamegraph workflow is built entirely on sampling.

Use instrumentation when: the workload is short (a unit test, a CLI tool, a one-shot batch job) and you need exact call counts, not proportions; you are debugging logic, not performance ("which path was taken how many times"); the per-call overhead is genuinely below the function-body cost (typical Python web-handler bodies are 1–10 ms — instrumentation noise of 1 µs is invisible); or you are willing to accept a perturbed measurement as a floor on real cost. cProfile, gprof, cargo flamegraph --bin --no-inline's timer mode, and Java's -XX:+PrintCompilation are instrumentation tools.

Use both when the question is "is this hot, and how often is it called". Sampling tells you proportion of time. Instrumentation tells you call count. Time-per-call is the ratio, and it is the most diagnostic of the three numbers. A function at 30% of CPU called 1,000 times → 300 µs/call, which is plausible for a database serialiser. The same 30% called 30 million times → 1 µs/call, which is suspicious — probably the profiler accidentally lifting the inline. Sampling alone gives the proportion; the call count from a separate instrumentation run on a non-production replica gives the third dimension. Real production diagnostic ladders climb both.

# profiler_compare.py — sample the same workload with two profilers and
# show how much they disagree. This is the experiment that converts the
# abstract "instrumentation lies" claim into a number.
#
# Workload: a Python function that calls a fast helper many times and a
# slow helper a few times. Total wall ~1 s. We profile it with cProfile
# (instrumentation) and py-spy (sampling) and compare the percentages.
#
# What you should see:
#   cProfile inflates fast_helper because of per-call overhead.
#   py-spy reports proportions close to the truth.
#
# Run: pip install py-spy
#      sudo py-spy record -o flame.svg -d 30 --pid <pid>   # in another shell

import cProfile, pstats, time, io, subprocess, sys, os

def fast_helper(x: int) -> int:
    # ~200 ns of work — well below cProfile's per-call probe.
    return (x * 2654435761) & 0xFFFFFFFF

def slow_helper(x: int) -> int:
    # ~1 ms of work — far above the probe.
    s = 0
    for i in range(10_000):
        s += (x ^ i) & 0xFF
    return s

def workload():
    # 30M fast calls + 30 slow calls. Real CPU split is roughly
    #   fast: 30e6 * 200 ns = 6.0 s
    #   slow: 30   * 1 ms   = 0.03 s
    # i.e. fast_helper is ~99.5% of real CPU, slow_helper is 0.5%.
    # cProfile will say something very different.
    for i in range(30_000_000):
        fast_helper(i)
    for i in range(30):
        slow_helper(i)

# --- cProfile (instrumentation) ---
prof = cProfile.Profile()
t0 = time.perf_counter()
prof.enable(); workload(); prof.disable()
elapsed_cprof = time.perf_counter() - t0

s = io.StringIO()
pstats.Stats(prof, stream=s).sort_stats('cumulative').print_stats(6)
print("=== cProfile (instrumented run, took {:.1f}s) ===".format(elapsed_cprof))
print(s.getvalue())

# --- Untimed reference run (no profiler) ---
t0 = time.perf_counter()
workload()
elapsed_real = time.perf_counter() - t0
print(f"=== Untimed run took {elapsed_real:.2f}s ===")
print(f"cProfile inflation factor: {elapsed_cprof / elapsed_real:.1f}×\n")

# --- py-spy (sampling) instructions ---
print("To get the sampling view, in another shell while this is running:")
print(f"  sudo py-spy record -o sample_flame.svg -d 10 --pid {os.getpid()}")
print("Then re-run workload() in a loop:")

while True:
    workload()
# Sample run on a c6i.large (Tue 14:02 IST):
=== cProfile (instrumented run, took 18.4s) ===
   30000033 function calls in 18.412 seconds
   ncalls  tottime  cumtime  filename:lineno(function)
        1   12.103   18.412  workload
 30000000    5.918    5.918  fast_helper       <-- cProfile says 32% of cumtime
       30    0.391    0.391  slow_helper       <-- cProfile says 2.1% of cumtime

=== Untimed run took 6.31s ===
cProfile inflation factor: 2.9×

=== py-spy 10s sample (in parallel) ===
Total samples: 985
  workload                  982 (99.7%)
    fast_helper             977 (99.2%)   <-- sampling says 99% of CPU
    slow_helper               5 (0.5%)    <-- sampling says 0.5%

The walk-through. fast_helper does ~200 ns of real work per call. cProfile's c_call/c_return hook costs ~150–200 ns per call, so each invocation is roughly halved between body and probe — and cProfile's percentage attributes the probe time to the called function, inflating its share. slow_helper is a 1 ms loop; the probe is 200 ns, the inflation is 0.02%, the cProfile and py-spy proportions agree. elapsed_cprof / elapsed_real = 2.9× is the global inflation: the program ran 2.9× slower while cProfile was attached, because every one of the 30 million fast_helper calls now goes through the bookkeeping path. py-spy shot 985 samples in 10 seconds at 100 Hz; 977 of them landed inside fast_helper, 5 in slow_helper, agreeing closely with the true proportion of 99.5% / 0.5%. Why cProfile and py-spy disagree by 60+ percentage points: cProfile is measuring the instrumented program, and the instrumented program spends a larger fraction of its time in the probe than in the body. The probe is attributed to whichever function is being entered or exited, so cProfile's percentages are partly a measurement of the probe overhead's distribution across functions, not the original program's CPU distribution. py-spy never modifies the program, so what it sees is the proportions the unprofiled CPU actually visits.

What this means in production — the diagnostic ladder

Every production performance investigation at a company that has been bitten once by this distinction follows the same ladder, in this order, and never inverts it.

The first rung is top and htop — is the CPU actually the bottleneck, or is the service waiting on the database? pidstat -d 1 checks I/O. ss -tn state established checks how many connections are stuck. None of this is profiling, but if the answer is "the CPU is at 8% and the service is waiting on Postgres", profiling the service is the wrong move regardless of which kind. Aditi's incident at the start was exactly this — wait_for_db_connection was a wait, not a compute, and the cProfile run showed compute hotspots that did not exist because the service was off-CPU for 41% of its time.

The second rung is production sampling with py-spy record -d 30 --pid <PID> (Python services), async-profiler -d 30 -f flame.html <PID> (JVM), or perf record -F 99 -p <PID> -g -- sleep 30 followed by perf script | flamegraph.pl > flame.svg (anything else). 30 seconds is enough for a service doing >1000 RPS to bottom out at 1% resolution; 60 seconds is enough for 0.5% resolution. The deliverable is one SVG flamegraph that names the top 3 hot functions and what fraction of CPU each consumes. No instrumentation has been applied to the production process. The service is still serving customers at full speed.

The third rung is, only after the flamegraph has named candidate hot functions, targeted instrumentation on a staging replica with cProfile or py-spy --idle for blocking time, or bpftrace -e 'uprobe:./bin:func { @[ustack] = count(); }' for user-space-traced call counts. The instrumentation is run on a separate copy of the service, often at lower load, with the explicit understanding that the absolute numbers will be wrong but the ratios between instrumented functions are usable for "is fast_helper called 30 million times or 30 thousand times". This rung is for understanding the call structure, not the time distribution.

The fourth rung is targeted eBPF uprobes on production for specific functions identified by the flamegraph, measuring the actual on-CPU time of those functions only — not the whole program. This is the production-safe form of instrumentation: you only pay the probe cost on the functions you instrumented, not on every call in the binary, so the global inflation is small and bounded.

# diagnostic_ladder.py — the four-rung ladder, codified.
# Run as: python3 diagnostic_ladder.py razorpay-payments-api 12345
# Where 12345 is the PID of the misbehaving production process.
#
# What it does:
#   Rung 1 — checks if CPU is even the bottleneck.
#   Rung 2 — runs a 30-second sampling profile (py-spy).
#   Rung 3 — recommends staging-side instrumentation steps.
#   Rung 4 — emits an eBPF one-liner for the named hot function.
#
# This is the script Aditi should have run before rewriting serialize_response.

import subprocess, sys, json, re, time

def rung1_is_cpu_the_bottleneck(pid: int) -> dict:
    # Read /proc/<pid>/stat — fields utime, stime, vs wall.
    with open(f'/proc/{pid}/stat') as f:
        fields = f.read().split()
    utime0, stime0 = int(fields[13]), int(fields[14])
    time.sleep(2.0)
    with open(f'/proc/{pid}/stat') as f:
        fields = f.read().split()
    utime1, stime1 = int(fields[13]), int(fields[14])
    cpu_jiffies = (utime1 + stime1) - (utime0 + stime0)
    cpu_pct = (cpu_jiffies / 100.0) / 2.0 * 100.0  # CONFIG_HZ=100, 2s window
    return {'rung': 1, 'cpu_pct_of_one_core': round(cpu_pct, 1),
            'verdict': 'cpu-bound' if cpu_pct > 80 else 'off-cpu (wait/io/lock)'}

def rung2_sample(pid: int, seconds: int = 30) -> dict:
    # py-spy can run without sudo on processes the same UID owns.
    result = subprocess.run(
        ['py-spy', 'record', '-o', f'/tmp/flame_{pid}.svg',
         '-d', str(seconds), '--pid', str(pid),
         '--format', 'speedscope', '--rate', '100'],
        capture_output=True, text=True)
    if result.returncode != 0:
        return {'rung': 2, 'error': result.stderr.strip()[:200]}
    return {'rung': 2, 'flamegraph': f'/tmp/flame_{pid}.svg',
            'samples_taken': seconds * 100,
            'note': 'open the SVG, name top 3 stacks before running rung 3'}

def rung3_staging_instrumentation_plan(hot_fn: str) -> dict:
    return {'rung': 3,
            'on_staging_replica': f"python -c \"import cProfile; "
                                  f"cProfile.run('from app import handler; "
                                  f"[handler() for _ in range(1000)]', "
                                  f"sort='cumulative')\"",
            'compare': f"call_count from staging × time-per-call from rung 2 "
                       f"should equal rung-2 percentage of {hot_fn}"}

def rung4_ebpf_oneliner(binary: str, hot_fn: str) -> dict:
    return {'rung': 4,
            'bpftrace': f"sudo bpftrace -e 'uprobe:{binary}:{hot_fn} "
                        f"{{ @start[tid]=nsecs; }} "
                        f"uretprobe:{binary}:{hot_fn} "
                        f"{{ @ns=hist(nsecs - @start[tid]); "
                        f"delete(@start[tid]); }}'",
            'cost': 'roughly 200 ns added per call to {} only — '
                    'negligible global cost'.format(hot_fn)}

if __name__ == '__main__':
    service, pid = sys.argv[1], int(sys.argv[2])
    print(json.dumps(rung1_is_cpu_the_bottleneck(pid), indent=2))
    print(json.dumps(rung2_sample(pid), indent=2))
    print(json.dumps(rung3_staging_instrumentation_plan('serialize_response'),
                     indent=2))
    print(json.dumps(rung4_ebpf_oneliner('/opt/razorpay/bin/payments',
                                         'serialize_response'), indent=2))
# Sample run on production pod (razorpay-payments-api, pid 12345):
{ "rung": 1, "cpu_pct_of_one_core": 41.2,
  "verdict": "off-cpu (wait/io/lock)" }
{ "rung": 2, "flamegraph": "/tmp/flame_12345.svg",
  "samples_taken": 3000,
  "note": "open the SVG, name top 3 stacks before running rung 3" }
{ "rung": 3, "on_staging_replica": "python -c \"import cProfile; ...\"",
  "compare": "call_count × time-per-call should equal rung-2 percentage" }
{ "rung": 4, "bpftrace": "sudo bpftrace -e 'uprobe:/opt/razorpay/...'",
  "cost": "roughly 200 ns added per call to serialize_response only" }

The output names the smoking gun on the first rung: cpu_pct_of_one_core = 41.2 means the process is on-CPU for less than half the time. The bottleneck is not CPU; running a sampling profile is still useful to rule out a hot path, but the answer almost certainly lives off-CPU — wait time, blocked time, lock time. py-spy's --idle flag exists exactly for this case: it samples the stacks of off-CPU threads too, so you see what the threads are waiting on. Aditi's actual diagnosis came from py-spy record --idle showing 41% of stacks rooted in wait_for_db_connection. The fix was a connection-pool size increase, not a serialiser rewrite.

Common confusions

Going deeper

What perf record -g actually does to the kernel — and why it's safe in prod

perf_event_open(2) is a syscall that the kernel honours by installing a hardware performance counter (or a software timer, when no PMU is available) configured to fire after a programmable count of events — cycles, instructions, branch misses, cache misses, or wall-clock nanoseconds. When the counter overflows, the CPU raises an NMI (non-maskable interrupt). The NMI handler captures the saved program counter, walks the stack via the configured unwinder (frame pointers, DWARF, or LBR — Last Branch Record, an Intel feature that records the last 16–32 branch targets in hardware so the unwinder can reconstruct the call stack without parsing DWARF), and writes the result to a per-CPU ring buffer that user-space perf record mmap's and drains. The whole hot path is between 5 and 20 µs depending on stack depth and unwinder. There is no involvement of any user-space library — the program being profiled does not even know it is being profiled. Brendan Gregg's Systems Performance (2nd ed., chapter 13) walks through every step. The reason production teams trust perf is that the cost is bounded by the sample rate and the unwinder, both of which the user controls — there is no way for the profile to "go quadratic" or "become a heisenbug" the way LD_PRELOAD-style instrumentation can.

eBPF as the bridge — instrumentation with bounded cost

eBPF uprobes are technically instrumentation — they install a trampoline at the function's first instruction that jumps into a kernel-side BPF program — but they're instrumentation with a budget. The BPF verifier rejects programs with unbounded loops, the BPF program is compiled to native code by the in-kernel JIT, and you only instrument the functions you explicitly named. Cost: ~200–400 ns per uprobe hit. If you uprobe a function called 10 million times per second, you've added 4 sec/sec of overhead — bad. If you uprobe a function called 10,000 times per second, you've added 4 ms/sec — fine. The discipline is: pick the function from a flamegraph (which is sampling, free in production), then uprobe only that function. Brendan Gregg's BPF Performance Tools is the canonical text. The IRCTC tatkal-window team uses this pattern: a 60-second perf record flamegraph names the slow function on the booking-confirm path, then a 5-minute bpftrace uprobe on that function builds a per-call latency histogram, and the histogram tells them whether the slowness is consistent (probably the algorithm) or bimodal (probably waiting on a downstream service). Total production overhead: under 0.5% of one core during the diagnostic window.

Coordinated omission's profiler analogue — what sampling cannot tell you about latency

Sampling profilers measure the time function bodies are on-CPU, weighted by frequency of being on-CPU. They do not measure the latency of any individual request. A request that takes 100 ms in 99 cases and 5 seconds in the 100th case has the same flamegraph shape as a request that always takes 149 ms — because the sampler aggregates over all on-CPU time and discards the per-request decomposition. To answer "what is the p99 of serialize_response's latency per call", you need a separate measurement: an eBPF latency histogram per uprobe-hit, or HdrHistogram-instrumented application code. This is the profiling counterpart of the coordinated-omission warning from /wiki/coordinated-omission-and-hdr-histograms: aggregate measurements hide tail behaviour, and the diagnosis of tail latency requires per-request data, not per-function aggregates. Combining them — a flamegraph for "where does CPU go" plus an HdrHistogram per hot function for "what does the per-call distribution look like" — is the production-grade pattern.

Reproduce this on your laptop

# Reproduce on your laptop
python3 -m venv .venv && source .venv/bin/activate
pip install py-spy
# Run profiler_compare.py and watch cProfile inflate fast_helper:
python3 profiler_compare.py
# In another shell, attach py-spy to the PID it printed, sample for 10 seconds:
sudo py-spy record -o flame.svg -d 10 --pid $(pgrep -f profiler_compare)
# Open flame.svg in a browser. Compare cProfile's percentages with py-spy's.

Where this leads next

This chapter named two profiler families and the diagnostic ladder that uses them in order. The next chapters in Part 5 walk into each in detail.

perf record and perf report — the fundamental loop (/wiki/perf-record-perf-report-the-fundamental-loop) is the canonical Linux sampling profiler, the thing every flamegraph in this curriculum is built on, and the workflow every backend engineer at Razorpay/Flipkart/Hotstar should be able to run from memory by the end of Part 5.

Flame graphs and how to read them (/wiki/flame-graphs-and-how-to-read-them) is the visualisation that turns a million perf samples into a one-screen diagnosis. This chapter assumed you've seen one; the next one builds the skill of reading one fast.

On-CPU vs off-CPU profiling (/wiki/on-cpu-vs-off-cpu-profiling) is the chapter that resolves the ambiguity Aditi hit: when the bottleneck is "waiting", a CPU sampler tells you the wrong story. Off-CPU profiling samples the stacks of blocked threads, and combined with on-CPU it forms the complete picture.

eBPF for tracing function latency (Part 6) is the production-safe instrumentation that closes the loop. Once a flamegraph has named the function, eBPF measures its per-call distribution without forcing a staging detour.

References