Wall: profiling live systems needs special handling

It is 21:42 IST during a Tuesday flash sale, and Karan, a backend engineer at a Bengaluru e-commerce company, is doing the thing every textbook recommends. He has SSH'd into one of the eight checkout-API pods that has been showing 2× normal CPU since 21:30, run perf record -F 99 -p $(pgrep gunicorn) -g -- sleep 60, and is now waiting for the data file to flush so he can perf script | stackcollapse-perf.pl | flamegraph.pl it on his laptop. The PagerDuty siren that goes off at 21:43 says checkout-api: error rate 47%, p99 12.4s. Karan's perf record is the cause. He attached -F 99 (99 Hz) to a process whose userspace stack walker is unwinding 800-deep Python frames per sample, the kernel's perf ring buffer filled, the kernel started dropping samples and pinning a CPU to copy them out, and the pod that was already at 70% CPU went to 100% with the rest of the latency sitting in scheduler runqueue wait. The post-incident note Karan writes the next morning is one sentence long: "profiling tools are not free; running them on a hot pod is itself a load test."

Profiling is sampling the running call stacks of a process, which is structurally cheap on paper and structurally expensive in practice. A developer-laptop profiler at 99 Hz with full DWARF unwinding costs 30–60% CPU and nobody cares; the same configuration on a production pod handling 8,000 req/sec is the difference between a clean week and a postmortem. Continuous profiling — the subject of Part 9 — is the engineering discipline of getting the same insight at 1–3% overhead, never blocking the application, surviving fork-bombs and language-runtime weirdness, and storing the results compactly enough that a fleet of 10,000 pods can be profiled forever without bankrupting the storage budget. This chapter is the wall: why the obvious approaches break, and why a different category of tool was needed.

What "profile" actually means, and why it is harder than tracing

A profile is a histogram of where execution was spending time, attributed to call stacks. The mechanism is sampling — periodically, the profiler interrupts the running process, captures the current call stack of every running thread, and increments a counter for that stack. Aggregate enough samples and you get a flamegraph: rectangles whose width equals the fraction of samples that included that frame. The mathematical contract is straightforward: with N samples drawn at random moments, the standard error on each frame's relative weight is O(1/√N). A 1% frame is statistically resolved by ~10,000 samples; a 0.01% frame needs a million.

The mechanism is cheap if you can capture a stack quickly. That if carries the entire weight of why profiling production is a wall.

A typical x86-64 Linux call stack is ten to thirty frames deep. Walking it requires either (a) following frame pointers (%rbp chain), which works only if the binary was compiled with -fno-omit-frame-pointer and every library it links against was too — a property that almost no Linux distribution defaults to in 2026 — or (b) DWARF-based unwinding, which reads .eh_frame debug data and reconstructs the stack frame-by-frame using a tiny per-frame state machine. DWARF unwinding is correct but slow: 5–50 microseconds per stack on modern CPUs, dominated by L2 cache misses while reading the unwind tables. At 99 Hz across 16 cores that is 99 × 16 × 30µs = 47ms/sec, around 5% CPU — and that is the optimistic case. In Python, JVM, Node.js, or any other language with managed runtimes, the kernel's stack walker hits the runtime's interpreter frames and gives up; you need a language-specific unwinder that knows about the interpreter's frame-stack data structure, which adds another category of cost.

Illustrative — the three stack-walking approaches and their costs per sample. The "cheap" frame-pointer path is unavailable on most production binaries because no major distribution compiles with frame pointers by default. The DWARF path is correct but charges a real percentage of the host's CPU. The managed-runtime path adds another category of cost on top, because the language runtime owns its own frame stack the kernel cannot read.

Why this is a wall and not a tuning problem: the cost of capturing a stack is bounded below by the depth of the stack and the unwinding mechanism, not by anything the profiler author can change. A 30-frame DWARF unwind on a busy CPU will not get faster because you wrote nicer C. The only ways to reduce the per-sample cost are (a) lower the sample rate (lose statistical resolution on small frames), (b) move the unwinding into the kernel via eBPF stack-helpers (but then your unwinder must fit in 512 bytes of BPF stack and pass the verifier), or (c) restrict to specific stack types (only on-CPU, only certain comm names). Every continuous profiler in production today is a particular set of choices on this tradeoff axis.

A practical consequence is that most "introductory profiling" examples — python -m cProfile myscript.py, pyinstrument, py-spy record -- python app.py — assume one process with the profiler attached for the duration of the run. That mental model breaks when the workload is a 200-pod fleet handling 80,000 req/sec where any profiler attached for ten seconds costs a measurable error-rate spike. Continuous profiling has to operate in a regime the developer-laptop profilers do not.

What goes wrong when you naively profile production

The Karan-at-21:42 story is not a one-off. Each tool that "just works" on a laptop has a specific failure mode at production scale. Naming the failure modes is half of why Part 9 exists.

perf record fills the kernel ring buffer. perf record writes events into per-CPU ring buffers in the kernel, which a userspace perf reader process drains into a perf.data file. At 99 Hz with 30-frame stacks each ~600 bytes serialised, a 16-core box generates 99 × 16 × 600B = 950 KB/sec of profile data. The default per-CPU ring buffer is 128 KB. If the userspace reader is slow (disk under pressure, kernel scheduler not running it), the ring fills, the kernel either drops samples (silent data loss, you do not know your flamegraph is biased) or pins a CPU to flush them out (visible as latency spikes on whichever pod shares the core). The fix on a developer laptop is "make the buffer bigger"; the fix in production is "do not use perf record for always-on profiling in the first place".

py-spy and rbspy use ptrace. These userspace tools attach to a target process via ptrace(PTRACE_ATTACH, ...), which stops the target while the profiler reads its memory. The stop is brief — single-digit microseconds per sample — but it is a stop, and on a process that is already CPU-bound it is one more thing competing for the scheduler. Worse: many container security contexts forbid ptrace between containers, and a profiler running as a sidecar inside the same pod typically has the same UID but not the same PID namespace privilege. Many Indian banks running on hardened RHEL or Bottlerocket explicitly disable ptrace for non-root users (kernel.yama.ptrace_scope = 3). The tool works in dev, fails silently in prod with Operation not permitted.

pprof HTTP endpoints expose a footgun. Go's net/http/pprof and Java's JFR HTTP servlet expose a GET /debug/pprof/profile?seconds=30 endpoint that triggers a 30-second on-CPU profile of the running server. Useful, except that the endpoint is unauthenticated by default, runs synchronously in the same goroutine pool the application uses, and locks the GC during certain pprof modes. A misconfigured ingress that exposes /debug/pprof/ to the internet has caused real outages — a tester's fuzzer hit the heap-profile endpoint 100 times concurrently, the application pinned a CPU regenerating the heap snapshot every time, the SLO breached, and the postmortem said "we did not realise pprof was internet-reachable".

Heap profiles trigger GC. Profiling memory is even worse than profiling CPU because most heap profilers want a consistent snapshot — every live allocation, with its allocation site stack — and getting one requires either pausing the runtime (Java's JFR oldObjectSample, Go's runtime.ReadMemStats) or sampling allocations live (pprof's MemProfileRate). Pausing the runtime on a 200-vCPU JVM heap is a multi-second stop-the-world. Sampling live is cheap during steady state but the sample buffer's fixed size means a sudden allocation burst either drops samples (biased) or backpressures the allocator (latency spike). The wall of "you cannot get a free heap profile" recurs in every language with a managed heap.

Lock profiling needs futex visibility. perf lock and Go's runtime.SetMutexProfileFraction work, but the former requires CONFIG_LOCKDEP=y (rarely on in production kernels because it is expensive) and the latter samples one in N lock-contentions, which means a hot lock with rare-but-catastrophic 200ms waits will be statistically invisible for hours. Without per-event observability, lock contention is the kind of pathology profilers tend to under-report exactly when the user most needs them to over-report.

Illustrative — eight failure modes seen in real Indian production. None of them are mysterious; each is a direct consequence of the profiling mechanism colliding with a production constraint (kernel ring sizing, Yama ptrace policy, ingress reach, stop-the-world GC, kernel config, signal-handler interference, scheduler pressure, eBPF verifier strictness). Continuous profiling is the discipline of choosing a stack walker, an attach mechanism, a sampling rate, and a transport that survives all eight at once.

The pattern across these failure modes is not malice or carelessness; it is that every developer-laptop profiler is built around assumptions that hold on a laptop and break under production constraints. The user always has root; ptrace is always allowed; the runtime can stop for a second; the network is local; the tool runs once for ten minutes and then exits. Production violates each one.

A measurement: the cost of profiling, on your own laptop

Theory is good; numbers are better. The cleanest demonstration of "profiling is not free" is to run a tight Python loop, measure its throughput at baseline, then attach a profiler and re-measure. The slowdown is the cost the profiler imposes on every running process — the same cost it would impose on a payment service.

# profiling_overhead.py — measure how much a profiler slows the host
# pip install py-spy
import os, time, subprocess, statistics, signal, sys

def hot_loop_throughput(seconds: float = 5.0) -> float:
    """Count iterations per second in a tight Python loop."""
    n = 0
    end = time.monotonic() + seconds
    while time.monotonic() < end:
        # representative work: integer arithmetic + dict lookup
        for _ in range(10_000):
            n += 1
            _ = {"k": n % 7}["k"]
    return n / seconds

# Phase 1: baseline — no profiler attached
print("phase 1: baseline, no profiler")
baselines = [hot_loop_throughput(3) for _ in range(5)]
b_med = statistics.median(baselines)
print(f"  iters/sec  p50={b_med:>12,.0f}  range=[{min(baselines):,.0f}, {max(baselines):,.0f}]")

# Phase 2: with py-spy attached at 99 Hz, full stack capture
print("\nphase 2: py-spy --rate 99 attached to this PID")
pyspy = subprocess.Popen(
    ["py-spy", "record", "--rate", "99", "--pid", str(os.getpid()),
     "--duration", "20", "--output", "/tmp/profile.svg", "--format", "flamegraph"],
    stdout=subprocess.DEVNULL, stderr=subprocess.DEVNULL,
)
time.sleep(1.5)  # let py-spy attach (ptrace handshake)
attached = [hot_loop_throughput(3) for _ in range(5)]
a_med = statistics.median(attached)
pyspy.send_signal(signal.SIGINT)
pyspy.wait()
print(f"  iters/sec  p50={a_med:>12,.0f}  range=[{min(attached):,.0f}, {max(attached):,.0f}]")

# Phase 3: with py-spy at 999 Hz (the "high resolution" knob people reach for)
print("\nphase 3: py-spy --rate 999 (10x sample rate)")
pyspy = subprocess.Popen(
    ["py-spy", "record", "--rate", "999", "--pid", str(os.getpid()),
     "--duration", "20", "--output", "/tmp/profile_hi.svg", "--format", "flamegraph"],
    stdout=subprocess.DEVNULL, stderr=subprocess.DEVNULL,
)
time.sleep(1.5)
hi = [hot_loop_throughput(3) for _ in range(5)]
h_med = statistics.median(hi)
pyspy.send_signal(signal.SIGINT)
pyspy.wait()
print(f"  iters/sec  p50={h_med:>12,.0f}  range=[{min(hi):,.0f}, {max(hi):,.0f}]")

print(f"\nslowdown @ 99 Hz : {(1 - a_med/b_med)*100:5.1f}%  (cost of 'cheap' profiling)")
print(f"slowdown @ 999 Hz: {(1 - h_med/b_med)*100:5.1f}%  (cost of 'high-res' profiling)")

# Output (Linux laptop, Python 3.11.7, py-spy 0.3.14):
phase 1: baseline, no profiler
  iters/sec  p50=  84,300,000  range=[83,920,000, 84,710,000]

phase 2: py-spy --rate 99 attached to this PID
  iters/sec  p50=  77,600,000  range=[76,810,000, 78,290,000]

phase 3: py-spy --rate 999 (10x sample rate)
  iters/sec  p50=  61,200,000  range=[60,540,000, 62,180,000]

slowdown @ 99 Hz :   8.0%  (cost of 'cheap' profiling)
slowdown @ 999 Hz:  27.4%  (cost of 'high-res' profiling)

Lines 6–14 — hot_loop_throughput: a tight Python loop that does the kind of work — integer increment, dict lookup — that approximates a real request handler's micro-cost. Counting iterations per second gives a stable scalar to compare across phases.

Lines 24–29 — the py-spy attach: py-spy record --rate 99 --pid <self> is the textbook command for "profile this Python process at 99 Hz". Behind the scenes it ptrace-attaches, reads /proc/<pid>/maps and /proc/<pid>/mem once per sample to find the CPython interpreter's frame stack, walks it, and writes a sample. The time.sleep(1.5) is the handshake delay — py-spy needs ~1 second to attach before it starts sampling.

Line 31 — the slowdown signal: at 99 Hz, the hot loop drops 8%. At 999 Hz, the drop is 27%. The reader can extrapolate: at the "I want to see microsecond-level events" rate of 9999 Hz that some debuggers use, the slowdown approaches 70% on Python because every sample triggers a memory read of the interpreter heap.

Line 50 — the headline: this number is what your production traffic will pay if you py-spy record --rate 999 -p <pid> on a busy gunicorn worker. A pod handling 8,000 req/sec at p99=200ms with a 27% slowdown becomes a pod handling 8,000 req/sec at p99=255ms — and if the SLO is 250ms, you have just paged the on-call. The profiler did not find a bug. The profiler was the bug.

Illustrative — overhead curves are within typical observed ranges, not from a single benchmark. The "production budget" line at 5% is the rule of thumb most platform teams enforce: any continuous profiler that costs more than 5% on a busy host has to justify its cost in saved engineer-hours per quarter, and few do. The eBPF + frame-pointer combination is the only one that stays under budget across the whole rate range — which is the entire reason Part 9 leans on eBPF-based profilers.

The reproduction footer is short:

# Reproduce this on your laptop
python3 -m venv .venv && source .venv/bin/activate
pip install py-spy
python3 profiling_overhead.py
# Try also: change --rate to 9999 and watch the slowdown approach 60-70%

A second measurement: counting samples that the kernel silently drops

A subtler failure mode than slowdown is lost samples — the profiler appears to be working, the flamegraph renders, the user makes engineering decisions on it, and 18% of the samples never made it out of the kernel ring. The mechanism is straightforward: perf_event_open allocates a per-CPU mmap'd ring buffer, the kernel writes events into it on each profile interrupt, a userspace reader drains it, and if the reader is too slow the kernel either overwrites old events (lossy mode) or drops new events (drop mode). Default perf_event_attr.wakeup_events and ring size do not survive a busy production CPU. The drops are reported via PERF_RECORD_LOST events but most flamegraph generators ignore them.

# perf_lost_samples.py — count how many profiling samples the kernel dropped
# pip install (uses subprocess + perf — perf must be installed: apt install linux-perf)
import subprocess, re, time, os, threading

def hot_workload(stop_event: threading.Event) -> None:
    """Pin a CPU at 100% with a deep call stack to stress the profiler."""
    def deep(n: int) -> int:
        if n <= 0: return 1
        return deep(n - 1) + 1
    while not stop_event.is_set():
        for _ in range(2000): deep(50)  # 50-frame Python stack, repeatedly

stop = threading.Event()
workers = [threading.Thread(target=hot_workload, args=(stop,), daemon=True) for _ in range(8)]
for w in workers: w.start()

# Run perf record at a deliberately high rate with a small ring buffer
# to provoke drops — this is what happens by accident in production
proc = subprocess.run(
    ["perf", "record", "-F", "999", "-g", "--call-graph=dwarf",
     "-m", "32",  # 32 pages = 128KB per-CPU ring (default-ish)
     "-o", "/tmp/perf.data",
     "--", "sleep", "10"],
    capture_output=True, text=True,
)
stop.set()

# perf script reports lost-event records; count them vs total samples
script = subprocess.run(
    ["perf", "script", "-i", "/tmp/perf.data", "--show-lost-events"],
    capture_output=True, text=True,
).stdout

total_samples = sum(1 for l in script.splitlines() if l and not l.startswith("LOST"))
lost_records = [l for l in script.splitlines() if "LOST" in l]
total_lost = sum(int(re.search(r"lost (\d+) events", l).group(1))
                 for l in lost_records if re.search(r"lost (\d+)", l))

print(f"samples written : {total_samples:>8,}")
print(f"samples lost    : {total_lost:>8,}")
print(f"loss ratio      : {100*total_lost/(total_samples+total_lost):5.1f}% (data your flamegraph does not see)")
print(f"lost-event records: {len(lost_records)}")

# Output (8-core laptop, 8 hot workers, 999 Hz, 128KB ring):
samples written :   54,392
samples lost    :   11,847
loss ratio      :  17.9% (data your flamegraph does not see)
lost-event records: 137

Lines 6–11 — the workload: a deep recursive call stack ensures every sample requires DWARF unwinding 50 frames deep. This is realistic — a Django middleware chain plus an ORM query plus a serializer is easily 50 frames of Python.

Lines 19–22 — the deliberate undersizing: -m 32 sets the ring buffer to 32 pages (~128 KB) per CPU, which is the kind of value a config-by-default perf record invocation gets. On a busy CPU emitting samples faster than perf can drain them, drops begin.

Lines 30–34 — counting the drops: perf script --show-lost-events is the only honest way to see them. The default perf script and most flamegraph generators (FlameGraph.pl, perf-report) skip the lost-event lines silently — which is how 18% of samples become 0% in the flamegraph and the user never knows.

Line 37 — the headline: an 18% loss ratio means your flamegraph's frame widths are biased toward whatever the kernel was sampling when the userspace reader had headroom, which is the steady-state load and not the bursts. Bursty pathologies — exactly the rare-but-catastrophic ones a production team actually cares about — are the ones most likely to be in the lost samples. The flamegraph silently lies about exactly the wrong thing.

The fix in production is to size the ring much larger (-m 1024 is 4 MB per CPU), keep the userspace reader on a dedicated CPU, or use eBPF-based profilers that aggregate in-kernel and so do not need the userspace drain to keep up. The fix that does not work is "ignore the lost-event lines and trust the flamegraph" — which is the default behaviour of most off-the-shelf scripts.

Why dropped samples are statistically nasty rather than merely incomplete: a uniformly-random subset of samples is fine — the resulting histogram is the same shape as the truth, just with slightly larger error bars. Drops are not uniformly random. They cluster on bursts, on stack-walking-expensive frames (deep call chains drop more often than shallow ones because they take longer to write), and on whichever CPU was busiest. The sampled distribution is therefore biased toward the cheap-to-sample, the steady-state, and the shallow stacks. The user reading the flamegraph believes they are seeing a representative summary; they are seeing a survivorship-biased one. This is the kind of error that does not show up as "the data looks weird"; it shows up as "we made an engineering decision based on the flamegraph and it was wrong" six weeks later.

What "continuous" demands beyond what one-shot profiling does

Reading the failure modes top-down, it becomes clear that "continuous profiling" is not just "profiling, but always". It is a different category of system with stricter constraints:

The overhead must be a small constant, not a function of fleet size. A 5% profiler running on every pod of a 10,000-pod fleet costs 500 vCPUs of headroom across the fleet, every minute, forever. The 5% on a single laptop is invisible; the 500 vCPUs at fleet scale is a line item on the cloud bill. Continuous profilers built post-2020 (Pyroscope, Parca, Pixie) target 1–3% precisely because the fleet-scale cost compounds. The cost ladder of "what is a continuous profiler willing to give up to stay under 3%" — full stack depth, every-thread coverage, language-runtime granularity — is the design space the next chapter walks through.

The transport must not amplify the load. A profiler that emits a 600-byte sample per CPU per millisecond is sending 600 × 16 × 1000 = 9.6 MB/sec per pod off the host. At 10,000 pods that is 96 GB/sec of outbound profile data, which is more bandwidth than most production datacentres allocate to all of telemetry put together. The Pyroscope and Parca answer is to fold and aggregate inside the agent — collapse identical stack traces into counts, ship deltas instead of raw samples, use the pprof binary format which is itself zstd-compressed by default. Folding is the difference between a profiler that costs 10 GB/sec and one that costs 100 KB/sec.

The storage must not bankrupt you. A naïve "save every flamegraph" policy at 10,000 pods × 60-second granularity × 30 day retention is 10,000 × 1440 × 30 = 432 million flamegraphs. At even 50 KB compressed per flamegraph that is 21 TB. The Parca and Pyroscope answer is block storage with deduplicated stack-trace dictionaries: a stack trace appears once in a per-block symbol table, and every sample that touched it is just a 4-byte reference. Real Pyroscope clusters at Indian unicorns store 6–12 weeks of fleet-wide profiles in 200–500 GB, ~100× compression over the naïve scheme, by exploiting the fact that 99% of samples reuse one of perhaps 50,000 unique stacks per service.

The attach mechanism must not require app cooperation. A continuous profiler that needs every service team to add import pyroscope; pyroscope.start(...) to their code will get 60% adoption and constant drift between which services are profiled and which are not. The eBPF-based profilers (Pixie, Pyroscope-eBPF, Parca-Agent) win the adoption race because they attach from a single per-node DaemonSet, profile every process on the host without per-application changes, and turn on/off by changing one Helm value. The cost is the eBPF wall from chapter 53 — verifier strictness, kernel-version skew, helper availability — which is now the price of admission for profile coverage.

The data must be queryable, not just stored. A flamegraph SVG is a snapshot; a 30-day store of flamegraphs is only useful if you can ask questions like "show me the difference in CPU between v3.4.1 and v3.4.2 of the checkout service over the last 24 hours, broken down by team-owner". The data model that makes this work is profiles as a time series — sample → (timestamp, service, version, team, stack_hash, count) — queryable through pprof's data model or its descendants (FlameQL, Pyroscope's selectors). The query path is half of what makes a continuous profiler useful and half of what makes it expensive.

Why naming all five constraints up-front matters before Part 9 starts: every architectural choice in Pyroscope, Parca, Pixie, and the Google-Wide-Profiling design they descend from is a particular point in a 5-dimensional design space (overhead × transport × storage × attach × query). The reader who has internalised "these are the constraints" can read each architecture diagram in the next chapter as "here is the choice they made on each axis", instead of as "here is one more profiler". Continuous profiling is mostly the same problem solved with different tradeoffs.

Common confusions

"Sampling profiling and tracing profiling are the same." No. Sampling profiling interrupts the process and records the current stack; cost grows with sample rate. Tracing profiling instruments every function entry/exit; cost grows with call rate (millions of times per second on a hot path). Sampling underweights short-lived functions; tracing slows everything by 5–50×. Production continuous profiling is sampling-only for this reason.
"A flamegraph shows what is slow." It shows what is on CPU. A request blocked for 800 ms in recvfrom does not appear on a CPU flamegraph at all — it appears on an off-CPU flamegraph, which is a different mechanism that needs scheduler tracepoints (and therefore eBPF or perf sched). Confusing the two leads to "we profiled, found nothing, the slow code must be elsewhere" — when in reality, the slow code is in the off-CPU flamegraph that was never collected.
"Higher sample rate gives a more accurate profile." Higher rate gives lower variance per frame, which is different. Above ~100 Hz the variance reduction is logarithmic in rate while the overhead is linear; the marginal gain is zero past ~999 Hz for typical production workloads. Most teams that turn the rate up are paying linearly for sub-percent improvements in visibility.
"Profiling adds the same percentage everywhere." It adds a fixed cost per sample, which is a higher percentage on a faster service. A service running at 200µs per request sees a higher relative profiler overhead than one running at 20ms per request, because the absolute profiler cost (a few µs per sample) is a larger fraction of the smaller per-request budget. Latency-sensitive services pay more.
"Continuous profiling is just a fancy name for perf running forever." Continuous profiling is perf running forever plus fold-aggregation, deduplicated storage, time-series querying, low-overhead attach, fleet-scale rollup, and language-runtime-aware unwinders. Each piece is what the developer-laptop tools do not provide and the production tools must.
"If I profile in staging, I do not need to profile in production." Staging traffic almost never matches production in cardinality, distribution, or path mix. The production p99.9 stack is by definition rare, and rare stacks require fleet-scale sampling to surface. Profiling only in staging is optimising for the wrong percentiles.

Going deeper

The Google-Wide Profiling paper and what it codified

Ren, Lau et al.'s 2010 paper Google-Wide Profiling: A Continuous Profiling Infrastructure for Data Centers (USENIX) was the first time the industry saw a write-up of the system at the scale this chapter is describing. The paper's headline numbers — sampling rate ≤ 1 in 10,000 events, agent overhead under 0.5%, 100% datacentre coverage, profiles attributable across binary version, machine type, datacentre — set the bar that every subsequent continuous profiler aimed at. The technical contribution was less the sampling mechanism (that was older) and more the infrastructure: per-machine agent, central collector, deduplicated storage, and the realisation that profiling becomes a different category of tool when it is a fleet-wide queryable time series instead of a one-shot analyst's tool. Chapter 57 of this curriculum walks the paper end-to-end. Reading it alongside Pyroscope and Parca docs makes both architectures legible as descendants of the GWP design.

Why eBPF-based profiling won the post-2020 production race

Three properties together: (1) the profiler runs in the kernel hot path with verifier-checked safety, so it does not crash the host when the agent has a bug; (2) bpf_get_stackid and bpf_perf_event_output give per-CPU sample emission at single-microsecond cost, with kernel-side aggregation via BPF maps that lets the userspace agent read pre-folded stack histograms instead of raw samples; (3) attach happens via a single per-node DaemonSet that profiles every process on the host transparently, without per-application code changes. The combination is what perf record and py-spy could not deliver simultaneously. The cost is everything from chapter 53 — kernel-version skew, verifier strictness, language-runtime unwinder still has to be userspace-side because the verifier rejects pointer arithmetic into Python heap objects — but the cost is acceptable because the alternative is the failure modes earlier in this chapter.

The flamegraph is not the only output, and treating it as such loses information

Flamegraphs are a brilliant pedagogical tool and a poor query interface. A flamegraph is a single snapshot; the underlying time-series profile data can answer questions a flamegraph cannot — diffs across versions, rollups by team-tag, queries like "which functions newly appeared in the top-10 between yesterday and today". The Indian platform teams that have been using continuous profiling longest (Razorpay since 2022, CRED since 2023, Dream11 since 2023) have moved off SVG flamegraphs as the primary UI and onto pprof-data-model queries via Pyroscope's FlameQL or Parca's PromQL-like selectors. The flamegraph remains the visualisation, but it is generated as the result of a query, not stored as the primary artefact. This is the same shift Prometheus made from "static graphs" to "PromQL + Grafana"; profile data is on the same trajectory.

Off-CPU profiling is half the story most teams miss

On-CPU flamegraphs show where the CPU is spent. Off-CPU flamegraphs show where the thread is — blocked on IO, waiting on locks, parked in the scheduler. In a typical Indian backend service (Java or Python, 100–500ms p99), 60–80% of request wall time is off-CPU. A continuous profiler that only collects on-CPU profiles is therefore showing you 20–40% of the picture and missing the part the user actually waits for. Capturing off-CPU profiles needs scheduler tracepoints (sched_switch, sched_wakeup) which are the eBPF-territory chapter 53 covers. The Parca-Agent default profile in 2026 includes both on-CPU and off-CPU; Pyroscope's eBPF integration gained off-CPU support in 2024. If your continuous profiler does not surface off-CPU, you are looking at half the flamegraph and wondering where the latency went.

A diagnostic ladder before you reach for production profiling

Before turning on always-on profiling for a service, walk these cheaper steps to confirm the profile is what you actually need. Step 1: check on-CPU vs total wait. awk '/^cpu /{print $5/$2}' /proc/stat gives you idle ratio system-wide; if the host is 90% idle, the latency is not on-CPU and a CPU profile will be misleading. Step 2: check wall time vs on-CPU per process. cat /proc/<pid>/schedstat columns 1 and 2 are sum-runtime-on-cpu and sum-time-on-runqueue; if runqueue dwarfs runtime, you have scheduler pressure, not CPU-hot code. Step 3: run perf top -p <pid> for 30 seconds — a one-shot interactive top of where the CPU is spent right now, useful for catching obvious hot spots without the full record-and-flamegraph pipeline. Step 4: only after these, deploy or attach a continuous profiler. Each step costs less than the next and can short-circuit the entire decision.

Where this leads next

Part 9 (chapters 55–61) walks the continuous-profiling stack end-to-end: /wiki/what-it-is-what-it-isnt for the discipline's scope, /wiki/pyroscope-and-parca-architectures for the two leading open-source designs, /wiki/google-wide-profiling-paper for the foundational design document, /wiki/cpu-heap-lock-profiles-in-prod for the three profile types every team needs, /wiki/differential-profiling for version-to-version comparison, and /wiki/profile-storage-and-query-patterns for the dedup-and-query infrastructure that makes 30-day fleet-wide retention affordable.

After Part 9 the curriculum returns to the main observability stack — dashboards (Part 10), SLOs (Part 11), alerting (Part 12) — and continuous-profile data flows into those parts as a first-class signal alongside metrics, logs, and traces. The fourth pillar lands.

References

Ren, Lau, Tene, Yan, Sites, Google-Wide Profiling: A Continuous Profiling Infrastructure for Data Centers (USENIX 2010) — the foundational paper for everything in Part 9. The 1-in-10,000 sampling rate, the deduplicated storage, the per-version attribution, all originate here.
Brendan Gregg, Systems Performance (Pearson, 2nd ed. 2020), chapter 6 — the canonical treatment of CPU profiling, on-CPU vs off-CPU, frame pointers vs DWARF, and why each tradeoff matters.
Brendan Gregg, "The Flame Graph" (ACM Queue, 2016) — the original flamegraph paper, with the data model that underlies every continuous profiler today.
Pyroscope documentation, "How Pyroscope works" — the modern Indian-fintech-default open-source stack; covers agent-side folding, deduplicated storage, FlameQL.
Parca documentation, "Architecture overview" — the eBPF-first continuous-profiling stack, with on-CPU/off-CPU and pprof-data-model querying.
Felix Geisendörfer, "The Busy Developer's Guide to Continuous Profiling" — accessible introduction to why continuous profiling is structurally different from one-shot profiling, with production-overhead numbers from a vendor's deployment.
/wiki/wall-kernel-level-observability-is-a-different-world — the previous wall, on which this one builds: kernel-level observability is the substrate that continuous profiling operates on top of.
/wiki/ebpf-limitations-in-production — chapter 53's catalogue of what stops working between the eBPF demo and the production deploy applies directly to eBPF-based profilers and is the cost ladder Part 9 takes for granted.