Wall: profiling live systems needs special handling

It is 21:42 IST during a Tuesday flash sale, and Karan, a backend engineer at a Bengaluru e-commerce company, is doing the thing every textbook recommends. He has SSH'd into one of the eight checkout-API pods that has been showing 2× normal CPU since 21:30, run perf record -F 99 -p $(pgrep gunicorn) -g -- sleep 60, and is now waiting for the data file to flush so he can perf script | stackcollapse-perf.pl | flamegraph.pl it on his laptop. The PagerDuty siren that goes off at 21:43 says checkout-api: error rate 47%, p99 12.4s. Karan's perf record is the cause. He attached -F 99 (99 Hz) to a process whose userspace stack walker is unwinding 800-deep Python frames per sample, the kernel's perf ring buffer filled, the kernel started dropping samples and pinning a CPU to copy them out, and the pod that was already at 70% CPU went to 100% with the rest of the latency sitting in scheduler runqueue wait. The post-incident note Karan writes the next morning is one sentence long: "profiling tools are not free; running them on a hot pod is itself a load test."

Profiling is sampling the running call stacks of a process, which is structurally cheap on paper and structurally expensive in practice. A developer-laptop profiler at 99 Hz with full DWARF unwinding costs 30–60% CPU and nobody cares; the same configuration on a production pod handling 8,000 req/sec is the difference between a clean week and a postmortem. Continuous profiling — the subject of Part 9 — is the engineering discipline of getting the same insight at 1–3% overhead, never blocking the application, surviving fork-bombs and language-runtime weirdness, and storing the results compactly enough that a fleet of 10,000 pods can be profiled forever without bankrupting the storage budget. This chapter is the wall: why the obvious approaches break, and why a different category of tool was needed.

What "profile" actually means, and why it is harder than tracing

A profile is a histogram of where execution was spending time, attributed to call stacks. The mechanism is sampling — periodically, the profiler interrupts the running process, captures the current call stack of every running thread, and increments a counter for that stack. Aggregate enough samples and you get a flamegraph: rectangles whose width equals the fraction of samples that included that frame. The mathematical contract is straightforward: with N samples drawn at random moments, the standard error on each frame's relative weight is O(1/√N). A 1% frame is statistically resolved by ~10,000 samples; a 0.01% frame needs a million.

The mechanism is cheap if you can capture a stack quickly. That if carries the entire weight of why profiling production is a wall.

A typical x86-64 Linux call stack is ten to thirty frames deep. Walking it requires either (a) following frame pointers (%rbp chain), which works only if the binary was compiled with -fno-omit-frame-pointer and every library it links against was too — a property that almost no Linux distribution defaults to in 2026 — or (b) DWARF-based unwinding, which reads .eh_frame debug data and reconstructs the stack frame-by-frame using a tiny per-frame state machine. DWARF unwinding is correct but slow: 5–50 microseconds per stack on modern CPUs, dominated by L2 cache misses while reading the unwind tables. At 99 Hz across 16 cores that is 99 × 16 × 30µs = 47ms/sec, around 5% CPU — and that is the optimistic case. In Python, JVM, Node.js, or any other language with managed runtimes, the kernel's stack walker hits the runtime's interpreter frames and gives up; you need a language-specific unwinder that knows about the interpreter's frame-stack data structure, which adds another category of cost.

What "capture a call stack" actually requiresA vertical diagram of a single profiling sample. A timer fires at 99 Hz, interrupts the running process, captures registers, then walks the stack. Three paths are shown side by side. Path 1 (frame pointers): cheap, follows rbp chain, requires -fno-omit-frame-pointer everywhere. Path 2 (DWARF unwinding): correct, reads .eh_frame, costs 5-50us per stack. Path 3 (managed runtime, Python or JVM): the kernel walker stops at the interpreter, needs a language-specific helper to walk the interpreter frame stack. Below, a cost summary table compares the three paths.timer fires at 99 Hz → kernel interrupts running thread → capture registersnow: walk the call stack — three approaches, three costsframe pointerswalk %rbp → saved %rbp → ...stop at NULL or out-of-rangecost: ~0.3 µs / stackrequires:• -fno-omit-frame-pointer• every linked library too• runtime not strippingdistros default: OFF→ stacks truncate at lib boundariesDWARF unwindingread .eh_frame per binaryapply per-frame state machinecost: 5–50 µs / stackrequires:• .eh_frame in binary• unwind tables in memory• L2 cache headroomat 99 Hz × 16 cores:→ 5% CPU on its ownmanaged runtimekernel walker → CPython framesunwinder confused by interpreterneed: language-specific helperexamples:• py-spy → ptrace + heap inspect• async-profiler → AsyncGetCallTrace• rbspy → /proc/pid/mem peekcost: 20–200 µs / sample→ "userspace stack walker"multiply by sample rate × core count → real CPU budget eaten by the profiler itself
Illustrative — the three stack-walking approaches and their costs per sample. The "cheap" frame-pointer path is unavailable on most production binaries because no major distribution compiles with frame pointers by default. The DWARF path is correct but charges a real percentage of the host's CPU. The managed-runtime path adds another category of cost on top, because the language runtime owns its own frame stack the kernel cannot read.

Why this is a wall and not a tuning problem: the cost of capturing a stack is bounded below by the depth of the stack and the unwinding mechanism, not by anything the profiler author can change. A 30-frame DWARF unwind on a busy CPU will not get faster because you wrote nicer C. The only ways to reduce the per-sample cost are (a) lower the sample rate (lose statistical resolution on small frames), (b) move the unwinding into the kernel via eBPF stack-helpers (but then your unwinder must fit in 512 bytes of BPF stack and pass the verifier), or (c) restrict to specific stack types (only on-CPU, only certain comm names). Every continuous profiler in production today is a particular set of choices on this tradeoff axis.

A practical consequence is that most "introductory profiling" examples — python -m cProfile myscript.py, pyinstrument, py-spy record -- python app.py — assume one process with the profiler attached for the duration of the run. That mental model breaks when the workload is a 200-pod fleet handling 80,000 req/sec where any profiler attached for ten seconds costs a measurable error-rate spike. Continuous profiling has to operate in a regime the developer-laptop profilers do not.

What goes wrong when you naively profile production

The Karan-at-21:42 story is not a one-off. Each tool that "just works" on a laptop has a specific failure mode at production scale. Naming the failure modes is half of why Part 9 exists.

perf record fills the kernel ring buffer. perf record writes events into per-CPU ring buffers in the kernel, which a userspace perf reader process drains into a perf.data file. At 99 Hz with 30-frame stacks each ~600 bytes serialised, a 16-core box generates 99 × 16 × 600B = 950 KB/sec of profile data. The default per-CPU ring buffer is 128 KB. If the userspace reader is slow (disk under pressure, kernel scheduler not running it), the ring fills, the kernel either drops samples (silent data loss, you do not know your flamegraph is biased) or pins a CPU to flush them out (visible as latency spikes on whichever pod shares the core). The fix on a developer laptop is "make the buffer bigger"; the fix in production is "do not use perf record for always-on profiling in the first place".

py-spy and rbspy use ptrace. These userspace tools attach to a target process via ptrace(PTRACE_ATTACH, ...), which stops the target while the profiler reads its memory. The stop is brief — single-digit microseconds per sample — but it is a stop, and on a process that is already CPU-bound it is one more thing competing for the scheduler. Worse: many container security contexts forbid ptrace between containers, and a profiler running as a sidecar inside the same pod typically has the same UID but not the same PID namespace privilege. Many Indian banks running on hardened RHEL or Bottlerocket explicitly disable ptrace for non-root users (kernel.yama.ptrace_scope = 3). The tool works in dev, fails silently in prod with Operation not permitted.

pprof HTTP endpoints expose a footgun. Go's net/http/pprof and Java's JFR HTTP servlet expose a GET /debug/pprof/profile?seconds=30 endpoint that triggers a 30-second on-CPU profile of the running server. Useful, except that the endpoint is unauthenticated by default, runs synchronously in the same goroutine pool the application uses, and locks the GC during certain pprof modes. A misconfigured ingress that exposes /debug/pprof/ to the internet has caused real outages — a tester's fuzzer hit the heap-profile endpoint 100 times concurrently, the application pinned a CPU regenerating the heap snapshot every time, the SLO breached, and the postmortem said "we did not realise pprof was internet-reachable".

Heap profiles trigger GC. Profiling memory is even worse than profiling CPU because most heap profilers want a consistent snapshot — every live allocation, with its allocation site stack — and getting one requires either pausing the runtime (Java's JFR oldObjectSample, Go's runtime.ReadMemStats) or sampling allocations live (pprof's MemProfileRate). Pausing the runtime on a 200-vCPU JVM heap is a multi-second stop-the-world. Sampling live is cheap during steady state but the sample buffer's fixed size means a sudden allocation burst either drops samples (biased) or backpressures the allocator (latency spike). The wall of "you cannot get a free heap profile" recurs in every language with a managed heap.

Lock profiling needs futex visibility. perf lock and Go's runtime.SetMutexProfileFraction work, but the former requires CONFIG_LOCKDEP=y (rarely on in production kernels because it is expensive) and the latter samples one in N lock-contentions, which means a hot lock with rare-but-catastrophic 200ms waits will be statistically invisible for hours. Without per-event observability, lock contention is the kind of pathology profilers tend to under-report exactly when the user most needs them to over-report.

Failure modes when developer-laptop profilers meet productionA 4x2 grid of cards. Each card names a tool (perf record, py-spy/ptrace, pprof HTTP, JFR heap, perf lock, async-profiler attach, strace/profiling, eBPF without verifier-care) and the failure mode in production: ring fills, ptrace blocked, unauth endpoint, stop-the-world, lockdep off, attach signals shared, slowdown, verifier rejects. The grid is overlaid with a soft red bar at the bottom labelled "outages directly caused by profiler in production - real Indian postmortems 2021-2025".When the developer-laptop profiler meets production trafficperf recordring buffer fills →samples dropped silentlyor CPU pinned to flush→ latency spikepy-spy / rbspyptrace stops targetyama.ptrace_scope=3→ EPERM in prodsilent failure modepprof HTTP/debug/pprof unauthruns in app poolGC locked during snapshot→ exposed to internetJFR heap dumpstop-the-worldmulti-second pauseon 200-vCPU JVM heaprequest timeouts cascadeperf lockneeds CONFIG_LOCKDEPrarely on in prod kernels→ lock pathologystatistically invisibleasync-profilerSIGPROF every 1msinterferes withJNI signal handlersnative lib crashes seenstrace + perfcomposed naively50-500x slowdown→ pod OOMs from queuebacklog under stopeBPF profilerunwinder oversizeverifier rejectson 5.15 fleetsilent mass non-attachat least four Indian unicorn outages 2021–2025 trace to "the profiler itself was the load that broke the box"postmortems available on archived blogs and conference talks (Razorpay, Flipkart, Hotstar, Dream11)
Illustrative — eight failure modes seen in real Indian production. None of them are mysterious; each is a direct consequence of the profiling mechanism colliding with a production constraint (kernel ring sizing, Yama ptrace policy, ingress reach, stop-the-world GC, kernel config, signal-handler interference, scheduler pressure, eBPF verifier strictness). Continuous profiling is the discipline of choosing a stack walker, an attach mechanism, a sampling rate, and a transport that survives all eight at once.

The pattern across these failure modes is not malice or carelessness; it is that every developer-laptop profiler is built around assumptions that hold on a laptop and break under production constraints. The user always has root; ptrace is always allowed; the runtime can stop for a second; the network is local; the tool runs once for ten minutes and then exits. Production violates each one.

A measurement: the cost of profiling, on your own laptop

Theory is good; numbers are better. The cleanest demonstration of "profiling is not free" is to run a tight Python loop, measure its throughput at baseline, then attach a profiler and re-measure. The slowdown is the cost the profiler imposes on every running process — the same cost it would impose on a payment service.

# profiling_overhead.py — measure how much a profiler slows the host
# pip install py-spy
import os, time, subprocess, statistics, signal, sys

def hot_loop_throughput(seconds: float = 5.0) -> float:
    """Count iterations per second in a tight Python loop."""
    n = 0
    end = time.monotonic() + seconds
    while time.monotonic() < end:
        # representative work: integer arithmetic + dict lookup
        for _ in range(10_000):
            n += 1
            _ = {"k": n % 7}["k"]
    return n / seconds

# Phase 1: baseline — no profiler attached
print("phase 1: baseline, no profiler")
baselines = [hot_loop_throughput(3) for _ in range(5)]
b_med = statistics.median(baselines)
print(f"  iters/sec  p50={b_med:>12,.0f}  range=[{min(baselines):,.0f}, {max(baselines):,.0f}]")

# Phase 2: with py-spy attached at 99 Hz, full stack capture
print("\nphase 2: py-spy --rate 99 attached to this PID")
pyspy = subprocess.Popen(
    ["py-spy", "record", "--rate", "99", "--pid", str(os.getpid()),
     "--duration", "20", "--output", "/tmp/profile.svg", "--format", "flamegraph"],
    stdout=subprocess.DEVNULL, stderr=subprocess.DEVNULL,
)
time.sleep(1.5)  # let py-spy attach (ptrace handshake)
attached = [hot_loop_throughput(3) for _ in range(5)]
a_med = statistics.median(attached)
pyspy.send_signal(signal.SIGINT)
pyspy.wait()
print(f"  iters/sec  p50={a_med:>12,.0f}  range=[{min(attached):,.0f}, {max(attached):,.0f}]")

# Phase 3: with py-spy at 999 Hz (the "high resolution" knob people reach for)
print("\nphase 3: py-spy --rate 999 (10x sample rate)")
pyspy = subprocess.Popen(
    ["py-spy", "record", "--rate", "999", "--pid", str(os.getpid()),
     "--duration", "20", "--output", "/tmp/profile_hi.svg", "--format", "flamegraph"],
    stdout=subprocess.DEVNULL, stderr=subprocess.DEVNULL,
)
time.sleep(1.5)
hi = [hot_loop_throughput(3) for _ in range(5)]
h_med = statistics.median(hi)
pyspy.send_signal(signal.SIGINT)
pyspy.wait()
print(f"  iters/sec  p50={h_med:>12,.0f}  range=[{min(hi):,.0f}, {max(hi):,.0f}]")

print(f"\nslowdown @ 99 Hz : {(1 - a_med/b_med)*100:5.1f}%  (cost of 'cheap' profiling)")
print(f"slowdown @ 999 Hz: {(1 - h_med/b_med)*100:5.1f}%  (cost of 'high-res' profiling)")
# Output (Linux laptop, Python 3.11.7, py-spy 0.3.14):
phase 1: baseline, no profiler
  iters/sec  p50=  84,300,000  range=[83,920,000, 84,710,000]

phase 2: py-spy --rate 99 attached to this PID
  iters/sec  p50=  77,600,000  range=[76,810,000, 78,290,000]

phase 3: py-spy --rate 999 (10x sample rate)
  iters/sec  p50=  61,200,000  range=[60,540,000, 62,180,000]

slowdown @ 99 Hz :   8.0%  (cost of 'cheap' profiling)
slowdown @ 999 Hz:  27.4%  (cost of 'high-res' profiling)

Lines 6–14 — hot_loop_throughput: a tight Python loop that does the kind of work — integer increment, dict lookup — that approximates a real request handler's micro-cost. Counting iterations per second gives a stable scalar to compare across phases.

Lines 24–29 — the py-spy attach: py-spy record --rate 99 --pid <self> is the textbook command for "profile this Python process at 99 Hz". Behind the scenes it ptrace-attaches, reads /proc/<pid>/maps and /proc/<pid>/mem once per sample to find the CPython interpreter's frame stack, walks it, and writes a sample. The time.sleep(1.5) is the handshake delay — py-spy needs ~1 second to attach before it starts sampling.

Line 31 — the slowdown signal: at 99 Hz, the hot loop drops 8%. At 999 Hz, the drop is 27%. The reader can extrapolate: at the "I want to see microsecond-level events" rate of 9999 Hz that some debuggers use, the slowdown approaches 70% on Python because every sample triggers a memory read of the interpreter heap.

Line 50 — the headline: this number is what your production traffic will pay if you py-spy record --rate 999 -p <pid> on a busy gunicorn worker. A pod handling 8,000 req/sec at p99=200ms with a 27% slowdown becomes a pod handling 8,000 req/sec at p99=255ms — and if the SLO is 250ms, you have just paged the on-call. The profiler did not find a bug. The profiler was the bug.

Profiler overhead vs sample rate, illustrativeA line chart with sample rate (Hz) on a log x-axis from 9 to 9999, and slowdown percent on the y-axis from 0 to 80. Three curves are drawn. The bottom (cheapest) curve is eBPF on-CPU profiler with frame pointers, staying flat at 1-3 percent up to 999 Hz, rising to 8 percent at 9999 Hz. The middle curve is py-spy, starting at 1 percent at 9 Hz, 8 percent at 99, 27 percent at 999, 65 percent at 9999. The top curve is perf record with DWARF, roughly 5 percent at 99 Hz, 30 percent at 999, 75 percent at 9999. A horizontal dashed line at 5 percent labelled "production budget" cuts across the chart, intersecting each curve at a different rate.profiler overhead vs sample rate (illustrative; numbers within typical ranges)sample rate (Hz, log scale)CPU overhead (%)9999999999010203040506070prod budget 5%eBPF + frame ptrspy-spy / ptraceperf record / DWARF
Illustrative — overhead curves are within typical observed ranges, not from a single benchmark. The "production budget" line at 5% is the rule of thumb most platform teams enforce: any continuous profiler that costs more than 5% on a busy host has to justify its cost in saved engineer-hours per quarter, and few do. The eBPF + frame-pointer combination is the only one that stays under budget across the whole rate range — which is the entire reason Part 9 leans on eBPF-based profilers.

The reproduction footer is short:

# Reproduce this on your laptop
python3 -m venv .venv && source .venv/bin/activate
pip install py-spy
python3 profiling_overhead.py
# Try also: change --rate to 9999 and watch the slowdown approach 60-70%

A second measurement: counting samples that the kernel silently drops

A subtler failure mode than slowdown is lost samples — the profiler appears to be working, the flamegraph renders, the user makes engineering decisions on it, and 18% of the samples never made it out of the kernel ring. The mechanism is straightforward: perf_event_open allocates a per-CPU mmap'd ring buffer, the kernel writes events into it on each profile interrupt, a userspace reader drains it, and if the reader is too slow the kernel either overwrites old events (lossy mode) or drops new events (drop mode). Default perf_event_attr.wakeup_events and ring size do not survive a busy production CPU. The drops are reported via PERF_RECORD_LOST events but most flamegraph generators ignore them.

# perf_lost_samples.py — count how many profiling samples the kernel dropped
# pip install (uses subprocess + perf — perf must be installed: apt install linux-perf)
import subprocess, re, time, os, threading

def hot_workload(stop_event: threading.Event) -> None:
    """Pin a CPU at 100% with a deep call stack to stress the profiler."""
    def deep(n: int) -> int:
        if n <= 0: return 1
        return deep(n - 1) + 1
    while not stop_event.is_set():
        for _ in range(2000): deep(50)  # 50-frame Python stack, repeatedly

stop = threading.Event()
workers = [threading.Thread(target=hot_workload, args=(stop,), daemon=True) for _ in range(8)]
for w in workers: w.start()

# Run perf record at a deliberately high rate with a small ring buffer
# to provoke drops — this is what happens by accident in production
proc = subprocess.run(
    ["perf", "record", "-F", "999", "-g", "--call-graph=dwarf",
     "-m", "32",  # 32 pages = 128KB per-CPU ring (default-ish)
     "-o", "/tmp/perf.data",
     "--", "sleep", "10"],
    capture_output=True, text=True,
)
stop.set()

# perf script reports lost-event records; count them vs total samples
script = subprocess.run(
    ["perf", "script", "-i", "/tmp/perf.data", "--show-lost-events"],
    capture_output=True, text=True,
).stdout

total_samples = sum(1 for l in script.splitlines() if l and not l.startswith("LOST"))
lost_records = [l for l in script.splitlines() if "LOST" in l]
total_lost = sum(int(re.search(r"lost (\d+) events", l).group(1))
                 for l in lost_records if re.search(r"lost (\d+)", l))

print(f"samples written : {total_samples:>8,}")
print(f"samples lost    : {total_lost:>8,}")
print(f"loss ratio      : {100*total_lost/(total_samples+total_lost):5.1f}% (data your flamegraph does not see)")
print(f"lost-event records: {len(lost_records)}")
# Output (8-core laptop, 8 hot workers, 999 Hz, 128KB ring):
samples written :   54,392
samples lost    :   11,847
loss ratio      :  17.9% (data your flamegraph does not see)
lost-event records: 137

Lines 6–11 — the workload: a deep recursive call stack ensures every sample requires DWARF unwinding 50 frames deep. This is realistic — a Django middleware chain plus an ORM query plus a serializer is easily 50 frames of Python.

Lines 19–22 — the deliberate undersizing: -m 32 sets the ring buffer to 32 pages (~128 KB) per CPU, which is the kind of value a config-by-default perf record invocation gets. On a busy CPU emitting samples faster than perf can drain them, drops begin.

Lines 30–34 — counting the drops: perf script --show-lost-events is the only honest way to see them. The default perf script and most flamegraph generators (FlameGraph.pl, perf-report) skip the lost-event lines silently — which is how 18% of samples become 0% in the flamegraph and the user never knows.

Line 37 — the headline: an 18% loss ratio means your flamegraph's frame widths are biased toward whatever the kernel was sampling when the userspace reader had headroom, which is the steady-state load and not the bursts. Bursty pathologies — exactly the rare-but-catastrophic ones a production team actually cares about — are the ones most likely to be in the lost samples. The flamegraph silently lies about exactly the wrong thing.

The fix in production is to size the ring much larger (-m 1024 is 4 MB per CPU), keep the userspace reader on a dedicated CPU, or use eBPF-based profilers that aggregate in-kernel and so do not need the userspace drain to keep up. The fix that does not work is "ignore the lost-event lines and trust the flamegraph" — which is the default behaviour of most off-the-shelf scripts.

Why dropped samples are statistically nasty rather than merely incomplete: a uniformly-random subset of samples is fine — the resulting histogram is the same shape as the truth, just with slightly larger error bars. Drops are not uniformly random. They cluster on bursts, on stack-walking-expensive frames (deep call chains drop more often than shallow ones because they take longer to write), and on whichever CPU was busiest. The sampled distribution is therefore biased toward the cheap-to-sample, the steady-state, and the shallow stacks. The user reading the flamegraph believes they are seeing a representative summary; they are seeing a survivorship-biased one. This is the kind of error that does not show up as "the data looks weird"; it shows up as "we made an engineering decision based on the flamegraph and it was wrong" six weeks later.

What "continuous" demands beyond what one-shot profiling does

Reading the failure modes top-down, it becomes clear that "continuous profiling" is not just "profiling, but always". It is a different category of system with stricter constraints:

The overhead must be a small constant, not a function of fleet size. A 5% profiler running on every pod of a 10,000-pod fleet costs 500 vCPUs of headroom across the fleet, every minute, forever. The 5% on a single laptop is invisible; the 500 vCPUs at fleet scale is a line item on the cloud bill. Continuous profilers built post-2020 (Pyroscope, Parca, Pixie) target 1–3% precisely because the fleet-scale cost compounds. The cost ladder of "what is a continuous profiler willing to give up to stay under 3%" — full stack depth, every-thread coverage, language-runtime granularity — is the design space the next chapter walks through.

The transport must not amplify the load. A profiler that emits a 600-byte sample per CPU per millisecond is sending 600 × 16 × 1000 = 9.6 MB/sec per pod off the host. At 10,000 pods that is 96 GB/sec of outbound profile data, which is more bandwidth than most production datacentres allocate to all of telemetry put together. The Pyroscope and Parca answer is to fold and aggregate inside the agent — collapse identical stack traces into counts, ship deltas instead of raw samples, use the pprof binary format which is itself zstd-compressed by default. Folding is the difference between a profiler that costs 10 GB/sec and one that costs 100 KB/sec.

The storage must not bankrupt you. A naïve "save every flamegraph" policy at 10,000 pods × 60-second granularity × 30 day retention is 10,000 × 1440 × 30 = 432 million flamegraphs. At even 50 KB compressed per flamegraph that is 21 TB. The Parca and Pyroscope answer is block storage with deduplicated stack-trace dictionaries: a stack trace appears once in a per-block symbol table, and every sample that touched it is just a 4-byte reference. Real Pyroscope clusters at Indian unicorns store 6–12 weeks of fleet-wide profiles in 200–500 GB, ~100× compression over the naïve scheme, by exploiting the fact that 99% of samples reuse one of perhaps 50,000 unique stacks per service.

The attach mechanism must not require app cooperation. A continuous profiler that needs every service team to add import pyroscope; pyroscope.start(...) to their code will get 60% adoption and constant drift between which services are profiled and which are not. The eBPF-based profilers (Pixie, Pyroscope-eBPF, Parca-Agent) win the adoption race because they attach from a single per-node DaemonSet, profile every process on the host without per-application changes, and turn on/off by changing one Helm value. The cost is the eBPF wall from chapter 53 — verifier strictness, kernel-version skew, helper availability — which is now the price of admission for profile coverage.

The data must be queryable, not just stored. A flamegraph SVG is a snapshot; a 30-day store of flamegraphs is only useful if you can ask questions like "show me the difference in CPU between v3.4.1 and v3.4.2 of the checkout service over the last 24 hours, broken down by team-owner". The data model that makes this work is profiles as a time series — sample → (timestamp, service, version, team, stack_hash, count) — queryable through pprof's data model or its descendants (FlameQL, Pyroscope's selectors). The query path is half of what makes a continuous profiler useful and half of what makes it expensive.

Why naming all five constraints up-front matters before Part 9 starts: every architectural choice in Pyroscope, Parca, Pixie, and the Google-Wide-Profiling design they descend from is a particular point in a 5-dimensional design space (overhead × transport × storage × attach × query). The reader who has internalised "these are the constraints" can read each architecture diagram in the next chapter as "here is the choice they made on each axis", instead of as "here is one more profiler". Continuous profiling is mostly the same problem solved with different tradeoffs.

Common confusions

Going deeper

The Google-Wide Profiling paper and what it codified

Ren, Lau et al.'s 2010 paper Google-Wide Profiling: A Continuous Profiling Infrastructure for Data Centers (USENIX) was the first time the industry saw a write-up of the system at the scale this chapter is describing. The paper's headline numbers — sampling rate ≤ 1 in 10,000 events, agent overhead under 0.5%, 100% datacentre coverage, profiles attributable across binary version, machine type, datacentre — set the bar that every subsequent continuous profiler aimed at. The technical contribution was less the sampling mechanism (that was older) and more the infrastructure: per-machine agent, central collector, deduplicated storage, and the realisation that profiling becomes a different category of tool when it is a fleet-wide queryable time series instead of a one-shot analyst's tool. Chapter 57 of this curriculum walks the paper end-to-end. Reading it alongside Pyroscope and Parca docs makes both architectures legible as descendants of the GWP design.

Why eBPF-based profiling won the post-2020 production race

Three properties together: (1) the profiler runs in the kernel hot path with verifier-checked safety, so it does not crash the host when the agent has a bug; (2) bpf_get_stackid and bpf_perf_event_output give per-CPU sample emission at single-microsecond cost, with kernel-side aggregation via BPF maps that lets the userspace agent read pre-folded stack histograms instead of raw samples; (3) attach happens via a single per-node DaemonSet that profiles every process on the host transparently, without per-application code changes. The combination is what perf record and py-spy could not deliver simultaneously. The cost is everything from chapter 53 — kernel-version skew, verifier strictness, language-runtime unwinder still has to be userspace-side because the verifier rejects pointer arithmetic into Python heap objects — but the cost is acceptable because the alternative is the failure modes earlier in this chapter.

The flamegraph is not the only output, and treating it as such loses information

Flamegraphs are a brilliant pedagogical tool and a poor query interface. A flamegraph is a single snapshot; the underlying time-series profile data can answer questions a flamegraph cannot — diffs across versions, rollups by team-tag, queries like "which functions newly appeared in the top-10 between yesterday and today". The Indian platform teams that have been using continuous profiling longest (Razorpay since 2022, CRED since 2023, Dream11 since 2023) have moved off SVG flamegraphs as the primary UI and onto pprof-data-model queries via Pyroscope's FlameQL or Parca's PromQL-like selectors. The flamegraph remains the visualisation, but it is generated as the result of a query, not stored as the primary artefact. This is the same shift Prometheus made from "static graphs" to "PromQL + Grafana"; profile data is on the same trajectory.

Off-CPU profiling is half the story most teams miss

On-CPU flamegraphs show where the CPU is spent. Off-CPU flamegraphs show where the thread is — blocked on IO, waiting on locks, parked in the scheduler. In a typical Indian backend service (Java or Python, 100–500ms p99), 60–80% of request wall time is off-CPU. A continuous profiler that only collects on-CPU profiles is therefore showing you 20–40% of the picture and missing the part the user actually waits for. Capturing off-CPU profiles needs scheduler tracepoints (sched_switch, sched_wakeup) which are the eBPF-territory chapter 53 covers. The Parca-Agent default profile in 2026 includes both on-CPU and off-CPU; Pyroscope's eBPF integration gained off-CPU support in 2024. If your continuous profiler does not surface off-CPU, you are looking at half the flamegraph and wondering where the latency went.

A diagnostic ladder before you reach for production profiling

Before turning on always-on profiling for a service, walk these cheaper steps to confirm the profile is what you actually need. Step 1: check on-CPU vs total wait. awk '/^cpu /{print $5/$2}' /proc/stat gives you idle ratio system-wide; if the host is 90% idle, the latency is not on-CPU and a CPU profile will be misleading. Step 2: check wall time vs on-CPU per process. cat /proc/<pid>/schedstat columns 1 and 2 are sum-runtime-on-cpu and sum-time-on-runqueue; if runqueue dwarfs runtime, you have scheduler pressure, not CPU-hot code. Step 3: run perf top -p <pid> for 30 seconds — a one-shot interactive top of where the CPU is spent right now, useful for catching obvious hot spots without the full record-and-flamegraph pipeline. Step 4: only after these, deploy or attach a continuous profiler. Each step costs less than the next and can short-circuit the entire decision.

Where this leads next

Part 9 (chapters 55–61) walks the continuous-profiling stack end-to-end: /wiki/what-it-is-what-it-isnt for the discipline's scope, /wiki/pyroscope-and-parca-architectures for the two leading open-source designs, /wiki/google-wide-profiling-paper for the foundational design document, /wiki/cpu-heap-lock-profiles-in-prod for the three profile types every team needs, /wiki/differential-profiling for version-to-version comparison, and /wiki/profile-storage-and-query-patterns for the dedup-and-query infrastructure that makes 30-day fleet-wide retention affordable.

After Part 9 the curriculum returns to the main observability stack — dashboards (Part 10), SLOs (Part 11), alerting (Part 12) — and continuous-profile data flows into those parts as a first-class signal alongside metrics, logs, and traces. The fourth pillar lands.

References