Note: Company names, engineers, incidents, numbers, and scaling scenarios in this article are hypothetical — even when they resemble real ones. See the full disclaimer.

Continuous profiling in production

At 02:14 IST a PaisaBridge payments-API alert fires: p99 = 480 ms, threshold 220 ms, traffic flat at 12k req/s. The on-call engineer Aditi opens the runbook. Step 1: "SSH to a hot pod, run perf record -F 99 -p $(pidof gunicorn) -- sleep 60, scp the data, generate a flamegraph." She does. By the time the SVG renders, three minutes have passed and the latency has self-corrected. The flamegraph she has shows a healthy steady state because she captured it after the spike. She has nothing to compare against and no profile from the moment the alert fired. The incident closes as "transient, no action". Two days later it happens again. And again. The pattern repeats for three weeks before someone realises that the only way to debug a 90-second tail-latency spike is to already have been profiling when it happened. That is what continuous profiling means: a low-overhead sampler running on every production host, every minute of every day, with profiles indexed by time and tagged with deploy SHA, so that "what was the CPU doing at 02:14 yesterday?" is a query, not a forensic re-enactment.

Continuous profiling runs a sampling profiler (py-spy, async-profiler, Parca, Pyroscope, Datadog) on every production host at low frequency — typically 19–99 Hz — and ships compressed pprof files to a time-series-indexed object store. The query model is "give me a flamegraph for service X between 02:13 and 02:14 IST yesterday, on the v4271 deploy"; the diff model is "show me what changed between v4271 and v4272". Overhead stays under 1% CPU when frequency and stack-unwinding are tuned correctly, and the diagnostic value comes from already having the profile before the incident, not running the profiler after it.

Why on-demand profiling is too slow for incident response

The traditional profiling workflow — perf record after the alert fires, copy the file, run perf report or generate a flamegraph — has three failure modes that cumulatively make it useless for short-lived incidents.

The first is the time-to-capture is longer than most incidents. PaisaBridge's median incident-acknowledged-to-mitigated time is 4 minutes for a self-correcting tail-latency spike. SSH-ing to a pod, finding the right PID (gunicorn forks 32 workers, the slow one is a moving target), starting perf record, waiting 60 s for samples to accumulate, generating the flamegraph — the workflow itself takes 5+ minutes when it works. The incident is over before the profile is ready. You end up profiling the post-incident steady state and learning nothing.

The second is the stack-unwinding cost of reactive profiling is visible at incident-time CPU pressure. perf record with --call-graph=dwarf on a hot Python process can add 5–8% CPU overhead. If the host is already under pressure (the reason p99 climbed in the first place), kicking on a 5% profiler can push the host past saturation and make the incident worse — which then "validates" the team's instinct to never run profilers in production, which then guarantees the next incident is also blind.

The third is you have nothing to compare against. A flamegraph in isolation is a list of where time went now. Without a baseline ("here is what this service looked like last Tuesday at 02:14 when it was healthy"), the engineer cannot tell whether 38% in _aligned_strided_loop is normal or anomalous. Diagnosis collapses into pattern-matching against the engineer's memory of past incidents — which is exactly the regime where senior engineers are valuable and junior engineers are stuck. Continuous profiling moves this from a memory exercise to a diff-the-flamegraphs exercise.

Why "always-on" beats "on-demand" mathematically: an incident lasts T seconds; capturing a useful profile requires C seconds of sampling (typically 30–60). On-demand profiling succeeds only when the engineer is alerted, decides to profile, and starts the capture before T-C seconds have elapsed. With T = 90s and C = 30s, the engineer has 60 seconds to acknowledge, log in, and run a command — in practice this works less than 30% of the time. Continuous profiling has T-C = T (profile is always running), success rate ~100%. The shift is from "profile when something is wrong" to "always have a profile, query the bad time window".

The mental shift the rest of this chapter is built around: a profile is time-series data, not a one-off artefact. The same way you keep CPU-utilisation metrics in Prometheus for 30 days, you keep flamegraphs for 30 days. The same way you alert on metric anomalies, you can diff-flamegraph profile anomalies. The same way you correlate metrics with deploys, you correlate profiles with deploys. The profile becomes a first-class observable.

The agent on each host samples at 19 Hz and uploads compressed pprof every 10 seconds. The store is keyed by (service, deploy SHA, host, timestamp), so the two main queries — "flamegraph for time window" and "diff between deploys" — both reduce to a range scan plus a stitch step. Storage cost: roughly 50–200 KB per host per minute, ~5–20 GB/day for a 200-host service.

How a low-overhead always-on profiler stays under 1% CPU

The single biggest objection to running a profiler in production is overhead. Continuous profilers in 2026 sit comfortably under 1% CPU when configured correctly; the cost model is well understood and the dial that matters is sample frequency.

The math is straightforward. A sample requires three things: an interrupt to stop the target thread, a stack walk to record the call chain, and a write of the resulting frame into a buffer. On a typical x86 server the interrupt costs ~1 µs, a frame-pointer-based stack walk on a 12-deep stack costs ~2 µs, and a buffer write is ~0.2 µs. Total per-sample cost: ~3.5 µs. At 99 Hz that is 99 × 3.5 = 347 µs/sec of overhead — 0.035% of one CPU. At 999 Hz it climbs to 0.35%. At 9999 Hz it crosses 3.5% and starts mattering. For continuous use, 19 Hz to 99 Hz is the sweet spot: enough samples to find anything that occupies more than ~5% of CPU within 5 minutes of capture, low enough that overhead is invisible.

The cost explodes when stack-unwinding goes wrong. DWARF unwinding (used by perf when frame pointers are absent) costs ~30–50 µs per sample — an order of magnitude worse than frame-pointer unwinding. A Python process built without frame pointers and profiled with perf record --call-graph=dwarf at 99 Hz will burn 3–5% CPU just on unwinding. The fix is to either build with frame pointers (-fno-omit-frame-pointer, the default in modern Linux distributions since the Ubuntu 24.04 default change in 2024) or use a profiler that has its own walker — py-spy for Python (walks the CPython interpreter frame chain directly), async-profiler for JVM (uses AsyncGetCallTrace to avoid safepoint bias), parca-agent and bpf-perf for system-wide eBPF-based unwinding.

Why frame pointers matter so much for continuous profiling: a frame-pointer-based stack walk is just while (rbp != 0): rip = *(rbp+8); rbp = *rbp — one cache line per frame, ~10 ns per frame on a modern CPU. A DWARF-based walk has to read .eh_frame, decode CFI instructions, and replay them — easily 30× slower. For a one-off perf record the difference does not matter; for a profiler running 24/7 on every host, the 30× difference is the difference between "0.4% CPU" and "12% CPU", which is the difference between "ship it" and "kill the project".

The other knob is wall-clock vs CPU-time sampling. CPU-time profilers (the default for most tools) only sample threads that are currently on-CPU; off-CPU work — blocking on locks, blocking on disk I/O, blocking on network — is invisible. Wall-clock profilers (py-spy's --idle, async-profiler's wall mode) sample every thread regardless of state, including those blocked. For a service that spends 80% of wall time blocked on Redis or Postgres, a CPU profile shows you 20% of the picture. Run both — CPU profile to find the hot computation, wall profile to find the lock contention. The off-CPU side has a dedicated chapter (/wiki/off-cpu-flamegraphs-the-other-half); continuous profiling captures both modes side-by-side.

# pyroscope_agent.py — a minimal continuous profiling agent.
# Wraps py-spy, captures a 10-second pprof sample, uploads to an HTTP ingestion
# endpoint with service/deploy/host tags, sleeps, repeats. This is the same
# shape as the Pyroscope, Parca, and Datadog agents — the actual implementations
# add retries, backoff, and a local disk spool, but the loop is this.

import gzip
import os
import socket
import subprocess
import sys
import time
import urllib.request
from pathlib import Path

# Realistic Indian-context config: a PaisaBridge payments-API pod
SERVICE = os.getenv("PROFILE_SERVICE", "payments-api")
DEPLOY_SHA = os.getenv("DEPLOY_SHA", "v4271-7c3a9d2")
REGION = os.getenv("AWS_REGION", "ap-south-1")
INGEST_URL = os.getenv("PYROSCOPE_INGEST", "http://pyroscope.internal:4040/ingest")
SAMPLE_HZ = 19           # 19 Hz: ~0.4% CPU on Skylake, plenty of samples
WINDOW_SEC = 10          # 10s windows = 6 uploads/min/host
PID = int(sys.argv[1])   # the gunicorn worker pid

def capture_one_window(pid: int, hz: int, secs: int, out: Path) -> None:
    """Run py-spy for `secs` and write a folded-stack file."""
    cmd = ["py-spy", "record",
           "-p", str(pid),
           "-r", str(hz),
           "-d", str(secs),
           "-f", "speedscope",   # JSON format, easy to ship as gzip
           "-o", str(out),
           "--idle"]             # include off-CPU samples (wall-clock mode)
    subprocess.run(cmd, check=True, capture_output=True)

def upload(payload: Path, tags: dict) -> None:
    """POST gzipped speedscope JSON to the ingestion endpoint."""
    body = gzip.compress(payload.read_bytes())
    qs = "&".join(f"{k}={v}" for k, v in tags.items())
    req = urllib.request.Request(
        f"{INGEST_URL}?{qs}",
        data=body,
        headers={"Content-Encoding": "gzip",
                 "Content-Type": "application/json"})
    with urllib.request.urlopen(req, timeout=5) as r:
        if r.status >= 300:
            raise RuntimeError(f"ingest {r.status}: {r.read()[:200]}")

def main() -> None:
    host = socket.gethostname()
    spool = Path("/var/spool/profiler")
    spool.mkdir(exist_ok=True)
    while True:
        ts = int(time.time())
        out = spool / f"{ts}.json"
        try:
            capture_one_window(PID, SAMPLE_HZ, WINDOW_SEC, out)
            tags = {"service": SERVICE, "deploy": DEPLOY_SHA,
                    "host": host, "region": REGION,
                    "from": ts, "until": ts + WINDOW_SEC}
            upload(out, tags)
            sz = out.stat().st_size
            print(f"[{ts}] uploaded {sz/1024:.1f} KB pprof "
                  f"(svc={SERVICE} deploy={DEPLOY_SHA} host={host})")
        except subprocess.CalledProcessError as e:
            print(f"[{ts}] py-spy failed: {e}", file=sys.stderr)
        except Exception as e:
            print(f"[{ts}] upload failed (will retry): {e}", file=sys.stderr)
        finally:
            out.unlink(missing_ok=True)

if __name__ == "__main__":
    main()

# Sample run, 60 seconds on a payments-api pod (PaisaBridge-style stack):
$ DEPLOY_SHA=v4271-7c3a9d2 python3 pyroscope_agent.py 28471

[1714003200] uploaded 38.4 KB pprof (svc=payments-api deploy=v4271-7c3a9d2 host=pay-pod-7c3)
[1714003210] uploaded 41.2 KB pprof (svc=payments-api deploy=v4271-7c3a9d2 host=pay-pod-7c3)
[1714003220] uploaded 39.7 KB pprof (svc=payments-api deploy=v4271-7c3a9d2 host=pay-pod-7c3)
[1714003230] uploaded 40.1 KB pprof (svc=payments-api deploy=v4271-7c3a9d2 host=pay-pod-7c3)
[1714003240] uploaded 38.9 KB pprof (svc=payments-api deploy=v4271-7c3a9d2 host=pay-pod-7c3)
[1714003250] uploaded 42.0 KB pprof (svc=payments-api deploy=v4271-7c3a9d2 host=pay-pod-7c3)

# Overhead measured by `perf stat` against the gunicorn worker:
# without agent:  3.42 cycles/req  (16.4 GHz·s of CPU per second of req traffic)
# with agent:     3.43 cycles/req  (16.5 GHz·s) — 0.4% increase, within noise

Walk-through. -r 19 is the sample frequency in Hz: 19 samples per second per thread. This is the dial that controls overhead, and 19 Hz is the floor where you still see anything occupying more than ~5% of CPU within a 10-second window (you need ~10–20 samples on a frame for it to be visible above noise). -d 10 is the window length: capture 10 seconds, upload, capture again. Long enough to catch anything periodic shorter than 10 s; short enough that the time-series resolution is good. --idle tells py-spy to include threads that are blocked (sleeping, waiting on a lock, waiting on a socket); without this you only see CPU-bound work and miss the entire off-CPU side. -f speedscope is a JSON format that compresses well with gzip (~5× ratio on a typical Python flamegraph). tags = {...} is what makes the system queryable: every upload carries the service name, the deploy SHA, the host, the region, and the time range. The ingestion side indexes by all of these, so "flamegraph for payments-api on v4271 between 02:13 and 02:14 IST in ap-south-1" is a one-line query, not a forensic exercise. The 0.4% measured overhead is below most teams' tolerance threshold; this is what makes always-on viable.

The agent loop pattern — capture short windows, tag richly, ship gzipped — is the universal continuous-profiling shape. Pyroscope, Parca, Datadog Continuous Profiler, Google Cloud Profiler, and the Polar Signals agent all do exactly this; the differences are the storage backend (ClickHouse, FrostDB, custom columnar store) and the unwinder (py-spy for Python, async-profiler for JVM, eBPF-based for system-wide).

Querying profiles by time, deploy, and tag — the diff is the diagnostic

Once profiles are stored as time-series data, the diagnostic workflow shifts from "generate a flamegraph" to "ask questions of a flamegraph store". Three query shapes dominate.

Time-window query. "Show me a flamegraph for payments-api between 02:13:00 and 02:14:00 IST yesterday." Internally: range-scan all profiles in the window, decompress, fold-stacks-merge, render. The merge is just summing sample counts per unique stack — pprof files compose linearly. A 60-second window across 200 hosts is 200 × 6 = 1,200 pprof files, ~50 MB total compressed; a modern column store reads and merges this in under 2 seconds. The result is a flamegraph that represents 1,200 host-seconds of execution, which is enough samples to see anything above 0.1% of CPU.

Diff query. "Show me what changed in payments-api between deploy v4271 and v4272." Internally: range-scan profiles tagged deploy=v4271, range-scan profiles tagged deploy=v4272, compute (samples_v4272[stack] / total_v4272) - (samples_v4271[stack] / total_v4271) per stack, render with red for "more time spent here in v4272" and blue for "less". A diff flamegraph (/wiki/differential-flamegraphs) where one bar lights up red is a one-glance regression diagnosis. The BharatBazaar catalogue team's 2024 Mega Bargain Days p99 regression on the search service was caught this way: the diff flamegraph between v17.4.2 and v17.4.3 lit up a single function, _normalize_query_string, as a 14% red bar — a regex compile that had moved from module-load to per-request after a refactor. Total diagnosis time from alert to commit: 9 minutes. Without the diff, this kind of regression takes hours.

Anomaly query. "Show me hosts in payments-api whose flamegraph diverges from the fleet median right now." Internally: cluster the per-host flamegraphs by stack-distribution, flag the outliers. A single host with pthread_cond_wait consuming 60% of wall time when the fleet median is 8% is the host with a deadlock or a slow downstream — the kind of "one-pod-is-weird" issue that Kubernetes-level metrics often miss because they aggregate. Anomaly queries are the newest of the three (Pyroscope added them in 2024, Datadog earlier) and require more compute than time-window or diff, but they catch a class of incidents that nothing else does.

The diff is the diagnostic. A flat colour-coded view of "what got slower and what got faster between two deploys" is the kind of question that takes hours of grep through git log, change manifests, and on-call notes — and minutes when you have continuous profiles on both sides of the deploy. The BharatBazaar 2024 catalogue regression was diagnosed from a graph with this exact shape.

A subtle property of the diff query is that it is asymmetric in time direction. Diffing v4272 against v4271 shows what got worse (red) and what got better (blue) in the new deploy. Diffing v4271 against v4272 shows the same data with the colours flipped. Most engineers default to "new minus old" because that matches the reading order ("what did we just deploy?"); a few systems default to "old minus new" because they want regressions to render in red regardless of direction. Pyroscope and Parca went with new-minus-old; Datadog goes with old-minus-new. Pick a direction, document it in the runbook, and never let the team mix them up — a flipped diff during a 2 AM incident is exactly the kind of cognitive trap that ends with the wrong fix.

The SetuStream streaming-router team built a query layer in 2024 that does one further trick: profile-correlated alerts. Their alert manager, when a p99 metric breaches its SLO, automatically queries the profile store for "flamegraph diff between now and 30 minutes ago", attaches the resulting SVG to the PagerDuty incident, and includes the top three changed stacks in the alert body. The on-call engineer's first view of the incident already includes the candidate root cause. This integration moved their median time-to-mitigate on tail-latency incidents from 18 minutes to 4 minutes — entirely because the engineer no longer has to do the profile-capture step manually.

Storage, retention, and the cost model

The cost question is the thing that decides whether continuous profiling ships or stays in a vendor's slide deck. The numbers are friendly once you compute them honestly.

A single pprof sample for a Python service running 32 worker threads, captured over 10 seconds at 19 Hz, contains about 32 × 19 × 10 = 6,080 stack samples. Compressed pprof representation is ~50–60 KB. Six uploads per minute per host = ~300 KB/min/host = ~430 MB/host/day. For a 200-host service, that is ~85 GB/day of profile data.

Retention policy controls the total storage. A common setup at Indian fintech and ecommerce companies in 2026:

Raw, per-host profiles: 7 days. Enough for week-over-week comparison and most incident postmortems. ~600 GB for a 200-host service.
Per-service merged profiles, 1-minute resolution: 30 days. Enough for monthly capacity reviews and deploy-comparison work. ~100 GB.
Per-service merged profiles, 1-hour resolution: 365 days. Used for "is the service slowly degrading?" and architectural retrospectives. ~10 GB.

S3 Standard storage at ap-south-1 prices is approximately ₹2 per GB-month. Total storage cost: roughly ₹1,500/month for a 200-host service — well below 1% of the compute spend the service is generating. The compute cost of running the agents is 200 hosts × 0.4% CPU × ₹3/CPU-hour × 720 hours/month = ₹1,728/month. Total system cost ~₹3,200/month for a 200-host service. PaisaBridge's published 2025 numbers: their continuous profiling stack covers ~3,500 hosts across all services and costs ~₹65,000/month, which is the salary of one engineer's morning coffee fund — for a system that closes p99 incidents in minutes instead of hours.

The cost-vs-overhead trade-off has a Pareto front. Higher sample frequency = larger pprof files = more storage = more diagnostic resolution. Lower sample frequency = smaller files = less storage = less resolution. For Python services, 19 Hz is the right default; for JVM with async-profiler the default of 99 Hz is fine because the unwinder is so efficient that even 999 Hz costs under 1%; for system-wide eBPF unwinding (Parca-style), 99 Hz is again the default. The dial is per-language because the unwinder cost is per-language.

Why per-host raw profiles matter even when you have merged ones: a service-merged flamegraph hides the case where one pod is weird. A single host with pthread_cond_wait at 60% wall time will be dwarfed in the merged view (1/200 of the data) but is glaringly obvious in the per-host view. Anomaly detection runs against per-host profiles for exactly this reason. Drop per-host retention below 24 hours and you lose the ability to debug "one pod was sad" incidents — which are roughly 30% of all production incidents.

The retention policy also has a deploy-correlation requirement. If your deploy cadence is multiple-times-per-day and your raw retention is 7 days, you have ~50–100 deploys' worth of data — enough to diff any two recent deploys. If you drop retention to 24 hours, you can only diff against the immediately previous deploy, which fails the moment a regression hits multiple deploys later (a slow leak that nobody noticed until it crossed the SLO). The ParakhTrade Kite team's standard since 2024 is "raw retention >= 2× the longest plausible regression-discovery window", which lands them at 14 days for the order-matching service and 7 days for everything else.

Compression algorithm choice matters more than most teams expect. The pprof wire format gzip-compresses at ~5× ratio; zstd at level 3 hits ~7×; zstd at level 9 hits ~9× but takes 30× the CPU. The Polar Signals team's 2024 measurement: switching their ingestion path from gzip to zstd-3 saved 40% of object-store cost and added 8% to ingestion CPU — a clear win at any scale above ~100 hosts. Most modern continuous-profiling systems default to zstd; if yours still uses gzip, switching is an afternoon of work for a one-time multi-month payback.

Deploying the agent — sidecar, daemonset, or system-wide

The agent has to actually run on every host, and the deploy model decides who owns it. Three patterns dominate, each with a clear cost/coverage trade-off.

Sidecar in every pod. A py-spy (or async-profiler, or language-specific) container runs alongside the main service container in the same pod, sharing the PID namespace so it can attach to the worker process. This is the model Pyroscope's official Helm chart and most language-specific agents (Datadog's dd-trace-py, the Pyroscope SDK) use. Pros: agent version pinned per-service, language-aware unwinding, easy rollout. Cons: doubles the pod count, every team has to opt in, multi-tenant clusters end up with hundreds of agent images. PaisaBridge's payments stack uses this model because each service team owns their profiling configuration.

Kubernetes DaemonSet. A single agent runs once per node and profiles every process on that node, including across namespaces. This is the Parca-agent and Polar Signals model, and it works because the unwinder is eBPF-based and language-agnostic. Pros: one deploy covers the whole cluster, zero per-service opt-in, no pod-count inflation. Cons: requires kernel >= 5.4 with BTF, requires CAP_SYS_ADMIN (a security review item), language-specific unwinders are second-class (Python and JVM stacks may show as raw addresses without the right helpers). SetuStream's streaming tier uses this model for every host outside the regulated payment path.

System-wide systemd service. For VM-based fleets (the ParakhTrade Kite tier, BharatRail's regulated stack, anything outside Kubernetes), a systemd unit runs the agent as a long-lived process. Pros: works on bare metal, fits existing config-management pipelines (Ansible, Salt, Chef). Cons: process-discovery is harder (no Kubernetes labels to query), restart hygiene depends on systemd's restart policy. The BharatRail Tatkal-hour fleet runs this way because the underlying hosts are not Kubernetes nodes.

The deploy-model decision usually maps to "what does the rest of your infrastructure look like" rather than to anything intrinsic to profiling. Pick the one that matches your existing operational posture; do not force a Kubernetes DaemonSet onto a VM fleet just because the cool kids use one.

A small but operationally important detail: tag every profile with the host's kernel version and CPU model (read once at agent start from /proc/version and /proc/cpuinfo), plus the container runtime version when running under Kubernetes. When a regression query returns "the new deploy is 14% slower on hosts A, B, C but normal on D, E, F", the answer is often "A/B/C are on Skylake, D/E/F are on Ice Lake, and the regression is a cache-line layout issue that only Skylake notices". Without the kernel/CPU tags, that diagnosis takes hours of correlation work; with them, the query layer surfaces it directly.

The cost of carrying these tags is a few dozen bytes per profile — irrelevant compared to the diagnostic value when the heterogeneous-fleet edge case eventually fires. Most ingestion paths support arbitrary tag dictionaries, so adding them is a one-line change in the agent.

Common confusions

"Continuous profiling is just perf record on a cron." No. The differences are sub-1% overhead (cron-based perf record typically runs at 99 Hz with DWARF, costing 5–8%), tagged time-series storage (cron writes files, continuous profiling writes to a queryable store), and language-specific unwinders (cron-perf cannot walk a Python or JVM stack reliably; py-spy and async-profiler can). The query model is the actual product, not the sampling.
"Profile data is too large to store at fleet scale." ~430 MB/host/day at 19 Hz with default settings. For a 1,000-host service, ~430 GB/day raw, 30 GB/day after merge to 1-minute resolution. ClickHouse compresses pprof bodies further. The realistic monthly bill is ₹5,000–₹50,000 for any service under 5,000 hosts. Storage is not the blocker.
"Sampling at 99 Hz misses sub-100ms events." True at the per-host level, false at the fleet level. With 200 hosts × 99 Hz = 19,800 samples/sec across the fleet, an event lasting 10 ms that runs once per second per host produces ~200 samples in a 10-second window — enough to be visible. Nyquist applies per signal source; aggregation across hosts effectively raises the sample rate.
"Wall-clock and CPU-time profiles are the same with extra overhead." They are different views. CPU-time shows where compute is spent; wall-clock shows where wall-time is spent (including blocked threads). A service that is 80% blocked on Postgres has a CPU profile dominated by the 20% that is on-CPU, and a wall profile dominated by the Postgres wait. Both are useful. Run both. They answer different questions.
"Continuous profiling replaces metrics." No. Metrics tell you that p99 climbed; profiles tell you what changed inside the process. They are complementary. The integration point is profile-correlated alerts: metric breach triggers profile query, profile diff lands in the incident.
"Frame pointers cost performance, so omitting them is a perf win." A measurable but small loss (≤1% in most workloads, per the Ubuntu 2024 frame-pointer-by-default analysis), traded for the ability to profile at all. The PaisaBridge and ParakhTrade standard since 2024 is "frame pointers on, always" precisely because the profiling-time gain (10–30× cheaper unwinding) dwarfs the runtime cost.

Going deeper

What pprof actually stores, and why it composes linearly

A pprof file is a Protocol Buffer message containing a list of Sample records, each holding a stack-trace (as a list of Location IDs into a deduplicated Location table) and a value vector (samples, cpu-time, alloc-bytes, etc.). The Function and Mapping tables are also deduplicated; a stack-trace is just a list of integers. The format is designed so that merging two pprof files is just concatenating their Sample lists and unioning their Location/Function tables — there is no recomputation, no reanalysis, just structural composition. This is what makes "merge 1,200 host-seconds of profiles for a query" cheap. A naive flamegraph format (folded-stacks text) does not have this property; you would have to re-parse and re-deduplicate every merge. Picking pprof as the wire format is the single most important storage decision in the system.

Stack-unwinding strategies and their trade-offs

Three families of stack walkers exist, with very different cost profiles. Frame-pointer walking chases rbp up the stack — fast (~10 ns/frame), correct only when frame pointers are present. DWARF-based unwinding reads .eh_frame tables — slow (~300 ns/frame), correct when frame pointers are missing. Compact-unwind tables (a Linux 6.x kernel feature, the BPF-friendly variant of DWARF) — fast (~30 ns/frame), correct without frame pointers, but require a one-time table-build step. Continuous profilers in 2026 use a hybrid: frame pointers when present, compact-unwind tables when not. Parca-agent, Polar Signals' tooling, and Datadog's BPF profiler all converged on this design.

Profile-guided optimisation as a downstream user

A second use of continuous profile data, beyond incident response, is to feed profile-guided optimisation (PGO) of the service itself. PGO compiles a binary with hint data about which branches are taken and which functions are hot. The Go 1.21 PGO support consumes pprof files directly — exactly the format the continuous profiler produces. The BharatBazaar search team in 2024 set up a pipeline that pulls the previous week's merged pprof, feeds it into the next build's go build -pgo=, and ships the PGO'd binary. Result: 5–8% throughput improvement per release on the catalogue search hot path. The continuous profile becomes both a diagnostic tool and an optimisation input — the data is so easy to keep that the second use was nearly free once the first use was deployed.

Reproduce this on your laptop

# Reproduce a tiny continuous profiler against a local Python service
sudo apt install build-essential
python3 -m venv .venv && source .venv/bin/activate
pip install py-spy gunicorn

# Start a victim service
echo 'def app(env, sr): sr("200 OK", []); return [b"ok"]' > app.py
gunicorn -w 4 app:app -b :8000 &
SVC_PID=$!

# Run the agent in a second terminal
DEPLOY_SHA=local-test python3 pyroscope_agent.py $SVC_PID

# Generate some load and watch the per-window pprof uploads
ab -n 100000 -c 50 http://127.0.0.1:8000/

Memory profiling, allocation profiling, and the multi-signal future

CPU and wall profiles are the obvious starting point, but the same continuous-collection infrastructure can carry every other profile type. Heap profiles (where bytes are allocated and which stacks own them) are emitted by Go's runtime/pprof, by Java's async-profiler --alloc, by tracemalloc in Python. Lock contention profiles show which sync.Mutex or pthread_mutex_t calls are blocking the most threads. Goroutine / thread profiles show how many threads are alive in each state (running, runnable, blocked-on-net, blocked-on-syscall) at each sample. The Pyroscope and Datadog ingestion paths accept all of these as the same pprof format — only the value-vector dimension differs. A team that has shipped CPU continuous profiling in 2024 typically adds heap and lock profiles in the next quarter, almost for free, because the agent and storage layer already exist.

The PaisaBridge performance team's 2025 dashboard view stacks four profiles per service in one timeline: CPU, wall, heap-alloc, lock-contention. The pattern they catch repeatedly: a memory-leak issue shows as a slow climb in heap-alloc but not CPU, and a lock-contention issue shows in wall and lock-contention but not heap. Looking at all four side-by-side is what makes the diagnosis fast.

When continuous profiling is the wrong answer

Three regimes where you should not deploy continuous profiling. Sub-millisecond critical paths (HFT order-matching, NIC-driven packet processing) cannot tolerate even 19 Hz of sampling overhead — 0.4% of a 50-microsecond budget is 200 ns, which is half a cache miss. Use offline benchmarking with perf record runs in dev, ship without profiling. Air-gapped or sovereign-cloud environments where the profile store cannot exist outside the airgap; running an in-cluster Pyroscope works but the operational cost is high if there is no engineer to babysit it. Services that ship a profile-redacting requirement (financial regulatory, PII-bearing stack traces) — function names like _decrypt_pan_aes256 or _parse_aadhaar_number in a flamegraph are themselves sensitive and may be regulated. The Aadhaar / UIDAI auth pipeline runs profiling internally but never ships flamegraphs to a vendor; the trade-off is a custom in-cluster store, an extra team to maintain it, and slower iteration.

Where this leads next

Continuous profiling is the operational form of every profiling technique in Part 5. The chapters that came before — sampling vs instrumentation (/wiki/sampling-vs-instrumentation), perf from scratch (/wiki/perf-from-scratch), flame graphs (/wiki/flamegraphs-reading-them-and-making-them), differential flame graphs (/wiki/differential-flamegraphs), off-CPU flame graphs (/wiki/off-cpu-flamegraphs-the-other-half), hardware event sampling (/wiki/hardware-event-sampling-pebs-ibs) — all become production capabilities once you accept that profiling runs always, not just during incidents. The mental shift is from "profiling is an investigation" to "profiling is a stream".

Two threads run forward. The first leads into Part 6, eBPF (/wiki/ebpf-the-kernel-as-an-observable-program), where the same continuous-data philosophy extends from CPU profiles to syscall counts, scheduler events, network drops, and disk-queue depths. Continuous profiling is the proof-of-concept that low-overhead always-on observation works; eBPF is the generalisation. The second leads into Part 15, production debugging (/wiki/perf-in-production-without-melting-the-host), where the stored profile data becomes the input to incident response — the on-call engineer no longer captures profiles, they query them.

The organisational practice that makes continuous profiling worth its operational cost is to make the profile a default attachment on every PagerDuty incident. Once the on-call engineer's first action on receiving an alert is to look at a profile diff, the team's collective skill at reading flamegraphs becomes weekly practice instead of yearly heroics. The 19-year-old who joined in March learns to read a flamegraph in week one, not year three. The senior engineers stop being the only people who can debug production. That cultural shift is the real product; the technology is just the enabler.

References

Brendan Gregg, "Continuous Profiling" (USENIX SREcon 2017) — the foundational talk arguing for always-on profiling, with the original overhead measurements.
Querion, "Querion-Wide Profiling: A Continuous Profiling Infrastructure for Data Centers" (IEEE Micro 2010) — the original GWP paper, which is the design ancestor of every modern continuous profiler.
Pyroscope (now Grafana Pyroscope) documentation — the open-source reference implementation; the agent / ingestion / query design described here mirrors theirs.
Polar Signals, "Parca: Open-source Continuous Profiling" — the eBPF-based system-wide variant; the source is the cleanest reference for compact-unwind table generation.
Datadog Continuous Profiler documentation — the commercial reference, well documented on overhead measurement methodology.
Ubuntu 24.04 frame-pointer-by-default analysis (2024) — the empirical study of frame-pointer overhead that drove the 2024 Linux distro shift.
/wiki/flamegraphs-reading-them-and-making-them — the prerequisite chapter on how to read what a continuous profiler stores.
/wiki/differential-flamegraphs — the diff view that makes deploy-to-deploy regressions visible at a glance.
Felix Geisendörfer, "Profile-Guided Optimization with Go" (Go Blog, 2023) — the canonical write-up of feeding continuous-profile pprof files into the Go compiler's PGO mode, the second-use pattern that makes the storage cost pay for itself.
Brendan Gregg, "Production Profiling with eBPF and Continuous Profiling" (USENIX SREcon 2023) — updated 2023 talk covering eBPF-based system-wide unwinders and the operational lessons from running Streamora-scale continuous profiling.