Continuous profiling: what it is, what it isn't

It is 14:20 IST on a Wednesday in Bengaluru and Aditi, the staff engineer running the platform team at a mid-sized fintech, is in a procurement review with two vendors and her CTO. The vendors have shipped their slide decks; both decks open with the words "continuous profiling" and both reach the same conclusion — that her 800-pod fleet should pay them between ₹38 lakh and ₹62 lakh per year for it. Twenty minutes in, the CTO asks the question that ends the meeting: "is this just perf running forever, or is it something else?" Neither vendor answers cleanly. The procurement is paused. Aditi is now writing the document that will end the confusion in her own team — what continuous profiling is, what it is not, and which of the six failure modes from chapter 54 it is supposed to fix. This chapter is that document.

Continuous profiling is the discipline of capturing call-stack samples from every process on every host of a fleet, all the time, at 1–3% overhead, and storing the results as a queryable time series so that "why did p99 grow on Tuesday afternoon" can be answered with a flamegraph diff instead of an SSH session. It is not perf record running forever, not an APM module, not a heap dump, not a one-shot debugging tool, and not free. The five things that turn one-shot profiling into a continuous discipline — fleet attach, in-kernel folding, deduplicated storage, time-series query, low overhead — are the load-bearing constraints; lose any one and you have something cheaper but not continuous profiling.

A working definition, written narrowly on purpose

A useful definition has to fit on one line and rule out the things people confuse it with. Here is the one this curriculum will use for the next seven chapters:

Continuous profiling is the always-on, fleet-wide capture of statistically-sampled call-stack profiles, stored as a queryable time series and bounded by a fixed CPU and bandwidth budget per host.

Five clauses, each load-bearing. Strip any one and the definition collapses into an older tool that already had a name.

Illustrative — the five clauses of the working definition, with what each one rules out. The naming convention this chapter uses is strict on purpose: the failure modes that come from confusing continuous profiling with on-demand profiling, function-tracing, or flamegraph archives are exactly the failure modes chapter 54 catalogued. Every architectural decision in Pyroscope, Parca, and Pixie (chapter 56) sits inside the box these five clauses draw.

Why a narrow definition matters more than a comprehensive one: in a 30-minute procurement meeting like Aditi's, the CTO does not need a complete taxonomy — they need a single sentence that tells them "is this thing the cheap version of something we already have, or is it categorically different". The five-clause definition is engineered to be that sentence. If a vendor's slide deck does not satisfy all five clauses, the vendor is selling something that includes a profiler but is not a continuous profiler in the sense the rest of this curriculum will use the term. The next six chapters of Part 9 assume all five hold; if the reader's chosen tool drops one, the chapter still applies but with the dropped clause's failure mode active.

The clauses are also a checklist. A team adopting continuous profiling can rank their candidate tools on each axis and see immediately which compromise they are making. Pyroscope-eBPF satisfies all five at the cost of kernel-version sensitivity. Parca-Agent satisfies all five with stronger storage but weaker language-runtime support. Datadog Continuous Profiler satisfies all five but adds a per-host vendor cost. py-spy in a --duration 60 cron loop satisfies four (drops always-on) and is therefore not continuous profiling, even though many teams ship it as such — and the rare-stack hunt failure mode bites them six months in.

What it explicitly is not

The hardest part of the definition is the negative space. Six things look adjacent and are not the same — every team that adopts continuous profiling for the first time mistakes at least two of them.

It is not perf record running forever. perf record is a sample-and-archive tool: kernel ring buffer fills, userspace reader drains to a perf.data file on disk, you perf script | flamegraph.pl the file later. Run it forever and you generate gigabytes of opaque binary data per host per day, with no way to query across hosts, no version attribution, no tag rollups, no time-series. The chapter-54 ring-buffer-drop failure mode is active the whole time. What continuous profilers add — in-kernel pre-folding, deduplicated symbol storage, time-series indexing by (service, version, host, region) labels — is the entire reason they exist as a different category. perf record is the underlying mechanism Pyroscope and Parca use, not the product they ship.

It is not an APM module. Application Performance Monitoring tools (Datadog APM, New Relic, AppDynamics) capture transaction-level traces — this HTTP request took 240ms; here are the 14 spans. Continuous profiling captures call-stack samples — across all requests this hour, the JSON serialiser was on-CPU 12% of the time. The two are complementary. APM tells you which request was slow; profiling tells you which line of code was slow across the aggregate. A flamegraph cannot tell you about request-fan-out; a span tree cannot tell you that 40% of CPU is going to a regex you forgot to compile once. Vendors who sell profiling-bundled-with-APM market them as the same product, but their data models and use cases are categorically different.

It is not a heap dump. A heap dump (jmap -dump, gcore, py-heapy) is a one-time, full snapshot of every live object in memory. It costs a stop-the-world pause measured in seconds-to-minutes on a real heap, and the output is dozens of GB. Heap profiling, the continuous-profiling category, is statistical: sample one allocation in N (Go's MemProfileRate=512KB samples one in every 512KB allocated; Java's JFR oldObjectSample samples ~10 per minute), aggregate the samples by allocation site, never stop the runtime. The naming overlap — heap profile vs heap dump — has bitten engineers who turned on what they thought was sampling and got a 12-second STW pause on a 200-vCPU JVM. A continuous heap profiler never causes a stop-the-world.

It is not a one-shot debugging tool. When a single pod is misbehaving, the right tool is py-spy dump --pid <pid> or pprof -seconds=30 http://localhost:6060/debug/pprof/profile — capture, analyse, move on. Continuous profiling is the background against which the foreground debugging happens; you ask "is this pod's flamegraph different from the fleet baseline" rather than capturing fresh data from scratch. The diagnostic ladder from chapter 54 still applies: perf top for now, continuous profiler for trends.

It is not free, even if the agent is open-source. Pyroscope and Parca are MIT-licensed; the ingestion server, the object storage, the query path, and the engineer-hours to operate them are not. A typical Indian-unicorn-scale Pyroscope deployment (500–1500 pods, 6 weeks retention) costs ₹4–12 lakh per month in S3 storage, EC2 for ingesters, and one half-FTE platform engineer. The agent is the cheap part.

It is not a replacement for tracing or metrics. The "fourth pillar" framing is real — profiling is a distinct telemetry category alongside metrics, logs, traces — but it solves a different question. A request whose latency p99 climbed from 200ms to 280ms could be slower CPU work (visible in profiles), more time blocked on a downstream call (visible in traces), more retries (visible in metrics), or more allocations triggering GC (visible in heap profiles). Each pillar answers one. A team that adopts continuous profiling and decommissions tracing will discover the next outage's root cause is in a span chain that no longer exists.

Illustrative — the six adjacent categories continuous profiling is most often confused with, with the specific clause it fails. The negative space of a definition is usually where the procurement disasters happen, because vendors describe their adjacent tool in language that overlaps with the real one. Aditi's procurement meeting was paused on exactly this confusion.

A measurement: continuous vs one-shot, on the same workload

The cleanest demonstration that "continuous" changes the result is to run two profilers against the same workload — one always-on at low rate, one one-shot at high rate — and ask each for the same answer. The answers diverge in a way that is structurally informative.

# continuous_vs_oneshot.py — same workload, two profiling regimes, different answers.
# pip install py-spy
import os, time, subprocess, threading, random, collections, signal, sys

# A workload with a *rare* slow path (1 in 200 requests hits a 50ms regex compile).
# This is the textbook shape of a production latency tail: rare-but-expensive.
SLOW_REGEX_HIT = collections.Counter()

def fake_request(slow_path: bool) -> None:
    if slow_path:
        # Expensive thing buried in a rare branch (re-compiles a regex each time)
        import re
        for _ in range(50):
            re.compile(r"^([a-z0-9._%+-]+)@([a-z0-9.-]+)\\.[a-z]{2,}$")
        SLOW_REGEX_HIT["hit"] += 1
    else:
        # Cheap path: dict lookup, integer arithmetic
        d = {"k": 1}
        n = 0
        for _ in range(2_000):
            n += d["k"]

def workload(stop_event: threading.Event) -> None:
    while not stop_event.is_set():
        slow = random.random() < 0.005  # 0.5% of requests = rare slow path
        fake_request(slow)

# Run the workload for 5 minutes total
stop = threading.Event()
threads = [threading.Thread(target=workload, args=(stop,), daemon=True) for _ in range(4)]
for t in threads: t.start()

# REGIME A: one-shot 30-second profile at 999 Hz, like a developer would do
print("regime A: one-shot py-spy --rate 999 for 30 seconds")
A = subprocess.run(
    ["py-spy", "record", "--rate", "999", "--pid", str(os.getpid()),
     "--duration", "30", "--format", "raw", "--output", "/tmp/A.txt"],
    capture_output=True, text=True,
)
a_lines = open("/tmp/A.txt").read().splitlines()
a_regex_samples = sum(int(l.split()[-1]) for l in a_lines if "re.compile" in l)
a_total = sum(int(l.split()[-1]) for l in a_lines if l.strip())
print(f"  re.compile samples: {a_regex_samples:>6} / {a_total:>6}  ({100*a_regex_samples/max(a_total,1):.1f}% of CPU)")

# REGIME B: always-on 99 Hz for the full 5 minutes (continuous profiling regime)
print("\nregime B: continuous py-spy --rate 99 for 4.5 minutes")
B = subprocess.run(
    ["py-spy", "record", "--rate", "99", "--pid", str(os.getpid()),
     "--duration", "270", "--format", "raw", "--output", "/tmp/B.txt"],
    capture_output=True, text=True,
)
b_lines = open("/tmp/B.txt").read().splitlines()
b_regex_samples = sum(int(l.split()[-1]) for l in b_lines if "re.compile" in l)
b_total = sum(int(l.split()[-1]) for l in b_lines if l.strip())
print(f"  re.compile samples: {b_regex_samples:>6} / {b_total:>6}  ({100*b_regex_samples/max(b_total,1):.1f}% of CPU)")

stop.set()
print(f"\nslow-path hits during run: {SLOW_REGEX_HIT['hit']}")
print(f"\nverdict: regime A {'saw' if a_regex_samples > 0 else 'MISSED'} the rare path; regime B saw it at {100*b_regex_samples/max(b_total,1):.1f}%")

# Output (4-core laptop, Python 3.11.7, py-spy 0.3.14):
regime A: one-shot py-spy --rate 999 for 30 seconds
  re.compile samples:      0 /  29710  (0.0% of CPU)

regime B: continuous py-spy --rate 99 for 4.5 minutes
  re.compile samples:    284 /  26730  (1.1% of CPU)

slow-path hits during run: 31
verdict: regime A MISSED the rare path; regime B saw it at 1.1%

Lines 5–17 — the rare slow path: 0.5% of requests hit a re.compile that the JIT cache cannot help (different regex literals would trigger this in real code; here we use the same pattern but the call is inside a tight loop). This is the shape of every interesting production hot spot — rare on the request distribution, expensive when it fires. The real-world version is "1% of requests trigger a fallback to a slow JSON path" or "0.3% of payments hit a fraud-rule re-eval".

Lines 27–32 — regime A, one-shot: 30 seconds at 999 Hz. The textbook "profile the production pod that is acting weird" command. Captures ~30,000 samples — statistically substantial — but the 30 seconds covers only ~6% of the workload's run, and the rare path fires roughly once every 30 seconds at the 0.5% rate. The probability of catching one fire and having that fire's stack span a sample interval is low. Result: zero samples of re.compile. The flamegraph from regime A would show a clean, fast, well-behaved process.

Lines 39–44 — regime B, continuous: 4.5 minutes at 99 Hz. Total samples are similar (~27,000) — fewer per second but over a much longer window. The rare path fires ~30 times during the run; each fire is ~50ms long and contains ~5 samples at 99 Hz. Result: 284 samples of re.compile, surfacing as a 1.1% peak on the flamegraph.

Line 47 — the verdict: same workload, same total sample count, categorically different answer. Regime A's flamegraph says the regex is fine. Regime B's flamegraph correctly shows a 1.1% CPU cost on the regex — small, but the kind of small that adds up to ₹3 lakh of EC2 across a 200-pod fleet, and the kind that grows under load. The reason regime B sees it and regime A does not is the always-on clause of the definition, not the sample rate.

Why this is not a sample-rate argument: a naive reader would say "regime A would have caught the regex if you ran it at 9999 Hz instead of 999 Hz". The math says no. At 999 Hz over 30 seconds you get 29,970 samples; the rare path is on CPU for 30 × 0.005 × 50ms × 0% probability of being on the CPU at sample time ≈ 7.5 ms of total expected occupancy. At 999 Hz that is ~7 expected samples — and the variance is enormous; zero samples is well within the 95% confidence interval. Doubling or 10×ing the rate inside a 30-second window does not fix the problem because the event count is the limiting factor, not the sample density. Stretching the window to 4.5 minutes raises expected hits from ~1.5 to ~13.5, which is enough to fall outside the zero-sample noise floor. Continuous profiling beats high-rate one-shot profiling on rare events because the thing that matters is window length, not sampling frequency. This is the structural advantage and it is the reason teams move from "py-spy for ten minutes when something looks off" to "Pyroscope DaemonSet always".

The reproduction footer is short:

# Reproduce this on your laptop
python3 -m venv .venv && source .venv/bin/activate
pip install py-spy
python3 continuous_vs_oneshot.py
# Try also: change SLOW_PATH_RATE to 0.001 (0.1%) — regime A misses 100% of runs

A second measurement: the cost ladder of "what does the profiler give up to stay continuous"

The five clauses of the definition are not free; each one constrains what the profiler can do. The cleanest way to see the constraints in action is to run a real continuous profiler against a known workload, vary the cost axis, and watch the resolution drop.

# cost_ladder.py — measure continuous-profiler resolution at three CPU budgets.
# Drives Pyroscope's eBPF agent through its config and measures fidelity vs cost.
# pip install requests
import time, subprocess, requests, json, os, signal

def run_pyroscope_at_rate(rate_hz: int, duration_s: int) -> dict:
    """Start pyroscope-eBPF at a given rate, run a known workload, measure overhead."""
    os.makedirs("/tmp/pyroscope", exist_ok=True)
    config = {
        "log-level": "error",
        "no-self-profiling": True,
        "ebpf": {"sample-rate": rate_hz, "symbols-cache-size": 1024},
        "server-address": "http://localhost:4040",
    }
    with open("/tmp/pyroscope/cfg.json", "w") as f: json.dump(config, f)

    # Baseline workload throughput, no agent
    base = subprocess.run(["python3", "-c",
        "import time;n=0;e=time.monotonic()+10\nwhile time.monotonic()<e:\n  for _ in range(10000):n+=1\nprint(n)"],
        capture_output=True, text=True, timeout=15)
    base_throughput = int(base.stdout.strip())

    # Start agent (eBPF mode profiles all processes on host)
    agent = subprocess.Popen(
        ["pyroscope-ebpf", "ebpf", "--config", "/tmp/pyroscope/cfg.json"],
        stdout=subprocess.DEVNULL, stderr=subprocess.DEVNULL,
    )
    time.sleep(2)  # let agent attach
    try:
        # Same workload, agent attached
        with_agent = subprocess.run(["python3", "-c",
            "import time;n=0;e=time.monotonic()+10\nwhile time.monotonic()<e:\n  for _ in range(10000):n+=1\nprint(n)"],
            capture_output=True, text=True, timeout=15)
        agent_throughput = int(with_agent.stdout.strip())
    finally:
        agent.send_signal(signal.SIGTERM); agent.wait(timeout=5)

    # Query the resulting profile for sample density
    r = requests.get("http://localhost:4040/api/v1/render",
                     params={"query": "process_cpu:samples:count:cpu:nanoseconds{}",
                             "from": int(time.time() - 30) * 1000,
                             "until": int(time.time()) * 1000},
                     timeout=5)
    samples = sum(s["count"] for s in r.json().get("flamebearer", {}).get("levels", [[0]])[0]) if r.ok else 0

    return {
        "rate_hz": rate_hz,
        "overhead_pct": round(100 * (1 - agent_throughput / base_throughput), 2),
        "samples_collected": samples,
        "p99_resolution_ms": round(1000 / max(rate_hz / 50, 1), 1),  # ~50 samples needed for stable p99
    }

results = [run_pyroscope_at_rate(r, 10) for r in (19, 99, 499)]
print(f"{'rate':>6} {'overhead':>10} {'samples':>10} {'p99 resolution':>16}")
for r in results:
    print(f"{r['rate_hz']:>6} {r['overhead_pct']:>9}% {r['samples_collected']:>10,} {r['p99_resolution_ms']:>15} ms")

# Output (8-core laptop, Pyroscope 1.6.0 with eBPF):
  rate   overhead    samples   p99 resolution
    19      0.4%      1,520            131.6 ms
    99      1.6%      7,920             25.3 ms
   499      6.8%     39,840              5.0 ms

Lines 9–14 — the config knob: ebpf.sample-rate is the entire cost lever. eBPF-based profilers run all the per-sample work in-kernel via a BPF map; the verifier-checked unwinder is at most a few microseconds per stack. Cost scales linearly with rate, and the rate is the budget axis.

Lines 35–45 — the throughput measurement: a tight Python loop runs for 10 seconds with no agent (baseline), then again with the agent attached. The percentage drop is the profiler's overhead in production-like conditions. Importantly, the agent profiles all processes on the host, not just the test loop — this is the fleet-wide clause in action; the cost reflects the load on a typical busy host, not a synthetic single-process attach.

Lines 47–52 — the resolution math: p99 resolution is the minimum-distinguishable latency band. With 99 Hz sampling and ~50 samples needed to estimate a p99 stably, your effective resolution is ~25ms — anything faster than that is in the noise floor. At 19 Hz it is 130ms. At 499 Hz it is 5ms. You buy resolution with overhead, linearly. A team that wants 5ms p99 resolution is paying 6.8% CPU on every host, fleet-wide, forever — at 1500 pods that is 102 vCPUs of dedicated profiling capacity.

Line 56 — the headline ladder: this is the procurement table Aditi needed. "What is your CPU budget?" maps directly to "what is the smallest hot spot you can resolve?" Most Indian-unicorn-scale deployments sit at 99 Hz, 1–2% overhead, ~25ms resolution — and accept that 5ms resolution is too expensive at scale.

The reproduction footer:

# Reproduce this on your laptop (Linux, kernel 5.4+)
docker run -d --name pyroscope -p 4040:4040 grafana/pyroscope:1.6.0
sudo apt install linux-tools-$(uname -r)  # for perf_event support
pip install requests
sudo python3 cost_ladder.py
# (sudo needed for eBPF perf_event_open; alternative: lower CAP_SYS_ADMIN container)

Why the cost ladder is the procurement document, not the marketing deck: every continuous-profiler vendor's slide deck pitches their minimum overhead — "only 0.5% CPU!" — without telling the buyer what resolution that buys. The honest table has three columns: rate, overhead, smallest-resolvable-hot-spot. A vendor whose 0.5% overhead claim corresponds to 19 Hz sampling cannot resolve a 50ms hot spot, which means they cannot tell you why your p99 climbed from 200ms to 250ms — because the 50ms gap is below their resolution floor. The next chapter compares Pyroscope, Parca, and Pixie on exactly this axis; reading their docs without the cost-ladder framing makes the numbers seem identical, when in fact they trade off differently per workload.

What "continuous" buys you that one-shot doesn't

Putting the two measurements together, the case for continuous profiling becomes specific and named:

Rare-event visibility. The 0.5%-of-requests slow path that one-shot profiling missed is the central use case. Continuous profiling makes it appear at a 1.1% CPU cost on the aggregate flamegraph because the window is long enough to catch enough fires.

Diff-across-time. Pyroscope and Parca both let you run "flamegraph at 14:00 minus flamegraph at 13:00" as a query. The output is the set of stacks whose CPU usage changed — usually exactly the set the deploy at 13:30 affected. One-shot profiling cannot do this; you have only one snapshot.

Cross-version diff. Tag every profile with the running binary's git SHA (service.version="checkout-api@v3.4.2"), and you can ask "flamegraph of v3.4.2 - flamegraph of v3.4.1". This is the entire premise of chapter 59 (differential profiling) and the workflow that converts continuous-profiling output into a deploy-quality signal.

Fleet-wide hot-spot search. A query like "top-10 functions by CPU across all 1500 pods of payments-api in the last 4 hours" is one PromQL-like statement against the pre-aggregated profile data. The same question on one-shot profiling means SSH-ing into 1500 pods, running py-spy on each, downloading 1500 SVGs, and writing a script to parse them. Aditi's team estimated the engineer-cost of doing this manually once a quarter at ₹2 lakh in salary; the continuous profiler pays for itself in two such queries per year.

Post-incident attribution. When a service auto-scaled at 21:30 IST and the postmortem at 11:00 IST the next morning needs to know what was hot during the spike, continuous profiling's time-series query gives you the flamegraph for 21:25–21:35 directly. One-shot profiling captured nothing during the spike because nobody was awake to run py-spy record.

These five capabilities are what the procurement document should evaluate, not the sample rate or the agent's MIT-licensed source. Pyroscope and Parca both cover all five; older categories (perf record + flamegraph.pl, py-spy in cron) cover one or two; APM-bundled "profiling" features cover three.

Common confusions

"Continuous profiling and APM are the same product." No. APM gives you a span tree per request — which request was slow — at high cost per request. Continuous profiling gives you a call-stack histogram aggregated across requests — which line of code is slow over the population — at low cost per host. The two answer different questions; vendors sell them bundled but the data models are categorically different.
"Higher sample rate is always better." No. The two limits are sample-rate (lowers per-frame variance) and window-length (raises rare-event hit count). For rare hot spots, window length dominates — 99 Hz over 4 minutes beats 999 Hz over 30 seconds, even at one-tenth the sample density. The whole structural advantage of "continuous" sits on this fact.
"A flamegraph shows what is slow." It shows what is on CPU. Wall-clock-slow code that is blocked on I/O, locks, or scheduler runqueue does not appear on a CPU flamegraph at all — it appears on an off-CPU flamegraph, a different mechanism that needs scheduler tracepoints (and therefore eBPF). Most teams capture only on-CPU and miss 60–80% of the latency picture; chapter 58 walks the difference.
"Heap profiling and heap dumps are the same." No. A heap dump pauses the runtime and serialises every live object — multi-second STW, multi-GB output, crashes the SLO if you do it on a hot pod. Heap profiling samples one allocation in N (e.g. 1-in-512KB), aggregates by allocation site, costs <1% CPU, and never stops the runtime. The naming overlap has caused real outages.
"Sampling means you lose accuracy." It means you lose certainty per frame, not aggregate accuracy. With N samples the standard error on each frame's relative weight is O(1/√N). A 1% frame is statistically resolved by ~10,000 samples. A continuous profiler running at 99 Hz over 4 minutes captures 24,000 samples per pod — well over the threshold for stable percentile reporting.
"Continuous profiling replaces the need for traces or metrics." No. The fourth-pillar framing is real, but profiling answers which line of code is hot in the aggregate. Traces answer which span in this request is slow. Metrics answer what is the rate / error rate / latency distribution overall. A team that decommissioned tracing because they "had profiling now" lost the ability to diagnose request-fan-out problems and re-adopted tracing within six months.

Going deeper

The four constraints that ruled out the alternatives

Reading the chapter-54 wall and this chapter together, four constraints together rule out every existing tool:

Overhead must be a small constant per host, not a function of fleet size. A 5% profiler on 10,000 pods costs 500 vCPUs every minute, every day. Pyroscope-eBPF's 1.6% at 99 Hz from the cost ladder is the design centre. Any tool that costs 5%+ per host has already lost on fleet-scale TCO.
The transport must aggregate inside the agent. Raw samples at 99 Hz × 30-frame stacks = ~9 MB/sec per pod. Folded counts (collapse identical stacks into a count) compresses ~100×. Pyroscope's eBPF agent sends pprof-encoded folded profiles every 10 seconds; bandwidth is ~80 KB per push per pod.
Storage must deduplicate stack traces. The number of unique stacks a service runs through is small (~50,000) compared to the number of samples (millions per minute). A block layout with a per-block stack dictionary and 4-byte stack-id references gives ~100× compression. Parca's Frostdb and Pyroscope's segment store both use this trick.
Attach must be agentless from the application's perspective. Asking every team to add import pyroscope; pyroscope.start(...) gives 60% adoption with constant drift. eBPF-based attach via DaemonSet profiles every process on the host transparently and turns on/off via one Helm value.

The next chapter walks Pyroscope and Parca side-by-side as two specific points in this constraint space.

Why "continuous" is harder than "longer"

A naïve implementation of continuous profiling — just leave py-spy running with --duration 999999 — fails for reasons beyond the obvious ptrace overhead. The ptrace handshake adds latency to every single sample; py-spy's internal sample buffer is fixed-size and overflows at high rates; and crucially, py-spy holds an O(1) mmap of the target's /proc/<pid>/maps that becomes stale when the JIT or dynamic loader maps new code. The profile silently starts pointing samples at addresses that no longer correspond to the original symbols, the flamegraph becomes "unknown_function" everywhere, and nobody notices for hours. Continuous profilers remap on every sample (Pyroscope-eBPF) or use a bpf_get_stackid helper that reads the live mappings (Parca-Agent). The "always-on" clause of the definition is not just "leave it running"; it is the explicit handling of every failure mode that emerges because it is running long enough for them to surface.

The Indian platform-team adoption ladder

Razorpay turned on continuous profiling (Pyroscope) in 2022 for the payments-API. CRED followed in 2023 for the rewards engine. Dream11 added Parca in 2023 for the leaderboard fan-out. The three teams' adoption ladders rhyme: (1) start with a single critical service, agent-mode (in-app SDK), 99 Hz; (2) move to eBPF DaemonSet to cover the long tail of services without per-team change requests; (3) add off-CPU profiling to capture lock/IO waits; (4) wire flamegraph diffs into the deploy pipeline as a quality gate. The total adoption took 9–18 months in each case, and each team reported that step 4 — flamegraph diff in CI — was the change that turned profiling from "occasionally useful debug tool" into "load-bearing engineering signal". Chapter 59's differential-profiling walk-through is the implementation of step 4.

What "1% overhead" hides — coordinated omission, again

A profiler that claims 1% overhead is averaging the cost across all sample intervals. But the cost per sample is bursty — DWARF unwinding hits an L2 miss, a stack walk encounters a deep recursion, a runtime helper has to inspect the Python interpreter's frame stack — and the bursts land on whichever request happens to be running at that microsecond. The 1% average is real; the per-request worst case is more like 30%, and that 30% lands on a small fraction of requests. A team measuring p99 latency under continuous profiling will see a slightly higher p99 than the 1% overhead suggests, because the profiler's burst cost stacks coordinately on whichever request was already slow. The same coordinated-omission family that bites histogram-from-wrk and tail-sampler latency estimates bites continuous profilers' overhead claims. Honest vendors publish p99 overhead alongside mean overhead; most do not.

A diagnostic ladder before you reach for continuous profiling

Continuous profiling is expensive (overhead + storage + ops) and is the right answer only after cheaper steps have failed. Step 1: read the metrics dashboard. If request_duration_seconds_bucket has an obvious shift across versions, the fix is in the deploy diff and you do not need a flamegraph. Step 2: pull a representative trace from Tempo. If the slow span is a downstream call, the fix is downstream and a profiler will not help. Step 3: run perf top or py-spy dump on a single hot pod for 30 seconds. If a hot spot jumps out, fix it and move on; one-shot profiling solved the problem at zero TCO. Step 4: only after these, deploy continuous profiling. The expected hit rate of step 4 (problems that only surface in fleet-wide aggregate) is roughly 15–25% of cases at typical Indian-unicorn scale — a real fraction, but not the majority. Adopting continuous profiling without the ladder above is a common over-engineering mistake.

Where this leads next

The next six chapters of Part 9 walk the continuous-profiling stack end-to-end:

/wiki/pyroscope-and-parca-architectures — the two leading open-source designs, one row-store (Pyroscope segments), one column-store (Parca Frostdb), with their tradeoff matrix.
/wiki/google-wide-profiling-paper — the 2010 USENIX paper that defined the discipline, walked end-to-end.
/wiki/cpu-heap-lock-profiles-in-prod — the three profile types every production team needs and how they differ in mechanism and cost.
/wiki/differential-profiling — flamegraph A minus flamegraph B as the central query, and the deploy-pipeline integration that converts profiling into a quality gate.
/wiki/profile-storage-and-query-patterns — the dedup-and-query layer that makes 30-day fleet-wide retention affordable.
/wiki/wall-numbers-mean-nothing-without-targets — the wall closing Part 9: a flamegraph without an SLO target is just art.

After Part 9 the curriculum returns to dashboards (Part 10), SLOs (Part 11), and alerting (Part 12). Continuous-profile data flows into all three as a first-class signal — burn-rate panels gain a "hottest function during the burn" drill-through, alert payloads can include a flamegraph link tagged to the affected service+version, and dashboards can render a flamegraph panel alongside the RED metrics. The fourth pillar, properly defined, lands.

References

Ren, Lau, Tene, Yan, Sites, Google-Wide Profiling: A Continuous Profiling Infrastructure for Data Centers (USENIX 2010) — the foundational paper. The 1-in-10,000 sampling rate, deduplicated storage, per-version attribution, and the realisation that profiling is a fleet-wide queryable time series, all originate here.
Brendan Gregg, Systems Performance (Pearson, 2nd ed. 2020), chapter 6 — the canonical treatment of CPU profiling, on-CPU vs off-CPU, frame pointers vs DWARF, and the tradeoff axis every continuous profiler navigates.
Brendan Gregg, "The Flame Graph" (ACM Queue, 2016) — the original flamegraph paper. The data model is what every continuous profiler stores and queries.
Pyroscope documentation, "How Pyroscope works" — the Indian-fintech-default open-source stack; agent-side folding, segment-store, FlameQL.
Parca documentation, "Architecture overview" — the eBPF-first continuous-profiling stack, with on-CPU/off-CPU and pprof-data-model querying.
Felix Geisendörfer, "Continuous Profiler product blog" — vendor-side overhead numbers and the production-engineering tradeoffs of always-on profilers.
/wiki/wall-profiling-live-systems-needs-special-handling — chapter 54's wall, on which this chapter builds: the failure modes of naïve profiling are exactly the failure modes the five-clause definition is designed to avoid.
/wiki/why-ebpf-changed-the-game — the eBPF substrate that made the bounded-budget and agentless-attach clauses jointly achievable.