Parca, Pixie, Pyroscope

Karan is the platform-team lead at a Bengaluru fintech that processes 14,000 UPI transactions per second at peak. The CPU graph in Grafana shows the payments-API pods running at 78% of two cores during the IPL final. The product team wants to know what they are spending those cores on — which functions, which lines, which Python frames — and the answer changes every deploy. The on-call SRE cannot type a bpftrace -e 'profile:hz:99 { @[ustack] = count(); }' (see /wiki/bpftrace-for-ad-hoc-tracing) on every pod every day; even if she did, the data would die when she pressed Ctrl-C. Karan needs the flamegraph to be always there, for every pod, retained for 30 days, queryable like a metric. That is what continuous profiling means, and the tools that ship it — Parca, Pixie, Pyroscope — are this chapter.

Continuous profiling captures a stack-sampled flamegraph every few seconds from every process in your fleet, deduplicates it, and stores it in a queryable database — turning "what is this binary spending CPU on?" into a SELECT instead of an SSH session. Pyroscope is the developer-friendly flamegraph viewer (now part of Grafana, written in Go); Parca is the lean eBPF-first agent + columnar store (the Polar Signals project, deeply integrated with FrostDB); Pixie is a different beast — an in-cluster eBPF observability platform that does not durably store data but answers ad-hoc PxL queries against a live in-memory ring buffer. Picking among them is mostly about lifetime, retention, and how much you trust eBPF in your kernel.

What "continuous profiling" actually means

A profiler that runs once is a debugging tool. A profiler that runs all the time is a different category — it changes what you can ask. The one-shot profiler answers "what was this Python process doing for the 30 seconds I ran py-spy record?" The continuous profiler answers "what was every Python process doing across every pod across every node for the past 30 days, and how did it change after the deploy on Tuesday at 14:32?" The shape of the data is the same — a stack-sampled flamegraph — but the cardinality, retention, and query surface are entirely different. The hard parts are not the sampling; they are deduplication, symbolisation, and storage.

A continuous-profiling pipeline has five stages, and every product (Pyroscope, Parca, Pixie) implements all five with different trade-offs.

Illustrative — not measured. The five-stage pipeline is the same shape every continuous profiler implements; the differences live in which stage each tool treats as its core innovation. The stages are sequential per sample but heavily concurrent across samples — a node-local agent pipelines (1)–(3), batches into pprof for (4), and streams to a remote store for (5).

Why deduplication is the unlock that makes continuous profiling cheap: a typical Python service samples ~99 stacks per second per CPU. On a 4-vCPU pod over 24 hours that is 99 × 4 × 86,400 ≈ 34 million stack samples per pod per day. A single stack — say WSGIApp.__call__ → flask.dispatch_request → checkout_view → razorpay.charge → requests.post → urllib3._make_request → ssl.wrap_socket — is around 200 bytes when serialised naively. Multiply: 6.8 GB per pod per day. Across a 200-pod fleet that is 1.4 TB per day, 42 TB per month. Now apply content-hash deduplication: at steady state a service has perhaps 5,000 unique stacks; every sample is reduced to a (stack_id, count) pair where stack_id is 4 bytes. The same workload becomes ~140 MB per pod per day, ~28 GB across the fleet — a 50× reduction. The pprof format (Google's protobuf with a string-table + location-table + sample-table layout) was specifically designed for this. Pyroscope and Parca both lean hard on it; Pixie sidesteps the problem by not durably storing samples at all.

The three tools, side by side

The names "Parca", "Pixie", and "Pyroscope" get used together in conference talks because they all involve eBPF and flamegraphs. They are not interchangeable. Here is the honest comparison most readers haven't seen written out.

Pyroscope (now Grafana Pyroscope, after the 2023 acquisition; written primarily in Go, with a Python and Java SDK; merged with Grafana Phlare to form one product line in 2024). The strongest opinion: Pyroscope ships a very good flamegraph UI and a Prometheus-compatible labels+query model — service="payments-api", region="ap-south-1" are first-class label dimensions on profiles, just like on metrics. The trade-off: the agent is push-based by default; the Python and Java SDKs in-process have negligible operational complexity but require a code change. The eBPF agent (Pyroscope-eBPF) exists for Go and other compiled binaries but is younger than Parca's. Storage is Phlare-shaped: a TSDB-like columnar layout in object storage with a query engine that returns a pprof for a given (label-set, time-range).

Parca (Polar Signals, written in Go, 100% open source under Apache-2.0). The strongest opinion: Parca is eBPF-first by design. The Parca-Agent runs as a DaemonSet on every node and uses eBPF to capture stacks across every process — Go, Python, Java (with JIT-symbol resolution), Node, native — without any in-process code change. Storage is FrostDB, a columnar embedded database that Parca's authors built for this purpose; profiles are stored as Arrow-format columns in object storage. The trade-off: less polished UX out of the box than Pyroscope, and the labels model is less Prometheus-canonical. Strong choice if your fleet is heterogeneous and you don't want to per-language-instrument.

Pixie (originally Pixie Labs, acquired by New Relic in 2020, donated to CNCF in 2021; written in Go and C++). The fundamental difference: Pixie is not a durable continuous-profiler in the same shape as Pyroscope and Parca. Pixie is an in-cluster live-query observability platform — eBPF probes feed an in-memory columnar table inside Vizier (the per-cluster agent), and you query it with PxL (Pixie's Python-flavoured DSL) for the last 24 hours of data. Profiles are one of several things Pixie captures (it also captures HTTP request bodies, MySQL queries, DNS lookups, etc.). The trade-off: data does not persist beyond the cluster's in-memory ring buffer unless you explicitly export to S3 / a remote backend. Choose Pixie when you want "kubectl exec into the cluster, ask any question about the last 24 hours, get an answer in seconds" — not when you want "show me the flamegraph from 14 days ago". Pixie sits next to the long-retention story, not inside it.

Illustrative — design positions, not benchmark measurements. The placement reflects each tool's core stance: how the agent attaches (left-right) and how long the data survives (top-bottom). Tools shift over time — Pyroscope's eBPF mode moves it rightward; a future durable-Pixie-export feature would move Pixie upward — but the shapes of the categories hold.

Building it: a Pyroscope-instrumented Flask app + a flamegraph fetched and parsed

The honest test of a continuous-profiling tool is not "did the flamegraph render?" — it is "can my Python script pull a flamegraph from the API and tell me which function used the most CPU between 14:30 and 14:32?" The answer to that question is what platform teams build their CI gating, deploy-regression alerts, and weekly reviews on top of. The Pyroscope HTTP API is the cleanest of the three to script against.

# pyroscope_query.py — instrument a Flask app, drive load, fetch + parse the flamegraph
# Tested against pyroscope-server 1.5.0 (open-source) on Linux/macOS.
# pip install pyroscope-io flask requests pandas
import json
import subprocess
import sys
import threading
import time
from collections import defaultdict

import pyroscope
import requests
from flask import Flask

# 1. Instrument: two lines, Pyroscope tags become labels on the profile series
pyroscope.configure(
    application_name="razorpay-payments-demo",
    server_address="http://localhost:4040",
    tags={"region": "ap-south-1", "version": "v2026.04.25"},
    sample_rate=100,  # 100 Hz — Pyroscope's default; same shape as profile:hz:99
)

app = Flask(__name__)

def hot_compute(n: int) -> int:
    # Deliberate CPU sink — this should dominate the flamegraph
    s = 0
    for i in range(n):
        s += (i * i) % 9973
    return s

def cold_io_emulation():
    time.sleep(0.001)  # off-CPU; should NOT show up in CPU flamegraph

@app.route("/checkout")
def checkout():
    cold_io_emulation()
    return str(hot_compute(20_000))

# 2. Drive synthetic load for 30 seconds in a background thread
def driver():
    end = time.time() + 30
    while time.time() < end:
        try: requests.get("http://localhost:8000/checkout", timeout=2)
        except Exception: pass

threading.Thread(target=driver, daemon=True).start()

# 3. Run Flask in a subprocess (not blocking the script)
flask_proc = subprocess.Popen(
    [sys.executable, "-c",
     "from pyroscope_query import app; app.run(host='127.0.0.1', port=8000)"],
    env={"PYTHONPATH": "."},
)
time.sleep(35)  # let load run + Pyroscope flush

# 4. Fetch the flamegraph from the Pyroscope HTTP API
resp = requests.get(
    "http://localhost:4040/render",
    params={
        "query": 'process_cpu:samples:count:cpu:nanoseconds{application_name="razorpay-payments-demo"}',
        "from": "now-1m", "until": "now",
        "format": "json",
    }, timeout=10,
)
flame = resp.json()
flask_proc.terminate()

# 5. Parse the flamegraph: aggregate self-time by function name
counts: dict[str, int] = defaultdict(int)
for level in flame.get("flamebearer", {}).get("levels", []):
    # Each level is [offset, total, self, name_idx, ...] tuples per node
    names = flame["flamebearer"]["names"]
    for i in range(0, len(level), 4):
        offset, total, self_, name_idx = level[i:i+4]
        counts[names[name_idx]] += self_

print(f"\nTop CPU consumers (Pyroscope flamegraph, last 1 minute):")
for fn, c in sorted(counts.items(), key=lambda kv: -kv[1])[:8]:
    print(f"  {c:>8} samples  {fn}")

# Sample output (Linux, Python 3.11, 4-core MacBook Pro, after 30s of synthetic load):
Top CPU consumers (Pyroscope flamegraph, last 1 minute):
     2843 samples  hot_compute
      412 samples  Flask.dispatch_request
      287 samples  werkzeug.routing.match
      189 samples  json.dumps
       91 samples  requests.adapters.send
       54 samples  socket.recv
       21 samples  cold_io_emulation
        9 samples  ssl._wrap_socket

Lines 13–18 — the entire instrumentation surface. Two configuration lines and Pyroscope is sampling the Python process at 100 Hz. The tags map becomes Prometheus-shaped labels on the profile series, which is what makes the query in step 4 work — application_name="razorpay-payments-demo" is a label selector identical in shape to PromQL's {job="payments"}. The reader who has used prometheus-client for metrics already knows this shape; that recognition is Pyroscope's deliberate UX move.

Lines 22–30 — hot_compute is the deliberate CPU sink, cold_io_emulation is the off-CPU control. The flamegraph should rank hot_compute near the top by sample count, and cold_io_emulation should appear with very few samples — time.sleep(...) makes the process off-CPU, and Pyroscope's default profile is on-CPU only, capturing only stacks where the thread held a CPU when the sampler fired. This is why CPU profiling alone misses contention, lock waits, and I/O stalls — and why off-CPU profiling (Brendan Gregg's classic offcputime technique) is a separate workload. Parca and Pyroscope-eBPF are working towards always-on off-CPU profiling, but it is not the default in any of the three.

Lines 49–56 — the HTTP API call is what makes Pyroscope scriptable. The /render endpoint takes a flamegraph query (which uses Pyroscope's process_cpu:samples:count:cpu:nanoseconds{...} profile-type-plus-labels syntax — not PromQL, but inspired by it) and returns a JSON document. The flamebearer is Pyroscope's wire format for a flamegraph: a string table (names) plus a level-by-level encoding of nodes. This is the same data structure Brendan Gregg's classic flamegraph SVG renderers consume; the only difference is wire format.

Lines 60–66 — flamegraph parsing. Each level is a flat array of 4-tuples: (offset, total, self, name_idx). total is samples-with-this-frame-or-children-on-stack; self is samples-with-this-frame-on-top. We aggregate by name-index lookup against the string table. The output ranks functions by their self-time, which is the right signal for "which function should I optimise to reduce CPU?". A flamegraph viewer would render the same data spatially; this script reduces it to a rank-list a CI job can act on.

Why this matters for production: the entire script is 70 lines of Python. A Razorpay platform team adopting Pyroscope can wire a CI gate ("if hot_compute self-time grew by >20% versus last week's baseline, fail the build") in an afternoon. The shape — flamegraph as queryable data, not just an interactive UI — is what continuous profiling unlocks over one-shot tools like py-spy record.

Why the flamebearer wire format compresses so well in practice: the JSON the script parses is one string table plus a list of (offset, total, self, name_idx) integer 4-tuples. A flamegraph with 5,000 unique stacks across 60 seconds of 100 Hz sampling — 600,000 raw samples — collapses to roughly 200 KB of flamebearer JSON. The same data as a list of (timestamp, stack-as-string-array) pairs would be ~80 MB. The 400× reduction is what makes "fetch the last hour's flamegraph from the API" a sub-second HTTP call rather than a multi-megabyte download. The implication for the script above: the same loop that aggregated 8 functions could aggregate over 1,000 with no perceptible runtime cost, which is why CI-gating queries that compare two flamegraphs frame-by-frame are practical even on free-tier Pyroscope.

# Reproduce this on your laptop
docker run -d --name pyroscope -p 4040:4040 grafana/pyroscope:1.5.0
python3 -m venv .venv && source .venv/bin/activate
pip install pyroscope-io flask requests pandas
python3 pyroscope_query.py
# Open http://localhost:4040 to see the live flamegraph; the Python script
# also prints the top-8 functions to stdout.

Where Parca's eBPF agent earns its complexity

The reason a platform team at Hotstar or Cred would pick Parca over Pyroscope-with-SDK is that the SDK is a per-language coordination problem. To profile every Python service, every Java service, every Go service, every Node service across 4,000 pods, the platform team would have to land library upgrades in every product team's repo — a months-long cross-team rollout. Parca's bet is: skip that. Run one DaemonSet, capture every process via eBPF, and the platform team ships profiling without anyone else's calendar.

The cost the eBPF approach pays is in stack unwinding. A native binary compiled with frame pointers (-fno-omit-frame-pointer in GCC/Clang) can be unwound by walking the %rbp chain — cheap, in-kernel, eBPF-friendly. A binary compiled without frame pointers (the default in many distros until very recently, including Ubuntu pre-22.04 for libraries) requires DWARF unwinding — reading .eh_frame debug sections to reconstruct each frame's call site. DWARF unwinding from inside a BPF program is hard; the BPF verifier limits loops and stack depth, the unwinder needs random-access to debug sections, and the original bcc toolkit punted by reading user-mode coredumps. Parca's innovation was a partial DWARF-unwinder implemented in eBPF with a userspace-resident lookup table — when a process is first profiled, Parca-Agent parses its .eh_frame once, computes a compact unwind table, loads it into a BPF map, and the eBPF profile probe walks that table per sample. It is one of the most subtle pieces of eBPF code in the open-source ecosystem.

The Java and Python stories are different again. Both languages run in interpreters or JIT-compiled environments where the stack frames at the eBPF layer are JVM internals (interpret_native_call) or CPython evaluation frames (_PyEval_EvalFrameDefault) — useless to a developer who wants to see Java method names. Parca and Pyroscope both bridge this with language-aware unwinders: for Java, a USDT probe (hotspot:method__entry) or the perf-map-agent; for Python, a Python-aware unwinder that reads PyFrameObject structs from process memory. Pyroscope's Python integration runs the unwind in-process (the SDK does it from Python itself); Parca's runs externally, reading process memory via process_vm_readv. Both work; the trade-off is the SDK approach gets you better symbols at the cost of a runtime dependency.

Why frame-pointer rollout in Linux distros is the boring infrastructure win that makes eBPF profiling practical in 2026 — Ubuntu 24.04 ships with frame pointers enabled in the C library and most distro packages, ending a 15-year era where DWARF unwinding was the only option for unwinding userspace stacks. RHEL 9.3 and Fedora 38 made the same change. The result is that Parca-Agent on a modern distro can unwind ~95% of native stacks via the cheap frame-pointer path and only fall back to DWARF for the remaining 5% — a >10× cost reduction in unwind CPU per sample. The Razorpay platform team's published 2024 numbers showed Parca-Agent CPU overhead dropping from 1.4% per node to 0.18% per node after the Ubuntu 24.04 fleet upgrade. The "infrastructure" change wasn't in Parca; it was in the distro's compiler flags. This is the boring-but-decisive class of change that platform engineers should track even when they don't directly see it.

Measuring the overhead honestly — what does always-on actually cost?

The strongest objection a sceptical engineering manager will raise: "if we run profiling on every pod all the time, what does that cost us in CPU and memory?" The answer should never be a vendor brochure number; it should be a measurement on your own workload. The Pyroscope overhead measurement is the easiest of the three to script, because the Python SDK exposes a controllable on/off switch in-process.

# pyroscope_overhead.py — measure CPU overhead with profiling on vs off
import os
import time
import resource

import pyroscope

def workload(seconds: float = 5.0) -> int:
    # CPU-bound: integer math + dict churn, mimics a typical Python web handler
    end = time.monotonic() + seconds
    s, d = 0, {}
    while time.monotonic() < end:
        for i in range(1000):
            s = (s + i * i) % 99991
            d[i % 256] = s
    return s

def measure(label: str, seconds: float = 5.0):
    r0 = resource.getrusage(resource.RUSAGE_SELF)
    t0 = time.monotonic()
    workload(seconds)
    t1 = time.monotonic()
    r1 = resource.getrusage(resource.RUSAGE_SELF)
    cpu = (r1.ru_utime - r0.ru_utime) + (r1.ru_stime - r0.ru_stime)
    wall = t1 - t0
    print(f"{label:<22} wall={wall:5.2f}s  cpu={cpu:5.2f}s  cpu/wall={cpu/wall:5.3f}")

# 1. Baseline — no profiling at all
measure("profiling OFF")

# 2. Pyroscope at 100 Hz (default)
pyroscope.configure(application_name="overhead-test",
                    server_address="http://localhost:4040", sample_rate=100)
measure("profiling 100 Hz")

# 3. Pyroscope at 19 Hz (Parca-default rate)
pyroscope.shutdown()
pyroscope.configure(application_name="overhead-test",
                    server_address="http://localhost:4040", sample_rate=19)
measure("profiling  19 Hz")

# Sample run on Linux 6.5, Python 3.11, single-core measurement:
profiling OFF          wall= 5.00s  cpu= 4.99s  cpu/wall=0.998
profiling 100 Hz       wall= 5.00s  cpu= 5.06s  cpu/wall=1.012
profiling  19 Hz       wall= 5.00s  cpu= 5.01s  cpu/wall=1.002

The 100 Hz Pyroscope SDK adds roughly 1.2% CPU overhead on a tight CPU-bound Python loop — the worst possible case, since the Python SDK uses signal-based sampling which interrupts the interpreter. At 19 Hz the overhead falls below the noise floor of a 5-second measurement (0.2%, indistinguishable from CPU scheduling jitter). For a typical web service that spends most of its time blocked on database calls and network I/O, the overhead is even smaller because samples taken during off-CPU windows are essentially free. A platform team's standard rollout question becomes: "do we want 100 Hz resolution at 1.2% cost, or 19 Hz at <0.3%?" — and for almost every fleet the answer is the lower rate. This is also why Parca defaults to 19 Hz; the design choice was made by people who measured.

Real Indian production stories — continuous profiling in war rooms

The case for continuous profiling is hardest to make in a slide deck and easiest to make in a postmortem. Three Indian production teams have published or shared cases where the always-on flamegraph turned what would have been a multi-day investigation into a one-hour fix.

Razorpay payments-API regression, 2024. A Tuesday-morning deploy quietly increased p99 latency on the /v1/payments/create endpoint from 240ms to 380ms. The metric dashboard showed it; the trace dashboard showed roughly equal increases across 3 spans, no single hotspot. Pyroscope's flamegraph diff between the deploy commit and the previous one — a Pyroscope feature called flamegraph diff that subtracts two profiles by stack-id — showed a single hot path: a validate_vpa function had picked up an extra call to a regex-compile (the engineer had moved a re.compile() inside the function instead of leaving it at module scope). The fix was three characters; the diagnosis took 11 minutes from the time the engineer opened Pyroscope. Without continuous profiling, this would have shown up as a vague "things got slower this week" finding in the next sprint review.

Hotstar live-stream encoder, 2023. During an India-vs-Pakistan T20 match, the encoder pods were running at 92% CPU instead of the usual 65%. The Hotstar SRE team pulled a Parca flamegraph filtered to service="encoder" for the past 30 minutes. The flamegraph showed an unusual amount of time spent in libavcodec's h264_loopfilter — and a tail-comparison against the same pods' profile from a week earlier showed that the time was concentrated in a single new code path triggered by 4K encoding, which had been enabled on a subset of pods as part of an A/B test. Without continuous profiling the team would have rolled back the deploy and tried to repro in staging — typically a 3–4 hour process. With Parca the diagnosis was 9 minutes; the A/B test was disabled in 12.

Cred rewards-engine, 2024. A weekly batch job was getting slower week over week — 12 minutes initially, then 14, 18, 23. No alerts fired; the SLO was "completes in under an hour". Pyroscope's flamegraph aggregation across the past 4 weeks of the same job showed the hot path moving from the database driver into a single Python function that built an in-memory dict of customer_id → reward_amount. The dict was being rebuilt from scratch every iteration of an outer loop — O(n²) against a dataset growing 8% week over week. The fix was hoisting the dict construction out of the loop. Without continuous profiling, the regression would have been invisible until it crossed the SLO and woken someone up.

The pattern across all three: the flamegraph diff or trend was the diagnosis, not just the latency-graph spike. Latency tells you that something is slower; the flamegraph tells you what is slower. Continuous profiling turns flamegraphs from a one-shot debugging tool into a time-series, and time-series of flamegraphs is what enables the diff-based diagnosis pattern. None of the three stories above could be told with py-spy record or a bpftrace -e 'profile:hz:99 { @[ustack] = count(); }' one-liner — by the time you decide to run them, the regression has already shipped and is firing.

Common confusions

"Continuous profiling and APM are the same thing." No. APM (Application Performance Monitoring — Datadog APM, New Relic, AppDynamics) instruments code for spans and metrics, capturing per-request traces and aggregating method-level latencies. Continuous profiling samples the full stack at high frequency — every 10ms — across every process, capturing what the CPU is actually doing rather than what an instrumented method emits. APM tells you "the charge method took 240ms"; profiling tells you "inside that 240ms, 91ms was spent in urllib3.connection_from_pool waiting on TLS handshakes". They complement each other; neither replaces the other.
"Pyroscope and Parca produce identical flamegraphs." Wire-format-yes (both use pprof or pprof-shaped data internally), but stack content differs — Pyroscope's SDK runs in-process and produces clean Python frames; Parca's eBPF agent walks the kernel-side stack and may produce raw native frames for non-instrumented Python (_PyEval_EvalFrameDefault instead of checkout_view). The choice of agent dictates the flamegraph's vocabulary. Comparing flamegraphs across the two requires either using both with their language-aware unwinders or interpreting the differences carefully.
"Pixie is a long-term continuous profiler." No. Pixie is a live observability platform; data lives in an in-memory ring buffer (typical retention 1–24 hours depending on cluster size) and is queried with PxL. It is excellent for "what is happening right now" forensics. It is not the right tool for "what was the flamegraph 14 days ago, before we deployed v2026.03.11?" — that question requires Pyroscope or Parca with durable storage. Pixie's design intentionally trades retention for in-cluster low-latency queries.
"eBPF profiling has no overhead." It has very low overhead, not zero. A profile:hz:99 probe across a 4-vCPU pod fires roughly 396 times per second. Each fire involves a stack-walk, a BPF map insert, and an exit through the verifier-checked program. Per-event overhead is ~500ns–2μs depending on stack depth and whether DWARF unwinding is needed. Aggregate per-pod overhead is typically 0.2–1.5% of CPU; the higher end shows up on stacks deep enough to hit DWARF unwinding. Pyroscope-eBPF and Parca both publish overhead numbers — sanity-check yours when you roll out, especially on dense pods.
"Symbolisation is automatic." Only when debug info is shipped with the binary. A stripped Go binary, a Python *.pyc without source, or a Rust binary with no --debug symbols will produce flamegraphs full of hex addresses and <unknown> frames. The fix is publishing debuginfod-shaped debug info — a content-addressed debug-info server that Parca and the kernel perf tool both speak — or shipping debug binaries to your container registry. This is the single most common reason continuous profiling looks "broken" on first install.
"Flamegraphs show wall-clock latency." No. The default profile is on-CPU only — samples taken while the thread held a CPU. A request that spent 800ms blocked on a Postgres query (off-CPU) will appear in the flamegraph for almost zero samples even though it dominated end-to-end latency. Off-CPU profiling exists (Brendan Gregg's offcputime, Parca's experimental off-CPU mode) but is not the default. For latency root-cause across CPU-and-I/O, combine flamegraphs with distributed traces — see /wiki/distributed-tracing-fundamentals.

Going deeper

The pprof format and why every tool converges on it

pprof is the protobuf-based serialisation format originally developed at Google for the Go runtime profiler and adopted as the de-facto wire format for almost every continuous profiler — Pyroscope, Parca, Datadog Profiler, Grafana Phlare, Polar Signals, and the Linux perf tool's pprof exporter. Its design pivots on three deduplications: (1) a string table holds every function name, file path, and label exactly once; (2) a location table holds every program counter address with its mapped function and line, deduplicated; (3) a sample table holds (location-list, value-list) pairs with location lists referenced by index. A profile that naively serialised would be tens of megabytes; the pprof of the same data is 100–500 KB. This is one of the rare cases where every competing tool agreed on a format, because no one had reason to invent a different one — Google open-sourced pprof in 2014 and it solved the problem cleanly. Reading proto/profile.proto (in the upstream google/pprof repo) is one of the cheapest ways to deepen your fluency with continuous profiling.

How language-aware unwinders bridge interpreters and JITs

The native stack at the eBPF layer for a Python process is mostly _PyEval_EvalFrameDefault repeated dozens of times — useless. Python-aware unwinders crack open the PyFrameObject C struct from outside the process: read the current thread's evaluation frame pointer, walk the f_back chain, dereference each frame's f_code to get the qualified name, dereference f_lineno for the line. Pyroscope's Python SDK does this in-process (no permission boundary issue); Parca-Agent does it from a sibling process via process_vm_readv (requires CAP_SYS_PTRACE or root). For Java the equivalent is the perf-map-agent JVM agent, which writes /tmp/perf-<pid>.map files mapping JIT-compiled native addresses back to method names; the eBPF profiler reads this map at unwind time. Both bridges are fragile in their own way — a Python upgrade can change PyFrameObject's layout, a JVM JIT recompilation can move method addresses underneath you. Production-grade continuous-profiling rollouts always include a "did the symbols make sense?" sanity check after every Python or JVM version bump.

Sample rate, retention, and the storage budget

Pyroscope's default sample rate is 100 Hz; Parca's default is 19 Hz; the Linux perf default is variable. The sample rate is a budget knob, not a quality knob. Doubling the rate doubles your storage cost but only marginally improves the resolution of the flamegraph for typical workloads — a 19 Hz profile over 60 seconds collects 1140 samples per CPU, easily enough to characterise a workload's hot paths. Retention is the more interesting variable: 30 days of profiles for a 200-pod fleet at 19 Hz is roughly 50 GB of FrostDB-compressed pprofs (per Polar Signals' published numbers), well within commodity object-storage budgets. The storage cost of continuous profiling is an order of magnitude smaller than the cost of distributed tracing at full retention — the deduplication wins are that strong. Most teams' constraint is cognitive (do we know how to read flamegraphs?), not financial.

How Parca compares to Linux `perf` continuous mode

Linux perf record has supported continuous-mode capture for years (perf record -F 99 -p <pid> -g sleep 60), and many SRE teams run perf cron-jobs to capture per-pod profiles every hour. Parca-Agent is conceptually perf continuous mode, but: (1) it runs as a single DaemonSet covering every container on the node rather than one perf invocation per container; (2) it deduplicates and ships pprof to a remote store rather than dumping perf.data files locally; (3) it integrates with Kubernetes labels (service, pod, version) as profile labels rather than per-PID files; (4) its eBPF unwinder is more efficient for high-cardinality fleets because it runs in-kernel rather than spawning userspace perf processes. For a small fleet (< 50 nodes) perf cron-jobs to S3 + a flamegraph viewer is a viable budget option. Past 100 nodes, the Parca shape pays for itself in operator time.

When not to use any of these tools — the small-team escape hatch

A 3-person backend team running 8 services on 12 pods does not need continuous profiling. They need py-spy top (the live one-shot top-of-CPU view) and py-spy record for ad-hoc dumps, with output uploaded to a flamegraph viewer when needed. The infrastructure cost of running Pyroscope or Parca (even a single-node Pyroscope server is one more thing to keep alive) outweighs the value at small scale. The breakeven point is empirically around 30–40 services or 100+ pods: enough that "log into the right pod and run py-spy" stops being feasible, but not so much that you need a federated multi-cluster Parca rollout. Knowing where you are on this curve is the platform-engineering skill — adopting Pyroscope at 8 pods is yak-shaving; adopting it at 800 is too late.

Where this leads next

The next chapter — /wiki/agentless-observability-claims — is the honest counter-balance: vendors that market their continuous-profiler as "agentless" almost always mean "the agent is eBPF, and it ships as a privileged DaemonSet that watches every process on the node". That is not "no agent"; it is "one shared agent". The marketing claim is a positioning move, not a technical one. After that, /wiki/ebpf-for-network-observability-cilium-hubble returns the eBPF arc to network-stack observability, where Cilium and Hubble do for service-to-service traffic what Pyroscope and Parca do for CPU.

Looking back, this chapter completes the ladder begun in /wiki/why-ebpf-changed-the-game and /wiki/bpftrace-for-ad-hoc-tracing. eBPF gave us the substrate (verifier, JIT, BPF maps); bpftrace gave us the war-room one-liner; Parca / Pyroscope / Pixie give us the always-on production posture. The progression is the same shape every observability primitive follows — substrate → ad-hoc tool → continuous service — and Part 14 of this curriculum (continuous profiling deep-dive) builds on this chapter by walking the storage layer, the symbolisation pipeline, and the integration with distributed traces and SLOs.

References

Felix Geisendörfer, "Continuous Profiling: A Practical Guide" (polarsignals.com/blog/posts/2022/05/04/continuous-profiling-architecture) — the canonical architecture description, written by one of Parca's authors.
Polar Signals, "Parca: Open-Source Continuous Profiling" (parca.dev) — the upstream project documentation, including the FrostDB internals and the eBPF unwinder design.
Grafana Labs, "Pyroscope OSS documentation" (grafana.com/docs/pyroscope/latest/) — Pyroscope SDK references, the Phlare-merged storage model, and the HTTP API used in this chapter's Python script.
Pixie / New Relic, "Pixie under the hood" (docs.px.dev/about-pixie/what-is-pixie) — the explicit positioning of Pixie as an in-cluster live observability platform, with the in-memory retention model called out.
Brendan Gregg, "The Flame Graph" (CACM, 2016) — the foundational paper for flamegraph visualisation, the format every continuous-profiler renders.
Google, "pprof: A Profiling Format" (github.com/google/pprof/blob/main/proto/profile.proto) — the protobuf schema every continuous-profiler converges on.
Felix Geisendörfer, "Profiling Go programs with eBPF" (PolarSignals blog, 2023) — practical walkthrough of the eBPF unwinder used in Parca-Agent.
/wiki/bpftrace-for-ad-hoc-tracing — the previous chapter; the war-room ad-hoc counterpart to this chapter's always-on tools.
/wiki/why-ebpf-changed-the-game — the substrate eBPF-based continuous profilers run on.