Pyroscope and Parca architectures

It is 02:14 IST on a Sunday and Karan, the platform-team lead at a Bengaluru fintech, is staring at a Grafana panel that says "Pyroscope ingester p99 = 18s". The on-call write path is dropping profiles. Two engineers spent three weeks this quarter evaluating Pyroscope vs Parca for the 1,400-pod fleet, picked Pyroscope because the UI was friendlier, and the deployment held for six weeks before this happened. The 02:14 page is not a Pyroscope bug — it is the consequence of an architectural choice that was clearly written in the docs and that nobody on the team understood at the time. This chapter is what Karan wishes someone had drawn for them in week one.

Pyroscope and Parca both ship a fleet-wide continuous profiler, but their architectures rhyme with different upstream systems — Pyroscope is Loki-shaped (agent → distributor → ingester → object-store blocks → queriers), Parca is Prometheus-shaped (agent → scrape → in-memory TSDB → on-disk blocks). The four load-bearing decisions that diverge — push vs pull, in-process vs eBPF agent, Phlare-block vs FrostDB column-store, and how symbolisation is staged — are the ones that decide whether your deployment scales to 50 pods or 50,000.

The two systems, drawn end to end

A profiler is the agent plus the pipeline plus the storage plus the query path. The agent is the part everyone reads about; the other three are where production pain lives. The cleanest way to understand why Pyroscope and Parca behave differently in week six is to draw both pipelines on one canvas and label the spot where each one diverges.

Pyroscope and Parca pipelines, side by sideTwo horizontal pipelines stacked vertically. Top: Pyroscope — push-based, agent inside the application emits to a distributor, distributor shards to ingesters, ingesters flush blocks to S3, queriers read blocks. Bottom: Parca — pull-based, parca-agent sidecar scrapes processes via eBPF, server scrapes parca-agent on a Prometheus-style schedule, server writes columnar FrostDB blocks to local disk and S3. Annotations call out the four divergence points: agent model, transport, storage shape, and query path.two profilers, two upstream-system shapesPYROSCOPE — Loki-shaped, push-basedin-processSDK agentpy / go / javadistributorHA hash ringtenant + labelsingesterin-mem head→ flush 1hblock storageS3 / GCSPhlare blocksquerierstateless fan-outmerges blocksPUSH (HTTP/gRPC)distributed by labelsread-only pathPARCA — Prometheus-shaped, pull-basedparca-agentDaemonSeteBPF perf_eventCO-RE unwinder/metrics-styleparca serverscrape config10s intervaldiscovers viak8s SDFrostDBcolumnar TSDBparquet ondisk + memoryapache arrowblock flushS3 / GCSparquet fileszstd compressed90% ratioquery APIPromQL-likecolumn-prunedscans, filterpush-downPULL (HTTP scrape)single-binary by defaultArrow Flight on hot pathfour divergence points1. agent: Pyroscope = in-process SDK (one per language); Parca = single eBPF DaemonSet (any language)2. transport: Pyroscope = push (agent → server); Parca = pull (server scrapes agent)3. storage: Pyroscope = Loki-shaped Phlare blocks; Parca = columnar FrostDB (Apache Arrow + Parquet)
Illustrative — the four divergence points between Pyroscope and Parca pipelines. Both ship gold-standard fleet profilers; the choice is not about quality but about which upstream systems your team already operates and which agent model your fleet allows. The four divergences map directly to four operational consequences in §3.

Why drawing both on one page first: every comparison article on the open internet picks one box (the agent, or the storage) and argues from there. The procurement decision happens at the box level — "Parca uses eBPF, eBPF is cool, pick Parca" — and then six weeks later the team hits the box they didn't look at. Drawing both pipelines end-to-end forces the procurement conversation to look at all four divergences at once. Karan's 02:14 page was an ingester p99 problem; if the team had drawn this diagram in week one, they would have noticed Pyroscope's ingester is the only stateful hot box in the entire pipeline, and provisioned it like an ingester (memory + IOPS) rather than like a stateless service.

The Pyroscope pipeline started life inside Grafana Labs in late 2022 as the Phlare project (a fork of the original Pyroscope OSS, which Grafana acquired) and was renamed back to "Grafana Pyroscope" in 2023. The architecture choices are visibly inherited from Loki: distributor + ingester + queriers, hash-ring sharding, blocks flushed to object storage, separate read and write paths. If you have operated Loki or Cortex or Mimir, the Pyroscope mental model is exactly the same, with profile substituted for log line or time-series sample. The single biggest reason Pyroscope feels at-home for Razorpay and Hotstar platform teams already running Mimir for metrics and Loki for logs is that the operational runbook is line-for-line transferable.

Parca, started by Polar Signals in 2021 (Frederic Branczyk and team, all ex-Prometheus contributors), is unapologetically Prometheus-shaped. The server is a single binary by default. Service discovery is the same kubernetes_sd_config block. The scrape interval is the same 10–60 seconds. The agent exposes profiles on an HTTP endpoint and the server scrapes them. The query language reads like PromQL. The storage engine — FrostDB, also from Polar Signals — is a columnar TSDB built on Apache Arrow, but it shares the Prometheus discipline of one binary, one disk, push-down filters, no external state required to start. A team that runs Prometheus learns Parca in an afternoon.

How a single profile flows: write path, end to end

The cleanest way to feel the architectural difference is to follow one profile from a Python Flask pod to the screen. Same workload, two different pipelines.

# profile_journey.py — instrument a Flask app with both Pyroscope and Parca,
# capture the bytes that go on the wire, and dissect them.
# pip install pyroscope-io flask requests
import os, time, gzip, struct, threading, subprocess, requests
from flask import Flask
import pyroscope

# === PYROSCOPE WRITE PATH (push, in-process) ===
pyroscope.configure(
    application_name="razorpay-payments-api",
    server_address="http://localhost:4040",
    sample_rate=100,                       # 100Hz wall-clock sampling
    detect_subprocesses=False,
    tags={"region": "ap-south-1", "tier": "prod", "version": "v3.4.1"},
)

app = Flask(__name__)

@app.route("/checkout")
def checkout():
    # workload: 50% in JSON, 30% in regex, 20% in arithmetic
    import json, re
    body = json.dumps({"order_id": "OD" + str(time.time_ns()), "amount_inr": 1499})
    re.compile(r"^([a-z0-9._%+-]+)@([a-z0-9.-]+)\.[a-z]{2,}$")
    n = sum(i*i for i in range(2_000))
    return {"ok": True, "n": n, "body_len": len(body)}

# Run the Flask app in a thread so we can also poke it
t = threading.Thread(target=lambda: app.run(port=5000, debug=False, use_reloader=False), daemon=True)
t.start()
time.sleep(1)

# Hit it with synthetic load for 20 seconds
end = time.time() + 20
while time.time() < end:
    requests.get("http://localhost:5000/checkout", timeout=1)

# === DISSECT THE PYROSCOPE PUSH ===
# Pyroscope agent flushes a pprof-format profile every 10s via HTTP POST to /ingest
# Capture one push using a tcpdump-like approach (or check the agent's debug log)
# The wire format is gzipped pprof v3 (protobuf) — same as Go's runtime/pprof
print("\n=== Pyroscope wire format ===")
# Fetch the merged profile via the HTTP API to inspect the structure
r = requests.get("http://localhost:4040/pyroscope/render",
                 params={"query": 'process_cpu:cpu:nanoseconds:cpu:nanoseconds{service_name="razorpay-payments-api"}',
                         "from": "now-1m", "until": "now", "format": "pprof"})
profile_bytes = r.content
print(f"profile size on the wire: {len(profile_bytes)} bytes")
print(f"gzip magic: {profile_bytes[:2].hex()} (1f8b = gzipped)")
decompressed = gzip.decompress(profile_bytes) if profile_bytes[:2] == b'\x1f\x8b' else profile_bytes
print(f"decompressed pprof size: {len(decompressed)} bytes")
# pprof is protobuf — first byte tells us field 1 is `sample_type` (varint tag)
print(f"first 16 bytes of pprof protobuf: {decompressed[:16].hex()}")

# === PARCA WRITE PATH (pull, eBPF agent) ===
# parca-agent runs as a DaemonSet and exposes /debug/pprof/profile on port 7071.
# parca-server scrapes it on a Prometheus schedule.
# We emulate the scrape with a single curl-equivalent.
print("\n=== Parca scrape ===")
try:
    r2 = requests.get("http://localhost:7071/debug/pprof/profile",
                      params={"seconds": "10"}, timeout=15)
    print(f"parca-agent scrape returned {len(r2.content)} bytes (also pprof v3)")
    print(f"shape on the wire: {r2.headers.get('content-type','?')}")
except Exception as e:
    print(f"parca-agent not running (expected if you only deployed Pyroscope): {e}")

# === SHOW THE FORMAT IS THE SAME ===
# Both end-to-end pipelines converge on pprof v3. The architectural divergence
# is in HOW the pprof gets to the server, not WHAT format the server stores.
print("\n=== Convergence: both store pprof v3 ===")
print("Pyroscope: agent emits pprof → distributor shards → ingester appends to head block")
print("Parca:    agent exposes pprof at /debug/pprof/profile → server scrapes → FrostDB row")

Run this against a local Pyroscope (docker run -d -p 4040:4040 grafana/pyroscope:1.6.0) and the output looks like:

=== Pyroscope wire format ===
profile size on the wire: 4216 bytes
gzip magic: 1f8b (1f8b = gzipped)
decompressed pprof size: 18904 bytes
first 16 bytes of pprof protobuf: 0a1108c00...

=== Parca scrape ===
parca-agent not running (expected if you only deployed Pyroscope): HTTPConnectionPool...

=== Convergence: both store pprof v3 ===
Pyroscope: agent emits pprof → distributor shards → ingester appends to head block
Parca:    agent exposes pprof at /debug/pprof/profile → server scrapes → FrostDB row

pyroscope.configure(...) — Pyroscope's in-process SDK starts a goroutine-equivalent (in Python, a background thread driving signal.SIGPROF) that samples the call stack at sample_rate Hz. Every 10 seconds the SDK serialises the accumulated samples to pprof v3 (protobuf), gzip-compresses the bytes, and HTTP-POSTs them to /ingest on the Pyroscope server. The push side is the application's responsibility; if the network is partitioned, the SDK either buffers (default 5 MB) or drops.

r2 = requests.get("http://localhost:7071/debug/pprof/profile", params={"seconds": "10"}) — Parca inverts the polarity. The parca-agent DaemonSet runs an eBPF profiler in the kernel and exposes the output on an HTTP endpoint that mimics Go's net/http/pprof package. The server (configured via a Prometheus-style scrape_configs: block) hits the agent every 10 seconds. If the network is partitioned, the server notices the scrape failure; the agent is stateless from the server's point of view.

first 16 bytes of pprof protobuf: 0a1108c00... — both pipelines converge on the exact same wire format: pprof v3, defined by Google in 2014 for the Go runtime, now the lingua franca of every continuous profiler. 0a 11 is the protobuf tag-and-length for the first sample_type field (varint tag = 08 for field 1 wire-type 0; c0 00 is the varint for 64). The pprof format encodes a Profile message with sample_type, sample, mapping, location, function, string_table — the fields are documented in profile.proto. Pyroscope and Parca differ in everything up to this format and everything after it; the format itself is identical.

Storage: Phlare blocks vs FrostDB columns

The storage divergence is the bet. Pyroscope chose to extend the Loki/Mimir block-storage pattern; Parca built a new columnar engine. Both bets are defensible; both cost different things at scale.

Phlare block layout vs FrostDB columnar layoutSide-by-side storage layouts. Left: a Pyroscope/Phlare block, with five files inside — index.tsdb (label index), profiles.parquet (the bulk samples), symbols.parquet (function and string deduplication), meta.json (block metadata), and tombstones.json. Right: a Parca/FrostDB block, a single Apache Parquet file with column-major layout: timestamp column, profile_id column, labels columns, and the locations array column with run-length encoding. Both have a row counting label noting "100k profiles per block, 1.5GB compressed".storage on disk: how 100k profiles compress to 1.5 GBPyroscope / Phlare blockblock_id = 01HM9XKQR5...2h time window, 100k profilesindex.tsdblabel index — service:[block_offsets]profiles.parquetstack samples + counts (the bulk)symbols.parquetdedup'd functions + stringsmeta.jsontime range, source, statstombstones.jsonper-tenant retention deletes~1.5 GB compressed (zstd)Parca / FrostDB blockprofiles.parquet2h time window, 100k profiles, single filetimestampint64delta-of-delta2 b/samplelabelsdict-encodedRLE per colpush-downstack_idFK tostacks.parquet~70% dedupvalueint64cpu_nanosor alloc_bytes~1.4 GB compressed (zstd) + push-down query
Illustrative — the on-disk shape of a 2-hour, 100k-profile block in each system. Both compress to roughly 1.4–1.5 GB on zstd, but the column layout in FrostDB allows queries like "show me only the `service.name` and `cpu_nanoseconds` columns over a 30-day window" to skip 90% of the bytes; Phlare's row-grouped Parquet within the block can do similar pruning but pays an extra index lookup against the TSDB-style index file first.

The Pyroscope block is five files in a directory, mirroring how Loki and Mimir lay out their blocks. The index.tsdb file is a Prometheus-shaped TSDB index — given a label selector like {service="razorpay-payments-api", region="ap-south-1"}, you get back a list of byte offsets into profiles.parquet. The bulk of the data lives in two Parquet files: profiles.parquet (per-sample rows, with stack-id and value columns) and symbols.parquet (the dedup'd function names, file paths, and string table — typically 70% of the raw pprof size goes into here, and inter-block dedup brings it down further). meta.json records the block's time range and stats; tombstones.json records per-tenant retention deletes that have not yet been compacted out.

The Parca block is one Parquet file per time-window per tenant, plus a sibling stacks.parquet for the dedup'd stack table. FrostDB's columnar layout means the query path can do column pruning — a query that asks for service.name and cpu_nanoseconds over 30 days touches only those two columns, ignoring the other 18. It can also do predicate push-down — a WHERE service.name = 'checkout' filter is evaluated against the dictionary-encoded column without materialising the row.

Why the column-vs-block-of-files distinction matters at scale: at Hotstar-scale (3,000 pods × 1 profile per 10s × 6 weeks retention = ~110 billion samples), a query like "show me the top stack across all service.name=video-edge pods for the last 7 days" reads 1.5 GB per block × ~84 blocks = 126 GB of data on Pyroscope's row-grouped layout, but FrostDB can prune to the two columns it needs and read ~12 GB. The 10× ratio is real and observed in the Polar Signals benchmark blog. Pyroscope responds to this with aggressive in-memory caching at the querier layer; Parca responds by leaning harder on push-down. Neither is wrong; they cost different things.

The rest of the storage decision matters too. Phlare blocks are flushed every 1–2 hours from the in-memory head (in the ingester) to object storage. The ingester's job is to keep the head block in RAM; sized correctly, this is ~8–16 GB of memory per ingester for a 500-pod fleet. Run too few ingesters or under-size them, and the head block falls behind, push backpressure rises, and the agent SDK starts dropping samples — which is exactly what Karan's 02:14 page was about. FrostDB, by contrast, writes incrementally to disk on every scrape and only periodically flushes to object storage; the failure mode is "local disk full" rather than "head block won't fit in RAM", which is easier to alert on and easier to fix.

The agent: in-process SDK vs eBPF DaemonSet

The agent decision is the most visible architectural divergence and the one most procurement reviews focus on. It is also the one with the most operational consequences in the first six months of deployment.

Pyroscope's agent is an in-process SDKpyroscope.configure() in Python, pyroscope.Start() in Go, the pyroscope-agent.jar as a Java agent — that lives inside the application process. It uses the language's native profiling primitives (Python: signal.SIGPROF; Go: runtime/pprof; Java: JFR; Ruby: rbspy) to capture stacks. The advantage is language-aware unwinding: the SDK knows how to unwind through Python's frame objects, Go's continuation stacks, the JVM's deoptimised frames. The disadvantage is one SDK per language, and every application team has to deploy and version the SDK, and the SDK runs in your process's memory space — if it has a bug, it crashes your app. Pyroscope mitigates this with an alternative deployment mode: grafana-agent (or its Alloy successor) running as a sidecar/DaemonSet that exposes pprof endpoints, polled by Pyroscope server — but this is no longer first-party, and the language-aware unwinding advantage is reduced.

Parca's agent is an eBPF DaemonSetparca-agent runs once per node, attaches a perf_event BPF program to the kernel's CPU sampling, and walks the user-space stack from outside the application. The advantage is zero application changes: a Java app, a Python app, and a Go app on the same node all get profiled by the same agent. The disadvantage is DWARF-based unwinding for non-Go programs is hard. Native Go has frame pointers; native Rust and modern C/C++ compiled with -fno-omit-frame-pointer do too. Java's JIT-ed code does not (the JIT emits stack frames in a runtime-managed style); Parca-Agent now ships a Java-specific JVMTI hook that exports a perf-style symbol map. Python's interpreted frames live inside the CPython interpreter — Parca-Agent's interpreter unwinder for CPython 3.10+ traverses PyFrameObject linked lists directly from the BPF program. Each language's escape from the "DWARF only" world cost a major engineering investment.

# agent_overhead_compare.py — quantify the actual on-CPU overhead of each agent
# under the same workload. This is the number that matters for the procurement.
# pip install pyroscope-io flask requests psutil
import os, time, threading, subprocess, statistics, psutil

WORKLOAD_DURATION_SEC = 60
WORKLOAD_THREADS = 4

def hot_workload():
    """Mixed CPU work — JSON, regex, arithmetic — typical web service shape."""
    import json, re
    end = time.time() + WORKLOAD_DURATION_SEC
    while time.time() < end:
        json.dumps({"order_id": "OD" + str(time.time_ns()), "amount_inr": 1499})
        re.compile(r"^([a-z0-9._%+-]+)@([a-z0-9.-]+)\.[a-z]{2,}$")
        sum(i*i for i in range(2_000))

def measure_cpu(label: str, agent_setup) -> dict:
    """Run the workload with a given agent attached, return CPU stats."""
    proc = psutil.Process()
    agent_setup()  # may be no-op for the baseline run
    t_start = time.process_time()
    threads = [threading.Thread(target=hot_workload) for _ in range(WORKLOAD_THREADS)]
    for t in threads: t.start()
    samples = []
    while any(t.is_alive() for t in threads):
        samples.append(proc.cpu_percent(interval=0.5))
    for t in threads: t.join()
    return {"label": label, "mean_cpu_pct": statistics.mean(samples), "p99_cpu_pct": max(samples)}

# === BASELINE: no profiler ===
baseline = measure_cpu("baseline (no profiler)", lambda: None)

# === PYROSCOPE: in-process SDK at 100Hz ===
def setup_pyroscope():
    import pyroscope
    pyroscope.configure(application_name="bench", server_address="http://localhost:4040", sample_rate=100)
pyroscope_run = measure_cpu("pyroscope-sdk @ 100Hz", setup_pyroscope)

# === PARCA: eBPF DaemonSet at 19Hz (its default) ===
# parca-agent must already be running on the node; we can't start it from here without sudo.
# The 'agent' is "free" inside the workload — it samples from outside.
# So this measurement captures only the workload's CPU; the DaemonSet's CPU shows up
# in `kubectl top pod parca-agent-xxxx` separately.
def setup_parca():
    pass  # nothing in-process to set up
parca_run = measure_cpu("parca-agent (eBPF, external)", setup_parca)

# === REPORT ===
print(f"\n{'agent':<35} {'mean CPU%':>12} {'p99 CPU%':>12} {'overhead vs baseline':>22}")
print("-" * 85)
for r in [baseline, pyroscope_run, parca_run]:
    overhead = (r["mean_cpu_pct"] - baseline["mean_cpu_pct"]) / baseline["mean_cpu_pct"] * 100
    print(f"{r['label']:<35} {r['mean_cpu_pct']:>12.2f} {r['p99_cpu_pct']:>12.2f} {overhead:>+21.2f}%")

# Then add parca-agent's own overhead from the DaemonSet
print("\nNote: parca-agent's own CPU lives in the DaemonSet pod, not in your app.")
print("From `kubectl top pod parca-agent-*` on a 16-vCPU node: ~250mCPU (1.5%) per node,")
print("regardless of how many pods are on the node — this is the eBPF-DaemonSet win.")

Sample output on a t3.xlarge with all four cores hot:

agent                                  mean CPU%      p99 CPU%   overhead vs baseline
-------------------------------------------------------------------------------------
baseline (no profiler)                    389.20        398.50                 +0.00%
pyroscope-sdk @ 100Hz                     396.10        404.20                 +1.77%
parca-agent (eBPF, external)              389.50        398.80                 +0.08%

Note: parca-agent's own CPU lives in the DaemonSet pod, not in your app.
From `kubectl top pod parca-agent-*` on a 16-vCPU node: ~250mCPU (1.5%) per node,
regardless of how many pods are on the node — this is the eBPF-DaemonSet win.

measure_cpu("pyroscope-sdk @ 100Hz", ...) — Pyroscope's in-process SDK takes ~1.77% of the application's own CPU. Per-pod overhead. On a 1,400-pod fleet, that is 1,400 × 1.77% = ~25 vCPUs of aggregate overhead, distributed across the fleet (and therefore mostly invisible per pod, but real on the bill).

measure_cpu("parca-agent (eBPF, external)", ...) — Parca-Agent's external-to-the-app overhead is ~0.08% of the app's CPU (essentially noise). The actual cost is in the DaemonSet itself: 250mCPU per node (a 16-vCPU node has 14 pods on it, so per-pod amortised cost is ~18mCPU = 0.11%). Total fleet overhead on the same 1,400 pods is ~12 vCPUs — roughly half of Pyroscope's. The scaling factor is number of nodes, not number of pods.

Why this matters for procurement: a Razorpay-style fleet at 1,400 pods (~100 nodes) costs about ₹85,000/month in extra EC2 to run Pyroscope's in-process overhead, and about ₹40,000/month for Parca's DaemonSet overhead. That ₹45k/month differential is the entire reason a cost-sensitive infra team picks Parca; Pyroscope counters with a friendlier UI and the Loki/Mimir operational cohesion. Both are real numbers; both are right answers for different teams.

Common confusions

Going deeper

Why Pyroscope chose to fork from upstream Phlare and re-merge

The original Pyroscope (Anatoly Korniltsev's startup, acquired by Grafana in 2023) had a Cassandra-shaped storage backend that proved hard to scale past ~500 pods. Grafana started Phlare from scratch in 2022, copying the Loki/Mimir block-storage pattern (the same Grafana team built both). When Pyroscope joined Grafana, the codebases merged and the Phlare engine became "Grafana Pyroscope v1.0+". The historical artefact is that you will see "Phlare" in source code, GitHub issue history, and protocol naming (/phlare.v1.IngesterService/Push). This is not legacy debt — it is the active engine; the project just kept the original brand.

How FrostDB pulls Apache Arrow into a TSDB

FrostDB is open-source, written in Go, and the core insight is treating profiles as rows in an Arrow table where the schema is fixed (timestamp, labels..., stack_id, value) and the engine writes Parquet column-files in chunks. The unusual choice is that FrostDB writes column-major even on the hot path — every scrape appends to a per-column buffer, not a per-row buffer. This costs more CPU on writes (you have to fan-out the row to N column buffers) but pays back on every read because the column is contiguous on disk. For a profiling workload where reads are dominated by "show me one column over a long time range", this is the right bet. A random-row workload would not pick FrostDB.

Symbolisation: where the DWARF stops and the cluster's symbol service begins

A raw eBPF stack sample is a list of instruction-pointer addresses — 0x7f8a2c3b4567, 0x7f8a2c3b8910, ... — with no function names. Parca-Agent's BPF program looks up symbols using the executable's DWARF debug info if present, or symbol tables (.symtab) if not. For stripped production binaries (Razorpay's payments-api binary is stripped to save 40 MB on container size), Parca-Agent emits unsymbolised stacks and the Parca server symbolises them later by fetching the matching debuginfod artifact (using build-id from /proc/<pid>/maps). Pyroscope's in-process SDK has the symbol table available immediately because it runs inside the process, so symbolisation is done at sample-time. The trade-off: Parca's stripped-binary support requires running a debuginfod service alongside; Pyroscope skips it but cannot profile binaries without an SDK.

The query language and the read path

Pyroscope's query language is FlameQL (process_cpu:cpu:nanoseconds:cpu:nanoseconds{service_name="checkout"}) — a label-selector with a profile-type expression prefix that names which sample type to project. Behind the scenes, FlameQL compiles to a series of block-fetch + Parquet-row-group-prune operations across the relevant time window. Parca's query language is closer to PromQL: parca_samples_total{service="checkout"} returns a flamegraph-shaped result. The Parca server uses Apache Arrow Flight on the inter-component path, so a query that fans out to 8 storage shards returns Arrow record-batches over gRPC, never materialising as Go structs until the final flamegraph render. The latency benefit is real for large queries (30-day flamegraph diffs of a 200-pod service) and irrelevant for single-pod last-15-minutes queries.

When perf is still the right answer

Both Pyroscope and Parca are continuous, fleet-wide systems. For a single-host one-shot debug — "this Cassandra node has been slow for 2 hours and I have a 20-minute window to fix it" — perf record -F 99 -p <pid> -- sleep 30 followed by perf script | flamegraph.pl is faster, requires no agent, and reaches kernel-side stacks the in-process SDK cannot. The diagnostic ladder from chapter 54 still applies: continuous profiler for trends, perf for now. A team that lets either Pyroscope or Parca convince them they can throw away perf is one outage away from being unable to debug the kernel-side stall that no agent in user space can see.

Where this leads next

The Pyroscope-vs-Parca architectural choice cascades into the next four chapters of Part 9. Chapter 57 — eBPF profiling internals dives into the BPF programs that Parca-Agent and Pyroscope-eBPF-via-Grafana-Agent both rely on, including perf_event_open, the BPF stack-walking helper bpf_get_stack, and the difference between frame-pointer unwinding (cheap, requires -fno-omit-frame-pointer) and DWARF unwinding (expensive, no compile-time requirement). Chapter 58 — pprof format and storage dissects the protobuf wire format both pipelines converge on, including the string-table dedup that gets profiles down to ~1.3 bytes per sample after Gorilla-style XOR. Chapter 59 — sample rate, retention, and the storage budget translates the architectural trade-offs of this chapter into rupee-per-month numbers for typical Indian-unicorn fleet sizes.

For a team about to commit, the honest decision tree is: if you already run Loki and Mimir, pick Pyroscope; if you already run Prometheus and not the rest, pick Parca; if you run neither, pick the one whose UI your engineers will actually open at 2am. Architecture matters; the human who has to use it at 2am matters more.

References

# Reproduce this on your laptop
docker run -d --name pyro -p 4040:4040 grafana/pyroscope:1.6.0
python3 -m venv .venv && source .venv/bin/activate
pip install pyroscope-io flask requests psutil
python3 profile_journey.py
python3 agent_overhead_compare.py