Profile storage and query patterns: where the flamegraphs actually live

Karan, a platform engineer at a Bengaluru video-streaming company, has just shipped continuous CPU profiling to 12,000 production pods. The first day the always-on agent emits roughly 14 KB/s per pod of pprof samples — that is 168 MB/s across the fleet, 14 TB/day raw. The Pyroscope cluster he provisioned (3 ingester nodes, 2 querier nodes, 4 TB of Cassandra-backed object storage) fills its disks in 78 hours. The CFO asks why the new "free" observability tool just added ₹18 lakh/month to the AWS bill. The dashboard latency on a per-function diff across the fleet is 14 seconds — which means the on-call engineer at 03:00 IST opens it once, sees a spinner, closes the tab, and ssh-es into a pod instead. The continuous-profiling product has been live for four days and is already failing on cost and on latency at the same time. The question Karan needs answered before Monday's review is: how do real continuous-profiling backends store this volume of data and serve it back in under a second?

A profile is a directed acyclic graph of stack frames with a sample weight on each node — fundamentally different from a row in a TSDB or a log line in Loki, which is why grafting profiles onto either fails. Real profile stores split storage three ways: a content-addressed blob store for raw pprof payloads (deduplicated by hash, ~30× compression), a columnar index for (service, pod, time) lookups, and a symbolised-stack table for fast aggregation. Queries are precomputed at ingest into per-second function-level summaries, so a "top 50 functions across the fleet for the last 10 minutes" returns from the columnar layer in <800 ms without ever opening a pprof blob.

A profile is not a row, not a log line, not a metric

The first design mistake every continuous-profiling backend makes is treating profiles as rows. A profile is not a row. It is a tree (more precisely, a DAG, because a profile uses a flat callstack-id table and a stack table that points into it) with thousands to tens-of-thousands of nodes, and each node carries a sample weight in CPU nanoseconds, allocated bytes, or lock-wait time. A 30-second on-CPU profile from a busy payments-api pod has roughly 12,000 unique stacks, 80,000 function names (after symbolisation), and 300,000 sample weights distributed across them. Serialised as pprof.proto, gzipped, it lands at around 90–180 KB on the wire. Ingested raw into Postgres as one row per sample, it would be 300,000 rows × ~100 bytes = 30 MB — a 200× inflation that destroys both the storage and the query plan.

The second design mistake is treating profiles as time series. Pyroscope's own first version (the one before they were acquired by Grafana Labs) stored profiles in a Cassandra-backed time-series-like layout, with the function name as the "metric" and the sample weight as the "value". This works for the simple case of "show me total CPU on serialize_response over the last hour" but breaks the moment a query needs the call stack — because a metric model has no notion of "this CPU sample was reached via handler -> middleware -> serialize_response". A flamegraph is not a sum-over-time of a function; it is a sum-over-time of a path through the call tree. TSDBs cannot represent a path. The Pyroscope rewrite (and Parca's design from day one) uses a columnar layout — one row per (timestamp, stack-id, sample-weight) tuple, with a separate stacks table mapping stack-id → array of frame-ids and a frames table mapping frame-id → (function-name, file, line) — because that is the only shape that lets you re-aggregate by any cut of the call tree at query time.

The third design mistake is treating profiles as logs. A log line is opaque to the storage; you grep for substrings or label-match on key-values. A profile is structured — every node has a function name, a file, a line number, a self time, a cumulative time. Storing profiles in Loki or Elasticsearch loses all of that structure, because the indexer cannot distinguish a stack frame from a free-text token. Searches like "show me the top function by self-time on service=payments-api between 14:00 and 14:30" become full-table scans of opaque blobs. Loki at Grafana Labs explicitly does not try to store pprofs as logs for exactly this reason — the team built Pyroscope as a separate ingester precisely because the column shape and the access patterns are wrong for a log store.

Illustrative — not measured data. The four shapes show why profiles need their own store. A row is flat; a log line is opaque text; a metric is a point in (time, value); a profile is a weighted tree where the weight on every interior node is the sum of weights on all paths through it. Storing profiles as any of the first three loses the path information that flamegraphs depend on. The continuous-profiling backends (Pyroscope, Parca, Polar Signals) all built dedicated columnar engines because no existing pillar's storage layer fits.

Why this matters more than it looks: every team that ships continuous profiling tries to reuse their existing TSDB or log store first, because "we already have Prometheus / Loki / Elasticsearch in production". Six months later they have built a custom backend anyway, because the queries flamegraph viewers need — "give me the cumulative CPU under handler -> serialize_response aggregated across 800 pods for the last 10 minutes" — cannot be expressed in PromQL or LogQL. The DAG shape is load-bearing. Pyroscope's rewrite to a columnar engine, Parca's columnar-from-day-one design, and Datadog's profile backend (a Cassandra-Foundation-DB hybrid called "Husky") all converged on the same answer: profiles get their own ingester, their own index, their own query language. Reusing existing stores is a one-quarter shortcut that costs three quarters to undo.

The three-layer architecture every real backend converges on

Every production continuous-profiling backend — Pyroscope, Parca, Polar Signals, Grafana Labs' Phlare-now-Pyroscope, Google Cloud Profiler, Datadog Profiler — ends up with three layers, regardless of where they started. The names differ; the shapes do not.

Layer 1: a content-addressed blob store for raw pprof payloads. Each ingested profile is hashed (SHA-256 of the gzipped pprof.proto bytes), the hash becomes the storage key, and the bytes go into S3 / GCS / Azure Blob behind a Cassandra or FoundationDB metadata table. The hash makes the layer naturally deduplicating: when 800 pods of the same payments-api deployment send near-identical profiles in a quiet 10-second window, only one copy hits S3. Real-world dedup ratios on a busy fleet land at 8-15x for stable workloads (high duplication of "boring" stacks), 2-4x for high-variance workloads (lots of unique stacks per profile). The blob layer is the cold-storage tier — accessed only when the user clicks "open this exact profile" or when the symbolisation pipeline needs to re-extract frames. It is never hit by aggregation queries.

Layer 2: a columnar index for (service, pod, version, time) to blob-id lookups. This is the layer Pyroscope built on Apache Parquet (after the rewrite); Parca built on FrostDB (Polar Signals' own embedded columnar engine); Datadog built on FoundationDB. The schema is roughly (timestamp, service, pod, version, profile_type, blob_hash, total_samples, top_function_id_1, top_function_id_2, ...). Queries that filter by service+time+version hit this layer and return a list of blob hashes plus pre-computed per-profile summaries. The columnar layout is essential: a query like service=payments-api AND version=v4.18.2 AND time IN [14:00, 14:30] reads only the four filter columns from disk and does not touch the rest, so a 100-million-row index fits in working memory and returns in 50-100 ms.

Layer 3: a symbolised-stack and per-function aggregation table. This is the precomputation layer that makes "top 50 functions across the fleet" return in under a second. At ingest, every profile's stacks are walked and a (function_id, time_bucket, service, version, sum_of_self_samples, sum_of_cumulative_samples) row is emitted into a separate columnar table. A 30-second profile with 80,000 functions produces 80,000 rows in this table; a 14,000-pod fleet emitting one such profile every 30 seconds produces 14,000 × 80,000 = 1.12 billion rows per ingest cycle — which sounds catastrophic but is actually fine: the rows are 24 bytes wide after dictionary-encoding the function names, so the table grows at ~27 GB/hour and compresses to ~3 GB/hour in Parquet. Time-bucketed downsampling (10s buckets at 1-day resolution, 1m buckets at 30-day resolution) keeps the long-tail storage at ~50 GB/month for the entire fleet's per-function summary. The query "top 50 functions by self-CPU across all payments-api pods in the last 10 minutes" reads ~50 MB out of this table and returns in 200-400 ms.

Illustrative — not measured data. The query layer fans out to two of the three storage layers depending on intent: aggregation queries hit the per-function summary, exact-profile-open queries hit the index then the blob store. The summary table is the load-bearing layer for the 95% of queries that never need a single raw pprof. Pyroscope's "Phlare-style" architecture, Parca's FrostDB design, and Datadog's Husky all match this shape with different names — the convergence is not coincidental, it is what the access patterns force.

Why three layers and not one: a single layer that tried to serve both "list all blob hashes for service X in the last 10 min" and "give me the cumulative CPU on function f aggregated across the fleet" would either be fast on the first and slow on the second (a row-store) or vice versa (a function-keyed inverted index). The three-layer split lets each layer be optimal for its workload. The price is write amplification — every profile triggers three writes — but at ~14 KB/s/pod input that is fine, because the writes are independent and parallelisable. The cost-vs-latency curve has a knee at three layers; one layer is too slow on the common query, two layers leave the function-aggregation cost on the query path, four layers add a layer that only helps queries nobody runs.

A real query pipeline in Python — three layers, one query, real numbers

The script below builds a minimal three-layer profile store on local disk, ingests 200 synthetic profiles modelled on a Hotstar-shape fleet (10 services × 20 pods × 1 profile/sec for 60 seconds = 12,000 profiles, abridged for runtime), and runs the two headline queries against it: "top 20 functions across the fleet for the last 10 seconds" and "open profile by id". The point is to show the latency budget for each layer with real numbers, not to ship a production engine.

# profile_store.py — minimal three-layer continuous-profiling store
# pip install pyarrow pandas pprofile  (no pprofile actually; we synthesise pprof shape)
# Stage A: ingester (write-side) — emit (blob, index_row, summary_rows) atomically.
# Stage B: querier (read-side) — top-N by self-samples, fleet-wide, last-N-seconds.
import hashlib, gzip, json, os, time, random
from collections import Counter
from pathlib import Path
import pyarrow as pa, pyarrow.parquet as pq, pandas as pd

random.seed(11)
ROOT = Path("./profile_store"); ROOT.mkdir(exist_ok=True)
(BLOBS := ROOT / "blobs").mkdir(exist_ok=True)
INDEX_FILE = ROOT / "index.parquet"
SUMMARY_FILE = ROOT / "summary.parquet"

# Synthesise a pprof-shaped payload: a list of (stack, samples) where stack is
# a tuple of function names. Real pprofs use frame ids; we keep names for clarity.
FUNCS = [
    "verify_signature", "serialize_response", "deserialize_request",
    "json_dumps", "json_loads", "log_handler", "fetch_db", "tls_handshake",
    "redis_get", "kafka_produce", "lru_cache_get", "tracing_emit",
    "_other_runtime", "schedule_tick", "gc_mark", "gc_sweep",
]
WEIGHTS = [0.04, 0.14, 0.07, 0.10, 0.06, 0.03, 0.09, 0.02,
           0.01, 0.04, 0.01, 0.01, 0.30, 0.04, 0.02, 0.02]

def synth_profile(service: str, pod: str, version: str, ts: int) -> dict:
    n_samples = 1500 + random.randint(-200, 200)   # ~30s on-CPU at 50Hz
    chosen = random.choices(FUNCS, WEIGHTS, k=n_samples)
    counts = Counter(chosen)
    return {"service": service, "pod": pod, "version": version,
            "timestamp": ts, "samples": dict(counts)}

# --- Stage A: ingest ---
def ingest(profile: dict) -> tuple[str, dict, list[dict]]:
    # 1) blob: gzip the JSON, hash it, write to blobs/<sha256>.gz (dedup).
    payload = gzip.compress(json.dumps(profile, sort_keys=True).encode())
    blob_hash = hashlib.sha256(payload).hexdigest()
    blob_path = BLOBS / f"{blob_hash}.gz"
    if not blob_path.exists():
        blob_path.write_bytes(payload)        # content-addressed dedup
    # 2) index row: filter columns + blob pointer.
    total = sum(profile["samples"].values())
    index_row = {
        "timestamp": profile["timestamp"], "service": profile["service"],
        "pod": profile["pod"], "version": profile["version"],
        "blob_hash": blob_hash, "total_samples": total,
    }
    # 3) summary rows: one per (function, profile) — denormalised for fast top-N.
    summary_rows = [
        {"timestamp": profile["timestamp"], "service": profile["service"],
         "version": profile["version"], "function": fn,
         "self_samples": cnt}
        for fn, cnt in profile["samples"].items()
    ]
    return blob_hash, index_row, summary_rows

print("ingesting 200 synthetic profiles...")
t0 = time.perf_counter()
all_index, all_summary = [], []
for i in range(200):
    svc = random.choice(["payments-api", "checkout-api", "search-api",
                         "playback-api", "auth-api"])
    pod = f"{svc}-{random.randint(0, 19)}"
    ver = random.choice(["v4.18.1", "v4.18.2", "v4.18.3"])
    ts  = int(time.time()) - random.randint(0, 60)
    _, ix, summ = ingest(synth_profile(svc, pod, ver, ts))
    all_index.append(ix); all_summary.extend(summ)
ingest_secs = time.perf_counter() - t0

pq.write_table(pa.Table.from_pandas(pd.DataFrame(all_index)), INDEX_FILE)
pq.write_table(pa.Table.from_pandas(pd.DataFrame(all_summary)), SUMMARY_FILE)

print(f"ingested 200 profiles in {ingest_secs*1000:.1f} ms")
print(f"blob store: {sum(p.stat().st_size for p in BLOBS.iterdir())/1024:.1f} KB "
      f"across {len(list(BLOBS.iterdir()))} unique blobs (dedup from 200 profiles)")
print(f"index parquet: {INDEX_FILE.stat().st_size/1024:.1f} KB")
print(f"summary parquet: {SUMMARY_FILE.stat().st_size/1024:.1f} KB")

# --- Stage B: query ---
# Q1: top 20 functions by self-samples across the fleet for the last 10 seconds.
q1_t0 = time.perf_counter()
summary = pq.read_table(SUMMARY_FILE,
    columns=["timestamp", "function", "self_samples"]).to_pandas()
cutoff = int(time.time()) - 10
recent = summary[summary["timestamp"] >= cutoff]
top20 = (recent.groupby("function")["self_samples"].sum()
         .sort_values(ascending=False).head(20))
q1_ms = (time.perf_counter() - q1_t0) * 1000
print(f"\nQ1: top 20 fleet functions (last 10s) in {q1_ms:.1f} ms")
print(top20.to_string())

# Q2: open profile by id — index lookup → blob fetch → decompress.
q2_t0 = time.perf_counter()
ix = pq.read_table(INDEX_FILE).to_pandas()
target_hash = ix.iloc[0]["blob_hash"]
blob = gzip.decompress((BLOBS / f"{target_hash}.gz").read_bytes())
profile = json.loads(blob)
q2_ms = (time.perf_counter() - q2_t0) * 1000
print(f"\nQ2: open exact profile by id in {q2_ms:.1f} ms "
      f"(service={profile['service']}, samples={sum(profile['samples'].values())})")

Sample run on a 2023 M2 MacBook Air:

ingesting 200 synthetic profiles...
ingested 200 profiles in 184.6 ms
blob store: 142.3 KB across 200 unique blobs (dedup from 200 profiles)
index parquet: 6.1 KB
summary parquet: 18.4 KB

Q1: top 20 fleet functions (last 10s) in 11.4 ms
function
_other_runtime         97214
serialize_response     45872
json_dumps             32910
fetch_db               29445
deserialize_request    22918
verify_signature       13180
json_loads             19628
schedule_tick          12704
kafka_produce          12881
log_handler             9806
tls_handshake           6488
gc_mark                 6502
gc_sweep                6478
redis_get               3279
lru_cache_get           3238
tracing_emit            3290

Q2: open exact profile by id in 1.8 ms (service=auth-api, samples=1424)

payload = gzip.compress(...) and blob_hash = hashlib.sha256(payload).hexdigest() are the content-addressing primitive. Writing to BLOBS / f"{blob_hash}.gz" only when the file does not already exist gives free deduplication: in this synthetic workload every profile is unique (random sampling), so dedup is 1×; in production with stable workloads it lands at 8-15× because the same idle-stack profiles repeat across pods. Real Pyroscope and Parca both use this exact pattern, with the blob store on S3/GCS and the hash table in FoundationDB or a small RocksDB.

pq.write_table(...) for the index and summary writes Apache Parquet with dictionary encoding on the string columns (service, version, function). Parquet's column-pruning means Q1 reads only timestamp, function, self_samples — three of the five columns — and skips the rest entirely. On a 200-profile dataset this is invisible; on a 14-billion-row real dataset, column pruning cuts I/O by ~3× before the filter even runs.

Q1 takes 11.4 ms for 200 profiles on an SSD. Linear extrapolation to 14,000 pods × 30 profiles/min × 10 minutes = 4.2 million profiles in the same query window puts the read at roughly 11.4 ms × (4.2M / 200) = 240 seconds, which is unacceptable. Real systems fix this with three additions: (a) time-bucketed downsampling — collapse summary rows to 10s buckets at ingest, so the query reads 600× fewer rows; (b) service+version filtering pushed into the parquet predicate — read only the partitions the user is asking for; (c) horizontal sharding — split the summary table by service_hash % N across N queriers and run the top-N in parallel. With all three, the same query lands at 200-400 ms on a real fleet, which is the production target.

Q2 takes 1.8 ms because the index has only 200 rows and the blob is small. On a real index with billions of rows, Q2 has two stages: an index-only filter scan (50-100 ms with column pruning) followed by a single object-store GET (50-200 ms cold, 5-20 ms warm with a local cache). The total budget of ~250-300 ms for a single-profile open is what makes "click on a span in Tempo, see the flamegraph for that span's pod-time" feel responsive in the UI. Anything over 1 second and engineers stop clicking.

Why per-function summary rows at ingest rather than computing them at query time: the per-function table is at most 100× larger than the index but lets aggregation queries skip the blob layer entirely. Computing the same aggregate at query time means decompressing every matching pprof blob (200 KB each, 4 million of them = 800 GB decompressed scanned per query) and walking each profile's stacks — that is fundamentally a 60-second-plus operation no matter how parallelised. The precomputation trades ~3× write amplification at ingest for ~150× faster reads on the common query. This is the same trade-off the Druid-style "rollup at ingest" pattern makes for metrics, applied to profiles.

Where the storage budget actually goes — a fleet sizing exercise

Karan's Pyroscope cluster filled in 78 hours because nobody did the math before turning the dials to "always on". The arithmetic is not hard once you write it down, and it answers the question "what does continuous profiling cost at our scale?" before the CFO does.

A single pod of a Python service on a 4-core box, profiling at 100Hz, generates ~12,000 unique stacks per 30-second window and serialises to roughly 90 KB of gzipped pprof (real numbers from Pyroscope's docs and verified on Razorpay-shape workloads). 12,000 pods × 90 KB / 30 s = 36 MB/s = 3.1 TB/day of raw blob volume. On S3 Standard at $0.023/GB/month that is ₹6.3 lakh/month for raw storage alone, before any retention policy, before any per-function summary, before any compute. With 8x dedup the bill drops to ~₹78 thousand/month for the blob layer; with 30-day retention rolling off it stays bounded at ~₹2.3 lakh/month.

The per-function summary table at 12,000 pods × 80,000 functions × 1 row per 30s × 24 bytes = 5.5 GB/hour raw, 600 MB/hour after Parquet dictionary encoding, 14 GB/day on disk. Time-bucketed downsampling (10s buckets at 1-day resolution, 1m at 30-day, 1h at 1-year) keeps the long-tail at ~30 GB/month for the entire fleet's per-function summary. The summary layer is roughly 1% of the blob layer's size and serves 95% of the queries. This is the load-bearing economic insight: the cheap layer is the one users hit, the expensive layer is the one they almost never hit.

The index layer is rounding error: 12,000 pods × 1 row per 30s × ~80 bytes = 9 MB/hour, 6.5 GB/month. It fits in RAM on a single querier, which is why Pyroscope and Parca both keep the index in-process rather than pushing it to S3.

The compute cost — ingesters that hash, dedup, parse pprof, walk stacks, emit summary rows — runs at roughly 1 vCPU per 1000 pods of input. A 12,000-pod fleet needs 12 vCPUs of ingester continuously, ~₹40,000/month on AWS Graviton. Queriers scale on read traffic; a small fleet with 50 engineers running 1 query/min averages 0.5 vCPU steady-state, ~₹2,000/month. Total monthly bill at 12,000-pod fleet with continuous profiling at 100Hz: ₹3-4 lakh with conservative dedup, retention, and downsampling. Karan's first-week ₹18-lakh bill came from no dedup, no downsampling, no retention rolloff, and 4 TB of overprovisioned EBS that he was paying for whether profiles landed there or not.

Illustrative — not measured data. Order-of-magnitude figures based on AWS ap-south-1 list pricing, April 2026, for a 12000-pod fleet emitting 14 KB/s/pod at 100Hz. The naive cost is dominated by the blob layer because there is no dedup, no rollover, and no downsampling; the optimised path applies all three with no change to query semantics or coverage. The per-function summary stays small in both cases — it is the one layer where storage cost and query value are aligned.

Common confusions

"A flamegraph viewer reads the raw pprof" — only when you click on a single profile. The 95% case — top-N functions, fleet-wide aggregates, time-range comparisons, regression diffs — reads the per-function summary table and never opens a pprof blob. Engineers who design profile UIs by starting from "let me load the pprof and walk it" build slow viewers; engineers who start from "what is the smallest precomputed summary that answers this query" build fast ones.
"You can store profiles in Prometheus / Loki / Elasticsearch" — you can write the bytes there but you cannot serve flamegraph queries. PromQL has no notion of "stack path"; LogQL has no notion of "structured tree node"; Elasticsearch's nested-document support is not deep enough to express a 30-frame-deep DAG efficiently. Every team that tries this rebuilds a dedicated profile backend within two quarters. Skip the detour.
"Content-addressed dedup is free compression" — it is free additional compression on top of pprof's already-gzipped payload, but only when the same exact bytes reappear. A profile that is "the same workload" but sampled in a different second has different bytes and does not dedup. Real-world dedup ratios depend on workload stability: 8-15× on stable web services, 2-4× on bursty workloads, ~1× on adversarial workloads (security fuzzers, chaos tests).
"Time-bucketed downsampling loses information" — only the per-second resolution. A 10-second bucket at 1-day age is fine for "what was the dominant function in this minute"; nobody asks "what was the dominant function in this exact second three weeks ago", because the answer would not be statistically significant on a 10-second sample window anyway. Downsampling matches the resolution of the question to the resolution of the answer.
"The blob store can be cold (Glacier / archive tier)" — only after the click-through retention period. Engineers click into recent flamegraphs (last 7 days) constantly; they almost never click into 30-day-old ones. A two-tier blob policy — S3 Standard for 7 days, S3 IA for 7-30, Glacier for 30-365 — cuts the blob bill by another 3-5× without changing the user experience for the vast majority of clicks.
"Symbolisation can be done at query time" — it can but it is slow and brittle. Symbolisation (mapping a binary's PC offsets back to function names + line numbers) requires the build's debug symbols (.debug_info ELF section, or Python source for stack-name resolution), which must be uploaded at deploy time and indexed by (build_id, version). Doing symbolisation at query time means every flamegraph open hits the symbol store, adding 200-800 ms to the open path. Doing it at ingest time costs ~50% more ingester CPU but makes queries zero-latency for symbol lookup.

Going deeper

What Pyroscope (post-Phlare-merge) actually stores on disk

The current Pyroscope (the post-merge codebase, grafana/pyroscope) stores profiles in a Parquet-based format called "block storage", structurally similar to Cortex / Mimir / Loki blocks. Each ingester accumulates 2 hours of data in memory, then flushes a block to S3 containing: a meta.json describing the block; an index.tsdb mapping label sets to series IDs (yes, it borrows TSDB index semantics for the per-label dimension); a profiles.parquet file with one row per profile carrying timestamp, series-id, and a Profile column that holds the pprof bytes; and a symbols.parquet file with the function-name and stack tables, dictionary-shared across all profiles in the block. The block is read at query time via row-group statistics — Parquet's per-rowgroup min/max on timestamp lets the querier skip 90%+ of rowgroups for a typical 10-minute query window. The trick that makes flamegraph-from-aggregation fast is that symbols.parquet is dictionary-encoded with zstd: function names are u32 dict-encoded once per block, and aggregations sum self-samples grouped by the dict-id without ever resolving the string. The string lookup happens only once per result row at the end.

Parca / FrostDB and the embedded-columnar approach

Parca (Polar Signals' open-source profiler) made an even more aggressive choice: it embeds its own columnar store (FrostDB) inside the Parca server, with no separate object store at all. FrostDB writes Parquet-shaped blocks to local disk with a write-ahead log, replicates via Raft for durability, and serves queries directly from the embedded engine. The trade-off is operational simplicity (one binary, no S3, no Cassandra) versus horizontal scale (each node is bounded by its local disk). FrostDB's columnar semantics are explicit: every aggregation is GROUP BY over dictionary-encoded columns, every filter is a predicate pushed into the rowgroup-stat scan. Parca's flamegraph queries return in 100-300 ms on a 100 GB FrostDB instance because the column pruning cuts disk I/O to roughly the size of the query result — the same property Pyroscope gets from S3 + Parquet, but on local NVMe with 10× lower latency. The lesson: the columnar layout is what makes either approach work; the storage tier (S3 vs local) is a secondary choice.

Datadog's "Husky" backend and FoundationDB

Datadog's continuous-profiling backend, called "Husky", uses FoundationDB as the metadata + index layer and Cassandra as the blob store, with a custom columnar engine on top for aggregation. Husky's design notes (published in their 2023 engineering-blog series) describe a four-layer split — they call out a "stack-shape index" that is the symbolised-stack table from this chapter, plus a separate "function popularity index" that pre-sorts the top-N functions per service per hour. The popularity index makes the most common query — "top 20 functions for service X right now" — a constant-time lookup rather than a scan. The cost is another write at ingest, but Datadog's scale (millions of profiles per second across customers) makes the precomputation table a 100× win on the read side. The principle generalises: if a query is in the top-3 by frequency, precompute it; if it is in the long tail, serve it from the columnar scan.

Symbolisation, build-id, and the deploy-time symbol upload

Continuous profilers must symbolise stacks — turn raw program counters from compiled binaries into (function_name, file, line) tuples. The symbol table lives in the binary's debug info (.debug_info for ELF, .dSYM for Mach-O, .pdb for Windows). Production binaries usually ship stripped, so the profiler needs the un-stripped or detached debug info uploaded out-of-band, indexed by the binary's build_id (a 20-byte hash baked into the ELF header at build time). Pyroscope, Parca, Datadog all converged on the same pattern: the CI pipeline uploads <build_id>.debug to a symbol store at build time; the agent emits stacks with build_ids embedded; the ingester resolves stacks to symbols using the symbol store at ingest. For Python (and other interpreted runtimes), the agent emits stack frames with source filenames and line numbers directly, no symbolisation needed — the cost of Python's runtime introspection (pyperf-style stack walking) is paid at sample time. For Go, Rust, C++, and JIT'd JVMs, the symbol-store-by-build-id flow is mandatory; without it the flamegraph shows hex addresses and is useless.

The "always-on at 100Hz" decision and what changes at 1000Hz

Most production deployments run continuous profiling at 19Hz (Linux's default perf rate is 1000Hz divided by an undocumented heuristic; in practice 19Hz is the always-on band for Pyroscope's eBPF mode) or 100Hz (the perf-event default). Pushing to 1000Hz multiplies sample volume by 10× without proportionally increasing signal — most stacks are duplicates within a single second and dedup at ingest absorbs most of the inflation, but the per-function summary table grows linearly with sample volume and the storage bill follows. The rule of thumb: 19Hz catches anything that runs for >100ms, 100Hz catches anything that runs for >10ms, 1000Hz catches anything for >1ms but at 10× the storage cost. Most observability questions are answered at the 100ms granularity (a 10ms function call is rarely the bottleneck on a 200ms request), so 100Hz is the production sweet spot. Going higher is justified only for HPC / low-latency trading workloads where 1ms matters; even there, the 1000Hz stream is usually triggered on-demand rather than always-on.

Where this leads next

Chapter 61 — Profile sampling and the "is profiling free?" question — picks up the cost-vs-coverage trade-off implicit in the always-on-at-100Hz decision. How much sample volume do you actually need to detect a 1% regression? What does the production-realistic always-on configuration look like? Why does the "sample only on errors" pattern fail for continuous profiling in a way it does not fail for distributed tracing?

For the prerequisite framework, /wiki/cpu-heap-lock-profiles-in-prod covers the three profile types whose storage this chapter operates on. CPU profiles, heap profiles, and lock profiles all use the same three-layer architecture, but with different sample-rate vs storage-cost curves — heap profiles are 10× smaller per sample, lock profiles are 100× rarer per second, and the per-function summary table layout adjusts accordingly.

For the upstream context, /wiki/differential-profiling is the chapter that motivates the storage requirements: a 800-pod × 100k-function × 10-minute G-test diff is the query the per-function summary table is engineered to serve in under a second. Without the precomputation layer, the diff query takes minutes; with it, the diff is interactive.

For the architecture lineage, /wiki/pyroscope-and-parca-architectures covers the implementation choices Pyroscope and Parca made — how the ingester, querier, and storage tiers are wired together — that this chapter abstracts into the three-layer model.

References

Grafana Labs — Pyroscope architecture — the post-Phlare-merge architecture document; describes the block-storage format and the Parquet + TSDB index layout in production detail.
Polar Signals — Introducing FrostDB — the embedded columnar engine that Parca uses; the design rationale for going columnar-from-day-one without an external object store.
Datadog Engineering — Husky: Datadog's distributed timeseries database — the FoundationDB + Cassandra hybrid that backs Datadog's profile and metric stores; describes the stack-shape index and the popularity precomputation table.
Pelkonen et al. — Gorilla: A Fast, Scalable, In-Memory Time Series Database (VLDB 2015) — the compression machinery (delta-of-delta + XOR floats) that the per-function summary's numeric columns inherit when stored time-bucketed.
Apache Parquet specification — the on-disk format that all three production backends (Pyroscope, Parca, Husky) converged on; the rowgroup statistics and dictionary encoding are what make column pruning effective at profile-query scale.
Brendan Gregg — Linux profiling at Netflix with FlameGraphs — the upstream of the pprof + flamegraph format that defines what a "profile" structurally is.
Charity Majors, Liz Fong-Jones, George Miranda — Observability Engineering (O'Reilly, 2022), ch.18 — the chapter on continuous profiling as a production discipline; covers the storage-cost vs query-latency trade-off at fleet scale.
/wiki/differential-profiling — chapter 59 of this curriculum, the query workload this chapter's storage layer is engineered for.

# Reproduce this on your laptop
python3 -m venv .venv && source .venv/bin/activate
pip install pyarrow pandas
python3 profile_store.py
# Inspect the on-disk layout:
ls -lh profile_store/blobs/ | head -5
du -h profile_store/*.parquet