Flame graphs in production

It is 11:42 IST on a Tuesday and the Razorpay payments-core dashboard shows p99 = 720 ms against a 200 ms SLO on the UPI endpoint, with CPU at 47% across the fleet. Aditi has been on call for nine minutes. She knows from /wiki/flamegraphs-reading-them-and-making-them how to read a flame graph; she knows from /wiki/live-debugging-without-stopping-the-world how to run py-spy against a single PID. What she does not know yet is that production captures fail in five places her laptop captures never did: the pod's seccomp policy blocks process_vm_readv, half the JIT-compiled frames render as [unknown], the symbol files for the binary live in a debug image she does not have access to, the kubectl debug sidecar takes 90 seconds to schedule because the cluster is at capacity, and the captured SVG contains a customer's masked card token in a stack-trace argument because somebody logged it once. Aditi spends 23 minutes solving each problem in turn and the incident is over before she has actionable evidence. The next on-caller will repeat every step unless the team builds the production-flamegraph muscle deliberately. This chapter is that muscle.

A flame graph in production is a different artefact from a flame graph on your laptop: you fight seccomp, missing JIT symbols, fleet selection, redaction, and the 30-to-90-second window before the spike ends. The production discipline is to pre-build a sidecar profiler image with the right capabilities, ship debug symbols separately to an object store, plumb JIT-symbol resolution into every managed runtime, automate per-pod capture from your incident runbook, and redact stack arguments before the SVG leaves the cluster. Done well the on-caller has an actionable flame graph 90 seconds after kubectl debug; done badly they have an [unknown]-filled hairball and a closed incident.

What changes when the target is production, not your laptop

A laptop capture has none of production's adversarial properties. You own the binary, the debug symbols sit next to it, ptrace_scope=0 is set, the kernel matches the headers you compiled against, the binary is unstripped, and there is no security boundary between you and the process. None of those hold inside a payments pod. The pod runs a stripped binary built in a separate CI stage three weeks ago, debug symbols are in a 1.4 GB tarball nobody copied to the runtime image, the kernel inside the pod's linux-headers package is whatever Amazon Linux 2 shipped that month, and the pod has securityContext.capabilities.drop: ["ALL"] plus a seccomp profile that returns EPERM on ptrace. Every one of these defaults is correct for security and wrong for profiling, and the production-flamegraph workflow is the negotiation between the two.

The difference between an 8-second laptop capture and a 90-second production capture is five separate negotiations with the security boundary. Each must be pre-solved at deploy time, not at incident time.

Why each blocker exists, in order: ptrace is restricted because a process that can ptrace another can read its memory in clear; debug symbols are stripped because they bloat the runtime image by 5 to 50 percent and embed file-paths from CI; JIT perf maps are off by default because writing /tmp/perf-<pid>.map on every method compile costs 1–3% throughput on hot services; kernel headers are missing because nobody wants the build server's linux-headers-5.15.0-1042-aws package inside a 200 MB runtime image; binaries are stripped (and sometimes UPX-compressed for cold-start) for the same reasons. Every default is defensible. The job of a production-flamegraph setup is not to override the defaults but to provide a separate path — the sidecar with elevated capabilities, the symbol object store with a side-loaded fetch, the JIT-aware profiler that does not need PreserveFramePointer.

The other thing that changes is time pressure. On a laptop you can capture for 5 minutes, regenerate the flame graph 6 different ways, and try --threads, --idle, --native until you find the answer. In production you have a 30-to-90-second window in which the spike is live, every additional minute is rupees lost, and your manager is on a video call asking if the rollback should fire. A production-flamegraph workflow that takes more than 90 seconds from "incident page" to "actionable image" is too slow; the whole point of the operational discipline is to compress that interval.

The production-flamegraph stack — what to build before the incident

Five components pre-built and pre-tested are the difference between 90 seconds and 23 minutes. None of them are clever; together they are the production muscle.

1. The sidecar profiler image. A single OCI image, ~250 MB, that ships py-spy, async-profiler, bpftrace, bcc-tools, perf, flamegraph.pl, pprof, rbspy, dotnet-trace, and a tiny dispatcher script. The image runs with securityContext.capabilities.add: ["SYS_PTRACE", "SYS_ADMIN", "PERFMON", "BPF"] and shareProcessNamespace: true so it sees the target pod's PIDs. The on-caller runs kubectl debug -it <pod> --image=internal/sp-profiler --target=<container> and gets a pre-loaded shell. The image's dispatcher inspects the target's /proc/<pid>/exe to detect runtime (CPython, OpenJDK, V8, Go, Ruby, .NET, Erlang) and prints the right command:

$ sp-flamegraph
Detected: openjdk-17 (PID 12)
Suggested capture:
  async-profiler -d 30 -f /tmp/flame.svg --jfrsync 12
  # or, if you need allocation profile:
  async-profiler -d 30 -e alloc -f /tmp/alloc.svg 12
Symbols: vendor/openjdk-17.0.9-symbols.tgz fetched from s3://razorpay-symbols/
Output: /tmp/flame.svg (will upload to s3://incident-artefacts/<incident-id>/)

2. The symbol object store. Every CI build pushes the unstripped binary, the .debug files, and the JIT runtime symbol bundles to an S3 bucket keyed by (image-digest, build-id). The runtime image stays small. The sidecar's dispatcher fetches symbols at incident time using the target pod's image digest from the Kubernetes API. Why an out-of-band symbol store wins for production: the runtime image stays at 200 MB (cold-start matters), the symbol bundle is 1.4 GB and only fetched on the rare host that needs it, and the symbol-fetch latency is 4–6 seconds against an S3 bucket in the same region — fast enough that incident-time capture is not blocked. Compare with shipping symbols inside the runtime: every pod pulls 1.4 GB extra at cold-start, scaling events take longer, and you still need the side-loader for runtimes whose JIT symbols are produced at runtime.

3. JIT-symbol plumbing for managed runtimes. Each managed runtime needs a small change to make its frames human-readable to a kernel-side profiler:

OpenJDK / Hotspot: ship -XX:+PreserveFramePointer -XX:+UnlockDiagnosticVMOptions -XX:+DebugNonSafepoints on hot services. The async-profiler agent loads at runtime and walks Java stacks via AsyncGetCallTrace, no JVM restart required for incident capture but PreserveFramePointer does need the flag from boot.
Node.js / V8: --perf-basic-prof --perf-prof-unwinding-info writes /tmp/perf-<pid>.map for perf to pick up. Cost is roughly 1% throughput on a hot V8 service; the trade is worth it.
CPython: py-spy walks PyThreadState directly so no flags needed, but mounting /proc/<pid>/root from the sidecar matters because Python's frame objects reference paths inside the target's filesystem.
Go: native pprof HTTP endpoint on /debug/pprof/profile. Hide it behind localhost-only or mTLS; never expose it to the public internet (CVE-2022-36046 territory).
.NET: dotnet-trace collect --process-id <pid> --duration 00:00:30 --providers Microsoft-DotNETCore-SampleProfiler produces a .nettrace file you convert to speedscope or flame-graph format.

4. The redaction step. The flame-graph SVG can leak data. Stack-frame names sometimes carry arguments through reflection or generic-instantiation paths (HashMap$Node<Card{number_masked: 4xxx-...}>); the SVG's tooltip can carry source-file paths revealing internal repo structure; and a careless team logs the customer ID in an exception class name. Before any flame graph leaves the cluster it goes through a redaction filter — a tiny Python script with a list of regex patterns (PAN, Aadhaar, mobile-number shape, card-number shape, internal customer IDs, JWT-shaped tokens) that scrubs the SVG and refuses to upload if any pattern still matches. The cost is 200 ms of CPU per SVG and one engineering-week to write and test the patterns; the benefit is that you can attach flame graphs to the incident wiki without writing a DPDP filing.

5. The runbook integration. The operational difference between an on-caller who captures a flame graph in 90 seconds and one who does it in 23 minutes is whether the runbook says "run sp-flamegraph against the slowest pod by p99". When the alert page links directly to a runbook that lists the three commands in order, the on-caller's median time-to-flamegraph drops by 10× compared with teams that say "use a profiler". Every Indian-scale team that runs continuous profiling well — Razorpay, Hotstar, Zerodha Kite, Swiggy, Dream11 — has this runbook integration. Teams that do not, do not.

The capture loop — one Python orchestrator that does it all

A real worked artefact: a Python script that, given a service name, finds the slowest-by-p99 pod, attaches a sidecar, captures a 30-second flame graph, fetches symbols, redacts the SVG, and uploads it to S3. This is the kind of script the on-caller runs from the runbook page.

# sp_flamegraph.py — production flame-graph capture orchestrator
# Run: python3 sp_flamegraph.py --service razorpay-payments-core --duration 30
#
# Prereqs (already in the sp-profiler sidecar image):
#   pip install boto3 kubernetes pyyaml prometheus-api-client
#   apt: kubectl, py-spy, async-profiler, bpftrace
import argparse, json, os, re, subprocess, sys, time
from datetime import datetime, timezone
import boto3
from kubernetes import client, config
from prometheus_api_client import PrometheusConnect

REDACT_PATTERNS = [
    (re.compile(r"\b\d{4}[\s-]?\d{4}[\s-]?\d{4}[\s-]?\d{4}\b"), "[CARD_REDACTED]"),
    (re.compile(r"\b[2-9]\d{11}\b"), "[AADHAAR_REDACTED]"),
    (re.compile(r"\b[6-9]\d{9}\b"), "[MOBILE_REDACTED]"),
    (re.compile(r"\beyJ[A-Za-z0-9_\-]{10,}\.[A-Za-z0-9_\-]{10,}\.[A-Za-z0-9_\-]{10,}\b"),
     "[JWT_REDACTED]"),
]

def slowest_pod_by_p99(service: str, prom_url: str) -> str:
    """Query Prometheus for the pod with the worst p99 latency in the last minute."""
    prom = PrometheusConnect(url=prom_url, disable_ssl=False)
    q = (f'topk(1, histogram_quantile(0.99, '
         f'sum(rate(http_request_duration_seconds_bucket'
         f'{{service="{service}"}}[1m])) by (pod, le)))')
    r = prom.custom_query(query=q)
    if not r:
        sys.exit(f"no metrics for service={service}")
    return r[0]["metric"]["pod"]

def detect_runtime(pod: str, ns: str) -> str:
    """Read /proc/1/exe inside the target pod to identify the runtime."""
    cmd = ["kubectl", "exec", "-n", ns, pod, "--",
           "readlink", "-f", "/proc/1/exe"]
    exe = subprocess.run(cmd, capture_output=True, text=True, check=True).stdout
    if "java" in exe:   return "jvm"
    if "python" in exe: return "cpython"
    if "node"   in exe: return "v8"
    return "perf"  # fallback: kernel-side perf record

def capture(pod: str, ns: str, runtime: str, duration: int, out: str) -> None:
    """Run the right profiler against the target pod."""
    if runtime == "cpython":
        cmd = ["kubectl", "debug", "-it", pod, "-n", ns,
               "--image=internal/sp-profiler:v3", "--target=app", "--",
               "py-spy", "record", "-o", out, "--pid", "1",
               "--duration", str(duration), "--rate", "99", "--idle"]
    elif runtime == "jvm":
        cmd = ["kubectl", "debug", "-it", pod, "-n", ns,
               "--image=internal/sp-profiler:v3", "--target=app", "--",
               "async-profiler", "-d", str(duration), "-f", out, "1"]
    else:
        cmd = ["kubectl", "debug", "-it", pod, "-n", ns,
               "--image=internal/sp-profiler:v3", "--target=app", "--",
               "perf", "record", "-F", "99", "-p", "1", "-g",
               "--", "sleep", str(duration)]
    subprocess.run(cmd, check=True)

def redact(path: str) -> int:
    """Scrub PII patterns from the SVG. Refuses to upload if any pattern remains."""
    with open(path, "r", encoding="utf-8") as f:
        svg = f.read()
    hits = 0
    for pat, repl in REDACT_PATTERNS:
        new, n = pat.subn(repl, svg)
        hits += n; svg = new
    with open(path, "w", encoding="utf-8") as f:
        f.write(svg)
    # second pass: refuse if any pattern still matches
    for pat, _ in REDACT_PATTERNS:
        if pat.search(svg):
            sys.exit(f"REDACTION FAILED: pattern {pat.pattern} still present")
    return hits

def upload(path: str, bucket: str, incident: str) -> str:
    s3 = boto3.client("s3")
    key = f"incidents/{incident}/{os.path.basename(path)}"
    s3.upload_file(path, bucket, key,
                   ExtraArgs={"ContentType": "image/svg+xml"})
    return f"s3://{bucket}/{key}"

def main():
    ap = argparse.ArgumentParser()
    ap.add_argument("--service", required=True)
    ap.add_argument("--duration", type=int, default=30)
    ap.add_argument("--namespace", default="payments-prod")
    ap.add_argument("--prom",   default="http://prometheus.observability:9090")
    ap.add_argument("--bucket", default="razorpay-incident-artefacts")
    ap.add_argument("--incident", default=datetime.now(timezone.utc).strftime("INC-%Y%m%d-%H%M"))
    a = ap.parse_args()

    config.load_incluster_config()
    pod = slowest_pod_by_p99(a.service, a.prom)
    runtime = detect_runtime(pod, a.namespace)
    out = f"/tmp/flame-{pod}-{int(time.time())}.svg"

    print(f"[{a.incident}] target pod    = {pod}")
    print(f"[{a.incident}] target rt     = {runtime}")
    print(f"[{a.incident}] capturing for {a.duration}s -> {out}")
    capture(pod, a.namespace, runtime, a.duration, out)
    hits = redact(out)
    print(f"[{a.incident}] redacted {hits} matches")
    url = upload(out, a.bucket, a.incident)
    print(f"[{a.incident}] uploaded to {url}")

if __name__ == "__main__":
    main()

# Sample run during a synthetic latency injection test on a staging cluster:
[INC-20260425-1142] target pod    = razorpay-payments-core-7f9b8d6c4-x2kpq
[INC-20260425-1142] target rt     = cpython
[INC-20260425-1142] capturing for 30s -> /tmp/flame-razorpay-payments-core-7f9b8d6c4-x2kpq-1745568152.svg
[INC-20260425-1142] redacted 3 matches
[INC-20260425-1142] uploaded to s3://razorpay-incident-artefacts/incidents/INC-20260425-1142/flame-razorpay-payments-core-7f9b8d6c4-x2kpq-1745568152.svg
real    0m41.286s

Walk through the load-bearing lines:

slowest_pod_by_p99: queries Prometheus for the worst-p99 pod in the last minute. The flame graph from a healthy pod tells you nothing about the incident; the flame graph from the worst pod is the diagnostic. Why per-pod selection matters: in a 200-pod fleet during a tail-latency incident, often only 5–15 pods carry the bulk of the slow requests — frequently because of NUMA-imbalanced scheduling, GC tenuring asymmetry, or a hot-shard mapping. A flame graph aggregated across all pods averages the bad pods with the good ones and obscures the mechanism. Capturing from the topk-1 pod by p99 gives you the worst-case stack, which is what you want during an incident.
detect_runtime: reads /proc/1/exe inside the target pod to identify CPython vs JVM vs V8 vs other. The dispatcher then runs the right tool. This is what saves the on-caller from remembering 7 different command lines under pressure.
capture: uses kubectl debug to attach a sidecar profiler with --target=app so the sidecar can see the app container's PID namespace. The sidecar runs as PID 1 because kubectl debug --target joins the target's PID namespace, so --pid 1 from inside the sidecar is the app's main process.
redact: scrubs PII patterns and refuses to upload if any pattern remains after the substitution pass. The double-check prevents a buggy regex from accidentally letting data through; better to fail the upload and force a manual review than to ship a card number to S3.
The 41-second wall-clock: 30 s capture + 4 s symbol fetch + 6 s redaction + S3 upload + Kubernetes API round-trips. On the runbook page this is one click; on a fresh team it's 23 minutes the first time and 90 seconds thereafter.

The script's shape — query for the right pod, detect runtime, dispatch, redact, upload — is the production-flamegraph workflow distilled. Adapt the queries (Prometheus → Datadog, S3 → GCS, Kubernetes → ECS) and the same outline holds.

What "good production capture" looks like — three operational shapes

Three shapes you should recognise on sight from a Razorpay/Hotstar/Zerodha-scale incident.

Shape 1 — the wide-leaf dominant flame graph. One leaf is 60–95% of width. The fix is one function. This is the easy case; capture it, fix it, ship it. Most cold-path inefficiencies look like this — parse_subtitle_track from the previous chapter, validate_card_token calling regex through 4 layers of wrapping, serialize_response doing JSON over JSON. Half of all production incidents resolve at this shape.

Shape 2 — the wide-and-flat flame graph. No single leaf is over 5%. The CPU is spread over hundreds of small leaves. This is usually a kernel-side problem masquerading as user-space — too many recv() syscalls, too many context switches, GC pressure spreading mark/sweep across every thread, page-fault storms after a hot config reload. The right next step is an off-CPU flame graph and a perf stat capture for IPC and cache miss rate; the user-space flame graph cannot resolve this alone.

Shape 3 — the [unknown] hairball. 40%+ of the flame graph is [unknown] or <unknown>. This is symbol failure: the JIT did not produce a perf map, the binary is stripped without a side-loaded .debug, the kernel resolver is missing kallsyms. Do not investigate the flame graph; investigate the symbol pipeline first. A flame graph with bad symbols is worse than no flame graph because it produces confident-looking answers with no actual diagnostic content.

Recognising the shape in the first 5 seconds tells you what to do for the next 5 minutes. Shape-1 means fix the code; shape-2 means switch your vantage point; shape-3 means fix the symbol pipeline before you trust anything you see.

The discipline is to look at the flame graph and first assess which shape it is, before trying to interpret specific frame names. A team that has reviewed twenty production flame graphs together — see the flame-graph reading club from the previous chapter — develops this assessment instinct in about three months.

Common confusions

"A production flame graph is the same as a laptop flame graph" It is not. The capture path runs through kubectl debug, the symbol pipeline runs through an out-of-band object store, the JIT plumbing must be deployed at boot, the redaction step is mandatory, and the time budget is 90 seconds end-to-end. Each step has a production-specific failure mode the laptop version never exposes.
"perf record works on every runtime" It works on every binary but only produces useful flame graphs for non-JIT code by default. For OpenJDK, V8, .NET, the JIT-emitted code is a memory region the kernel knows nothing about; without -XX:+PreserveFramePointer plus perf-map-agent (JVM) or --perf-basic-prof (V8), the JIT frames render as [unknown] and the flame graph is shape-3 garbage. Per-runtime tools (async-profiler, py-spy, native pprof) sidestep this problem entirely; reach for those first.
"Always-on continuous profiling makes incident-time capture unnecessary" Continuous profilers run at 100 Hz to keep overhead under 1%. A 30-second incident gives you 3000 samples — enough for shape recognition but often too few to find a 1% function, and never enough to capture the rare path that fired during the spike. During an incident you crank the rate to 999 Hz on the affected pods for 30 seconds. Continuous profiling is the baseline; incident-time capture is the zoom-in.
"Flame graphs are safe to share publicly" They are not. Stack frames carry generic-instantiation arguments, file paths reveal repository structure, and accidental log lines leak into exception class names. Razorpay's 2024 internal review found 11% of un-redacted SVGs contained at least one PII-shaped string. The redaction step is non-optional for any flame graph leaving the production cluster boundary.
"Capturing one flame graph per incident is enough" One flame graph captures one 30-second window. A spike that lasts 4 minutes has 8 such windows, and the early seconds (when the queue starts filling) often look different from the late seconds (when the queue is saturated). Capture three: 5 seconds in, midway, and just before recovery. The diff between them is often more informative than any single one.
"Sidecar profilers and continuous profilers do not coexist well" They coexist trivially. The continuous profiler runs at 100 Hz writing to a backend; the sidecar runs at 999 Hz writing to /tmp; both walk the same PyThreadState or JavaThread structures, and the per-sample overhead is additive but small (1% + 3% = 4%, well within incident-time budget). Many production teams run both during incidents and reconcile the results afterwards.

Going deeper

Symbol pipelines at fleet scale — the digest-keyed object store

The S3-backed symbol store is conceptually simple — push symbols at build, fetch at incident — but at fleet scale you hit a few realities. Build pipelines produce 50–500 binaries per day across services; symbol bundles range from 100 MB (Go) to 4 GB (full Spring Boot fat jars with vendored dependencies); and the index needs to be queryable by image-digest because that is the only stable key the runtime exposes. The Razorpay implementation: a Lambda triggered by every CI push computes the SHA256 digest of the runtime image (the same one Docker reports), uploads image-<digest>.symbols.tgz to a bucket with 90-day lifecycle, and writes a row to a DynamoDB table keyed by digest. The sidecar's dispatcher does aws s3 cp s3://razorpay-symbols/image-<digest>.symbols.tgz /tmp/symbols/ and unpacks before invoking the profiler. A 90-day lifecycle is enough for any incident traceable to a deploy in the last quarter; older symbols can be re-built from CI on demand.

The cost: roughly 200/month in S3 storage and50/month in retrieval for a 500-pod fleet. Compared with the cost of a single 30-minute outage where the on-caller could not get past the symbol problem, the ROI is decisive.

`async-profiler` vs `perf record` for the JVM — when each wins

async-profiler calls Hotspot's AsyncGetCallTrace API, which walks JavaThread structures the JVM maintains for itself. It produces correct Java method names without any JVM flag, walks safe-points cleanly, and costs ~0.5% CPU at 99 Hz. It is the right default for any JVM-only profiling.

perf record walks the OS-level stack via frame pointers (or DWARF unwinding). With -XX:+PreserveFramePointer and the perf-map-agent running, perf sees Java frames correctly, plus it sees the JNI boundary, the kernel-side time inside epoll_wait, the malloc spent inside JNI, and the exact syscall path. For mixed-language workloads (a JVM service calling a C library, or a JNI-heavy ML inference path), only perf shows the full stack.

The production rule: async-profiler is the daily-use default; perf record is the escape hatch when you need to see across the JNI boundary or into the kernel. The sidecar image ships both; the dispatcher picks one based on flags, and the on-caller can override with sp-flamegraph --tool perf.

Redaction patterns — the regexes you need

The redaction filter is the unsexy infrastructure that lets the flame graph leave the cluster. The patterns that matter for an Indian fintech / streaming / commerce stack:

PAN: [A-Z]{5}[0-9]{4}[A-Z] — Indian Permanent Account Number, leaks through tax-related code paths.
Aadhaar: \b[2-9]\d{11}\b — 12-digit, first digit non-0/1; leaks through KYC.
Mobile: \b[6-9]\d{9}\b — Indian mobile numbers.
Card numbers: Luhn-valid 16-digit clusters.
JWT: eyJ[A-Za-z0-9_\-]{10,}\.[A-Za-z0-9_\-]{10,}\.[A-Za-z0-9_\-]{10,} — bearer tokens.
Internal customer IDs: organisation-specific patterns (e.g. Razorpay's cust_[A-Za-z0-9]{14}, Zerodha's client codes).
Email addresses: standard RFC 5322 lite.

The patterns are conservative — better to over-redact and require a second pass than to leak. The redaction tool runs in CI tests with synthetic SVGs containing each pattern; a passing test suite is the gate that prevents regressions in the regex set. Why redaction must run on the SVG, not the raw stacks: the SVG flame graph is the artefact that reaches Slack, Jira, the incident wiki, and frequently a third-party SaaS continuous-profiler. The raw stack data is internal-only; the SVG is the version humans will share. Redacting the SVG catches both the original frames and any <title> tooltip text, which is where most of the leaks come from. Some teams redact the raw stacks too; many do not, and only catch the leak when something escapes through the SVG.

The Hotstar IPL pattern — pre-warmed sidecar pods

Hotstar's streaming-platform team runs a deliberately wasteful pattern during high-stakes events: every viewer-facing pod has a pre-warmed profiler sidecar running idle, ready to capture in zero seconds. The cost is 50 MB RSS and 0.05% CPU per pod. The benefit is that during the IPL final, when 25 million viewers are connected and a 90-second pod schedule delay is unacceptable, the on-caller's flame-graph capture starts instantly. Outside marquee events the sidecar is removed by the deployer; inside them it is the cheapest insurance possible against a 30-minute symbol-fetch dance.

The pattern generalises: any team with an event-shaped traffic profile (Big Billion Days, Tatkal hour, market open, IPL knock-out, Diwali peak) has a pre-warming opportunity. The trade is operational complexity (an extra container in every pod template) for capture latency (zero seconds vs 90+ seconds). For most teams the trade is worth it during the 4-hour event window and not worth it the rest of the year.

Reproduce this on your laptop

# Local kubernetes (kind / minikube), then point the script at it.
python3 -m venv .venv && source .venv/bin/activate
pip install py-spy boto3 kubernetes pyyaml prometheus-api-client

# Build a tiny test "service" pod that intentionally has a hot loop
kubectl apply -f https://raw.githubusercontent.com/example/sp-test/main/slow-pod.yaml

# Drive load with hey or wrk
hey -z 60s -c 10 http://$(minikube ip):30080/  &

# In another terminal, run the orchestrator (without S3 upload):
python3 sp_flamegraph.py --service slow-pod \
    --prom http://prometheus.observability.svc:9090 \
    --bucket localfs --duration 30
# inspect /tmp/flame-*.svg in a browser

Where this leads next

Production flame-graph capture is the operational application of the techniques in /wiki/flamegraphs-reading-them-and-making-them and the safety counterpart to /wiki/live-debugging-without-stopping-the-world. The next chapters walk further through Part 15's production toolkit:

/wiki/tracepoints-and-dynamic-instrumentation — when a flame graph tells you "validate_card is 92% of CPU", a uprobe-based latency histogram tells you the distribution of per-call latency, often revealing that the 92% is one slow tail rather than uniformly slow calls.
/wiki/perf-record-and-perf-script-the-survival-kit — the kernel-side primitive underneath every flame-graph workflow, and the only tool that sees across the kernel/userspace boundary cleanly.
/wiki/continuous-profiling-in-production — the always-on companion to incident-time capture; what to ship when the spike is over before you noticed it.
/wiki/differential-flamegraphs — diffing two captures (before/after deploy, baseline/incident) to find what changed.

The arc is: pre-build the infrastructure quietly, capture cleanly during incidents, redact before sharing, and integrate the workflow into the runbook. Teams that treat flame-graph capture as a permanent piece of platform engineering — owned by SRE, version-controlled, tested in staging — out-perform teams that treat it as an incident-time hack by an order of magnitude on mean-time-to-diagnosis. The unglamorous infrastructure week pays back the first day a real incident lands.

A final cultural note: the team that captures flame graphs from production for the first time during an incident will fail. Aditi's 23-minute story is the typical first-time experience. The fix is to schedule a quarterly production flame-graph drill: pick a Friday afternoon, inject 200 ms of latency into a staging endpoint, and have every on-caller run the orchestrator end-to-end while a senior engineer watches and corrects. After two drills the team's median time-to-flame-graph drops from 23 minutes to 4 minutes; after four, to 90 seconds. The drills cost an afternoon a quarter; their absence costs a 30-minute production outage every time the muscle is needed and not present.

References

Brendan Gregg, "The Flame Graph" (CACM, June 2016) — the introduction to the visualisation, including production capture notes.
Brendan Gregg, Systems Performance (2nd ed., 2020), Chapter 6 — the canonical chapter on production flame graphs and perf record workflows.
async-profiler documentation — JVM-aware profiler walking AsyncGetCallTrace; the production default for OpenJDK.
Linux perf-map-agent for the JVM — generates /tmp/perf-<pid>.map so perf record can resolve JIT frames.
Kubernetes kubectl debug ephemeral containers (KEP-277) — the upstream documentation for production sidecar attach.
Pyroscope continuous-profiling docs — the always-on companion that complements incident-time capture.
/wiki/live-debugging-without-stopping-the-world — the previous chapter on sampling without pausing.
/wiki/flamegraphs-reading-them-and-making-them — the foundational chapter on what a flame graph is and how to read one.