Note: Company names, engineers, incidents, numbers, and scaling scenarios in this article are hypothetical — even when they resemble real ones. See the full disclaimer.

Flame graphs: reading them and making them

Riya is on call for ParakhTrade's Kite order-matching service. At 09:17 IST — seven minutes after market open — p99 on the order-place path crosses 80ms and trips the SLO. CPU on every pod is 81%. She runs py-spy record -o /tmp/kite.svg --pid $(pgrep -f order-match) --duration 30, scp's the SVG to her laptop, opens it in Firefox, and stares at a wall of orange rectangles. The widest box at the bottom says _PyEval_EvalFrameDefault and is 100% wide. Above it, the pyramid splits into a dozen children — dispatch, match_limit_order, _pickle.loads, redis.Redis.hgetall, psycopg2.execute, and one labelled <built-in method numpy.dot> that takes 31% of the width. Is the bottleneck numpy.dot? Is it _PyEval_EvalFrameDefault because that's the widest? Is the answer "Python is slow"? Riya has nine minutes before the next SLO check. This chapter is about what those rectangles actually mean and how to read them without lying to yourself.

A flame graph stacks call frames vertically (caller below, callee above) and sets each box's width to the fraction of samples that contained that frame — width is samples, not wall time, not invocation count, and not a CPU percentage in any direct sense. The bottom row is always 100% wide because every sample has a leaf. The frames you optimise are the plateaus near the top — wide boxes whose children are narrow or absent, meaning that function is doing the work itself, not delegating it.

What a flame graph actually is

Brendan Gregg invented the flame graph in 2011 to compress a profile of millions of stack samples into one SVG that fits on a 1080p screen. The construction is small: take a list of stacks captured by a sampling profiler, group identical stacks together with a count, sort by stack contents, and draw each unique stack as a vertical column of rectangles. Frames at the bottom are callers; frames at the top are callees. The horizontal width of every rectangle is samples_in_this_frame / total_samples — proportional to how often that function (and the caller chain that led to it) appeared in the captured stacks.

The colour palette is decorative — orange/red shades for user-mode frames, yellow for JITted frames, blue/green for kernel frames, depending on the variant. Colour does not encode hotness. The Y-axis position is also decorative beyond ordering: the absolute height of a column is just stack depth, which has nothing to do with badness. Only the X-width matters quantitatively, and only on the top edge of each column does width tell you "this function is the leaf of a sample".

The bottom is always 100%. Above each frame, children sum to ≤ that frame's width. A **plateau** — a frame much wider than the sum of its visible children — is where the leaf time accumulates and where optimisation pays off. `numpy.dot` here is 31% with only ~31% of children below it; the difference (≈11%) is the function's own self-time spent inside C BLAS code that the sampler can't see deeper into.

Why the plateau rule is the entire reading skill: the difference between a frame's width and the sum of its children's widths is its self-time — the samples where this function was the leaf, doing work itself rather than waiting for a child. A 100%-wide _PyEval_EvalFrameDefault with children summing to 100% has zero self-time; it is the interpreter dispatch that every frame transitively passes through, not a target. A 31%-wide numpy.dot with children summing to 31% means BLAS is doing real arithmetic; the optimisation is to give it less to do (smaller matrices, vectorised batching), not to "fix" the interpreter underneath it. Reading flame graphs without the plateau rule is the most common cause of "we optimised the wrong thing" postmortems.

A flame graph that displays the call path with caller-on-bottom is sometimes called an "icicle graph" when flipped (callee on bottom). Both are the same data; the orientation just changes whether you scan top-down or bottom-up. The flamegraph.pl default and most production tools (Brendan Gregg's, Streamora's FlameScope, Datadog's profiler, Pyroscope) put the caller on the bottom. Stick with that until you have a specific reason to flip.

One more conceptual point that beginners trip over: a flame graph is built from stack samples, not from invocation traces. If a function is called 1 million times but each call is 1 microsecond, the flame graph at 99 Hz captures it in maybe 1–2 samples — the function looks invisible. If the same function is called once and runs for 30 seconds, the flame graph captures it in ~3,000 samples — it dominates. Width is time spent, weighted by sampler-event-count. A profile that shows a "cold" function does not mean the function was rarely called; it means the function was rarely on the CPU at sample time. For invocation-frequency questions you want a counting profiler (uftrace, dtrace aggregations, or perf trace) — not a sampler. Picking the right tool for the question is half the job; flame graphs answer "where is the time", not "what runs how often".

The pipeline that produces a flame graph

Every flame graph in the wild is the output of a three-stage pipeline: collect → fold → render. Each stage has its own failure modes; misreading a flame graph almost always traces back to one specific stage misbehaving.

Collect: a sampling profiler captures stack traces N times per second. On Linux: perf record -F 99 -g, py-spy record, async-profiler, or pprof. Each sample is [frame_0, frame_1, frame_2, ..., frame_n], ordered leaf-to-root.

Fold: identical stacks are merged with a count. The format is one line per unique stack: frame_n;frame_n-1;...;frame_1;frame_0 count. For perf-based pipelines the tool is stackcollapse-perf.pl (parses perf script output); for py-spy it is built in (--format flamegraph writes folded output directly). The folded file for a 30-second 99 Hz capture across 16 cores has at most ~47,500 lines and is usually a few hundred KB.

Render: flamegraph.pl reads the folded file and writes an SVG. It sorts stacks alphabetically (so similar stacks visually cluster), assigns colours from a palette, sets each rectangle's width proportional to count / total, and emits the SVG. The result is a single self-contained file that opens in any browser, supports search (Ctrl-F highlights matching frames), and supports click-to-zoom.

The fold step is where stack-reuse compression happens. A 30-second 16-CPU run produces ~47,500 raw samples but typically only 200–2,000 unique stacks. The folded file fits in `less`; the SVG fits in a browser tab.

If the flame graph looks wrong, the bug lives in exactly one of these stages. Frame names show [unknown] or hex addresses → collect-stage symbol resolution failed. Two stacks that should be the same render as separate frames → fold-stage normalised them differently (most often one had a [stripped] placeholder the other didn't). Search doesn't highlight a function you know was called → render-stage --minwidth discarded it as too narrow to draw. Knowing the stage tells you the fix.

The reason this three-stage decomposition is worth internalising: every commercial continuous-profiling product (Datadog, Grafana Pyroscope, Polar Signals, Google Cloud Profiler, AWS CodeGuru) reduces to exactly the same three stages, just with the boundaries moved around. The collect agent runs as a sidecar or in-process shim. The fold step happens either on the agent (efficient) or on ingest (more flexible). The render step happens in the browser at view time, on demand, with the user picking the time range and the filter dimensions (RPC type, host, version). What looks like a feature-rich SaaS dashboard is collect-fold-render with a database in the middle and a query layer on top. Once you've built the toy version yourself with py-spy and the script above, the SaaS pricing pages are easy to read — you know exactly what you're paying for at each stage and which stages you could DIY for free.

Capturing and rendering one yourself — `py-spy` from a Python driver

Below is a complete, runnable Python script that captures a flame graph of a synthetic order-matching workload, then post-processes the SVG to extract the top-N hottest stacks for an alert payload. This is the artefact a ParakhTrade-style on-call runbook would invoke; the human looks at the SVG, the alert summary tells the bot what to page about.

# capture_flamegraph.py — drive py-spy from Python, post-process the SVG.
# Demonstrates the full collect→fold→render→consume pipeline in one script.
#
# Why py-spy: it samples Python stacks via /proc/<pid>/mem, so it does not
# require the target process to cooperate, does not slow it down measurably
# (<2% overhead at 100 Hz), and resolves Python frames natively without
# needing a /tmp/perf-<pid>.map file. It also writes an SVG that is
# byte-for-byte the flamegraph.pl format.

import multiprocessing as mp
import os
import re
import subprocess
import sys
import time
from pathlib import Path

# ---------- the synthetic workload (a stand-in for Kite's order-match) ----------
def order_match_worker(n_orders: int = 200_000) -> None:
    """Simulate match_limit_order's hot path: dot product + dict lookup +
    encode. Numbers chosen so the flame graph has visible structure."""
    import numpy as np
    book = {f"NSE:RELIANCE_{i}": np.random.rand(64) for i in range(512)}
    orders = [(f"NSE:RELIANCE_{i % 512}", np.random.rand(64))
              for i in range(n_orders)]
    matched = 0
    for sym, vec in orders:
        bid = book[sym]                              # hgetall analogue
        score = float(np.dot(bid, vec))              # the BLAS call
        if score > 32.0:
            matched += 1
            book[sym] = (bid * 0.99 + vec * 0.01)    # update analogue
    print(f"matched {matched}/{n_orders} (sentinel print to keep the proc alive)")

# ---------- the capture driver ----------
def capture(pid: int, seconds: int, out_svg: Path) -> dict:
    """Run py-spy record against pid for `seconds`, write SVG, parse stats."""
    t0 = time.perf_counter()
    res = subprocess.run(
        ["py-spy", "record", "-o", str(out_svg),
         "--format", "flamegraph",
         "--rate", "99",
         "--duration", str(seconds),
         "--pid", str(pid)],
        capture_output=True, text=True)
    elapsed = time.perf_counter() - t0
    if res.returncode != 0:
        raise RuntimeError(f"py-spy failed: {res.stderr}")
    # py-spy reports samples on stderr: e.g. "Collected 2940 samples"
    m = re.search(r"Collected (\d+) samples", res.stderr)
    samples = int(m.group(1)) if m else 0
    return {"svg": str(out_svg), "samples": samples, "wall_s": round(elapsed, 2)}

# ---------- post-process the SVG to extract the top-N leaf frames ----------
LEAF_RE = re.compile(
    r'<title>([^<]+) \(([\d,]+) samples, ([\d.]+)%\)</title>')

def hottest_leaves(svg_path: Path, top_n: int = 8) -> list[dict]:
    """Parse flamegraph.pl-format SVG. Each rectangle's <title> contains the
    frame name and sample count. Filter to leaves (frames whose self-time is
    the meaningful share)."""
    text = svg_path.read_text()
    rows = []
    for fn, count_s, pct_s in LEAF_RE.findall(text):
        rows.append({"frame": fn,
                     "samples": int(count_s.replace(",", "")),
                     "pct": float(pct_s)})
    # Keep widest non-trivial frames; the top is dominated by interpreter
    # frames that are 100% wide and useless. Filter to frames < 80% wide.
    leaves = [r for r in rows if 1.0 <= r["pct"] <= 80.0]
    leaves.sort(key=lambda r: -r["pct"])
    return leaves[:top_n]

if __name__ == "__main__":
    out = Path("/tmp/kite_flame.svg")
    p = mp.Process(target=order_match_worker)
    p.start()
    time.sleep(0.5)  # let workload reach steady state
    stats = capture(p.pid, seconds=10, out_svg=out)
    p.join()
    print(f"\n[capture] {stats}\n")
    print("[top leaves]  pct   samples   frame")
    for r in hottest_leaves(out):
        print(f"  {r['pct']:>5.1f}%  {r['samples']:>7,d}   {r['frame']}")

# Sample run on c6i.xlarge (Skylake-X, 4 vCPU, kernel 6.6, py-spy 0.3.14):

matched 73182/200000 (sentinel print to keep the proc alive)

[capture] {'svg': '/tmp/kite_flame.svg', 'samples': 940, 'wall_s': 10.21}

[top leaves]  pct   samples   frame
   31.4%      295   <built-in method numpy.core._multiarray_umath.dot>
   18.7%      176   order_match_worker (capture_flamegraph.py:24)
    9.1%       86   <dictcomp> (capture_flamegraph.py:21)
    7.8%       73   <listcomp> (capture_flamegraph.py:23)
    4.5%       42   <built-in method builtins.float>
    3.8%       36   <method 'random_sample' of 'numpy.random.mtrand.RandomState'>
    2.1%       20   __getitem__
    1.6%       15   __setitem__

The walk-through. subprocess.run(["py-spy", "record", "--rate", "99", ...]) is the entire collect-stage. py-spy writes folded output to a temp file, runs flamegraph.pl internally with the right palette for Python ("py" colour scheme), and emits the SVG. time.sleep(0.5) before capturing matters: a process in the first 100ms of execution is dominated by import-time code (import numpy is ~80ms by itself) which is irrelevant to steady-state behaviour and pollutes the flame graph; warmup-aware capture is a discipline (covered properly in the warmup chapter). LEAF_RE = re.compile(r'<title>...') parses the SVG: every rectangle in flamegraph.pl's output has a <title> element with frame_name (count samples, pct%) — that's the API the SVG exposes for free. The 80% upper-bound filter (1.0 <= r["pct"] <= 80.0) discards interpreter dispatch frames at the bottom of the pyramid that span the full width and tell you nothing — the same reason you read flame graphs from the top down, not from the widest box. Why numpy.core._multiarray_umath.dot is the leaf at 31.4% rather than _PyEval_EvalFrameDefault at 100%: the C function dot is implemented as a built-in method dispatched directly from the bytecode loop, and py-spy correctly records it as a leaf frame. The 100%-wide interpreter frame is the wrapper that all Python work passes through, but the self-time of the interpreter is roughly 0% — every cycle inside the eval loop is attributed to whatever bytecode it dispatched to. So a flame graph reading "100% in _PyEval_EvalFrameDefault" never means "the interpreter is slow"; it means "Python code is running, and the leaf above is where the time is actually spent". In Riya's case, the top alert line — numpy.dot at 31.4% — is what she pages on, and the optimisation is "fewer / bigger / batched dot products", not "rewrite Python in Rust".

The same script's logic — capture, parse, threshold — is what production continuous-profiling systems do. Pyroscope, Polar Signals, Datadog Profiler, and Grafana Pyroscope all reduce to: a sampling agent on each pod, a folded-stack ingestion endpoint, a flame graph renderer, and a "hottest leaves" alert path on top. Knowing the building blocks makes the SaaS offerings legible.

A note on the statistical confidence of the numbers above. The capture ran 10 seconds at 99 Hz on a single 4-vCPU machine — about 940 samples after sleep and idle thread filtering. A frame at 31.4% width is 295 ± √295 ≈ 295 ± 17 samples by Poisson, or 31.4% ± 1.8% at 1σ, or 31.4% ± 5.4% at 3σ. That is enough confidence to act on numpy.dot as a real plateau but not enough to argue about whether psycopg.execute (13%) or redis.hgetall (14%) is the bigger problem — those overlap inside their respective error bars. When the two top plateaus are within 3% of each other, capture longer (--duration 60) before deciding which to optimise first; the cost is one extra minute of capture time and the benefit is not optimising the second-biggest problem and discovering that p99 didn't move.

Reading the Kite flame graph step-by-step — Riya's nine minutes

Back to Riya at 09:17 IST. She has the SVG open. Here is the discipline that turns "wall of orange rectangles" into a one-line diagnosis in under three minutes.

Step 1 — find the plateaus, not the bases. Ignore everything in the bottom three rows. Those are framework frames (_PyEval_EvalFrameDefault, dispatch, gunicorn worker) that are 100% wide because every sample passes through them. Scan up the pyramid until you find a frame whose width is large but whose stacked children sum to noticeably less than its width. That gap is the leaf time, the work the function does itself.

The visual heuristic: look for horizontal terraces in the pyramid. If the pyramid steps neatly from one width to a smaller width to a smaller width, with each layer's children summing to the layer's width, the function is purely a delegator and the work is happening above. If a layer is wider than the sum of its children — a terrace, a step where the cliff is smaller than the floor — the difference is leaf time spent in that layer. Train the eye to find terraces and the rest of flame graph reading is mechanical.

Step 2 — record the top three plateaus, with their widths. For Kite at 09:17 these were numpy.dot (31%), redis.Redis.hgetall (14%), psycopg2.execute (13%). Together they explain 58% of CPU; the remaining 42% is spread across narrow frames that are not individually worth optimising. The 80/20 rule applies: three plateaus usually account for the actionable budget.

Step 3 — for each plateau, ask "what is the unit of work?". numpy.dot is one matrix multiply per match_limit_order call; the optimisation is to batch many orders into one multiply (50× fewer Python→C transitions). redis.Redis.hgetall is one Redis round-trip per order; the optimisation is hmget of many keys in one call, or a local LRU. psycopg2.execute is one INSERT per fill; the optimisation is executemany or COPY. Each plateau maps to a known optimisation pattern; the flame graph just told you which ones apply. Why mapping plateaus to optimisation patterns works: the flame graph identifies where time is spent, but the fix is determined by what kind of work that frame represents — a Python→C transition (batch it), a network round-trip (pipeline it), a syscall (amortise it), a lock acquisition (shard the lock), a memory copy (use a zero-copy path). The diagnostic ladder is a closed set of about a dozen patterns. Once you have the plateau, the fix is one Querion search away. The flame graph is not the answer; it is the question, sharpened to the point where the answer is obvious to someone who has seen this shape before.

Step 4 — confirm with Ctrl-F. flamegraph.pl's SVG has a built-in search box. Type numpy.dot in the search field; matching frames are tinted purple, and the bottom-right shows "matched: 31.4%". This is the sanity check — the search percentage must equal the visible width of the plateau. If they disagree, the SVG was rendered with --inverted or a non-standard palette and your eyeballed widths are off.

Step 5 — sanity-check the lost-sample count. The SVG generated by py-spy and flamegraph.pl includes a header with sample totals and any drops. If the run reports more than 1–2% of samples lost, the flame graph has gaps in exactly the high-load moments and the plateaus may be misranked. Re-capture with a larger ring buffer (perf record -m 16M) or run the profiler at higher priority (--realtime 99) before trusting the picture. Riya's run had zero drops; the picture is honest. A run with 12% drops would have meant Riya needs to recapture before paging anyone, because the plateau ranking might invert under proper capture conditions.

The total time for this is two to three minutes once you've done it ten times. Riya sends the message: "p99 driven by numpy.dot (31%), hgetall (14%), psycopg execute (13%) — batching all three; ETA 25 min." That's a flame graph doing its job.

The training to get there is shorter than people expect. After reading and acting on roughly 20 flame graphs from your own service, the reading speed drops from "stare for 10 minutes" to "diagnosis in 90 seconds". The barrier is not the visualisation — it is having seen enough plateaus on enough services to recognise the patterns. Capture flame graphs in low-stakes situations (canary deploys, load tests, a slow CI build) so that when production catches fire, the muscle is already there.

Flame graph shapes by runtime — what to expect before you look

Different language runtimes produce flame graphs with characteristic shapes; learning the shapes lets you spot anomalies in seconds. A normal Python service shows a tall central pyramid dominated by _PyEval_EvalFrameDefault at the bottom (because every Python call passes through the evaluator), with C extension leaves at the top (numpy, lxml, psycopg2's _psycopg.so). A Python flame graph that has no C extensions on top usually means the workload is pure Python compute — a strong signal to vectorise with numpy or rewrite the hot path in Cython. A normal Go service shows wide runtime.gcMark* and runtime.scanobject frames during GC (typically 2–8% combined); more than ~15% in runtime.* frames means the heap is too big or allocation rate is too high. A normal Java service shows a tall pyramid of JIT_compiled frames (with the right symbol map), and [kernel.kallsyms] frames for futex_wait (typical for synchronised blocks); a fat JVM_GC or G1CollectedHeap::* plateau means GC tuning is the bottleneck.

The shapes also let you sanity-check before you read. A flame graph from a Rust async service that shows zero tokio::* frames is wrong — the runtime should always be visible. A flame graph from a CPython service that shows zero _PyEval_EvalFrameDefault is wrong — the evaluator should be the bottom row of every column. When the expected shape is missing, the bug is in the collect stage (wrong unwinder, missing symbols, target not actually running the workload), and re-reading the flame graph is wasted effort until you fix the capture.

There is a corollary worth stating explicitly: the first flame graph you capture for a service should be in normal conditions, not during an incident. The shape during a healthy steady state is your reference; an incident-time flame graph is only legible by comparison. PaisaBridge's reliability runbook captures a 60-second baseline flame graph from each service every Monday morning at 11:00 IST (low-traffic, post-deploy-stable, before lunch peaks) and stores it as <service>-baseline-<week>.svg in the runbook. When p99 jumps at 14:30 on a Wednesday, the on-call captures a 60-second flame graph and diffs it visually against Monday's baseline. The frames that grew are red on the differential view; the frames that shrank are blue. Without the baseline, "this flame graph looks bad" is opinion; with it, "this frame got 18% wider since Monday" is a specific fact you can act on.

The four traps that turn flame graphs into hallucination

Flame graphs are the highest-bandwidth profiling visualisation, but they hallucinate readily when one of the pipeline stages misbehaves. Four traps account for the vast majority of misreadings.

Trap 1: missing frame pointers turn the call graph into a pile. When perf record -g (frame-pointer unwinder) runs on a binary compiled without -fno-omit-frame-pointer, the kernel's stack walker reads garbage above the first non-FP-preserving frame. Symptoms: implausibly short stacks, fat [unknown] boxes, frames that "couldn't possibly call each other" appearing as parent/child. Fix: rebuild with -fno-omit-frame-pointer (Go and Rust 1.74+ already do; CPython since 3.11; OpenJDK with -XX:+PreserveFramePointer), or switch to --call-graph dwarf or --call-graph lbr (covered in the previous chapter). On a SetuStream 2024 incident, a flame graph from a Java service showed __memmove as the parent of Java_java_net_SocketInputStream_read — physically impossible — and the diagnosis stalled for 40 minutes until someone realised the JVM was missing the libperf-jvmti.so agent.

Trap 2: low sample count, high statistical noise. A flame graph is a sample of execution, with all the statistical caveats of a sample. At 99 Hz × 4 CPUs × 10 seconds you have roughly 4,000 samples; a frame at 1.0% width is 40 samples, well within Poisson noise (1σ ≈ ±6 samples ≈ ±0.15%). A frame at 0.1% is 4 samples — noise. The rule: don't make decisions on frames narrower than ~3% unless your total sample count is in the hundreds of thousands. The fix when you need fine resolution: longer captures (--duration 300) or higher rates (-F 999), accepting more overhead.

Trap 3: on-CPU vs off-CPU confusion. The default flame graph from perf record or py-spy shows on-CPU time only — a thread blocked on a lock, an I/O, or a sleep contributes zero samples. If your service is slow because of locking, the flame graph shows what the un-blocked threads were doing while the blocked thread was waiting; the actual bottleneck is invisible. Symptoms: low CPU utilisation (5–20%) but high p99, flame graph dominated by epoll_wait or futex_wait (the wakeup paths, not the wait paths). Fix: capture an off-CPU flame graph using bpftrace on the scheduler tracepoints, or py-spy record --idle for Python (turns on capturing of waiting threads). The two flame graphs together — on-CPU and off-CPU — are usually called "wakeup flame graphs" or "hot/cold split"; covered in the next chapter.

Trap 4: aggregating across heterogeneous workloads. A 30-minute capture across a service that handles three RPC types — place_order, get_position, cancel_order — produces a flame graph that averages all three. A frame at 12% width might mean "place_order is 100% of CPU and match_limit_order is 12% of place_order" or "match_limit_order is 12% of every RPC equally" — these have very different fixes. Symptoms: optimising the highlighted frame doesn't move the SLO. Fix: capture per-RPC flame graphs (filter perf script by pid, or use py-spy --subprocesses and the --idle flag to separate request-handling threads), or use differential flame graphs that subtract one capture from another to highlight what changed. Brendan Gregg's flamegraph.pl --differential and Streamora's FlameScope both implement this.

A fifth, less obvious trap deserves mention: stack truncation. The kernel limits the depth of the captured call chain — perf defaults to 127 frames, bpftrace to 127, py-spy has no fixed limit but truncates at the Python recursion limit (default 1000). A deeply recursive parser, a heavily middlewared web framework (Django + DRF + 12 middleware layers + 8 decorators on the view), or a coroutine library that adds wrapper frames per await can blow past these limits. Symptoms: the flame graph's top edge is jagged with truncated stacks, frames that should be visible are missing, and the sample count for the truncated stacks lands on a placeholder ([truncated] or just the bottom-most captured frame). Fix: raise the limit (perf record --max-stack 256), or attack the recursion (most production code does not need 127-deep stacks; the depth is itself a bug).

Common confusions

"The widest box is the bottleneck." Almost never. The widest box is usually the framework entry point (main, _PyEval_EvalFrameDefault, tokio::runtime::Worker::run) which has 100% width because every sample passes through it. The bottleneck is the widest frame whose visible children are narrower than itself — a plateau. Read top-down for plateaus, not bottom-up for width.
"Width is wall-clock time." Width is fraction of samples that contained this frame. Whether that maps to wall-clock time depends on what the sampler measures. perf -e cycles width is approximately on-CPU time. perf -e task-clock width is approximately wall time (including idle). py-spy --idle width is wall time too. perf -e branches width is something stranger — a frame that runs many branches looks fat even if it's fast. Always know what your sampler counts.
"Flame graphs and call graphs are the same thing." A call graph shows the static or dynamic graph of "function A calls function B"; a flame graph shows the time-weighted aggregation of stack samples. A function called from twenty places appears in twenty different columns of a flame graph (one per caller chain), but as a single node in a call graph. Flame graphs are denser when you want to know "where is the time", call graphs are clearer when you want to know "what calls what". Different tools, different questions.
"A flat (1-row-tall) top means the program is well-optimised." Stack depth is dictated by code structure, not by performance. A deeply recursive parser produces tall flame graphs even if it's perfectly tuned; a tight numpy inner loop produces a 3-row-tall flame graph even when the rest of the service is a mess. Height has nothing to do with badness.
"My flame graph shows 5% in <unknown>, that's fine." It is not. Five per cent of samples that the symboliser couldn't resolve is 5% of your time you cannot attribute. If 5% of your salary went into an unlabelled bucket every month you would investigate. Add --symfs, install debug symbols, write the JIT map, fix the issue. The unknown bucket grows monotonically as a service evolves; left alone, it eats the flame graph.
"Colour means something." Colour in flamegraph.pl is a hash of the frame name into a palette, decorative only. The "hot" / "Java" / "io" / "perl" colour schemes are stylistic. The only colour-encoded variant is the differential flame graph, where red = "this frame got hotter" and blue = "got colder" between two captures.

Going deeper

Differential flame graphs — comparing two captures

A single flame graph answers "where is the time?". A differential flame graph answers "where did the time move between yesterday and today?" — which is the question you usually have during a regression investigation. The construction: capture two profiles (before and after a deploy, or before and after a traffic spike), fold both, then merge per-stack with count_after - count_before. Render with flamegraph.pl --differential (or use Streamora's FlameScope / Pyroscope's diff view); positive deltas render red (got slower), negative deltas render blue (got faster). The eye picks out red plateaus instantly. The PaisaBridge reliability team uses differential flame graphs for every canary deploy: the canary captures 5 minutes of profile, the baseline captures 5 minutes of the same shard pre-deploy, and the diff is uploaded to the deploy ticket. If the diff has a red frame larger than 2%, the deploy is held. The cost is two extra flame-graph captures per deploy; the benefit is catching CPU regressions before they roll out fleet-wide.

Hot / cold flame graphs — the on-CPU and off-CPU split

The on-CPU flame graph shows where threads spent their CPU. The off-CPU flame graph shows where they spent their blocked time — waiting on locks, syscalls, I/O, sleeps. Built from bpftrace on the sched_switch tracepoint or from py-spy --idle, the off-CPU graph is built the same way as the on-CPU one but the unit is "time spent with the thread off the runqueue" instead of "samples while on CPU". For services dominated by I/O (most CRUD APIs), the off-CPU graph is where the bottleneck lives; the on-CPU graph just shows which non-blocked thread was passing the time. The split also explains the classic "p99 high, CPU low" symptom: on-CPU graph is empty and useless, off-CPU graph fingers the lock or the upstream service. Brendan Gregg's "wakeup flame graphs" and "off-CPU flame graphs" docs are the canonical references; the next chapter walks through capturing them in production.

FlameScope — when one flame graph is too coarse

FlameScope (Streamora, 2018) takes a long capture (30+ minutes) and renders it as a subsecond heatmap on top of a flame graph. The heatmap rows are seconds, the columns are subsecond bins, and you select a region of the heatmap to render the flame graph for that interval only. The point: production performance is bursty. A 30-minute average flame graph hides the 4-second window where p99 spiked to 800ms because that window is 0.2% of the total. FlameScope lets you find the spike on the heatmap (a vertical red stripe), select it, and read the flame graph for just those samples. The BharatBazaar Mega Bargain Days team uses FlameScope on continuous-profiling captures from their catalogue service — the sale-spike behaviour is invisible in 24-hour aggregates but obvious as a 2-minute red column on FlameScope's heatmap.

Icicle graphs, sandwich views, and other variants

The vanilla flame graph (caller-on-bottom) is one rendering choice; the same folded data drives several others. The icicle graph is the same data flipped upside down — callee at bottom, caller at top — and is what Chrome DevTools' performance tab shows by default. The icicle is easier when scanning for "what was the entry point?" (root is at the top edge, easy to scan) and harder when scanning for "what was the leaf?" (now buried at the bottom). The sandwich view (used by Speedscope and gperftools' pprof -web) renders the same function as a single combined entry showing all callers below and all callees above; useful when one function is called from many places and the regular flame graph fragments it across columns. The time-ordered flame chart (Chrome DevTools' "Bottom-Up" view, or perf script | stackcollapse-perf.pl --time) preserves wall-clock order along the X-axis instead of sorting alphabetically — this trades the "similar-stacks-cluster" property for the ability to see "what happened in the first 200ms vs the last 200ms". Pick the variant that matches the question; the data is the same in all cases.

Reproduce this on your laptop

# Reproduce on your laptop
python3 -m venv .venv && source .venv/bin/activate
pip install py-spy numpy
sudo sysctl kernel.yama.ptrace_scope=0   # let py-spy attach without root
python3 capture_flamegraph.py            # writes /tmp/kite_flame.svg
xdg-open /tmp/kite_flame.svg             # or open in any browser

Where this leads next

This chapter taught you what a flame graph is, how to capture one, and the four traps that produce wrong readings. Two follow-up chapters build on it directly. Treat the flame graph as the first tool in the diagnostic ladder — it tells you where to look. The downstream tools — perf annotate for assembly-level attribution, bpftrace for off-CPU analysis, pprof for memory and lock contention — narrow the answer once you know which function to investigate. Flame graph alone is a 30-second screening; the follow-ups are the diagnosis.

Off-CPU flame graphs and the wakeup graph (/wiki/off-cpu-flamegraphs-the-other-half) is the diagnostic tool when your service has high p99 but low CPU. Trap 3 above — on-CPU only sees running threads — is what off-CPU graphs solve, by sampling the scheduler instead of the CPU. The wakeup-graph variant additionally shows which thread woke the blocked thread, which is what you need when the bottleneck is "thread A is waiting on thread B" and the question is who B is.

Continuous profiling in production (/wiki/continuous-profiling-in-production) is how Pyroscope, Datadog Profiler, and Polar Signals turn one-shot flame graphs into a 24/7 service that answers "what was hot during yesterday's spike?" without needing someone to be SSHed into a pod at the right moment. The continuous-profiling pattern also unlocks the PaisaBridge-style baseline-and-diff workflow at fleet scale: every deploy automatically diffs against the pre-deploy steady state, every traffic spike automatically diffs against the previous hour's baseline, and the on-call sees the differential before they even ask for it.

Differential and FlameScope analysis (/wiki/differential-flame-graphs) extends the basic flame graph to the two questions that drive production diagnosis: "what changed?" (differential) and "what happened in this 4-second window?" (FlameScope subsecond heatmap).

The thread connecting all three is the same: a flame graph compresses millions of stack samples into one image; the follow-ups slice the same data by time, by blocked-vs-running, and by before-vs-after. Once the basic chart is legible, the variants are the productivity multipliers that turn a screening tool into a full diagnostic suite.

The mental model to carry forward: every modern profiling tool — perf, bpftrace, py-spy, async-profiler, pprof, Pyroscope, Datadog — produces some shape of stack-sample data. Flame graphs are the universal lingua franca for visualising that data. Learn to read them well, and the entire profiling toolbox becomes legible at once.

References

Brendan Gregg, "The Flame Graph" — CACM, June 2016 — the canonical paper. Width-as-samples, the colour palette discussion, and the original flamegraph.pl design rationale.
Brendan Gregg, "FlameGraph" GitHub repository — flamegraph.pl, stackcollapse-perf.pl, the differential mode, and the colour schemes. The README is the operational manual.
py-spy — sampling profiler for Python — py-spy record produces flame graphs natively; --idle captures off-CPU samples; --subprocesses follows forked workers. The README documents the /proc/<pid>/mem mechanism that lets it sample without ptrace.
Streamora FlameScope — subsecond heatmap on top of flame graphs. The README explains the heatmap-to-flame-graph drill-down workflow.
Brendan Gregg, "Off-CPU Flame Graphs" — the bpftrace-based recipe for capturing off-CPU graphs, including the "wakeup flame graph" hybrid that links blocked threads to the threads that woke them.
Grafana Pyroscope documentation — the open-source continuous-profiling system; reads folded format, supports differential queries, integrates with Grafana dashboards.
/wiki/perf-from-scratch — the previous chapter; the syscall and tooling that produces the raw stacks this chapter folds and renders.
/wiki/sampling-vs-instrumentation — the chapter on why sampling is statistically sound and how to choose the rate; flame graphs inherit every property discussed there.
Speedscope — a fast, interactive web-based viewer for large profiles — accepts the folded format directly. Its "left-heavy" view sorts frames within each level by width, which makes plateaus pop out instantly; the "sandwich" view aggregates all callers and callees for one function. Useful when flamegraph.pl's static SVG isn't navigable enough for a multi-MB profile.
Brendan Gregg, "CPU Flame Graphs" — operational guide — the practitioner's manual covering perf, DTrace, Linux perf_events, hot/cold splits, and the correct invocation for each operating system. The "Interpretation" section is the gold-standard tutorial on the plateau rule.