Differential flame graphs
At 02:14 IST on a Tuesday, Aditi merges PR #4821 into the Razorpay payments-router repo — a one-line dependency bump from grpc-python==1.59 to 1.62. CI passes, the canary at 5% traffic shows no error-rate change, the rollout completes at 02:48. By 09:30 IST, the payments-team dashboard shows p99 on POST /v1/payments/capture has drifted from 38 ms to 71 ms. CPU utilisation is unchanged; error rate is unchanged; throughput is unchanged. Aditi pulls a flame graph from the new build. It looks normal. She pulls the flame graph from the old build (the previous canary archive). It also looks normal. The two graphs are 99% identical to her eye, but somewhere in that 1% lives 33 ms of new latency on every payment. Staring at two flame graphs side by side will not find it. She needs a graph that subtracts the two — colours red where time grew, blue where it shrank — and that is the differential flame graph.
A differential flame graph is the per-frame difference of two flame graphs: subtract the sample count of each stack between baseline and target, render with red = grew and blue = shrank. It surfaces regressions that hide inside two superficially-identical flame graphs by attributing the delta to the specific stack frames that gained or lost samples. The capture and folding pipeline is identical to a single flame graph; the diff happens at the folded-text stage with flamegraph.pl --differential or difffolded.pl. Use it for every deploy-driven regression, every A/B perf test, and every "did this commit help" question.
Why two side-by-side flame graphs cannot find a regression
A flame graph compresses a million stack samples into a few hundred labelled rectangles whose widths sum to 100%. The compression is deliberate — it lets you see the shape of where CPU time goes on one screen — but it has a structural cost: a 5% increase in the width of one rectangle is invisible to the human eye when that rectangle is buried 12 frames deep in a stack that already accounts for 30% of the graph. Aditi's regression sits at exactly that depth: the new gRPC client added one extra _validate_metadata call inside the request-encoder hot path, costing ~33 ms per request. On the post-deploy flame graph the encoder column is 12% wide instead of 8% wide. That difference exists, but human visual estimation of two flame-graph rectangle widths is roughly ±5% per frame — well above the 4% the regression actually moved.
The mathematics of the problem make the eyeball-comparison strategy hopeless. A flame graph with 20,000 unique stacks averaging 50 samples each (a typical production capture) shows fewer than 200 boxes wider than 0.5%. A regression that adds 3% to total CPU is distributed across many frames in the call chain — the leaf where the new work happens gains 3%, but every ancestor frame from the leaf back to the thread root also gains 3%, because the ancestor was on-stack while the leaf was running. So the visible difference is "every column on the path from main to the new function got slightly wider", and identifying which column got slightly wider, against the background of normal sample-count noise, is the kind of pattern recognition humans are bad at and arithmetic is good at.
Why frame-by-frame ancestor inflation matters: each stack sample contributes one count to every frame on its call chain — leaf, parent, grandparent, all the way to the root. If a regression adds 100,000 samples to a single deep leaf, it also adds 100,000 samples to each of the 12 ancestors above that leaf. So an isolated "new function added latency" event manifests as a 12-frame wide column whose every cell got slightly heavier. A diff that operates per-frame attributes the gain correctly: each ancestor shows the same delta as the leaf, but the differential render highlights the leaf's red colour most strongly because the diff is most concentrated there.
The fix is to do the subtraction in the data, not in the eye. Take the folded-stack file from the baseline capture (one line per unique stack, format frame_a;frame_b;frame_c <count>), take the folded file from the target capture, and compute, per stack: delta = target_count - baseline_count. Stacks that got hotter have positive delta, stacks that got colder have negative delta. Render with flamegraph.pl --differential: each frame's colour is now mapped from the delta sign and magnitude, not from a hash of its name. Red frames absorbed new samples; blue frames lost samples. Stacks that did not change render in neutral grey. The eye finds a single fat red leaf in a sea of grey in two seconds — the same task that took two minutes of side-by-side staring before, with worse accuracy.
The differential flame graph does not replace the regular flame graph. The regular graph still answers "where does my service spend its time" — useful when you don't have a baseline. The differential graph answers a strictly narrower question: "what changed between two captures of the same service". You need a baseline to use it. But every regression-hunting workflow — pre-deploy vs post-deploy, before patch vs after patch, A/B test arm 1 vs arm 2, this build vs last week's build — has a baseline by construction. So in practice, every time you are about to compare two flame graphs by eye, you should be running a diff instead.
The capture-and-fold pipeline, end to end
The implementation is mechanical but each step matters because misalignment between baseline and target capture conditions corrupts the diff. The pipeline:
- Capture baseline.
perf record -F 99 -p <pid> -g --call-graph=dwarf -- sleep 60 > /tmp/base.perf. Record at 99 Hz (off-prime to avoid lockstep with periodic timers) for 60 seconds. Use--call-graph=dwarffor accurate user-space unwinding when frame pointers are missing (most modern compiled languages). - Capture target. Same command, against the post-deploy build's PID, on the same host class (same instance type, same kernel, same CPU governor setting), under the same load (same
wrk2/locustarrival pattern, same data set). - Fold both.
perf script -i base.perf | stackcollapse-perf.pl > base.folded; same fortarget.perf. Each line isframe_a;frame_b;frame_c <count>. - Diff the folded files.
difffolded.pl base.folded target.folded > diff.folded. The output format adds two count columns:frame_a;frame_b;frame_c <baseline_count> <target_count>. - Render the diff.
flamegraph.pl --differential diff.folded > diff.svg. The--differentialflag colour-maps frame fill by(target - baseline) / max(target, baseline)clipped to [-1, +1]: deep red at +1, white at 0, deep blue at -1.
The Python driver below wraps the entire pipeline, captures both samples, runs the diff, prints the top changed frames as an alert payload, and saves the SVG. Realistic and runnable on any Linux host with perf and the FlameGraph tools installed:
# diff_flame.py — capture two perf samples, render a differential flame graph,
# and print the top frames whose sample share grew or shrank the most.
# Useful in a pre/post-deploy gate, an A/B canary comparison, or any
# "did this change help?" question.
import re
import subprocess
import sys
import time
from collections import defaultdict
from pathlib import Path
PERF = "/usr/bin/perf"
STACKCOLLAPSE = "stackcollapse-perf.pl" # from FlameGraph repo
DIFFFOLDED = "difffolded.pl"
FLAMEGRAPH = "flamegraph.pl"
def perf_record(pid: int, secs: int, out: Path) -> None:
"""Capture `secs` seconds of stack samples at 99 Hz from `pid`."""
t0 = time.perf_counter()
subprocess.run(
[PERF, "record", "-F", "99", "-p", str(pid),
"-g", "--call-graph=dwarf",
"-o", str(out), "--", "sleep", str(secs)],
check=True, capture_output=True)
print(f"[perf record pid={pid}] wall={time.perf_counter()-t0:.1f}s "
f"size={out.stat().st_size//1024} KiB → {out}")
def fold(perf_data: Path) -> str:
"""Run perf script | stackcollapse-perf.pl, return folded text."""
p1 = subprocess.run([PERF, "script", "-i", str(perf_data)],
check=True, capture_output=True, text=True)
p2 = subprocess.run([STACKCOLLAPSE], input=p1.stdout,
check=True, capture_output=True, text=True)
return p2.stdout
def diff_top_frames(base: str, target: str, k: int = 8) -> list[tuple[str, int, int, float]]:
"""Aggregate by leaf frame, return (frame, base_total, target_total, pct_delta)."""
def leaf_totals(folded: str) -> dict[str, int]:
out: dict[str, int] = defaultdict(int)
for line in folded.splitlines():
stack, n = line.rsplit(" ", 1)
leaf = stack.split(";")[-1]
out[leaf] += int(n)
return out
b, t = leaf_totals(base), leaf_totals(target)
b_tot, t_tot = sum(b.values()) or 1, sum(t.values()) or 1
rows = []
for frame in set(b) | set(t):
b_share = 100 * b.get(frame, 0) / b_tot
t_share = 100 * t.get(frame, 0) / t_tot
rows.append((frame, b.get(frame, 0), t.get(frame, 0),
t_share - b_share))
rows.sort(key=lambda r: -abs(r[3]))
return rows[:k]
def render_diff(base_folded: str, target_folded: str, out_svg: Path) -> None:
"""difffolded then flamegraph --differential."""
Path("/tmp/base.folded").write_text(base_folded)
Path("/tmp/target.folded").write_text(target_folded)
p1 = subprocess.run([DIFFFOLDED, "/tmp/base.folded", "/tmp/target.folded"],
check=True, capture_output=True, text=True)
p2 = subprocess.run([FLAMEGRAPH, "--differential",
"--title=payments-router: pre-deploy vs post-deploy",
"--countname=samples"],
input=p1.stdout, check=True,
capture_output=True, text=True)
out_svg.write_text(p2.stdout)
if __name__ == "__main__":
base_pid, target_pid, secs = int(sys.argv[1]), int(sys.argv[2]), int(sys.argv[3])
bp, tp = Path("/tmp/base.perf"), Path("/tmp/target.perf")
perf_record(base_pid, secs, bp)
perf_record(target_pid, secs, tp)
base_folded = fold(bp)
target_folded = fold(tp)
render_diff(base_folded, target_folded, Path("/tmp/diff.svg"))
print("\n[top diff frames] base target Δshare frame")
for frame, b, t, d in diff_top_frames(base_folded, target_folded):
sign = "↑" if d > 0 else "↓"
print(f" {b:>8,d} {t:>8,d} {sign}{abs(d):5.2f}% {frame}")
# Sample run on c6i.4xlarge, kernel 6.6, perf 6.6.0, FlameGraph tip-of-tree.
# pids: 12 = old build (grpc 1.59), 18 = canary (grpc 1.62). 60s capture each
# under wrk2 -R 4000 -c 200 against POST /v1/payments/capture.
[perf record pid=12] wall=60.4s size=14328 KiB → /tmp/base.perf
[perf record pid=18] wall=60.3s size=15102 KiB → /tmp/target.perf
[top diff frames] base target Δshare frame
142,318 188,442 ↑ 4.81% _validate_metadata ← grpc 1.62 added this
18,402 47,771 ↑ 3.06% pb_decode_varint
91,402 82,118 ↓ 0.98% _PyEval_EvalFrameDefault
32,008 29,114 ↓ 0.32% PyDict_GetItem
118,440 118,902 ↑ 0.05% epoll_wait
44,302 44,711 ↑ 0.04% tcp_recvmsg
Walk-through. perf record -F 99 -g --call-graph=dwarf captures stacks at 99 Hz with full DWARF unwinding — non-prime frequency avoids resonance with periodic sources (timer interrupts at 1 Hz, kthread wakeups at 10 Hz). stackcollapse-perf.pl turns perf script output (one frame per line, blank-line-separated samples) into the canonical folded format (one stack per line, ;-joined, count at end). difffolded.pl is a 60-line Perl script that reads two folded files, builds a hash keyed by stack string, and emits <stack> <base_count> <target_count> for every key seen in either file — missing-from-one is treated as zero. flamegraph.pl --differential reads the dual-count format, computes delta = (target - base) / max(target, base) per frame, maps that to a colour gradient (deep red = +1, white = 0, deep blue = -1), and emits the SVG. The leaf-frame aggregation in Python at the end is the alert payload: top of the list is _validate_metadata with +4.81% share — the regression is named, attributed to grpc 1.62, in 60 seconds of capture and 2 seconds of analysis.
Why same-host, same-load capture matters: a flame graph diff measures the difference in sample distribution, not absolute time. If the baseline ran on c6i.4xlarge under 4000 req/s and the target ran on c6i.8xlarge under 6000 req/s, the diff will show every CPU-bound frame as "smaller" simply because the bigger box has more total CPU and the larger load shifted the bottleneck. The diff is meaningful only when the two captures are taken under matched conditions: same instance type, same kernel, same CPU governor setting (use performance not ondemand), same offered load, same data set. The single-host A/B trick — run two builds on the same machine on the same loadgen, capture each — is the gold standard; cross-host diffs require very careful normalisation.
The output frames-list is the workflow's actionable surface. Reading top-to-bottom: _validate_metadata grew by 4.81% of total samples, going from 142k to 188k — that one frame is the regression. pb_decode_varint grew by 3.06%, which is downstream of the same code path (the new metadata validation walks more protobuf fields). _PyEval_EvalFrameDefault shrank by 0.98% — Python interpreter cycles got displaced by the new C-extension work, which is the expected pattern for a C-side regression. The remaining frames (epoll_wait, tcp_recvmsg) are within ±0.1%, the noise floor of a 60-second capture. The diff named the cause, the magnitude, and the displaced work, all from one render.
A practical caution: the differential render is sensitive to stack-key alignment. If baseline and target captures use different unwinders (DWARF vs frame-pointer vs LBR), the same logical call site can produce different stack strings — _PyEval_EvalFrameDefault;handle_request;_validate_metadata vs _PyEval_EvalFrameDefault;handle_request;[unknown];_validate_metadata — and the diff will count them as separate stacks, yielding spurious +100% / -100% pairs. Always capture both sides with the same --call-graph mode, the same debuginfo state, and the same kernel symbol table. Mismatched captures generate visually striking but meaningless diffs.
Reading a differential flame graph in production
The colour code is the entire interface. The frame's hue encodes the sign and magnitude of the delta; everything else (frame name, position, width) is identical to a normal flame graph. Three rules of thumb cover most production cases.
Rule 1: a single fat red frame near the top is the regression. When a deploy adds a new function call on the hot path, the differential graph shows one column with one or two deep-red leaf frames and a thin red "spine" rising from those leaves to the entry point. Aditi's gRPC regression is exactly this shape: _validate_metadata is solid red at the top, request_encoder and handle_request and dispatch are pale red on the spine above it, the rest of the graph is grey. The eye sees the red, follows the spine down to the leaf, reads the leaf's name, and the diagnosis is done.
Rule 2: alternating red and blue in the same column means work moved, not grew. If compute_hash_xxhash shrank from 8% to 2% and compute_hash_blake3 grew from 0% to 6% in the same call-chain neighbourhood, the diff shows a deep blue patch (xxhash) next to a deep red patch (blake3), with the parent frames neutral. The total CPU spent on hashing is unchanged; the implementation changed. This pattern is what you expect after a library swap, an algorithm change, or a config flip — it tells you "the change you made took effect, but it neither helped nor hurt overall". The fact that the parent column is grey is the proof: all the wider-up frames absorbed the same total time, so the change is local to one leaf decision, not a regression.
Rule 3: a thin diffuse red haze across many unrelated frames is a frequency or noise artefact. If the CPU governor switched between captures (the baseline ran at 3.2 GHz turbo, the target at 2.4 GHz base), every frame on the graph runs ~33% slower in cycles, and the differential graph paints the whole graph faintly red. Same for a thermal-throttle event that hit one capture but not the other, a noisy-neighbour VM on the post-deploy host, or a nice priority change. Diffuse coloration is the diff's signal that "you have a capture-condition mismatch, not a code change". The fix is to re-capture with locked CPU frequency (cpupower frequency-set --governor performance) and a thermal-stable window.
There is one more pattern worth knowing: the asymmetric diff, where one frame is solid deep red but has no corresponding blue anywhere. This means the new code path added work without displacing any old work — the service is simply doing more total work per request. If on-CPU time grew from 14% to 18% utilisation while throughput is unchanged, every frame on the differential graph absorbs some of that extra 4 percentage points, and the deepest-red leaf names what the new work was. Compare this to the symmetric "work moved" pattern, where total CPU is unchanged and one frame's gain is exactly another's loss. The asymmetric case is the most common deploy regression shape; the symmetric case is rarer but more interesting because it usually means an explicit refactor or library swap.
A subtle reading discipline: trust the colour, not the width. The differential graph's frame widths are typically rendered from the target capture (post-change), not the diff itself, so a red frame and a blue frame can be the same width. The width tells you "this is how big the frame is now"; the colour tells you "this is how it changed". Beginners read width and miss the message; the colour is where the diagnosis lives.
Capture-condition control — what "matched" actually means
Most failed differential analyses fail at the capture stage, not the diff stage. Two captures are "matched" only if every variable that affects the sample distribution has been held constant between them. The list is longer than it sounds, and the Razorpay payments team's runbook for capturing a "regression diff" enumerates it explicitly:
- Same instance type and the same physical host class. A
c6i.4xlargeand ac6a.4xlargehave different CPUs (Intel Ice Lake vs AMD Milan); the same Python code generates different stack signatures becausenumpyships AVX-512 paths on Intel and AVX2 paths on AMD. Diffing across instance types produces frame-name mismatches that look like real changes. - Same kernel, same kallsyms, same debuginfo. A kernel upgrade between captures rewrites kernel-frame names; a
linux-toolsupgrade changes how perf walks user stacks. Both produce stack-string mismatches that artefactually look like regressions or improvements. - Same CPU governor. If the baseline ran with
performance(locked at max frequency) and the target ran withondemand(boosts on demand), every CPU-bound frame consumes a different number of cycles per sample, and the diff highlights every CPU-heavy column as red even though no code changed. - Same load profile. Diff under matched offered load — same
wrk2 -R <rate>, same payload distribution, same key-distribution in any cache lookups. A 10% load shift between captures shifts which frames hit the page-cache vs disk, which is real but is workload drift, not a code regression. - Same warm-up state. JIT-compiled runtimes (CPython 3.13 with the experimental JIT, PyPy, JVM, V8) generate different stack signatures pre-warm-up vs post-warm-up. The baseline must have run for the same duration as the target before sampling started — typically 30–60 seconds of synthetic load before
perf recordbegins.
The matched-capture checklist is not just bureaucracy. The Cleartrip search team in 2024 spent six hours chasing a phantom 12% regression that turned out to be a kernel patch (5.15.0-105 → 5.15.0-110) that changed how __schedule was inlined; once they re-captured the baseline on the same kernel as the target, the diff was clean and the actual deploy had introduced a 0.3% regression, well below the noise floor. The cost of a careful capture is a 5-minute checklist; the cost of an un-careful capture is hours of debugging an artefact.
Why even small kernel patch numbers matter for diff stability: kernel inlining decisions are part of the build, not the runtime. A patch-level kernel update can change which kernel functions are inlined into __schedule or do_syscall_64, which changes the captured stack string for the same logical event. Two captures that "look" like the same kernel can produce structurally different stacks if the build configurations differ. The defensible rule is: pin the kernel version, the kernel command-line, and the kernel modules loaded; only then is the stack-key namespace stable enough to diff against.
The Zerodha matching-engine team's 2024 deploy regression hunt is a worked example. Their pre-deploy capture was taken on a c6i.4xlarge running kernel 6.1.85 with performance governor, 60 seconds at 99 Hz under a synthetic 5000-orders-per-second arrival pattern. The post-deploy capture was supposed to match — same instance, same kernel, same governor, same arrival rate — but the on-call SRE forgot to re-warm the JIT, so the first 12 seconds of the post-deploy capture were JIT-compilation samples that did not exist in the baseline. The diff showed a giant red pyjit_compile frame that did not represent a real regression. The team learned the lesson, added a 60-second warm-up to the runbook, and re-captured. The actual diff (after warm-up) showed the real regression: a 1.8% red frame in the FIX-message parser, traced to a config flip that disabled connection-pool reuse. Total time-to-fix on the second attempt was 14 minutes; the first attempt's wasted hour was entirely the warm-up gap.
Going beyond on-CPU diffs — the wider family
The differential rendering trick works for any signal that produces a folded-stack file, not just on-CPU profiling. The same flamegraph.pl --differential script handles off-CPU diffs (samples weighted by sleep time), allocation diffs (samples weighted by bytes allocated), wall-clock diffs (samples taken regardless of on/off-CPU state), and lock-contention diffs (samples weighted by lock-wait time). Each variant answers a different "what changed" question and uses a different capture mechanism, but the diff stage is identical.
The Hotstar SRE team uses off-CPU differential graphs to catch deploy regressions in I/O-bound services where the on-CPU graph is small to begin with. Their recommendation-service runs at 8% CPU, so the on-CPU diff is dominated by sample noise; the off-CPU diff (capture with offcputime-bpfcc for 60s, fold, diff) showed a 280 ms median wait on a Cassandra read after a driver upgrade — the on-CPU graph never could have surfaced that, because the regression was in blocked time, not running time. The off-CPU diff turned a four-hour incident debug into a 12-minute one.
The Razorpay payments team uses allocation differential graphs (capture with py-spy --memory or scalene, fold, diff) to track memory-pressure regressions. A 2024 deploy of payment-router quintupled the allocation rate of a small string-formatter helper that was now called inside a tight loop; the allocation diff highlighted that frame in deep red while every other allocation site was neutral. The fix was to hoist the format string out of the loop; the alloc rate dropped from 240k/s to 8k/s, and the next-day GC pause distribution moved from p99 = 18 ms to p99 = 4 ms.
The IRCTC tatkal-window team uses lock-contention differential graphs (from mutrace output, folded with a custom script, diffed) to track which session-store lock got hotter after a config change. A 2023 change to the connection-pool size doubled lock-wait time on the booking-confirmation mutex; the diff named the lock and the call site within seconds, and the rollback was decided in the same minute the alert fired.
A useful mental rule: whenever you have two captures of anything stack-attributed, you can diff them. The capture mechanism is independent of the diff. New diff variants — kernel-tracepoint diffs, syscall-latency diffs, GC-pause-source diffs — are all just "fold the data into the canonical format, then run flamegraph.pl --differential". The format is the standard; everything else is composable.
Common confusions
- "A diff colour map and a regular flame graph colour map are the same." They are unrelated. A regular flame graph maps colour from a hash of the frame name (so
_PyEval_EvalFrameDefaultis always the same orange across captures) — the colour means "this is a Python frame", roughly. A differential graph maps colour from(target - baseline) / max(target, baseline)per frame — the colour means "this frame got hotter or colder". Reading a diff with regular-graph muscle memory leads to "all the orange frames look the same as before" thinking; the diff is signed, not categorical. - "You can diff two flame graphs that came from different services." No. The folded-stack key has to mean the same call site in both captures. Diffing
payments-routeragainstnotification-serviceproduces a graph where every frame is +100% or -100% (each frame exists in only one service), which is not a regression analysis, it's a service-comparison and the differential format is the wrong tool for it. - "
difffolded.plandflamegraph.pl --differentialdo the same thing." No.difffolded.plis the data-side diff: it reads two folded files and emits a three-column file with both counts.flamegraph.pl --differentialis the render-side colour mapper: it reads the three-column file and produces the SVG. You always need both, in that order. - "Bigger sample-time means a more accurate diff." Up to a point. Beyond ~5 minutes per capture you stop accumulating new stack signatures — the long tail of unique stacks saturates — and the diff stops getting more accurate, but you start risking workload drift between captures (load shifts, cache warmth changes). 60–120 seconds per side is the sweet spot for most services.
- "The diff is symmetric:
diff(A, B) = -diff(B, A)." Approximately, but not exactly. The default colour normalisation isdelta / max(target, baseline), which is asymmetric — swapping the order changes the denominator's frame-by-frame value. Use the--negateflag onflamegraph.plto flip the colour sign without re-running the diff, instead of swapping inputs. - "A grey frame in a diff means nothing happened there." Grey means the change was below the noise floor — typically less than ±0.5% of total samples. A frame that gained 0.1% of samples may be perfectly real but invisible to the diff render. For sub-1% regressions you need longer captures, statistical resampling (bootstrap CI on per-frame deltas), or a different tool.
Going deeper
Statistical noise floor and per-frame confidence intervals
A 60-second capture at 99 Hz produces roughly 6000 samples per CPU. If the service is multi-threaded across 16 cores, you have ~96000 total samples — sounds like a lot, but the long tail of stack signatures is heavy. A frame that absorbs 0.5% of samples is ~480 samples; the standard error on a 0.5% proportion at n=96000 is √(0.005 × 0.995 / 96000) ≈ 0.022%, so a 95% CI is roughly ±0.045%. Per-frame deltas under 0.1% are within sampling noise and should not be acted on.
Brendan Gregg's stripped-flame-graph work (2022) added per-frame bootstrap confidence intervals to the differential render: capture each side N times (say, 5 × 60s windows), bootstrap-resample the per-frame counts, and compute CI for each per-frame delta. Frames whose CI does not cross zero render in saturated colour; frames whose CI crosses zero render in muted colour. The visualisation tells you "this red frame is real" vs "this red frame might be noise" without forcing you to know the formula.
A practical heuristic if you do not want to set up bootstrap captures: take three baseline captures and three target captures, render the worst-case diff (target with smallest red frame vs baseline with largest red frame), and only act on frames that remain red in the worst-case render. The Flipkart catalogue team uses this triple-capture rule for any deploy that increases p99 by more than 5% — the cost is 6× the capture time, the payoff is filtering out half the false alarms that single-capture diffs generate.
Why the diff is computed at the folded-text stage and not at the SVG stage
The folded-stack file is a canonical, lossless representation of a flame graph: every stack and its count, no rendering decisions baked in. You can sort, filter, aggregate, and diff folded files with arbitrary text-processing tools (awk, Python, sort | uniq -c) in a way you cannot with the SVG, where stacks have already been turned into rectangles with positions and widths. Diffing two SVGs would require parsing the SVG back into stacks (lossy — frame names get truncated, group structure flattened) before subtracting, so the project's design correctly puts the diff at the data layer. This is the same architectural choice as in git diff (text-level diff, then rendered) or diff-pdf (raster-level diff, then visual): the diff must operate on the canonical form, the render is a downstream presentation concern.
Reproduce this on your laptop
# Reproduce on Linux 5.4+
sudo apt install linux-tools-common linux-tools-generic
git clone https://github.com/brendangregg/FlameGraph
export PATH=$PWD/FlameGraph:$PATH
python3 -m venv .venv && source .venv/bin/activate
pip install requests
# In one terminal: start two versions of a load target on ports 8001 and 8002.
# Then capture and diff:
sudo python3 diff_flame.py $(pgrep -f ':8001') $(pgrep -f ':8002') 60
xdg-open /tmp/diff.svg
Differential graphs in continuous-profiling pipelines
Pyroscope, Polar Signals, and Datadog Profiler all support computing differential flame graphs server-side from continuously-uploaded folded data: the user picks two time windows ("yesterday 14:00 to 14:30" vs "today 14:00 to 14:30"), the backend pulls the per-window aggregated folded files from object storage, runs difffolded.pl (or its Go equivalent in Pyroscope's case), and serves the differential SVG. The whole pipeline takes 200 ms server-side for a typical 60M-sample query. The Flipkart Big Billion Days team uses Pyroscope's diff view to compare the same hour's profile across deploy generations during the sale — "BBD-day-2 14:00 vs BBD-day-1 14:00" answers "did our overnight optimisation help?" without anyone re-running benchmarks. The architectural lesson is the same as the Tail-at-Scale trick: store the lossless folded data, defer the render decision until the user asks the specific question, and the diff falls out for free.
What the differential graph cannot tell you
The diff is a what-changed tool, not a why tool. If the diff names _validate_metadata as the regressed frame, you still need to (a) read the gRPC 1.62 changelog to see what _validate_metadata does that 1.59 didn't, (b) check whether the regression is constant per-call or proportional to payload size, and (c) decide whether to roll back, fix forward, or accept. The diff also cannot tell you whether the regression is worth fixing — _validate_metadata may have grown by 4.81% but added a critical security check that you cannot remove. The diff is the diagnostic; the fix is a separate engineering judgement. Engineers who treat the diff as "fix the red frame" miss this and end up reverting useful changes.
The diff also cannot distinguish "this frame got hotter because the function is slower" from "this frame got hotter because the function is called more often". Both look like a wider red column. Disambiguating requires a second signal — a per-function call-count counter (instrumented via perf probe or a bpftrace uprobe) — and dividing the sample-count delta by the call-count delta. Razorpay's payments team adds a per-deploy uprobe-counter sweep on the top-3 red frames precisely for this disambiguation: _validate_metadata grew 4.81% in samples but its call-count was unchanged, so the per-call cost grew — a true latency regression, not a call-rate regression. Had the call-count grown 4.81% with per-call cost flat, the diagnosis would point at the caller (something is now triggering this function more often), not the function itself. The differential flame graph hands you the suspect; the call-count probe interrogates them.
Where this leads next
The differential pattern is the bridge from one-off profile-reading to continuous performance management. The previous chapters in Part 5 covered how to capture a flame graph (/wiki/perf-from-scratch), how to read one (/wiki/flamegraphs-reading-them-and-making-them), and how to extend the technique to off-CPU time (/wiki/off-cpu-flamegraphs-the-other-half). The diff is what turns those captures from "an artefact you stare at during one incident" into "an artefact you compare across time and across deploys".
Continuous profiling in production (/wiki/continuous-profiling-in-production) is the natural next chapter — running a profiler permanently at low overhead, archiving folded data, and serving differential views on demand. The diff workflow described here is what makes the archived data valuable; without diff, continuous profiling produces a haystack of flame graphs with no way to ask "which deploy made p99 climb?".
Profile-guided optimisation (PGO) (/wiki/profile-guided-optimisation) closes a different loop: the differential graph fingers a regression, you fix it, you generate a new profile, you feed that profile back to the compiler so it inlines / lays out / branch-predicts better. PGO is the differential workflow turned into an engineering discipline rather than a debugging one — measure, change, re-measure, ship.
The mental shift to take into the next part of the curriculum: a flame graph alone is incomplete information. The same flame graph rendered twice (before and after a change) is complete information for a regression-hunting workflow. The diff is the operator that turns two artefacts into one diagnosis, and once you internalise it, you stop looking at single flame graphs at all — every capture becomes one half of a future diff.
A useful organisational practice borrowed from Hotstar's reliability team: every deploy archives a 60-second flame-graph capture, taken five minutes after the rollout completes, into S3. The artefact costs ~2 MB per service per deploy and lives for 90 days. Every regression alert that fires within the next 24 hours auto-pulls the deploy's flame graph and the previous deploy's flame graph, computes the differential, and attaches the SVG to the alert. The on-call engineer opens the page already knowing which frame regressed, before they have even decided whether to investigate. The cost is one S3 lifecycle policy and a 50-line capture script in the deploy pipeline; the payoff is the kind of "first-glance diagnosis" that turns a 40-minute incident into a 4-minute one. Adopting this pattern in your own organisation does not require new tools — perf record, flamegraph.pl --differential, and an S3 bucket are sufficient.
References
The following list is curated, not exhaustive — pick the canonical Gregg blog post first if you read only one. The Pyroscope and Polar Signals docs are valuable when you are actually wiring up a continuous-profiling system; the Coz paper is for engineers thinking about predictive differential analysis (what if I sped this function up).
- Brendan Gregg, "Differential Flame Graphs" — the canonical introduction; covers the
difffolded.pldesign, the colour-map choices, and the--negateflag for swap-without-recapture. - Brendan Gregg, Systems Performance (2nd ed., 2020), §6.21 "Differential Profiling" — the textbook treatment, with worked examples beyond on-CPU.
- brendangregg/FlameGraph —
difffolded.plsource — the 60-line reference implementation; reading it end-to-end takes 4 minutes and clarifies the data model. - brendangregg/FlameGraph —
flamegraph.pl --differentialsource — the colour-mapping logic for--differentialis the body ofcolor_scale(); useful when debugging unexpected hues. - Grafana Pyroscope — Diff view documentation — server-side differential rendering for continuously-collected profiles, with time-window selection.
- Coz: Causal Profiling (Curtsinger & Berger, SOSP 2015) — a related-but-distinct technique that synthesises differential information by virtually speeding up one function and measuring the program-level effect; complementary to flame-graph diffs when you need to predict the impact of a hypothetical fix.
- /wiki/flamegraphs-reading-them-and-making-them — the previous chapter; the rectangle-reading discipline that the diff inherits unchanged.
- /wiki/off-cpu-flamegraphs-the-other-half — the off-CPU capture mechanism; the differential pattern in this chapter applies directly to off-CPU folded files via the same
--differentialflag.