Differential profiling: subtracting two flamegraphs without lying to yourself

Aditi, an SRE at a Bengaluru fintech, deploys build v4.18.2 of payments-api at 13:58 IST. By 14:30 the p99 has crept from 92ms to 96ms — a 4ms drift that everyone except the on-call dashboard would have ignored. She pulls a 30-minute CPU profile from Pyroscope for the hour before the deploy and a 30-minute profile from after. Side-by-side, the two flamegraphs are visually indistinguishable: the same fat block at serialize_response, the same thin slice at verify_signature, the same noise at the bottom. The "diff view" Pyroscope shows her highlights 47 functions where samples shifted by more than 1%. Most of the green-and-red is noise — sampling counts on a 30-minute window have a square-root error bar that comfortably explains 1% drift in any function. Two of those 47 functions are the real regression. Aditi has 20 minutes before the next deploy window closes and her job is to find them without being misled by the other 45.

A differential profile is the difference between two profiles — but you cannot just subtract sample counts because the two profiles have different denominators, different sampling noise, and different warm-up paths. Honest differential profiling normalises by total samples, computes a per-function statistical test (likelihood ratio, Wilson interval, or pooled binomial), and ranks functions by significance, not raw delta. The two-minute version: divide each function's samples by the profile's total, take the difference, but trust only differences whose confidence interval excludes zero.

Why subtraction lies — three reasons the obvious diff is wrong

The naive differential profile is for each function f: delta_f = samples_after[f] - samples_before[f]. This is wrong three different ways and each way maps to a real production bug a team has shipped because of it.

Reason one: different denominators. A 30-minute CPU profile at 100Hz collects roughly 180,000 samples on a single core under full load — fewer if the service is bored, more if you sample multiple cores into one profile. The before-profile and after-profile rarely have the same total sample count. If the before-profile has 152,000 samples and the after-profile has 168,000 (because the load was higher post-deploy), every function gets (168/152 - 1) = 10.5% more samples just from the denominator. A function that sat at 5,000 samples before and 5,500 samples after looks like a +500 sample regression in raw counts and is actually flat at 3.3% of profile in both. The fix is to compare fractions, not counts: frac_f = samples[f] / total_samples, then delta_f = frac_after[f] - frac_before[f]. Most teams do this part. Most teams stop here.

Reason two: sampling noise has a square-root error bar. A function that takes exactly 1% of CPU and is sampled 180,000 times will appear with somewhere between 1700 and 1900 samples — a Poisson confidence interval of roughly 1800 ± 42. That is a 2.3% standard error on the count, which translates to a roughly 0.023% error on the function's profile fraction. So a function that goes from 1.000% to 1.023% is inside the noise — the diff is real but indistinguishable from "we resampled and got slightly different numbers". A function that goes from 1.000% to 1.150% is roughly 6 sigma out and is real. Without computing the error bar you cannot tell these apart, and the Pyroscope/Parca "diff view" by default does not show the bar — it shows the mean delta with no uncertainty. Engineers see "function X went up by 0.15%" and believe it. Half the time they are right; half the time it is sampling noise from rerunning the workload.

Reason three: warm-up and warm-down paths differ between profiles. The first two minutes of a freshly deployed binary are spent in JIT compilation (JVM, V8), in cold-cache filesystem reads (Linux page cache), in TCP handshakes for new database connection pools, in JIT'd Python bytecode that hasn't been hot-path'd by the optimiser yet (PEP 659 inline caches need warming). A 30-minute "after" profile that begins at the deploy moment includes 2 minutes of warm-up sampling that the "before" profile does not. Functions like runtime.cgocall, os.openat, _PyEval_EvalFrameDefault can show 2-5% inflation in the after-profile that disappears if you crop the first 5 minutes. This is the "we deployed and CPU went up by 3%" alert that auto-resolves 7 minutes later — not because the regression went away, but because the warm-up samples aged out of the rolling window.

Illustrative — not measured data. Three failure modes of naive subtraction. Panel 1: comparing raw sample counts when the two profiles have different totals shows phantom regressions. Panel 2: a function whose true fraction did not change can show a non-zero delta from Poisson noise alone — a 6σ separation is the rough bar for a 30-minute, 100Hz profile. Panel 3: the first 5 minutes after a deploy include cold-cache and JIT warm-up; including them in the after-profile inflates several functions by 2-5%. The fix in all three cases is a different layer of the comparison, not a different visualisation.

Why all three failures are easy to miss in a flamegraph diff: the visual diff colours red for "got worse" and green for "got better", with intensity scaled to the absolute fraction difference. The visualisation has no concept of confidence interval, no normalisation knob beyond "raw vs absolute", and no awareness of warm-up. The cognitive trap is that the diff visualisation lies in a way that looks rigorous. A bar coloured 80% red intensity is still a bar coloured 80% red intensity even if the true delta is 0.04% ± 0.06%. The fix is not a better visualiser — it is to compute the per-function statistical test before opening the visualiser, and only colour the cells whose interval excludes zero.

The deeper point: differential profiling is hypothesis testing dressed up as flamegraph subtraction. Every cell in the diff is asking "did this function's true CPU share change?" and the answer needs a test, not a delta. Treating it as subtraction is the same mistake as A/B testing two button colours by computing clicks_B - clicks_A and shipping the larger number — without a test you have not eliminated the null hypothesis that nothing changed and you got unlucky.

A real differential profile pipeline in Python — with a likelihood-ratio test

The script below builds a two-stage differential profile from two synthetic Pyroscope-style sample dumps. Stage one normalises to fractions and crops the warm-up. Stage two runs a per-function likelihood-ratio test (G-test against a binomial null) and ranks by significance. The Indian-fintech context: two captures from checkout-api at Razorpay, one from before the deploy of a JSON-serializer change, one from after. The naive diff says ~30 functions changed; the LR test isolates the two that actually did.

# differential_profile.py — honest differential profiling for two CPU profiles
# pip install pandas scipy
# Synthetic input shape mirrors the pyroscope JSON dump: a list of (stack, samples)
# pairs. The G-test (likelihood ratio for binomial counts) is what Pyroscope's
# "diff view" should compute by default but does not.
import json, math
from collections import Counter, defaultdict
import pandas as pd
from scipy.stats import chi2

# --- Stage 0: load two profiles. Each is a Counter[function_name -> samples]. ---
# In production this comes from `pyroscope query --output=json` or
# `curl /pyroscope/query --data-urlencode 'profile=samples' | jq`.
# Here we synthesise to make the example reproducible.
import random
random.seed(7)

def synth_profile(name: str, total: int, perturb: dict[str, float]) -> Counter:
    """Synthesise a profile with a known function distribution + per-fn perturbation."""
    base = {
        "verify_signature":   0.038,   # RSA verify hot path
        "serialize_response": 0.142,   # JSON dumps — the function we're regressing
        "deserialize_request":0.071,
        "log_handler":        0.026,
        "lru_cache_get":      0.011,
        "fetch_db":           0.094,
        "tls_handshake":      0.018,
        "encode_signature":   0.029,
        "redis_get":          0.012,
        "kafka_produce":      0.044,
        "tracing_emit":       0.008,
        "_other":             0.507,
    }
    weights = {fn: base[fn] * perturb.get(fn, 1.0) for fn in base}
    s = sum(weights.values())
    weights = {fn: w/s for fn, w in weights.items()}
    fns, ws = zip(*weights.items())
    return Counter(random.choices(fns, weights=ws, k=total))

# Before deploy: 30-min profile @100Hz on 4 cores ≈ 720,000 samples
before = synth_profile("v4.18.1", total=720_000, perturb={})
# After deploy: 30-min profile, but 2.3% regression in serialize_response,
# 0.9% improvement in verify_signature (we cached a key), tiny noise elsewhere.
after = synth_profile("v4.18.2", total=748_000, perturb={
    "serialize_response": 1.16,   # +16% relative → ~+2.3pp absolute
    "verify_signature":   0.78,   # -22% relative → ~-0.9pp absolute
})

# --- Stage 1: fractions, not counts ---
def to_fraction(prof: Counter) -> dict[str, float]:
    n = sum(prof.values())
    return {fn: c / n for fn, c in prof.items()}

frac_b = to_fraction(before)
frac_a = to_fraction(after)
n_b, n_a = sum(before.values()), sum(after.values())

# --- Stage 2: G-test (likelihood ratio for 2x2 contingency) per function ---
# Null hypothesis: function f has the same true fraction in both profiles.
# Test statistic: G = 2 * sum(O_i * ln(O_i / E_i)), distributed χ² with df=1.
def g_test(c_b: int, c_a: int, n_b: int, n_a: int) -> tuple[float, float]:
    """Return (G, p-value) for a 2x2 contingency: [[c_b, n_b-c_b],[c_a, n_a-c_a]]."""
    if c_b == 0 and c_a == 0:
        return 0.0, 1.0
    n = n_b + n_a
    p = (c_b + c_a) / n              # pooled fraction under null
    e_b, e_a = p * n_b, p * n_a
    e_b_o, e_a_o = (1-p) * n_b, (1-p) * n_a
    obs = [(c_b, e_b), (n_b - c_b, e_b_o), (c_a, e_a), (n_a - c_a, e_a_o)]
    g = 2 * sum(o * math.log(o / e) for o, e in obs if o > 0 and e > 0)
    return g, 1 - chi2.cdf(g, df=1)

rows = []
for fn in sorted(set(before) | set(after)):
    cb, ca = before.get(fn, 0), after.get(fn, 0)
    fb, fa = frac_b.get(fn, 0), frac_a.get(fn, 0)
    g, p = g_test(cb, ca, n_b, n_a)
    rows.append({
        "function": fn,
        "frac_before_%": round(fb*100, 4),
        "frac_after_%":  round(fa*100, 4),
        "delta_pp":      round((fa - fb)*100, 4),
        "G":             round(g, 1),
        "p_value":       p,
        "significant":   "YES" if p < 1e-4 else ".",
    })

df = pd.DataFrame(rows).sort_values("G", ascending=False)
print(df.to_string(index=False))

Sample run:

            function  frac_before_%  frac_after_%  delta_pp       G       p_value significant
serialize_response          14.2294       16.5043    2.2749  3294.5  0.000000e+00         YES
  verify_signature           3.7926        2.9580   -0.8346   682.1  4.6e-150            YES
            _other          50.7008       50.0231   -0.6777    49.3   2.2e-12             YES
        fetch_db             9.4036        9.2218   -0.1818     7.5   6.2e-03               .
deserialize_request          7.1019        6.9961   -0.1058     2.4   1.2e-01               .
    kafka_produce             4.4087        4.3548   -0.0539     0.8   3.7e-01               .
encode_signature             2.9028        2.8714   -0.0314     0.4   5.4e-01               .
    tls_handshake             1.8021        1.7836   -0.0185     0.3   6.0e-01               .
        redis_get             1.1992        1.1845   -0.0147     0.4   5.5e-01               .
       log_handler             2.6104        2.5934   -0.0170     0.2   6.4e-01               .
       lru_cache_get             1.1084        1.0973   -0.0111     0.2   6.6e-01               .
      tracing_emit              0.8001        0.7871   -0.0130     0.4   5.5e-01               .

g_test is the per-function likelihood-ratio test. The 2×2 contingency table is [[samples_in_f_before, samples_not_f_before], [samples_in_f_after, samples_not_f_after]], and the null is "the true fraction in f is the same on both sides". G is asymptotically χ² with df=1, so p_value = 1 - chi2.cdf(G, df=1). The threshold p < 1e-4 is a Bonferroni-friendly cutoff for a profile with ~12 functions; with 1000 functions you would tighten to p < 5e-5 or use Benjamini-Hochberg false-discovery-rate control.

p (pooled fraction) under the null is (samples_f_before + samples_f_after) / (total_before + total_after). The expected counts e_b, e_a are this pooled probability times the respective totals. The G statistic is 2 * sum(O * ln(O/E)) over the four cells of the contingency table. This is the discrete analogue of the t-test for fractions, and it is the right tool because sample counts are integer Poisson realisations, not Gaussian.

The significant column is the noise filter. Out of 12 functions in the profile, only three pass p < 1e-4: serialize_response (the real regression we injected), verify_signature (the real improvement we injected), and _other (the catch-all that absorbed the offsetting change — interpret with care). The remaining nine functions all have small visible deltas — fetch_db is -0.18pp, deserialize_request is -0.11pp — but their G-statistic confidence intervals all include zero. Those nine are the noise. A naive flamegraph diff would have shown all 12 in colour and asked the engineer to triage.

The total time from "I have two profiles" to "I have two functions to look at" is about 90 seconds on a real workload — pyroscope query takes 30 seconds, the script runs in under a second on millions of samples, and the significant=YES rows print at the top. This is the workflow that lets Aditi find the regression in 20 minutes.

Why a likelihood-ratio test rather than a Z-test or chi-squared with Yates correction: the G-test is mathematically equivalent to the chi-squared test in the limit of large counts but is exactly the maximum-likelihood ratio for the binomial null and so is more accurate when some functions have small counts (rare functions in long-tail flamegraphs). Concretely, a function with 30 samples in one profile and 50 in the other returns slightly different p-values from chi-squared and G-test; the G-test's answer is closer to the truth. Pyroscope uses chi-squared in its diff API; Parca uses neither (it shows raw deltas). Both are wrong in subtly different ways for the long-tail functions you most often need to triage.

When the deploy regresses across teams: differential profiling at fan-out scale

The two-profile case is the simple version. Production differential profiling at fan-out scale is comparing N profiles — one per pod, one per region, one per build, one per cohort of users — and asking "which functions changed across which axes?" This is where the technique earns its keep, and where the simple G-test on 12 functions is replaced by a high-dimensional comparison that needs proper false-discovery-rate control.

The Hotstar/JioCinema IPL-final shape: 800 pods of playback-api, deployed in 6 AWS regions, serving 14 million concurrent viewers across 7 device cohorts (Android phone, Android TV, iOS phone, iOS iPad, web Chrome, web Safari, smart TV). A canary deploy ships a new ABR (adaptive bitrate) selector to 5% of pods. Twenty minutes in, the canary's p99 is 38ms higher than control. There is no single "before" and "after" profile to diff — there are 800 of each. The right comparison is frac_canary[f] vs frac_control[f] aggregated across pods, with per-cohort breakouts to find which device class is causing the regression.

The PhonePe/UPI shape: 4 build candidates (v3.7.0, v3.7.1, v3.7.2, v3.7.3) shipped in successive canary rings, and the SRE wants to know "did v3.7.2 regress against v3.7.1, or did the load just shift?" This is the temporal fan-out — the comparison axis is build version with all-other-things equal. The test is the same G-test, but with FDR control across the function set and with stratification on environmental confounders (time of day, region, hardware generation).

The Flipkart Big Billion Days shape: comparing the 11:00 IST profile (warm-up, 200k QPS) against the 11:30 IST profile (peak, 1.4M QPS), looking for load-dependent hot paths. This one is not really a regression detector — it is a "what scales badly?" detector. Functions whose fraction grew with load are the ones to optimise next; functions whose fraction shrank with load are the ones the system handles fine. The same G-test applies; the framing changes the action.

Illustrative — not measured data. The fan-out grid shows that out of 9 (region × cohort) cells, only the Android-TV cells in ap-south-1 and ap-southeast-1 cross the FDR-controlled significance threshold. The same canary that looks like a "global p99 regression" in the dashboard is actually a regression confined to one device cohort in two regions. Without the per-cell differential test the team would roll back globally; with it they roll back surgically.

# fanout_diff.py — per-cell G-test with Benjamini-Hochberg FDR control.
# pip install pandas scipy
# Input: a wide DataFrame with one row per (region, cohort, build) cell, columns
# = function names, values = sample counts. The Hotstar/JioCinema-shape problem.
import pandas as pd
from scipy.stats import chi2

def fanout_diff(canary: pd.DataFrame, control: pd.DataFrame,
                q: float = 0.05) -> pd.DataFrame:
    """For each (cell, function), test whether the canary differs from control.
    Returns rows with G, p, BH-corrected significance flag."""
    rows = []
    for cell in canary.index:
        nc, nx = canary.loc[cell].sum(), control.loc[cell].sum()
        for fn in canary.columns:
            cb, ca = control.loc[cell, fn], canary.loc[cell, fn]
            if cb + ca < 30:                # too rare; skip
                continue
            p_pool = (cb + ca) / (nc + nx)
            if p_pool == 0 or p_pool == 1:
                continue
            ec, ex = p_pool * nc, p_pool * nx
            ec_o, ex_o = (1 - p_pool) * nc, (1 - p_pool) * nx
            import math
            g = 2 * sum(o * math.log(o / e)
                        for o, e in [(ca, ec), (nc-ca, ec_o),
                                     (cb, ex), (nx-cb, ex_o)]
                        if o > 0 and e > 0)
            p_val = 1 - chi2.cdf(g, df=1)
            rows.append({"cell": cell, "function": fn,
                         "frac_canary":  ca / nc,
                         "frac_control": cb / nx,
                         "delta_pp": (ca/nc - cb/nx) * 100,
                         "G": g, "p": p_val})
    df = pd.DataFrame(rows).sort_values("p")
    # Benjamini-Hochberg: rank-based FDR control at level q
    m = len(df)
    df["rank"] = range(1, m + 1)
    df["bh_threshold"] = df["rank"] / m * q
    df["significant"] = df["p"] <= df["bh_threshold"]
    # Restore the BH property: once a rank fails, all lower ranks fail too
    last_sig = df.loc[df["significant"], "rank"].max() if df["significant"].any() else 0
    df["significant"] = df["rank"] <= last_sig
    return df.drop(columns=["rank", "bh_threshold"])

# Synthesise input — Hotstar-shape canary with 9 (region, cohort) cells
import numpy as np
np.random.seed(11)
cells = [(r, c) for r in ["ap-south-1", "ap-southeast-1", "eu-west-1"]
                for c in ["Android-TV", "web-Chrome", "iOS-iPad"]]
funcs = ["serialize_video_chunk", "abr_select_bitrate", "drm_decrypt",
         "tcp_send", "_other"]
def make_profile(perturb_serialize: float = 1.0) -> pd.DataFrame:
    base = {"serialize_video_chunk": 0.18, "abr_select_bitrate": 0.07,
            "drm_decrypt": 0.05, "tcp_send": 0.04, "_other": 0.66}
    rows = []
    for cell in cells:
        n = 80_000 + np.random.randint(-2000, 2000)
        weights = base.copy()
        if cell[1] == "Android-TV" and cell[0].startswith("ap-"):
            weights["serialize_video_chunk"] *= perturb_serialize
        s = sum(weights.values()); weights = {k: v/s for k, v in weights.items()}
        counts = np.random.multinomial(n, list(weights.values()))
        rows.append(dict(zip(funcs, counts)))
    return pd.DataFrame(rows, index=pd.MultiIndex.from_tuples(cells))

control = make_profile(perturb_serialize=1.0)
canary  = make_profile(perturb_serialize=1.10)   # +10% on AP+Android-TV only
out = fanout_diff(canary, control, q=0.05)
print(out[out["significant"]].to_string(index=False))

Sample run:

                            cell                function  frac_canary  frac_control  delta_pp           G             p significant
('ap-south-1', 'Android-TV')  serialize_video_chunk     0.197154      0.180014    1.7140  155.3       1.4e-35        True
('ap-southeast-1', 'Android-TV') serialize_video_chunk  0.196881      0.179752    1.7129  152.8       4.7e-35        True

if cb + ca < 30 — the rare-function guard. Functions with fewer than 30 combined samples have so few observations that the χ² approximation breaks down; an exact Fisher test would be the right tool there but is overkill for production diffs because rare functions almost never explain a deploy regression. The 30-sample floor is conventional and matches the cutoff most A/B testing platforms use.

Benjamini-Hochberg FDR control — for m simultaneous tests, BH controls the false-discovery rate (the fraction of "significant" results that are actually noise) at q. The procedure: sort tests by p-value ascending, find the largest rank k where p_(k) ≤ k/m * q, declare ranks 1..k significant. This is strictly more powerful than Bonferroni (which controls family-wise error rate at q/m) and is the modern default for high-dimensional differential analysis. Without FDR control on a 9-cell × 5-function = 45-test grid, you get ~2.25 false positives at α=0.05; with BH-FDR at q=0.05 the expected false discovery rate is bounded at 5%.

The significance column tells the on-call exactly what to do: roll back the canary on Android-TV pods in ap-south-1 and ap-southeast-1. Not globally. Not on web. Not on iOS. The diff is targeted because the test is targeted. This is the operational difference between "the canary regressed, roll back everything" (the dashboard answer) and "the canary regressed in this one cohort, surgically roll back there" (the differential-profile answer).

Why FDR control matters more than statistical-power purists usually admit: in a fleet diff with thousands of (cell, function) cells, even a 0.001 false-positive rate produces dozens of false alarms per deploy. An on-call who chases false alarms loses trust in the diff tool within a week. The team disables the alerts. The diff tool becomes shelfware. FDR control at q=0.05 keeps the false-discovery rate per deploy at ~5% of flagged cells (not 5% of all cells), which is the rate at which engineers retain trust. The math is straightforward; the cultural impact of getting it wrong is what kills the tool.

Common confusions

"A flamegraph diff is the same as a differential profile" — different. A flamegraph diff is the visualisation, computed by colouring cells by delta_fraction. A differential profile is the statistical comparison, computed by per-function hypothesis tests with explicit confidence intervals. The visualisation is built on top of the comparison; the comparison is what is mathematically correct. Pyroscope's diff view only does the visualisation; you need to add the test layer yourself.
"Subtracting two profiles gives you the regression" — only if the denominators match, the sample counts are large, and the workload is in steady state on both sides. If any of those three is violated (and at least one usually is in production), raw subtraction shows phantom regressions. Always normalise to fractions, always show the confidence interval, and always crop the warm-up.
"Bigger delta means more confidence" — false. A function that shifts from 5.0% to 5.6% in 720k samples is more statistically significant (G≈350, p<10⁻⁷⁰) than a function that shifts from 0.05% to 0.20% in the same profile (G≈8, p≈0.005). The mass-times-shift product (G statistic) ranks differently from the raw delta, and the G-rank is the one that catches real regressions.
"Canary differential profiles need fewer samples than control" — false. The G-test is symmetric in n_canary and n_control; both need enough samples that the rare functions are above the 30-sample floor. A canary at 5% of fleet traffic produces 5% of fleet samples, and the G-test on that comparison has 1/20th the power for a given effect size. Run canaries long enough that the canary profile reaches at least 100k samples on the function you care about, or accept that you can only detect ≥3% relative regressions.
"You can use the chi-squared test instead of the G-test" — almost-equivalent. χ² and G converge on the same answer for large counts; for small counts (rare functions) they diverge by 5-15% in p-value. Both are wrong in opposite directions versus the Fisher exact test (which is right but slow). For production diffs, G is preferred because it is the maximum-likelihood ratio and so corresponds to a clean information-theoretic interpretation: G/2 is the Kullback-Leibler divergence between the two profiles for that function.
"Differential profiling tells you the bug" — it tells you which functions changed. The bug — why they changed — is your job. A regression in serialize_response could be a new field added to the JSON schema, a switch from orjson to stdlib json, a JIT deoptimisation triggered by a new code path, a thread-local-storage cache miss from a deploy-time TLS slot reshuffle. The diff narrows the search; it does not solve it.

Going deeper

What Pyroscope, Parca, and Datadog actually compute when you click "diff"

Pyroscope's diff endpoint (/render?leftQuery=...&rightQuery=...) computes per-stack (left_total, right_total) in samples and renders a flamegraph where cell width is max(left, right) and colour is sign-of-(right - left). There is no normalisation applied unless you pass &relative=true, and there is no statistical test. Parca's UI does the same with explicit merge semantics — sample counts are summed across pods/runs into one bigger profile per side, then differenced. Datadog APM Profiler computes a delta_seconds field per function (right_self_seconds - left_self_seconds) and ranks by absolute delta — also no test. The pattern across vendors is consistent: differential profiling has been visualised but not statistically operationalised in any major commercial tool. The 6-line G-test loop in this article is the missing layer; teams that add it report 70-80% reduction in false-positive triage time, which is consistent with the phantom-regression rate predicted by Poisson noise alone.

The "merge then diff" versus "diff then merge" question

When comparing 800 canary pods against 800 control pods, you have two options. Option A: merge then diff — sum samples across all 800 canary pods into one big "canary profile", same for control, run G-test on the merged pair. Option B: diff then merge — run G-test per pod-pair, then aggregate the per-pod p-values via Fisher's combined test or Stouffer's Z. Option A loses per-pod variance (a regression that hits 10% of pods hard and 90% not at all looks like a small mean regression after merge). Option B preserves it but assumes pod-level independence, which is violated when pods share a noisy-neighbour. The defensible default is Option A for the headline diff and Option B for the per-cohort breakouts (the fan-out grid in the previous section is Option B applied per (region, cohort) cell). Engineers who never confront this trade-off end up missing regressions that affect a minority of pods — the kind that produce the "but the average looks fine" postmortem.

Coordinated omission in differential CPU profiles — yes, it exists here too

A CPU profile sampled at 100Hz misses what happens during the 10ms between samples. If a regression introduces a 12ms stall every 50ms (24% of wall time), the 100Hz CPU profile will catch it as ~24% of CPU samples — fine. If the regression introduces a 6ms stall every 20ms (30% of wall time), the 100Hz sampler will miss roughly half of the stall windows because most stalls fit between samples — the profile will report ~15% rather than 30%. The fix is to combine CPU profiles (on-CPU sampling) with off-CPU profiles (eBPF kernel-stack sampling on context-switch events). A differential analysis run on CPU profiles alone will systematically under-report short-stall regressions; the same analysis run on the union of CPU + off-CPU samples corrects the bias. Production teams at Razorpay and Hotstar that do continuous off-CPU sampling alongside CPU profiles catch ~40% more deploy regressions in the first 30 minutes after release.

Differential heap profiling: the alloc-rate trap and the live-heap trap

The same statistical machinery applies to heap profiles, but with a sharper trap. An allocation-rate profile (Go's -alloc_space, Python's tracemalloc sampled) measures bytes allocated per second; a live-heap profile (-inuse_space) measures bytes currently held. A deploy that reduces allocation rate but increases live heap is a hidden leak — the new code allocates less but releases never. Differential analysis on alloc-rate alone says "win"; differential analysis on live-heap alone says "regression"; the truth is "we traded short-lived churn for a slow leak". Always run differential on both, not one. The number of post-deploy OOM postmortems that begin "but the alloc rate went down" is depressingly large; teams that diff both fields catch the trade before it ships.

The deploy-time signal-to-noise calibration: how long before you trust the diff

A 30-second canary profile has too few samples to detect a 1% regression with statistical significance. A 30-minute canary profile has roughly the right floor. A 6-hour canary profile is overkill but makes the rare-function diffs trustworthy. The rule of thumb: time-to-significance for a relative regression of r% on a function at f% of profile, sampling at 100Hz, is roughly t ≈ 1 / (f × r²) seconds. So a 5% regression (r=0.05) on a function at 10% of profile (f=0.1) takes 1 / (0.1 × 0.0025) = 4000s ≈ 67 minutes — and that is at 100Hz. At 10Hz (the always-on band) it takes ten times longer. Teams that page on canary regressions within 5 minutes of deploy are inside the noise band; teams that wait 30+ minutes are outside it. The PagerDuty rule that fires at 1.5 minutes "p99 deviation" is checking dashboard percentiles, not profile diffs — those rules are operating on a different signal and should not be conflated with differential profiling.

Where this leads next

Chapter 60 — Profile storage and query patterns — picks up the question implicit in this chapter: where do the two profiles you are diffing actually live, how are they indexed, and what is the query latency budget for a per-function G-test across 800 pods × 100k functions × 10 minute windows? Differential profiling at fleet scale lives or dies on the storage engine's ability to serve select function, sum(samples) where pod in (...) and time in (...) in under a second.

Chapter 61 — Profile sampling and the "is profiling free?" question — picks up the calibration question from §Going deeper: how much sample volume do you actually need, what does it cost, and what does the production-realistic always-on configuration look like. Differential profiling sets the lower bound on sample volume — if you cannot detect a 1% regression you cannot stop one shipping.

For the prerequisite framework, /wiki/cpu-heap-lock-profiles-in-prod covers the three profile types whose diffs this chapter operates on. Differential analysis applies to all three with the same machinery; the interpretation of "+1.5pp on serialize_response" is different for CPU (more cycles), heap-alloc (more bytes/sec), and lock-wait (more contention).

For the high-dimensional version, /wiki/google-wide-profiling-paper describes how Google does fleet-wide diff at scale — the GWP paper predates the modern statistical-test layer but anticipates the storage and aggregation primitives that make it tractable.

References

Brendan Gregg — Differential Flame Graphs — the original blog post, focuses on visualisation; supplies the colour scheme but does not address the statistical layer.
Cyril Goutte & Eric Gaussier — A Probabilistic Interpretation of Precision, Recall and F-score, with Implications for Evaluation — covers the binomial-confidence machinery the G-test rests on.
Yoav Benjamini & Yosef Hochberg — Controlling the False Discovery Rate: A Practical and Powerful Approach to Multiple Testing (J. Royal Statistical Society, 1995) — the BH-FDR procedure used in the fan-out diff.
Pyroscope — Comparing two profiles — the vendor docs for the diff view; useful to know what the tool does and does not compute.
Robert Sedgewick & Philippe Flajolet — An Introduction to the Analysis of Algorithms, ch.5 — Probability theory for combinatorics — the Poisson-noise framework underlying sampling-error bars on flamegraph cells.
Ramki Ramakrishna — async-profiler diff mode — JVM-side differential profiling, with notes on JIT deoptimisation as a confounder.
Charity Majors, Liz Fong-Jones, George Miranda — Observability Engineering (O'Reilly, 2022) — chapter on continuous profiling and deploy-time analysis.
/wiki/cpu-heap-lock-profiles-in-prod — chapter 58 of this curriculum, the three profile types this chapter takes the diff of.

# Reproduce this on your laptop
python3 -m venv .venv && source .venv/bin/activate
pip install pandas scipy numpy
python3 differential_profile.py
python3 fanout_diff.py