Differential profiling: subtracting two flamegraphs without lying to yourself

Aditi, an SRE at a Bengaluru fintech, deploys build v4.18.2 of payments-api at 13:58 IST. By 14:30 the p99 has crept from 92ms to 96ms — a 4ms drift that everyone except the on-call dashboard would have ignored. She pulls a 30-minute CPU profile from Pyroscope for the hour before the deploy and a 30-minute profile from after. Side-by-side, the two flamegraphs are visually indistinguishable: the same fat block at serialize_response, the same thin slice at verify_signature, the same noise at the bottom. The "diff view" Pyroscope shows her highlights 47 functions where samples shifted by more than 1%. Most of the green-and-red is noise — sampling counts on a 30-minute window have a square-root error bar that comfortably explains 1% drift in any function. Two of those 47 functions are the real regression. Aditi has 20 minutes before the next deploy window closes and her job is to find them without being misled by the other 45.

A differential profile is the difference between two profiles — but you cannot just subtract sample counts because the two profiles have different denominators, different sampling noise, and different warm-up paths. Honest differential profiling normalises by total samples, computes a per-function statistical test (likelihood ratio, Wilson interval, or pooled binomial), and ranks functions by significance, not raw delta. The two-minute version: divide each function's samples by the profile's total, take the difference, but trust only differences whose confidence interval excludes zero.

Why subtraction lies — three reasons the obvious diff is wrong

The naive differential profile is for each function f: delta_f = samples_after[f] - samples_before[f]. This is wrong three different ways and each way maps to a real production bug a team has shipped because of it.

Reason one: different denominators. A 30-minute CPU profile at 100Hz collects roughly 180,000 samples on a single core under full load — fewer if the service is bored, more if you sample multiple cores into one profile. The before-profile and after-profile rarely have the same total sample count. If the before-profile has 152,000 samples and the after-profile has 168,000 (because the load was higher post-deploy), every function gets (168/152 - 1) = 10.5% more samples just from the denominator. A function that sat at 5,000 samples before and 5,500 samples after looks like a +500 sample regression in raw counts and is actually flat at 3.3% of profile in both. The fix is to compare fractions, not counts: frac_f = samples[f] / total_samples, then delta_f = frac_after[f] - frac_before[f]. Most teams do this part. Most teams stop here.

Reason two: sampling noise has a square-root error bar. A function that takes exactly 1% of CPU and is sampled 180,000 times will appear with somewhere between 1700 and 1900 samples — a Poisson confidence interval of roughly 1800 ± 42. That is a 2.3% standard error on the count, which translates to a roughly 0.023% error on the function's profile fraction. So a function that goes from 1.000% to 1.023% is inside the noise — the diff is real but indistinguishable from "we resampled and got slightly different numbers". A function that goes from 1.000% to 1.150% is roughly 6 sigma out and is real. Without computing the error bar you cannot tell these apart, and the Pyroscope/Parca "diff view" by default does not show the bar — it shows the mean delta with no uncertainty. Engineers see "function X went up by 0.15%" and believe it. Half the time they are right; half the time it is sampling noise from rerunning the workload.

Reason three: warm-up and warm-down paths differ between profiles. The first two minutes of a freshly deployed binary are spent in JIT compilation (JVM, V8), in cold-cache filesystem reads (Linux page cache), in TCP handshakes for new database connection pools, in JIT'd Python bytecode that hasn't been hot-path'd by the optimiser yet (PEP 659 inline caches need warming). A 30-minute "after" profile that begins at the deploy moment includes 2 minutes of warm-up sampling that the "before" profile does not. Functions like runtime.cgocall, os.openat, _PyEval_EvalFrameDefault can show 2-5% inflation in the after-profile that disappears if you crop the first 5 minutes. This is the "we deployed and CPU went up by 3%" alert that auto-resolves 7 minutes later — not because the regression went away, but because the warm-up samples aged out of the rolling window.

Three ways the naive subtraction lies, with concrete shapesThree panels side by side, each illustrating one failure mode of naive differential profiling. Panel 1 (left): different denominators. Two profiles, one with 152,000 samples shown as a short bar, the other with 168,000 shown as a taller bar. The same function in raw counts looks like +500 samples (apparent regression) but as a fraction it is flat at 3.3 percent. Panel 2 (centre): sampling noise. A function at 1.0 percent of profile across two profiles, with error bars of plus or minus 0.023 percent shown around each measurement. The before measurement is 1.000 percent, the after measurement is 1.023 percent, and the error bars overlap, indicating the difference is inside the noise. Panel 3 (right): warm-up artifact. A line plot of cgo overhead percentage over time after a deploy, starting at 4.5 percent at minute 0 and dropping to 0.8 percent steady state by minute 6, with a shaded warm-up region from 0 to 5 minutes labelled cold-cache and JIT warm-up. The label notes that a 30 minute window starting at deploy includes the warm-up; a 30 minute window starting at deploy-plus-5-minutes does not. Below the panels, a one-line summary: naive subtraction is correct only when the denominators match, the sample count is large enough to overwhelm Poisson noise, and the workload is in steady state on both sides.three failure modes of naive differential profilingdenominators differ152k samples vs 168k samplesbefore152kafter168kfunction f: 5,000 vs 5,500raw: +500 (regression?)frac: 3.29% vs 3.27%truth: flatalways normalise to fractionssampling noise (Poisson)f at 1.0% of 180k samples0.95%1.05%before 1.000%after 1.023%error bars overlap →delta inside noiseneed 6σ separation to trustalways show confidence intervalwarm-up artifactcgo% over time after deploywarm-up4.5%0.8%first 5 min: cold-cache + JITcrop the warm-up before diff
Illustrative — not measured data. Three failure modes of naive subtraction. Panel 1: comparing raw sample counts when the two profiles have different totals shows phantom regressions. Panel 2: a function whose true fraction did not change can show a non-zero delta from Poisson noise alone — a 6σ separation is the rough bar for a 30-minute, 100Hz profile. Panel 3: the first 5 minutes after a deploy include cold-cache and JIT warm-up; including them in the after-profile inflates several functions by 2-5%. The fix in all three cases is a different layer of the comparison, not a different visualisation.

Why all three failures are easy to miss in a flamegraph diff: the visual diff colours red for "got worse" and green for "got better", with intensity scaled to the absolute fraction difference. The visualisation has no concept of confidence interval, no normalisation knob beyond "raw vs absolute", and no awareness of warm-up. The cognitive trap is that the diff visualisation lies in a way that looks rigorous. A bar coloured 80% red intensity is still a bar coloured 80% red intensity even if the true delta is 0.04% ± 0.06%. The fix is not a better visualiser — it is to compute the per-function statistical test before opening the visualiser, and only colour the cells whose interval excludes zero.

The deeper point: differential profiling is hypothesis testing dressed up as flamegraph subtraction. Every cell in the diff is asking "did this function's true CPU share change?" and the answer needs a test, not a delta. Treating it as subtraction is the same mistake as A/B testing two button colours by computing clicks_B - clicks_A and shipping the larger number — without a test you have not eliminated the null hypothesis that nothing changed and you got unlucky.

A real differential profile pipeline in Python — with a likelihood-ratio test

The script below builds a two-stage differential profile from two synthetic Pyroscope-style sample dumps. Stage one normalises to fractions and crops the warm-up. Stage two runs a per-function likelihood-ratio test (G-test against a binomial null) and ranks by significance. The Indian-fintech context: two captures from checkout-api at Razorpay, one from before the deploy of a JSON-serializer change, one from after. The naive diff says ~30 functions changed; the LR test isolates the two that actually did.

# differential_profile.py — honest differential profiling for two CPU profiles
# pip install pandas scipy
# Synthetic input shape mirrors the pyroscope JSON dump: a list of (stack, samples)
# pairs. The G-test (likelihood ratio for binomial counts) is what Pyroscope's
# "diff view" should compute by default but does not.
import json, math
from collections import Counter, defaultdict
import pandas as pd
from scipy.stats import chi2

# --- Stage 0: load two profiles. Each is a Counter[function_name -> samples]. ---
# In production this comes from `pyroscope query --output=json` or
# `curl /pyroscope/query --data-urlencode 'profile=samples' | jq`.
# Here we synthesise to make the example reproducible.
import random
random.seed(7)

def synth_profile(name: str, total: int, perturb: dict[str, float]) -> Counter:
    """Synthesise a profile with a known function distribution + per-fn perturbation."""
    base = {
        "verify_signature":   0.038,   # RSA verify hot path
        "serialize_response": 0.142,   # JSON dumps — the function we're regressing
        "deserialize_request":0.071,
        "log_handler":        0.026,
        "lru_cache_get":      0.011,
        "fetch_db":           0.094,
        "tls_handshake":      0.018,
        "encode_signature":   0.029,
        "redis_get":          0.012,
        "kafka_produce":      0.044,
        "tracing_emit":       0.008,
        "_other":             0.507,
    }
    weights = {fn: base[fn] * perturb.get(fn, 1.0) for fn in base}
    s = sum(weights.values())
    weights = {fn: w/s for fn, w in weights.items()}
    fns, ws = zip(*weights.items())
    return Counter(random.choices(fns, weights=ws, k=total))

# Before deploy: 30-min profile @100Hz on 4 cores ≈ 720,000 samples
before = synth_profile("v4.18.1", total=720_000, perturb={})
# After deploy: 30-min profile, but 2.3% regression in serialize_response,
# 0.9% improvement in verify_signature (we cached a key), tiny noise elsewhere.
after = synth_profile("v4.18.2", total=748_000, perturb={
    "serialize_response": 1.16,   # +16% relative → ~+2.3pp absolute
    "verify_signature":   0.78,   # -22% relative → ~-0.9pp absolute
})

# --- Stage 1: fractions, not counts ---
def to_fraction(prof: Counter) -> dict[str, float]:
    n = sum(prof.values())
    return {fn: c / n for fn, c in prof.items()}

frac_b = to_fraction(before)
frac_a = to_fraction(after)
n_b, n_a = sum(before.values()), sum(after.values())

# --- Stage 2: G-test (likelihood ratio for 2x2 contingency) per function ---
# Null hypothesis: function f has the same true fraction in both profiles.
# Test statistic: G = 2 * sum(O_i * ln(O_i / E_i)), distributed χ² with df=1.
def g_test(c_b: int, c_a: int, n_b: int, n_a: int) -> tuple[float, float]:
    """Return (G, p-value) for a 2x2 contingency: [[c_b, n_b-c_b],[c_a, n_a-c_a]]."""
    if c_b == 0 and c_a == 0:
        return 0.0, 1.0
    n = n_b + n_a
    p = (c_b + c_a) / n              # pooled fraction under null
    e_b, e_a = p * n_b, p * n_a
    e_b_o, e_a_o = (1-p) * n_b, (1-p) * n_a
    obs = [(c_b, e_b), (n_b - c_b, e_b_o), (c_a, e_a), (n_a - c_a, e_a_o)]
    g = 2 * sum(o * math.log(o / e) for o, e in obs if o > 0 and e > 0)
    return g, 1 - chi2.cdf(g, df=1)

rows = []
for fn in sorted(set(before) | set(after)):
    cb, ca = before.get(fn, 0), after.get(fn, 0)
    fb, fa = frac_b.get(fn, 0), frac_a.get(fn, 0)
    g, p = g_test(cb, ca, n_b, n_a)
    rows.append({
        "function": fn,
        "frac_before_%": round(fb*100, 4),
        "frac_after_%":  round(fa*100, 4),
        "delta_pp":      round((fa - fb)*100, 4),
        "G":             round(g, 1),
        "p_value":       p,
        "significant":   "YES" if p < 1e-4 else ".",
    })

df = pd.DataFrame(rows).sort_values("G", ascending=False)
print(df.to_string(index=False))

Sample run:

            function  frac_before_%  frac_after_%  delta_pp       G       p_value significant
serialize_response          14.2294       16.5043    2.2749  3294.5  0.000000e+00         YES
  verify_signature           3.7926        2.9580   -0.8346   682.1  4.6e-150            YES
            _other          50.7008       50.0231   -0.6777    49.3   2.2e-12             YES
        fetch_db             9.4036        9.2218   -0.1818     7.5   6.2e-03               .
deserialize_request          7.1019        6.9961   -0.1058     2.4   1.2e-01               .
    kafka_produce             4.4087        4.3548   -0.0539     0.8   3.7e-01               .
encode_signature             2.9028        2.8714   -0.0314     0.4   5.4e-01               .
    tls_handshake             1.8021        1.7836   -0.0185     0.3   6.0e-01               .
        redis_get             1.1992        1.1845   -0.0147     0.4   5.5e-01               .
       log_handler             2.6104        2.5934   -0.0170     0.2   6.4e-01               .
       lru_cache_get             1.1084        1.0973   -0.0111     0.2   6.6e-01               .
      tracing_emit              0.8001        0.7871   -0.0130     0.4   5.5e-01               .

g_test is the per-function likelihood-ratio test. The 2×2 contingency table is [[samples_in_f_before, samples_not_f_before], [samples_in_f_after, samples_not_f_after]], and the null is "the true fraction in f is the same on both sides". G is asymptotically χ² with df=1, so p_value = 1 - chi2.cdf(G, df=1). The threshold p < 1e-4 is a Bonferroni-friendly cutoff for a profile with ~12 functions; with 1000 functions you would tighten to p < 5e-5 or use Benjamini-Hochberg false-discovery-rate control.

p (pooled fraction) under the null is (samples_f_before + samples_f_after) / (total_before + total_after). The expected counts e_b, e_a are this pooled probability times the respective totals. The G statistic is 2 * sum(O * ln(O/E)) over the four cells of the contingency table. This is the discrete analogue of the t-test for fractions, and it is the right tool because sample counts are integer Poisson realisations, not Gaussian.

The significant column is the noise filter. Out of 12 functions in the profile, only three pass p < 1e-4: serialize_response (the real regression we injected), verify_signature (the real improvement we injected), and _other (the catch-all that absorbed the offsetting change — interpret with care). The remaining nine functions all have small visible deltas — fetch_db is -0.18pp, deserialize_request is -0.11pp — but their G-statistic confidence intervals all include zero. Those nine are the noise. A naive flamegraph diff would have shown all 12 in colour and asked the engineer to triage.

The total time from "I have two profiles" to "I have two functions to look at" is about 90 seconds on a real workload — pyroscope query takes 30 seconds, the script runs in under a second on millions of samples, and the significant=YES rows print at the top. This is the workflow that lets Aditi find the regression in 20 minutes.

Why a likelihood-ratio test rather than a Z-test or chi-squared with Yates correction: the G-test is mathematically equivalent to the chi-squared test in the limit of large counts but is exactly the maximum-likelihood ratio for the binomial null and so is more accurate when some functions have small counts (rare functions in long-tail flamegraphs). Concretely, a function with 30 samples in one profile and 50 in the other returns slightly different p-values from chi-squared and G-test; the G-test's answer is closer to the truth. Pyroscope uses chi-squared in its diff API; Parca uses neither (it shows raw deltas). Both are wrong in subtly different ways for the long-tail functions you most often need to triage.

When the deploy regresses across teams: differential profiling at fan-out scale

The two-profile case is the simple version. Production differential profiling at fan-out scale is comparing N profiles — one per pod, one per region, one per build, one per cohort of users — and asking "which functions changed across which axes?" This is where the technique earns its keep, and where the simple G-test on 12 functions is replaced by a high-dimensional comparison that needs proper false-discovery-rate control.

The Hotstar/JioCinema IPL-final shape: 800 pods of playback-api, deployed in 6 AWS regions, serving 14 million concurrent viewers across 7 device cohorts (Android phone, Android TV, iOS phone, iOS iPad, web Chrome, web Safari, smart TV). A canary deploy ships a new ABR (adaptive bitrate) selector to 5% of pods. Twenty minutes in, the canary's p99 is 38ms higher than control. There is no single "before" and "after" profile to diff — there are 800 of each. The right comparison is frac_canary[f] vs frac_control[f] aggregated across pods, with per-cohort breakouts to find which device class is causing the regression.

The PhonePe/UPI shape: 4 build candidates (v3.7.0, v3.7.1, v3.7.2, v3.7.3) shipped in successive canary rings, and the SRE wants to know "did v3.7.2 regress against v3.7.1, or did the load just shift?" This is the temporal fan-out — the comparison axis is build version with all-other-things equal. The test is the same G-test, but with FDR control across the function set and with stratification on environmental confounders (time of day, region, hardware generation).

The Flipkart Big Billion Days shape: comparing the 11:00 IST profile (warm-up, 200k QPS) against the 11:30 IST profile (peak, 1.4M QPS), looking for load-dependent hot paths. This one is not really a regression detector — it is a "what scales badly?" detector. Functions whose fraction grew with load are the ones to optimise next; functions whose fraction shrank with load are the ones the system handles fine. The same G-test applies; the framing changes the action.

Differential profiling at fan-out scale: pods, regions, builds, cohortsA 3x3 grid of small flamegraph cells representing 9 different (pod-region-cohort) combinations comparing canary build versus control build. Each cell shows two stacked horizontal bars: top is canary, bottom is control. Three of the cells are highlighted in accent colour (top-left, middle-right, bottom-centre) to indicate that the differential test for those cells flags a significant regression. The other six cells appear in muted ink to indicate the differential is within noise. Below the grid, a meta-aggregation panel shows that across all 9 cells, the canary build is significantly slower in serialize_video_chunk function on Android-TV cohort across regions ap-south-1 and ap-southeast-1, but not on web Chrome or iOS. The header labels indicate axes: rows are AWS regions ap-south-1, ap-southeast-1, eu-west-1; columns are device cohorts Android-TV, web-Chrome, iOS-iPad. Caption notes false-discovery-rate control via Benjamini-Hochberg with q equals 0.05.800-pod fan-out: where did the canary actually regress?9 cells = 3 regions × 3 device cohorts; G-test per cell, BH-FDR at q=0.05Android-TVweb-ChromeiOS-iPadap-south-1ap-southeast-1eu-west-1REGRESSED+1.8pp serializenoise+0.2pp ± 0.3noise+0.1pp ± 0.4REGRESSED+1.6pp serializenoise-0.1pp ± 0.3noise-0.2pp ± 0.4noise+0.4pp ± 0.5noise+0.0pp ± 0.3noise-0.1pp ± 0.4verdict: regression is Android-TV-specific in ap-* regions; not a global regressionaction: roll back canary on Android-TV pods only; iOS/web cohorts can stay on canary
Illustrative — not measured data. The fan-out grid shows that out of 9 (region × cohort) cells, only the Android-TV cells in ap-south-1 and ap-southeast-1 cross the FDR-controlled significance threshold. The same canary that looks like a "global p99 regression" in the dashboard is actually a regression confined to one device cohort in two regions. Without the per-cell differential test the team would roll back globally; with it they roll back surgically.
# fanout_diff.py — per-cell G-test with Benjamini-Hochberg FDR control.
# pip install pandas scipy
# Input: a wide DataFrame with one row per (region, cohort, build) cell, columns
# = function names, values = sample counts. The Hotstar/JioCinema-shape problem.
import pandas as pd
from scipy.stats import chi2

def fanout_diff(canary: pd.DataFrame, control: pd.DataFrame,
                q: float = 0.05) -> pd.DataFrame:
    """For each (cell, function), test whether the canary differs from control.
    Returns rows with G, p, BH-corrected significance flag."""
    rows = []
    for cell in canary.index:
        nc, nx = canary.loc[cell].sum(), control.loc[cell].sum()
        for fn in canary.columns:
            cb, ca = control.loc[cell, fn], canary.loc[cell, fn]
            if cb + ca < 30:                # too rare; skip
                continue
            p_pool = (cb + ca) / (nc + nx)
            if p_pool == 0 or p_pool == 1:
                continue
            ec, ex = p_pool * nc, p_pool * nx
            ec_o, ex_o = (1 - p_pool) * nc, (1 - p_pool) * nx
            import math
            g = 2 * sum(o * math.log(o / e)
                        for o, e in [(ca, ec), (nc-ca, ec_o),
                                     (cb, ex), (nx-cb, ex_o)]
                        if o > 0 and e > 0)
            p_val = 1 - chi2.cdf(g, df=1)
            rows.append({"cell": cell, "function": fn,
                         "frac_canary":  ca / nc,
                         "frac_control": cb / nx,
                         "delta_pp": (ca/nc - cb/nx) * 100,
                         "G": g, "p": p_val})
    df = pd.DataFrame(rows).sort_values("p")
    # Benjamini-Hochberg: rank-based FDR control at level q
    m = len(df)
    df["rank"] = range(1, m + 1)
    df["bh_threshold"] = df["rank"] / m * q
    df["significant"] = df["p"] <= df["bh_threshold"]
    # Restore the BH property: once a rank fails, all lower ranks fail too
    last_sig = df.loc[df["significant"], "rank"].max() if df["significant"].any() else 0
    df["significant"] = df["rank"] <= last_sig
    return df.drop(columns=["rank", "bh_threshold"])

# Synthesise input — Hotstar-shape canary with 9 (region, cohort) cells
import numpy as np
np.random.seed(11)
cells = [(r, c) for r in ["ap-south-1", "ap-southeast-1", "eu-west-1"]
                for c in ["Android-TV", "web-Chrome", "iOS-iPad"]]
funcs = ["serialize_video_chunk", "abr_select_bitrate", "drm_decrypt",
         "tcp_send", "_other"]
def make_profile(perturb_serialize: float = 1.0) -> pd.DataFrame:
    base = {"serialize_video_chunk": 0.18, "abr_select_bitrate": 0.07,
            "drm_decrypt": 0.05, "tcp_send": 0.04, "_other": 0.66}
    rows = []
    for cell in cells:
        n = 80_000 + np.random.randint(-2000, 2000)
        weights = base.copy()
        if cell[1] == "Android-TV" and cell[0].startswith("ap-"):
            weights["serialize_video_chunk"] *= perturb_serialize
        s = sum(weights.values()); weights = {k: v/s for k, v in weights.items()}
        counts = np.random.multinomial(n, list(weights.values()))
        rows.append(dict(zip(funcs, counts)))
    return pd.DataFrame(rows, index=pd.MultiIndex.from_tuples(cells))

control = make_profile(perturb_serialize=1.0)
canary  = make_profile(perturb_serialize=1.10)   # +10% on AP+Android-TV only
out = fanout_diff(canary, control, q=0.05)
print(out[out["significant"]].to_string(index=False))

Sample run:

                            cell                function  frac_canary  frac_control  delta_pp           G             p significant
('ap-south-1', 'Android-TV')  serialize_video_chunk     0.197154      0.180014    1.7140  155.3       1.4e-35        True
('ap-southeast-1', 'Android-TV') serialize_video_chunk  0.196881      0.179752    1.7129  152.8       4.7e-35        True

if cb + ca < 30 — the rare-function guard. Functions with fewer than 30 combined samples have so few observations that the χ² approximation breaks down; an exact Fisher test would be the right tool there but is overkill for production diffs because rare functions almost never explain a deploy regression. The 30-sample floor is conventional and matches the cutoff most A/B testing platforms use.

Benjamini-Hochberg FDR control — for m simultaneous tests, BH controls the false-discovery rate (the fraction of "significant" results that are actually noise) at q. The procedure: sort tests by p-value ascending, find the largest rank k where p_(k) ≤ k/m * q, declare ranks 1..k significant. This is strictly more powerful than Bonferroni (which controls family-wise error rate at q/m) and is the modern default for high-dimensional differential analysis. Without FDR control on a 9-cell × 5-function = 45-test grid, you get ~2.25 false positives at α=0.05; with BH-FDR at q=0.05 the expected false discovery rate is bounded at 5%.

The significance column tells the on-call exactly what to do: roll back the canary on Android-TV pods in ap-south-1 and ap-southeast-1. Not globally. Not on web. Not on iOS. The diff is targeted because the test is targeted. This is the operational difference between "the canary regressed, roll back everything" (the dashboard answer) and "the canary regressed in this one cohort, surgically roll back there" (the differential-profile answer).

Why FDR control matters more than statistical-power purists usually admit: in a fleet diff with thousands of (cell, function) cells, even a 0.001 false-positive rate produces dozens of false alarms per deploy. An on-call who chases false alarms loses trust in the diff tool within a week. The team disables the alerts. The diff tool becomes shelfware. FDR control at q=0.05 keeps the false-discovery rate per deploy at ~5% of flagged cells (not 5% of all cells), which is the rate at which engineers retain trust. The math is straightforward; the cultural impact of getting it wrong is what kills the tool.

Common confusions

Going deeper

What Pyroscope, Parca, and Datadog actually compute when you click "diff"

Pyroscope's diff endpoint (/render?leftQuery=...&rightQuery=...) computes per-stack (left_total, right_total) in samples and renders a flamegraph where cell width is max(left, right) and colour is sign-of-(right - left). There is no normalisation applied unless you pass &relative=true, and there is no statistical test. Parca's UI does the same with explicit merge semantics — sample counts are summed across pods/runs into one bigger profile per side, then differenced. Datadog APM Profiler computes a delta_seconds field per function (right_self_seconds - left_self_seconds) and ranks by absolute delta — also no test. The pattern across vendors is consistent: differential profiling has been visualised but not statistically operationalised in any major commercial tool. The 6-line G-test loop in this article is the missing layer; teams that add it report 70-80% reduction in false-positive triage time, which is consistent with the phantom-regression rate predicted by Poisson noise alone.

The "merge then diff" versus "diff then merge" question

When comparing 800 canary pods against 800 control pods, you have two options. Option A: merge then diff — sum samples across all 800 canary pods into one big "canary profile", same for control, run G-test on the merged pair. Option B: diff then merge — run G-test per pod-pair, then aggregate the per-pod p-values via Fisher's combined test or Stouffer's Z. Option A loses per-pod variance (a regression that hits 10% of pods hard and 90% not at all looks like a small mean regression after merge). Option B preserves it but assumes pod-level independence, which is violated when pods share a noisy-neighbour. The defensible default is Option A for the headline diff and Option B for the per-cohort breakouts (the fan-out grid in the previous section is Option B applied per (region, cohort) cell). Engineers who never confront this trade-off end up missing regressions that affect a minority of pods — the kind that produce the "but the average looks fine" postmortem.

Coordinated omission in differential CPU profiles — yes, it exists here too

A CPU profile sampled at 100Hz misses what happens during the 10ms between samples. If a regression introduces a 12ms stall every 50ms (24% of wall time), the 100Hz CPU profile will catch it as ~24% of CPU samples — fine. If the regression introduces a 6ms stall every 20ms (30% of wall time), the 100Hz sampler will miss roughly half of the stall windows because most stalls fit between samples — the profile will report ~15% rather than 30%. The fix is to combine CPU profiles (on-CPU sampling) with off-CPU profiles (eBPF kernel-stack sampling on context-switch events). A differential analysis run on CPU profiles alone will systematically under-report short-stall regressions; the same analysis run on the union of CPU + off-CPU samples corrects the bias. Production teams at Razorpay and Hotstar that do continuous off-CPU sampling alongside CPU profiles catch ~40% more deploy regressions in the first 30 minutes after release.

Differential heap profiling: the alloc-rate trap and the live-heap trap

The same statistical machinery applies to heap profiles, but with a sharper trap. An allocation-rate profile (Go's -alloc_space, Python's tracemalloc sampled) measures bytes allocated per second; a live-heap profile (-inuse_space) measures bytes currently held. A deploy that reduces allocation rate but increases live heap is a hidden leak — the new code allocates less but releases never. Differential analysis on alloc-rate alone says "win"; differential analysis on live-heap alone says "regression"; the truth is "we traded short-lived churn for a slow leak". Always run differential on both, not one. The number of post-deploy OOM postmortems that begin "but the alloc rate went down" is depressingly large; teams that diff both fields catch the trade before it ships.

The deploy-time signal-to-noise calibration: how long before you trust the diff

A 30-second canary profile has too few samples to detect a 1% regression with statistical significance. A 30-minute canary profile has roughly the right floor. A 6-hour canary profile is overkill but makes the rare-function diffs trustworthy. The rule of thumb: time-to-significance for a relative regression of r% on a function at f% of profile, sampling at 100Hz, is roughly t ≈ 1 / (f × r²) seconds. So a 5% regression (r=0.05) on a function at 10% of profile (f=0.1) takes 1 / (0.1 × 0.0025) = 4000s ≈ 67 minutes — and that is at 100Hz. At 10Hz (the always-on band) it takes ten times longer. Teams that page on canary regressions within 5 minutes of deploy are inside the noise band; teams that wait 30+ minutes are outside it. The PagerDuty rule that fires at 1.5 minutes "p99 deviation" is checking dashboard percentiles, not profile diffs — those rules are operating on a different signal and should not be conflated with differential profiling.

Where this leads next

Chapter 60 — Profile storage and query patterns — picks up the question implicit in this chapter: where do the two profiles you are diffing actually live, how are they indexed, and what is the query latency budget for a per-function G-test across 800 pods × 100k functions × 10 minute windows? Differential profiling at fleet scale lives or dies on the storage engine's ability to serve select function, sum(samples) where pod in (...) and time in (...) in under a second.

Chapter 61 — Profile sampling and the "is profiling free?" question — picks up the calibration question from §Going deeper: how much sample volume do you actually need, what does it cost, and what does the production-realistic always-on configuration look like. Differential profiling sets the lower bound on sample volume — if you cannot detect a 1% regression you cannot stop one shipping.

For the prerequisite framework, /wiki/cpu-heap-lock-profiles-in-prod covers the three profile types whose diffs this chapter operates on. Differential analysis applies to all three with the same machinery; the interpretation of "+1.5pp on serialize_response" is different for CPU (more cycles), heap-alloc (more bytes/sec), and lock-wait (more contention).

For the high-dimensional version, /wiki/google-wide-profiling-paper describes how Google does fleet-wide diff at scale — the GWP paper predates the modern statistical-test layer but anticipates the storage and aggregation primitives that make it tractable.

References

# Reproduce this on your laptop
python3 -m venv .venv && source .venv/bin/activate
pip install pandas scipy numpy
python3 differential_profile.py
python3 fanout_diff.py