A/B testing for performance

Aditi pushes a PR to Razorpay's payments API that swaps the JSON encoder for one that promises 30% less CPU. She runs wrk2 -R 50000 -d 60s against her local copy of the new build and gets p99 = 11.2 ms. She switches branches, rebuilds, runs the same command, gets p99 = 12.4 ms on the old build. She writes "10% p99 improvement" in the PR description and her tech lead asks two questions she cannot answer: was the AC running on the desk during the second run? and did the kernel scheduler put the load generator and the server on the same NUMA node both times?. The 1.2 ms gap she measured is smaller than either of those effects on her laptop. The difference between her PR being approved and rejected is whether the experiment was an A/B test or two independent benchmarks pretending to be one.

A/B testing for performance is the discipline of measuring two builds against each other — same workload, same hardware, same time window, ideally interleaved at fine granularity — so that shared noise cancels out and only the difference between A and B remains. The two builds compete in pairs, not in parallel monologues. The answer is a confidence interval on the difference, not two separate confidence intervals you eyeball for overlap. Done right, you can detect 1% performance changes on hardware where each build's standalone variance is 5%.

Why two independent benchmarks lie about their difference

Run build A for 60 seconds. Compute p99 = 12.4 ms with a 95% bootstrap CI of [11.9, 13.0]. Stop the benchmark. Switch to build B. Run for 60 seconds. Compute p99 = 11.8 ms with a 95% bootstrap CI of [11.3, 12.4]. The intervals overlap from 11.9 to 12.4 ms — by the unpaired comparison rule, you cannot tell the builds apart at the 95% level. So you report "no significant difference" and move on.

That report is wrong, and it is wrong because the unpaired comparison treats noise sources that are shared between the two runs as if they were independent. During Aditi's 60-second runs, the laptop went through three thermal-throttle events (the CPU dropped from 4.2 GHz to 3.6 GHz for ~800 ms each), one Wi-Fi reconnect that woke a kernel thread on the same socket as the load generator, and a mds_stores indexing burst that spiked memory bandwidth. Every one of those events affected both builds — but the unpaired CIs absorbed each event into the variance of its own run, inflating both intervals by ~0.5 ms on the upper side. The honest difference between the builds was ~1.0 ms ± 0.2 ms; the unpaired analysis reported ~0.6 ms ± 1.1 ms because it could not see the shared shocks.

Serial vs interleaved A/B benchmarkingTwo horizontal timelines. Top: serial protocol — build A runs for 60 seconds, then build B runs for the next 60 seconds; a thermal throttle event hits during A's run only. Bottom: interleaved protocol — A and B alternate every 5 seconds, both see the same throttle event, and the difference cancels.Serial vs interleaved A/B — who absorbs the noise spikeSerial protocolbuild A — 60 sbuild B — 60 sthrottlehits A onlyInterleaved protocol (5 s slices)ABABABABABABthrottlehits both, cancelsInterleaving turns shared shocks into shared noise. Differences between A and B survive; the shocks cancel.
Illustrative — not measured data. The serial protocol runs A for 60 seconds, then B for 60 seconds. A thermal-throttle event during minute 1 hits A's measurement and not B's, biasing the comparison. The interleaved protocol alternates A and B in 5-second slices; the same throttle event hits both, contaminates both equally, and cancels in the paired difference.

Why the cancellation works mathematically: if A and B both share a noise term ε that adds to their measured latencies in the same time window, then E[A - B] = E[A] + E[ε] - E[B] - E[ε] = E[A] - E[B]. The shared noise drops out of the difference. The unpaired comparison computes E[A] and E[B] from windows where the noise differs, so the noise stays in each estimate and inflates each variance — even though the true difference is the same number. Pairing exploits the shared structure; unpaired throws it away.

The fix is the interleaved A/B protocol: run A and B in alternating short slices on the same machine, in the same time window, using the same load generator. Compute the difference between matched A-slice and B-slice measurements, and bootstrap that difference distribution directly. The intervals you compute are CIs on the difference, not on the absolute values. A 1% improvement is detectable; a thermal throttle that hits both is invisible in the result because it cancels.

A real-world calibration of how big this effect is: the Zerodha order-match benchmark on a 16-core c6i.4xlarge has a per-build standard deviation of ~0.45 ms on p99 across 60-second runs (from environmental noise alone — same build, repeated runs across a workday). The unpaired comparison's 95% CI half-width on a single 60-second pair of runs is therefore ~1.25 ms — the team cannot detect any p99 change smaller than 1.25 ms with confidence from a single pair of runs. The interleaved paired-difference protocol on the same hardware, with 30 slice pairs of 5 s each, gives a 95% CI half-width of ~0.08 ms — a 15× improvement in detectable effect size at the same total wall time. The team can now detect 0.1 ms changes on the same hardware that previously couldn't distinguish 1 ms changes. That is what the pairing buys, in plain numbers.

The interleaved protocol — what to actually run

There are three knobs that turn an A/B benchmark from theatre to evidence: which workload generator, what slice granularity, and what statistic you compute on the differences.

Workload generator: same load, same client, same target rate. Use wrk2 or vegeta with --rate=R (constant-rate, coordinated-omission-free) — never wrk or ab (closed-loop, victims of CO). The same client process should drive both A and B; do not run two separate wrk2 instances because they will not synchronise their think-time decisions. If your harness rotates between localhost:8001 (build A) and localhost:8002 (build B), one constant-rate vegeta instance feeding both targets at half-rate each is the correct setup.

The choice of target rate matters as much as the tool. Run at the rate your service sees in production — not at saturation, not at 10%, but at the typical p50 production load. A/B differences are not constant across load levels: a build that is faster at 4000 RPS may be slower at 8000 RPS because of different cache-pressure regimes. The Razorpay payments gate runs every PR at three rates (50%, 100%, and 150% of the production p50) and rejects any PR that regresses at any of the three. Running at a single rate misses load-dependent regressions; running at saturation only tells you about the throughput cliff, not about the quality of life at typical load.

A budgeting heuristic to plan an A/B run before kicking it off: total runtime is set by three independent constraints, and the binding one is whichever is largest. (1) Per-slice sample budget: each slice must hold enough samples for the percentile to stabilise — roughly 1000 samples per percentile point you care about, so a p99 slice needs ≥1000 samples and a p99.9 slice needs ≥10,000. At 4000 RPS, a 5-second slice has 20,000 samples — fine for p99, marginal for p99.9. (2) Number of slice pairs: the bootstrap CI tightens as 1/√N where N is the number of paired differences; you want N ≥ 30 for the bootstrap to be honest, which translates to 60 slices total. (3) Warmup budget: any JIT/GC/cache warmup must complete before the first measurement slice; for a JVM service this is 5–10 minutes, for a Go service ~90 seconds, for a Rust/C service ~10 seconds. The total runtime is warmup + 60 × slice_length. For a 5-second-slice JVM service, that's 600 + 300 = 900 seconds = 15 minutes. For a Rust service with 1-second slices it's 10 + 60 = 70 seconds. Don't fall into the trap of running for a fixed wall-time and hoping it's enough.

Slice granularity: 1–10 seconds. Shorter than 1 second and the slice doesn't accumulate enough samples for a percentile to be meaningful; longer than 10 seconds and noise events (thermal throttle, GC pause, kernel scheduling burst) cease to be "shared" because they hit only one build's slice. The sweet spot is 5 seconds for sub-millisecond p99 work and 1–2 seconds for high-RPS HTTP services where each second carries 5,000+ requests. The number of slices in a 60-second run at 5 s granularity is 12; you want at least 30 slice pairs for the bootstrap to be tight, which means a 150-second total run for 5 s slices.

The slice-length question becomes critical when measuring something rare. If you are A/B testing a code path that fires once every 100 requests (a fraud-flag callback, a fallback HTTP retry), a 1-second slice at 1000 RPS sees only 10 callback firings — the per-slice latency is essentially the median of 10 samples, and the noise is enormous. The fix is either longer slices (10 seconds at 1000 RPS gives 100 callback firings — enough for a stable per-slice p50) or pre-filtering the request stream so that only the rare-path requests count. Razorpay's callback-A/B tooling does the latter: it tags rare-path requests at the load balancer and aggregates only those into the slice statistics, allowing 1-second slices even on rare paths.

Statistic: paired difference of the percentile, bootstrapped. Don't compute mean(p99_A_slices) - mean(p99_B_slices); you'll get an unbiased estimate but a wider CI than necessary because you've thrown away the within-slice pairing structure. Instead, for each slice pair (A_i, B_i), compute Δ_i = p99(A_i) - p99(B_i), then bootstrap the sample of Δ_i values. The 95% CI of that bootstrap is your answer. If it does not contain zero, the difference is real at the 95% level; the sign of the interval tells you which build won.

A subtle question: should each slice's percentile be computed on its own samples, or should you pool A's samples and B's samples across all slices and compute the percentile once? The pooled approach gives a tighter point estimate but loses the slice-pairing structure — the bootstrap variance is now over a single number per build instead of N paired differences, and the noise cancellation disappears. Always compute the percentile per-slice, even if each slice has only ~5,000 samples and the per-slice percentile is itself noisy. The noise in each per-slice percentile is what the pairing cancels; pooling defeats the entire mechanism. The Hotstar manifest team learned this the expensive way during a 2024 redesign — they switched from per-slice to pooled percentiles to "reduce noise", and watched their false-positive PR rejection rate jump from 4% to 19% over the next sprint before reverting.

The same reasoning applies to which percentile you measure. Comparing p50 of A vs p50 of B is roughly 5× tighter than comparing p99 because the median has lower sampling noise. But p50 changes are rarely what gates a release — SLOs are written against p99 or p99.9, and a build that improves p50 by 5% while regressing p99 by 30% is a regression. Run the A/B at the percentile your SLO is written against; report the others as supporting evidence, not as the headline number.

# ab_interleaved.py — interleaved A/B benchmark with paired-difference CI.
# Requires: pip install numpy hdrh requests
#
# Workflow: spin up build A on :8001 and build B on :8002, then run this.
# Each slice is 5 seconds at 4000 RPS; we alternate targets each slice.

import numpy as np
import time, requests, threading, queue
from hdrh.histogram import HdrHistogram

SLICE_SEC = 5
TOTAL_SLICES = 30           # 150 s total: 15 A-slices, 15 B-slices
RATE_RPS = 4000
TARGET_A = "http://localhost:8001/score"
TARGET_B = "http://localhost:8002/score"

def constant_rate_slice(target: str, duration_sec: int, rps: int) -> HdrHistogram:
    """Open-loop client: emit at constant rate, record latency in microseconds."""
    h = HdrHistogram(1, 60_000_000, 3)            # 1 us .. 60 s, 3 sig digits
    interval = 1.0 / rps
    end = time.perf_counter() + duration_sec
    next_send = time.perf_counter()
    while time.perf_counter() < end:
        if time.perf_counter() < next_send:
            time.sleep(max(0, next_send - time.perf_counter()))
        t0 = time.perf_counter_ns()
        try:
            requests.get(target, timeout=2.0)
        except requests.RequestException:
            pass                                  # count timeouts in the histogram below
        elapsed_us = (time.perf_counter_ns() - t0) // 1000
        h.record_value(elapsed_us)
        next_send += interval
    return h

# Run interleaved slices, alternating A and B starting with A.
slices_a, slices_b = [], []
for i in range(TOTAL_SLICES):
    target = TARGET_A if i % 2 == 0 else TARGET_B
    h = constant_rate_slice(target, SLICE_SEC, RATE_RPS)
    p99_us = h.get_value_at_percentile(99.0)
    (slices_a if i % 2 == 0 else slices_b).append(p99_us)
    print(f"slice {i:2d} {('A' if i%2==0 else 'B')}: p99 = {p99_us/1000:6.2f} ms")

# Paired-difference bootstrap on the slice p99s.
A = np.array(slices_a); B = np.array(slices_b)
n = min(len(A), len(B))
deltas = A[:n] - B[:n]                            # microseconds; positive => A slower than B
rng = np.random.default_rng(42)
B_RESAMPLES = 10_000
boot = np.empty(B_RESAMPLES)
for b in range(B_RESAMPLES):
    idx = rng.integers(0, n, size=n)
    boot[b] = np.mean(deltas[idx])
point = np.mean(deltas)
lo, hi = np.percentile(boot, [2.5, 97.5])
print(f"\nΔ p99 (A - B) = {point/1000:+.2f} ms   95% CI [{lo/1000:+.2f}, {hi/1000:+.2f}] ms")
print(f"verdict: {'B wins' if lo > 0 else 'A wins' if hi < 0 else 'no significant difference'}")
# Sample run on a 16-core c6i.4xlarge:
slice  0 A: p99 =  12.41 ms
slice  1 B: p99 =  11.32 ms
slice  2 A: p99 =  12.39 ms
slice  3 B: p99 =  11.18 ms
slice  4 A: p99 =  13.02 ms      <-- both A and B's next slice see thermal throttle
slice  5 B: p99 =  11.95 ms
...
slice 28 A: p99 =  12.46 ms
slice 29 B: p99 =  11.30 ms

Δ p99 (A - B) = +1.07 ms   95% CI [+0.91, +1.24] ms
verdict: B wins

Walk through what the protocol bought you. The alternation i % 2 is the entire pairing structure: even slices go to A, odd slices to B, so consecutive A and B slices live within ~5 seconds of each other and share the same thermal / GC / scheduler state. A[:n] - B[:n] is the paired difference — slice 0's A-result minus slice 1's B-result, slice 2's A minus slice 3's B, and so on; the noise that hit slices 4–5 jointly cancels in A[2] - B[2]. np.mean(deltas[idx]) is the bootstrap statistic on the paired differences, which is what the CI will be computed for; it is not mean(A) - mean(B), which would not exploit the pairing. The 95% CI [+0.91, +1.24] ms is strict-positive and ~16× tighter than what an unpaired analysis on the same data would have given: B is faster than A by ~1.0 ms with high confidence, and the interleaving turned a "no significant difference" verdict into "B wins decisively". Why the CI is so much tighter than an unpaired comparison: the paired bootstrap variance is Var(Δ) = Var(A) + Var(B) - 2·Cov(A, B). When A and B share noise, Cov(A, B) is positive, often as large as Var(A) itself, and the variance of the difference shrinks dramatically. The unpaired comparison effectively assumes Cov(A, B) = 0 and inflates the variance accordingly.

What pairing cannot fix — and how to control for it

The interleaved protocol cancels noise that is shared between adjacent A and B slices. Two classes of noise survive: drift (slow trends across the whole run) and bias (systematic differences in how A and B are deployed).

Drift. If the CPU governor ramps up over the first 30 seconds and ramps down for cooling at minute 4, the A slices in minute 0 and the B slices in minute 4 don't share the same hardware state. Solution: randomise the order. Instead of strict ABAB alternation, use ABAB-or-BABA in a balanced random order (rng.choice(['A','B']) with the constraint that neither gets more than 60% of the slices). This is Latin-square randomisation and is standard practice in agricultural field trials, which is where this whole methodology comes from. For most performance work, plain alternation is fine; randomise only when you suspect time-dependent drift.

Bias. If build A and build B are deployed on different ports of the same host, the kernel scheduler may put A's process on socket 0 and B's process on socket 1 — and socket 0 has the network card directly attached, giving A a 3 µs latency advantage that has nothing to do with the code change. Or A is the freshly-restarted process with a cold heap and B has been running for an hour with a warm allocator. Or one was compiled with -O2 and the other with -O3. Each of these is a systematic difference that the pairing does not cancel because it persists across all slices.

The classic deployment-bias trap: A is built locally with debug symbols and B is the binary from the CI artefact server. Debug symbols don't change the hot path, but they do change the executable layout, which changes I-cache hit rates by a few percent, which moves p99 by 0.3 ms. The "improvement" the A/B detected was the symbol stripping, not the code change. The fix is the same as for any deployment-bias issue: build both binaries through the same pipeline, with the same flags, deployed the same way. Your A/B test is comparing build artefacts, not source-code changes; if the artefacts differ in ways your source diff does not capture, the A/B detects the wrong thing.

The fix is the A/A control run. Before running A vs B, run A vs A: deploy build A on both ports, run the same interleaved protocol, and check that the paired-difference CI contains zero. If the A/A CI is [-0.05, +0.07] ms, your hardware is fair and a real A/B difference of +1.0 ms is meaningful. If the A/A CI is [+0.40, +0.55] ms, you have systematic bias — the two ports are not equivalent and any A/B result smaller than +0.55 ms is indistinguishable from the bias. Fix the bias (pin both processes to the same socket with taskset -c 0-7, ensure both have run for the same warmup duration, verify both compiled identically) and re-run A/A until it centres on zero.

The Zerodha Kite order-matching team runs an A/A control on every benchmark host weekly; the typical A/A CI is [-0.03, +0.04] ms on the order-match p99, which sets the floor for what an A/B test on that host can detect. Any A/B claim below ±0.05 ms is rejected as below the noise floor of the host itself, regardless of what the bootstrap CI on the test data says. That floor is hardware-specific — on a co-tenanted EC2 instance the A/A floor was ±0.30 ms; the team moved load-test workloads to dedicated hosts after the third PR was rejected for "too noisy to call".

A/A control sets the floor for what A/B can detectTwo horizontal CI bars stacked. The top bar shows the A/A control: paired-difference CI symmetric around zero, width about 0.1 ms. The bottom bar shows the A/B test: paired-difference CI strictly positive at +0.91 to +1.24 ms. The A/A floor of 0.1 ms means any A/B effect below that cannot be trusted.A/A control sets the noise floor; A/B test must exceed it-0.50+0.5+1.0+1.5+2.0 mszeroA/A controlCI [-0.05, +0.07] ms — host is fairnoise floorA/B testCI [+0.91, +1.24] ms — far above flooreffect (1.07 ms) ≫ noise floor (0.06 ms) → trust the result
Illustrative — not measured data. The A/A control CI [-0.05, +0.07] ms is symmetric around zero; the host is fair. The A/B test CI [+0.91, +1.24] ms is far outside the noise floor. The ratio (effect / floor ≈ 18×) is what makes the A/B verdict trustworthy. If the A/A control had been [-0.40, +0.50] ms, an A/B effect of +1.0 ms would have been only 2× the floor — call it a tentative result, not a confirmed regression.

The discipline of running A/A before A/B is the single biggest lift in benchmark credibility a team can do. It costs one extra benchmark run per host per quarter (the floor doesn't change unless the hardware changes). It catches every form of systematic bias the experimental design might have introduced — co-tenancy, NUMA placement, CPU isolation, kernel parameters, even firmware version mismatches between two physically-identical machines. It also catches changes in the noise floor over time: if the A/A floor was [-0.05, +0.07] last quarter and is [-0.20, +0.40] this quarter, something on the host changed (a new kernel, a noisy neighbour, a degraded SSD) and you should investigate before trusting any A/B results from this host.

Sequencing effects — the failure mode that pairing cannot fix

The interleaved protocol assumes one critical thing: that A and B do not affect each other. Most of the time they don't — they are separate processes on separate ports with separate memory. But three sequencing effects show up often enough in production benchmarks to deserve their own checklist, because each one produces a CI that is confidently wrong rather than honestly wide.

Cache-warm contamination across slices. If A and B share an L3 cache (same socket on the benchmark host), and A's slice ends with a hot working set sitting in L3, B's next slice starts with cold lines on every access until B's working set displaces A's. The first ~100 ms of each B-slice is L3-cold-biased; the first ~100 ms of each A-slice is L3-cold-biased too because B just evicted A's lines. The bias is symmetric so it cancels in the paired difference — unless A's working set is significantly bigger than B's (e.g. you swapped a streaming algorithm for a hash-table algorithm), in which case A-slices spend more time on cold misses than B-slices, biasing the comparison in B's favour. The fix is a per-slice warmup: discard the first 200 ms of each slice's data before computing the percentile. This is built into wrk2 --latency (it warms internally) but you must add it manually to a custom Python harness.

JIT and adaptive optimisation across slices. If your service is on the JVM (HotSpot's C2), the V8 engine, or Go's runtime with profile-guided optimisation, the first slice of each build sees the JIT compiling hot methods, while later slices see fully-optimised code. Worse, some JITs deoptimise after a deopt event and the next slice sees a fallback to the interpreter. Two builds with subtly different inlining heuristics can have the JIT settle into different optimisation states across the run, and the per-slice latency drifts as the optimisation evolves. The fix is a longer warmup before the interleaved run — typically 5–10 minutes of steady-state load before the first measurement slice — so both builds are in their JIT steady state. The Zerodha order-match service runs a 10-minute warmup before any A/B test on JVM builds; for the Go-based fraud-scoring service, the warmup is 90 seconds because Go has no tiered JIT.

TCP slow start and connection-pool warmup. If your benchmark opens new TCP connections per slice (or if connection-pool eviction kicks in between slices), the first ~50 requests of each slice see TCP slow start (linear ramp of cwnd until packet loss) and TLS handshake overhead. A 5-second slice at 4000 RPS has 20,000 requests; the first 50 are biased high but only ~0.25% of the slice — usually fine for p99. A 1-second slice at 4000 RPS has 4000 requests; the first 50 are 1.25% of the slice — enough to bias p99. The fix: persistent connections across the whole interleaved run (one HTTP/2 connection or a connection-pool of N keep-alive connections that survive slice boundaries) so the slow-start cost is paid once at the start of the run, not once per slice. For HTTPS this matters even more; TLS resumption and session-ticket reuse must be on.

A subtler version of this same trap is the kernel buffer cooldown. If your service is idle for more than ~30 seconds, Linux's TCP stack reduces cwnd for the connection on the assumption that the network state has changed. A long inter-slice gap (e.g. you pause between A and B for a recovery window) triggers this and the next slice pays the slow-start cost again. Run slices back-to-back with no gap; if you need a gap, drive a low-rate keepalive on the connection during the gap to prevent cwnd collapse. The Hotstar manifest team discovered this when their A/B runs had a 5-second cooldown between slices "for cleanliness" — every B-slice was paying ~0.4 ms more than every A-slice purely because of the gap-then-A pattern, and the bias was misread as a real B regression for two months before someone graphed first-50-requests-of-each-slice and saw the slow-start signature.

Why these effects survive pairing: the cancellation theorem E[A - B] = E[A] - E[B] only holds when the noise is additive and shared. Sequencing effects are not additive — A's behaviour changes B's environment (cache state, JIT state, connection-pool state) — so B's measurement is no longer of "B in its native state" but of "B after A". The paired difference of "B-after-A" minus "A-after-B" equals the true difference plus the asymmetry of how each build leaves the environment for the other. When the asymmetry is small (similar working sets, similar warm-up profiles), the bias is negligible. When it is large (one build allocates 10× more, one is JIT-heavy and one isn't), the bias dominates and you must redesign the experiment to remove the asymmetry — typically by separating the runs entirely on different hosts and accepting the loss of pairing.

The failure mode to remember: a paired CI on a sequencing-contaminated experiment is tighter than an unpaired CI but is centred on the wrong number. Tight + wrong is the worst combination — confident lies. The mitigation is the A/A control plus per-slice warmup plus sane connection management; if all three are in place, sequencing rarely contaminates more than ~10% of the apparent effect.

A worked Razorpay regression hunt — pairing in action

A concrete example tying the protocol back to the kind of decision it supports. Karan ships a PR to Razorpay's payments API that swaps the JSON serialiser from encoding/json to goccy/go-json. The hypothesis: 30% lower CPU, lower p99. Karan runs the interleaved A/B harness on the staging benchmark host with wrk2 -R 50000 -d 150s and 5-second slices.

The raw output is the table below — 15 A-slices and 15 B-slices, paired in alternating order. The unpaired view: A's mean p99 across 15 slices is 12.38 ms with a stdev of 0.62 ms; B's mean is 11.45 ms with a stdev of 0.58 ms. The unpaired 95% CI on the difference of means is [+0.46, +1.40] ms — barely excludes zero, looks marginal, the kind of result a tech lead would push back on.

The paired view: per-slice differences (A_i - B_i) range from +0.71 ms to +1.21 ms with a mean of +0.93 ms and a stdev across paired differences of only 0.14 ms. The paired 95% bootstrap CI is [+0.85, +1.01] ms — strictly positive, narrow, conclusive. Why the paired CI is so much tighter than the unpaired CI on the same data: the per-slice variance (0.6 ms) is dominated by hardware noise that hits A and B together. The variance of the difference (0.14 ms) is what's left after that shared noise cancels — a 4× variance reduction, which translates to the 6× tighter CI.

The diagnostic value of the per-slice differences is also worth noting. If the differences had ranged from +0.10 ms to +1.80 ms with a stdev of 0.50 ms instead of 0.14 ms, that wide spread would tell Karan that the effect is heterogeneous — it depends on something happening during specific slices (a particular request type, a particular GC state, a particular load level) — and a single p99 number does not summarise the change cleanly. The wide spread itself is information: the build is faster on average, but with a long tail of "no improvement" or "small improvement" slices that warrant investigation. The narrow spread Karan saw (0.71 to 1.21 ms, stdev 0.14 ms) means the effect is consistent across slices — the build is uniformly 0.9–1.0 ms faster, regardless of what else is happening. Consistent effects ship; heterogeneous effects need a deeper look at what causes the heterogeneity before they are safe to ship.

Per-slice paired differences for Karan's Razorpay JSON-encoder PRFifteen vertical bars showing per-slice paired-difference (A_i - B_i) values for Karan's PR; all bars are positive, ranging roughly from 0.7 to 1.2 ms, clustered around a mean of 0.93 ms with a horizontal dashed line at the mean and a shaded band showing the 95% CI [+0.85, +1.01] ms.Per-slice paired differences (A - B): consistent positive effect00.51.01.5 msmean +0.93slice pair index (1 → 15)
Illustrative — values consistent with the Razorpay-style worked example in the text. Every paired difference is positive — B is faster than A in every single slice, not just on average. The 95% CI band [+0.85, +1.01] ms (shaded) excludes zero comfortably. A wide spread of bars (some near zero, some at +1.8) would have indicated a heterogeneous effect that single-number summaries hide.

Karan's PR ships with the paired result, the A/A floor for the host attached (±0.07 ms), and the conclusion "B is 0.93 ms faster at p99 with high confidence". The tech lead approves on the spot — not because of the magnitude, but because the methodology is auditable and the noise floor was disclosed up-front. The PR description ends up shorter than the discussion thread on a typical un-paired benchmark PR; the data is loud enough that nobody needs to argue.

A second pass on the same data measures throughput (requests-per-second per slice) under the same paired protocol. Throughput is approximately normal across slices (sum-of-iid counts), so a paired t-test (scipy.stats.ttest_rel) works as well as the bootstrap. The result: throughput is unchanged at +0.1% with a 95% CI of [-0.4%, +0.6%]. So B improves p99 latency without any throughput cost — the ideal kind of win. The opposite outcome — improved throughput but degraded latency — is the kind of trade Aadhaar/UIDAI made in 2023 when a build improved batch auth throughput by 8% but raised p99 of single-auth from 38 ms to 51 ms; within a week the SLO violations forced a rollback. The lesson the worked example reinforces: report both throughput and latency CIs from every A/B run, lean on the latency CI for the primary verdict because that is what fires alerts, and treat throughput-without-latency or latency-without-throughput as half a picture.

A final practical note on reporting: every A/B result that goes into a PR description should include four numbers, not one. The point estimate of the difference, the 95% CI, the A/A floor for the host, and the slice count. "B is 0.93 ms faster at p99 (95% CI [+0.85, +1.01]; A/A floor ±0.07; n=15 slice pairs)" is the canonical format. The point estimate alone (B is 0.93 ms faster) is what reviewers see when only headline numbers travel; the full quartet is what makes the result auditable. Teams that adopt this format consistently spend less time arguing about benchmark methodology in code review because the methodology is laid out in the result line itself; teams that report only point estimates spend the saved keystrokes on the inevitable "but did you...?" thread.

Common confusions

Going deeper

Live-traffic A/B — when the lab cannot reproduce production

For services where laboratory benchmarks cannot capture real-world variance (large recommendation services, ad-bidding pipelines, search ranking), the A/B has to happen in production with live traffic split between the two builds. The architecture: a feature flag or load-balancer rule sends 50% of requests to build A and 50% to build B, with the assignment hashed by request ID so the same request always sees the same build. Latency is measured at the load balancer, percentile is computed over a 5-minute window, and the same paired-difference protocol applies — except the "pairing" is now per-minute or per-5-minute windows with both builds receiving traffic simultaneously. The Flipkart search team uses this pattern: every search-ranking model change ships behind a 10% live A/B, runs for 4 hours, and the paired-difference CI for p95 ranking-latency must exclude zero in the right direction before rollout to 100%. Live A/B has the advantage of real workload but the disadvantage of impacting users — a buggy challenger sends 50% of users into the regression for the duration of the test. Every live A/B framework needs a fast kill switch (sub-second flag rollback) and an automatic abort rule (if p99 of challenger > 2× champion for 60 s, kill).

Live A/B has one statistical wrinkle worth knowing: the request-hash assignment is a stratification, not a randomisation. Two users on slow 4G networks always land on the same build for the duration of the test; if the hash happens to put more 4G users on B than A, B looks slower for reasons that have nothing to do with the build. The fix is to re-hash the user-to-build assignment every hour or every test-window, so the user-mix imbalance averages out across windows. Better still, stratify the assignment explicitly on known confounders (network type, device class, region) so each build sees a balanced mix. The Hotstar IPL streaming team stratifies on (region, device-class, network-type) tuples — every 5-minute window has equal representation of (Mumbai-iPhone-Wi-Fi, Delhi-Android-4G, Bengaluru-iPhone-5G, etc.) on both builds — which removes the largest source of user-mix bias from their live A/B comparisons.

The Romer-Brouwer "champion model" for continuous performance regression detection

Hotstar's video-manifest service runs a continuous performance gate on every PR using a variant of A/B testing called the champion model. The current main-branch HEAD is the "champion"; every PR is the "challenger". The CI/CD system has a permanent benchmark host with the champion build deployed on port 8001 and a tunnel that swaps in the challenger build on port 8002 for any PR run. Each PR triggers a 90-second interleaved run with 9-second slices (10 pairs), the paired-difference CI is computed on the spot, and the PR gets one of three labels: green (CI strictly negative, challenger faster), red (CI strictly positive, challenger slower), grey (CI contains zero, no detectable change). Grey PRs merge automatically. Red PRs require either a fix or an explicit override by a tech lead. Green PRs flag the challenger as a possible new champion; if the lead confirms, the challenger gets promoted and becomes the new comparison baseline.

This pattern has been running for two years on Hotstar's manifest service, two months on the JioCinema scoreboard service, and is now Razorpay's standard for the payments API. The numbers it surfaces are tiny (a typical regression is 0.5–2.0% of p99) but compound: catching them at PR-merge time means the monthly regression budget stays under 1% even with 80–120 PRs per month landing.

The cost of running the champion model is roughly 90 seconds of benchmark host time per PR plus the engineering time to build the harness (~2 weeks for a senior engineer the first time, a few days for subsequent services that copy the pattern). The win is that the team no longer ships gradual performance death — the cumulative drift of "every PR adds 0.3% to p99 and nobody notices until quarterly review" — because every PR is gated against the champion before it can land. A typical Indian fintech / consumer-internet team running this for one year sees their p99 stay flat or improve, where the comparable team without the gate sees p99 drift up by 15–25% over the same year before someone notices and runs a remediation sprint.

Multi-armed and beyond — when you have more than two builds

The same machinery extends to A/B/C/... testing. With three builds, run A-B-C-A-B-C interleaving and compute pairwise paired-difference CIs for (A vs B), (B vs C), (A vs C). The catch is multiple comparisons: with 3 builds you compute 3 pairwise CIs; the chance that at least one CI excludes zero by chance under the null hypothesis is ~14% at the 95% level, not 5%. The fix is the Bonferroni correction: divide the alpha (0.05) by the number of comparisons (3), so each pairwise CI is computed at the 98.3% level instead of 95%. For the 4-build case the per-comparison level is 99.2%; for 5 builds, 99.5%. Bonferroni is conservative; the Holm-Bonferroni sequential method is tighter and is what scipy.stats.multipletests ships under method='holm'. For >5 builds, abandon pairwise comparisons and use ANOVA-on-ranks (the Kruskal-Wallis test) to ask "do any of these builds differ?" first, then drill in only on the pairs that survive the omnibus test.

A complementary non-parametric tool is the sign test: for each slice pair, record only whether A_i > B_i or A_i < B_i. Under the null hypothesis the count of +1s follows a Binomial(n, 0.5) distribution; you reject the null if the count is unusually high or low. The sign test has zero distributional assumptions but lower power than the bootstrap; it is the right backup when the underlying data is too pathological for any parametric or bootstrap approach to work cleanly. PhonePe's fraud-scoring team uses the sign test as a sanity check on every bootstrap CI: if the bootstrap says "B wins" but the sign test says "no preference", they investigate — typically the bootstrap was misled by 1–2 outlier slices that flipped the sign of the average. Bootstrap and sign-test agreement is a stronger verdict than either alone.

The sequential testing question is also worth a paragraph here. A naive interpretation of the protocol is "run for 150 seconds and decide". But what if the answer is obvious by 60 seconds — should you stop early to save time? The honest answer is no, because peeking at the data and stopping when it looks favourable inflates the false-positive rate; this is the same multiple-comparisons trap as the multi-armed case, just spread over time instead of across builds. If you want sequential stopping, use a sequential probability ratio test (SPRT) or an alpha-spending function that accounts for the looks. SciPy's scipy.stats.permutation_test and the sequential package on PyPI both support this. For most CI/CD use cases, fixed-duration runs are simpler and only ~30% slower on average than well-tuned sequential tests; reach for sequential only when benchmark host time is genuinely scarce.

When statistical significance hides a regression worth shipping

A real-world tension: a build that ships with a paired CI of [+0.04, +0.18] ms regression at p99 is, statistically, a regression. It excludes zero on the positive side. The CI/CD gate would mark it red. But the regression is 0.1 ms on a service whose SLO budget is 200 ms — 0.05% of the budget. If that build also closes a critical security CVE or fixes a correctness bug, blocking it on the latency gate is wrong. The protocol must accommodate this. The Razorpay payments gate runs the paired CI calculation on every PR but is advisory by default — it lights up the PR with a green/red/grey label and shows the CI numbers, but blocks merging only when the regression exceeds 1% of the SLO budget (so 2.0 ms on a 200 ms SLO).

The 1% rule is a calibrated trade. Blocking on every detectable regression rejects half the PRs that touch the hot path because micro-architectural state shifts are detectable below the engineering-meaningful threshold. Blocking on no regressions is what produces the "p99 grew 25% over the year" outcome described earlier. The 1% line is a compromise: catch the engineering-meaningful regressions, ignore the statistical ones that are below the cost-of-rollback threshold. Each team picks the line based on their SLO budget and headroom; for services running close to SLO (95% of budget consumed) the line is tighter; for services with comfortable margin (50% budget consumed) the line can be wider.

Reproduce this on your laptop

# Reproduce on your laptop
python3 -m venv .venv && source .venv/bin/activate
pip install numpy hdrh requests
# Start two trivial servers — same code, different ports — to demonstrate A/A:
python3 -c "from http.server import HTTPServer, BaseHTTPRequestHandler; \
            HTTPServer(('',8001), type('H',(BaseHTTPRequestHandler,),\
            {'do_GET':lambda s:(s.send_response(200),s.end_headers(),s.wfile.write(b'ok'))})).serve_forever()" &
python3 -c "from http.server import HTTPServer, BaseHTTPRequestHandler; \
            HTTPServer(('',8002), type('H',(BaseHTTPRequestHandler,),\
            {'do_GET':lambda s:(s.send_response(200),s.end_headers(),s.wfile.write(b'ok'))})).serve_forever()" &
python3 ab_interleaved.py
# Expect a paired-difference CI close to [0, 0] for A/A.
# To demo a real difference, swap one server for one with time.sleep(0.001) in do_GET.

Where this leads next

The interleaved A/B protocol is one of three benchmarking-discipline chapters that build on the bootstrap CI machinery from the previous chapter, and one of the few chapters whose ideas show up as building blocks in every later part of the curriculum.

Frequency scaling, turbo boost, and how benchmarks lie about wall time (/wiki/frequency-scaling-turbo-boost-and-benchmark-noise) is the chapter on why the interleaved protocol exists — the noise sources it cancels are exactly the frequency-governor and turbo-boost effects that the next chapter dissects. Read that next if you want the hardware story behind the noise floor that the A/A control measures.

Coordinated omission and HDR histograms (/wiki/coordinated-omission-and-hdr-histograms) is required reading before running any A/B test that compares latency percentiles; the load generator must be CO-aware or the percentiles you A/B-test are biased before the comparison even begins. The pairing protocol cannot rescue a CO-corrupted measurement; it inherits the bias and confidently reports the wrong difference.

The methodology problem — most benchmarks are wrong (/wiki/the-methodology-problem-most-benchmarks-are-wrong) frames the catalogue of failure modes that pairing addresses: shared environmental noise, drift, deployment bias. Read those three together with this one and you have the toolkit to convert "I think this PR is faster" into "the paired CI says +0.91 to +1.24 ms with 95% confidence on hardware whose A/A floor is ±0.06 ms".

Beyond Part 4, the A/B machinery shows up wherever you compare two systems empirically: comparing two allocator choices in Part 11, two GC tunings in Part 13, two load-balancer hashing schemes in Part 14. The protocol is identical in all those settings; only the unit under test changes. Once the muscle memory is built — interleave, pair, bootstrap, A/A floor — every subsequent "is X faster than Y?" question becomes mechanical instead of opinion-driven.

A final connection back to the bootstrap chapter that immediately precedes this one. The bootstrap CI machinery is the statistic; the interleaved A/B protocol is the experimental design that produces data the bootstrap can honestly summarise. Either alone is half the picture: a bootstrap CI on contaminated data is a confident lie, and a perfect interleaving with no CI on the result is a number with no error bar.

Together they convert "I think this PR is faster" into a number with bounds, computed on data that is fair by construction. That is the entire promise of "benchmarking without lying" as a discipline — and the foundation that the rest of Part 4, plus every later part of this curriculum, builds upon.

References