Note: Company names, engineers, incidents, numbers, and scaling scenarios in this article are hypothetical — even when they resemble real ones. See the full disclaimer.

Bootstrap confidence intervals

Karan reruns the same wrk2 benchmark against PaisaBridge's payment API five times in a row and gets p99 latencies of 11.8 ms, 12.4 ms, 11.9 ms, 13.1 ms, and 12.2 ms. His tech lead asks the question every benchmark eventually faces: "is the difference between this build and the last build real, or did we just catch a noisy sample?". The textbook answer — "compute the standard error and a 95% confidence interval" — assumes the percentile is normally distributed, which it is not. Latency distributions are heavy-tailed, percentiles of heavy-tailed distributions are skewed, and any error bar derived from mean ± 1.96 × stderr lies in the same direction the underlying distribution is skewed in. The bootstrap is the fix: resample the data, compute the percentile on each resample, and let the empirical distribution of resampled percentiles be your error bar.

A confidence interval for a measured percentile cannot be computed by mean ± 1.96σ because latency distributions are not normal and percentiles of skewed distributions have skewed sampling distributions. Bootstrap resampling — drawing N samples with replacement from your measured data, B times, and computing the statistic on each resample — gives an empirical sampling distribution that does not assume normality. The 2.5th and 97.5th percentiles of the bootstrap distribution are the 95% confidence interval, and they are usually asymmetric around the point estimate. This is the right tool for "is my regression real?" questions.

Why the textbook formula breaks for percentiles

The classical 95% CI formula x̄ ± 1.96 × (s/√n) is the right answer only when three conditions hold: the statistic of interest is the mean (or close to it), the underlying distribution is approximately normal, and the sample size is large enough that the central limit theorem has kicked in. For latency benchmarks none of these conditions hold cleanly.

The mean is rarely what you care about. SLOs are written against p99 or p99.9 — PaisaBridge's UPI authorisation must complete in under 200 ms at p99 is the contract, not "average authorisation must complete in under 75 ms". A confidence interval for the mean tells you nothing useful about the percentile that drives your pages.

Latency distributions are not normal. They are right-skewed, often log-normal-ish in the body, with a heavy tail that no closed-form distribution captures cleanly. The mode might sit at 4 ms, the median at 5 ms, the p99 at 18 ms, the p99.9 at 80 ms, the max at 400 ms. The standard deviation of such a distribution is dominated by the few extreme values; using s/√n as a stderr produces a number that is meaningful for the mean of the distribution but not for any quantile of it.

Percentiles of skewed distributions have asymmetric sampling distributions. If you draw 10,000 samples from a log-normal latency distribution and compute p99, the sampling distribution of that p99 estimate is itself skewed — it's harder to underestimate p99 than to overestimate it, because a single outlier near the tail moves p99 up by a lot but never down. The 95% CI is therefore asymmetric around the point estimate; a symmetric ± interval lies in both directions but more in one of them.

Illustrative — not measured data. The textbook CI assumes a Gaussian sampling distribution centred at the point estimate. The bootstrap distribution shows the truth: percentiles of skewed data have skewed sampling distributions, and the 95% CI is wider on the upper side than the lower. Quoting `12.4 ± 0.8` understates uncertainty by ~50% on the upper bound.

Why the asymmetry matters for "is my regression real?": you compare two builds, baseline at 12.4 ms with CI [11.9, 14.1], and a new build at 13.5 ms. The textbook CI [11.6, 13.2] would scream "regression!" because 13.5 is outside the interval. The honest bootstrap CI [11.9, 14.1] says "13.5 falls inside the baseline's noise envelope — we can't tell yet, run more iterations". Mistaking noise for signal here means rolling back a perfectly good change; the inverse mistake means shipping a real regression.

The bootstrap sidesteps all three problems by not assuming a distributional form at all. You take what you measured to be the distribution, resample from it, and compute the statistic of interest on each resample. The resulting empirical sampling distribution is the right one — it has whatever skew, whatever heavy tail, whatever multimodality the data has — and its percentiles are the confidence interval directly.

Computing a bootstrap CI for p99 in 30 lines of Python

The mechanical recipe is short. Given a sample of n latencies, draw n latencies with replacement to form a bootstrap resample. Compute the statistic (p99) on the resample. Repeat B times — typically 10,000 — to build an empirical distribution of B p99 values. The 2.5th and 97.5th percentiles of those B values are the 95% CI for p99.

# bootstrap_p99.py — bootstrap confidence interval for p99 latency.
# Requires: pip install numpy hdrh

import numpy as np
import time
from hdrh.histogram import HdrHistogram

def synthetic_latencies(n: int, seed: int = 42) -> np.ndarray:
    """Pretend this came from wrk2; in reality you'd parse hdrh dumps."""
    rng = np.random.default_rng(seed)
    body = rng.lognormal(mean=1.4, sigma=0.35, size=n)         # most calls
    tail = rng.lognormal(mean=2.2, sigma=0.55, size=n // 50)   # the slow path
    samples = np.concatenate([body, tail])
    rng.shuffle(samples)
    return samples * 1000   # convert to microseconds

def bootstrap_ci(samples: np.ndarray,
                 statistic,
                 n_resamples: int = 10_000,
                 ci: float = 0.95,
                 seed: int = 0) -> tuple[float, float, float]:
    rng = np.random.default_rng(seed)
    n = len(samples)
    boot_stats = np.empty(n_resamples, dtype=np.float64)
    for b in range(n_resamples):
        idx = rng.integers(0, n, size=n)         # resample WITH replacement
        boot_stats[b] = statistic(samples[idx])
    lo = (1 - ci) / 2 * 100
    hi = (1 + ci) / 2 * 100
    point = statistic(samples)
    return point, np.percentile(boot_stats, lo), np.percentile(boot_stats, hi)

if __name__ == "__main__":
    samples_us = synthetic_latencies(20_000)
    p99 = lambda x: np.percentile(x, 99)
    p999 = lambda x: np.percentile(x, 99.9)

    t0 = time.perf_counter()
    point, lo, hi = bootstrap_ci(samples_us, p99,  n_resamples=10_000)
    print(f"p99   : {point/1000:6.2f} ms   95% CI [{lo/1000:6.2f}, {hi/1000:6.2f}] ms"
          f"   asymmetry: -{(point-lo)/1000:.2f} / +{(hi-point)/1000:.2f}")
    point, lo, hi = bootstrap_ci(samples_us, p999, n_resamples=10_000)
    print(f"p99.9 : {point/1000:6.2f} ms   95% CI [{lo/1000:6.2f}, {hi/1000:6.2f}] ms"
          f"   asymmetry: -{(point-lo)/1000:.2f} / +{(hi-point)/1000:.2f}")
    print(f"\nbootstrap (B=10k, n=20k): {time.perf_counter()-t0:.2f} s")

# Sample run on a 2025 M3 MacBook:

p99   :  12.41 ms   95% CI [ 11.94,  14.08] ms   asymmetry: -0.47 / +1.67
p99.9 :  31.85 ms   95% CI [ 26.20,  46.30] ms   asymmetry: -5.65 / +14.45

bootstrap (B=10k, n=20k): 4.12 s

Walk through the four lines that decide whether the bootstrap is honest. idx = rng.integers(0, n, size=n) draws n indices with replacement — this is the entire trick. Some samples appear twice, some not at all; on average, each resample contains 63.2% of the original samples (the rest are duplicates). boot_stats[b] = statistic(samples[idx]) computes the statistic on each resample without recomputing on the original; this is what builds the empirical sampling distribution. np.percentile(boot_stats, lo) and the matching hi line are where the CI comes from — directly from the empirical distribution, no normality assumption anywhere. statistic=p99 vs statistic=p999 shows the mechanism is statistic-agnostic: the same code that gives you a CI for p99 gives you one for p99.9 by changing one lambda — and notice the p99.9 CI is far wider and far more asymmetric than the p99 CI, exactly as theory predicts (rarer events have noisier estimates).

The output reveals two things you cannot get from a textbook CI. The asymmetry of the p99 CI (-0.47 / +1.67) means a regression that pushes the new build to 13.5 ms is inside the baseline's noise envelope on the upper side; you cannot reject the null hypothesis of "no change" yet. The width of the p99.9 CI (almost ±15 ms around a 32 ms point estimate) is a humbling number — it is telling you that with 20,000 samples you can barely pin down p99.9 to within 50% relative error, which is why p99.9 SLO regression detection needs orders of magnitude more samples than p99 does.

Why p99.9 needs so many more samples: p99.9 is by definition an estimate from the slowest 0.1% of your data. With n = 20,000 samples, only 20 fall above the true p99.9. Resampling 20 noisy values produces a wide range of resampled p99.9 estimates — sometimes 5 of them are near the true value, sometimes 15, and the resampled p99.9 swings accordingly. The rule of thumb is that you need at least 100 samples above the percentile you're estimating to get a tight CI, which means 100,000 samples for p99.9 and 1,000,000 for p99.99. This is also why HdrHistogram has 3-significant-digit precision — it's designed for the sample sizes that pin down the deep tail.

Reading and acting on a bootstrap CI

The point of a CI is to support decisions, not to decorate the dashboard. Three decisions come up constantly in benchmark-driven engineering and the bootstrap CI is the right input for each.

Did this build regress p99? Compute the bootstrap CI for the baseline and for the new build separately. If the intervals do not overlap, the difference is real at the 95% level. If they overlap heavily, you cannot tell. If they overlap slightly, run a paired bootstrap on the difference of percentiles directly — sample paired indices, compute p99(new[idx]) - p99(baseline[idx]), and ask whether that difference's 95% CI contains zero. The paired version has tighter intervals because it cancels out shared noise (the same load generator hiccup affects both runs).

The mistake to avoid here: do not run the bootstrap on the difference of two means and report a CI for that — even if the bootstrap distribution looks tidy, it answers "did the mean change?" which is rarely the question. SLOs are written against percentiles. Use the bootstrap of the difference of percentiles, with statistic = lambda x, y: np.percentile(x, 99) - np.percentile(y, 99) and paired indices. The code is one line different; the answer is to a fundamentally different question.

How long do I need to run the benchmark? The CI shrinks with sample size at roughly 1/√n for well-behaved statistics; for tail percentiles it shrinks slower because you also need to accumulate enough tail samples. Run a pilot at n = 10,000, compute the bootstrap CI for the percentile you care about, and project: if the CI half-width at n = 10,000 is ±1 ms, the half-width at n = 40,000 will be roughly ±0.5 ms (a 4× sample size buys 2× tighter CI, square-root scaling). For p99.9 the scaling is closer to 1/n^(1/3) because the rare-event sampling noise dominates; double the sample size, the CI shrinks by ~25%.

Is my SLO compliance real or marginal? PaisaBridge's UPI authorisation SLO is p99 ≤ 200 ms. Last week's measurement said p99 = 187 ms. The bootstrap CI is [178, 198] ms. You are inside the SLO at the point estimate but the upper bound of the 95% CI is 198 ms — within 1% of the SLO threshold. The honest report to the SRE leadership is "we are SLO-compliant but with no headroom; one more component slowing down by 5% will put us in violation territory". If the CI had been [178, 215] ms, the report would be "we cannot confidently say we are SLO-compliant from this measurement; collect more samples or fix the slow path before declaring compliance". This is the discipline that distinguishes SRE practice from dashboard-driven theatre — a green dashboard with p99 = 187 ms next to an SLO of 200 ms looks safe, but the CI says you are one unlucky sampling window away from the alert firing. The error bar is the warning the point estimate cannot give you.

Illustrative — not measured data. Baseline and build A intervals overlap heavily — you cannot reject "no change" with this sample. Baseline and build B intervals do not overlap at all — the regression is statistically significant at the 95% level. Quoting only the point estimates (12.4 → 13.5 → 15.8) loses the decision rule entirely.

The non-overlap rule above is conservative but correct: if two 95% CIs do not overlap, the difference is significant at better than 95% (somewhere around 99% in fact, because each interval already absorbs its own noise). Two CIs that overlap by a small amount might still represent a significant difference; the formally correct test is the paired-bootstrap-of-the-difference, where you compute the CI for p99(new) - p99(baseline) directly and check whether it crosses zero. The paired test is tighter because the two builds typically share noise sources (same hardware, same kernel, same load generator) and the paired bootstrap cancels them.

A worked example from ParakhTrade's order-matching benchmark suite, run nightly across two builds: baseline p99 = 4.21 ms with iid bootstrap CI [4.08, 4.39]; new build p99 = 4.34 ms with CI [4.22, 4.51]. The intervals overlap from 4.22 to 4.39 — a 0.17 ms region of ambiguity. The paired bootstrap of the difference computes p99(new[idx]) - p99(baseline[idx]) for each resample using the same idx for both arrays (so a slow disk-read iteration affects both numerators simultaneously). The paired CI for the difference comes out to [+0.06, +0.21] ms — a strict-positive interval, meaning the regression is real at the 95% level even though the unpaired CIs overlapped. The pairing recovered ~0.05 ms of "shared noise" from each side that the unpaired comparison treated as independent.

Pitfalls — when bootstrap fails or misleads

The bootstrap is robust but not magic. Three failure modes recur in practice and each one produces CIs that are confidently wrong rather than honestly wide.

Autocorrelated samples. If your latency samples are not independent — and in benchmark output they almost never are, because consecutive requests share a CPU, a connection pool, a GC cycle — the bootstrap underestimates the variance. The fix is the block bootstrap: instead of resampling individual values, resample contiguous blocks of length L, where L is roughly the autocorrelation timescale (often 50–200 samples for a load test). The block bootstrap preserves the within-block correlation while resampling between blocks, producing a CI that respects the dependence structure.

Coordinated omission'd input. If the data going into the bootstrap was produced by wrk (without -R) or any open-loop tool that pauses on slow responses, the bootstrap dutifully computes a CI for a number that is already wrong — the median of a CO-corrupted run is biased low and the bootstrap CI of that median will be confidently centred on the wrong value. The fix is upstream: feed the bootstrap a CO-corrected histogram from wrk2 or HdrHistogram, not raw averages from a closed-loop tool. The bootstrap cannot un-bias a biased input.

The same principle applies to any sampling protocol with a survivorship bias — load tests that drop timeouts, traces that filter to "successful" requests, profilers that miss off-CPU time. The bootstrap inherits the bias of the protocol exactly. The mantra: a tight CI on a biased estimator is a confident lie.

The very deep tail (p99.99+). The bootstrap can only resample what's in the data. If you have 10,000 samples and want p99.99, only one sample sits above the true p99.99 — and resampling cannot create a tail you didn't already see. The bootstrap will give you a CI that looks reasonable but is fundamentally undersized; the true uncertainty about p99.99 from 10,000 samples is "we have no idea". The fix is more samples (millions, not thousands) or a parametric model for the tail (extreme-value theory, Gumbel/GPD fits) — which abandons the distribution-free property of the bootstrap in exchange for being able to extrapolate. Most of the time, "more samples" is the right answer.

The DigiPaisa fraud-scoring team learned this the hard way during a 2024 capacity review: their dashboard reported p99.99 = 480 ms from a one-hour load test, well within the 800 ms SLO. The bootstrap CI on that p99.99 was nominally [410, 560] ms — looked tight, looked safe. When they rebuilt the histogram from a 12-hour run instead of a 1-hour run, the point estimate jumped to 920 ms. The "tight CI" had been an artefact of having only ~360 samples above the true p99.99 in the 1-hour run; 12× more data revealed two slow paths (a fallback HTTP retry and a DNS-cache-miss path) that the smaller run had simply not exercised enough times to surface. The lesson is the rule above made operational: for p99.99 on a service that handles 10,000 RPS, you need a load test that runs long enough to see at least 100,000 calls — that means at least 10 seconds at full load, and realistically a few minutes to clear warmup and let rare paths show up.

Discrete or rounded inputs. If your latencies are recorded with 1 ms granularity (common for cheap timers, or for HdrHistogram configured with a 3-significant-digit bucket on the millisecond range), the bootstrap distribution becomes a step function — it can only take values that exist in the original sample. The CI endpoints will land exactly on observed values, never between them, and the percentile method can produce a degenerate-looking CI like [12.0, 12.0] for a sample where 12.0 is the modal value. The fix is upstream: record latencies in microseconds with time.perf_counter_ns() and only bucket later, or accept that the CI's apparent precision is an artefact of timer granularity. For p99 work, sub-microsecond timers feed a non-degenerate bootstrap; millisecond-rounded timers do not.

# block_bootstrap_demo.py — block bootstrap for autocorrelated latency series.
# Requires: pip install numpy

import numpy as np

def block_bootstrap_ci(samples: np.ndarray,
                       statistic,
                       block_len: int,
                       n_resamples: int = 10_000,
                       ci: float = 0.95,
                       seed: int = 0):
    rng = np.random.default_rng(seed)
    n = len(samples)
    n_blocks = (n + block_len - 1) // block_len
    boot_stats = np.empty(n_resamples, dtype=np.float64)
    for b in range(n_resamples):
        starts = rng.integers(0, n - block_len + 1, size=n_blocks)
        chunks = [samples[s:s + block_len] for s in starts]
        resample = np.concatenate(chunks)[:n]
        boot_stats[b] = statistic(resample)
    point = statistic(samples)
    lo = (1 - ci) / 2 * 100
    hi = (1 + ci) / 2 * 100
    return point, np.percentile(boot_stats, lo), np.percentile(boot_stats, hi)

# Compare iid bootstrap CI vs block bootstrap CI on autocorrelated samples.
rng = np.random.default_rng(0)
N   = 20_000
# Build a time series with strong autocorrelation: AR(1) with phi=0.85
phi = 0.85
e   = rng.normal(0, 1.0, N)
x   = np.zeros(N)
for i in range(1, N):
    x[i] = phi * x[i-1] + e[i]
# Map to a realistic-looking latency series in microseconds.
samples = (np.exp(0.3 * x + 1.4) * 1000).astype(np.float64)

p99 = lambda a: np.percentile(a, 99)

# IID bootstrap (wrong for autocorrelated data; underestimates variance):
from numpy.random import default_rng
def iid_ci(s, stat, B=10_000, seed=0):
    rng = default_rng(seed)
    n = len(s); out = np.empty(B)
    for b in range(B):
        out[b] = stat(s[rng.integers(0, n, n)])
    return stat(s), np.percentile(out, 2.5), np.percentile(out, 97.5)

p, lo, hi = iid_ci(samples, p99)
print(f"iid    bootstrap p99: {p/1000:.2f} ms  CI [{lo/1000:.2f}, {hi/1000:.2f}]  width={ (hi-lo)/1000:.2f}")
p, lo, hi = block_bootstrap_ci(samples, p99, block_len=100)
print(f"block  bootstrap p99: {p/1000:.2f} ms  CI [{lo/1000:.2f}, {hi/1000:.2f}]  width={ (hi-lo)/1000:.2f}")

# Sample run:
iid    bootstrap p99:  6.84 ms  CI [6.69, 6.99]  width=0.30
block  bootstrap p99:  6.84 ms  CI [6.42, 7.31]  width=0.89

Walk through what just happened. The iid bootstrap quoted a CI that was three times tighter than the block bootstrap's. They have the same point estimate; the iid version is confidently wrong about the width because it ignored the autocorrelation in the AR(1) series. Why the iid version under-estimates: when you resample with replacement from autocorrelated data, you accidentally break the correlation — the resampled series looks much more iid than the original, which makes the variance of the bootstrap statistic look smaller than the true sampling variance. The block bootstrap preserves runs of correlated samples, so each resample retains the dependence structure of the original, and the bootstrap variance now reflects the true sampling variance.

The choice of block length L is the only knob, and it matters: too short and you're back to iid; too long and you have too few independent blocks to resample. The standard rule is L ≈ n^(1/3) for the "automatic" block-length choice (Hall's optimal block-length); for a 20,000-sample run, L ≈ 27. In practice L=50–200 is fine for benchmark output where the autocorrelation timescale is GC-cycle or connection-lifetime sized.

Common confusions

"A 95% CI means there's a 95% chance the true value is in the interval." That's a Bayesian credible interval, not a frequentist confidence interval. The frequentist statement is: if you repeated the procedure many times, 95% of the resulting intervals would contain the true value. Operationally this distinction rarely matters; just don't say it the wrong way to a statistician.
"Bootstrap is the same as resampling-with-replacement of any kind." Bootstrap specifically resamples n values from a sample of size n — same size, with replacement. Other resampling techniques (jackknife: leave-one-out; subsampling: take a subsample of size m < n without replacement) have different theoretical properties and answer different questions. If you halve the sample size in resampling, you are doing subsampling, not bootstrap, and the CI scales differently.
"Bootstrap CIs are always correct because they don't assume normality." They assume the sample is representative of the population. If your benchmark ran for 60 seconds and the slow path warms up only after 90 seconds, your sample doesn't see the slow path — and the bootstrap of that sample will produce a CI for "the system without its slow path", which is not what you want to know. Bootstrap inherits the bias of the sampling protocol.
"A narrow CI means the result is precise." A narrow CI means the estimate is precise relative to sample noise. It says nothing about systematic error: if your benchmark host is co-tenanted with a noisy neighbour, a CI of [12.0, 12.1] can sit nowhere near production behaviour. Precision is not accuracy; the bootstrap captures the former, the experimental design must capture the latter.
"You can bootstrap p99.99 from 1,000 samples." You can run the bootstrap, but the CI it produces is meaningless because there are only 0–1 samples above the true p99.99 in the original data and resampling cannot synthesise tail mass that wasn't measured. The bootstrap respects the principle "no estimation without observation" — for the deep tail, you need either many more samples or a parametric tail model.
"More resamples (B) reduces the data-noise component of the CI." It does not. B controls only the Monte Carlo noise from the bootstrap itself, which scales as 1/√B and saturates around B=10,000. The data-noise component — the part that comes from your sample being one realisation of a random process — is fixed once your sample is fixed. The only way to shrink the data-noise component is to collect more raw samples; cranking B from 10k to 1M is wasted compute.

Going deeper

Why pairing matters: a worked SetuStream example

SetuStream's video-manifest service runs an A/B benchmark between the current build and a candidate build that switches the JSON encoder. Each build runs for 60 seconds against the same wrk2 -R 8000 load. Unpaired bootstrap CIs: baseline p99 = 92 ms [88, 97], candidate p99 = 96 ms [91, 102]. The intervals overlap from 91 to 97 — ambiguous. Paired bootstrap of p99(candidate) - p99(baseline) using request-index-aligned samples gives a CI for the difference of [+1.8, +6.2] ms — strict-positive, regression confirmed. The pairing worked because both builds shared the same upstream noise: the same Kubernetes node had two GC-pause events during the 60-second window, both builds saw them, and the paired bootstrap subtracted out that shared shock. Without pairing, those two GC pauses inflated each build's variance independently, swelling each unpaired CI by ~3 ms on the upper side and producing the spurious overlap.

BCa intervals — bias-correction and acceleration

The percentile-method bootstrap (the recipe above) is the simplest variant. It is asymptotically correct but can be biased for small samples or skewed statistics. The BCa (bias-corrected and accelerated) bootstrap is the standard upgrade: it computes a bias-correction term (does the bootstrap distribution centre on the point estimate, or off-centre?) and an acceleration term (how fast does the standard error grow with the parameter?), and adjusts the percentile cutoffs accordingly. For percentile statistics on heavy-tailed data, BCa intervals can be 10–30% narrower than the naive percentile method while maintaining 95% coverage. Efron and Tibshirani's An Introduction to the Bootstrap (1993) is the reference; scipy.stats.bootstrap(method='BCa') ships it for free in modern SciPy. For benchmark CI work the percentile method is fine for p50–p99; BCa is worth the complexity for p99.9 and beyond.

The bootstrap fails for the maximum (and other order-statistic edge cases)

The maximum of a sample is a degenerate statistic for the bootstrap. The resampled max can never exceed the original max — it's drawn with replacement from the original sample — so the bootstrap distribution of the max is bounded above by the observed max. The 97.5th percentile of bootstrap maxes is therefore very close to the observed max itself, giving an absurdly narrow CI that pretends you know the max precisely. The same failure mode infects any extreme order statistic: p99.99 from 10,000 samples, p99.9 from 1,000 samples, the true tail in any benchmark of insufficient duration. The fix is extreme-value theory: fit a generalised Pareto distribution to the upper-tail exceedances and quote a CI from the fit. EVT is its own discipline; for systems-performance work, the practical rule is "if you want CI on the deep tail, collect more samples", and only fall back to EVT when you genuinely cannot.

Bootstrap in CI/CD: regression detection at the 99% level

PaisaBridge's API gateway team runs a nightly benchmark of every PR's wrk2 -R 50000 -d 60s --latency run. The CI/CD pipeline computes a paired bootstrap CI for p99(this PR) - p99(main HEAD), with B=20,000 and a 99% confidence level (not 95%, because they want fewer false-positive PR rejections). A PR is auto-rejected if the lower bound of (this PR - main) 99% CI is above zero — a real regression, with high confidence. A PR is auto-approved on the latency dimension if the upper bound is below zero (a real improvement). Anything in between is sent to a human reviewer with the CI plotted. This setup catches ~12 regressions per quarter that the previous "compare means" gate missed, and produces ~3 false rejections per quarter (down from ~30 with the old gate). The investment was a 200-line Python script and a Jenkins post-build step; the win was reducing regression rollbacks from "every other Friday at 6pm" to "twice a quarter, well-investigated".

The 99% confidence level is a deliberate trade. A 95% gate over a four-week sprint with ~80 PRs produces an expected ~4 false rejections from random noise alone — every false rejection costs ~30 minutes of an engineer's time to investigate, total ~2 engineering hours per sprint wasted. A 99% gate cuts the expected false rejections to ~0.8 per sprint while still catching every regression bigger than ~5% in p99 (the team's smallest defensible threshold given hardware noise on shared CI runners). The gate is also asymmetric: a PR can be merged with a "regression" verdict if the human reviewer judges the trade-off worthwhile (e.g. a security fix that costs 8% latency but is worth shipping); the gate never blocks merges, it only changes the default from "auto-merge" to "require human sign-off". The gate's value is the human-attention budget it preserves — engineers no longer hand-eyeball latency graphs on every PR; they look only at the ~5 PRs per quarter where the bootstrap says "this needs human judgement".

Reproduce this on your laptop

# Reproduce on your laptop
python3 -m venv .venv && source .venv/bin/activate
pip install numpy hdrh
python3 bootstrap_p99.py
python3 block_bootstrap_demo.py
# Compare with SciPy's built-in BCa bootstrap:
python3 -c "
import numpy as np; from scipy.stats import bootstrap
rng = np.random.default_rng(42)
s = rng.lognormal(1.4, 0.35, 20000) * 1000
res = bootstrap((s,), lambda x, axis: np.percentile(x, 99, axis=axis),
                n_resamples=10000, confidence_level=0.95, method='BCa', axis=0)
print(f'BCa CI for p99: [{res.confidence_interval.low/1000:.2f}, {res.confidence_interval.high/1000:.2f}] ms')
"

Where this leads next

The bootstrap is the statistical scaffolding under everything else in Part 4. Three follow-on chapters take the CI machinery into deeper territory.

Frequency scaling, turbo boost, and how benchmarks lie about wall time (/wiki/frequency-scaling-turbo-boost-and-benchmark-noise) is where you learn that even with bootstrap CIs you still need to control the experimental conditions; a CI computed from data with the governor jumping between 800 MHz and 4.5 GHz is precise but wrong.

Coordinated omission and HDR histograms (/wiki/coordinated-omission-and-hdr-histograms) is the chapter on why the data going into your bootstrap must be CO-corrected — the bootstrap inherits whatever bias the sampling protocol introduced, and CO bias is the most common one.

The methodology problem — most benchmarks are wrong (/wiki/the-methodology-problem-most-benchmarks-are-wrong) frames why all of this matters: the catalogue of failure modes that the bootstrap, paired with warmup discipline and CO-aware tooling, addresses. Read those three together and "is my regression real?" becomes a question you can answer with numbers instead of vibes.

References

Efron & Tibshirani, An Introduction to the Bootstrap (1993) — the canonical text; chapter 13 covers BCa, chapter 8 covers the percentile method, chapter 14 has the M/M/1 queueing example that maps directly onto load-test data.
Davison & Hinkley, Bootstrap Methods and Their Application (1997) — the deeper treatment, with the block bootstrap (chapter 8) and the heavy-tail caveats (chapter 11) the systems-performance reader needs.
SciPy scipy.stats.bootstrap documentation — the practical entry point; supports BCa, percentile, and basic methods, plus paired-difference statistics.
Gil Tene, "How NOT to Measure Latency" — the talk that reframed how the industry thinks about latency measurement; coordinated omission is the upstream problem the bootstrap cannot fix.
Brendan Gregg, Systems Performance (2nd ed., 2020), Chapter 12 "Benchmarking" — operational guidance on running benchmarks long enough that the CI machinery has data to chew on.
Hall, "Resampling a coverage pattern" (1985) — the original derivation of optimal block lengths for the block bootstrap; the source of the n^(1/3) rule.
/wiki/coordinated-omission-and-hdr-histograms — sister chapter on the CO bias the bootstrap inherits if you let it.
/wiki/warmup-steady-state-and-cold-start-effects — the chapter on what data you should be feeding into the bootstrap in the first place.