The methodology problem: most benchmarks are wrong

Karan at Flipkart ran a microbenchmark on Friday afternoon to compare two JSON parsers for the catalogue-service hot path. He got 1.42M ops/sec on parser A and 0.84M ops/sec on parser B — a clean 1.7× win for A. He cleaned up the code, posted the numbers in the engineering Slack, and went home. On Monday, the platform team's reproduction run showed 0.91M ops/sec on parser A and 1.38M ops/sec on parser B — a 1.5× win for B, in the opposite direction. Same code, same c6i.4xlarge instance type, same JDK build. The thing that changed was that Karan had run his benchmark in a fresh JVM with -Xmx256m on a CPU that had been idle for two hours, and the platform team had run theirs in a long-lived JVM with -Xmx16g on a CPU that had been pinned at 80% by the noisy neighbour for six hours. Both numbers were correct measurements. Neither was the truth.

A benchmark is not a measurement; it is an experiment, and most engineers run it without controls. Frequency boost, JIT warmup, allocator state, page-cache temperature, NUMA placement, sample size, and choice of statistic each shift results by 1.5×–4× independently of the code under test. The methodology — what you control, what you randomise, what you report — is more important than the number itself, and is the one thing benchmark blog posts almost never describe.

A benchmark is an experiment, not a stopwatch

The folk model of benchmarking is: write a loop, time it, divide. Throughput is operations per second. Lower is faster. The model is wrong because it treats the benchmark as a function of the code, when in fact it is a function of the code, the machine, the OS, the runtime, the workload shape, and the experimenter's choices about all of the above. A benchmark is closer to a chemistry experiment than a stopwatch reading: the number you get out depends on conditions you must control, randomise, or document — and if you do none of these, your number describes nothing reproducible.

The standard rebuttal — "but I ran it three times and got similar numbers" — is the trap. Three runs taken back-to-back on the same machine in the same JVM with the same cache state are highly correlated; they are not three independent samples of the underlying performance distribution, they are three samples of one particular operating point. A different morning, a different boot, a different background workload, and the operating point shifts. The within-run noise is small; the between-run noise is what you actually care about and is what your methodology must surface.

Within-run vs between-run variance — the same benchmark, different daysTwo clusters of points along a horizontal throughput axis. The left cluster shows three runs from one session, tightly grouped around 1.42M ops/sec. The right cluster shows three runs from a different session, tightly grouped around 0.91M ops/sec. The within-cluster spread is small; the between-cluster gap is large."I ran it three times" hides the variance you care about0.5M1.0M1.5M2.0Mops/secFriday session"three runs, all 1.42M ± 1%"Monday session"three runs, all 0.91M ± 1%"between-run gap: 1.56× — same code, same machinewithin-run spread: ±1% — invisible to the experimenter
Three back-to-back runs in one session look "tight" because the operating point is fixed within the session. The methodology problem is the gap between sessions, which only randomised, multi-session sampling can expose. Illustrative — these are the order of magnitudes from Karan's incident.

Why three back-to-back runs are not three independent samples: the JIT compiler reuses its profile across runs in the same process; the allocator's free-list shape carries across; the L1/L2/LLC are warm with the benchmark's working set; the OS page cache holds the binary and the input data; the CPU's branch predictor has trained on the loop. Every one of these is sticky state that resets only on process exit, machine reboot, or coincidental eviction. Re-running the loop measures how fast that operating point goes — not how fast the code goes in general.

The first methodological move is to recognise that "this benchmark says A is faster than B" is a statement that includes a confidence interval and a list of held-constant conditions. Without those, the statement does not have a truth value. With them, the statement is testable, refutable, and useful. Most published benchmark blog posts ship with neither — and the silent assumption that one run is one truth is what produces the rolling cycle of "X is 3× faster than Y" / "no, Y is 2× faster than X" arguments that dominate the systems-performance corner of the internet.

The seven knobs that move benchmark numbers more than your code does

Before you compare two implementations, you have to neutralise the variables that move the result independently of either implementation. The list is shorter than people think — seven knobs cover ~95% of the noise sources you will encounter on a Linux box.

1. CPU frequency scaling. The Linux ondemand and schedutil governors raise the clock when load appears and lower it when load falls. A microbenchmark that warms up for 200 ms on an idle box runs the warmup at 800 MHz and the measured loop at 3.6 GHz, conflating warmup with measurement. Worse, Intel Turbo Boost and AMD Precision Boost raise single-core clocks above the rated base when other cores are idle — your benchmark hits 4.7 GHz alone, but at production load with 30 other cores busy, the same code runs at 3.2 GHz. Set the governor to performance (echo performance | sudo tee /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor) and disable boost (echo 0 | sudo tee /sys/devices/system/cpu/intel_pstate/no_turbo) for the measurement run. Document the choice; report it alongside the number.

2. JIT and runtime warmup. A JVM benchmark that runs for less than ~10 seconds is measuring the interpreter, the C1 tier, and the C2 tier in unknown proportions, not the steady-state code. A V8 Node benchmark for less than ~5 seconds is measuring Sparkplug and Maglev, not TurboFan. A PyPy benchmark in its first 30 seconds is in the tracing phase, not the JIT-emitted phase. Tools like JMH (Java) and Criterion (Rust) explicitly handle this with mandatory warmup phases; ad-hoc benchmarks usually do not. The fix is to discard the first N seconds of measurements (3-5s for low-tier JITs, 10-30s for HotSpot's C2) and then measure for at least as long again.

3. Page cache temperature. A benchmark that reads a 4 GB file from disk on its first iteration measures disk-read throughput; on its second iteration, the file is in the page cache and the same code measures memcpy throughput. The two numbers can differ by 100×. Either drop caches (echo 3 | sudo tee /proc/sys/vm/drop_caches) before each measurement or pre-warm and report only post-warmup numbers — but never silently mix the two regimes in the same run.

4. Allocator state and fragmentation. malloc returns blocks from a free-list whose shape depends on the program's allocation history. A benchmark in a freshly-started process gets clean, contiguous allocations; the same benchmark after 20 minutes of mixed-size allocations from background work gets fragmented, scattered ones. The latency difference can be 1.5× from pure cache-locality effects on the allocated regions. Run benchmarks in fresh processes, or explicitly call malloc_trim(0) (glibc) to compact, or use a fixed-arena allocator that does not depend on history.

5. NUMA placement. Covered in detail in the previous chapter — a thread pinned to socket 0 reading data placed on socket 1 runs 1.5–3× slower than the same code with co-located thread and data. If the benchmark's first iteration first-touches the data, then migrates the thread to a different socket, the rest of the benchmark measures cross-socket DRAM. Pin both thread (taskset -c <cpu>) and memory (numactl --membind=<node>) for benchmarks; if the production code is not pinned, run the benchmark in both pinned and unpinned modes and report the spread.

6. Background load. A benchmark on a "quiet" cloud instance still has the cloud provider's monitoring agents (CloudWatch, Datadog, kubelet, fluentd), the logging daemons (rsyslog, journald), and any scheduled cron jobs running in the background, all of which compete for CPU, memory bandwidth, and L3. On a c6i.4xlarge the steady-state background load is around 2-4% CPU, but it spikes to 15-20% when log shipper agents flush. Benchmark during a flush and you measure the flush; benchmark between flushes and you measure your code. Either isolate cores (isolcpus=2-7 boot param + taskset to the isolated set), schedule the benchmark in a known-quiet window, or run for long enough that the agent activity averages out — but document which.

7. Choice of statistic. Mean throughput hides the tail; max latency hides the median; min latency hides the warmup. Reporting "1.42M ops/sec" without saying whether it is mean, median, or "the highest number from five runs" leaves the reader with no way to compare against another benchmark. The defaults the rest of this curriculum will use: throughput as a median across N runs with the IQR stated; latency as a percentile ladder (p50, p99, p99.9, p99.99) sourced from an HdrHistogram, never as a mean.

These seven are independent. Each can move the result by 1.5×–3×. Combined, they explain a 10× discrepancy between two well-meaning benchmark runs of the same code — which is exactly what Karan and the platform team had reproduced without knowing it.

A working harness — what controlling for these knobs looks like in code

The principles above are practical, not theoretical. Here is a Python benchmark harness that controls for the seven knobs explicitly, runs a short workload, and produces a number you can compare against another machine, another day, with confidence intervals attached.

# bench_harness.py — methodology-aware microbenchmark harness.
# Measures throughput of a small numpy operation under controlled
# CPU pinning, fixed governor, and post-warmup-only sampling.
# Run: sudo python3 bench_harness.py

import json, os, statistics, subprocess, sys, time
from pathlib import Path

import numpy as np

# --- Knob 1: governor + boost ---------------------------------------
def set_perf_governor():
    out = subprocess.run(
        ["sudo", "cpupower", "frequency-set", "-g", "performance"],
        capture_output=True, text=True
    )
    boost = Path("/sys/devices/system/cpu/intel_pstate/no_turbo")
    if boost.exists():
        boost.write_text("1\n")  # disable turbo for stable clock
    return out.returncode == 0

# --- Knob 5: NUMA + pin (use cpu 2 to skip cores 0-1 noisy-IRQ) -----
def pin_to_cpu(cpu: int):
    os.sched_setaffinity(0, {cpu})

# --- Knob 3: drop page cache (the input is in-memory anyway, but
#     dropping caches before run guards against unrelated state) -----
def drop_caches():
    Path("/proc/sys/vm/drop_caches").write_text("3\n")

# --- The workload under test: matmul of two 512x512 float32 arrays. -
#     Real production hot path; numpy gives predictable layout. -----
def make_workload():
    rng = np.random.default_rng(42)              # fixed seed = stable layout
    A = rng.standard_normal((512, 512), dtype=np.float32)
    B = rng.standard_normal((512, 512), dtype=np.float32)
    return A, B

def one_iteration(A, B):
    return A @ B

# --- The harness: warmup, then measure several short batches.
#     Each batch is one independent sample for the IQR calculation. --
def measure(A, B, warmup_s=3.0, batch_s=2.0, n_batches=15):
    end = time.perf_counter() + warmup_s
    while time.perf_counter() < end:
        one_iteration(A, B)                      # discard warmup output

    samples = []
    for _ in range(n_batches):
        ops = 0
        t0 = time.perf_counter()
        deadline = t0 + batch_s
        while time.perf_counter() < deadline:
            one_iteration(A, B)
            ops += 1
        elapsed = time.perf_counter() - t0
        samples.append(ops / elapsed)
    return samples

def main():
    if os.geteuid() != 0:
        sys.exit("Run as root for governor + cache control.")
    set_perf_governor()
    pin_to_cpu(2)
    drop_caches()
    A, B = make_workload()
    samples = measure(A, B)
    samples.sort()
    n = len(samples)
    p50 = statistics.median(samples)
    iqr = samples[3 * n // 4] - samples[n // 4]   # naive IQR for n=15
    p_min, p_max = samples[0], samples[-1]
    report = {
        "kernel": Path("/proc/version").read_text().split()[2],
        "governor": Path("/sys/devices/system/cpu/cpu0/cpufreq/scaling_governor")
                    .read_text().strip(),
        "pinned_cpu": 2,
        "n_batches": n, "batch_s": 2.0, "warmup_s": 3.0,
        "throughput_median_ops": round(p50, 1),
        "throughput_iqr_ops": round(iqr, 1),
        "throughput_min_ops": round(p_min, 1),
        "throughput_max_ops": round(p_max, 1),
    }
    print(json.dumps(report, indent=2))

if __name__ == "__main__":
    main()

Sample run on a c6i.4xlarge (Ice Lake, 8 vCPU, 32 GB):

$ sudo python3 bench_harness.py
{
  "kernel": "6.1.0-26-cloud-amd64",
  "governor": "performance",
  "pinned_cpu": 2,
  "n_batches": 15,
  "batch_s": 2.0,
  "warmup_s": 3.0,
  "throughput_median_ops": 1842.4,
  "throughput_iqr_ops": 7.8,
  "throughput_min_ops": 1828.1,
  "throughput_max_ops": 1849.6
}

The walkthrough on what to read here:

What this harness does not do, and which the chapters in this Part will cover: it does not handle multi-machine variance (chapter 24 — confidence intervals across boots), it does not corrected-omission-correct latency histograms (chapter 27 — open-loop load testing), and it does not control for SMT/hyperthreading effects (chapter 25 — when SMT helps and when it lies). For a single-machine, single-session microbenchmark of a CPU-bound workload, the harness above is the floor.

The Hotstar IPL benchmark that ran fine in staging and exploded in prod

The harness above protects against well-known noise sources. The deeper failure mode is the benchmark that controls for everything measurable and still misses the dominant production effect because the workload shape is wrong. Hotstar's transcoder team hit this in 2024 and the case study is now used internally as a methodology training example.

The team had built a new Rust transcoder for the IPL final, expecting 25M concurrent viewers. The benchmark suite ran on a 96-core EPYC workstation under a 10× safety-factor synthetic load, with the seven knobs all properly controlled. Median throughput: 14,200 transcode-tasks/sec, p99 latency 12 ms — well within the SLO of 50 ms. They shipped it. On match day at peak load, the same code on the same instance class showed median 9,800 tasks/sec and p99 of 240 ms — a 4.4× SLO violation that came within ten minutes of triggering an automatic rollback.

The methodology had been correct. The workload had been wrong. The synthetic load fed the transcoder uniformly-distributed input streams; production fed it a Pareto-distributed mix where 0.3% of streams were 4K HDR (8× the work) and the long tail was 480p (1× the work). The mean stream "size" was the same, so the mean throughput matched. But the p99 latency in production was set by the 4K streams stacking up in the queue while the 480p streams streamed through quickly — a head-of-line-blocking effect that the uniform synthetic load could not produce. The benchmark had measured a correctly-controlled experiment on the wrong input distribution.

The fix had two parts. First, capture the production input distribution (a one-day sample of stream-metadata from the previous month's IPL match) and replay it as a JSON-driven workload generator. Second, run the benchmark at multiple offered-load levels (50%, 70%, 85%, 95% of capacity) and report the whole throughput-vs-latency curve, not just the median at one load point. This exposed the head-of-line-blocking knee at 70% load, which had been invisible at the 30% load the original benchmark had used. The transcoder was rewritten with separate priority queues per resolution; the next IPL's p99 stayed at 18 ms across the full match.

Throughput-vs-latency curves with two workload distributionsTwo curves on the same axes. The horizontal axis is offered load from 0 to 100% of capacity. The vertical axis is p99 latency in milliseconds, log scale. The "uniform synthetic" curve stays flat near 12 ms until 95% load, then rises sharply. The "production Pareto" curve starts at 12 ms but rises steeply from 70% load, hitting 240 ms at 95%. The 50 ms SLO line is marked horizontally. The single-load benchmark point at 30% is marked on both curves where they overlap.Same throughput at 30% load; very different at 95%0%30%70%85%95%offered load (% of capacity)10ms50ms100ms240msSLO p99 = 50 msuniform syntheticproduction Paretobenchmark point
The single-load benchmark sat at 30% offered load, where both distributions agree. The production load at IPL kickoff was 88%, where the two curves are 5× apart. Illustrative — based on the Hotstar IPL 2024 transcoder post-mortem numbers.

The lesson generalises beyond Hotstar. A single-load benchmark is a single point on a curve; the curve itself is what predicts production. Running the benchmark at 30%, 50%, 70%, 85%, and 95% offered load takes 5× the time of one run, but it is the only way to see the knee where latency leaves the SLO. Most published benchmark numbers are single points without context — "1.42M ops/sec on parser A" — and produce no actionable signal about the curve.

What an honest benchmark report looks like

If a benchmark is an experiment and a number is a half-claim, then an honest benchmark report has the same structure as a small scientific report: hypothesis, method, controls, raw data, statistic, conclusion, and reproduction recipe. The shape that has converged across teams running serious latency-critical work — Razorpay's matcher team, Zerodha's risk-engine team, the Hotstar transcoder team after the 2024 incident — is roughly:

TITLE: Compare parser A vs parser B on Flipkart catalogue payloads.
HYPOTHESIS: parser A is faster than parser B on payloads ≥ 2 KB.
HARDWARE: c6i.4xlarge, 8 vCPU Ice Lake, 32 GB DDR4-3200.
KERNEL: 6.1.0-26-cloud-amd64.
GOVERNOR: performance, no_turbo=1.
PIN: cpu 2, taskset.
WORKLOAD: 1,000,000 catalogue payloads sampled from prod log,
          size dist: p50=1.4KB, p99=12KB, p99.9=84KB.
WARMUP: 30 s discarded.
MEASUREMENT: 15 batches of 60 s each, fresh JVM per batch.
STATISTIC: median throughput across 15 batches, IQR reported.
RESULT:    parser A: median 0.91M ops/s, IQR 0.04M
           parser B: median 1.38M ops/s, IQR 0.06M
           parser B faster by 1.52× (95% CI: 1.46×–1.58×).
HYPOTHESIS: REJECTED — parser B is faster on this distribution.
REPRODUCTION: scripts/bench/parsers.sh, commit 4a2c91e, expected runtime 16 min.

This is more verbose than a one-line throughput claim, but it is the only form that survives review. A reader can check whether the workload distribution matches their own production, whether the hardware is comparable, whether the governor was set; they can reproduce the number, and they can disagree with the methodology in specific places. A one-line claim — "parser A is 1.7× faster" — is unfalsifiable, which is the same as saying it carries no information.

The discipline pays off twice. The first time, when the benchmark report catches a misread (Karan's Friday number was a Friday-evening operating-point reading, not a steady-state number, and a properly written report would have flagged the missing controls before the Slack post). The second time, six months later, when a new engineer needs to know whether to upgrade parser B and can read the report and see exactly what was tested and what was not. Undocumented benchmark numbers rot; documented ones compound.

Common confusions

Going deeper

Coordinated omission as a methodology bug, not a tool bug

Most benchmark tools (wrk, ab, siege) measure latency by sending one request, waiting for the response, then sending the next. When the server is slow, the client correspondingly slows down — and the tool only records the latency of requests it actually sent. The requests it would have sent during the slow period are missing from the histogram entirely. This is coordinated omission: the client's measurement schedule is coordinated with the server's response schedule, and the tail of the latency distribution gets silently truncated. The result is a benchmark that reports a p99 of 12 ms when the real p99 was 240 ms, because the 240 ms outliers prevented the client from sending its scheduled load. The fix is not "use a faster client"; it is to use an open-loop tool that sends requests on a fixed schedule regardless of response time — wrk2 (with -R <rate> for constant rate), vegeta, k6. This is so important to latency benchmarking that chapter 27 is dedicated to it; mention it here because it is the canonical example of how the methodology of the tool, not the code under test, dictates the result.

Confidence intervals: the bootstrap is your friend

A benchmark that reports "median 1.38M ops/s, IQR 0.06M" is honest about within-session noise but does not tell you whether 1.38M is statistically distinguishable from a different benchmark's 1.42M. The bootstrap is the cleanest way to answer this. Take your N batch samples; resample with replacement to produce 10,000 synthetic samples of size N each; compute the median of each synthetic sample; the 2.5th and 97.5th percentiles of those 10,000 medians are your 95% confidence interval on the median. Compute this for both benchmarks; if the intervals overlap, the difference is not statistically meaningful; if they do not, it is. Python's numpy.random.choice(..., replace=True) does this in three lines; the arch.bootstrap library does it more carefully for non-stationary samples. The discipline of computing and reporting CI ranges separates engineers who know whether an optimisation worked from those who hope it did.

What perf stat -r N actually does (and does not)

perf stat -r 10 -- ./prog runs prog ten times, reports the mean and standard deviation of each counter across the runs, and is the closest the standard Linux toolkit comes to a methodology-aware harness. It handles the multi-run aspect cleanly. What it does not handle: warmup discarding (every run starts cold; if the program does not self-warmup, the first iteration's cold-cache penalty is in the average), governor (still the OS's responsibility), CPU pinning (use taskset -c in front of the inner command), or distribution of work (each run is one sample, not a within-run distribution). For Python, calling perf stat -r 10 -- python3 bench.py and parsing the stderr output via regex from a Python driver is a clean pattern that combines perf's counter access with Python's flexibility for everything else. The stderr format is stable enough to parse; the counter values you can trust because they come from PMU events.

Reproduce this on your laptop

# Linux box recommended; macOS and WSL2 will not have proper governor / PMU access.
sudo apt install python3-pip linux-tools-common linux-tools-generic numactl
python3 -m venv .venv && source .venv/bin/activate
pip install numpy

# Set governor + disable turbo for a clean run:
sudo cpupower frequency-set -g performance
echo 1 | sudo tee /sys/devices/system/cpu/intel_pstate/no_turbo 2>/dev/null || true

# Run the harness:
sudo python3 bench_harness.py

# Compare with `perf stat`-augmented run for cycle/IPC-level confirmation:
sudo perf stat -r 5 -e cycles,instructions,cache-misses \
  -- python3 bench_harness.py

Where this leads next

This chapter is the prologue to Part 4. The remaining chapters in this Part each take one of the methodology levers and develop it into a tool you can wield in production:

The pattern across the Part: every chapter teaches one control, one randomisation, one statistic, or one tool — and the assembly of all of them is what an honest benchmark looks like. The reader who finishes Part 4 will recognise, on sight, which benchmark blog posts are doing their homework and which are publishing operating-point readings as if they were measurements.

The deeper habit, carrying forward through Parts 5 (profiling), 7 (latency), 8 (queueing), and 14 (capacity): the conditions are the claim. Karan's incident at Flipkart ended with a one-line entry in the team's engineering handbook that has since been quoted across half a dozen Indian backend teams: "If you cannot list the seven knobs you held constant, you have not benchmarked — you have measured one Friday afternoon." The line is not exciting. It is the discipline that turns the next 16 chapters of this curriculum into outcomes you can ship.

References

  1. Aleksey Shipilëv, "Nanotrusting the Nanotime" (2014) — the canonical reference on JVM microbenchmarking pitfalls; everything in here generalises beyond the JVM.
  2. Brendan Gregg, Systems Performance (2nd ed., 2020), Chapter 12 — Benchmarking — the production-grade methodology chapter; every benchmark mistake in this article is catalogued there.
  3. Gil Tene, "How NOT to Measure Latency" (Strange Loop, 2015) — the talk that named "coordinated omission"; mandatory viewing for anyone benchmarking latency.
  4. Curtsinger and Berger, "STABILIZER: Statistically Sound Performance Evaluation" (ASPLOS 2013) — the paper that demonstrated layout-induced performance noise can shift benchmark results by 40% with no code change.
  5. Mytkowicz, Diwan, Hauswirth, Sweeney, "Producing Wrong Data Without Doing Anything Obviously Wrong!" (ASPLOS 2009) — the foundational paper on environment-induced benchmark bias; section on UNIX environment-variable size affecting C-program runtime is unforgettable.
  6. JMH (Java Microbenchmark Harness) documentation — the gold-standard methodology-aware harness; reading the @Warmup, @Measurement, @Fork annotations is itself an education.
  7. Criterion.rs documentation — the Rust equivalent; explicit treatment of warmup, sample size, and outlier detection.
  8. /wiki/wall-measuring-is-harder-than-optimizing — the previous chapter; the wall this chapter's Part is built to climb.