Note: Company names, engineers, incidents, numbers, and scaling scenarios in this article are hypothetical — even when they resemble real ones. See the full disclaimer.

Warmup, steady state, and cold-start effects

Aditi runs python3 bench.py against her FastAPI order-matching service for 60 seconds and reports a mean latency of 38 ms to her tech lead at ParakhTrade Kite. The lead asks her to plot iteration latency against iteration index. The first 1,400 iterations form a fat downward curve from 380 ms to 6 ms. The remaining 158,000 iterations sit on a flat line near 4 ms with a small tail. Her "mean of 38 ms" is the area-weighted average of two distributions glued together — the warmup tail and the steady state — and the second one is what the service does at 10:00 IST market open with 8 million traders connected. The first one is what happens once per pod, on cold-start, and never again until the next deploy. Her published number is wrong about both regimes.

A benchmark measures three regimes glued end-to-end: a cold tail dominated by page faults, JIT compilation, and CPU frequency ramp-up; a warmup transient where caches and branch predictors learn the workload; and a steady state where the numbers represent the system's actual behaviour. Reporting the union as "mean latency" is the most common benchmarking lie after coordinated omission. The discipline is three-part — explicitly warm up, explicitly mark when steady state begins, and report the cold-start cost as its own first-class number.

The three regimes hidden inside a single benchmark run

Run any non-trivial process for the first time on a Linux box and the first few hundred milliseconds look nothing like the next few minutes. The reasons are not subtle, but they compound in ways that make the cold tail visible far longer than any single one of them would suggest:

Page faults. The kernel has not yet mapped your program's pages into RAM. Every first access to a code page or a data page costs a minor or major page fault — 1–3 µs for a minor, 100 µs–10 ms for a major if the swap is involved. A 200 MB Python service typically takes 4–8 seconds to fault all its pages in.
CPU frequency governor. Modern x86 and ARM CPUs run at a low frequency (800 MHz on Skylake, 600 MHz on Graviton) when idle. The governor ramps up over 50–500 ms once it sees sustained load, but a benchmark that runs for 200 ms can finish entirely at the low frequency. Throughput at 800 MHz vs 4.2 GHz differs by 5×.
CPU caches. The L1d, L1i, L2, and L3 caches are cold. The first iteration of your hot loop fetches every line from DRAM at ~80 ns each; the second iteration finds them in L1 at ~1 ns each.
Branch predictor and BTB. No branch history. The predictor's first encounter with each conditional branch is a coin flip; after a few hundred encounters it converges and mispredict rates drop from 50% to 2%.
TLB. No virtual-to-physical translations are cached. The first access to each 4 KB page incurs a TLB miss (40–100 cycles); after warmup, 99% of accesses hit the L1 dTLB at zero cost.
JIT / runtime warmup. PyPy, the JVM HotSpot JIT, V8 TurboFan, .NET tiered compilation, JAX/TorchScript tracing — all start interpreting bytecode and recompile hot methods to native code only after sufficient samples. HotSpot's C1 compiles after 1,500 invocations, C2 after 10,000. Until then your method runs 5–50× slower.
Allocator warmup. tcmalloc, jemalloc, and Python's pymalloc build their thread-local caches and size-class arenas lazily. The first 10,000 small allocations on a fresh thread populate the freelist; subsequent allocations hit the fast path.
Connection pools, DB caches, kernel page cache. A web service warming its connection pool to Postgres runs slow until the pool is full; a database warming its buffer pool runs slow until the working set is in RAM; a service that reads files runs slow until the kernel page cache holds the hot files.

Illustrative — not measured data. The y-axis is log-scale; the cold tail is two orders of magnitude above the steady-state floor and decays over thousands of iterations. Reporting the mean of the entire run gives a number that is neither the cold-start cost nor the steady-state cost.

Why the cold tail is longer than any one cause: each cause has its own time constant. Page faults complete in seconds, the frequency governor settles in 100–500 ms, JIT takes 10,000+ invocations, the buffer pool fills only as queries touch new data. The visible "warmup" is the union of all of these; you cannot be in steady state until every contributor has finished. A workload that touches new data forever (a streaming pipeline reading new partitions) never reaches a true steady state for the buffer pool — it only reaches an equilibrium where the eviction rate matches the read rate.

The benchmarking discipline that follows: do not report a single number for "the benchmark". Report three numbers — the cold-start cost (one-shot, paid once per pod), the warmup duration (how long until the system is representative), and the steady-state distribution (what the user actually feels). Most published benchmark numbers conflate them and most readers do not notice; the ones that hurt are the ones where the conflation hides a regression.

Measuring the three regimes from a Python harness

The fastest way to see the three regimes is to instrument a real benchmark and plot per-iteration latency. The script below drives a Python function that does CPU-bound work (a small SHA-256 hash chain stand-in for a JSON validation pipeline), records the latency of every iteration into an HDR histogram with per-iteration timestamps, and prints both the cold-tail slice and the steady-state slice with their respective percentiles.

# warmup_probe.py — measure the three regimes of a benchmark run.
# Requires: pip install hdrh numpy

import hashlib, time, statistics, sys
from hdrh.histogram import HdrHistogram
import numpy as np

ITERS         = 200_000
WORK_BYTES    = 4096      # one page of input per call
WARMUP_FLOOR  = 30_000    # treat first 30k iterations as not-yet-steady
COLD_TAIL_END = 5_000     # the very-cold prefix (page faults, freq ramp)

def workload(buf: bytes) -> bytes:
    h = hashlib.sha256()
    for _ in range(64):
        h.update(buf)
        buf = h.digest() + buf[:WORK_BYTES - 32]
    return h.digest()

def run() -> np.ndarray:
    rng = np.random.default_rng(seed=42)
    buf = rng.bytes(WORK_BYTES)
    samples_us = np.empty(ITERS, dtype=np.int64)
    for i in range(ITERS):
        t0 = time.perf_counter_ns()
        workload(buf)
        samples_us[i] = (time.perf_counter_ns() - t0) // 1000
    return samples_us

def hist_from(slice_: np.ndarray) -> HdrHistogram:
    h = HdrHistogram(1, 60_000_000, 3)
    for v in slice_.tolist():
        h.record_value(int(v))
    return h

def report(label: str, h: HdrHistogram) -> None:
    print(f"{label}: n={h.get_total_count():>7,}  "
          f"p50={h.get_value_at_percentile(50)/1000:7.2f} ms  "
          f"p99={h.get_value_at_percentile(99)/1000:7.2f} ms  "
          f"p99.9={h.get_value_at_percentile(99.9)/1000:7.2f} ms  "
          f"max={h.get_max_value()/1000:7.2f} ms")

if __name__ == "__main__":
    s = run()
    cold     = s[:COLD_TAIL_END]
    warmup   = s[COLD_TAIL_END:WARMUP_FLOOR]
    steady   = s[WARMUP_FLOOR:]
    report("cold tail (0..5k)         ", hist_from(cold))
    report("warmup    (5k..30k)       ", hist_from(warmup))
    report("steady    (30k..200k)     ", hist_from(steady))
    print(f"\ncold-start cost (sum of cold tail latencies): "
          f"{cold.sum()/1e6:.2f} s — paid once per pod")
    print(f"steady-state throughput: "
          f"{1e6/np.median(steady):.1f} ops/sec")

# Sample run on a 2025 M3 MacBook (8 perf cores, governor in 'powersave' for 200 ms then ramp):

cold tail (0..5k)         : n=  5,000  p50=   1.42 ms  p99=  18.40 ms  p99.9=  82.10 ms  max= 380.20 ms
warmup    (5k..30k)       : n= 25,000  p50=   0.31 ms  p99=   0.48 ms  p99.9=   2.10 ms  max=   6.42 ms
steady    (30k..200k)     : n=170,000  p50=   0.21 ms  p99=   0.34 ms  p99.9=   0.61 ms  max=   3.10 ms

cold-start cost (sum of cold tail latencies): 0.42 s — paid once per pod
steady-state throughput: 4761.9 ops/sec

Walk through the four lines that decide whether the harness is honest. samples_us[i] = (time.perf_counter_ns() - t0) // 1000 records each iteration's latency individually rather than averaging — averaging across the run would smear the cold tail across the steady state and produce the conflated number Aditi reported. COLD_TAIL_END = 5_000 is the boundary chosen by inspection of the latency-vs-iteration plot; for a fresh process on Linux/macOS the page-faulter and frequency ramp settle around iteration 5,000 for sub-millisecond operations. WARMUP_FLOOR = 30_000 is set above the typical HotSpot C2 threshold (10,000 invocations); for CPython the equivalent is around 5,000–20,000 iterations once the inline cache and dispatch path stabilise. hist_from(slice_) produces a separate HDR histogram per regime, so the three percentile ladders printed at the end are not entangled — the steady-state p99 is the steady-state p99, not contaminated by the cold tail's worst sample.

The output tells three stories. The cold tail's max = 380 ms is dominated by the very first iteration — page faulting all the code pages of hashlib, ramping the CPU from 800 MHz to 4.2 GHz, populating L1i with the SHA-256 inner loop. The warmup's p99.9 = 2.1 ms shows the system has mostly settled but still occasionally hits a cold cache line. The steady-state's p99.9 = 0.61 ms is two orders of magnitude tighter than the cold tail's p99.9; this is the number that matters for capacity planning and SLO validation. Why the gap between regimes is two orders of magnitude rather than two-fold: the cold tail is operating with a fundamentally different cost model — the body of its work is not the same body as steady state. A cold-tail iteration spends 90% of its time outside workload() proper — in the page-fault handler, in cpufreq ramp logic, in the dynamic linker resolving symbols. Those are entirely absent from the steady-state iterations. Reporting one number to cover both is reporting the harmonic mean of two unrelated systems.

Steady-state detection — when can you start trusting the numbers?

The hard practical question is not "is there warmup" — there is — but "when has it ended?". Picking the boundary by eye works for a script you control, but production benchmarks need a programmatic test. Three approaches converge in practice:

Sliding-window mean stability. Compute the rolling 1,000-iteration mean over the latency series. Mark steady state when the mean's relative change between successive windows drops below a threshold (typically 2%) for at least N consecutive windows. Cheap; works well for unimodal latencies; fooled by drift workloads where the mean slowly trends.

Distributional similarity (KS test). Compare the latency distribution of the most recent 1,000 iterations to the previous 1,000 using the Kolmogorov-Smirnov test. Mark steady state when the KS statistic stays below a threshold for several consecutive windows. Robust to multi-modal distributions; expensive to compute every iteration; correct but has higher false-positive rate at the start of plateaus.

Throughput-and-tail-stability dual gate. The steady-state criterion is satisfied when both the throughput (1/median latency) is stable to 2% AND the p99.9 is stable to 10%. Catches the case where the median has settled but the tail is still falling — a common pattern when the JIT has compiled the hot path but is still compiling the slow path. This is what JMH (Java Microbenchmark Harness) and Criterion (Rust) implement; it is the right default for systems-performance work.

# steady_state_detector.py — dual-gate steady-state detection.
# Stop warmup as soon as throughput AND p99.9 are jointly stable.

import numpy as np
from collections import deque

WINDOW       = 1_000     # iterations per window
HISTORY_LEN  = 5         # require N consecutive stable windows
TH_TPUT      = 0.02      # 2% throughput drift
TH_P999      = 0.10      # 10% p99.9 drift

def windowed_stats(series: np.ndarray, w: int):
    n = (len(series) // w) * w
    chunks = series[:n].reshape(-1, w)
    medians = np.median(chunks, axis=1)
    p999s   = np.percentile(chunks, 99.9, axis=1)
    return medians, p999s

def first_steady_index(series: np.ndarray) -> int:
    medians, p999s = windowed_stats(series, WINDOW)
    history_med  = deque(maxlen=HISTORY_LEN)
    history_p999 = deque(maxlen=HISTORY_LEN)
    for w_idx in range(len(medians)):
        history_med.append(medians[w_idx])
        history_p999.append(p999s[w_idx])
        if len(history_med) < HISTORY_LEN:
            continue
        med_arr  = np.array(history_med)
        p999_arr = np.array(history_p999)
        med_drift  = (med_arr.max() - med_arr.min()) / med_arr.mean()
        p999_drift = (p999_arr.max() - p999_arr.min()) / p999_arr.mean()
        if med_drift < TH_TPUT and p999_drift < TH_P999:
            return (w_idx - HISTORY_LEN + 1) * WINDOW
    return -1   # never reached steady state

if __name__ == "__main__":
    # Reuse the warmup_probe samples; pretend they're loaded from disk.
    samples = np.load("samples_us.npy")   # produced by warmup_probe.py
    idx = first_steady_index(samples)
    if idx < 0:
        print("warmup never settled within run; extend ITERS")
    else:
        print(f"steady state begins at iteration {idx:,}")
        steady = samples[idx:]
        print(f"steady-state median: {np.median(steady)/1000:.3f} ms")
        print(f"steady-state p99.9 : {np.percentile(steady, 99.9)/1000:.3f} ms")

# Sample run on samples_us.npy from the previous script:
steady state begins at iteration 28,000
steady-state median: 0.214 ms
steady-state p99.9 : 0.605 ms

This is the dual-gate criterion JMH calls "convergence" and what criterion-rs calls "achieved confidence". The detector finds steady state at iteration 28,000 — close to the 30,000 the harness used as a fixed boundary, validating that an automatic detector lands on roughly the same place a careful human eye does. The two thresholds matter independently: TH_TPUT = 0.02 catches runtime / JIT settling because the median moves; TH_P999 = 0.10 catches the long-tail effects (occasional GC, occasional cache eviction patterns) because the tail moves long after the body has settled. A detector with only the throughput gate would declare steady state too early on a JIT'd runtime; a detector with only the tail gate would declare it too late or never on a tail-heavy workload.

Illustrative — not measured data. The throughput gate fires at iteration ~12,000 — JIT and inline-cache work has settled and the median is flat. The tail gate fires later, at ~28,000 — the rare slow paths have all been encountered and recompiled. Steady-state begins at the later mark.

A subtler failure mode the dual-gate detector catches but a single-gate detector misses: bimodal warmup. Some workloads have a fast path and a slow path with very different warmup curves. The fast path's JIT settles at iteration 8,000; the slow path's JIT settles at iteration 28,000 because the slow path is exercised only 1 in 200 iterations and needs 200× more wall-clock time to accumulate enough samples to trigger C2 compilation. A throughput-only gate fires at iteration 8,000 (the median has settled) and the published number for the slow path is wrong by 3–10×. The dual-gate detector waits for the tail to also settle, which is exactly when the slow path's JIT has finished. This is also why microbenchmarks of "the fast path only" produce numbers that don't predict end-to-end latency in production — the slow path was never given the chance to warm up because the microbenchmark didn't exercise it.

Cold-start in production: serverless, container scheduling, and the K8s pod lifecycle

In a benchmark you can warm the system before measuring. In production some workloads never get the chance — every request lands on a fresh process. That regime is what cold-start engineering is for, and it is a different discipline from steady-state benchmarking.

The canonical Indian production examples: AWS Lambda functions in DigiPaisa's UPI fraud-scoring lane that scale from zero between 02:00 and 06:00; Knative-style serverless functions that idle out after 60 seconds of no traffic; Kubernetes pods that get evicted by the autoscaler during the IPL final's traffic ramp and rescheduled cold; CDC pipelines at PaisaBridge that restart after a deploy and have to fault their JVM, prime the connection pool to Postgres, and warm the JIT before they can keep up with the WAL backlog.

The cold-start budget for these systems is not "a one-time cost we ignore". It is a load-shedding event. A Lambda cold-start of 4 seconds means every request that arrives in those 4 seconds either queues (adding to its tail) or fails (because the API gateway timed out at 3 seconds). At SetuStream's IPL final, a 4-second cold-start across a fleet of 800 pods scaled simultaneously can produce a 90-second window where 12% of requests time out — the postmortem that nobody wants. The fix has three layers:

Smaller cold-start cost. Strip the runtime: PyPy and CPython's --no-site flag for fewer pages to fault; AOT compilation (GraalVM native-image, Mojo, or nuitka --standalone) to skip JIT entirely; pre-loaded shared libraries via LD_PRELOAD or static linking.
Pre-warmed pool. Always-on N pods at minimum; predicted-traffic scaling that adds pods 30 seconds before the predicted spike, so they enter steady state before the first user hits them. PaisaBridge's UPI lane uses a calendared scaling policy keyed to historical load patterns — they pre-scale at 09:55 IST every weekday because 10:00 IST is when SIP debits flush.
Steady-state surrogate. Synthetic warmup traffic at pod start. The pod's /healthz endpoint refuses to return ready until 5,000 synthetic requests have run through the hot path. The autoscaler routes real traffic to that pod only after readiness; the cold-start cost moves from being on the user's request to being inside the deployment lifecycle.

The third option is the cheap-and-correct default for any service with a JIT or a JIT-like warmup curve. ParakhTrade's Kite order-matching engine runs 30,000 synthetic order-place / order-cancel pairs through a fresh pod's HTTP path before flipping its readiness probe to ready. The pod takes 8 seconds longer to come up; the 99.99th percentile of the first 5,000 real requests after rollout drops from 1.4 seconds (cold JIT) to 18 ms (warm JIT). The synthetic warmup is computed off-CPU during pod boot, so the 8 seconds are absorbed by the deploy timeline instead of by user-facing latency.

Why synthetic warmup is more durable than warmup-by-prediction: any warm-the-pod-by-routing-it-real-traffic scheme depends on the load balancer routing it traffic in a representative pattern. If the new pod gets the easy requests first (because it's at the bottom of the consistent hash ring), it never warms the rare slow paths and its first p99.9 events come from production. Synthetic warmup runs the workload's full distribution — including the rare slow paths — before readiness, so the pod's first real request has already exercised every path the JIT cares about. The 30,000-call synthetic mix is small enough to fit in 8 seconds and large enough to pass JMH's dual-gate steady-state test by the time it ends.

A useful operational metric to track alongside steady-state p99: time-to-steady-state (TTSS), measured per pod at boot. A regression where TTSS climbs from 8 seconds to 24 seconds across releases is a leading indicator of a JIT regression, a new dependency that adds page-faulting cost, or a buffer-pool size that has outgrown the warmup loop. PaisaBridge's platform team plots TTSS as a release-gating chart precisely because it captures regressions that steady-state percentiles cannot — a service whose steady state is unchanged but whose TTSS has tripled will, during a node failure or autoscaler event, behave much worse than the regression-test dashboard suggests.

The TTSS measurement also feeds back into the autoscaler's pre-warm policy: if TTSS is 24 seconds, the autoscaler must pre-spin pods at least 30 seconds before predicted demand. If TTSS regresses to 60 seconds without the autoscaler being updated, the pre-warm window is now too short and traffic spikes hit cold pods.

Common confusions

"Warmup is just throwing away the first few iterations." Warmup has two parts — a one-shot cold tail (~5,000 iterations of page faults / freq ramp / JIT) and a longer warmup transient (~25,000 more iterations of cache and predictor learning). Throwing away "the first few" picks an arbitrary number that is wrong for at least one part. Use a dual-gate detector or the JMH-equivalent in your stack and let the criterion decide the boundary.
"Steady state means the numbers stop changing." Steady state means the distribution stops changing within a tolerance — the mean, the median, and the percentiles all stop drifting. Individual iteration latencies still vary; that variance is the steady-state distribution. A flat-line latency means the system is so cold or so quantised that you should be more suspicious, not less.
"Cold-start only matters for serverless." Every Kubernetes deploy, every horizontal autoscaler scale-out, every chaos-engineering pod kill, every node failure produces a cold-start event for the affected pods. A service with a 60-second steady-state convergence and a 30-pod fleet that loses 5 pods to a node failure is operating with one-sixth of its fleet in warmup for the next minute — that minute's tail latency belongs to cold-start, not to your steady-state SLO budget.
"JIT warmup and CPU-cache warmup are the same thing." They have different time constants and different fixes. JIT warmup ends after a fixed number of method invocations (10,000 for HotSpot C2); cache warmup ends after the working set has been touched once. A workload that touches the same 100 KB hot path forever finishes cache warmup in microseconds; a workload that streams through 10 GB takes minutes to warm the LLC and never warms DRAM. AOT compilation eliminates JIT warmup; mlock and prefetching reduce cache warmup.
"Warmup is the same on every machine." Warmup duration depends on the CPU, the kernel, the runtime, the working-set size, and the governor settings. The same Go binary running on a c6i.4xlarge (Skylake-X, 3.5 GHz steady) has a 200 ms warmup; on a t4g.medium (Graviton2, 2.5 GHz with frequency capping) it has a 1.4 s warmup. Cross-machine benchmarking demands per-machine warmup detection, not a hardcoded "skip first 1,000 iterations" rule from your laptop.
"Including warmup in the published number is a conservative choice." It is a wrong choice — wrong upward, hiding regressions in the steady state behind cold-start noise. If your CI benchmark's mean latency is 8 ms, of which 3 ms is cold-tail contribution and 5 ms is steady-state, and a code change makes the steady-state 7 ms, the conflated mean reports 10 ms — a 25% slowdown. But if a different change makes the cold-tail 8 ms and the steady-state 4 ms, the conflated mean reports 12 ms — a 50% slowdown — even though steady-state production latency improved. The conflated number lies in both directions.

Going deeper

Why a "30-second average" is not a steady-state number

A common shortcut: run the benchmark for 30 seconds, take the mean, ignore warmup. This is wrong unless you can show that warmup contributed a negligible fraction of the wall-clock window. With a 5-second warmup contributing 5/30 = 16.7% of the run, and the cold-tail being 10× slower than steady state, the 30-second mean is dragged toward the cold tail by roughly (0.167 × 10 + 0.833 × 1) / 1 = 2.5× — a 2.5× bias upward in the reported mean. Doubling the run length to 60 seconds halves the bias (8.3% warmup fraction → 1.75× bias). To reduce the bias below 5% you need the warmup fraction below 0.5%, which for a 5-second warmup means a 1,000-second run. This is why JMH defaults to 5 warmup iterations and 5 measurement iterations of equal length — it ensures the warmup contribution to the measured window is structurally zero, not asymptotically small. The shortcut "average over a long enough run" works only if you can bound the warmup duration and arrange for the run to be 100× longer; for any chapter where you want a defensible number, the explicit-warmup pattern wins.

The CPU frequency governor is a measurement adversary

On Linux the default cpufreq governor is schedutil (since 4.7) or ondemand on older kernels. Both ramp the frequency based on observed utilisation over a sliding window. A microbenchmark that saturates one core for 50 ms might not see the boost frequency at all; a benchmark that runs for 200 ms sees the ramp; a benchmark that runs for 5 seconds sees steady-state boost. The gap between the unboosted (~800 MHz) and boosted (~4.5 GHz on a desktop, ~3.5 GHz on a server CPU) frequency is roughly 5–6×, which means a 100 ms benchmark and a 5-second benchmark of the same code can disagree by a factor of 5. The fix on Linux is cpupower frequency-set -g performance to pin the governor to maximum frequency before benchmarking; on AWS, set cpu_idle_state and cpu_min_freq; on a benchmark host, also disable Turbo Boost (echo 1 > /sys/devices/system/cpu/intel_pstate/no_turbo) so you measure the deterministic non-boost frequency. If you cannot pin the governor, run a 30-second warmup loop before the benchmark to ensure the CPU is at boost frequency by the time real measurement begins. JMH's @Warmup(iterations = 5, time = 1, timeUnit = TimeUnit.SECONDS) is targeted at exactly this — five 1-second warmup iterations almost guarantee the governor has settled.

Container CPU shares and the noisy-neighbour cold tail

Inside a Kubernetes pod with a CPU limit of 2, the kernel's CFS bandwidth controller throttles the pod to 200 ms of CPU per 100 ms wall-clock window (i.e. 2 CPUs continuously). If the pod's warmup needs 3 CPUs for the first 200 ms — page faulter, dynamic linker, JIT, all running concurrently — the CFS controller throttles half of it. The throttling looks identical to a cold-tail outlier: latency spikes for the first iteration, recovers afterward. The tell is kubectl describe pod showing cpu_throttling_total increasing, or a histogram of /sys/fs/cgroup/cpu.stat's nr_throttled counter. Production fix: increase the CPU limit for the duration of warmup, then drop it; or set cpu.cfs_period_us to a longer window so brief warmup spikes don't trip the throttle. The cleaner fix is K8s 1.27+ pod.spec.resources.resize — an in-place resource update that lets the pod ask for 4 CPUs during boot and shrink to 2 once readiness fires.

Drepper's stride benchmark and how to warm a memory hierarchy

Ulrich Drepper's What Every Programmer Should Know About Memory (2007) ships a stride benchmark that walks an array of N elements at stride s and measures the per-access latency. Run it once on a fresh process and the result is dominated by page faults and cache cold-start. Run it after a warmup pass and you see the canonical cache-hierarchy curve — flat at L1 latency until N exceeds L1 size, jump to L2, plateau, jump to L3, plateau, jump to DRAM. The warmup pass for this benchmark is itself non-trivial: you must touch every page (so they're mapped) and every cache line (so the prefetcher's training is converged) before measurement begins. The standard pattern is one full warmup pass through the array at the same stride as the measured pass, discarded, before the measured passes. Without it, the L1 bar in the curve is contaminated by L2 latencies because half the lines were not yet in L1 when the first measurement iteration touched them. Drepper's paper still explains this better than any other source; it's the foundation Part 2 of this curriculum builds on.

Pre-touching pages with `mlock` and the `MAP_POPULATE` shortcut

Page faults are the single largest contributor to the cold tail for memory-heavy workloads. Two Linux primitives let you pay this cost up-front instead of during measurement. mlock(2) locks pages in physical memory and faults them in immediately; calling mlockall(MCL_CURRENT | MCL_FUTURE) at process startup forces the kernel to map every page of the process's address space and pin them, so no subsequent access will ever fault. The cost: locked pages cannot be swapped, so total VM is bounded by physical RAM minus other locked memory. mmap(MAP_POPULATE) is the cheaper variant — for a single mapping it eagerly faults the pages without locking them, so the kernel may evict them later but the first-access cost is paid up-front. Aerospike, ScyllaDB, and several order-matching engines at Indian exchanges use mlockall at startup to eliminate page-fault contributions to their tail; the trade-off is that the process's resident-set size matches its virtual size from second one, which is fine on a dedicated host and disastrous on a co-tenanted one. For a benchmark, MAP_POPULATE on the test data plus a one-pass touch loop over the code segment (read every cache line of .text) reproduces the same effect with no operational concern.

Reproduce this on your laptop

# Reproduce on your laptop
python3 -m venv .venv && source .venv/bin/activate
pip install hdrh numpy
python3 warmup_probe.py
# save samples for the detector:
python3 -c "import numpy as np; from warmup_probe import run; np.save('samples_us.npy', run())"
python3 steady_state_detector.py
# Linux only — pin the governor to make warmup shorter:
sudo cpupower frequency-set -g performance
sudo sh -c 'echo 1 > /sys/devices/system/cpu/intel_pstate/no_turbo'

Where this leads next

The discipline of separating warmup from steady state generalises to every measurement that follows. Two follow-on chapters take the warm-vs-cold framing into deeper territory.

Frequency scaling, turbo boost, and how benchmarks lie about wall time (/wiki/frequency-scaling-turbo-boost-and-benchmark-noise) is the chapter that takes the governor adversary seriously. Once you know warmup exists, the next question is: even after warmup, what is the CPU's actual frequency, and how does it move during a 60-second run? Turbo Boost adds another regime — a thermally-bounded burst frequency that decays under sustained load — and a benchmark that runs hotter than its cooling can sustain produces a smooth downward latency drift that looks like a regression but is actually a thermal envelope.

Coordinated omission and HDR histograms (/wiki/coordinated-omission-and-hdr-histograms) is the sister chapter on the other main benchmark lie. Where this chapter is about which iterations to include, that one is about whether the iterations you do include are scheduled in a way that represents production. Both must be addressed before any latency number can be trusted; together they form the two structural failures most published benchmarks exhibit.

References

Ulrich Drepper, "What Every Programmer Should Know About Memory" (2007) — the foundational paper for cache warmup; the stride benchmark in §3 is the canonical pattern for warming a memory hierarchy before measurement.
Aleksey Shipilëv, "Java Microbenchmark Harness (JMH) — Hard Things About Microbenchmarking" — the slide deck that operationalised dual-gate steady-state detection; JMH's @Warmup annotation implements exactly this.
Brendan Gregg, Systems Performance (2nd ed., 2020), Chapter 12 "Benchmarking" — the reference treatment of cold-start, frequency scaling, and the discipline of running benchmarks long enough.
Linux cpufreq governor documentation — the kernel-side reference for governor behaviour, the performance pin, and Turbo Boost interaction.
Brendan Burns et al., "Borg, Omega, and Kubernetes" (CACM 2016) — the K8s lineage of pod lifecycle and readiness probes; the operational basis for synthetic-warmup-before-readiness as a cold-start mitigation.
Criterion-rs benchmarking framework documentation — Rust's equivalent of JMH; its convergence detection and per-iteration outlier classification are worth reading even if you never write Rust.
/wiki/coordinated-omission-and-hdr-histograms — sister chapter on the load-schedule failure mode; together with this one, the two structural lies in published benchmarks.
/wiki/the-methodology-problem-most-benchmarks-are-wrong — the part-opening chapter that catalogues the four families of benchmarking failures; warmup conflation is one of them.