Warmup, steady state, and cold-start effects

Aditi runs python3 bench.py against her FastAPI order-matching service for 60 seconds and reports a mean latency of 38 ms to her tech lead at Zerodha Kite. The lead asks her to plot iteration latency against iteration index. The first 1,400 iterations form a fat downward curve from 380 ms to 6 ms. The remaining 158,000 iterations sit on a flat line near 4 ms with a small tail. Her "mean of 38 ms" is the area-weighted average of two distributions glued together — the warmup tail and the steady state — and the second one is what the service does at 10:00 IST market open with 8 million traders connected. The first one is what happens once per pod, on cold-start, and never again until the next deploy. Her published number is wrong about both regimes.

A benchmark measures three regimes glued end-to-end: a cold tail dominated by page faults, JIT compilation, and CPU frequency ramp-up; a warmup transient where caches and branch predictors learn the workload; and a steady state where the numbers represent the system's actual behaviour. Reporting the union as "mean latency" is the most common benchmarking lie after coordinated omission. The discipline is three-part — explicitly warm up, explicitly mark when steady state begins, and report the cold-start cost as its own first-class number.

The three regimes hidden inside a single benchmark run

Run any non-trivial process for the first time on a Linux box and the first few hundred milliseconds look nothing like the next few minutes. The reasons are not subtle, but they compound in ways that make the cold tail visible far longer than any single one of them would suggest:

Three regimes of a benchmark runA latency-vs-iteration plot. The y-axis is latency in milliseconds on log scale from 1 ms to 1 second. The x-axis is iteration index from 0 to 200000. Three labelled regions: cold tail (0 to 5000) descending sharply from 380 ms to 6 ms, warmup transient (5000 to 30000) descending more gently from 6 ms to 4 ms, steady state (30000 onwards) flat at 4 ms with a thin tail. Illustrative — not measured data.A typical Python+FastAPI run plotted iteration-by-iteration05k30k200k iter1 ms10 ms50 ms200 ms1 scold tailfaults, freq, JITwarmupcaches, BTBsteady state — what your readers feelsteady-state mean ≈ 4 msfirst iter ≈ 380 ms
Illustrative — not measured data. The y-axis is log-scale; the cold tail is two orders of magnitude above the steady-state floor and decays over thousands of iterations. Reporting the mean of the entire run gives a number that is neither the cold-start cost nor the steady-state cost.

Why the cold tail is longer than any one cause: each cause has its own time constant. Page faults complete in seconds, the frequency governor settles in 100–500 ms, JIT takes 10,000+ invocations, the buffer pool fills only as queries touch new data. The visible "warmup" is the union of all of these; you cannot be in steady state until every contributor has finished. A workload that touches new data forever (a streaming pipeline reading new partitions) never reaches a true steady state for the buffer pool — it only reaches an equilibrium where the eviction rate matches the read rate.

The benchmarking discipline that follows: do not report a single number for "the benchmark". Report three numbers — the cold-start cost (one-shot, paid once per pod), the warmup duration (how long until the system is representative), and the steady-state distribution (what the user actually feels). Most published benchmark numbers conflate them and most readers do not notice; the ones that hurt are the ones where the conflation hides a regression.

Measuring the three regimes from a Python harness

The fastest way to see the three regimes is to instrument a real benchmark and plot per-iteration latency. The script below drives a Python function that does CPU-bound work (a small SHA-256 hash chain stand-in for a JSON validation pipeline), records the latency of every iteration into an HDR histogram with per-iteration timestamps, and prints both the cold-tail slice and the steady-state slice with their respective percentiles.

# warmup_probe.py — measure the three regimes of a benchmark run.
# Requires: pip install hdrh numpy

import hashlib, time, statistics, sys
from hdrh.histogram import HdrHistogram
import numpy as np

ITERS         = 200_000
WORK_BYTES    = 4096      # one page of input per call
WARMUP_FLOOR  = 30_000    # treat first 30k iterations as not-yet-steady
COLD_TAIL_END = 5_000     # the very-cold prefix (page faults, freq ramp)

def workload(buf: bytes) -> bytes:
    h = hashlib.sha256()
    for _ in range(64):
        h.update(buf)
        buf = h.digest() + buf[:WORK_BYTES - 32]
    return h.digest()

def run() -> np.ndarray:
    rng = np.random.default_rng(seed=42)
    buf = rng.bytes(WORK_BYTES)
    samples_us = np.empty(ITERS, dtype=np.int64)
    for i in range(ITERS):
        t0 = time.perf_counter_ns()
        workload(buf)
        samples_us[i] = (time.perf_counter_ns() - t0) // 1000
    return samples_us

def hist_from(slice_: np.ndarray) -> HdrHistogram:
    h = HdrHistogram(1, 60_000_000, 3)
    for v in slice_.tolist():
        h.record_value(int(v))
    return h

def report(label: str, h: HdrHistogram) -> None:
    print(f"{label}: n={h.get_total_count():>7,}  "
          f"p50={h.get_value_at_percentile(50)/1000:7.2f} ms  "
          f"p99={h.get_value_at_percentile(99)/1000:7.2f} ms  "
          f"p99.9={h.get_value_at_percentile(99.9)/1000:7.2f} ms  "
          f"max={h.get_max_value()/1000:7.2f} ms")

if __name__ == "__main__":
    s = run()
    cold     = s[:COLD_TAIL_END]
    warmup   = s[COLD_TAIL_END:WARMUP_FLOOR]
    steady   = s[WARMUP_FLOOR:]
    report("cold tail (0..5k)         ", hist_from(cold))
    report("warmup    (5k..30k)       ", hist_from(warmup))
    report("steady    (30k..200k)     ", hist_from(steady))
    print(f"\ncold-start cost (sum of cold tail latencies): "
          f"{cold.sum()/1e6:.2f} s — paid once per pod")
    print(f"steady-state throughput: "
          f"{1e6/np.median(steady):.1f} ops/sec")
# Sample run on a 2025 M3 MacBook (8 perf cores, governor in 'powersave' for 200 ms then ramp):

cold tail (0..5k)         : n=  5,000  p50=   1.42 ms  p99=  18.40 ms  p99.9=  82.10 ms  max= 380.20 ms
warmup    (5k..30k)       : n= 25,000  p50=   0.31 ms  p99=   0.48 ms  p99.9=   2.10 ms  max=   6.42 ms
steady    (30k..200k)     : n=170,000  p50=   0.21 ms  p99=   0.34 ms  p99.9=   0.61 ms  max=   3.10 ms

cold-start cost (sum of cold tail latencies): 0.42 s — paid once per pod
steady-state throughput: 4761.9 ops/sec

Walk through the four lines that decide whether the harness is honest. samples_us[i] = (time.perf_counter_ns() - t0) // 1000 records each iteration's latency individually rather than averaging — averaging across the run would smear the cold tail across the steady state and produce the conflated number Aditi reported. COLD_TAIL_END = 5_000 is the boundary chosen by inspection of the latency-vs-iteration plot; for a fresh process on Linux/macOS the page-faulter and frequency ramp settle around iteration 5,000 for sub-millisecond operations. WARMUP_FLOOR = 30_000 is set above the typical HotSpot C2 threshold (10,000 invocations); for CPython the equivalent is around 5,000–20,000 iterations once the inline cache and dispatch path stabilise. hist_from(slice_) produces a separate HDR histogram per regime, so the three percentile ladders printed at the end are not entangled — the steady-state p99 is the steady-state p99, not contaminated by the cold tail's worst sample.

The output tells three stories. The cold tail's max = 380 ms is dominated by the very first iteration — page faulting all the code pages of hashlib, ramping the CPU from 800 MHz to 4.2 GHz, populating L1i with the SHA-256 inner loop. The warmup's p99.9 = 2.1 ms shows the system has mostly settled but still occasionally hits a cold cache line. The steady-state's p99.9 = 0.61 ms is two orders of magnitude tighter than the cold tail's p99.9; this is the number that matters for capacity planning and SLO validation. Why the gap between regimes is two orders of magnitude rather than two-fold: the cold tail is operating with a fundamentally different cost model — the body of its work is not the same body as steady state. A cold-tail iteration spends 90% of its time outside workload() proper — in the page-fault handler, in cpufreq ramp logic, in the dynamic linker resolving symbols. Those are entirely absent from the steady-state iterations. Reporting one number to cover both is reporting the harmonic mean of two unrelated systems.

Steady-state detection — when can you start trusting the numbers?

The hard practical question is not "is there warmup" — there is — but "when has it ended?". Picking the boundary by eye works for a script you control, but production benchmarks need a programmatic test. Three approaches converge in practice:

Sliding-window mean stability. Compute the rolling 1,000-iteration mean over the latency series. Mark steady state when the mean's relative change between successive windows drops below a threshold (typically 2%) for at least N consecutive windows. Cheap; works well for unimodal latencies; fooled by drift workloads where the mean slowly trends.

Distributional similarity (KS test). Compare the latency distribution of the most recent 1,000 iterations to the previous 1,000 using the Kolmogorov-Smirnov test. Mark steady state when the KS statistic stays below a threshold for several consecutive windows. Robust to multi-modal distributions; expensive to compute every iteration; correct but has higher false-positive rate at the start of plateaus.

Throughput-and-tail-stability dual gate. The steady-state criterion is satisfied when both the throughput (1/median latency) is stable to 2% AND the p99.9 is stable to 10%. Catches the case where the median has settled but the tail is still falling — a common pattern when the JIT has compiled the hot path but is still compiling the slow path. This is what JMH (Java Microbenchmark Harness) and Criterion (Rust) implement; it is the right default for systems-performance work.

# steady_state_detector.py — dual-gate steady-state detection.
# Stop warmup as soon as throughput AND p99.9 are jointly stable.

import numpy as np
from collections import deque

WINDOW       = 1_000     # iterations per window
HISTORY_LEN  = 5         # require N consecutive stable windows
TH_TPUT      = 0.02      # 2% throughput drift
TH_P999      = 0.10      # 10% p99.9 drift

def windowed_stats(series: np.ndarray, w: int):
    n = (len(series) // w) * w
    chunks = series[:n].reshape(-1, w)
    medians = np.median(chunks, axis=1)
    p999s   = np.percentile(chunks, 99.9, axis=1)
    return medians, p999s

def first_steady_index(series: np.ndarray) -> int:
    medians, p999s = windowed_stats(series, WINDOW)
    history_med  = deque(maxlen=HISTORY_LEN)
    history_p999 = deque(maxlen=HISTORY_LEN)
    for w_idx in range(len(medians)):
        history_med.append(medians[w_idx])
        history_p999.append(p999s[w_idx])
        if len(history_med) < HISTORY_LEN:
            continue
        med_arr  = np.array(history_med)
        p999_arr = np.array(history_p999)
        med_drift  = (med_arr.max() - med_arr.min()) / med_arr.mean()
        p999_drift = (p999_arr.max() - p999_arr.min()) / p999_arr.mean()
        if med_drift < TH_TPUT and p999_drift < TH_P999:
            return (w_idx - HISTORY_LEN + 1) * WINDOW
    return -1   # never reached steady state

if __name__ == "__main__":
    # Reuse the warmup_probe samples; pretend they're loaded from disk.
    samples = np.load("samples_us.npy")   # produced by warmup_probe.py
    idx = first_steady_index(samples)
    if idx < 0:
        print("warmup never settled within run; extend ITERS")
    else:
        print(f"steady state begins at iteration {idx:,}")
        steady = samples[idx:]
        print(f"steady-state median: {np.median(steady)/1000:.3f} ms")
        print(f"steady-state p99.9 : {np.percentile(steady, 99.9)/1000:.3f} ms")
# Sample run on samples_us.npy from the previous script:
steady state begins at iteration 28,000
steady-state median: 0.214 ms
steady-state p99.9 : 0.605 ms

This is the dual-gate criterion JMH calls "convergence" and what criterion-rs calls "achieved confidence". The detector finds steady state at iteration 28,000 — close to the 30,000 the harness used as a fixed boundary, validating that an automatic detector lands on roughly the same place a careful human eye does. The two thresholds matter independently: TH_TPUT = 0.02 catches runtime / JIT settling because the median moves; TH_P999 = 0.10 catches the long-tail effects (occasional GC, occasional cache eviction patterns) because the tail moves long after the body has settled. A detector with only the throughput gate would declare steady state too early on a JIT'd runtime; a detector with only the tail gate would declare it too late or never on a tail-heavy workload.

Dual-gate steady-state detection: throughput vs tail stabilisationTwo stacked plots. Top plot shows window-median throughput vs window index, with the median stabilising around window 12 to 4761 ops/sec. Bottom plot shows window-p99.9 vs window index, stabilising later around window 28. The dual-gate steady-state mark is the later of the two stabilisation points. Illustrative — not measured data.Throughput stabilises before the tail does — both gates required5k ops/s2k ops/swindow-median throughputstable @ ~12k iter10 ms0.5 mswindow-p99.9 latencystable @ ~28k iter ← steady-state mark
Illustrative — not measured data. The throughput gate fires at iteration ~12,000 — JIT and inline-cache work has settled and the median is flat. The tail gate fires later, at ~28,000 — the rare slow paths have all been encountered and recompiled. Steady-state begins at the later mark.

A subtler failure mode the dual-gate detector catches but a single-gate detector misses: bimodal warmup. Some workloads have a fast path and a slow path with very different warmup curves. The fast path's JIT settles at iteration 8,000; the slow path's JIT settles at iteration 28,000 because the slow path is exercised only 1 in 200 iterations and needs 200× more wall-clock time to accumulate enough samples to trigger C2 compilation. A throughput-only gate fires at iteration 8,000 (the median has settled) and the published number for the slow path is wrong by 3–10×. The dual-gate detector waits for the tail to also settle, which is exactly when the slow path's JIT has finished. This is also why microbenchmarks of "the fast path only" produce numbers that don't predict end-to-end latency in production — the slow path was never given the chance to warm up because the microbenchmark didn't exercise it.

Cold-start in production: serverless, container scheduling, and the K8s pod lifecycle

In a benchmark you can warm the system before measuring. In production some workloads never get the chance — every request lands on a fresh process. That regime is what cold-start engineering is for, and it is a different discipline from steady-state benchmarking.

The canonical Indian production examples: AWS Lambda functions in PhonePe's UPI fraud-scoring lane that scale from zero between 02:00 and 06:00; Knative-style serverless functions that idle out after 60 seconds of no traffic; Kubernetes pods that get evicted by the autoscaler during the IPL final's traffic ramp and rescheduled cold; CDC pipelines at Razorpay that restart after a deploy and have to fault their JVM, prime the connection pool to Postgres, and warm the JIT before they can keep up with the WAL backlog.

The cold-start budget for these systems is not "a one-time cost we ignore". It is a load-shedding event. A Lambda cold-start of 4 seconds means every request that arrives in those 4 seconds either queues (adding to its tail) or fails (because the API gateway timed out at 3 seconds). At Hotstar's IPL final, a 4-second cold-start across a fleet of 800 pods scaled simultaneously can produce a 90-second window where 12% of requests time out — the postmortem that nobody wants. The fix has three layers:

The third option is the cheap-and-correct default for any service with a JIT or a JIT-like warmup curve. Zerodha's Kite order-matching engine runs 30,000 synthetic order-place / order-cancel pairs through a fresh pod's HTTP path before flipping its readiness probe to ready. The pod takes 8 seconds longer to come up; the 99.99th percentile of the first 5,000 real requests after rollout drops from 1.4 seconds (cold JIT) to 18 ms (warm JIT). The synthetic warmup is computed off-CPU during pod boot, so the 8 seconds are absorbed by the deploy timeline instead of by user-facing latency.

Why synthetic warmup is more durable than warmup-by-prediction: any warm-the-pod-by-routing-it-real-traffic scheme depends on the load balancer routing it traffic in a representative pattern. If the new pod gets the easy requests first (because it's at the bottom of the consistent hash ring), it never warms the rare slow paths and its first p99.9 events come from production. Synthetic warmup runs the workload's full distribution — including the rare slow paths — before readiness, so the pod's first real request has already exercised every path the JIT cares about. The 30,000-call synthetic mix is small enough to fit in 8 seconds and large enough to pass JMH's dual-gate steady-state test by the time it ends.

A useful operational metric to track alongside steady-state p99: time-to-steady-state (TTSS), measured per pod at boot. A regression where TTSS climbs from 8 seconds to 24 seconds across releases is a leading indicator of a JIT regression, a new dependency that adds page-faulting cost, or a buffer-pool size that has outgrown the warmup loop. Razorpay's platform team plots TTSS as a release-gating chart precisely because it captures regressions that steady-state percentiles cannot — a service whose steady state is unchanged but whose TTSS has tripled will, during a node failure or autoscaler event, behave much worse than the regression-test dashboard suggests.

The TTSS measurement also feeds back into the autoscaler's pre-warm policy: if TTSS is 24 seconds, the autoscaler must pre-spin pods at least 30 seconds before predicted demand. If TTSS regresses to 60 seconds without the autoscaler being updated, the pre-warm window is now too short and traffic spikes hit cold pods.

Common confusions

Going deeper

Why a "30-second average" is not a steady-state number

A common shortcut: run the benchmark for 30 seconds, take the mean, ignore warmup. This is wrong unless you can show that warmup contributed a negligible fraction of the wall-clock window. With a 5-second warmup contributing 5/30 = 16.7% of the run, and the cold-tail being 10× slower than steady state, the 30-second mean is dragged toward the cold tail by roughly (0.167 × 10 + 0.833 × 1) / 1 = 2.5× — a 2.5× bias upward in the reported mean. Doubling the run length to 60 seconds halves the bias (8.3% warmup fraction → 1.75× bias). To reduce the bias below 5% you need the warmup fraction below 0.5%, which for a 5-second warmup means a 1,000-second run. This is why JMH defaults to 5 warmup iterations and 5 measurement iterations of equal length — it ensures the warmup contribution to the measured window is structurally zero, not asymptotically small. The shortcut "average over a long enough run" works only if you can bound the warmup duration and arrange for the run to be 100× longer; for any chapter where you want a defensible number, the explicit-warmup pattern wins.

The CPU frequency governor is a measurement adversary

On Linux the default cpufreq governor is schedutil (since 4.7) or ondemand on older kernels. Both ramp the frequency based on observed utilisation over a sliding window. A microbenchmark that saturates one core for 50 ms might not see the boost frequency at all; a benchmark that runs for 200 ms sees the ramp; a benchmark that runs for 5 seconds sees steady-state boost. The gap between the unboosted (~800 MHz) and boosted (~4.5 GHz on a desktop, ~3.5 GHz on a server CPU) frequency is roughly 5–6×, which means a 100 ms benchmark and a 5-second benchmark of the same code can disagree by a factor of 5. The fix on Linux is cpupower frequency-set -g performance to pin the governor to maximum frequency before benchmarking; on AWS, set cpu_idle_state and cpu_min_freq; on a benchmark host, also disable Turbo Boost (echo 1 > /sys/devices/system/cpu/intel_pstate/no_turbo) so you measure the deterministic non-boost frequency. If you cannot pin the governor, run a 30-second warmup loop before the benchmark to ensure the CPU is at boost frequency by the time real measurement begins. JMH's @Warmup(iterations = 5, time = 1, timeUnit = TimeUnit.SECONDS) is targeted at exactly this — five 1-second warmup iterations almost guarantee the governor has settled.

Container CPU shares and the noisy-neighbour cold tail

Inside a Kubernetes pod with a CPU limit of 2, the kernel's CFS bandwidth controller throttles the pod to 200 ms of CPU per 100 ms wall-clock window (i.e. 2 CPUs continuously). If the pod's warmup needs 3 CPUs for the first 200 ms — page faulter, dynamic linker, JIT, all running concurrently — the CFS controller throttles half of it. The throttling looks identical to a cold-tail outlier: latency spikes for the first iteration, recovers afterward. The tell is kubectl describe pod showing cpu_throttling_total increasing, or a histogram of /sys/fs/cgroup/cpu.stat's nr_throttled counter. Production fix: increase the CPU limit for the duration of warmup, then drop it; or set cpu.cfs_period_us to a longer window so brief warmup spikes don't trip the throttle. The cleaner fix is K8s 1.27+ pod.spec.resources.resize — an in-place resource update that lets the pod ask for 4 CPUs during boot and shrink to 2 once readiness fires.

Drepper's stride benchmark and how to warm a memory hierarchy

Ulrich Drepper's What Every Programmer Should Know About Memory (2007) ships a stride benchmark that walks an array of N elements at stride s and measures the per-access latency. Run it once on a fresh process and the result is dominated by page faults and cache cold-start. Run it after a warmup pass and you see the canonical cache-hierarchy curve — flat at L1 latency until N exceeds L1 size, jump to L2, plateau, jump to L3, plateau, jump to DRAM. The warmup pass for this benchmark is itself non-trivial: you must touch every page (so they're mapped) and every cache line (so the prefetcher's training is converged) before measurement begins. The standard pattern is one full warmup pass through the array at the same stride as the measured pass, discarded, before the measured passes. Without it, the L1 bar in the curve is contaminated by L2 latencies because half the lines were not yet in L1 when the first measurement iteration touched them. Drepper's paper still explains this better than any other source; it's the foundation Part 2 of this curriculum builds on.

Pre-touching pages with mlock and the MAP_POPULATE shortcut

Page faults are the single largest contributor to the cold tail for memory-heavy workloads. Two Linux primitives let you pay this cost up-front instead of during measurement. mlock(2) locks pages in physical memory and faults them in immediately; calling mlockall(MCL_CURRENT | MCL_FUTURE) at process startup forces the kernel to map every page of the process's address space and pin them, so no subsequent access will ever fault. The cost: locked pages cannot be swapped, so total VM is bounded by physical RAM minus other locked memory. mmap(MAP_POPULATE) is the cheaper variant — for a single mapping it eagerly faults the pages without locking them, so the kernel may evict them later but the first-access cost is paid up-front. Aerospike, ScyllaDB, and several order-matching engines at Indian exchanges use mlockall at startup to eliminate page-fault contributions to their tail; the trade-off is that the process's resident-set size matches its virtual size from second one, which is fine on a dedicated host and disastrous on a co-tenanted one. For a benchmark, MAP_POPULATE on the test data plus a one-pass touch loop over the code segment (read every cache line of .text) reproduces the same effect with no operational concern.

Reproduce this on your laptop

# Reproduce on your laptop
python3 -m venv .venv && source .venv/bin/activate
pip install hdrh numpy
python3 warmup_probe.py
# save samples for the detector:
python3 -c "import numpy as np; from warmup_probe import run; np.save('samples_us.npy', run())"
python3 steady_state_detector.py
# Linux only — pin the governor to make warmup shorter:
sudo cpupower frequency-set -g performance
sudo sh -c 'echo 1 > /sys/devices/system/cpu/intel_pstate/no_turbo'

Where this leads next

The discipline of separating warmup from steady state generalises to every measurement that follows. Two follow-on chapters take the warm-vs-cold framing into deeper territory.

Frequency scaling, turbo boost, and how benchmarks lie about wall time (/wiki/frequency-scaling-turbo-boost-and-benchmark-noise) is the chapter that takes the governor adversary seriously. Once you know warmup exists, the next question is: even after warmup, what is the CPU's actual frequency, and how does it move during a 60-second run? Turbo Boost adds another regime — a thermally-bounded burst frequency that decays under sustained load — and a benchmark that runs hotter than its cooling can sustain produces a smooth downward latency drift that looks like a regression but is actually a thermal envelope.

Coordinated omission and HDR histograms (/wiki/coordinated-omission-and-hdr-histograms) is the sister chapter on the other main benchmark lie. Where this chapter is about which iterations to include, that one is about whether the iterations you do include are scheduled in a way that represents production. Both must be addressed before any latency number can be trusted; together they form the two structural failures most published benchmarks exhibit.

References