Note: Company names, engineers, incidents, numbers, and scaling scenarios in this article are hypothetical — even when they resemble real ones. See the full disclaimer.
Steady-state hypotheses: how chaos engineering defines normal
It is 11:47am on a Thursday at PaySetu and Riya, the on-call SRE for the payments tier, has the chaos-platform draft-experiment open in one tab and a Grafana dashboard in another. The experiment is terminate-one-payment-shard-replica. The blast radius is set. The abort condition is set. There is one field left to fill: the steady-state hypothesis. Riya types payment_success_rate > 99.5%, hovers over Save, and stops. Over the last week, the payment success rate at 11am has been 99.84% on weekdays and 99.71% on Sundays during peak. The lower bound 99.5% is so wide that the experiment can pass while customer-visible behaviour quietly degrades. She tightens it to payment_success_rate >= 99.7% over a rolling 60-second window with at least 200 attempts. That single edit — a number, a time window, a sample-size floor — is the difference between a chaos experiment that proves resilience and one that lies about it. The steady-state hypothesis is the most important field on the form, and almost everyone gets it wrong on the first try.
A steady-state hypothesis is a numeric, time-bounded, falsifiable claim about a business-level metric — phrased as metric op threshold over window with sample-floor — that says exactly what normal looks like during the experiment. It is the definition of pass. Without it, every chaos experiment passes by default and proves nothing. The hypothesis is built from observed historical distributions, not from SLO documents, and it is tightened until a real degradation would falsify it within the experiment window.
What a steady-state hypothesis actually is
A chaos experiment has three moving parts: a hypothesis, a fault, and an abort condition. The hypothesis is the only one that determines whether the experiment passes. The fault is the variable. The abort condition is the safety net. The hypothesis is the measurement that decides the verdict.
Casey Rosenthal and the Netflix chaos team formalised the term in the 2017 Chaos Engineering book — they borrowed it from the scientific method. A steady state, in their framing, is the system's normal output under load: the rate at which orders complete, the p99 of the checkout latency, the streams-started-per-second. The hypothesis is the claim that this output remains within a defined band even when the fault is injected. Falsify the hypothesis and you've found a real weakness. Confirm it and you've earned a small piece of evidence that the system tolerates that fault under that load.
The shape of a well-formed steady-state hypothesis has six parts:
- A business-level metric, not an infrastructure metric. (
payment_success_rate, notcpu_usage.) - An operator (
>=,<=,between,within ±%). - A threshold (a number derived from observed normal).
- A time window (a rolling window, never an instantaneous reading).
- A sample-size floor (the window must contain at least N data points before the verdict is meaningful).
- A scope (the slice of traffic the hypothesis applies to).
Concretely: payment_success_rate >= 99.7% over a rolling 60-second window with at least 200 attempts, scoped to payments-shard-3 in ap-south-1. A hypothesis missing any of those six parts will fail in production — either by accepting a degradation it should have caught, or by alarming on noise that has nothing to do with the fault.
Why "metric > SLO" is the wrong threshold
The single most common mistake — Riya's first attempt above, and what almost every engineer writes when they fill in the field for the first time — is to copy the threshold from the SLO document. The SLO says "payment success rate above 99.5% measured over 28 days". So the chaos hypothesis says success_rate > 99.5%. Both are 99.5%. Surely that's right.
It is not right, for two reasons. First, the SLO is computed over a 28-day window; the chaos experiment runs for 5 minutes. A success rate of 99.5% over 28 days allows a sustained dip to 95% for 3 hours and still passes. A 5-minute experiment that drops to 95% the whole time would be invisible to the SLO computation but is a catastrophic result for the experiment. Second, the SLO is the floor below which the business panics; the steady-state hypothesis should be tighter than that, because the experiment exists to detect deviations from normal, not to detect full-blown SLO breaches. By the time you are breaching SLO, the experiment is already a failed deploy.
The right threshold is built from the metric's observed distribution during a comparable window — same hour-of-day, same day-of-week, same scope. PaySetu's payment success rate at 11am on a Thursday is, say, 99.84% with a standard deviation of 0.06 percentage points across the last 8 weeks. A reasonable hypothesis threshold is 3 standard deviations below the mean: success_rate >= 99.66%. That number tracks the empirical distribution, not the marketing target. Why: 3σ corresponds to a 0.27% false-positive rate under a Gaussian assumption — meaning if the fault is doing nothing, the experiment will spuriously fail in roughly 1 of every 370 runs, which is a tolerable rate for a quarterly cadence and a tight enough threshold to catch fault-induced degradation that lies between "noise" and "SLO breach".
The 3σ rule isn't sacred — for very-low-variance metrics (success rates near 100% with tiny standard deviation) you may use 2σ; for very-high-variance metrics (long-tail latencies) you often use a percentile-based threshold like p99 instead. The principle — derive the threshold from the distribution, not from the SLO document — is what matters.
Computing the threshold from real history
The mechanics are mechanical: query the metric for the last N comparable windows, compute mean and standard deviation, set the threshold at mean − k × σ for an "above" metric or mean + k × σ for a "below" metric. The pure-Python version is fifteen lines and works against any time-series store via its query API.
import statistics
from dataclasses import dataclass
@dataclass
class Hypothesis:
metric: str
operator: str
threshold: float
window_seconds: int
min_samples: int
scope: str
def derive_hypothesis(history_buckets, metric, scope, window_s, min_samples, k=3.0):
"""history_buckets: list of (window_value_pct, sample_count) for past comparable windows."""
samples = [v for v, n in history_buckets if n >= min_samples]
if len(samples) < 8:
raise ValueError(f"need 8+ comparable windows, got {len(samples)}")
mu = statistics.mean(samples)
sigma = statistics.stdev(samples)
return Hypothesis(metric, ">=", round(mu - k * sigma, 3),
window_s, min_samples, scope)
# PaySetu payment success rate, last 8 Thursdays at 11am, 60s windows
history = [
(99.84, 247), (99.91, 253), (99.78, 261), (99.86, 244),
(99.82, 258), (99.79, 249), (99.88, 256), (99.83, 251),
]
h = derive_hypothesis(history, "payment_success_rate",
"shard-3:ap-south-1", window_s=60, min_samples=200)
print(f"hypothesis: {h.metric} {h.operator} {h.threshold}%")
print(f"window: {h.window_seconds}s, min_samples: {h.min_samples}")
print(f"scope: {h.scope}")
print(f"derived from mean={statistics.mean([v for v,_ in history]):.3f}, "
f"sigma={statistics.stdev([v for v,_ in history]):.3f}")
Output:
hypothesis: payment_success_rate >= 99.717%
window: 60s, min_samples: 200
scope: shard-3:ap-south-1
derived from mean=99.839, sigma=0.041
Walking through it: history_buckets is the list of past 60-second windows at the comparable hour/day — the assumption is that the chaos experiment will run at the same hour-of-day, so the comparable distribution is the same hour-of-day across the past 8 weeks, not the 28-day rolling average. min_samples filters out windows that did not see enough traffic — if Thursday morning had a holiday with 50 attempts, that window's success rate is a noisy outlier and should not contribute to the threshold. k=3.0 is the σ-multiplier; tightening to k=2.0 would set the threshold at 99.757%, and loosening to k=4.0 would set it at 99.677%. The output threshold of 99.717% is tighter than the SLO of 99.5% by a meaningful margin; Why: the SLO is what the business considers a contractual floor; the hypothesis is what the system normally does. The gap between those two numbers — 0.22 percentage points here — is exactly the headroom the experiment is supposed to interrogate. If injecting a fault drops the rate from 99.84% to 99.6%, the SLO is fine but the system has lost its margin, and that's the leak the experiment is meant to expose. Re-running this query weekly is itself a cron-job artefact: the production hypothesis should be re-derived every Monday morning, because the metric's distribution drifts as traffic and code change.
When the metric is latency, not a success rate
Success rates are simple: there's a clean Gaussian-ish distribution and 3σ does the right thing. Latencies are harder. p99 latency is bounded below by physics and unbounded above by GC pauses and tail events. The standard deviation of p99 across windows is often dominated by the rare worst windows, so 3σ produces a threshold that's looser than reality. For latency, switch from σ-based bounds to percentile-of-percentile bounds.
The pattern: take the per-window p99s for the last 8 weeks at the comparable hour, then take the p95 of those values as the threshold. That is, "the experiment passes if the in-experiment p99 is no worse than the historical p95 of p99s". The double-percentile sounds fussy but it captures the right thing: the system normally has a p99 of 180ms; in the worst 5% of comparable windows the p99 climbs to 240ms; the experiment is allowed to push p99 up to 240ms before the hypothesis is falsified. Anything looser than that is indistinguishable from a bad day, and anything tighter alarms on benign tail variation.
CricStream's video-start latency hypothesis during their first chaos experiment, for context: start_latency_p99 <= 380ms over a rolling 90-second window with at least 1000 starts, scoped to ap-south-1 mobile clients. The 380ms came from p95-of-p99s across 12 prior weekday-evening windows. The infrastructure team had wanted to set it at 250ms (the SLO target) — that would have failed almost every benign chaos run because evening tail latency under normal CDN-cache-miss conditions already touches 300ms. The team that builds the hypothesis from observed normal, not aspirational normal, is the team whose chaos programme survives past quarter two.
The deeper move, when the metric is heavy-tailed, is to measure the same metric at multiple percentiles and write a hypothesis on each. CricStream's mature configuration carries three latency hypotheses per experiment: p50 within ±10ms of historical p50, p95 within ±25ms, and p99 within ±60ms. The asymmetric tolerances reflect that medians are stable and tails are not — a fault that pushes p50 by 10ms is genuinely unusual; a fault that pushes p99 by 30ms is statistically routine. Writing one hypothesis per percentile costs three configuration lines and catches three different classes of degradation that a single p99 hypothesis would conflate.
Why the time window matters as much as the threshold
The window length is a parameter most engineers leave at "5 minutes" because that's the experiment duration. That's wrong. The window must be short enough to catch the fault's signal but long enough to have statistical meaning.
A window of 1 second on a 200-rps service contains 200 samples — enough that a 99.7% success rate has a confidence interval of roughly ±0.4 percentage points, dominated by sampling noise. A window of 60 seconds contains 12,000 samples — confidence interval of ±0.05 percentage points, narrow enough that real degradation stands out. A window of 5 minutes contains 60,000 samples but smears the signal: if the fault degrades the rate for 90 seconds and recovers, the 5-minute window averages it away to "barely perceptible".
The rule of thumb that holds across most chaos programmes: window length should be 1× to 2× the expected detection-and-mitigation time of the fault. If the fault's expected impact lasts 30 seconds (e.g. a single-pod restart with a 30-second pod-startup time), the window is 30–60 seconds. If the fault's expected impact lasts 5 minutes (e.g. a slow leak in a connection pool), the window is 5–10 minutes. Picking the window arbitrarily — same length as the experiment — is the third-most-common mistake in this whole framework.
Common confusions
- "The hypothesis is the same as the SLO." The SLO is a 28-day floor; the hypothesis is what the system normally does in this hour. The hypothesis is always tighter than the SLO — the gap between them is the resilience headroom the experiment is testing.
- "Infrastructure metrics are fine for the hypothesis." CPU at 80% might be fine or terrible — it depends on what users see. Always phrase the hypothesis in terms of a customer-visible signal: success rate, end-to-end latency, throughput. Infra metrics are for diagnosing why the hypothesis was falsified, not for the hypothesis itself.
- "You only need one hypothesis per experiment." Most useful experiments have 2–4 hypotheses: one on the primary success metric, one on tail latency, one on a downstream-dependent metric, sometimes one on customer-facing error budget burn. They are evaluated independently; falsifying any one fails the experiment.
- "A passing hypothesis means the fault did nothing." Passing means the system held within its normal band — the fault may still have caused real internal stress that didn't surface in the chosen metric. The hypothesis is a necessary but not sufficient signal of resilience.
- "You can write the hypothesis after the experiment." No — that is p-hacking. The hypothesis is locked before injection; if you reach for "let me adjust the threshold a bit" mid-experiment, the result tells you nothing about the system and everything about the experimenter.
- "More hypotheses are always better." Above 4–5 you start chasing ghost metrics that flap on every run for unrelated reasons, and the chaos programme becomes a flaky-test-suite. Pick the metrics that genuinely encode "the system is doing its job".
Going deeper
The Netflix derivation: from steady state to "informative experiment"
The Netflix chaos team's framing — captured in Basiri et al.'s 2016 IEEE Software article and the 2017 O'Reilly book — defines the steady state as the metric distribution under normal load and a useful experiment as one whose hypothesis is sufficiently tight that an undetected-by-monitoring failure would falsify it. Crucially, the team treats the hypothesis as a hypothesis in the Popperian sense: it must be falsifiable in the experiment window. A hypothesis like "user satisfaction stays high" is unfalsifiable; "stream-starts-per-second stays within ±5% of the median of the last 12 comparable hours" is. The discipline of writing falsifiable hypotheses is half the value of the practice — engineers who train on it learn to write better SLOs, alerts, and post-mortem questions across the rest of their work.
When the metric has a discontinuous distribution (cricket-final spikes)
CricStream's stream-starts metric has a heavy modality problem: during a cricket final, stream-starts go from 80k/s to 4 lakh/s in under 45 seconds and stay there for 4 hours. The historical distribution looks bimodal: one cluster around steady-state evening traffic, another cluster around event traffic. A naive σ-based threshold fitted across the whole 8-week history is useless — it splits the difference and matches neither mode. The fix is regime-aware hypothesis derivation: detect the operating regime first (if start_rate > 200000: regime = "event"; else: regime = "normal"), then derive a threshold from the matching cluster. Why: a single Gaussian poorly fits a bimodal distribution; the threshold derived under that bad fit will reject benign event-traffic windows and accept benign normal-traffic windows, getting the false-positive and false-negative rates wrong on both sides. A two-mixture Gaussian fit (or a simple regime classifier from the rate alone) recovers the right threshold per regime.
Hypothesis testing on long-tailed metrics — bootstrap intervals
When the metric is something like p99 checkout latency, the underlying per-request latency distribution is heavy-tailed (often log-normal or worse). The mean and standard deviation of the per-window p99 are not reliable estimators because a single bad window — one GC pause hitting one node — pulls the mean up and the stdev with it. Bootstrap confidence intervals are the right tool: resample the historical windows with replacement, compute the per-resample p99, and use the 5th and 95th percentiles of that distribution as the hypothesis bounds. The bootstrap captures the empirical sampling distribution without making distributional assumptions, and the resulting bounds are robust to single-window outliers in the historical data.
The "frozen for 28 days" anti-pattern
A common organisational drift: the SRE team writes the steady-state hypothesis once, sets it as the production threshold, and never re-derives it. Six months later the service has shifted — there's a new code path, an extra dependency, a 5% throughput-driven shift in p99. The old threshold is now wrong: too tight in some dimensions (alarms during normal operation) and too loose in others (passes during real degradation). Hypotheses must be re-derived on a cadence, ideally weekly, automatically, from the most recent comparable windows. PaySetu runs chaos-hypothesis-update as a Sunday-night cron and emails the SRE team if any threshold has shifted by more than 0.05 percentage points week-over-week — those shifts are signals worth investigating regardless of whether a chaos experiment is scheduled.
What to do when the hypothesis flaps
A hypothesis that fires falsely on every third run is a liability — engineers stop trusting the chaos programme, runs get re-tried until they pass, and the practice quietly turns into theatre. When a hypothesis flaps, three diagnostics in order: first, check the sample-size floor — a window with 50 attempts will produce wider confidence intervals than 200 and will flap on tiny absolute changes. Second, check whether the metric has a regime the σ-fitting did not separate (event-day vs. normal-day, weekday vs. weekend); if so, switch to regime-aware derivation as in the CricStream subsection above. Third, check whether the metric itself is noisy at the chosen window length — sometimes a 60s window is just too short for that metric and a 180s window with the same threshold will be stable. Only after all three diagnostics fail should the threshold be loosened. Loosening a hypothesis to make it pass is the cardinal sin of chaos engineering; it converts a real signal-detection apparatus into wishful thinking. Why: the σ derivation already absorbs the metric pipeline's own variance because σ was measured through that pipeline — pipeline drop, retry double-counting, and reordering are all implicitly priced into the historical distribution, so loosening the threshold to "account for noise" double-counts the noise and lets real degradation through.
Where this leads next
A steady-state hypothesis is a number, but the practice of writing one well is a habit that bleeds into the rest of the chaos-engineering toolkit. The next two chapters in Part 19 build directly on it: blast radius and recovery (/wiki/blast-radius-and-recovery) is about scoping the fault so that even if the hypothesis is falsified, the damage is bounded; and the principles framework (/wiki/the-principles-netflix) puts steady-state at the top of the list of five tenets that distinguish chaos engineering from "just breaking things in production".
Beyond Part 19, the same hypothesis-from-historical-distribution pattern shows up in canary deploys (success-rate of the canary cohort vs. the control), in autoscaling triggers (load-based scale-up keyed off observed normal), and in alerting (3σ-derived alerts as the alternative to fixed thresholds). The reader who internalises the six-part hypothesis structure here will find themselves writing better alerts and better canary criteria within weeks, and that transfer is half the curriculum's value.
References
- Basiri, Ali et al. "Chaos Engineering." IEEE Software, 2016. — formalises steady-state hypothesis as the central abstraction.
- Rosenthal, Casey & Jones, Nora. "Chaos Engineering: System Resiliency in Practice." O'Reilly, 2020. Chapter 3 covers steady-state hypotheses end-to-end.
- Beyer, Betsy et al. "Site Reliability Engineering: How Google Runs Production Systems." O'Reilly, 2016. SLO and error-budget chapters set the context the hypothesis lives within.
- Allspaw, John. "Resilience Engineering: Where Do I Start?" — on the discipline of writing measurable claims about systems.
- Dean, Jeff & Barroso, Luiz André. "The Tail at Scale." CACM, 2013. — for why latency hypotheses need percentile-of-percentile thresholds.
- Hoorn, Andre van et al. "A Survey of Methods and Tools for Steady State Detection." — the statistical-engineering literature behind the practice.
/wiki/the-principles-netflix— the five-tenet framework that puts steady state first./wiki/blast-radius-and-recovery— the safety counterpart to the hypothesis.
Reproduce this on your laptop
python3 -m venv .venv && source .venv/bin/activate
# statistics is stdlib in Python 3.11+
python3 -c "
import statistics
hist = [(99.84, 247), (99.91, 253), (99.78, 261), (99.86, 244),
(99.82, 258), (99.79, 249), (99.88, 256), (99.83, 251)]
samples = [v for v, n in hist if n >= 200]
mu, sigma = statistics.mean(samples), statistics.stdev(samples)
print(f'threshold (mu - 3*sigma): {mu - 3*sigma:.3f}%')"
The output, threshold (mu - 3*sigma): 99.717%, matches the article. Tighten or loosen by changing the σ-multiplier; swap in your own historical buckets to derive your own threshold.
A short checklist before you save the experiment form
Practical kit, learned from the same teams who learned this the hard way:
- The threshold came from a query you ran against last week's data, not from the SLO doc.
- The window is 1× to 2× the fault's expected impact duration, not the experiment duration.
- The sample-size floor is set so that a quiet hour cannot pass the hypothesis by accident.
- The metric is something a customer would notice changing, not something an infra dashboard would notice.
- There are between 1 and 4 hypotheses, and each one is independently falsifiable.
- The hypothesis was written before the fault was named — not after.
- There is a one-line rationale inline with each threshold —
# 99.717 = mu(99.84) - 3*sigma(0.041), 8 wk Thu 11am— so the next engineer to read the file understands why this number and not the SLO number.
The form takes 90 seconds to fill in. The thinking takes an hour. The hour is the work.