Note: Company names, engineers, incidents, numbers, and scaling scenarios in this article are hypothetical — even when they resemble real ones. See the full disclaimer.

The principles (Netflix): the five tenets that defined chaos engineering

It is December 2014 and the Netflix Traffic team is staring at a blank document. Chaos Monkey has been running for four years and has prevented at least two dozen near-incidents that nobody outside the team knows about. Every external talk gets the same question: "isn't this just randomly killing servers?" And every internal review keeps re-litigating the same arguments — can we run this in production, should we, who decides. Casey Rosenthal, Lorin Hochstein, Ali Basiri, and a handful of others realise the problem is not the tool. The problem is that nobody has written down what chaos engineering is, separately from what Chaos Monkey does. They publish chaos.principles — five tenets, one page, no diagrams. Almost every chaos programme that fails over the next decade fails because it violates at least one of those five tenets, and the document is precise enough that the violation is detectable from outside.

The 2014 Principles of Chaos Engineering define five tenets: build a hypothesis around steady-state behaviour, vary real-world events, run experiments in production, automate experiments to run continuously, and minimise blast radius. The tenets sound obvious in isolation; they are violated everywhere in practice — most commonly by skipping the hypothesis, restricting experiments to staging, or running them quarterly instead of continuously. The principles are not a checklist; they are a definition. A programme that satisfies four of five is not "80% chaos engineering" — it is something else, with a different and weaker guarantee.

Why a manifesto, and why these five tenets

The principles document was not the first chaos engineering work at Netflix — Chaos Monkey shipped in 2010, the Simian Army by 2011, Chaos Kong by 2013. The principles came after four years of running experiments, watching what worked and what got copy-pasted into other organisations and broke. Every tenet in the document corresponds to a failure mode the Netflix team had already seen in adopters. The tenets are descriptive, not aspirational — they describe what the successful chaos programmes looked like.

The five tenets, ordered as in the original document:

Build a hypothesis around steady-state behaviour. State, in advance, the measurable property the system must maintain — not "is fine", but "merchant-write success rate stays ≥ 99.5% over any rolling 5-minute window".
Vary real-world events. Inject the failures the system actually faces in production — network partitions, slow disks, dependency timeouts, certificate expirations — not just the ones convenient to test.
Run experiments in production. Staging is a different system; only production has the real traffic, the real dependencies, the real failure surface.
Automate experiments to run continuously. A one-shot experiment tests the system as it was on the day of the experiment. Drift defeats one-shot tests within months.
Minimise blast radius. Every experiment is the smallest one that would falsify the hypothesis — bounded population, bounded duration, one-click rollback.

The principles document is more useful as a diagnosis tool than as a how-to. Watch a chaos programme for one week, classify what is happening on each row, and the failures on the right tell you where the programme is leaking.

Why these five and not others: the tenets correspond to the five dimensions across which a chaos experiment can degrade into theatre. Drop the hypothesis and you cannot fail; drop real-world events and you test only the easy things; drop production and you test a different system; drop continuity and you test a snapshot; drop blast-radius minimisation and the experiment becomes too dangerous to run, so it does not get run. Each tenet plugs one specific way the discipline collapses into something safer-feeling but useless.

Tenet 1 in depth — what counts as a steady-state hypothesis

The first tenet is the one most often skipped, because it requires more thinking than any of the others. A steady-state hypothesis must be:

Measurable. Backed by a metric the team already trusts — usually a SLO or SLI. "Success rate", "p99 latency", "queue depth", "replication lag".
Quantitative. Has numbers. Not "stays healthy" but "stays ≥ 99.5%".
Windowed. Has a time window — "over any rolling 5-minute window" — because instantaneous metrics are too noisy to falsify against.
Pre-stated. Written down before the experiment runs. After-the-fact rationalisation ("well, the dashboard didn't go red, so we passed") is the most common failure mode of every empirical discipline, not just chaos engineering.
Falsifiable. It must be possible for the experiment to violate it. A hypothesis that no realistic experiment could falsify is not a hypothesis; it is a definition.

A worked PaySetu example. The team that owns the merchant-write API has the SLO "99.9% of write requests succeed within 200ms over a 30-day window". That SLO is too long-windowed for a 90-second experiment — the SLO budget is consumed gradually, and a single experiment's contribution is below noise. The chaos hypothesis derives a short-window version: "over any rolling 5-minute window during the experiment, success rate stays ≥ 99.5% and p99 stays ≤ 400ms". The relaxed thresholds are the cost of the shorter window — accept more variance because the sample is smaller. The hypothesis is stricter than the SLO during the experiment because if you cannot stay near-SLO during a controlled fault you certainly cannot during an uncontrolled one.

Why the relaxation matters: the rolling 5-minute window has fewer samples, so the variance of the success-rate estimate is higher. Demanding 99.9% in a 5-minute window during a partial fault would auto-abort the experiment on the first noise burst, even when the system is fine. The relaxed threshold is calibrated to "what does the metric look like during normal operation in a 5-minute window" — and the abort threshold is set just below that, so the abort fires only when the system is genuinely degrading, not when it is briefly noisy.

Tenet 2 in depth — varying real-world events, not convenient ones

The second tenet says: inject the failures the system actually faces, not the failures easiest to inject. The failure surface of a real distributed system is dominated by:

Network partitions — partial, asymmetric, intermittent. Not "one server is down" but "server A can reach B but not C, while D can reach C but not B, for 47 seconds".
Slow dependencies — the call returns, but in 4.2 seconds instead of 40ms. This is far harder to handle than a hard failure, because most timeout configurations are too generous.
DNS and certificate failures — DNS TTLs expire at the worst time; certificates rotate; intermediate CAs become unreachable.
Disk failures, partial — the disk is up but slow, or returns occasional EIO, or fills up. Not "disk is dead" but "disk takes 800ms for an fsync that usually takes 4ms".
Resource exhaustion — file descriptors, sockets, memory, connection pools — all of which can be exhausted without the process crashing.
Clock skew — a server's clock jumps forward 3 seconds during NTP correction, breaking every TTL-based caching layer for one tick.
Dependency outages — a managed service the application relies on (object store, secrets manager, external API) returns 5xx for 90 seconds.

The seductive trap is to inject only the failures the tooling makes easy. tc netem makes latency easy. AWS API makes instance kills easy. Kubernetes makes pod evictions easy. None of these are the failures that cause the worst outages. CricStream's worst incident in 2025 was not a server crash — it was a 600ms intermittent slowdown from one CDN POP that triggered retries from 4 million viewers, which triggered connection-pool exhaustion in the origin, which cascaded to the stream control plane. Inject only "kill instance" and you never test that path.

The escalation ladder — how blast radius grows in disciplined steps

A programme that runs only the smallest possible experiment forever learns nothing about how the system behaves under larger faults. A programme that jumps to the largest fault on day one breaks production. The principles' fifth tenet implies — and Netflix's later writing makes explicit — an escalation ladder: each experiment level confirms the system tolerates that level before the next is even attempted.

The ladder is a contract: a team running L4 (pod kill) has demonstrated the system survives L1–L3 (latency and packet-loss faults). Skipping rungs is the failure mode that produces career-event outages disguised as chaos experiments.

The ladder is also a staffing tool. A team that can comfortably run L1–L3 weekly may not yet be ready for L6 — not because the technology is missing, but because the on-call rotation, the postmortem culture, and the dependency graph are not. AutoGo's ride-hail platform spent eighteen months at L1–L4 before its first L5 experiment, because every L4 experiment surfaced a different cascading-failure pattern that needed fixing first. The ladder turned out to be a roadmap of what to fix, in priority order, sorted by what the experiments themselves revealed.

A worked example: a steady-state hypothesis enforcer in Python

The hardest principle to operationalise is the automatic steady-state check during the experiment. The script below is the harness PaySetu uses to wrap any chaos experiment: it polls a Prometheus metric every two seconds, evaluates the hypothesis on a rolling window, and triggers the experiment's rollback callback if the hypothesis is violated. The same script runs in production for 14 different experiments; only the metric query, threshold, and fault-injection callback differ.

# steady_state_enforcer.py
# Rolling-window hypothesis check with auto-rollback for chaos experiments.
# Designed for one-experiment-per-process; PaySetu runs ~14 of these on schedule.
import time, statistics, urllib.request, urllib.parse, json, sys
from collections import deque
from datetime import datetime

PROM_URL = "http://prometheus.paysetu.internal:9090"
METRIC = ('sum(rate(merchant_write_requests_total{status="2xx"}[1m]))'
          ' / sum(rate(merchant_write_requests_total[1m]))')
WINDOW_S = 300                # 5-minute rolling window
THRESHOLD = 0.995             # success rate must stay >= 99.5%
POLL_S = 2
EXPERIMENT_DURATION_S = 90

def query_prom(q):
    url = f"{PROM_URL}/api/v1/query?{urllib.parse.urlencode({'query': q})}"
    with urllib.request.urlopen(url, timeout=4) as r:
        data = json.load(r)
    result = data["data"]["result"]
    if not result:
        return None
    return float(result[0]["value"][1])

def hypothesis_holds(samples, threshold):
    # Hypothesis: median success rate over the rolling window >= threshold
    # Median (not mean) because a single bad sample shouldn't tank the test
    if len(samples) < 5:
        return True              # too few samples to falsify
    return statistics.median(s for _, s in samples) >= threshold

def run(inject, rollback):
    samples = deque()             # (timestamp, success_rate)
    inject()
    print(f"[{datetime.utcnow().isoformat()}] fault injected; enforcing hypothesis")
    t0 = time.time()
    aborted = False
    try:
        while time.time() - t0 < EXPERIMENT_DURATION_S:
            now = time.time()
            v = query_prom(METRIC)
            if v is not None:
                samples.append((now, v))
            # Trim to window
            while samples and samples[0][0] < now - WINDOW_S:
                samples.popleft()
            ok = hypothesis_holds(samples, THRESHOLD)
            print(f"  t={now-t0:5.1f}s  success={v:.4f}  "
                  f"window_n={len(samples)}  hypothesis={'OK' if ok else 'VIOLATED'}")
            if not ok:
                aborted = True
                print(f"  ABORT — rolling success below {THRESHOLD}")
                break
            time.sleep(POLL_S)
    finally:
        rollback()
        verdict = "violated" if aborted else "held"
        print(json.dumps({"hypothesis": verdict,
                          "samples": len(samples),
                          "duration_s": round(time.time() - t0, 1)}, indent=2))
        sys.exit(1 if aborted else 0)

if __name__ == "__main__":
    # Example: hooks for tc-netem-based partition injection
    from chaos_partition_experiment import inject_latency, rollback
    run(inject_latency, rollback)

Sample run during a real experiment (200ms latency injected to one replica):

[2026-04-28T11:47:02] fault injected; enforcing hypothesis
  t=  0.1s  success=0.9994  window_n=1  hypothesis=OK
  t=  4.2s  success=0.9991  window_n=3  hypothesis=OK
  t= 12.4s  success=0.9982  window_n=7  hypothesis=OK
  t= 28.6s  success=0.9978  window_n=15  hypothesis=OK
  t= 60.1s  success=0.9971  window_n=31  hypothesis=OK
  t= 88.9s  success=0.9968  window_n=45  hypothesis=OK
{ "hypothesis": "held", "samples": 45, "duration_s": 90.0 }

Walkthrough: query_prom() is a synchronous Prometheus query for the success-rate ratio metric. It returns None when Prometheus has no data — usually a sign that the metric pipeline is itself degraded, in which case the harness skips the sample rather than counting it as a violation. hypothesis_holds() uses the median over the rolling window, not the mean, because the median is robust to a single bad sample (a 1-second burst of errors during a deploy, say) while still catching sustained degradation. The len(samples) < 5 guard prevents the experiment from aborting in the first 10 seconds before enough data has accumulated. run() is the harness loop: inject, poll on POLL_S, evaluate, abort on violation, always rollback (the finally clause runs even on Ctrl-C or exception). Why the median not the mean: a single 0.0 sample (Prometheus returning 0 for one second because the metric was momentarily empty) would tank the mean and falsely abort the experiment. The median ignores up to half the samples being garbage, which is exactly what you want for a metric pipeline that itself has occasional gaps. The asymmetry — robust to false positives, sensitive to true degradation — is what makes the hypothesis test usable in production rather than just theoretically correct.

The exit code is the contract with the orchestrator: 0 means hypothesis held (next experiment may proceed); 1 means hypothesis violated (alert, postmortem, freeze further experiments on this surface until investigated). PaySetu's chaos scheduler uses the exit code as the gate for the next experiment in sequence — a violated experiment automatically pauses the entire schedule until a human acknowledges it.

Tenets 3, 4, and 5 in tighter focus

The first two tenets get most of the attention because they are the most often skipped. The remaining three are the ones that quietly get half-implemented — the programme nominally satisfies them, but the implementation undercuts the guarantee.

Tenet 3 — production. "Production" is not a binary; it is a spectrum. PaySetu's production includes the merchant-write API serving real merchants, but also a "pre-production" stack with synthetic traffic that mirrors real load shapes. The synthetic stack is more permissive — the team runs more aggressive faults there before promoting them to the real stack. The principle is satisfied as long as some fraction of experiments runs against real traffic with real dependencies; running zero experiments against real production is the violation, not the existence of a tiered escalation. CricStream uses a four-tier ladder: synthetic stack → 0.1% real-traffic shadow → 1% canary → full production, with an experiment promoted upward only after passing the previous tier without falsifying the hypothesis.

Tenet 4 — automate to run continuously. The thing automated is not the fault; it is the schedule. A programme where engineers manually run a fault every Tuesday during a 30-minute "chaos slot" is more continuous than one where a fully automated harness runs only when someone remembers to trigger it. The automation that matters is the cron + on-call ack + auto-rollback chain — the part that ensures the experiment runs even when the team is busy. KapitalKite's brokerage-trading platform runs nine experiments on a weekly schedule; each Monday at 14:00 IST, the platform notifies the on-call engineer 30 minutes before the experiment, the engineer either acks (default, runs at 14:30) or defers (one-week postponement). After two consecutive defers, the platform alerts the engineering manager — the friction is deliberate, because a continuously deferred experiment is a violated tenet.

Tenet 5 — minimise blast radius. The non-obvious part is that minimising blast radius is not the same as minimising risk. A 0.5% blast-radius experiment that runs weekly produces 50 experiments a year against the real system; an annual "kill the entire region" exercise produces one experiment a year, with vastly higher risk per execution. The cumulative learning from 50 small experiments dominates the learning from one large one, because each small experiment falsifies a different hypothesis and the failures interact. The principle is asking you to make experiments small enough that you can run many of them, not small enough to be safe in isolation.

Common confusions

"Chaos engineering is the same as chaos testing or chaos QA" — no. Chaos testing runs in CI against a known set of injected failures, like a unit test. Chaos engineering runs in production with hypotheses about emergent behaviour the team did not pre-enumerate. The first answers "does the code I just wrote handle these specific faults?"; the second answers "what does the running system actually do under partial failure?". Both are useful; only the second is what the principles document refers to.
"Steady-state hypothesis is the same as an SLO" — related but distinct. SLOs are long-windowed (30-day, quarterly) success-rate or latency targets, computed against the full traffic distribution. Steady-state hypotheses are short-windowed (1- to 10-minute) thresholds tuned to the experiment's duration, often relaxed from the SLO because short windows have higher variance. SLOs answer "is the product meeting its commitment?"; hypotheses answer "is the experiment causing observable harm?".
"Running in production means running on every customer" — production means real traffic, real dependencies, and the real failure surface — but the blast radius (tenet 5) restricts which fraction of that surface participates. A 0.1%-blast-radius experiment is in production by every definition that matters, while affecting fewer users than most internal demos.
"You need Chaos Monkey to do chaos engineering" — Chaos Monkey is one tool, narrow in vocabulary (kills VMs at random within a service group). The principles cover every fault type the system faces; instance kills are at most 10% of the experiment surface. Adopting Chaos Monkey without adopting the principles is cargo-culting; adopting the principles without ever running Chaos Monkey is fine.
"Continuous means hourly" — continuous means on a regular cadence such that drift cannot accumulate undetected. For a fast-moving service that ships 30 deploys a day, continuous might be hourly. For a slow-moving service with quarterly releases, continuous might be weekly. The criterion is "do experiments run often enough that the system never sits in an untested configuration for long?", not a fixed clock.
"Falsifiable means easy to falsify" — a hypothesis is falsifiable if some realistic experiment outcome would prove it wrong. A good hypothesis is one the system reliably satisfies under normal conditions and a controlled fault is at the boundary of falsifying. A hypothesis that is trivially falsified (the threshold is too tight) is broken; one that cannot be falsified by any realistic fault is also broken.

Going deeper

The history before 2014 — why a manifesto was needed

By 2013 Netflix had Chaos Monkey, Latency Monkey, Conformity Monkey, Doctor Monkey, Janitor Monkey, Security Monkey, 10–18 Monkey, and Chaos Kong. Adopters outside Netflix were copying the tools but not the practices — running Chaos Monkey in staging only, with no hypothesis, on a rare schedule, with no rollback. The conferences in 2013–2014 (ChaosConf precursors, Velocity tracks) kept showing the same pattern: organisations would announce they had "started chaos engineering", run for two months, hit one bad incident caused by a botched experiment, and shut it all down. The principles document was, in part, a diagnostic — it gave the early adopters a vocabulary to point at what had gone wrong rather than blaming "chaos" wholesale. The five tenets, distilled from four years of observing what worked, formed the contract between the discipline and its practitioners: do all five, or you are doing something else.

Falsifiability as the load-bearing idea

Karl Popper's criterion — that scientific claims must be falsifiable — is the philosophical spine of the principles document, even though it is never named. The first tenet's emphasis on hypotheses, the third tenet's insistence on production (because staging cannot falsify production claims), the fourth tenet's continuity (because today's claim becomes tomorrow's stale guess) — all are forms of "the claim about your system's resilience must be empirically tested, repeatedly, against the real surface". Most engineering disciplines admit this in principle and avoid it in practice, because falsification is uncomfortable. The principles make falsification non-negotiable. This is also why the document survives — it does not bind itself to specific technologies (no mention of EC2, Kubernetes, AWS, even Linux) and so its relevance has not decayed in eleven years.

How the principles map to other empirical disciplines

The closest analogy is clinical trial design: hypothesis (efficacy claim), real-world events (real patients, not lab animals), production (the actual disease, not in-vitro), continuity (longitudinal monitoring), blast radius (Phase 1 / Phase 2 / Phase 3 escalation). Engineers occasionally find this framing helpful when explaining chaos engineering to non-engineers — "we run controlled trials against our system the way pharma runs trials against patients" lands far better with executives than "we break things in production". The framing is also useful for designing experiments: clinical trial methodology has fifty years of insight into stopping rules, blinding, sample size — most of which transfers, mutatis mutandis, to chaos design.

Where the principles are silent

Three modern problems the 2014 document does not address. First, multi-tenant cloud platforms: the document assumes the system under test is the only thing on the infrastructure; modern systems share kernels, networks, and noisy neighbours. Second, cross-organisation experiments: a chaos experiment whose fault propagates through an external dependency (a payment gateway, a national ID auth system, a CDN) hits a blast-radius wall the principles do not enumerate, because in 2014 the assumption was that you owned everything in your stack. Third, machine-learning systems: ML pipelines have non-deterministic outputs, drift over time even without code changes, and have steady-state behaviour that is itself a distribution rather than a point — none of which the principles directly cover. The successor literature (Chaos Engineering, O'Reilly 2020; Hochstein and Jones, 2022) extends the framework to these cases, but the original document is the floor, not the ceiling.

Where this leads next

The next chapter — fault injection at the platform level — picks up where this one stops: tenet 5 says minimise blast radius, but the mechanism that lets you minimise it is platform-level fault injection. AWS Fault Injection Simulator, Azure Chaos Studio, the Linux tc-netem family, and the Kubernetes litmus operator are the implementations of "blast-radius-bounded fault" as a primitive. Without those primitives, every team rolls its own injection harness and the rollback story degrades into "ssh in and undo whatever".

After that, game days covers the human-in-the-loop variant of chaos engineering — large-scale simulated outages with on-call teams responding in real time, which the principles document acknowledges but does not dwell on. Then steady-state hypotheses goes deeper on tenet 1 — how to derive the hypothesis from the SLO, how to set thresholds, how to run the rolling-window check. Finally the wall: every system is unique closes Part 19 with the observation that generic chaos playbooks do not transfer; you must run experiments shaped by your system's real failure modes, which is itself a corollary of tenet 2.

The thread connecting all five chapters of Part 19 is the same claim the principles make: resilience is empirical, not designed. You cannot review your way to confidence; you cannot test-in-staging your way to confidence; you cannot SLO your way to confidence. You can only run experiments with hypotheses, in production, continuously, with bounded blast radius — and the floor for "what counts as an experiment" is the five tenets, not negotiable, not a checklist.

References

Principles of Chaos Engineering (chaos.principles, 2014) — the original one-page manifesto. Still the canonical source.
Casey Rosenthal, Lorin Hochstein, Aaron Blohowiak, Nora Jones, Ali Basiri, Chaos Engineering: System Resiliency in Practice (O'Reilly, 2020) — book-length expansion of the principles by the original Netflix authors.
Ali Basiri et al., "Chaos Engineering" (IEEE Software, 2016) — peer-reviewed synthesis of the practice as developed at Netflix.
Karl Popper, The Logic of Scientific Discovery (1959) — the falsifiability criterion that underlies tenet 1.
Netflix Technology Blog, "The Netflix Simian Army" (2011) — historical context for the four years of experience that produced the principles.
Lorin Hochstein, "Chaos Engineering as I see it" (blog, 2019) — one author's reflection on what the principles got right and what they did not anticipate.
Tammy Bütow, "How to start a chaos engineering programme" (Gremlin blog, 2019) — adoption-focused walkthrough that explicitly maps each step to a tenet.
See also: wall: to trust the system, you must break it, fault injection at the platform level, steady-state hypotheses, game days.