Note: Company names, engineers, incidents, numbers, and scaling scenarios in this article are hypothetical — even when they resemble real ones. See the full disclaimer.

Wall: to trust the system, you must break it

It is 21:40 IST on a Friday and Aditi, the platform tech lead at PaySetu, is staring at a four-month-old dashboard screenshot on a runbook PR. The dashboard is green. The runbook says "if the merchant-write Postgres primary fails, the read replicas in ap-south-1b take over within 8 seconds — verified by the failover test on 2025-11-17". Aditi is the engineer who ran that test in November, and she now knows three things she did not know then: the connection-pool library version changed in January, the primary's IAM role lost a permission in February that nobody noticed because nothing was failing over, and the read replicas were re-provisioned in March on a different subnet. Three months of green dashboards, 99.99% headline uptime, zero customer complaints — and the failover code path has not actually executed in production for 145 days. The wall every distributed system hits is that observability tells you what is happening; it cannot tell you what would happen, and the only honest test of "does our system survive failure mode X" is to make failure mode X happen on purpose, in production, while you are watching.

Uptime numbers and dashboards measure the past, not the future — a system that has been green for 90 days has merely not been tested for 90 days. The failure modes of a distributed system are exponential in the number of components, so you cannot enumerate them in a design review or guarantee them through code review; you can only sample them by injecting controlled failures and watching what the system actually does. Chaos engineering is not "breaking things for fun" — it is the empirical method, applied to operational software, with a hypothesis, a blast-radius limit, and a rollback. Without it, every claim of resilience is unfalsified.

The fundamental gap: dashboards record outcomes, not capabilities

Observability — the topic of the part you just finished — is a measurement discipline. Logs, metrics, and traces all answer questions about events that have already happened: "what was the p99 of the merchant-write API yesterday at 14:00", "which pods restarted in the last hour", "where did this trace spend 2.4 seconds". Every one of those answers is a fact about the past. None of them tell you what the system would do if the primary database lost network connectivity to two of its three replicas right now, or if the rate-limiter Redis cluster lost a quorum at 09:30 on the busiest payday of the month.

The instinct most engineering organisations develop is to substitute historical reliability for future reliability. "We had four nines last quarter, therefore we have four nines this quarter." This substitution feels rigorous because the number is precise, but the inference is wrong: 99.99% is a property of the workload that ran, the failures that occurred, and the recovery code paths that were exercised. Change any of those — a new traffic shape, a new failure mode, a code path that has not run for three months — and the historical number predicts nothing about the future one. Aditi's failover code is the canonical example: it ran once in a controlled test, has not run since, and three independent code paths it depends on have changed underneath it.

The gap between observed reliability and actual reliabilityTwo parallel timelines. The top line, labelled "What dashboards see", shows uniform green segments for 145 days with one yellow blip — looking very healthy. The bottom line, labelled "What is actually exercised", shows the same 145 days but most segments are grey (untested), with only a few green diamond markers where a failover or retry actually fired. The gap between the two lines is annotated as "the trust gap — assumed working, but not exercised". Illustrative. Dashboards measure what ran. Not what would run. What dashboards see success rate, latency, error rate 99.99% What was actually exercised code paths that ran in production ~3% of paths The trust gap A 145-day green window does not mean 145 days of resilience — it means 145 days during which the failure modes you fear did not happen. The only way to close the gap: deliberately exercise the paths that did not run on their own. Day 0 Day 145
Most production code paths do not execute on a normal day. The retry path executes when something fails. The circuit-breaker open-state executes when the failure threshold trips. The cross-region failover executes when the home region is unreachable. The quorum-loss handler executes when ≥2 of 3 replicas die. None of these run on the green-dashboard days, and "green dashboard" is silence about them, not evidence.

Why historical reliability is a worse predictor than people assume: the failure surface of a system with N components and M states per component is roughly M^N joint configurations. A 400-service system with even 4 meaningful states per service (healthy / degraded / partitioned / dead) has ~10^240 joint states. Production has visited maybe 10^4 of them in 145 days. The 99.99% number is computed over an infinitesimal sample of the actual state space — and, crucially, it is the sample biased toward states the system is already good at handling, because those are the only states it spent time in.

Why fault injection is not "breaking things for fun"

A common misread of chaos engineering is that it is "intentional sabotage" — engineers maliciously killing pods to "test" resilience. That framing is wrong, and the misread is what blocks adoption in most organisations. Chaos engineering is the empirical method applied to running systems: you state a hypothesis about what should happen under a specific failure, you arrange the smallest controlled experiment that would falsify it, you run the experiment with a defined blast radius and a one-click rollback, and you record what actually occurred.

The structure of a chaos experiment looks more like a clinical trial than vandalism. Four elements are non-negotiable:

  1. Steady-state hypothesis. A measurable claim about normal behaviour the system maintains. "Merchant-write API success rate stays above 99.5% over any rolling 5-minute window." Vague hypotheses ("the system is fine") cannot be falsified, so they are not hypotheses.
  2. The fault to inject. A specific, named, reversible failure. "Drop all egress packets from the merchant-write Postgres primary to its synchronous replica for 90 seconds." Not "make the database slow".
  3. The blast radius. The maximum population of users / requests / regions the experiment can affect, agreed in advance with the on-call team. "Up to 0.5% of merchant-write traffic, only in ap-south-1, only during business hours, with one-click abort."
  4. The rollback. A pre-tested, fast (< 30 second) way to undo the fault if the steady-state hypothesis is violated. The rollback button is itself part of the experiment design — running an experiment whose rollback you have not rehearsed is the actual irresponsible thing.
Anatomy of a chaos experimentFive-stage diagram showing the lifecycle of a chaos experiment. Stage 1 box "Hypothesis" lists the steady-state claim. Stage 2 box "Define blast radius" shows percent traffic, regions, time window. Stage 3 box "Inject fault" shows the named fault. Stage 4 box "Observe" shows three pillars (logs, metrics, traces) plus a red dotted alert threshold. Stage 5 box "Rollback or escalate" shows two outcomes: hypothesis confirmed (close, learn) or violated (rollback, postmortem). Arrows connect each stage. An ABORT lever sits across all stages indicating one-click stop. Illustrative. A chaos experiment is a clinical trial — five stages, every one pre-specified 1. Hypothesis Steady-state claim: "merchant-write success ≥ 99.5% over 5min rolling" Falsifiable? Yes — has number, window, threshold 2. Blast radius Population: ≤0.5% of writes Region: ap-south-1 only Window: 11:00–17:00 IST On-call ack: yes 3. Inject Fault: drop egress primary → sync replica Duration: 90 seconds Tool: tc qdisc / pumba 4. Observe Watch: • success rate • replication lag • connection errs • tail latency Auto-abort if success < 99.0% 5. Decide Rollback remove fault, postmortem OR Confirm: close, archive trace ABORT — one-click rollback armed for the entire experiment lifetime
Every chaos experiment runs through these five stages. Skipping any of them turns it into vandalism. The hypothesis is what makes the experiment scientific; the blast radius is what makes it safe; the rollback is what makes it repeatable.

The principle that makes chaos engineering not-stupid is blast radius minimisation: every experiment is the smallest one that would still falsify the hypothesis. If the hypothesis is "the merchant-write API tolerates loss of one Postgres replica", you do not start by killing the primary; you start by introducing 200ms of latency to one replica's network interface for 30 seconds in one region during business hours, and you check whether the steady-state metric moves. The escalation ladder — latency → packet loss → process kill → instance termination → AZ-level partition → region-level partition — is climbed one rung at a time, with confirmation at each step that the rollback works and the steady-state hypothesis holds. Netflix's Chaos Monkey, the original, only kills one instance at a time within an auto-scaling group during business hours; everything more aggressive (Chaos Kong, the simulated region failure) came later, after years of building the smaller-radius vocabulary first.

A worked example: simulating a partition with tc netem and watching the failover

The tool used most often for the smallest-blast-radius network experiments is Linux's traffic-control subsystem (tc), specifically the netem (network emulation) qdisc. It runs in the kernel of one host, can be applied to one network interface, can be enabled and disabled in milliseconds, and can simulate every kind of network failure that matters: latency, packet loss, packet reordering, duplication, and corruption. The Python harness below wraps tc in a controlled experiment that injects 200 ms of one-way latency between a primary database and its synchronous replica, watches replication lag, and rolls back automatically if lag crosses a threshold.

# chaos_partition_experiment.py
# Inject one-way 200ms latency between primary and sync replica,
# watch replication lag, auto-abort if it exceeds threshold.
# Run as root on the primary host. Rollback is unconditional.
import subprocess, time, signal, sys, json
from datetime import datetime

PRIMARY_IFACE = "eth0"
REPLICA_IP = "10.42.7.18"           # the sync replica's address
LATENCY_MS = 200
EXPERIMENT_DURATION_S = 90
LAG_ABORT_MS = 5000                 # auto-abort if replication lag > 5s
POLL_INTERVAL_S = 2

def run(cmd, check=True):
    r = subprocess.run(cmd, shell=True, capture_output=True, text=True)
    if check and r.returncode != 0:
        raise RuntimeError(f"FAILED: {cmd}\n{r.stderr}")
    return r.stdout.strip()

def inject_latency():
    # Add an HTB qdisc, a class for traffic to REPLICA_IP, and netem on that class
    run(f"tc qdisc add dev {PRIMARY_IFACE} root handle 1: prio")
    run(f"tc qdisc add dev {PRIMARY_IFACE} parent 1:3 handle 30: netem delay {LATENCY_MS}ms")
    run(f"tc filter add dev {PRIMARY_IFACE} protocol ip parent 1:0 prio 3 "
        f"u32 match ip dst {REPLICA_IP}/32 flowid 1:3")

def rollback():
    # Idempotent — safe to call multiple times
    run(f"tc qdisc del dev {PRIMARY_IFACE} root", check=False)

def measure_lag_ms():
    # In real PaySetu this queries pg_stat_replication; here we stub it
    out = run("psql -tAc \"SELECT EXTRACT(EPOCH FROM "
             "(now() - pg_last_xact_replay_timestamp())) * 1000\"")
    return float(out)

def main():
    signal.signal(signal.SIGINT, lambda *_: (rollback(), sys.exit(1)))
    log = []
    print(f"[{datetime.utcnow().isoformat()}] arming rollback")
    rollback()  # ensure clean state
    print(f"[{datetime.utcnow().isoformat()}] injecting {LATENCY_MS}ms to {REPLICA_IP}")
    inject_latency()
    t0 = time.time()
    try:
        while time.time() - t0 < EXPERIMENT_DURATION_S:
            lag = measure_lag_ms()
            log.append({"t": time.time() - t0, "lag_ms": lag})
            print(f"  t={time.time()-t0:5.1f}s  lag={lag:8.1f}ms")
            if lag > LAG_ABORT_MS:
                print(f"  ABORT — lag {lag:.0f}ms > threshold {LAG_ABORT_MS}ms")
                break
            time.sleep(POLL_INTERVAL_S)
    finally:
        rollback()
        print(f"[{datetime.utcnow().isoformat()}] rollback complete")
        print(json.dumps({"experiment": "primary_replica_latency",
                          "samples": len(log),
                          "max_lag_ms": max(s["lag_ms"] for s in log),
                          "result": "completed"}, indent=2))

if __name__ == "__main__":
    main()

Sample run on a PaySetu staging cluster:

[2026-04-28T11:32:14] arming rollback
[2026-04-28T11:32:14] injecting 200ms to 10.42.7.18
  t=  0.1s  lag=    14.8ms
  t=  2.1s  lag=   213.4ms
  t=  4.1s  lag=   438.2ms
  t=  6.2s  lag=   712.9ms
  t=  8.2s  lag=   984.1ms
  t= 10.2s  lag=  1198.7ms
  ...
  t= 88.4s  lag=  1402.3ms
[2026-04-28T11:33:44] rollback complete
{ "experiment": "primary_replica_latency",
  "samples": 44, "max_lag_ms": 1402.3, "result": "completed" }

Walkthrough: inject_latency() uses three tc commands — a priority qdisc as the root, a netem qdisc on one of its bands that adds the configured delay, and a u32 filter that matches packets destined for the replica's IP and routes them through that band. The result is one-way egress delay applied only to traffic from this host to the one specific replica IP. rollback() deletes the root qdisc, which atomically removes everything underneath; it is check=False because deleting a non-existent qdisc returns 2, which is fine for an idempotent rollback. The signal handler ensures that even if the script is Ctrl-C'd or its container is killed, the qdisc gets removed. Why this design is safe in production: the tc filter only affects traffic to one IP, so 100% of customer traffic to other replicas, application servers, and external services is untouched. The blast radius is one TCP flow on one host. The rollback is one kernel call. The 90-second duration and the auto-abort threshold cap the worst-case impact at a defined level. This is what "smallest experiment that would falsify the hypothesis" looks like in code.

The numbers tell the story: lag grows roughly linearly under sustained 200ms one-way delay because synchronous-commit waits for replica ack, and the round-trip cost is paid on every commit. By second 10, lag is approaching 1.2 seconds — well under the 5-second auto-abort, so the experiment runs to completion. Why a real experiment includes more than just the lag metric: PaySetu watches the merchant-write API's success rate, p99 latency, connection-pool wait time, and the application's internal replication_drift_seconds gauge in parallel. Any one of these crossing its abort threshold triggers rollback. A single observability metric is a single sensor; the steady-state hypothesis usually needs three.

The cultural barrier — and why it is the real wall

The technical apparatus of chaos engineering is straightforward: tc netem, pumba, AWS Fault Injection Simulator, Gremlin, the Chaos Toolkit, Litmus. The actual wall is cultural. Four objections kill chaos programmes more often than the technology:

  1. "We can't break production — customers will notice." This conflates uncontrolled failure (an outage) with controlled failure (a 0.5%-blast-radius experiment with one-click rollback). The premise of the objection is also wrong-shaped: production is being broken every day by entropy, deploys, and dependency drift; the only choice is whether you find out at 02:14 from PagerDuty or at 14:00 from a planned experiment.

  2. "We'll do it after we improve the system." This is the trap that keeps chaos engineering perpetually one quarter away. The system is improved precisely by finding the failure modes that experiments expose; waiting until it is "ready" means waiting forever. Netflix's chaos programme started in 2008 on a system that was visibly unreliable; the chaos was the improvement loop.

  3. "We don't have the staffing." Real, but inverted. The cost of a half-hour weekly chaos experiment is much smaller than the cost of one bad-day incident with a code path that has not been exercised. The accounting that makes this objection persuasive only works if you ignore the cost of incidents.

  4. "What if the experiment causes an outage?" It might. The blast radius is the answer: a 0.5% experiment that turns into a 0.5% outage for 90 seconds is a recoverable event; a 100% outage at 02:14 because nobody tested the failover is a career event. The first is bounded by design; the second is bounded by luck.

The mature pattern shipped at most chaos-engineering organisations is the steady-state hypothesis review: every team owns a "runbook" of chaos experiments, each with its hypothesis, blast radius, and rollback, scheduled to run weekly or monthly. Failed experiments produce postmortems exactly as outages do, but with the recovery already complete and the customer impact bounded. CricStream runs 14 standing experiments on its live-streaming control plane, half of them executing automatically in production every Tuesday between 11:00 and 13:00 IST. Why the schedule matters: chaos experiments that run "when we get to them" do not run. Chaos experiments that run on a calendar slot, with on-call coverage, and with a default rollback window pre-approved, do run — and they run consistently enough to catch the slow-drift failure modes (a library upgrade in January, an IAM change in February, a subnet move in March) that destroy a once-tested code path.

Common confusions

  • "Chaos engineering is the same as load testing" — load testing answers "how many requests per second can the system serve?" (a capacity question). Chaos engineering answers "what happens when one component fails?" (a resilience question). The two use overlapping tooling (production-shaped traffic, blast-radius limits) but they ask different questions, and a system that passes load tests can still fail trivially under partial-failure conditions.

  • "Chaos Monkey is what chaos engineering means" — Chaos Monkey is one of the simplest possible experiments: kill one VM in a service group at random during business hours. Real chaos engineering covers a far broader range — network partitions, clock skew, slow disks, full disks, DNS failures, dependency outages, certificate expirations — most of which are not "kill an instance".

  • "You need a 'chaos engineering team' to do chaos engineering" — a centralised team is one model and not the most successful. The more durable pattern is a chaos platform run by the platform team and chaos experiments owned by every product team for the systems they own. Centralised teams know the tools but not the failure modes; product teams know the failure modes but need the platform.

  • "Production is sacred — only test in staging" — staging environments are systematically different from production: less traffic, fewer real dependencies, different data shapes, fewer concurrent operations. A failover that passes in staging routinely fails in production because the very thing being tested (load-dependent behaviour, real dependency failures, contention) is missing from staging. Chaos in staging is necessary; chaos limited to staging is insufficient.

  • "Once we ran the experiment, we know the answer" — the answer is valid only for the version of the system that was running when you ran it. Code, config, infrastructure, and dependency versions all drift. Aditi's failover test in November was true in November; it predicts nothing about April. Chaos experiments are continuous, not one-shot — the same way unit tests are continuous, not one-shot.

  • "Game days are the same as automated chaos experiments" — game days are scheduled, human-driven, large-scope simulations: an entire team disconnects from its tools and tries to handle an injected outage, often involving multiple services. Automated chaos experiments are smaller, narrower, run continuously without human attention. Both matter; they answer different questions. Game days test humans + the system. Automated chaos tests the system without putting humans on the spot.

Going deeper

From Chaos Monkey to FIS — the platformisation of fault injection

The original Chaos Monkey (Netflix, 2010) was a Python script that called the AWS API to terminate one EC2 instance per auto-scaling group during business hours. The fault model was deliberately simple — one instance dies — because the systems being tested were not yet ready for anything more aggressive. By 2016, Netflix had extended the family to Chaos Kong (kill an entire region), Latency Monkey (inject delays into RPC paths), Conformity Monkey (kill instances violating policy), and 10x Monkey (multiply traffic). The lesson was that the experiment vocabulary needs to evolve as fast as the system does. Cloud platforms now ship fault-injection as a managed service: AWS Fault Injection Simulator (FIS) supports networking, EC2, RDS, EKS, EBS faults declaratively; Azure Chaos Studio and GCP's equivalent are similar. The advantage of platformisation is that the rollback is also the platform's responsibility, with timeouts, ACL-based authorisation for who can run experiments, and audit logs for compliance. The disadvantage is that platform-level fault injection only covers faults the platform knows about — application-layer faults (a poison message in a Kafka queue, a deadlocked database query, a leaked file descriptor) still need application-level injection harnesses.

The Principles of Chaos Engineering — what the original 2014 manifesto actually says

The Principles document (chaos.principles, 2014) lays out five tenets that have aged remarkably well: build a hypothesis around steady-state behaviour, vary real-world events (network failures are most realistic, not just instance kills), run experiments in production (because staging lies), automate experiments to run continuously (drift defeats one-shot tests), and minimise blast radius. The principle most often violated in adoption attempts is the third — running experiments only in staging "until we are ready". The principle most often forgotten is the first — running experiments without a pre-stated steady-state hypothesis, so any outcome is rationalisable as "well, it didn't crash, so we passed". Without a falsifiable hypothesis there is no experiment, only theatre.

Why FLP, CAP, and the impossibility results matter for chaos design

The chapter on the FLP impossibility and the CAP theorem tell you what the system cannot do; chaos engineering tells you what it actually does in the regimes those theorems describe. A consensus-backed coordination service (etcd, ZooKeeper) cannot make progress during a partition that breaks the quorum, by FLP-style reasoning. Whether your application correctly detects this regime, fails over to a degraded mode, and recovers cleanly when the partition heals — that is empirical. Chaos experiments at the Raft layer (kill one of three voters; kill two of three voters; partition the leader from a majority) are how you confirm the system actually does what the theorem says it should. Most subtle bugs at the consensus layer live precisely in the seam between "what the protocol guarantees" and "what the implementation does on this particular hardware with this particular library version".

Reproduce this on your laptop

python3 -m venv .venv && source .venv/bin/activate
pip install psutil
# Need a Linux box with iproute2; macOS does not ship tc.
# In a disposable Linux VM (or a docker container with --cap-add=NET_ADMIN):
sudo python3 chaos_partition_experiment.py
# To experiment without a real Postgres, replace measure_lag_ms() with:
#   return abs(time.time()*1000 - last_seen_ms)
# tracking ping RTTs to REPLICA_IP via subprocess.

Where this leads next

This chapter closes Part 18 — observability — and opens the doorway to Part 19, chaos engineering. The two parts are not a sequence so much as two halves of the same loop: observability tells you what the system is doing, chaos engineering tells you what it would do, and together they constitute the only honest answer to "is the system reliable". A team that has invested deeply in observability but not in chaos has built a sensor array that records past failures with high fidelity and predicts future failures with none. A team that has invested in chaos without observability cannot tell whether an experiment passed or failed, so the experiments are noise. You need both.

Part 19 builds out the discipline: the principles document, fault injection at the platform level, game days as a human-system test, steady-state hypotheses as the unit of experiment, and the wall that every system is unique — meaning generic chaos playbooks do not transfer; you must run experiments shaped by your system's real failure modes, not someone else's.

The thread connecting Part 18 to Part 19 is one claim: resilience is a property you cannot prove by reading the code, only by running the experiment. A system that has not been broken on purpose has never been observed to recover; a system that has been broken on purpose, repeatedly, with falsifiable hypotheses and bounded blast radius, has the only kind of resilience evidence that actually predicts the next failure.

References

  1. Netflix Technology Blog, "The Netflix Simian Army" (2011) — the introduction of Chaos Monkey and its successors.
  2. Casey Rosenthal, Lorin Hochstein, Aaron Blohowiak, Nora Jones, Ali Basiri, Chaos Engineering: System Resiliency in Practice (O'Reilly, 2020) — the canonical book-length treatment.
  3. Ali Basiri et al., "Chaos Engineering" (IEEE Software, 2016) — the academic-flavoured synthesis from the Netflix team.
  4. Principles of Chaos Engineering (chaos.principles, 2014) — the five-tenet manifesto.
  5. AWS, "Fault Injection Simulator" documentation — the cloud-platform-managed fault injection model.
  6. Linux tc-netem(8) man page — the canonical reference for kernel-level network fault injection.
  7. Adrian Cockcroft, "Failure Modes and Continuous Resilience" (QCon London 2019) — failure-mode taxonomy and resilience patterns.
  8. See also: the principles — Netflix, fault injection at the platform level, observability is a data problem, steady-state hypotheses.