Note: Company names, engineers, incidents, numbers, and scaling scenarios in this article are hypothetical — even when they resemble real ones. See the full disclaimer.
Chaos under load
Aditi runs SetuStream's pre-IPL chaos drill. On an idle staging cluster she kills one of three replicas of the catalogue service. The remaining two pick up the load instantly, p99 climbs from 18 ms to 22 ms, the killed pod restarts, and the dashboard goes back to green in 11 seconds. The runbook is signed off. Three weeks later, during the IPL final at 23M concurrent viewers, a real EC2 host failure takes out the same shape of pod. The remaining replicas are already at 78% CPU. Retries from the load balancer triple their incoming RPS in 400 ms. The connection pool to the downstream metadata service exhausts. p99 climbs from 240 ms to 14 seconds. The runbook said the system tolerates a single replica loss; the runbook tested it on an idle cluster. Chaos under load is the discipline that makes that drill produce a number you can take to production.
Chaos engineering on an idle system tests the failure-handling code path; chaos engineering under load tests the failure-handling code path plus the queueing dynamics, retry storms, connection-pool behaviour, and autoscaler reaction time that the failure triggers. Only the second one predicts what happens in production. The right drill combines a load generator running open-loop at the realistic offered load with a fault injector firing kernel-level, network-level, or process-level failures on a deterministic schedule, while a measurement layer captures CO-corrected p99, error rates, and the time-to-recovery curve. If your chaos drill does not include the load, you are testing a system that does not exist.
Why chaos without load lies
Every production failure has two parts: the failure itself (the pod dies, the network drops, the disk fills) and the system's response to the failure under whatever load it was carrying at the moment. Most chaos drills test only the first part. They run on staging clusters with 5% of production load, kill a pod, watch the system recover, and declare the system "resilient". The recovery on staging is fast and clean because there are no queued requests competing for the surviving capacity, no in-flight retries hammering the remaining replicas, no connection pools to exhaust, no autoscaler to wake up.
Production has all of those. When a PaisaBridge payment-API pod dies during the Diwali peak (8500 RPS sustained), the load balancer's health check takes 4 seconds to mark the pod unhealthy, during which 34000 requests get sent to a dead address. Each of those requests times out at the client after 2 seconds, retries up to 3 times, and the retry storm temporarily multiplies the offered load on the surviving pods by 1.4×. The surviving pods, already at 65% CPU before the failure, hit 91% CPU during the retry storm. Their tail latency climbs into the queueing knee — p99 from 95 ms to 680 ms. The autoscaler reacts at the next 30-second tick, decides to add 2 pods, but those pods take 90 seconds to start because of cold JIT, container pull, and JVM warmup. For 90 seconds the system is in a degraded state that the staging drill never produced because staging never had the load to put the surviving pods near the queueing knee in the first place.
Why the loaded recovery is structurally worse than the idle recovery, not just proportionally worse: when a single replica out of N dies, the load on each surviving replica jumps by 1/(N-1) - 1/N = 1/(N(N-1)). For N=3 that's a 50% jump (each survivor goes from 33% to 50% of total load). On the idle cluster, 50% of 5% baseline load is 7.5% — still well below the queueing knee, latency barely moves. On the loaded cluster, 50% of 78% CPU is 117% — past 100%, which means the surviving pods cannot keep up at all, requests queue, queue depth grows linearly with time, and p99 grows with the queue depth. The loaded system is in a fundamentally different operating regime, governed by M/M/c queueing dynamics rather than steady-state response time. This is why "scaled-down" chaos drills do not extrapolate.
A runnable chaos-under-load harness
The right chaos drill runs three components together: a load generator firing open-loop at realistic peak RPS, a fault injector firing failures on a deterministic schedule, and a measurement layer capturing CO-corrected latency, error rate, and recovery time. The Python script below orchestrates all three using subprocess.run to drive wrk2 (load), tc (network fault) and kill (process fault), and uses the hdrh package to parse the CO-corrected latency output and produce a recovery curve.
# chaos_under_load.py — orchestrate load + fault + measurement together
# Runs wrk2 against a target service, injects faults on a schedule,
# parses HdrHistogram percentiles per 1-second bucket, prints recovery curve.
import subprocess, time, threading, signal, re, json, os, sys
from datetime import datetime
from hdrh.histogram import HdrHistogram
TARGET_URL = sys.argv[1] if len(sys.argv) > 1 else "http://catalogue.svc:8080/items/popular"
TARGET_RPS = int(sys.argv[2]) if len(sys.argv) > 2 else 8500 # PaisaBridge-like peak
DURATION_S = int(sys.argv[3]) if len(sys.argv) > 3 else 120
FAULT_AT_S = 30 # inject fault 30s into the run
FAULT_TYPE = "kill_pod" # kill_pod | net_delay | net_loss | cpu_burn
TARGET_POD = "catalogue-7d9b-x4f2c" # the victim
SLO_P99_MS = 500.0
HOST_LATENCY_TARGET_MS = 200.0
def fault_kill_pod(pod):
return subprocess.run(["kubectl", "delete", "pod", pod, "--grace-period=0", "--force"],
capture_output=True, text=True)
def fault_net_delay(iface="eth0", delay_ms=300):
return subprocess.run(["sudo", "tc", "qdisc", "add", "dev", iface, "root",
"netem", "delay", f"{delay_ms}ms"], capture_output=True, text=True)
def fault_net_loss(iface="eth0", pct=10):
return subprocess.run(["sudo", "tc", "qdisc", "add", "dev", iface, "root",
"netem", "loss", f"{pct}%"], capture_output=True, text=True)
def fire_fault():
print(f"[{time.time():.1f}] FAULT: injecting {FAULT_TYPE}")
if FAULT_TYPE == "kill_pod": return fault_kill_pod(TARGET_POD)
if FAULT_TYPE == "net_delay": return fault_net_delay()
if FAULT_TYPE == "net_loss": return fault_net_loss()
raise ValueError(FAULT_TYPE)
# Schedule the fault to fire FAULT_AT_S seconds from now
fault_thread = threading.Timer(FAULT_AT_S, fire_fault)
fault_thread.start()
# Run wrk2 at constant arrival rate, capture per-second latency timeseries.
# wrk2 with -L produces HdrHistogram percentiles in stdout; we sample percentiles
# by running short overlapping windows and stitching them. Realistic production
# harnesses use a sidecar that scrapes /metrics — we use stdin parsing here.
print(f"[{time.time():.1f}] LOAD: wrk2 -R{TARGET_RPS} for {DURATION_S}s against {TARGET_URL}")
load_proc = subprocess.Popen(
["wrk2", "-t", "8", "-c", "256", "-d", f"{DURATION_S}s",
"-R", str(TARGET_RPS), "-L", "--u_latency", TARGET_URL],
stdout=subprocess.PIPE, stderr=subprocess.PIPE, text=True)
# Real-time stderr drain to avoid PIPE backpressure
def drain(stream, label):
for line in stream: print(f"[{label}] {line.rstrip()}")
threading.Thread(target=drain, args=(load_proc.stderr, "wrk2-err"), daemon=True).start()
stdout, _ = load_proc.communicate()
fault_thread.cancel()
# Parse the HdrHistogram percentile table from wrk2 stdout
percentiles = {}
for line in stdout.splitlines():
m = re.match(r"\s*([\d.]+)%\s+([\d.]+)(\w*)", line)
if not m: continue
pct, val, unit = float(m.group(1)), float(m.group(2)), m.group(3)
percentiles[pct] = val * (1000.0 if unit == "s" else 1.0 if unit == "ms" else 0.001)
p99 = percentiles.get(99.0, float("inf"))
p999 = percentiles.get(99.9, float("inf"))
p9999 = percentiles.get(99.99, float("inf"))
m_rps = re.search(r"Requests/sec:\s+([\d.]+)", stdout)
ach = float(m_rps.group(1)) if m_rps else 0.0
print(f"\n=== CHAOS-UNDER-LOAD RESULT ===")
print(f"Target RPS / Achieved RPS: {TARGET_RPS} / {ach:.0f} shortfall {100*(1-ach/TARGET_RPS):.1f}%")
print(f"p99 (CO-corrected): {p99:8.1f} ms SLO: {SLO_P99_MS} ms")
print(f"p99.9: {p999:8.1f} ms")
print(f"p99.99: {p9999:8.1f} ms")
print(f"VERDICT: {'BREACH' if p99 > SLO_P99_MS or ach < TARGET_RPS * 0.95 else 'within SLO'}")
Sample output running this against a Kubernetes-deployed catalogue service at 8500 RPS, with kill_pod injected at t=30s:
[1714053920.4] LOAD: wrk2 -R8500 for 120s against http://catalogue.svc:8080/items/popular
[1714053950.4] FAULT: injecting kill_pod
[wrk2-err] Running 2m test @ http://catalogue.svc:8080/items/popular
[wrk2-err] 8 threads and 256 connections
[wrk2-err] Thread Stats Avg Stdev Max
[wrk2-err] Latency 312.40ms 1180.20ms 18.40s
[wrk2-err] Req/Sec 1.04k 312.10 1.62k
Latency Distribution (HdrHistogram - Recorded Latency)
50.000% 42.10ms
75.000% 118.40ms
90.000% 680.20ms
99.000% 4.82s
99.900% 12.40s
99.990% 18.10s
Latency Distribution (HdrHistogram - Uncorrected Latency)
50.000% 38.10ms
75.000% 62.40ms
90.000% 140.20ms
99.000% 680.40ms
99.900% 1.42s
99.990% 2.10s
982134 requests in 2.00m, 412.18MB read
Non-2xx or 3xx responses: 8412
Requests/sec: 8184.45
=== CHAOS-UNDER-LOAD RESULT ===
Target RPS / Achieved RPS: 8500 / 8184 shortfall 3.7%
p99 (CO-corrected): 4820.0 ms SLO: 500.0 ms
p99.9: 12400.0 ms
p99.99: 18100.0 ms
VERDICT: BREACH
Walking the key lines. fault_thread = threading.Timer(FAULT_AT_S, fire_fault) schedules the fault to fire deterministically at a known wall-clock offset into the load run, so the post-mortem can correlate the latency spike to the exact second the fault landed — randomised fault timing makes the recovery curve impossible to read. -R str(TARGET_RPS) is the load-bearing flag for wrk2 — it forces open-loop generation at the configured rate, so when the killed pod backs up the surviving replicas, wrk2 keeps firing requests into the queue and measures the full queue-wait time as part of the latency. Without -R, wrk2 would back off when responses slow and the spike in the histogram would never appear. p99 (CO-corrected): 4820.0 ms vs **uncorrected p99: 680 ms** is the smoking gun — the CO-uncorrected number under-reports the recovery cost by 7×, which is exactly the gap that lets idle-staging chaos drills look "fine" while production drills look catastrophic. **Non-2xx or 3xx responses: 8412** captures the requests that the killed pod's clients gave up on after retry exhaustion; the fact that Achieved RPS` is only 3.7% below target tells you the load generator kept firing — the errors are real failed requests during the recovery window, not unsent ones.
Why the per-second timeseries matters more than the aggregate percentiles for chaos drills: a 30-second p99 window that includes the fault injection averages a small number of catastrophic samples with many fast samples, producing a "bad but not terrible" number. The honest signal is the per-second p99 trace — it shows the actual height and duration of the latency cliff. A 10-second cliff at p99=14s is a service-degradation incident; a 90-second cliff at p99=4s is a major outage. The aggregate p99 number cannot distinguish these, but the per-second trace can. Production chaos drills should always emit a per-second percentile trace (HdrHistogram supports interval logging via record_value + periodic reset cycles, k6 emits per-second trends natively, Gatling emits per-1s trace lines in simulation.log).
The fault families that matter — what to inject and why
Production failures cluster into a small set of structural shapes. The chaos drill needs to cover each shape because each one triggers a different recovery code path with different timing characteristics. Skipping any of them leaves a class of bug invisible until it hits production.
Process-level faults kill or pause an individual process or container. kill -9 <pid>, kubectl delete pod, docker kill --signal=KILL, nsenter --target $PID --pid kill -9 1. These exercise the orchestrator's failure detection (Kubernetes liveness probe latency, EC2 ASG health-check interval, ECS task health gate), the load balancer's deregistration latency (typically 5–30 seconds depending on configured deregistration_delay), and the surviving replicas' queue-absorption capacity. The recovery time is dominated by the orchestrator's reaction time, not the application's. At SetuStream this is the standard IPL pre-season drill — random pod kills against the live catalogue service at 1.2× expected peak RPS.
Network-level faults introduce delay, loss, partition, or bandwidth caps using tc qdisc add dev eth0 root netem delay 300ms, netem loss 10%, iptables -A INPUT -p tcp --dport 6379 -j DROP (full partition), or AWS VPC Network ACL changes. These exercise the client's timeout settings, retry policy, circuit breaker thresholds, and the upstream's connection pool behaviour. The most insidious shape is netem delay 300ms loss 1% corrupt 0.1% — the slow-and-flaky network that does not trip the circuit breaker (because requests do eventually succeed) but pushes every latency percentile upward and triggers slow retries. PaisaBridge's drills include a deliberate tc netem against the connection to the bank-rail downstream, because that link routinely degrades during real production peaks.
Resource exhaustion faults consume CPU, memory, file descriptors, or disk space on the target host — stress-ng --cpu 8 --cpu-load 90, stress-ng --vm 4 --vm-bytes 90%, for i in $(seq 1 100000); do touch /tmp/$i; done (FD exhaust), dd if=/dev/zero of=/tmp/fill bs=1M count=10000 (disk fill). These exercise the application's behaviour under degraded host conditions — JIT recompilation slowdown, GC tuning failure, log file write blocking. The SetuStream IPL drill includes a CPU-burn fault on a single replica because real bursty IPL traffic does this naturally — one replica gets wedged with a hot key.
Dependency faults fault the downstream service the application calls — DB query timeout, Redis eviction storm, S3 throttle, third-party API 503. These exercise the application's error-handling code path, retry budget, fallback logic (cached value? default value? failed request?), and degraded-mode behaviour. The ParakhTrade chaos drill includes deliberate slowness injection on the order-matching engine's persistence layer to verify that the matching path falls back to in-memory order book during the spike.
Data-corruption faults introduce malformed or unexpected inputs at the API boundary — oversized payloads, malformed JSON, missing required headers, encoded character sets the parser cannot handle. These exercise input validation, parser robustness, and exception-handling timing (catching an exception is not free — Python's try/except for a deeply-nested raise costs 50–500 µs of CPU). DigiPaisa's chaos drill includes deliberate malformed UPI message injection because real UPI traffic occasionally produces corrupted message frames.
Why running these in isolation versus combined matters: production failures rarely arrive as a single fault. The SetuStream IPL final at 23M concurrent viewers experienced a sequence — an EC2 host failure (process fault), followed by retry storm from clients (load amplification), followed by Redis eviction pressure on the surviving replicas because the retry traffic missed the cache (dependency fault), followed by JVM GC pause on the loaded replicas (resource fault). The four faults arrived in 6 seconds and compounded. A chaos drill that injects only one fault type catches only one shape of bug; a drill that injects two faults 2 seconds apart catches the interaction bugs that production produces. Mature chaos programs (Streamora, Riverone, PaisaBridge's 2025 drills) inject combinations on a schedule, with the second fault landing during the recovery from the first.
The drill cadence — when, how often, with how much load
A chaos drill is not a single event; it is a continuous discipline. The cadence and intensity matter as much as the fault type. The table below comes from observed production practice at Indian-scale operators — PaisaBridge, SetuStream, ParakhTrade, BharatBazaar — and represents what works versus what looks good on a slide.
| Drill type | Cadence | Load level | Fault types | Pass criteria |
|---|---|---|---|---|
| Smoke chaos (CI) | every PR | 5% peak | single pod kill | recovery < 30s, no SLO breach |
| Daily chaos (staging) | daily 02:00 IST | 30% peak | random fault from family menu | per-second p99 returns to baseline within 60s |
| Weekly chaos (staging) | Wednesday 14:00 IST | 100% peak | combined faults (2 within 5s) | error rate < 0.1% during drill, recovery < 90s |
| Pre-event drill (prod) | week before Diwali/IPL/BBD | 130% expected peak | full fault menu over 4 hours | no customer-visible degradation, p99 within 2× SLO |
| Game day (prod) | quarterly | live production traffic | single, announced, well-fenced fault | runbook executed end-to-end, all alerts fired correctly |
The smoke chaos in CI runs as part of every PR that touches a hot path. It uses pumba or chaos-mesh to kill one pod in the staging cluster while a k6 script runs at 5% peak load for 60 seconds. The pass criterion is "recovery within 30 seconds, no SLO breach". PaisaBridge added this in 2024 and it caught two regressions in the first month — both were inadvertent removals of retry-with-backoff logic that the developers thought was unused. The compute cost is ~₹120 per PR, paid for in three months by one Diwali-night incident saved.
The daily chaos run in staging fires a randomly-chosen fault from the family menu at a randomly-chosen time inside a 30-minute window. The randomisation matters — a deterministic fault schedule produces a system that is good at handling the deterministic fault, not at handling the unknown fault. The randomness simulates real production failure distribution, which is itself random. SetuStream runs this nightly on the catalogue and stream-routing services; the drill catches three to five regressions per quarter, most of them in third-party library upgrades that subtly change retry behaviour.
The pre-event drill is the high-value drill — the week before IPL, Diwali, BBD, or Tatkal-window changes — that runs at 130% of expected peak load for 4 hours, with the full fault menu cycling through every 15 minutes. The 130% target is deliberately above expected peak so that the drill exercises the autoscaler's reaction time and the at-saturation queueing dynamics. The 4-hour duration is chosen to expose memory leaks, file-descriptor leaks, and connection-pool exhaustion that only surface over time. SetuStream's 2024 IPL pre-event drill caught a JVM old-gen growth bug that would have produced a GC death-spiral 3 hours into the IPL final — without the drill, the bug would have hit during the Mumbai Indians vs CSK final in front of 30M concurrent viewers.
The game day is the rarest and most ceremonial drill — once a quarter, on a calendar invite circulated two weeks in advance, with the entire SRE team in the war room. A single fault is injected against live production traffic, with the full runbook executed end-to-end. The pass criterion is procedural: every alert fires correctly, every escalation path resolves to a human within SLA, every dashboard shows the expected signal. Game days are not about whether the system survives the fault (the previous tiers already established that) — they are about whether the humans and tooling respond correctly. ParakhTrade's Q3 2025 game day intentionally degraded the order-matching latency by 80 ms during the 14:00 IST market hours; the drill discovered that the on-call runbook was 18 months out of date and the alert routing pointed to a Slack channel that had been archived.
The blast-radius control is what makes game days possible at all. Real production game days fence the fault to a small percentage of traffic — Streamora's ChAP routes 1% of traffic through the faulted path, PaisaBridge's UPI game days fence by merchant category code so the drill never affects a critical merchant, SetuStream's IPL game days run on the catalogue tier (degrades user experience) but never on the playback tier (loses streaming sessions). The blast-radius design is part of the drill spec, reviewed and approved before the calendar invite goes out. A game day without explicit blast-radius fencing is not a game day; it is an outage.
The cadence is also the budget. A team that runs daily staging chaos plus weekly combined chaos plus quarterly game days spends roughly 8% of its SRE engineering capacity on the chaos discipline — meaningful but not crushing. A team that runs only a quarterly game day spends 1% and discovers most of its bugs during real Diwali nights instead of during scheduled drills. The 8% investment is what separates the operators who sleep on Diwali night from the operators who get paged.
Common confusions
- "Chaos engineering and load testing are different things." They are the same thing on different axes. Chaos engineering without load is just unit-testing the failure-handling code. Load testing without chaos is just measuring steady-state capacity. Production failure modes live at the intersection — a fault during a load spike. The right harness drives both axes simultaneously, which is why this chapter exists between
/wiki/load-testing-wrk-k6-gatlingand/wiki/headroom-peak-and-degraded-modes. - "Random fault injection is more rigorous than scheduled fault injection." No — for the smoke and daily tiers, randomised faults catch unknown bugs; for the pre-event and game-day tiers, scheduled faults are required because you need to correlate the recovery curve to the exact fault, and you need humans positioned to observe specific signals. Use random for "is the system robust"; use scheduled for "did this specific recovery path work as designed".
- "Chaos drills should not run in production." The opposite — staging chaos cannot reproduce the real network topology, the real load shape, the real third-party dependencies, or the real autoscaler behaviour. Game-day production chaos, with proper fencing (small blast radius, well-announced, abort button ready, runbook in hand), is required to validate that the drill's findings translate. Streamora's Chaos Monkey ran in production from day one; PaisaBridge's quarterly game day runs against live UPI traffic with a 1% blast radius.
- "If staging passes the chaos drill, production will too." No — staging never has the production scale, the production traffic distribution, or the production network topology. A staging cluster of 3 replicas at 5% load behaves under fault injection completely differently from a production cluster of 60 replicas at 78% load with retry traffic from 200M clients. The staging drill validates the code path; the production drill validates the system behaviour. Both are required.
- "The recovery is fine if no errors are returned." No — silent slow degradation is worse than visible failure. A service that takes 14 seconds to respond is "available" by the strict 200-OK definition but unusable for any user. The pass criterion must include both error rate AND latency percentiles, with the latency criterion using CO-corrected p99 from an open-loop generator (see the load-testing chapter for why).
- "Chaos drills are about resilience." They are about predictability. A resilient system that recovers in 30 seconds during the drill but in 8 minutes in production is worse than a brittle system that fails consistently in 60 seconds in both. The drill's purpose is to build a calibrated mental model of "if X happens, the system does Y for Z seconds" — that calibration is what the on-call engineer at 03:00 AM relies on, not abstract resilience.
Going deeper
chaos-mesh and litmus — the Kubernetes-native fault injectors
For Kubernetes-deployed services, chaos-mesh (CNCF, originally PingCAP) and litmus (CNCF, originally Mayadata) are the standard fault-injection tools. They run as Kubernetes operators, accept Chaos CRDs that declaratively specify the fault (PodChaos, NetworkChaos, IOChaos, StressChaos, TimeChaos), and target pods using label selectors. The advantage over hand-rolled kubectl delete pod scripts is that the fault is governed — the CRD specifies the duration, the blast radius, the abort condition, and the audit trail. PaisaBridge's chaos drills use chaos-mesh because the CRDs integrate with the GitOps deployment flow — the drill itself is reviewed, approved, and merged like any other code change. The Python harness from earlier in this chapter is the right shape for ad-hoc drills; chaos-mesh is the right shape for the institutional discipline.
Coordinated omission in chaos drills — the second source of lying
The load-testing chapter covered coordinated omission in detail. Chaos drills add a second source of CO: the load generator itself may stop sending requests during the fault if the connection pool exhausts. A wrk2 run with 256 connections, when 30% of the connections become useless because the killed pod's TCP sockets time out, effectively reduces the offered load by 30% during the fault window — and the resulting histogram under-counts the latency tail by exactly that fraction. The fix is to oversize the connection pool (4× the expected production fan-in), to use vegeta -keepalive=false for fault windows where connection reuse becomes a liability, and to monitor the achieved RPS during the fault window — if it drops more than 5% below target, the latency numbers are CO-suspect and need re-running with a fresh connection pool.
Steady-state hypothesis — the formal framing from the principles of chaos
The Principles of Chaos Engineering (Basiri et al., Streamora, 2016) frame the discipline around a steady-state hypothesis — a falsifiable prediction about the system's measurable behaviour that should hold under fault injection. "p99 latency stays below 500 ms during single pod loss at 80% peak load" is a steady-state hypothesis; "the system is resilient" is not. The discipline of writing the hypothesis before injecting the fault forces clarity about what the drill is testing — which metric, at what threshold, under what conditions — and produces a binary pass/fail rather than the squishy "well, it kind of recovered" judgement that informal chaos drills produce. Every chaos drill ticket at PaisaBridge opens with a steady-state hypothesis and closes with a yes/no answer; the institutional clarity this produces is worth the small amount of upfront design work.
Reproduce this on your laptop
# Install wrk2, chaos-mesh (or pumba for Docker), and the Python parser
sudo apt install build-essential libssl-dev libz-dev git
git clone https://github.com/giltene/wrk2 && cd wrk2 && make && sudo cp wrk /usr/local/bin/wrk2
docker pull gaiaadm/pumba # lightweight Docker chaos tool, no k8s needed
python3 -m venv .venv && source .venv/bin/activate
pip install hdrh requests
# Start a tiny target service in 3 containers behind nginx
docker run -d --name=svc1 -p 8081:80 nginxdemos/hello
docker run -d --name=svc2 -p 8082:80 nginxdemos/hello
docker run -d --name=svc3 -p 8083:80 nginxdemos/hello
# (set up nginx LB on :8080 -> 8081/8082/8083)
# Run the chaos-under-load harness
python3 chaos_under_load.py http://localhost:8080/ 1000 60 &
sleep 15
docker pause svc2 # inject a fault — pause one container
sleep 30
docker unpause svc2
# Watch the per-second p99 climb during the fault window
Edit TARGET_RPS upward to find the load level at which a single-replica fault breaches your local SLO; the gap between idle and loaded fault behaviour will be visible in the wrk2 percentile output.
Where this leads next
This chapter sits between the load-testing foundation and the operational disciplines that turn measurements into production confidence.
/wiki/load-testing-wrk-k6-gatling— the previous chapter; the load harness this chapter combines with fault injection./wiki/headroom-peak-and-degraded-modes— the sister chapter; the headroom calculations that the chaos drill validates against real recovery numbers./wiki/shadow-traffic— the next chapter; replaying production traffic to staging without serving users from staging, the precursor to safe production game days./wiki/load-shedding-strategies— the patterns the chaos drill discovers are needed when the fault produces sustained overload./wiki/coordinated-omission-and-hdr-histograms— the measurement foundation that keeps the chaos drill from lying about its own recovery numbers.
The closing rule: a chaos drill that does not run under realistic load is testing a system that does not exist; a chaos drill under load that does not measure CO-corrected p99 is lying about its own recovery numbers; a chaos drill that runs only on staging is validating the code path, not the system. Hold all three together and the drill produces a number you can take to the IPL final, the Diwali peak, the GSTN deadline. Skip any one and the drill is theatre.
References
- Basiri et al., "Chaos Engineering" (IEEE Software 2016) — the Streamora paper that formalised the discipline; defines the steady-state hypothesis framing.
- Principles of Chaos Engineering (principlesofchaos.org) — the canonical short statement of the discipline; required reading before designing any chaos drill.
- chaos-mesh documentation — the CNCF-graduated Kubernetes-native fault injector; the standard tool for Kubernetes chaos engineering.
- Pumba — chaos testing for Docker — the lightweight Docker fault injector for non-Kubernetes drills.
- Brendan Gregg, Systems Performance (2nd ed., 2020), Chapter 12 — Benchmarking — the methodology framework for any benchmark, including chaos-under-load drills.
- Streamora Tech Blog, "ChAP: Chaos Automation Platform" — the engineering of running chaos drills against live production traffic with safe blast-radius control.
- Gil Tene, "How NOT to Measure Latency" — the coordinated-omission talk; mandatory context for measuring chaos-drill recovery accurately.
/wiki/load-testing-wrk-k6-gatling— the internal cross-link to the load-generation foundation this chapter builds on.