Note: Company names, engineers, incidents, numbers, and scaling scenarios in this article are hypothetical — even when they resemble real ones. See the full disclaimer.

Chaos under load

Aditi runs SetuStream's pre-IPL chaos drill. On an idle staging cluster she kills one of three replicas of the catalogue service. The remaining two pick up the load instantly, p99 climbs from 18 ms to 22 ms, the killed pod restarts, and the dashboard goes back to green in 11 seconds. The runbook is signed off. Three weeks later, during the IPL final at 23M concurrent viewers, a real EC2 host failure takes out the same shape of pod. The remaining replicas are already at 78% CPU. Retries from the load balancer triple their incoming RPS in 400 ms. The connection pool to the downstream metadata service exhausts. p99 climbs from 240 ms to 14 seconds. The runbook said the system tolerates a single replica loss; the runbook tested it on an idle cluster. Chaos under load is the discipline that makes that drill produce a number you can take to production.

Chaos engineering on an idle system tests the failure-handling code path; chaos engineering under load tests the failure-handling code path plus the queueing dynamics, retry storms, connection-pool behaviour, and autoscaler reaction time that the failure triggers. Only the second one predicts what happens in production. The right drill combines a load generator running open-loop at the realistic offered load with a fault injector firing kernel-level, network-level, or process-level failures on a deterministic schedule, while a measurement layer captures CO-corrected p99, error rates, and the time-to-recovery curve. If your chaos drill does not include the load, you are testing a system that does not exist.

Why chaos without load lies

Every production failure has two parts: the failure itself (the pod dies, the network drops, the disk fills) and the system's response to the failure under whatever load it was carrying at the moment. Most chaos drills test only the first part. They run on staging clusters with 5% of production load, kill a pod, watch the system recover, and declare the system "resilient". The recovery on staging is fast and clean because there are no queued requests competing for the surviving capacity, no in-flight retries hammering the remaining replicas, no connection pools to exhaust, no autoscaler to wake up.

Production has all of those. When a PaisaBridge payment-API pod dies during the Diwali peak (8500 RPS sustained), the load balancer's health check takes 4 seconds to mark the pod unhealthy, during which 34000 requests get sent to a dead address. Each of those requests times out at the client after 2 seconds, retries up to 3 times, and the retry storm temporarily multiplies the offered load on the surviving pods by 1.4×. The surviving pods, already at 65% CPU before the failure, hit 91% CPU during the retry storm. Their tail latency climbs into the queueing knee — p99 from 95 ms to 680 ms. The autoscaler reacts at the next 30-second tick, decides to add 2 pods, but those pods take 90 seconds to start because of cold JIT, container pull, and JVM warmup. For 90 seconds the system is in a degraded state that the staging drill never produced because staging never had the load to put the surviving pods near the queueing knee in the first place.

Same chaos event, two outcomes — idle vs loaded systemTwo side-by-side latency vs time charts. Left: idle staging cluster. p99 baseline 20ms. At t=10s a pod is killed. p99 spikes to 24ms then returns to 20ms by t=21s. Right: loaded production cluster at 78% CPU. p99 baseline 240ms. At t=10s same pod kill. p99 spikes to 14000ms, decays slowly, recovery completes at t=48s.Same fault, same code, two completely different recovery curvesIdle staging cluster (5% load)p990time (s)01030pod killed24ms20ms baselinerecovery: 11s, p99 +20%Loaded production cluster (78% CPU)p990time (s)01060pod killed14000ms240ms baselinerecovery: 38s, p99 +5800%
Same kernel-level pod kill, same Kubernetes deployment, same HPA configuration. The idle staging drill produces a tiny bump and an 11-second recovery — looks fine. The loaded production cluster produces a 14-second p99 spike and a 38-second recovery, because the surviving pods were already near the queueing knee and the retry storm pushed them past it. Illustrative — magnitudes vary by service shape, but the qualitative gap is universal.

Why the loaded recovery is structurally worse than the idle recovery, not just proportionally worse: when a single replica out of N dies, the load on each surviving replica jumps by 1/(N-1) - 1/N = 1/(N(N-1)). For N=3 that's a 50% jump (each survivor goes from 33% to 50% of total load). On the idle cluster, 50% of 5% baseline load is 7.5% — still well below the queueing knee, latency barely moves. On the loaded cluster, 50% of 78% CPU is 117% — past 100%, which means the surviving pods cannot keep up at all, requests queue, queue depth grows linearly with time, and p99 grows with the queue depth. The loaded system is in a fundamentally different operating regime, governed by M/M/c queueing dynamics rather than steady-state response time. This is why "scaled-down" chaos drills do not extrapolate.

A runnable chaos-under-load harness

The right chaos drill runs three components together: a load generator firing open-loop at realistic peak RPS, a fault injector firing failures on a deterministic schedule, and a measurement layer capturing CO-corrected latency, error rate, and recovery time. The Python script below orchestrates all three using subprocess.run to drive wrk2 (load), tc (network fault) and kill (process fault), and uses the hdrh package to parse the CO-corrected latency output and produce a recovery curve.

# chaos_under_load.py — orchestrate load + fault + measurement together
# Runs wrk2 against a target service, injects faults on a schedule,
# parses HdrHistogram percentiles per 1-second bucket, prints recovery curve.
import subprocess, time, threading, signal, re, json, os, sys
from datetime import datetime
from hdrh.histogram import HdrHistogram

TARGET_URL   = sys.argv[1] if len(sys.argv) > 1 else "http://catalogue.svc:8080/items/popular"
TARGET_RPS   = int(sys.argv[2]) if len(sys.argv) > 2 else 8500     # PaisaBridge-like peak
DURATION_S   = int(sys.argv[3]) if len(sys.argv) > 3 else 120
FAULT_AT_S   = 30                                  # inject fault 30s into the run
FAULT_TYPE   = "kill_pod"                          # kill_pod | net_delay | net_loss | cpu_burn
TARGET_POD   = "catalogue-7d9b-x4f2c"              # the victim
SLO_P99_MS   = 500.0
HOST_LATENCY_TARGET_MS = 200.0

def fault_kill_pod(pod):
    return subprocess.run(["kubectl", "delete", "pod", pod, "--grace-period=0", "--force"],
                          capture_output=True, text=True)

def fault_net_delay(iface="eth0", delay_ms=300):
    return subprocess.run(["sudo", "tc", "qdisc", "add", "dev", iface, "root",
                           "netem", "delay", f"{delay_ms}ms"], capture_output=True, text=True)

def fault_net_loss(iface="eth0", pct=10):
    return subprocess.run(["sudo", "tc", "qdisc", "add", "dev", iface, "root",
                           "netem", "loss", f"{pct}%"], capture_output=True, text=True)

def fire_fault():
    print(f"[{time.time():.1f}] FAULT: injecting {FAULT_TYPE}")
    if FAULT_TYPE == "kill_pod":      return fault_kill_pod(TARGET_POD)
    if FAULT_TYPE == "net_delay":     return fault_net_delay()
    if FAULT_TYPE == "net_loss":      return fault_net_loss()
    raise ValueError(FAULT_TYPE)

# Schedule the fault to fire FAULT_AT_S seconds from now
fault_thread = threading.Timer(FAULT_AT_S, fire_fault)
fault_thread.start()

# Run wrk2 at constant arrival rate, capture per-second latency timeseries.
# wrk2 with -L produces HdrHistogram percentiles in stdout; we sample percentiles
# by running short overlapping windows and stitching them. Realistic production
# harnesses use a sidecar that scrapes /metrics — we use stdin parsing here.
print(f"[{time.time():.1f}] LOAD: wrk2 -R{TARGET_RPS} for {DURATION_S}s against {TARGET_URL}")
load_proc = subprocess.Popen(
    ["wrk2", "-t", "8", "-c", "256", "-d", f"{DURATION_S}s",
     "-R", str(TARGET_RPS), "-L", "--u_latency", TARGET_URL],
    stdout=subprocess.PIPE, stderr=subprocess.PIPE, text=True)

# Real-time stderr drain to avoid PIPE backpressure
def drain(stream, label):
    for line in stream: print(f"[{label}] {line.rstrip()}")
threading.Thread(target=drain, args=(load_proc.stderr, "wrk2-err"), daemon=True).start()

stdout, _ = load_proc.communicate()
fault_thread.cancel()

# Parse the HdrHistogram percentile table from wrk2 stdout
percentiles = {}
for line in stdout.splitlines():
    m = re.match(r"\s*([\d.]+)%\s+([\d.]+)(\w*)", line)
    if not m: continue
    pct, val, unit = float(m.group(1)), float(m.group(2)), m.group(3)
    percentiles[pct] = val * (1000.0 if unit == "s" else 1.0 if unit == "ms" else 0.001)

p99    = percentiles.get(99.0, float("inf"))
p999   = percentiles.get(99.9, float("inf"))
p9999  = percentiles.get(99.99, float("inf"))
m_rps  = re.search(r"Requests/sec:\s+([\d.]+)", stdout)
ach    = float(m_rps.group(1)) if m_rps else 0.0

print(f"\n=== CHAOS-UNDER-LOAD RESULT ===")
print(f"Target RPS / Achieved RPS: {TARGET_RPS} / {ach:.0f}    shortfall {100*(1-ach/TARGET_RPS):.1f}%")
print(f"p99 (CO-corrected):    {p99:8.1f} ms    SLO: {SLO_P99_MS} ms")
print(f"p99.9:                 {p999:8.1f} ms")
print(f"p99.99:                {p9999:8.1f} ms")
print(f"VERDICT: {'BREACH' if p99 > SLO_P99_MS or ach < TARGET_RPS * 0.95 else 'within SLO'}")

Sample output running this against a Kubernetes-deployed catalogue service at 8500 RPS, with kill_pod injected at t=30s:

[1714053920.4] LOAD: wrk2 -R8500 for 120s against http://catalogue.svc:8080/items/popular
[1714053950.4] FAULT: injecting kill_pod
[wrk2-err] Running 2m test @ http://catalogue.svc:8080/items/popular
[wrk2-err]   8 threads and 256 connections
[wrk2-err]   Thread Stats   Avg      Stdev     Max
[wrk2-err]     Latency   312.40ms  1180.20ms   18.40s
[wrk2-err]     Req/Sec     1.04k    312.10     1.62k

  Latency Distribution (HdrHistogram - Recorded Latency)
   50.000%   42.10ms
   75.000%  118.40ms
   90.000%  680.20ms
   99.000%   4.82s
   99.900%  12.40s
   99.990%  18.10s
  Latency Distribution (HdrHistogram - Uncorrected Latency)
   50.000%   38.10ms
   75.000%   62.40ms
   90.000%  140.20ms
   99.000%  680.40ms
   99.900%   1.42s
   99.990%   2.10s
  982134 requests in 2.00m, 412.18MB read
  Non-2xx or 3xx responses: 8412
Requests/sec:   8184.45

=== CHAOS-UNDER-LOAD RESULT ===
Target RPS / Achieved RPS: 8500 / 8184    shortfall 3.7%
p99 (CO-corrected):    4820.0 ms    SLO: 500.0 ms
p99.9:                12400.0 ms
p99.99:               18100.0 ms
VERDICT: BREACH

Walking the key lines. fault_thread = threading.Timer(FAULT_AT_S, fire_fault) schedules the fault to fire deterministically at a known wall-clock offset into the load run, so the post-mortem can correlate the latency spike to the exact second the fault landed — randomised fault timing makes the recovery curve impossible to read. -R str(TARGET_RPS) is the load-bearing flag for wrk2 — it forces open-loop generation at the configured rate, so when the killed pod backs up the surviving replicas, wrk2 keeps firing requests into the queue and measures the full queue-wait time as part of the latency. Without -R, wrk2 would back off when responses slow and the spike in the histogram would never appear. p99 (CO-corrected): 4820.0 ms vs **uncorrected p99: 680 ms** is the smoking gun — the CO-uncorrected number under-reports the recovery cost by 7×, which is exactly the gap that lets idle-staging chaos drills look "fine" while production drills look catastrophic. **Non-2xx or 3xx responses: 8412** captures the requests that the killed pod's clients gave up on after retry exhaustion; the fact that Achieved RPS` is only 3.7% below target tells you the load generator kept firing — the errors are real failed requests during the recovery window, not unsent ones.

Why the per-second timeseries matters more than the aggregate percentiles for chaos drills: a 30-second p99 window that includes the fault injection averages a small number of catastrophic samples with many fast samples, producing a "bad but not terrible" number. The honest signal is the per-second p99 trace — it shows the actual height and duration of the latency cliff. A 10-second cliff at p99=14s is a service-degradation incident; a 90-second cliff at p99=4s is a major outage. The aggregate p99 number cannot distinguish these, but the per-second trace can. Production chaos drills should always emit a per-second percentile trace (HdrHistogram supports interval logging via record_value + periodic reset cycles, k6 emits per-second trends natively, Gatling emits per-1s trace lines in simulation.log).

The fault families that matter — what to inject and why

Production failures cluster into a small set of structural shapes. The chaos drill needs to cover each shape because each one triggers a different recovery code path with different timing characteristics. Skipping any of them leaves a class of bug invisible until it hits production.

Process-level faults kill or pause an individual process or container. kill -9 <pid>, kubectl delete pod, docker kill --signal=KILL, nsenter --target $PID --pid kill -9 1. These exercise the orchestrator's failure detection (Kubernetes liveness probe latency, EC2 ASG health-check interval, ECS task health gate), the load balancer's deregistration latency (typically 5–30 seconds depending on configured deregistration_delay), and the surviving replicas' queue-absorption capacity. The recovery time is dominated by the orchestrator's reaction time, not the application's. At SetuStream this is the standard IPL pre-season drill — random pod kills against the live catalogue service at 1.2× expected peak RPS.

Network-level faults introduce delay, loss, partition, or bandwidth caps using tc qdisc add dev eth0 root netem delay 300ms, netem loss 10%, iptables -A INPUT -p tcp --dport 6379 -j DROP (full partition), or AWS VPC Network ACL changes. These exercise the client's timeout settings, retry policy, circuit breaker thresholds, and the upstream's connection pool behaviour. The most insidious shape is netem delay 300ms loss 1% corrupt 0.1% — the slow-and-flaky network that does not trip the circuit breaker (because requests do eventually succeed) but pushes every latency percentile upward and triggers slow retries. PaisaBridge's drills include a deliberate tc netem against the connection to the bank-rail downstream, because that link routinely degrades during real production peaks.

Resource exhaustion faults consume CPU, memory, file descriptors, or disk space on the target host — stress-ng --cpu 8 --cpu-load 90, stress-ng --vm 4 --vm-bytes 90%, for i in $(seq 1 100000); do touch /tmp/$i; done (FD exhaust), dd if=/dev/zero of=/tmp/fill bs=1M count=10000 (disk fill). These exercise the application's behaviour under degraded host conditions — JIT recompilation slowdown, GC tuning failure, log file write blocking. The SetuStream IPL drill includes a CPU-burn fault on a single replica because real bursty IPL traffic does this naturally — one replica gets wedged with a hot key.

Dependency faults fault the downstream service the application calls — DB query timeout, Redis eviction storm, S3 throttle, third-party API 503. These exercise the application's error-handling code path, retry budget, fallback logic (cached value? default value? failed request?), and degraded-mode behaviour. The ParakhTrade chaos drill includes deliberate slowness injection on the order-matching engine's persistence layer to verify that the matching path falls back to in-memory order book during the spike.

Data-corruption faults introduce malformed or unexpected inputs at the API boundary — oversized payloads, malformed JSON, missing required headers, encoded character sets the parser cannot handle. These exercise input validation, parser robustness, and exception-handling timing (catching an exception is not free — Python's try/except for a deeply-nested raise costs 50–500 µs of CPU). DigiPaisa's chaos drill includes deliberate malformed UPI message injection because real UPI traffic occasionally produces corrupted message frames.

Five fault families and the code path each one testsFive horizontal rows. Each row shows a fault type, the tool to inject it, and the code path it exercises. Rows: process kill, network delay, resource exhaustion, dependency timeout, data corruption.Five fault families — each tests a different recovery code pathProcesskubectl delete podorchestrator failure detection + LB deregistrationrecovery time = 5–30s, dominated by health-check intervalNetworktc netem delay/lossclient timeouts + retry policy + circuit breaker thresholdsslow-and-flaky is worst — does not trip CB, pushes p99Resourcestress-ng --cpu / --vmJIT slowdown + GC tuning + log-write blockingsingle hot replica wedged — bursty IPL traffic shapeDependencydownstream timeout / 503retry budget + fallback logic + degraded-mode behaviourParakhTrade: matching falls back to in-memory order bookDatamalformed JSON / oversized payloadparser robustness + exception-handling cost (50–500µs)DigiPaisa UPI: corrupted message frames in real traffic
Each fault family exercises a distinct code path with distinct recovery timing characteristics. A drill that only kills pods will never catch the bug in the network-timeout path; a drill that only injects network faults will never catch the bug in the orchestrator's failure-detection path. Cover all five over the chaos calendar.

Why running these in isolation versus combined matters: production failures rarely arrive as a single fault. The SetuStream IPL final at 23M concurrent viewers experienced a sequence — an EC2 host failure (process fault), followed by retry storm from clients (load amplification), followed by Redis eviction pressure on the surviving replicas because the retry traffic missed the cache (dependency fault), followed by JVM GC pause on the loaded replicas (resource fault). The four faults arrived in 6 seconds and compounded. A chaos drill that injects only one fault type catches only one shape of bug; a drill that injects two faults 2 seconds apart catches the interaction bugs that production produces. Mature chaos programs (Streamora, Riverone, PaisaBridge's 2025 drills) inject combinations on a schedule, with the second fault landing during the recovery from the first.

The drill cadence — when, how often, with how much load

A chaos drill is not a single event; it is a continuous discipline. The cadence and intensity matter as much as the fault type. The table below comes from observed production practice at Indian-scale operators — PaisaBridge, SetuStream, ParakhTrade, BharatBazaar — and represents what works versus what looks good on a slide.

Drill type Cadence Load level Fault types Pass criteria
Smoke chaos (CI) every PR 5% peak single pod kill recovery < 30s, no SLO breach
Daily chaos (staging) daily 02:00 IST 30% peak random fault from family menu per-second p99 returns to baseline within 60s
Weekly chaos (staging) Wednesday 14:00 IST 100% peak combined faults (2 within 5s) error rate < 0.1% during drill, recovery < 90s
Pre-event drill (prod) week before Diwali/IPL/BBD 130% expected peak full fault menu over 4 hours no customer-visible degradation, p99 within 2× SLO
Game day (prod) quarterly live production traffic single, announced, well-fenced fault runbook executed end-to-end, all alerts fired correctly

The smoke chaos in CI runs as part of every PR that touches a hot path. It uses pumba or chaos-mesh to kill one pod in the staging cluster while a k6 script runs at 5% peak load for 60 seconds. The pass criterion is "recovery within 30 seconds, no SLO breach". PaisaBridge added this in 2024 and it caught two regressions in the first month — both were inadvertent removals of retry-with-backoff logic that the developers thought was unused. The compute cost is ~₹120 per PR, paid for in three months by one Diwali-night incident saved.

The daily chaos run in staging fires a randomly-chosen fault from the family menu at a randomly-chosen time inside a 30-minute window. The randomisation matters — a deterministic fault schedule produces a system that is good at handling the deterministic fault, not at handling the unknown fault. The randomness simulates real production failure distribution, which is itself random. SetuStream runs this nightly on the catalogue and stream-routing services; the drill catches three to five regressions per quarter, most of them in third-party library upgrades that subtly change retry behaviour.

The pre-event drill is the high-value drill — the week before IPL, Diwali, BBD, or Tatkal-window changes — that runs at 130% of expected peak load for 4 hours, with the full fault menu cycling through every 15 minutes. The 130% target is deliberately above expected peak so that the drill exercises the autoscaler's reaction time and the at-saturation queueing dynamics. The 4-hour duration is chosen to expose memory leaks, file-descriptor leaks, and connection-pool exhaustion that only surface over time. SetuStream's 2024 IPL pre-event drill caught a JVM old-gen growth bug that would have produced a GC death-spiral 3 hours into the IPL final — without the drill, the bug would have hit during the Mumbai Indians vs CSK final in front of 30M concurrent viewers.

The game day is the rarest and most ceremonial drill — once a quarter, on a calendar invite circulated two weeks in advance, with the entire SRE team in the war room. A single fault is injected against live production traffic, with the full runbook executed end-to-end. The pass criterion is procedural: every alert fires correctly, every escalation path resolves to a human within SLA, every dashboard shows the expected signal. Game days are not about whether the system survives the fault (the previous tiers already established that) — they are about whether the humans and tooling respond correctly. ParakhTrade's Q3 2025 game day intentionally degraded the order-matching latency by 80 ms during the 14:00 IST market hours; the drill discovered that the on-call runbook was 18 months out of date and the alert routing pointed to a Slack channel that had been archived.

The blast-radius control is what makes game days possible at all. Real production game days fence the fault to a small percentage of traffic — Streamora's ChAP routes 1% of traffic through the faulted path, PaisaBridge's UPI game days fence by merchant category code so the drill never affects a critical merchant, SetuStream's IPL game days run on the catalogue tier (degrades user experience) but never on the playback tier (loses streaming sessions). The blast-radius design is part of the drill spec, reviewed and approved before the calendar invite goes out. A game day without explicit blast-radius fencing is not a game day; it is an outage.

The cadence is also the budget. A team that runs daily staging chaos plus weekly combined chaos plus quarterly game days spends roughly 8% of its SRE engineering capacity on the chaos discipline — meaningful but not crushing. A team that runs only a quarterly game day spends 1% and discovers most of its bugs during real Diwali nights instead of during scheduled drills. The 8% investment is what separates the operators who sleep on Diwali night from the operators who get paged.

Common confusions

  • "Chaos engineering and load testing are different things." They are the same thing on different axes. Chaos engineering without load is just unit-testing the failure-handling code. Load testing without chaos is just measuring steady-state capacity. Production failure modes live at the intersection — a fault during a load spike. The right harness drives both axes simultaneously, which is why this chapter exists between /wiki/load-testing-wrk-k6-gatling and /wiki/headroom-peak-and-degraded-modes.
  • "Random fault injection is more rigorous than scheduled fault injection." No — for the smoke and daily tiers, randomised faults catch unknown bugs; for the pre-event and game-day tiers, scheduled faults are required because you need to correlate the recovery curve to the exact fault, and you need humans positioned to observe specific signals. Use random for "is the system robust"; use scheduled for "did this specific recovery path work as designed".
  • "Chaos drills should not run in production." The opposite — staging chaos cannot reproduce the real network topology, the real load shape, the real third-party dependencies, or the real autoscaler behaviour. Game-day production chaos, with proper fencing (small blast radius, well-announced, abort button ready, runbook in hand), is required to validate that the drill's findings translate. Streamora's Chaos Monkey ran in production from day one; PaisaBridge's quarterly game day runs against live UPI traffic with a 1% blast radius.
  • "If staging passes the chaos drill, production will too." No — staging never has the production scale, the production traffic distribution, or the production network topology. A staging cluster of 3 replicas at 5% load behaves under fault injection completely differently from a production cluster of 60 replicas at 78% load with retry traffic from 200M clients. The staging drill validates the code path; the production drill validates the system behaviour. Both are required.
  • "The recovery is fine if no errors are returned." No — silent slow degradation is worse than visible failure. A service that takes 14 seconds to respond is "available" by the strict 200-OK definition but unusable for any user. The pass criterion must include both error rate AND latency percentiles, with the latency criterion using CO-corrected p99 from an open-loop generator (see the load-testing chapter for why).
  • "Chaos drills are about resilience." They are about predictability. A resilient system that recovers in 30 seconds during the drill but in 8 minutes in production is worse than a brittle system that fails consistently in 60 seconds in both. The drill's purpose is to build a calibrated mental model of "if X happens, the system does Y for Z seconds" — that calibration is what the on-call engineer at 03:00 AM relies on, not abstract resilience.

Going deeper

chaos-mesh and litmus — the Kubernetes-native fault injectors

For Kubernetes-deployed services, chaos-mesh (CNCF, originally PingCAP) and litmus (CNCF, originally Mayadata) are the standard fault-injection tools. They run as Kubernetes operators, accept Chaos CRDs that declaratively specify the fault (PodChaos, NetworkChaos, IOChaos, StressChaos, TimeChaos), and target pods using label selectors. The advantage over hand-rolled kubectl delete pod scripts is that the fault is governed — the CRD specifies the duration, the blast radius, the abort condition, and the audit trail. PaisaBridge's chaos drills use chaos-mesh because the CRDs integrate with the GitOps deployment flow — the drill itself is reviewed, approved, and merged like any other code change. The Python harness from earlier in this chapter is the right shape for ad-hoc drills; chaos-mesh is the right shape for the institutional discipline.

Coordinated omission in chaos drills — the second source of lying

The load-testing chapter covered coordinated omission in detail. Chaos drills add a second source of CO: the load generator itself may stop sending requests during the fault if the connection pool exhausts. A wrk2 run with 256 connections, when 30% of the connections become useless because the killed pod's TCP sockets time out, effectively reduces the offered load by 30% during the fault window — and the resulting histogram under-counts the latency tail by exactly that fraction. The fix is to oversize the connection pool (4× the expected production fan-in), to use vegeta -keepalive=false for fault windows where connection reuse becomes a liability, and to monitor the achieved RPS during the fault window — if it drops more than 5% below target, the latency numbers are CO-suspect and need re-running with a fresh connection pool.

Steady-state hypothesis — the formal framing from the principles of chaos

The Principles of Chaos Engineering (Basiri et al., Streamora, 2016) frame the discipline around a steady-state hypothesis — a falsifiable prediction about the system's measurable behaviour that should hold under fault injection. "p99 latency stays below 500 ms during single pod loss at 80% peak load" is a steady-state hypothesis; "the system is resilient" is not. The discipline of writing the hypothesis before injecting the fault forces clarity about what the drill is testing — which metric, at what threshold, under what conditions — and produces a binary pass/fail rather than the squishy "well, it kind of recovered" judgement that informal chaos drills produce. Every chaos drill ticket at PaisaBridge opens with a steady-state hypothesis and closes with a yes/no answer; the institutional clarity this produces is worth the small amount of upfront design work.

Reproduce this on your laptop

# Install wrk2, chaos-mesh (or pumba for Docker), and the Python parser
sudo apt install build-essential libssl-dev libz-dev git
git clone https://github.com/giltene/wrk2 && cd wrk2 && make && sudo cp wrk /usr/local/bin/wrk2
docker pull gaiaadm/pumba    # lightweight Docker chaos tool, no k8s needed

python3 -m venv .venv && source .venv/bin/activate
pip install hdrh requests

# Start a tiny target service in 3 containers behind nginx
docker run -d --name=svc1 -p 8081:80 nginxdemos/hello
docker run -d --name=svc2 -p 8082:80 nginxdemos/hello
docker run -d --name=svc3 -p 8083:80 nginxdemos/hello
# (set up nginx LB on :8080 -> 8081/8082/8083)

# Run the chaos-under-load harness
python3 chaos_under_load.py http://localhost:8080/ 1000 60 &
sleep 15
docker pause svc2     # inject a fault — pause one container
sleep 30
docker unpause svc2

# Watch the per-second p99 climb during the fault window

Edit TARGET_RPS upward to find the load level at which a single-replica fault breaches your local SLO; the gap between idle and loaded fault behaviour will be visible in the wrk2 percentile output.

Where this leads next

This chapter sits between the load-testing foundation and the operational disciplines that turn measurements into production confidence.

The closing rule: a chaos drill that does not run under realistic load is testing a system that does not exist; a chaos drill under load that does not measure CO-corrected p99 is lying about its own recovery numbers; a chaos drill that runs only on staging is validating the code path, not the system. Hold all three together and the drill produces a number you can take to the IPL final, the Diwali peak, the GSTN deadline. Skip any one and the drill is theatre.

References