Note: Company names, engineers, incidents, numbers, and scaling scenarios in this article are hypothetical — even when they resemble real ones. See the full disclaimer.

Fault injection at the platform level: when chaos becomes infrastructure

It is a Tuesday afternoon at PaySetu and a senior engineer named Ananya is typing kubectl exec into a production pod to add 200ms of latency on the payments-api egress to the fraud-scoring service. She has done this nineteen times this quarter. Each time she writes the same tc qdisc command, each time she sets a phone alarm to undo it in 4 minutes, and once last month she forgot the alarm and the latency stayed in for 31 minutes before the on-call paged her. The problem is not that Ananya is reckless. The problem is that fault injection at PaySetu has not graduated from a script to a platform — there is no central place that knows what's running, no automatic timeout, no audit log, no steady-state guard, and no one-click rollback. This chapter is about that graduation: what changes when you stop running tc by hand and start treating fault injection the way you treat deploys.

Platform-level fault injection means a single API for "inject this fault on this target for this duration" that owns target selection, blast-radius enforcement, automatic timeout, audit logging, steady-state observation, and one-click rollback. The mechanism is usually a sidecar (Envoy filters, eBPF programs, or kernel netfilter rules) attached at deploy time, controlled by a central scheduler. The non-obvious win is not that injection becomes easier — it's that every fault now has a known operator, a known scope, a known revert path, and a known steady-state. The move from script to platform is the move from "tribal chaos engineering" to "chaos engineering as a service".

What "platform" means here, and why a script is not enough

A fault-injection script is a bash one-liner — tc qdisc add dev eth0 root netem delay 200ms on a specific pod. It works. It also has every property a production-grade tool must not have: no timeout, no audit, no scope, no abort path, no observability hook, no permission model, and no awareness that another script is already running. Two engineers running the same one-liner on the same pod produces undefined behaviour. The pod restarting wipes the state silently. The on-call who finds latency at 3am has no way to know whether it's an experiment or a real fault.

A platform replaces all of that with one API call:

POST /injections
{
  "target": {"service": "payments-api", "selector": "shard=south-1", "max_pods": 3},
  "fault":  {"type": "latency", "p50_ms": 200, "p99_ms": 500},
  "duration_seconds": 240,
  "steady_state": {"slo": "merchant-write-success", "min": 0.995, "abort_after": 60},
  "hypothesis": "fraud-scoring 200ms degradation must not breach merchant-write SLO",
  "owner": "ananya@paysetu.in"
}

Six fields, every one of them load-bearing. target.max_pods=3 is the blast-radius cap — even if the selector matches 200 pods, only 3 are affected. duration_seconds=240 is the dead-man's switch; the fault auto-clears at T+240 even if the controller crashes. steady_state is the abort condition — if the merchant-write success rate falls below 99.5% for more than 60 seconds, the platform reverts the fault without waiting for a human. owner is the audit trail. hypothesis is the falsifiable claim that justifies the experiment existing. None of these are optional. A platform that lets you skip any of them has reverted to being a script with extra steps.

The platform's job is to convert each missing property of the script into an enforced guarantee. Illustrative.

How the fault is actually injected: sidecars, eBPF, and kernel hooks

The control plane (the API above) is the easy half. The hard half is the data plane — the thing that actually slows a packet, drops a connection, or returns a 503. There are three mainstream mechanisms, each with a different cost-and-precision profile.

Sidecar proxy injection (Envoy / Istio fault filters). Every pod ships with a sidecar — Envoy or similar — that already mediates all in/out traffic. The platform pushes a config update via xDS that adds a fault filter: delay, abort, or bandwidth. The filter intercepts requests at L7 and applies the fault. Why: the sidecar already terminates the connection at the application boundary, so adding a delay is a single conditional in an existing critical path — no new container, no kernel-level surgery. Cost: works only for traffic going through the sidecar (mesh members), not for raw TCP connections to managed databases bypassing the mesh.

eBPF programs at the kernel boundary. A privileged DaemonSet loads an eBPF program that hooks into tc (traffic control) or XDP (eXpress Data Path). The program inspects packets in kernel space and drops/delays/corrupts them based on filters set by the platform. Why: eBPF runs inside the kernel and sees every packet regardless of which container or sidecar emitted it — so you can target traffic from any pod to any destination, including managed dependencies the mesh cannot see. Cost: requires kernel ≥ 4.18 with BPF_PROG_TYPE_SCHED_CLS enabled, and the program is harder to debug than a YAML config.

Kernel netfilter / tc qdisc directly. The platform runs a privileged container per node that issues iptables / nftables or tc qdisc commands scoped by network namespace. Why: this is the lowest-common-denominator mechanism — every Linux kernel since 2.4 has it, no special kernel config, no eBPF verifier, no sidecar requirement. Cost: less precise than eBPF (you cannot easily filter on L7 fields), and the tc rule lives in kernel state that survives container restarts but not node reboots.

Most production platforms (Netflix's ChAP, AWS FIS, Chaos Mesh, LitmusChaos) ship at least two of the three mechanisms because no single one covers every fault type. Latency on outgoing HTTP is best done in the sidecar; clock skew is best done with eBPF or clock_settime syscall interposition; packet loss to a managed Postgres works only at the kernel level because the mesh does not see that traffic.

Each mechanism trades precision for coverage. Production platforms ship all three.

A minimal control loop you can read in one screen

Here is the core control loop — the thing that takes an injection request, applies it, watches the SLO, aborts if needed, and reverts on schedule:

import time, threading, json
from dataclasses import dataclass
from typing import Callable

@dataclass
class Injection:
    id: str
    target_pods: list
    fault: dict           # {"type": "latency", "p50_ms": 200}
    duration_s: int
    slo_check: Callable[[], float]   # returns current SLO value
    slo_min: float
    abort_grace_s: int

def apply_to_sidecar(pod, fault):    print(f"  [apply]  {pod} <- {fault}")
def revert_sidecar(pod):              print(f"  [revert] {pod}")

def run_injection(inj: Injection):
    print(f"[t=0] starting injection {inj.id}: {inj.fault} on {inj.target_pods}")
    for pod in inj.target_pods:
        apply_to_sidecar(pod, inj.fault)
    start = time.time()
    breach_started_at = None
    try:
        while time.time() - start < inj.duration_s:
            time.sleep(1)
            slo = inj.slo_check()
            elapsed = int(time.time() - start)
            if slo < inj.slo_min:
                breach_started_at = breach_started_at or time.time()
                breach_dur = time.time() - breach_started_at
                print(f"[t={elapsed}] SLO {slo:.4f} < {inj.slo_min} (breach {breach_dur:.0f}s)")
                if breach_dur >= inj.abort_grace_s:
                    print(f"[t={elapsed}] ABORT: SLO breached for {inj.abort_grace_s}s")
                    return "aborted"
            else:
                breach_started_at = None
                print(f"[t={elapsed}] SLO {slo:.4f} ok")
        return "completed"
    finally:
        for pod in inj.target_pods:
            revert_sidecar(pod)
        print(f"[t={int(time.time()-start)}] injection {inj.id} reverted (dead-man fired)")

# fake SLO that drops mid-experiment
slo_values = [0.999, 0.998, 0.997, 0.992, 0.991, 0.991, 0.999, 0.999]
i = [0]
def fake_slo():
    v = slo_values[min(i[0], len(slo_values)-1)]
    i[0] += 1
    return v

inj = Injection(
    id="exp-2026-04-29-001",
    target_pods=["payments-api-7d8", "payments-api-9b2"],
    fault={"type": "latency", "p50_ms": 200},
    duration_s=8,
    slo_check=fake_slo,
    slo_min=0.995,
    abort_grace_s=2,
)
print("RESULT:", run_injection(inj))

Output when run:

[t=0] starting injection exp-2026-04-29-001: {'type': 'latency', 'p50_ms': 200} on ['payments-api-7d8', 'payments-api-9b2']
  [apply]  payments-api-7d8 <- {'type': 'latency', 'p50_ms': 200}
  [apply]  payments-api-9b2 <- {'type': 'latency', 'p50_ms': 200}
[t=1] SLO 0.9990 ok
[t=2] SLO 0.9980 ok
[t=3] SLO 0.9970 ok
[t=4] SLO 0.9920 < 0.995 (breach 0s)
[t=5] SLO 0.9910 < 0.995 (breach 1s)
[t=6] SLO 0.9910 < 0.995 (breach 2s)
[t=6] ABORT: SLO breached for 2s
  [revert] payments-api-7d8
  [revert] payments-api-9b2
[t=6] injection exp-2026-04-29-001 reverted (dead-man fired)
RESULT: aborted

Walking through it line by line: apply_to_sidecar is the platform-specific hook that pushes the fault config (in production, an xDS update or eBPF map write). The main loop ticks once per second, calling slo_check — which in production reads the merchant-write success rate from your metrics backend over the last 1-minute rolling window. The breach_started_at variable is the heart of the abort logic: it tracks when the SLO first dipped below the threshold, and the experiment only aborts if the breach persists for abort_grace_s seconds — without that grace window, a single noisy data point would cancel every experiment. The try/finally guarantees revert even if the loop crashes; that is the dead-man's switch. The scariest line is apply_to_sidecar — it is the single line where production state changes, and a real platform wraps it in idempotency checks, RBAC, and audit logging before that call ever fires.

Production story: PaySetu rolls this out and discovers what their SLO actually is

After Ananya's 31-minute incident, PaySetu's reliability team decides to build the platform. They pick Envoy fault filters as the v1 mechanism (the mesh covers 80% of inter-service calls) and a dead-simple Python control loop like the one above. The first three weeks of running real experiments produce a result no-one expected: the platform aborts roughly 40% of experiments not because the system fails, but because nobody actually knows what the steady state is. Engineers were defining min: 0.999 for SLOs that, on a normal Tuesday with no fault injected, oscillated between 0.992 and 0.998 just due to background noise. The platform was correctly aborting on what looked like a breach and was actually the system's resting state.

The fix was not in the platform. The fix was that every team had to spend two weeks measuring their own steady-state baseline before they were allowed to file an injection — what is your p99 on a quiet Sunday, what is it during a Monday-morning settlement run, what is it on the third Friday of the month when merchants reconcile. Six months later, PaySetu's reliability lead said the most valuable thing the chaos platform produced was not the experiments but the forced introspection: you cannot inject a fault until you have written down, in numbers, what "fine" looks like. The platform was not the answer; the platform was the question.

Common confusions

"Platform-level injection is the same as Chaos Monkey." Chaos Monkey kills VMs at random with no hypothesis, no SLO, no abort path — it is the opposite of platform-level injection. Platform injection is targeted, bounded, falsifiable, observed; Chaos Monkey was deliberately the simplest possible thing that worked, and it shipped before the discipline existed.
"Sidecar injection covers everything." It covers only traffic that flows through the sidecar. Calls to managed databases, Redis behind a load balancer, or anything the mesh does not see are invisible to Envoy filters and need eBPF or tc qdisc instead.
"The dead-man's timer makes manual revert unnecessary." The dead-man's timer is the fallback. The primary revert is the SLO-driven abort. Relying on the timer means accepting that your fault runs to completion every time, which defeats blast-radius minimisation when the system is already breaching.
"Adding a 200ms delay tests the system." A constant 200ms delay tests one operating point. Real network failure modes are bursty, asymmetric, and correlated — a platform that only ships constant-latency injection is missing 80% of the failure surface (timeouts, jitter, partial partitions).

Going deeper

The Lineage Driven Fault Injection (LDFI) idea

Peter Alvaro's 2015 paper "Lineage-driven Fault Injection" argues that random fault injection is wasteful — most random faults are uninteresting because the system trivially tolerates them. Instead, derive faults from successful executions: trace which messages and services contributed to a successful response, then inject faults targeting that lineage. Why: the only faults that matter are ones on the critical path of a successful outcome — anything off that path is, by construction, irrelevant. LDFI is what makes Netflix's ChAP smarter than Chaos Monkey, and it is the mathematical justification for the move from "random kills" to "targeted, hypothesis-driven" platform injection.

Time-and-clock injection is a different beast

Latency, drops, and aborts are easy because they happen at network boundaries. Clock skew is not — there is no sidecar between your code and gettimeofday(). To inject clock skew at the platform level you either (a) use eBPF to intercept clock_gettime syscalls and shift the result by a configurable offset, or (b) use LD_PRELOAD to replace libc's clock functions in user space, or (c) use Linux's time namespaces (kernel ≥ 5.6) to run pods in a namespace with a different boot offset. Each has a sharp edge: eBPF interception breaks if the program uses VDSO (which most do, for performance); LD_PRELOAD only works for dynamically-linked binaries; time namespaces are powerful but only applied at namespace creation. CricStream learned this the hard way when their first clock-skew experiment did nothing for 40 minutes because their Go binaries were statically linked and the LD_PRELOAD shim was never loaded.

The "experiment scheduler" problem

Once you have a platform, you can run many experiments per day. But you cannot run two experiments on the same blast-radius simultaneously, or correlate the results — and humans are bad at filing experiments efficiently. The next layer is a scheduler: given a queue of pending experiments and a graph of service dependencies, pick a non-overlapping schedule that maximises coverage. This is bin packing with conflicts, and it is where chaos engineering meets job scheduling. AWS FIS uses a simple "one experiment per blast-radius per hour" rule. ChAP at Netflix is more sophisticated, allowing multiple experiments if their target sets are disjoint.

Why audit logs are not optional

When an experiment runs at PaySetu and the on-call paged engineer sees latency spike at 2:47am, the only difference between "experiment" and "incident" is the audit log. If the log says "exp-2026-04-29-001 owned by ananya, target payments-api shard south-1, slo merchant-write, scheduled abort 02:51" then the engineer goes back to sleep. If there's no log, the engineer assumes incident, pages the team, escalates, and the next time someone proposes an experiment, the team blocks it. Audit is not a compliance feature — it is what prevents the platform from being killed by its own users.

Restart-survival is harder than it looks

A platform pushes a fault into a sidecar, the pod restarts mid-experiment, and now the platform's view of the world disagrees with reality. Two design choices: (a) faults are ephemeral — pods come back clean, the platform re-applies if needed, and the dead-man timer still owns the timeline; (b) faults are persisted — written to a CRD or external store, re-applied automatically by the sidecar on startup. Option (a) is simpler and the usual answer; option (b) is required only when you cannot tolerate the brief window between restart and re-apply. KapitalKite chose (a) for everything except their order-routing experiments, where even a 200ms gap in injected latency would let an outlier order slip through and skew the result.

Where this leads next

The platform you have built so far runs one experiment at a time, on demand. The next chapter — automating chaos in CI/CD — is about making the platform run experiments automatically, gated on the deploy pipeline, so every release is implicitly chaos-tested. After that, the curriculum moves into game days (humans drive the experiments), incident-response tooling (the SLO-abort path the platform just used is the same path the incident-response platform uses), and finally the maturity model that puts all of this into context.

The platform also unlocks experiments you cannot run with a script: cross-region partitions, correlated dependency failures, time-of-day-conditional injections — all of which require the scheduler, the steady-state observer, and the audit log working together. See /wiki/automating-chaos-in-ci-cd, /wiki/game-day-design, and /wiki/the-observability-maturity-model for the path forward.

References

Alvaro, Peter et al. "Lineage-driven Fault Injection." SIGMOD 2015. — the formal argument that random injection is the wrong default.
Basiri, Ali et al. "ChAP: Chaos Automation Platform." Netflix Tech Blog, 2017. — the canonical write-up of platform-level injection at scale.
Rosenthal, Casey & Jones, Nora. "Chaos Engineering: System Resiliency in Practice." O'Reilly, 2020. Chapters 5–7.
Envoy documentation: "HTTP fault filter." The reference for sidecar-level fault injection.
Linux kernel docs: "BPF and TC qdisc" — the kernel-side mechanism for L3/L4 injection.
AWS Fault Injection Simulator (FIS) docs — the managed equivalent of a chaos platform.
/wiki/the-principles-netflix — the five tenets that any platform must enforce.
/wiki/blast-radius-and-recovery — the conceptual ancestor of the max_pods cap.