kill -9 in a Loop — Building a Recovery Test Harness

Note: Company names, engineers, incidents, numbers, and scaling scenarios in this article are hypothetical — even when they resemble real ones. See the full disclaimer.

In short

Back in Build 1 you wrote a harness that killed an append-only log mid-write and checked that every acknowledged record came back. That harness was protecting one invariant: no lost writes. The store you have now is enormously more complicated — a buffer pool, a WAL with LSNs, pages with page-LSNs, an analysis / redo / undo recovery pipeline, compensation log records so undo is itself crash-safe — and its correctness claim is enormously stronger. It promises atomicity under crash: after any SIGKILL, restart, and recovery, the database looks as if every committed transaction ran in full and every uncommitted transaction never started. To actually know that, you rebuild the Build 1 harness at Build 5 scale. A parent process runs a worker that opens random transactions and does random puts and commits and aborts, logging to a parent-visible file which transactions it saw COMMIT OK for. A killer thread in the parent sends SIGKILL at a random interval. The parent restarts the database, runs recovery, and checks three invariants: every committed transaction's writes are visible; no uncommitted transaction's writes are visible; every page's page-LSN is consistent with the WAL. Run the loop 500 times and aggregate. Five minutes of this exposes more recovery bugs than a week of hand-written tests — and it connects the whole build, because the discipline is the same discipline Build 1 used on a 60-line log and the same discipline that Jepsen uses on distributed CockroachDB clusters. What you have at the end of this chapter is a toy, but it is a stress-tested toy: a single-node transactional store with WAL, ARIES-like recovery, and a harness that has personally tried to break it five hundred times and failed.

Every chapter of Build 5 has been building towards a single promise your database now makes to the application on top of it: commit means durable, abort means invisible, and a crash in between — at any instruction, in any transaction — resolves to one of those two outcomes for every operation you ever issued. That is the whole contract of an ACID-D store. You wrote the mechanisms — the write-ahead rule, LSNs, page-LSNs, checkpointing, the analysis pass, the redo pass, the undo pass, compensation log records. You believe they work. Your happy-path tests pass.

They are not a proof. They are not even strong evidence. Everything you have tested so far has been tested by code running inside a live process — exactly the region chapter 4 told you is structurally blind to the interesting bugs. A recovery bug lives in the window between two WAL writes, or between a page flush and an fsync, or between the CLR for the N-th undone update and its own durability barrier. You cannot see that window with a unit test. You can only see it by standing outside the process, killing the process inside that window, and inspecting what the disk kept.

This is the capstone chapter of Build 5: you build the harness that does exactly that, and you run it.

The shape of the test — a bigger Build 1

The Build 1 harness had three moving parts: a writer child doing puts to a log; a parent sending SIGKILL at a random moment; a checker that, on restart, compared an ack file (parent-visible) to the log (disk) and counted lost writes. That is the shape of every crash harness that has ever worked. You are going to build exactly the same shape, scaled up to a transactional store.

The worker now looks like this: it opens a random number of concurrent transactions (single-threaded is fine — just interleave them), each doing a random mix of put / delete / get. At random points the worker commits a transaction and records that commit's success to a parent-visible file. At other points it aborts a transaction. Meanwhile the parent sleeps a random interval and then SIGKILLs the worker. On restart, the parent launches the database, lets it run recovery, and then asks the database: show me everything you believe is committed. The parent compares that against its own record of COMMIT OK acknowledgements.

The Build 5 crash loop. Same five-step shape as the Build 1 harness in chapter 4: spawn, work, kill, recover, check, repeat. What grew up is the workload (real transactions instead of flat puts) and the recovery (analysis + redo + undo instead of a forward log scan). The discipline of standing outside the process so you can see what it cannot is the same.

The two files that matter are the database files (on disk, modified by the worker) and a plain text commits.log file (on disk, modified by the parent's view — that is, by the worker after a successful COMMIT OK returned to it). The parent must never infer a commit from anything except that commits.log line. That is what makes the harness honest: the parent only "knows" what it "saw". Anything the worker did that the parent did not see, the parent makes no claim about.

The harness in ~60 lines

Here is the full thing. toydb is the single-node transactional store you built across Build 5 — MiniDB in your local tree, or whatever you named it. Its interface: MiniDB(path) opens the database and runs recovery; begin() starts a transaction and returns a handle txn; txn.put(k, v) and txn.delete(k) record updates; txn.commit() returns normally on success, raises on failure; txn.abort() rolls back in the foreground; db.scan() yields every currently-visible (k, v) pair.

# crash_harness5.py — SIGKILL-in-a-loop for a WAL+ARIES store
import os, sys, time, random, signal, subprocess, json, string

DBDIR = "harness5.db"
COMMITS = "harness5.commits"   # parent-visible: one xid per line for every COMMIT OK

def rand_key():  return "k" + "".join(random.choices(string.ascii_lowercase, k=4))
def rand_val():  return "v" + "".join(random.choices(string.ascii_lowercase + string.digits, k=8))

def worker():
    """Child: open txns, do random ops, commit/abort, log commits AFTER ack."""
    from toydb import MiniDB
    db = MiniDB(DBDIR)                                  # opens + runs recovery if needed
    commits = open(COMMITS, "a", buffering=1)           # line-buffered for parent visibility
    live = {}                                           # xid -> txn handle
    for _ in range(10_000_000):                         # way more than the kill delay permits
        action = random.choice(["begin", "op", "op", "op", "commit", "abort"])
        if action == "begin" and len(live) < 4:
            t = db.begin(); live[t.xid] = t
        elif action == "op" and live:
            t = random.choice(list(live.values()))
            if random.random() < 0.85: t.put(rand_key(), rand_val())
            else:                      t.delete(rand_key())
        elif action == "commit" and live:
            xid, t = live.popitem()
            t.commit()                                  # raises on failure; never returns partial
            commits.write(f"{xid}\n")                   # parent sees this ONLY after COMMIT OK
        elif action == "abort" and live:
            xid, t = live.popitem()
            t.abort()

def parent_loop(trials=500):
    stats = {"trials": 0, "lost_commits": 0, "phantom_visible": 0, "page_lsn_bad": 0}
    for t in range(trials):
        for p in (DBDIR, COMMITS):
            if os.path.isdir(p): __import__("shutil").rmtree(p)
            elif os.path.exists(p): os.remove(p)

        child = subprocess.Popen([sys.executable, __file__, "worker"])
        time.sleep(random.uniform(0.020, 0.500))        # kill window
        child.send_signal(signal.SIGKILL); child.wait()

        committed = set()                               # parent's ground truth
        if os.path.exists(COMMITS):
            with open(COMMITS) as f:
                committed = {int(ln) for ln in f if ln.strip().isdigit()}

        from toydb import MiniDB
        db = MiniDB(DBDIR)                              # opens + runs recovery
        visible = {k: v for k, v in db.scan()}
        # ground truth: the DB's own view of which xids committed, after recovery
        recovered_committed = db.committed_xids()
        uncommitted_after = db.in_doubt_xids()          # should be empty — undo ran
        bad_page_lsn = db.verify_page_lsns()            # returns list of offending page_ids

        lost    = committed - recovered_committed
        phantom = uncommitted_after
        stats["trials"] += 1
        stats["lost_commits"]    += len(lost)
        stats["phantom_visible"] += len(phantom)
        stats["page_lsn_bad"]    += len(bad_page_lsn)
        print(f"trial {t:03d}  acked_commits={len(committed):>4}  "
              f"recovered={len(recovered_committed):>4}  "
              f"lost={len(lost)}  phantom={len(phantom)}  pageLSN_bad={len(bad_page_lsn)}")

    print(json.dumps(stats, indent=2))

if __name__ == "__main__":
    (worker if len(sys.argv) > 1 and sys.argv[1] == "worker" else parent_loop)()

That is the harness. Every line of it is the same discipline as the Build 1 harness: the parent is the only process that sees both sides of the crash, and the parent only believes what it saw. If the parent never saw COMMIT OK, the parent makes no claim about that transaction — recovery may commit it, may roll it back, either outcome is correct as long as it is consistent.

Why line-buffering on the commits file: the parent uses commits.log to know "for a fact" that the worker saw COMMIT OK. If Python's stdio buffered the file, a SIGKILL could lose an xid line that had already been written (from the worker's point of view) but had not reached the kernel yet — and then the parent would fail to count that commit as acked, and the invariant would say "the DB committed something you did not acknowledge," which is a spurious non-failure. Setting buffering=1 (line-buffered) forces Python to call write(2) after every newline, so the kernel has the bytes the moment the line is complete. The parent reads the same file after the crash and sees whatever write(2) calls completed before the kill. Note carefully: this does not require fsync. The parent is still alive and its kernel's page cache still holds the line; it can read it straight back without durability. This is a SIGKILL harness (rung 2 of the ladder from chapter 4), not a power-cut harness.

The three invariants — and what each one protects

After each trial, the parent checks three things. These are not arbitrary; each one is the test for one of the three guarantees your recovery algorithm is supposed to provide.

Invariant 1 — no lost commits. For every xid the parent saw COMMIT OK for, the recovered database must believe that xid is committed. Formally: committed ⊆ recovered_committed. A violation means a transaction whose commit returned successfully was rolled back by recovery. That is the worst possible bug in a transactional store — the user was told "your money has been transferred" and after a crash, it hasn't.

The mechanism that protects this invariant is the write-ahead rule plus the durable commit barrier: the COMMIT record must be fsync'd to the WAL before COMMIT OK returns to the caller. If that fsync is missing, or if the commit record is written to the log file but never reaches disk, or if the commit record's LSN is somehow lost across recovery — the invariant fires.

Invariant 2 — no phantom visible writes. After recovery, there must be no transaction in the "in-doubt" or "active" state. The analysis pass builds the active-transaction table; the undo pass walks each entry in it backward, writing CLRs and undoing every change. If any transaction is left active in the ATT after undo completes, the harness reports it. And crucially — though this is harder to check directly without a full log scan — none of the writes of those active transactions should be visible through db.scan().

The mechanism protecting this is the undo pass and compensation log records from the previous chapter. A failure here means either the undo pass did not finish all active transactions, or a CLR did not get flushed, or a before-image was wrong and the page was left in a corrupted intermediate state.

Invariant 3 — page-LSN sanity. For every page P in the database, P.page_lsn must be the LSN of the last log record that modified P. Specifically, P.page_lsn must be ≥ every LSN in the log that names P as its target, modulo records that were never applied because of the skip-or-apply rule. The verify_page_lsns helper walks the WAL, for each page computes max_lsn_for_page, and compares to the page's on-disk page-LSN.

This is the invariant that protects the next recovery. If page-LSNs are inconsistent after this recovery, the next crash (running recovery from a different starting point) will apply records it should have skipped, or skip records it should have applied. Getting this right is what makes recovery idempotent — and idempotence is what keeps a database repairable under repeated failure.

The three invariants, each wired directly to the recovery mechanism that is supposed to make it hold. When the harness reports a violation, the column in the middle tells you where to look first.

Notice how much more the invariants are doing than in Build 1. In chapter 4 the harness checked "no lost writes / no torn writes / no gaps." Those were properties of the bytes on the disk. Here the harness checks properties of transactions — atomicity ("all or nothing") and isolation-of-uncommitted-work from recovered state. The bytes-on-disk invariants still apply underneath (each WAL record must be framed and CRC'd, same discipline as chapter 5), but they are the floor. The transactional invariants are the ceiling.

Running the harness

500 crashes on a store that got it right, and one that didn't

On a correct implementation, running the harness produces output like this. Each trial is one SIGKILL cycle and the counters are per-trial.

trial 000  acked_commits=  37  recovered=  37  lost=0  phantom=0  pageLSN_bad=0
trial 001  acked_commits=  82  recovered=  82  lost=0  phantom=0  pageLSN_bad=0
trial 002  acked_commits=  12  recovered=  12  lost=0  phantom=0  pageLSN_bad=0
...
trial 499  acked_commits= 154  recovered= 154  lost=0  phantom=0  pageLSN_bad=0
{
  "trials": 500,
  "lost_commits": 0,
  "phantom_visible": 0,
  "page_lsn_bad": 0
}

Zero violations in 500 trials. This is what "WAL + ARIES-style recovery" is supposed to look like under stress.

Now inject a bug — say, flush the commit record to the WAL file with write() but skip the fsync before returning COMMIT OK — and run the same harness:

trial 000  acked_commits=  34  recovered=  34  lost=0  phantom=0  pageLSN_bad=0
trial 001  acked_commits=  29  recovered=  28  lost=1  phantom=0  pageLSN_bad=0
trial 002  acked_commits=  95  recovered=  95  lost=0  phantom=0  pageLSN_bad=0
trial 003  acked_commits=  61  recovered=  59  lost=2  phantom=0  pageLSN_bad=0
...
trial 499  acked_commits= 103  recovered= 103  lost=0  phantom=0  pageLSN_bad=0
{
  "trials": 500,
  "lost_commits": 47,
  "phantom_visible": 0,
  "page_lsn_bad": 0
}

Forty-seven lost commits across 500 trials. About one in ten trials exposes the bug; the rest happen to crash at a moment where the kernel had already flushed the commit record on its own. This is why a harness needs to be noisy and statistical, not deterministic — the bug's firing rate tells you how severe it is, and aggregating over hundreds of trials gives you a number instead of a hunch.

Inject a different bug — have the undo pass skip writing a CLR for the last record it undoes — and you will see phantom counters rise on some trials: a second crash during recovery leaves the first crash's partial undo half-applied, because without the CLR the second recovery does not know the work was done. That is what the phantom_visible column catches.

Connecting back to chapter 4 — same discipline, bigger database

Go back and reread the harness listing from chapter 4 — Power-Loss Testing Your Own Database. Put it next to the harness above. The two listings are the same program, written against two different databases.

Parent spawns worker. Identical. The parent must live through the crash to check the invariants.
Worker appends to a parent-visible ack file only after the durability contract succeeds. Identical. In chapter 4 the ack was per-put; here the ack is per-COMMIT OK. The boundary of what counts as "durable from the caller's view" moved from the write to the commit — that is the only change.
Parent kills with SIGKILL at a random delay. Identical. Same rung 2 of the violence ladder.
Parent reads the ack file, reopens the DB, compares. Identical in shape. The comparison grew from "byte equality of record i" to "set equality of committed-xid sets," but it is still a parent-built ground truth versus a DB-reported ground truth.
Loop N times, aggregate. Identical. Durability bugs are probabilistic; you run until the expected hit count is high.

Every lesson from chapter 4 transfers. SIGSTOP + inspect + SIGCONT to catch the worker at a chosen instant — still works. dm-log-writes to replay the exact block-level sequence under arbitrary power-cut schedules — still works, now against a WAL and data files instead of a log file. eBPF to intercept fsync and model lying storage — still works, and has become more important because your store now depends on a strict partial order of fsyncs (WAL before page flush; page flush before WAL truncation; CLR before checkpoint). A single dishonest fsync in the wrong place now corrupts a transaction instead of a record. The harness is how you find out.

There is one new thing Build 5 adds, which Build 1 did not have and could not have: recovery is itself a process that can crash. A crash during recovery was a non-event in chapter 4 (the recovery was "scan the file"). A crash during recovery in Build 5 means the analysis/redo/undo pipeline is itself running, and the kill might land in the middle of it. That is exactly why CLRs exist — so that a second crash during undo does not leave the database worse than the first crash did. To stress this, extend the harness: on half of the trials, let recovery start and then kill it again a few milliseconds into the replay, and verify that a third start (which runs recovery on the already-partially-recovered state) still holds the three invariants. That is the test that actually exercises the "R" in ARIES-R and proves recovery is recoverable.

What you have built

At the start of Build 5 you had a store that could survive a clean shutdown and some torn tails, and a vague notion that "real databases" had something called a write-ahead log. You have now built, piece by piece:

A WAL with LSNs and framed, CRC-checksummed records (from chapters in Build 5's lead-in).
A buffer pool that obeys the write-ahead rule and a page layout that stores each page's page-LSN in its header.
A checkpointing scheme that bounds how much log recovery must replay.
An ARIES-style three-pass recovery algorithm — analysis, redo, undo — with the clean LSN-comparison invariants that make redo idempotent and undo monotonic.
Compensation log records that make undo itself crash-safe, so recovery survives being interrupted by a second crash.
And now, a harness that kills the whole thing 500 times in a loop and proves, statistically, that after any crash at any instant, every committed transaction survives and every uncommitted transaction vanishes.

It is a toy. It has no SQL, no query planner, no indexing beyond a flat KV, no MVCC, no concurrency beyond a single process. But it is a toy that has been crashed at, and at the end of every crash it has correctly reassembled itself. That is a property an enormous amount of production software does not actually have — and it is the property that separates a database from a file.

Common confusions

"If 500 SIGKILL trials are green, my database is production-ready." No. 500 trials at rung 2 (SIGKILL) is the minimum. A production database also needs rung 3 — actual power-cut simulation with dm-log-writes or a VM — because SIGKILL does not test the fsync path (the kernel is still alive and flushes the page cache). Run rung 3 in nightly CI, on every filesystem you plan to support, before you let a user near your store.
"The harness is inconsistent — it gives different results each run." That is not a flake, it is the signal. Recovery bugs fire probabilistically on the timing of the kill. Run more trials until the 95% confidence interval on the failure rate is tight enough to tell you whether you fixed the bug or hid it.
"If my CLR is broken, recovery will crash and I'll know." Sometimes — if the CLR bug produces a scanner error or an assertion. But often a CLR bug produces silent wrong state: a page is partially undone, the database looks plausible, and only a specific read later reveals the corruption. The page-LSN sanity check is what catches the silent case; without it the harness misses whole bug classes.
"I check that committed ⊆ recovered_committed. That's the durability guarantee." It is half of it. The other half is recovered_committed ⊆ committed ∪ unobserved_commits — a recovered "committed" transaction whose xid was never in the parent's ack file might be fine (the worker sent COMMIT OK and was killed before writing the ack) or might be a bug (recovery committed a transaction that never got a COMMIT record). Distinguishing the two requires scanning the WAL yourself post-mortem and looking for the COMMIT record. Harnesses in production do this; the toy harness above is permissive on this side on purpose, so that it does not false-alarm on the "killed just after COMMIT OK, just before ack line" window.
"My store passes the harness in 500 trials with no bug. I can stop testing." Every serious database team in the world would be delighted to reach 500 clean trials and would then immediately start running the harness continuously in CI forever. Bugs are introduced by future changes. A harness is not a one-shot test; it is a permanent fixture.

Going deeper

The harness above is the smallest honest Build 1 → Build 5 extension. The real tools that grown-up database teams use for recovery testing are closer in spirit but bigger in scope.

Jepsen for distributed recovery

Jepsen, introduced briefly in chapter 4, is first and foremost a distributed-systems testing framework: it partitions networks, skews clocks, pauses VMs, reshuffles packets. But everything Jepsen does rests on the same core as our harness — a parent process that records every operation issued and every response seen, injects faults, and checks at the end that the resulting history is consistent with some serial order allowed by the database's stated contract.

Where Jepsen becomes particularly illuminating for a recovery-testing discussion is its analyses of distributed transactional databases — CockroachDB, YugabyteDB, FaunaDB, TiDB. In those systems, "commit" is not a single fsync; it is a Paxos or Raft round across multiple nodes, each running its own ARIES-style local recovery on top. A crash of one node mid-round is a recovery event for that node and a correctness concern for the distributed layer. Jepsen has found multiple bugs in those systems where the single-node recovery was correct but the interaction between local recovery and the distributed commit protocol was not. The harness shape is the same; the invariants are linearisability and external consistency instead of single-node atomicity. Read the CockroachDB analysis for a particularly clean worked example.

ALICE for file-system-level semantics

ALICE (OSDI 2014) attacks recovery from a different angle. Instead of killing the process and seeing what the disk kept, ALICE systematically enumerates every valid crash state a filesystem could produce between two fsync points, and runs the application's recovery code against each. Where our harness samples the crash state randomly, ALICE explores the space exhaustively (within bounds).

For a WAL-based store, ALICE is devastating. It finds the cases where your WAL appends look correct to the application but the filesystem's journal ordering produces a disk state that the application's recovery code has never seen in a SIGKILL test. The ALICE paper specifically catches missing directory fsyncs, misordered metadata commits, and renamed-file-not-followed-by-fsync bugs in LevelDB, SQLite, HDFS, Git, and VirtCloud's WAL. If you graduate your toy to a real storage engine, this is the tool you run before releasing.

Property-based testing of recovery

Libraries like Hypothesis (Python), QuickCheck (Haskell), and proptest (Rust) let you state recovery invariants as properties — "for any sequence of operations op₁, op₂, … op_n and any crash point k ≤ n, recovery produces a state consistent with some prefix ending in a commit boundary" — and have the library generate thousands of randomised sequences searching for a counterexample, then shrink the counterexample to its minimal form. This is exactly what our SIGKILL loop does, except property-based testing is declarative about the invariant and automatic about the shrinking. When the library finds a 47-operation sequence that breaks recovery, it shrinks it to the minimal 4-operation sequence that still breaks recovery — which is what you want at 2 a.m. when you are debugging.

The combination — property-based generation of workloads, SIGKILL injection, ALICE-style disk-state enumeration on top — is the state of the art for single-node recovery testing. For distributed recovery, Jepsen is the state of the art. The toy harness you built in this chapter is a thread running through every one of them.

Where this leads next

Build 5 is complete. You have a transactional, crash-safe, stress-tested single-node store. It answers the atomicity and durability halves of ACID. What it does not yet do is answer the consistency half well, because it has no schema, no types, no way to express "this row must have a foreign key to that table," and no query language more expressive than "scan the KV space."

That is Build 6.

Build 6 — SQL and the query engine. A parser turns text into an abstract syntax tree. A logical planner rewrites that tree into relational algebra. A physical planner chooses access paths over your B-tree and LSM storage. An execution engine runs pipelines of operators — scan, filter, join, aggregate, sort — over your buffer pool. Every mechanism from Builds 1-5 becomes a primitive the query engine uses: the buffer pool for page access, the WAL for transactional writes from DML, recovery for crash safety under DDL, the page layout for tuple access. Build 6 is where your storage engine becomes a database in the sense a user would recognise the word.

Before you start Build 6, keep the harness from this chapter running in the background on your CI. Every change you make to the storage engine from here on should cross it without a single violation. If the harness ever fails, you have a recovery bug — and because the harness is deterministic about what it saw (the commits.log file) and non-deterministic about the kill timing, the reproducer is already in your hand: the sequence in commits.log plus the kill-delay seed that triggered the failure. That is the debugging shape every database team eventually converges on, and now you have it too.

References

Mohan et al., ARIES: A Transaction Recovery Method Supporting Fine-Granularity Locking and Partial Rollbacks Using Write-Ahead Logging, ACM TODS 1992 — the original ARIES paper. The recovery algorithm you just stress-tested, in full.
Pillai et al., All File Systems Are Not Created Equal: On the Complexity of Crafting Crash-Consistent Applications, OSDI 2014 — the ALICE paper. Exhaustive enumeration of filesystem crash states, versus our random sampling.
Kyle Kingsbury, Jepsen — the canonical adversarial testing framework and its library of database analyses, including many recovery-related findings in distributed stores.
MacIver, Hypothesis: Property-Based Testing for Python — the most accessible property-based testing library; the natural way to generate workloads for a harness like this one.
Mohan et al., Finding Crash-Consistency Bugs with Bounded Black-Box Crash Testing, OSDI 2018 — CrashMonkey, the successor to ALICE, scaled to mainline filesystem bug-finding.