Power-Loss Testing Your Own Database

Note: Company names, engineers, incidents, numbers, and scaling scenarios in this article are hypothetical — even when they resemble real ones. See the full disclaimer.

In short

You cannot unit-test durability. A unit test runs inside your process; a crash happens between two of its instructions, somewhere your test function is never allowed to stand. The only honest way to know your store survives power loss is to actually kill it — with SIGKILL, a parent-managed kill -9, or a virtual-machine power-cut — mid-write, thousands of times, and then restart it and check that a stated invariant still holds. This chapter builds that harness. You will write a parent supervisor in Python that spawns a child writing to an append-only log, murders the child at a random instant with os.kill(pid, signal.SIGKILL), starts a fresh child that reads the log back, and verifies that every write the parent saw acknowledged is still present and every record that the parent did not yet see is either absent or a discardable partial tail. Five minutes of this loop exposes more durability bugs than a year of hand-written tests. You will also meet the four classes of failure it finds — lost writes, torn writes, reordered writes, and silent corruption — with a tiny reproducer for each, and the production-grade tools (Jepsen, ALICE, CrashMonkey, dm-log-writes, eBPF) that grown-up database teams use to catch them.

Your unit tests are green. Every record you wrote comes back. The put/get round-trip passes a thousand times in CI. You ship. Three days later, a user opens the application after a power outage and it starts with an empty database.

What went wrong is not a bug in any single line of your code. It is a bug in the gap between lines. Somewhere between the write() that returned success to Python and the fsync() that never happened — or happened to the wrong file, or happened before the directory entry was durable, or was swallowed by a lying SSD cache — a power cut slipped in and cashed the difference. Unit tests cannot see this gap because unit tests run on one side of it. They run inside the process. A crash lives between two instructions of that process, in a place the test framework is structurally forbidden from inspecting.

This chapter is the missing test. It is the test you cannot write inside the program because the program has to die for the test to mean anything. You will build a small, fast harness that kills your store at a random moment, restarts it, and checks whether the invariants survived. Five minutes of this loop will teach you more about your own code than the previous week of linting did.

Why unit tests cannot catch durability bugs

A unit test looks like this:

def test_put_get_roundtrip():
    db = AppendOnlyKV("test.log")
    db.put("k", "v")
    assert db.get("k") == "v"

It tests one thing: that within a single live process, the put → get path produces the right answer. It is a useful test. It will catch a typo in the scanner. It will catch a regression where you forgot to flush before returning. It is also completely blind to every interesting durability bug, because every interesting durability bug happens across a process boundary.

Why across a boundary: the only reason durability is interesting is that something outside the process can end the process at an arbitrary moment. That something is power loss, a kernel panic, an OOM killer, or a kill -9 from an admin. A unit test function cannot invoke any of these on itself, because if it did its test framework would die with it and report nothing. The failure mode lives in the region the test cannot inhabit.

Compare the two pictures below. A unit test sees the full sequence of instructions; a crash happens between instructions in a way no assertion can observe.

A single put() is not an instant. It is an interval, stretched across userspace, the kernel, and the disk controller. The "crash window" is everything between the moment the bytes leave your Python buffer and the moment the disk reports them persistent. A unit test cannot observe this interval because the unit test is one of the instructions inside it.

The harness in this chapter does exactly what a unit test cannot: it stands outside the store process, so when the store process is killed mid-interval, the harness is still alive to inspect what the disk actually kept.

Simulating crashes — the three levels of violence

Not every way of stopping a process tests the same thing. The ladder from "gentle" to "actually pulling the plug" has three rungs, each exposing a stricter set of durability bugs.

Rung 1 — SIGTERM or normal exit. You ask the process nicely to stop. It runs its exit handlers, flushes any Python-level buffers, closes files, and returns. Any bytes in Python's buffer make it to the kernel via the atexit handler's implicit close(). This is not a crash. It tests only that your store flushes its own buffers on a clean shutdown. Every toy gets this right on the first try.

Rung 2 — SIGKILL (kill -9). The kernel destroys the process immediately. No exit handlers run. No close() is called. Anything sitting in Python's buffer is gone — only bytes that already made it to the kernel page cache survive. This is the first real crash test: it simulates a process abort in the middle of a write, which mimics an OOM-killer strike or a segfault. It does not simulate power loss, because the kernel is still alive and will still flush its own page cache to disk in the background.

Rung 3 — power cut. The kernel dies too. The page cache is gone. Only bytes that were issued a durability barrier (fsync or a write-through FUA write) and successfully acknowledged by the disk survive. This is the test your users' hardware actually runs on them. You cannot trigger this from within the OS — you have to pull the plug, unplug the VM's virtual disk, or use a block-layer tool like dm-log-writes (covered later) to simulate it deterministically.

There is also SIGSTOP, which pauses a process without killing it. SIGSTOP is not a crash — it is the pause button. But it is useful in a harness for a different reason: you can SIGSTOP a writer, inspect the on-disk state while it is frozen, and SIGCONT it, catching it mid-write in a reproducible way.

Three levels of crash simulation, and which caching layer each one invalidates. For a single-machine store with fsync on every commit, a rung-2 test with SIGKILL plus a fresh process that reads the log back is close enough to a rung-3 power cut to catch the overwhelming majority of bugs — and it runs a thousand times a minute.

For the rest of this chapter, the harness uses rung 2 — SIGKILL — because it is cheap, deterministic, and scriptable. For real production hardening, you graduate to rung 3 with dm-log-writes or a VM, which is covered in Going deeper.

The harness — writes, kills, recovers, in a loop

The shape of a power-loss test harness is simple enough to state in one sentence: a parent process spawns a writer child, tells the writer which records to produce and in what order, kills the writer at a random moment, spawns a reader child on the same file, and checks the invariants. Let us build that.

Here is the full harness in ~50 lines of Python.

# crash_harness.py — power-loss tester for AppendOnlyKV
import os, sys, time, random, signal, subprocess, json

LOG = "harness.log"
ACK = "harness.ack"   # parent-visible record of which puts the writer claimed were durable

def writer():
    """Child: write N records to the log; append each k to ACK only after fsync returns."""
    from appendkv import AppendOnlyKV   # the store from chapter 2, now fsync'd per chapter 3
    db = AppendOnlyKV(LOG)
    ack = open(ACK, "a", buffering=1)   # line-buffered: acks are visible to parent promptly
    for i in range(100_000):
        db.put(f"key{i}", f"value{i}")           # this call must fsync internally
        ack.write(f"{i}\n")                      # only written AFTER fsync returns
        # no sleep; go as fast as we can so the parent's kill lands mid-write

def parent_loop(trials=200):
    losses = torn = reordered = 0
    for t in range(trials):
        for path in (LOG, ACK):
            if os.path.exists(path): os.remove(path)

        child = subprocess.Popen([sys.executable, __file__, "writer"])
        # pick a random kill delay: long enough for many writes, short enough to catch mid-write
        time.sleep(random.uniform(0.005, 0.050))
        child.send_signal(signal.SIGKILL)
        child.wait()

        acked = set()
        if os.path.exists(ACK):
            with open(ACK) as f:
                acked = {int(line) for line in f if line.strip().isdigit()}

        from appendkv import AppendOnlyKV
        db = AppendOnlyKV(LOG)
        present = {int(k[3:]): v for k, v in db.scan_all() if k.startswith("key")}

        # invariant 1: every acked write must be present with its declared value
        missing = {i for i in acked if i not in present}
        if missing: losses += len(missing)

        # invariant 2: present values must match what the writer deterministically produced
        wrong = {i for i, v in present.items() if v != f"value{i}"}
        if wrong: torn += len(wrong)

        # invariant 3: present keys should be a prefix of what the writer intended (no reorder)
        if present:
            max_seen = max(present)
            gap = {i for i in range(max_seen) if i not in present}
            if gap: reordered += len(gap)

        print(f"trial {t:03d}  acked={len(acked):>6}  present={len(present):>6}  "
              f"lost={len(missing)} torn={len(wrong)} gaps={len(gap) if present else 0}")

    print(json.dumps({"trials": trials, "total_lost": losses,
                      "total_torn": torn, "total_gaps": reordered}))

if __name__ == "__main__":
    (writer if len(sys.argv) > 1 and sys.argv[1] == "writer" else parent_loop)()

The harness assumes AppendOnlyKV.put calls fsync internally (chapter 3 made that change) and that a helper scan_all() yields (key, value) pairs from the log, silently skipping any torn tail line (a line that does not contain =, or whose checksum fails, once chapter 5 adds one).

Let us unpack the three invariants the parent checks, because they are the ones that map to the bugs you are hunting.

Invariant 1 — no lost acks. If the writer wrote i into the ACK file, it did so only after db.put(...) for record i returned. By the contract of put, that means the record was durable. So on recovery, every i that is in ACK must be in the log. Any that is not is a lost write — the worst durability bug, because the writer was told its data was safe and it wasn't.

Invariant 2 — no corrupted values. The writer produces deterministic records: key{i} → value{i}. If the log contains key42 = valueXX for some mangled XX, the record has been corrupted — either by a partial write that scrambled a byte boundary, or by the scanner misinterpreting a torn tail as valid. This is a torn write (or a scanner bug).

Invariant 3 — no gaps. The writer wrote records in order, and each put returned before the next one started. So if key99 is durable but key50 is missing, something reordered the writes underneath you. This should be impossible in a single-threaded single-file append — if you see it, you have a bug in how you are flushing, or your "append" is actually doing a seek somewhere, or the filesystem is reordering metadata updates in a way your recovery is not handling.

The harness loop. Step 2 is inside the writer child; steps 1, 3, and 4 are inside the parent. Only the parent sees the full timeline, which is why only the parent can check the invariants.

One trial of the harness, step by step

Say the writer has just been killed, and the parent is about to check the invariants. Here is what the filesystem looks like, and what each check says.

harness.log:                  harness.ack:
key0=value0                   0
key1=value1                   1
key2=value2                   2
key3=value3                   3
key4=value                    (fifth ack never written)

The last line in harness.log is torn — value was cut off before 4\n made it to the page cache. harness.ack contains four lines, so the writer acknowledged putting key0 through key3. (It had called db.put("key4", "value4") before the kill, but put never returned, so the ack for 4 was never written.)

Check 1 — lost acks. acked = {0, 1, 2, 3}. Scanning the log, present = {0, 1, 2, 3} (the torn line 4 is skipped by the scanner because partition("=") gives key "key4" and value "value" — hmm, actually that is not skipped by our text-format scanner, which is exactly the point). With a checksum-framed record format from chapter 5, the torn line would fail its CRC and be discarded; present = {0,1,2,3}, and invariant 1 passes.

Check 2 — corrupted values. With the text-format scanner, present now includes a bogus {4: "value"}, and the invariant 2 check v != f"value{i}" trips: we have detected a torn write surviving as a fake record. This is the harness doing its job.

Check 3 — gaps. None. The present keys are a contiguous prefix.

Verdict. On one trial the harness would report lost=0 torn=1 gaps=0, and you would go fix the scanner to require framing or checksums. You run 200 trials, aggregate, and get a statistical picture of where your store fails.

Run the harness. On a correct store with proper fsync + checksumming, all three counters are zero across 200 trials. On the toy from chapter 2 — no fsync, no checksums — you will see lost writes and torn writes immediately. Every production database team runs a harness of this shape, continuously, on every commit.

The four classes of failure you will find

When you run the harness on a buggy store, the failures do not distribute evenly. They fall into four classes, each with a different root cause and a different fix. Learn these four; every failure you will ever debug is one of them.

Lost writes

A write that was acknowledged as durable is gone after recovery. This is the worst class — the store told the application its data was safe and it wasn't.

Root cause. The write was not actually durable at the moment the ack went out. Somewhere between put() and the ack.write(), a layer was skipped — no fsync, a flush() that pushed only to the page cache, a missing directory fsync after a new file was created.

Tiny reproducer.

# BAD: writes acknowledged before fsync → lost on crash
def put_bad(self, k, v):
    self._f.write(f"{k}={v}\n")
    self._f.flush()          # only pushes to kernel page cache
    # missing: os.fsync(self._f.fileno())
    return "acked"           # we lied

# GOOD: fsync before we return
def put_good(self, k, v):
    self._f.write(f"{k}={v}\n")
    self._f.flush()
    os.fsync(self._f.fileno())
    return "acked"

Torn writes

A record on disk is half-written — some bytes of the new value, some bytes of garbage or of the old value. The scanner either sees it as a malformed record (and hopefully rejects it) or, worse, interprets it as a real record with a wrong value.

Root cause. A write of N bytes is not atomic across the stack. Your write() might translate to several sector or block writes internally, and only some of them complete before the crash. Filesystems like ext4 give you atomicity for writes within a single 4KB page under certain mount options, but anything larger can tear across pages.

Tiny reproducer. With the text format, a record like age=16\n can be cut off anywhere — including just after ag — and look like a new line that starts with ag and continues into whatever garbage is on disk. The fix is length-prefixed, checksum-framed records (chapter 5):

# framed record: [4-byte length][payload][4-byte CRC32]
import struct, binascii
def encode(kv_bytes: bytes) -> bytes:
    return struct.pack("<I", len(kv_bytes)) + kv_bytes + struct.pack("<I", binascii.crc32(kv_bytes))

def decode_next(f):
    hdr = f.read(4)
    if len(hdr) < 4: return None           # clean EOF
    (n,) = struct.unpack("<I", hdr)
    body = f.read(n + 4)
    if len(body) < n + 4: return None      # torn tail: stop recovery here
    payload, crc = body[:n], struct.unpack("<I", body[n:])[0]
    if crc != binascii.crc32(payload): return None  # corruption: stop recovery here
    return payload

Reordered writes

The log contains key99 but is missing key50, even though the writer produced them in order. The filesystem or the disk has committed later writes before earlier ones.

Root cause. Two very different things can produce this symptom. First, the directory entry and the file data are separate metadata operations; creating a new file then writing to it can have the data land on disk before the name does, so on recovery the data is unreachable. Fix: after creating a file, fsync the parent directory. Second, the OS writeback scheduler and the disk's internal write buffer can reorder queued writes; fsync inserts a barrier but does not guarantee ordering across unrelated fsyncs. Fix: issue fsyncs in the strict order you want them persisted, and do not use nobarrier mount options.

Tiny reproducer.

# BAD: new file written to, but parent directory not fsynced
# on recovery, the file may not appear in its directory at all
f = open("new.log", "w")
f.write("data\n")
f.flush(); os.fsync(f.fileno())   # file data is durable
f.close()
# missing: dir_fd = os.open(".", os.O_DIRECTORY); os.fsync(dir_fd)
# on a crash before the dir entry is flushed, "new.log" is nowhere

# GOOD: fsync the directory too
f = open("new.log", "w")
f.write("data\n")
f.flush(); os.fsync(f.fileno())
f.close()
dir_fd = os.open(".", os.O_DIRECTORY)
os.fsync(dir_fd)
os.close(dir_fd)

Silent corruption

A record reads back as valid — no torn framing, no missing bytes — but its value is not what was written. A bit has flipped on the platter. No syscall will catch this because, as far as the OS is concerned, the disk returned what it returned and there was no I/O error.

Root cause. Cosmic rays, DRAM errors, SSD wear, firmware bugs, controller bit-flips. The frequency is low — on consumer SSDs, order of one error per 10^{15} bits read — but a large database reads 10^{15} bits in a week.

Fix. Per-record checksums (CRC32C is the standard choice) that you verify on every read. If the checksum fails, you do not return the value to the application; you return an error. Recovery from that error is a separate question — it is why real systems replicate. A single-disk database can detect silent corruption but cannot repair it.

Tiny reproducer. Simulate a bit-flip:

# simulate a single-bit flip somewhere in the file
import os
with open("harness.log", "r+b") as f:
    f.seek(os.path.getsize("harness.log") // 2)
    byte = f.read(1)
    f.seek(-1, 1)
    f.write(bytes([byte[0] ^ 0x01]))   # flip the lowest bit
# re-run the scanner. Without checksums, you will read a wrong value and not know.
# With CRC32 framing, the scanner will reject that record with a corruption error.

The four classes of durability bug your harness will find. Fix each with the marked mechanism. Chapter 5 builds length-prefixed framing with CRC32C into our store.

Jepsen-lite for a single-node store

The harness above is what you write on day one. It covers SIGKILL — rung 2 of the violence ladder. To climb further, you need two more capabilities:

Fault injection below the store. Instead of killing the process, intercept the syscalls it makes, and lie to it about which ones succeeded. For fsync in particular, you want a mode where the store thinks the data is durable but actually is not — because that is what a lying SSD does. On Linux, the cleanest way is an eBPF probe that hooks sys_fsync and, in a fraction of cases, turns the fsync into a no-op before the kernel ever runs it. Your harness now tests the store's behaviour under a realistically adversarial storage stack.

# conceptual — real eBPF code is in C with bcc or bpftrace
# bpftrace -e 'kprobe:vfs_fsync /pid == TARGET/ { @n = @n + 1; if (@n % 10 == 0) { override(0); } }'
# every 10th fsync returns success without actually doing anything

Signal-driven probes inside the store. A SIGSTOP from the parent freezes the store at a random instant. The parent can inspect the on-disk state — a snapshot — then SIGCONT the store and compare to see what moved. This is how Jepsen-style tools narrow down which exact operation was in flight when the failure happened.

This combination — parent-controlled kills, eBPF-based fsync interception, on-disk snapshots between signals — is sometimes called Jepsen-lite, because it brings the Jepsen style of adversarial testing (normally aimed at distributed systems) down to a single-node store. For single-machine correctness, this is enough. When you get to replicated systems in Build 5, you graduate to Jepsen proper.

Common confusions

"If SIGKILL passes, fsync is correct." No. SIGKILL only kills the process; the kernel still flushes its page cache shortly afterwards. A store with no fsync at all will survive most SIGKILL runs because the background writeback commits the pages to disk before the parent ever looks. To actually test fsync, you need rung 3 — a power cut, or dm-log-writes replay (below).
"If I fsync often enough, I do not need a checksum." No. fsync addresses lost writes and some torn writes; it does nothing for silent corruption. A bit can flip on the platter a year after you wrote the record, and no future fsync will notice. Checksums are the only defence for bytes already on disk.
"The harness is flaky because it gets different results each run." That is not a flake — that is the point. Durability bugs are inherently probabilistic; a given bug has some probability of firing in any one trial. You tune the harness by running enough trials (hundreds to thousands) to make the expected hit count high. If a bug hits in 1% of trials, 200 trials give you a 13% chance of catching it — so run 1000.
"I can skip fsync because my Python version guarantees flushing on close." CPython's close() does a flush() at the C FILE* level, which runs write(2) to the kernel. It does not run fsync. The POSIX standard explicitly allows close() to return before the data is on disk. This confusion has cost a lot of data, and it is the reason chapter 3 exists.
"A correctness harness should be green every time." Counterintuitively, a durability harness that is green every time is a harness that is not stressing the system hard enough. If you never see a torn tail on disk, increase the write rate, shrink the kill delay, and push the parallelism up until you see torn tails being correctly discarded by the scanner. Then you know the recovery code is on the hot path.
"If my store passes 200 trials of SIGKILL, it's production-ready." No. 200 trials at rung 2 is the minimum bar. Production-ready means: thousands of trials at rung 2; at least some trials at rung 3 (dm-log-writes or VM power-cut); eBPF fsync interception modelling dishonest hardware; and — the part no harness can do — real hardware QA with real power cuts on the class of disk you plan to ship on.

Going deeper

If you just wanted a harness that shakes out the obvious durability bugs, the 50-line loop above is enough. The rest of this section points at the industrial-strength tools that real storage teams use, and the historical incidents that motivated them.

Jepsen proper

Jepsen — built by Kyle Kingsbury — is the gold-standard adversarial testing framework for database correctness. It is aimed primarily at distributed systems (it partitions the network, skews clocks, pauses nodes, reshuffles packets) but many of its ideas scale down cleanly to a single-node store: the harness records the history of every operation the client issued, every response the database returned, and every fault injected, and then checks at the end whether a linearisable history could have produced that sequence. When it cannot, the tool prints the shortest counterexample. Kingsbury's Jepsen reports have found correctness violations in MongoDB, Cassandra, etcd, CockroachDB, RabbitMQ, and many others — published, reproducible, and archived at jepsen.io.

The technique transfers down: a single-node Jepsen-lite tool records every put and get, kills the process at a random moment, restarts, and checks that the returned history is compatible with some serial order consistent with the ack/no-ack signals. If an ack'd put is not visible on recovery, that is a linearisability violation by the database's own contract.

ALICE and CrashMonkey

ALICE (OSDI 2014, Pillai et al. — the same paper cited in chapter 1) is an academic tool that systematically explores every valid crash state a filesystem might produce between two fsync points, and runs the application's recovery code on each. It modelled ext3, ext4, btrfs, xfs; it found durability bugs in LevelDB, SQLite, HDFS, Git, Mercurial, and VirtCloud's WAL. ALICE is what taught the industry that "call fsync where it matters" is a lot harder in practice than it sounds.

CrashMonkey (OSDI 2018, Mohan et al.) is the spiritual successor, more automated and scalable. It snapshots the block device after every barrier-crossing write, replays every possible subset of unsynced operations, and checks the application state on recovery. It found crash-consistency bugs in mainline ext4 and btrfs.

You are not expected to run ALICE or CrashMonkey yourself during Build 1 — they are substantial tools — but you should know their names and shape. When you graduate to a real storage engine, they are the tools you reach for.

dm-log-writes — deterministic power-cut replay

Linux ships a device-mapper target called dm-log-writes that records every block write issued to a disk, with barriers marked. You run your database on top of a dm-log-writes device; it produces a log of every write. Then you replay the log up to any chosen point (just before a barrier, in between two barriers, at the exact instant of the Nth fsync) and mount the resulting image as the state of the disk at that simulated crash moment. Point your recovery code at it. If recovery cannot handle every prefix that ended between barriers, you have a bug.

This is the closest thing to a deterministic power-cut test on commodity Linux, and it is what kernel filesystem developers use to test their own changes. It is used by Btrfs, XFS, and the LWN-published benchmarks of database fsync behaviour.

eBPF for fsync interception and lying-disk simulation

eBPF lets you attach small programs to kernel tracepoints and syscall entry/exit points without modifying the kernel. For our use:

Hook vfs_fsync_range on the store process's fd and override its return code.
Sample one fsync in N and "lose" it — let the kernel return success while dropping the actual flush.
Record every write + fsync and replay the sequence back into the store to reproduce a specific schedule.

bcc and bpftrace are the usual front-ends. Written correctly, an eBPF-driven harness lets you model the behaviour of dishonest consumer-grade SSDs on honest enterprise hardware — which means your tests cover the user's hardware, not yours.

Real-world incidents that motivated all of this

The reason every serious database team runs a crash harness is that every serious database has lost data to a crash bug. A partial roll-call:

MongoDB, 2013–2015: multiple rounds of Jepsen testing uncovered lost-write windows and stale reads in a variety of configurations. The team published fixes and tightened defaults. The episode is documented at jepsen.io/analyses/mongodb.
etcd, 2014–2016: Jepsen found several correctness bugs; the team adopted rigorous crash testing and today etcd is one of the most-tested WAL-based stores in open source.
RocksDB / LevelDB: the ALICE paper (cited above) found missing directory fsyncs and unsafe filename reuse in LevelDB, which were fixed upstream.
Postgres, 2018 — the infamous "fsyncgate" where Linux's fsync semantics were discovered to silently drop dirty pages after an I/O error, causing Postgres (and every other database on Linux) to think data was durable when it wasn't. The fix involved both kernel changes and application-side changes, and it is required reading for anyone who thinks fsync is straightforward.
SQLite: the project maintains a suite of crash tests that run for hours per release, and the list of documented crash fixes in its release notes is a multi-year timeline of edge cases no unit test ever caught.

The pattern is consistent: the bug is not in the obvious line of code. It is in the interaction between the store, the filesystem, and the disk, exposed by a specific schedule of operations that a crash interrupts. The only tool that finds it is a harness that runs that schedule and that crash, over and over.

Where this leads next

The harness you have built can find lost writes, torn writes, reordered writes, and silent corruption — but fixing the first two requires a new record format. The text-line format from chapter 2 cannot detect a torn tail reliably, and it has no place to hang a checksum.

The Three Walls — Linear Reads, Corrupt Tails, Unbounded Growth — chapter 5 takes the three structural limits of the append-only log (slow reads, torn tails, unbounded growth) and shows how every real storage engine is designed around exactly these three pressures. It introduces length-prefixed framing, CRC32C checksums, segment files, tombstones, and compaction — the concrete fixes for torn writes and unbounded growth, and the setup for the indexing work of Build 2.
The Append-Only Log — chapter 2: the store this harness was attacking. Revisit it with the harness running and watch the bugs you could not see before.
fsync, Write Barriers, and What "Durable" Actually Means — chapter 3: the syscall-level details the harness is stress-testing. Every green run of the harness is evidence that the decisions you made in chapter 3 were correct.

References

Pillai et al., All File Systems Are Not Created Equal: On the Complexity of Crafting Crash-Consistent Applications, OSDI 2014 — the ALICE paper. The empirical demonstration that write + fsync is, in practice, much harder than it looks.
Mohan et al., Finding Crash-Consistency Bugs with Bounded Black-Box Crash Testing, OSDI 2018 — the CrashMonkey paper, and the current state of the art in automated crash testing.
Kyle Kingsbury, Jepsen — the canonical adversarial testing framework and its library of database analyses (MongoDB, etcd, Cassandra, CockroachDB, and many more).
PostgreSQL wiki, Fsync Errors — the 2018 "fsyncgate" post-mortem. A required read on why fsync is not as simple as its man page.
Linux kernel documentation, dm-log-writes — the deterministic block-layer power-cut replay tool used by kernel filesystem developers.
SQLite, How SQLite Is Tested — the test strategy of one of the most rigorously crash-tested databases in existence, including anomaly testing and fault injection.