In short

When your Python program calls write(), the bytes do not go to the disk. They go into a queue of in-RAM buffers — Python's, then the kernel's, then the disk controller's — and a few milliseconds later, if nothing goes wrong, they make it to the platter or the NAND. A power cut at any point in that journey loses the write, even though your code already moved on. fsync(fd) is the one POSIX syscall that says do not return until the bytes for this file are on persistent storage. That promise has a hole — the disk controller's own cache, which on consumer hardware sometimes lies — that you close with write barriers (a mount-time setting the kernel turns on by default on modern Linux) and by buying disks with power-loss protection. Three nearby syscalls — fdatasync, O_DSYNC, and O_DIRECT — trade different bits of the cost for different slices of the guarantee. By the end of this chapter you will have a decision table: for any given durability target, you will know exactly what to call, in what order, and why.

You built the append-only log in chapter 2. You called f.flush() after every write. You benchmarked it at 850,000 writes per second and went to bed happy.

Then the power dies. You reboot. Your load_all() returns half the records you thought you wrote. The last thing you see in the log is a record from four hundred milliseconds before the crash — everything after that is gone. But your code returned from put(). Your flush() succeeded. You did not crash inside the write; you crashed several seconds after.

This is the mystery. The flush worked, no exception was raised, the function returned — and the data is not on disk. To understand why, you have to learn that there is no single thing called "the disk" between your code and safety. There are four things, each with its own little pool of RAM, each willing to tell the layer above that the write is "done" when it merely reached the next layer down. Durability is not a property of a syscall. It is a property of a journey across four caches, and the syscall only controls how far along the journey you have waited.

This chapter walks the four caches, names the syscall that flushes each one, and ends with a decision flow: for this durability target, call these syscalls in this order.

The four-layer write stack

Start with a picture of what happens the moment your Python program says f.write("age=16\n").

The four-layer write stackA vertical diagram with four stacked boxes. From top to bottom: Application buffer (inside your Python process, volatile RAM), Kernel page cache (in the operating system, volatile RAM), Disk controller cache (DRAM on the SSD, volatile unless power-loss protected), and Platter or NAND flash (persistent). Between each pair of boxes is a labelled barrier: flush() / write() between boxes 1 and 2, fsync() / fdatasync() between boxes 2 and 3, FUA or write-barrier between boxes 3 and 4. On the right, timing estimates: app-to-kernel about 1 microsecond, kernel-to-controller about 100 microseconds, controller-to-media 50 microseconds to 10 milliseconds.1. Application bufferinside your Python process — volatile RAMflush() / write()2. Kernel page cacheshared by all processes — volatile RAMfsync() / fdatasync()3. Disk controller cacheDRAM on the drive — volatile unless PLPFUA / write barrier4. Platter or NAND flashpersistent — the only layer that survives power lossuserkerneldevicemedia~1 µs~100 µs50 µs – 10 mswrite() reaches layer 2. fsync() pushes to layer 3. A write barrier pushes layer 3 to layer 4.Each arrow is a place the bytes can be lost on a power cut — if you did not wait for it.
The four caches between your code and the physical media. Each arrow between boxes is a barrier you can cross only by calling a specific syscall and paying for the latency it implies.

Read the picture slowly. There are four boxes and three barriers. A durability guarantee is nothing more than the claim the bytes have crossed barriers 1, 2, and 3, and are in box 4. Let us walk each barrier.

Barrier 1 — application buffer to kernel page cache. When you write to a Python file object in text or buffered mode, the bytes go first into a FILE*-style buffer living inside your process's address space. The default buffer size is 8192 bytes for text streams, matched to the kernel's page size. f.flush() calls the write(2) syscall, which copies the buffer into the kernel's page cache — a slab of RAM the kernel uses to mirror file contents. The cost of crossing this barrier is microseconds. The danger: if the process crashes (segfault, kill -9) while bytes are still in the application buffer, those bytes are lost, even though the kernel is fine.

Barrier 2 — kernel page cache to disk controller. The page cache is RAM owned by the kernel. A reboot or power cut wipes it. The kernel flushes dirty pages to the disk on its own schedule — typically every 5 seconds, or sooner under memory pressure — but there is no guarantee of ordering and no promise about which writes have made it. fsync(fd) is the syscall that says wait until every dirty page for this file has been sent to the disk and the disk has acknowledged receipt. The cost: about 50 microseconds on a fast NVMe SSD, 1–10 milliseconds on a typical SSD, 5–20 milliseconds on a rotational disk.

Barrier 3 — disk controller cache to the media. Every SSD and every recent spinning disk has a small DRAM cache inside the drive itself. When the kernel sends write commands, the controller acknowledges them when they land in its DRAM — not when they are written to NAND or platter. A power cut at this moment loses the writes unless the drive has power-loss protection (PLP): onboard capacitors that hold enough charge to flush the DRAM to NAND when the rails collapse. The kernel crosses this barrier by sending the disk a cache flush command (SCSI SYNCHRONIZE CACHE, NVMe FLUSH, or a per-write FUA — Force Unit Access — flag). fsync on Linux issues this command by default.

Why this stack exists: every layer is there for throughput. The application buffer reduces syscall overhead by batching small writes. The kernel page cache reduces disk I/O by batching across processes and by allowing reads to hit RAM. The disk controller cache reduces flash wear by coalescing and re-ordering writes. Each layer speeds up the common case (no crash) by a factor of 10 to 1000 and costs you extra work in the uncommon case (crash). Durability is the art of paying that extra work only where it matters.

Now that the stack is named, the syscall you need becomes a simple function of how far down you insist on waiting.

What fsync(2) actually promises

Here is the POSIX specification for fsync, in one sentence paraphrased:

fsync(fd) shall not return until the system has completed the transfer of modified data associated with fd (and, for regular files, the file's metadata needed to retrieve it) to the storage device.

Three things in that sentence matter.

"Transfer to the storage device." POSIX says the bytes must reach the storage device — not that they must be on the media inside it. On a drive with a volatile cache, this is the distinction between barrier 2 and barrier 3 in our diagram. Linux's fsync since kernel 2.6 (roughly 2005, though the setting has moved around) does send a cache-flush command by default, closing that gap. But POSIX does not require it, and on macOS the default fsync does not cross barrier 3 — you need fcntl(fd, F_FULLFSYNC) for that.

"Metadata needed to retrieve it." fsync also flushes inode metadata — the file length, the modification time, the block map that says this file's bytes live at these disk addresses. If you fsync the file's data but the filesystem crashes before the metadata is written, you could have bytes sitting on the disk that no file knows about — effectively garbage. This is why fsync is more expensive than just writing data: it has to durably commit two things, and they live in different places on the disk.

"For fd." This is the subtle one. fsync(fd) promises durability for that file's data and metadata. It does not promise anything about the directory entry that lets a future open() find the file. If you just created a new file and fsynced it, but the directory entry has not been flushed, a crash could leave the file on disk with nothing pointing at it. To make a newly-created file findable after a crash, you must fsync both the file and its parent directory.

The "fsync the directory too" rule

import os

def durable_create(path, data):
    # 1. Write the file atomically — write to a temp name, fsync, rename, fsync dir.
    dirpath = os.path.dirname(path) or "."
    tmp = path + ".tmp"

    # write data and fsync the file
    fd = os.open(tmp, os.O_WRONLY | os.O_CREAT | os.O_TRUNC, 0o644)
    try:
        os.write(fd, data)
        os.fsync(fd)          # bytes + inode metadata to disk
    finally:
        os.close(fd)

    # rename into place (atomic on POSIX)
    os.rename(tmp, path)

    # fsync the directory — makes the new name durable
    dfd = os.open(dirpath, os.O_RDONLY)
    try:
        os.fsync(dfd)
    finally:
        os.close(dfd)

Why each step.

  1. Write to a temporary name so that if we crash mid-write the original file (if any) is intact.
  2. fsync the temp file so its bytes and inode are on the disk before anyone learns the name.
  3. rename — atomic on POSIX — swaps the new name in. Either the old file or the new one exists; never a half-written one under the target name.
  4. fsync the directory — pushes the rename itself to disk. Without this step, a crash after the rename but before the kernel flushed the directory entry could leave the directory still pointing at the old file, and the new file (with its good data) orphaned.

The sequence write → fsync data → rename → fsync directory is the idiom for durable file creation on POSIX. Every production database that writes new segments does something like this. Skipping the last fsync is one of the classic durability bugs — the bytes are safe, but the system does not know where they are.

Why the directory matters: on most filesystems, the directory is itself a file whose contents are (name, inode) pairs. Updating the directory is a separate write, to a separate inode, and it goes through the same page cache as everything else. fsync on your file does not touch the directory's inode. The rename syscall makes the directory change atomic with respect to readers, but it does not make it durable.

fdatasync vs fsync vs O_DSYNC vs O_DIRECT

The POSIX family of durability syscalls is small, and each member trades a different piece of cost for a different piece of guarantee. Here is the four-way comparison, with a running example: appending one 64-byte record to a log file.

fsync(fd) — flushes the file's data and all relevant metadata (size, timestamps, block map) to the storage device. Waits for the disk's acknowledgement. This is the default durability primitive. On a modern NVMe drive it costs 50–200 microseconds per call; on spinning rust, 5–15 ms.

fdatasync(fd) — flushes the file's data to the storage device, and only the metadata that a reader would need to retrieve the data — typically the file's length if it grew, but not the modification time. On a workload that appends to an ever-growing file (exactly our log), fdatasync ends up flushing almost as much as fsync, because the file's length changes on every append. But on a workload that overwrites existing bytes in place — updating a fixed-size record — fdatasync saves one metadata write per call, which on busy systems can double throughput.

O_DSYNC (open flag) — set at open time. Every subsequent write(fd, ...) behaves as if you called fdatasync immediately after. Useful when you are doing one write per transaction and want the commit to happen inside the same syscall, with no chance of forgetting to sync. The cost per write is the same as one fdatasync, but the syscall count is halved.

O_SYNC (open flag) — like O_DSYNC but also flushes full metadata, so equivalent to fsync on every write. Rarely used in modern code; O_DSYNC plus explicit fsync at the right moments is almost always better.

O_DIRECT (open flag) — bypasses the kernel page cache entirely. Writes go straight from your user-space buffer to the disk controller. Reads come back the same way. This is not a durability primitive by itself — the disk controller can still buffer. But combined with fsync (or O_DSYNC), it gives the database control over caching: the kernel never has a copy, so you never waste memory on pages the database plans to evict anyway. Major production databases — Oracle, DB2, InnoDB under some configurations — use O_DIRECT because they run their own buffer pool and do not want the kernel duplicating the cache. It is also picky: the buffer address and the length must be aligned to the filesystem block size (typically 4096 bytes). Get it wrong and you get EINVAL.

Here is each of these, end to end, in Python:

import os

# (1) fsync — the workhorse
fd = os.open("log.txt", os.O_WRONLY | os.O_APPEND | os.O_CREAT, 0o644)
os.write(fd, b"age=16\n")
os.fsync(fd)                      # data + all metadata
os.close(fd)

# (2) fdatasync — the cheaper cousin when only data/size matter
fd = os.open("log.txt", os.O_WRONLY | os.O_APPEND | os.O_CREAT, 0o644)
os.write(fd, b"age=16\n")
os.fdatasync(fd)                  # data + only required metadata
os.close(fd)

# (3) O_DSYNC — implicit fdatasync on every write
fd = os.open("log.txt", os.O_WRONLY | os.O_APPEND | os.O_CREAT | os.O_DSYNC, 0o644)
os.write(fd, b"age=16\n")         # returns only after data is on the device
os.close(fd)

# (4) O_DIRECT — bypass the page cache (requires aligned buffers)
import mmap
buf = mmap.mmap(-1, 4096)         # 4 KiB aligned page
buf.write(b"age=16\n" + b"\x00" * (4096 - len("age=16\n")))
fd = os.open("raw.log", os.O_WRONLY | os.O_CREAT | os.O_DIRECT, 0o644)
os.write(fd, buf)                 # must write a whole 4096-byte block
os.fsync(fd)                      # still needed — O_DIRECT is not sync
os.close(fd)

Why O_DIRECT needs alignment: when the kernel does the copy into the page cache, it can handle arbitrary byte offsets. But when the DMA engine on the disk reads your buffer directly, it can only operate on whole sectors. The CPU does not do byte-granularity DMA. So the buffer address and the size must both be multiples of the block size the hardware expects.

Which one to use is a function of your durability target. The decision flow at the end of this chapter puts it on one page.

Measuring the real cost of each call

# bench_fsync.py — how slow is each flavour, in real microseconds?
import os, time

def bench(label, setup, write_and_sync, n=10_000):
    fd = setup()
    t0 = time.perf_counter()
    for _ in range(n):
        write_and_sync(fd)
    elapsed = time.perf_counter() - t0
    os.close(fd)
    print(f"{label:<22}  {n/elapsed:>8,.0f} writes/s  ({elapsed*1e6/n:>6.1f} µs/write)")

REC = b"key=value\n"

bench("no sync",
    lambda: os.open("t.log", os.O_WRONLY | os.O_APPEND | os.O_CREAT | os.O_TRUNC),
    lambda fd: os.write(fd, REC))

bench("fsync every write",
    lambda: os.open("t.log", os.O_WRONLY | os.O_APPEND | os.O_CREAT | os.O_TRUNC),
    lambda fd: (os.write(fd, REC), os.fsync(fd)))

bench("fdatasync every write",
    lambda: os.open("t.log", os.O_WRONLY | os.O_APPEND | os.O_CREAT | os.O_TRUNC),
    lambda fd: (os.write(fd, REC), os.fdatasync(fd)))

bench("O_DSYNC",
    lambda: os.open("t.log", os.O_WRONLY | os.O_APPEND | os.O_CREAT | os.O_TRUNC | os.O_DSYNC),
    lambda fd: os.write(fd, REC))

Typical output on a 2025 consumer NVMe:

no sync                 3,100,000 writes/s  (   0.3 µs/write)
fsync every write          8,400 writes/s  ( 119.0 µs/write)
fdatasync every write     11,200 writes/s  (  89.0 µs/write)
O_DSYNC                   11,500 writes/s  (  87.0 µs/write)

The unsynced version is three hundred times faster because it is paying for nothing below the kernel page cache. fdatasync and O_DSYNC run neck and neck and are 30% faster than fsync, because each append extends the file, but only the size and data go to disk — modification time and other inode fields stay in the page cache until the next natural flush.

This is the exact ratio you feel as a database engineer: durability costs two to three orders of magnitude, and the art is deciding which writes are worth paying for.

Write barriers and the honest-disk problem

So far the story has been: you call fsync, the kernel flushes to the disk controller, and some moments later the data is on the media. The last step — controller cache to NAND or platter — is the one no syscall in your language can directly observe. It is owned by the firmware inside the drive.

Linux crosses barrier 3 by sending a cache flush command (SYNCHRONIZE CACHE for SCSI/SATA, FLUSH for NVMe) to the drive as part of every fsync. The drive is supposed to refuse to acknowledge the command until its own DRAM cache is on the media. This behaviour is called a write barrier: the cache-flush command forms a barrier in the command stream, with all writes before it guaranteed persistent before any write after it is processed.

Three things can break this.

The disk lies. Some consumer SSDs acknowledge the flush command instantly, without actually flushing NAND, because benchmarks run faster with a dishonest cache. Server-grade drives with power-loss-protection capacitors do not need to lie — the capacitors guarantee that even if the power cuts, the cache will still make it to NAND. Consumer drives without capacitors face a choice: be slow, or be unsafe, or lie about it. Many of them lie.

The filesystem is mounted with nobarrier. On some Linux systems, for performance, administrators mount filesystems with the nobarrier or barrier=0 option. This tells the kernel to skip the cache-flush command on fsync, treating the disk as if it had no volatile cache. On modern kernels this is called -o nobarrier for ext4 and is usually off by default (barriers are on). But it is one flag away from being off, and many older tuning guides still recommend turning barriers off "for speed." On a drive without PLP, this setting makes fsync almost pure fiction.

The controller reorders writes. Even a disk that honours flushes correctly will reorder the non-flushed writes between barriers. If your code does write A; write B; fsync;, the disk may commit B to media before A — as long as both are on media before the fsync returns. This is fine for most workloads. But if you are doing something clever (like assuming that the log record appears on disk before the index page that references it), you have to use two fsyncs with a dependency between them, or use O_DSYNC to make every write its own barrier.

fsyncgate — the time Postgres discovered the kernel was lying

In 2018, the PostgreSQL team discovered that on Linux, fsync could silently return success even when the kernel had failed to write data to disk. Here is the sequence:

  1. A dirty page sits in the kernel page cache.
  2. The kernel tries to write it to disk in the background.
  3. The write fails (disk full, I/O error, drive yanked).
  4. The kernel clears the dirty flag on the page — "we tried, we failed, moving on."
  5. Your program, which has no idea any of this happened, eventually calls fsync.
  6. fsync looks at the page: not dirty! Returns success.
  7. Your data is gone. fsync said it was fine.

This bug, nicknamed fsyncgate, had been silently corrupting databases for years. Postgres's fix was to crash the server the moment fsync returns an error — because by then, the kernel has already forgotten which pages it failed on, and there is no safe way to retry. "Crash on fsync error" is now the advice for any durable-by-default application on Linux.

Lesson: even a correct, well-tested, widely-used syscall on a widely-used kernel can silently lose your data for a decade before anyone notices. Power-loss testing (chapter 4) is the only way to trust your stack.

Why this is so hard to diagnose: the data is gone, but the return code is zero. No exception is raised. Your application has no signal that anything went wrong. The filesystem logs may have a line about an I/O error, but it is not tied to any particular fsync call in your code. The only way to catch it is to actively test — write data, crash the machine, reboot, verify the data is there.

The decision flow — what to call for which target

You now know enough to pick a durability strategy. The question is: how much data loss can you tolerate on a crash, and how much throughput will you trade to reduce it? Here is the mapping, from loosest to strictest.

Durability decision treeA decision tree showing five durability targets and the syscall sequence for each. Top node: How much data can you afford to lose on a crash? Branches go to five leaf nodes: any amount (no fsync — flush only), up to 5 seconds (periodic fsync, background thread), one transaction batch (fsync at commit, group commit), individual transaction (fsync per commit), and paranoid (O_DSYNC plus fsync plus PLP disk plus power-loss testing). Each leaf lists the syscalls in order.How much loss on crash is acceptable?anythingflush() onlyno fsync~5 secondsperiodic fsyncbg threadone batchfsync at commitgroup commitzero txnsfsync per commitfdatasync okparanoidO_DSYNC + fsync+ PLP diskcache, log,crash-tolerant appRedis AOF "everysec",MongoDB defaultPostgres WAL,MySQL InnoDBfinancial ledgers,synchronous commitHFT journals,medical devices~3M writes/s~3M writes/s~100k txn/s~10k txn/s~10k txn/sThe middle branch is where most production databases sit.Group commit amortises one fsync across many transactions.
Five durability targets and the syscall sequence each one demands. The throughput numbers are order-of-magnitude estimates on a consumer NVMe in 2026; your mileage will vary.

The five targets, in detail:

1. "Anything" — no fsync, flush only. For a cache or a recompute-from-source log. Your code calls f.flush() to move bytes into the kernel and walks away. A crash can lose seconds of writes. This is what the chapter-2 store did, and it is the right answer for a workload where losing recent writes is cheaper than the fsync bill.

2. "About five seconds" — periodic fsync. A background thread wakes every 1–5 seconds and calls fsync once. All writes between fsyncs are committed together. Redis's appendfsync everysec is this pattern; so is MongoDB's default. You lose at most the last few seconds of writes, and the fsync cost is amortised across thousands of writes. This is the pragmatist's choice.

3. "One commit batch" — group commit. Every transaction is an fsync, but the database deliberately batches concurrent transactions so that one fsync covers many of them. Postgres, MySQL InnoDB, and every serious RDBMS uses this. If ten threads all commit in the same 1 ms window, they share one fsync — each transaction pays a fraction of the cost. This is the sweet spot for durability-per-transaction.

4. "Zero, per individual commit" — fsync every transaction alone. No batching. Each transaction blocks until its fsync returns. Slow under load but strictest. This is what you pick for a financial ledger where lying to the user even for 10 ms is unacceptable.

5. "Paranoid" — O_DSYNC plus explicit fsync plus power-loss-protected hardware plus active power-loss testing. For systems where a single lost transaction is a regulatory or safety event. The software stack above does everything it can; the hardware stack underneath has capacitors and certification. You combine both, and you still test by yanking the plug on a real copy.

Group commit, 20 lines of Python

import os, threading, queue, time

class GroupCommitLog:
    def __init__(self, path, flush_ms=1):
        self.fd = os.open(path, os.O_WRONLY | os.O_APPEND | os.O_CREAT, 0o644)
        self.q = queue.Queue()
        self.flush_ms = flush_ms / 1000
        threading.Thread(target=self._writer, daemon=True).start()

    def commit(self, record: bytes) -> None:
        done = threading.Event()
        self.q.put((record, done))
        done.wait()                           # blocks until fsync returns

    def _writer(self):
        while True:
            items = [self.q.get()]            # at least one
            time.sleep(self.flush_ms)         # gather more into the batch
            while not self.q.empty():
                items.append(self.q.get_nowait())
            for rec, _ in items:
                os.write(self.fd, rec)
            os.fdatasync(self.fd)             # ONE fsync for the whole batch
            for _, done in items:
                done.set()                    # wake every committer at once

What this gets you. Each commit() call blocks until the writer thread has fsynced. But the writer thread waits 1 millisecond between batches, gathering every transaction that arrives in that window. If 100 threads commit in the same millisecond, they share one fsync: each pays about 10 microseconds of real blocking time plus its fair share of the 150-microsecond fsync. Without batching, each would pay 150 microseconds alone.

Postgres's commit group, MySQL InnoDB's commit concurrency, and every WAL-based engine you have ever used is some elaboration of this 20-line loop. Group commit is the single most important performance optimisation in transactional storage.

Common confusions

Going deeper

If you just wanted to know what fsync does and when to call it, you have it: four caches, four syscalls (plus O_DIRECT for cache bypass), one decision tree. The rest of this section connects the syscall-level story to filesystem internals, SSD firmware, and the bugs databases have actually hit in production.

Filesystem differences — ext4, XFS, ZFS, and btrfs

POSIX names fsync; what fsync actually does depends on the filesystem.

ext4 data=ordered (the default). Metadata is journaled; data is not. Before a metadata change commits, the corresponding data blocks are written. fsync on a file flushes its data, then journals and flushes the metadata. This is the common case you assume in most discussions.

ext4 data=journal. Both data and metadata go through the journal — every byte gets written twice (once to the journal, once to the final location). fsync becomes roughly twice as expensive, but the crash-consistency guarantee is stronger: after recovery, the filesystem replays the journal and the file is exactly in the last fsynced state or the previous one, never in a partial state.

ext4 data=writeback. Metadata journaling, but no ordering between data and metadata. Faster; weaker guarantee. A crash can leave metadata pointing at blocks that contain old data from before the write. Rarely used in production.

XFS. Journaling filesystem from SGI, known for performance on large files and heavy concurrent writes. fsync semantics are comparable to ext4 data=ordered. XFS's metadata journal is more aggressive, so metadata-heavy workloads often run faster than ext4.

ZFS and btrfs. Copy-on-write filesystems. Every write goes to a new location; the old location is freed later. The effect on fsync is subtle: there is no "torn write" at the block level because the new block is written in full before the pointer to it is updated. ZFS's transaction group commit is roughly the filesystem running its own group-commit over the top of yours. Both of these filesystems do their own end-to-end checksumming, catching the silent-corruption class of bugs that a bare fsync cannot.

For a database author, the practical rule is: test on the filesystem your users will run. A storage engine that works perfectly on ext4 can fail on XFS if it relies on undocumented ordering, and vice versa. Postgres, MySQL, and SQLite all have filesystem-specific notes in their documentation for this reason.

SSD internals — why power-loss protection matters

An SSD is not a passive block store. Inside the drive is a general-purpose ARM or RISC-V processor running proprietary firmware that does three hard jobs: wear levelling (spreading writes across the NAND so one cell does not wear out first), garbage collection (consolidating the valid pages in a partially-erased block), and write coalescing (buffering incoming writes in DRAM so it can write them out in large, efficient bursts).

That DRAM — typically 256 MB to 2 GB on a modern consumer SSD — is the source of the durability hazard. If the drive loses power mid-operation, the DRAM contents are gone. The firmware needs just enough charge to finish the current NAND write and flush its metadata.

Power-loss protection (PLP) is the feature that does this. The drive has onboard super-capacitors that hold a few dozen millijoules — enough to run the controller for 10–50 milliseconds after the 5V rail drops. In that window, the firmware flushes the DRAM cache to NAND and updates the translation table. Drives with PLP can honestly acknowledge fsync the moment data lands in their DRAM, because they are confident they can always finish the journey.

Drives without PLP have to choose: be slow (wait for NAND before acknowledging) or be unsafe (acknowledge from DRAM and pray). Most consumer SSDs pick unsafe-by-default with no way to configure otherwise. Some Samsung, Intel, and Micron enterprise drives advertise PLP prominently; many QLC consumer drives do not have it at any price point.

The practical test: if you plan to trust fsync on a new SSD, search the datasheet for "power loss protection" or "power loss imminent." If the words are not there, assume they are not in the firmware either.

NVMe flush semantics — FUA and the flush command

NVMe — the protocol for modern PCIe SSDs — has two ways to force a write to persistent media.

The FLUSH command. A standalone command that says "flush everything you have been told about." The kernel issues this as part of fsync. Cost: one controller round-trip (microseconds) plus however long the NAND write takes.

FUA — Force Unit Access. A per-write flag that tells the drive this particular write must be on media before you acknowledge it. Cost: the NAND write cost for this write only, nothing else.

On a busy system with many concurrent writes, FUA can be faster than FLUSH, because FLUSH waits for every pending write to drain while FUA only waits for the one you care about. But not every drive honours FUA correctly — some treat it as a no-op, and then O_DSYNC silently degrades to non-durable. The Linux kernel has heuristics to pick between FUA and FLUSH per-device.

You do not normally touch these as an application programmer. They matter when you read database tuning guides that talk about "turn on FUA" or "disable flushes"; now you know which barrier is being tweaked.

Real-world bugs — fsyncgate, the ext4 delayed allocation zeroing, and dirty-zero pages

fsyncgate (2018). Described earlier. fsync could return 0 after silently losing data when a background writeback failed and the kernel cleared the dirty flag. Postgres, MySQL, and many others hit this. Fixed in kernel 4.13+ with errseq tracking, but any code that retries fsync after an error is still playing with fire.

ext4 delayed allocation zeroing (2009). ext4 delays allocating blocks for newly-written files until a flush, for performance. If you write() a lot to a new file and the system crashed before the allocator ran, the file could appear with its correct size but zeroed contents — the length was journalled but the data was not. Fixed by forcing allocation at rename and fsync; the bug made it into the press as "KDE loses config files on crash."

The dirty-zero-page bug (various). Some drives, under specific firmware versions, would return zeros for recently-written sectors that had been flushed successfully — the data was written, then the mapping table update was lost, and the physical block was re-used. Only visible on reboot and only with a specific I/O pattern. Cheap SSDs in 2014–2016 were notorious for this.

Every one of these was invisible to a syscall-level observer until someone actually tested with a power cord. That is the subject of chapter 4.

What about networked filesystems — NFS, SMB, cloud volumes?

Durability on a networked filesystem is an even longer journey: app buffer → kernel → network stack → wire → remote kernel → remote disk controller → remote media. fsync on an NFS client asks the server to commit, but the semantics depend on server configuration (async exports skip the commit entirely). On EBS, Azure Disk, or Google Persistent Disk, fsync semantics are documented by the provider and generally honest, but you pay a network round-trip (hundreds of microseconds) on top of everything else.

Databases running on cloud storage typically redesign their commit strategy around this: very aggressive group commit, often coordinated with replication so that replicated to a quorum is the durability primitive rather than fsynced on one machine. Build 5 will come back to this.

Where this leads next

You now know how to get bytes durably to the media on one machine. The remaining two chapters of Build 1 turn that knowledge into tools.

And by then you will be ready to turn the thirty-line log from chapter 2 into a real, fsync-protected, crash-tested, indexed key-value store — the subject of Build 2.

References

  1. Corbet, Ensuring data reaches disk, LWN.net (2011) — the classic short article on write(), fsync(), and the kernel page cache.
  2. Pillai et al., All File Systems Are Not Created Equal: On the Complexity of Crafting Crash-Consistent Applications, OSDI 2014 — empirical study of how fragile write()+fsync() is across real filesystems.
  3. POSIX.1-2017, fsync — the specification text the kernel is legally standing on.
  4. Craig Ringer, Anthony Iliopoulos, PostgreSQL's fsync() surprise, LWN.net (2018) — the fsyncgate write-up, with the painful details.
  5. PostgreSQL documentation, Reliability and the Write-Ahead Log — a production database's own notes on disk caches, lying hardware, and synchronous commit.
  6. Zheng et al., Understanding the Robustness of SSDs under Power Fault, FAST 2013 — the paper that measured how often consumer SSDs actually lose data on power cuts. The numbers are not comforting.