In short

A file, as far as your operating system is concerned, is a byte stream — a sequence of bytes you can read, write, and seek inside. That is all it promises. A database is built on top of a file (or a few files), but it promises five things a bare file does not: durability (writes survive a crash), atomicity (a write is all-or-nothing), concurrency (many readers and writers without corruption), queries (you can ask where is the row with id 42 instead of scanning), and integrity (the bytes you read back are the bytes you wrote). Every one of those promises has to be built, in code, on top of read, write, and fsync. The rest of this track builds them, one by one.

Here is ten lines of Python that look like they store data:

# save_note.py
def save(note_id: int, text: str) -> None:
    with open("notes.txt", "a") as f:
        f.write(f"{note_id}\t{text}\n")

def load_all() -> dict[int, str]:
    out = {}
    with open("notes.txt") as f:
        for line in f:
            k, v = line.rstrip("\n").split("\t", 1)
            out[int(k)] = v
    return out

Run it. It appears to work. You call save(1, "buy milk"), you call load_all(), you see the note. Ship it.

Now pull the power cord out of the wall one second after the save() call returns. Plug it back in. Run load_all() again. Your note may be there. It may be gone. It may be a half-written garbage string that crashes the loader. Nothing about open(), write(), and close() gave you a promise about what happens if the machine dies. The file system was not lying to you — it just never claimed anything stronger than "if nothing goes wrong, the bytes you wrote will be there when you read again."

A database is the software you build when "if nothing goes wrong" is not good enough. This chapter names the five promises a database makes that your notes.txt does not, and shows — with real Python and real syscalls — why getting even the first one right is harder than it looks.

This is chapter 1 of a 5-chapter build. By the end of the build, you will have written a tiny append-only key-value store that survives crashes, and you will know exactly why every piece of it exists. For now, the job is to name the ground we are standing on.

A file is a byte stream — and nothing else

When you call open("notes.txt", "w") on a POSIX system, what you get back is a file descriptor. A file descriptor is a handle into a kernel data structure that points at an inode, which is metadata describing a sequence of bytes somewhere on disk. The operations the kernel lets you perform on that handle are:

That is the entire interface. The POSIX standard does not promise that the bytes you write are on the physical disk when write returns. It does not promise that two processes write-ing at the same time will not interleave their bytes. It does not promise that if you lose power halfway through a write, the file will be in any particular state. It does not even promise that the bytes you read back next week will be the same bytes you wrote, if the disk media has degraded.

Why the interface is so thin: the kernel is trying to expose a uniform abstraction over wildly different hardware — spinning platters, SSDs, NFS mounts over a flaky network, tmpfs backed by RAM, a USB stick someone will yank out in 2 seconds. A promise the kernel cannot keep on the weakest hardware cannot be part of the interface. So the file interface is the intersection of what every backing store can do. Everything beyond that — durability, ordering, atomicity — has to be asked for explicitly with extra syscalls like fsync.

A file, then, is an unreliable byte sequence. Useful, universal, fast — and a long way from a database.

The five promises

A database is the name we give to a program that turns that unreliable byte sequence into something you can trust your data to. Different databases make these promises in different ways and to different degrees — a SQLite running in embedded mode and a Spanner instance running across three continents do not share an implementation — but they all promise, at minimum, these five things:

The five promises of a databaseFive labelled boxes in a row, each naming one promise a database makes beyond what a file offers: Durability, Atomicity, Concurrency, Queries, Integrity. Below each box is a one-line failure mode that a plain file exhibits.Durabilitywrite survivespower lossfile: maybeAtomicityall or nothing,never halffile: torn writesConcurrencysafe parallelreaders/writersfile: racesQueriesask by key,not by offsetfile: scan onlyIntegritybytes in =bytes outfile: bitrotA database is the software that turns a byte stream into a store with these five properties.Every chapter in this track builds one of them.
The five promises. A plain file fails all five in ordinary operation; a database pays the cost in code and in syscalls to keep them.

1. Durability. Once a write is acknowledged, it must survive crashes. If the database tells your code I saved it, then pulling the plug and booting back up must produce that saved value. The word people use for this guarantee is persisted.

2. Atomicity. A logical write either happens entirely or does not happen at all. You never read back half of a transaction. If you are moving 500 rupees from account A to account B and the power dies between the two writes, the database must either leave the money in A or deliver it to B — never make it vanish in transit.

3. Concurrency. Many processes, or many threads inside one process, can read and write at the same time without corrupting each other. A reader in the middle of scanning must not see a half-updated row. Two writers hitting the same key must serialize somehow; one of them must win a defined contest.

4. Queries. You can ask the database for data by some logical predicate — give me the user whose email is a@b.in — and get an answer in less time than scanning the whole file. This is what an index buys you: O(log n) or O(1) access instead of O(n).

5. Integrity. The bytes you read back are the bytes you wrote. If the disk flipped a bit between then and now, the database must detect it (via checksums) rather than silently hand you a corrupt value. The database must also reject writes that violate its stated schema or constraints before they hit disk.

Your notes.txt implementation above keeps none of these five. It is a file — nothing more and nothing less. The rest of this chapter focuses on the first promise, durability, because it is the one that most surprises people. Atomicity, concurrency, queries, and integrity are the subject of later builds.

Why open() + write() + close() is not durable

Here is the hard thing about durability: on a modern Linux system, a successful return from write() and a clean close() give you almost no guarantee that the bytes are on the disk. Two layers of caching sit between your Python code and the physical storage, and both are lying by default — for good reasons — about where your data is.

The write stack between your code and the platterA vertical stack of five layers. Top: your Python process with an app-level buffer. Next: the write() syscall. Next: the kernel page cache, labelled volatile RAM. Next: the block layer and the fsync syscall. Next: the disk controller cache, labelled volatile. Bottom: the platter/NAND, labelled persistent. Arrows flow downward with each boundary labelled as cache or barrier.Your Python process — app-level bufferwrite()Kernel page cachevolatile RAM — lost on power cutfsync()Disk controller cache (DRAM on the SSD)still volatile — capacitor or nothingwrite barrier / FUAPlatter / NAND flashpersistent — survives power lossuser spacekernel RAMdevice RAMstorage mediawrite() reaches the page cache. fsync() is what pushes it further.Even then, a lying disk cache can acknowledge before persisting.
Four layers stand between your write() call and bytes on the physical media. write() only promises to reach the first layer (the kernel page cache). Getting to the last layer needs explicit syscalls — and, on many disks, trust that the controller is not lying.

Layer 1 — the kernel page cache. When you call write(fd, buf), the kernel typically just copies buf into a region of RAM called the page cache and returns success. It does not wait for the disk. It will schedule the actual disk write later, in batches, because disks are roughly 100,000 times slower than RAM and writing immediately on every syscall would cripple your program. This is a feature for throughput, but it is a landmine for durability: if the machine loses power in the next few seconds, everything sitting in the page cache is gone.

The syscall that pushes the page cache to the disk is fsync(fd). It blocks until the kernel has told the disk to persist everything it knows about for that file. You pay for it — a typical fsync on an SSD takes 1 to 10 milliseconds, on a spinning disk 10 to 100 milliseconds. That cost is why nobody fsyncs after every write by default; the trade-off between throughput and durability is the central design question of every storage engine you will ever meet.

Layer 2 — the disk controller cache. Even after fsync returns, your bytes may not be on the platter. Modern SSDs have a volatile DRAM cache on the device itself, where they buffer writes for performance. If the power dies while data sits in that cache, it is gone — unless the disk has a super-capacitor (a "power-loss-protected" cache) that holds enough charge to flush the cache on its way down. Consumer SSDs mostly do not. Enterprise SSDs mostly do.

Worse, some disks lie. They acknowledge a flush instantly, without actually pushing the cache to persistent storage, because a lying disk benchmarks faster than an honest one. Database engineers have spent the last two decades learning to distrust the stack underneath them. Chapter 4 of this build walks you through actually power-testing your own disk; chapter 3 explains the exact set of syscalls (fsync, fdatasync, O_DSYNC, O_DIRECT, write barriers) that let you control the layers above.

Here is what all of this means for the notes.txt example above. Let's look at two versions of the same save:

The fsync makes all the difference

# version A — probably loses data on a crash
def save_fast(note_id, text):
    with open("notes.txt", "a") as f:
        f.write(f"{note_id}\t{text}\n")
    # f.close() ran when the 'with' block exited
    # but the bytes may still be in the kernel page cache

# version B — durable by the time it returns
import os

def save_durable(note_id, text):
    with open("notes.txt", "a") as f:
        f.write(f"{note_id}\t{text}\n")
        f.flush()              # push Python's buffer into the kernel
        os.fsync(f.fileno())   # push kernel page cache to the disk

What each step does.

  1. f.write(...) copies bytes into Python's in-process buffer (a C-level FILE* buffer).
  2. f.flush() copies Python's buffer down into the kernel's page cache using write(2). After this, your process has no copy — but the data is still just in kernel RAM.
  3. os.fsync(...) asks the kernel to issue a barrier and wait until the disk says the file's data is persisted. After this, barring a lying cache, a power cut will not lose your write.

What close() alone does not do. Closing the file descriptor does a flush for you (copies Python's buffer into the kernel) but it does not issue an fsync. The POSIX standard explicitly allows close() to return before data hits the disk. So version A is racy: most of the time the data survives, but the one time in a thousand that the power cuts in the next few seconds, it does not.

What version B still does not promise. Even version B is at the mercy of the disk controller. If the SSD has a volatile cache and no power-loss protection and is willing to lie about flushes, version B can still lose data. This is not paranoia — database vendors have reproduced it. The defence is to buy hardware with honest flush semantics, or to turn on write barriers and test.

The gap between version A and version B is a single syscall, but it is the gap that separates a file from a database. Every production database calls some variant of fsync at carefully chosen points — after a transaction commits, after a checkpoint, after a log record is written — and not between them. Getting those points right is most of what a storage engine does.

The minimum bar — and why a CSV is not a database

So what is the minimum a system must do to earn the name database? There is no standards body and no certifying authority, but a practical working definition is: a database is a system that gives at least durability and either atomicity, concurrency, queries, or integrity, and that exposes this through an interface more abstract than raw file offsets. By that bar:

The whole arc of this subject is the arc between the first bullet and the last. Every chapter adds a piece — a log, an index, a lock manager, a checksum, a schema — and each piece exists to turn one broken promise into a kept one.

Why the word "database" is not a sharp boundary: the promises are on a spectrum. SQLite gives you durability by default. Redis by default only snapshots to disk periodically — a crash can lose the last few seconds. Both call themselves databases, and both are right. What matters is knowing which promise you are actually getting, which is why the rest of this track spends so much time on the machinery underneath.

Common confusions

Going deeper

If you just wanted to know what makes a database different from a file, you have it — five promises, a cache hierarchy, and a single syscall called fsync that turns the first one from a wish into a guarantee. The rest of this section is for readers who want to connect this to the ACID vocabulary, the exact POSIX text, and the performance cost of keeping each promise.

ACID, BASE, and the five-promise taxonomy

The classical database textbook vocabulary is ACID — Atomicity, Consistency, Isolation, Durability. The five-promise decomposition in this chapter maps onto ACID like this: Durability is D directly, Atomicity is A, Concurrency is I (isolation — what concurrent transactions see of each other) plus part of A, Integrity is C (consistency — schema and invariants) plus the separate concern of data not rotting on disk, and Queries is outside ACID entirely (ACID is about transactional safety, not access time).

There is also BASE — Basically Available, Soft state, Eventual consistency — which is the vocabulary of systems that consciously weaken some ACID property to gain availability or scale. A BASE system is still a database; it just tells you loudly which promise it is not keeping.

Picking a taxonomy is a choice. This track uses the five-promise version because it tracks the machinery you actually write in code: a write-ahead log is how you get durability; a lock manager or MVCC engine is how you get concurrency; a B-tree is how you get queries. ACID is the contract; the five promises are the levers.

What POSIX actually says about write and fsync

The POSIX specification for write(2) is stricter than people assume in one way and looser than they assume in another. It guarantees that a single write() of up to PIPE_BUF bytes to a regular file is atomic relative to other writers — two concurrent writes will not interleave their bytes inside that limit. That is a concurrency guarantee, not a durability one. It says nothing about whether the bytes are on disk, nothing about what happens on a crash, and on many filesystems (especially network filesystems) even the atomicity guarantee is weakened in practice.

The POSIX specification for fsync(2) says it shall not return "until the system has completed that action" (transfer the file's modified data to the storage device). That sentence is the entire legal basis for durability on POSIX. It says nothing about disk-controller caches below the kernel, and it explicitly allows filesystems to interpret "transfer" in ways the original authors of POSIX did not anticipate. Linux's fsync does include a cache-flush command to the storage device by default, but this was not always true, and it can be disabled by mount options (nobarrier) that some users still turn on for performance.

The exact gap between what POSIX says and what the hardware does is the subject of Chapter 3. For this chapter, the rule of thumb is: if you did not call fsync explicitly, assume the bytes are not on disk.

The cost of fsync, concretely

On a typical consumer NVMe SSD in 2026, an fsync to a small file takes about 50 microseconds to 2 milliseconds depending on queue depth and drive. On a rotational disk it takes 5 to 15 milliseconds (one rotation of the platter at 7200 rpm is 8.3 milliseconds). On a networked filesystem it can take 50 to 200 milliseconds.

A naive implementation that fsyncs after every individual row insert therefore caps at a few thousand inserts per second on SSD and a few hundred on a hard drive. Real databases bulk-commit: they batch many logical writes into one durable write, amortising the fsync across a group. This is called group commit, and it is the reason a Postgres or MySQL instance can do tens of thousands of transactions per second on the same hardware that caps your naive loop at one thousand. You will build this batching in Chapter 3.

The edge of the abstraction — what even a perfect database cannot promise

No amount of fsync will protect you against a failure of the storage media itself. If the disk physically fails, your data is gone. Databases handle this by replicating — copying the data to another disk, another machine, another data centre. Replication is not durability; it is durability's insurance policy. A database that replicates can survive one-disk failures, but it introduces a whole new problem: the copies might disagree. That problem, called consistency in the CAP-theorem sense, is the subject of much later parts of the track (Builds 4 and 5).

For Build 1, we stay on a single machine and single disk, and we focus on getting the five promises right for one copy of the data. If you cannot get durability right for one disk, replicating the data only replicates the bug.

Where this leads next

This chapter named the five promises and showed why even durability — the simplest of them — is not free on top of a plain file. The next four chapters of Build 1 turn that naming into code:

After Build 1, you will have a durable, crash-safe, fault-tested append-only log. It will be slow, it will have no index, and it will be a legitimate database by the definition this chapter set. Build 2 adds the index; Build 3 adds atomicity; Build 4 and beyond add queries, concurrency, integrity, and — eventually — distribution.

References

  1. Hellerstein, Stonebraker, Hamilton, Architecture of a Database System, Foundations and Trends in Databases (2007) — the canonical modern reference for what a database system actually contains.
  2. Pillai et al., All File Systems Are Not Created Equal: On the Complexity of Crafting Crash-Consistent Applications, OSDI 2014 — the paper that empirically showed how fragile write() + fsync() is across real filesystems.
  3. POSIX.1-2017, fsync — the specification text that durability code is legally standing on.
  4. Kleppmann, Designing Data-Intensive Applications (2017), Ch. 3 — plain-English treatment of storage engines and the log-structured vs page-oriented split.
  5. Corbet, Ensuring data reaches disk, LWN.net — the classic short article on why close() is not enough and what the Linux kernel actually does on fsync.
  6. PostgreSQL documentation, Reliability and the Write-Ahead Log — a production database's own notes on disk caches, lying hardware, and the cost of getting durability right.