The Append-Only Log — padho-wiki

Note: Company names, engineers, incidents, numbers, and scaling scenarios in this article are hypothetical — even when they resemble real ones. See the full disclaimer.

In short

The smallest thing that deserves to be called a database is a file you only append to. Every write adds one line — a key, an equals sign, a value, a newline — to the end. Every read starts at the top and scans until the end, keeping the last value it saw for the key you asked about. That is it. Thirty lines of Python. No index, no header, no schema. This store is called an append-only log, and it is crash-safer than the "obvious" design (overwrite records in place) because a half-finished append can only corrupt the tail — never a record that was already written. Reads are O(n) in the file size, so this design will not scale on its own. But every production database in the world — Postgres, SQLite, LevelDB, Kafka, HDFS — has an append-only log sitting underneath, because the log is the only part that can be made durable cheaply. This chapter builds that log from scratch.

Open a terminal. Create a file called mydb.log. Write one line to it: name=dipti. Close the file.

Congratulations, you have a database.

It is a bad database. It has no index, no concurrency, no types, no query language. But it satisfies the minimum definition from chapter 1 of this track: it is persistent, and a program can later come back and ask it what is the name? and get dipti back. Everything a real database adds — fast reads, transactions, schemas, SQL — is a layer of complexity sitting on top of this exact primitive. Understanding the primitive is the first and hardest step, because every instinct you have from classical programming will fight it.

The first instinct, the one you have to unlearn right now, is that when you update a record you overwrite the old value. That is not how durable storage works. That is how you lose data. The rule of every real storage engine is the opposite: you append the new value to the end, and you leave the old value exactly where it is. This chapter is a thirty-line Python program and a set of reasons why.

The simplest possible store

Here is the format. One record per line, key and value separated by =:

name=dipti
age=15
city=coimbatore
age=16

When you want the value of age, you read the file top to bottom and remember the last time you saw age=. In this file, the answer is 16. The earlier line age=15 is still there — you did not delete it, you did not overwrite it. It just no longer matters, because a later line shadowed it.

Why the last-write-wins rule works: the writer only ever adds to the end, so "later in the file" means "written later in time." Scanning top-to-bottom replays history in order. Whichever value you see last for a given key is the one the writer most recently committed.

That is the whole design. Two operations:

put(k, v): append one line k=v\n to the end of the file.
get(k): scan the file; return the value from the last line whose key equals k.

No deletions yet (we will get to tombstones in chapter 5). No types — everything is a string. No crash handling beyond what the filesystem gives you. Just a file, and two rules.

The append-only log in one picture. Writes grow the file downward; reads walk the file downward and keep whichever value for the key they saw last.

The thirty-line implementation

Open your editor. Type this in. Do not copy-paste — type it. Every line will matter later.

# appendkv.py — a tiny append-only key-value store
class AppendOnlyKV:
    def __init__(self, path):
        self.path = path
        # open for appending in text mode; create if missing
        self._f = open(path, "a", encoding="utf-8")

    def put(self, key, value):
        # one record = one line: key=value\n
        # keys and values must not contain '=' or '\n' (we will fix this later)
        self._f.write(f"{key}={value}\n")
        self._f.flush()  # push Python's buffer into the OS

    def get(self, key):
        # scan the whole file, keep the last value we saw for this key
        latest = None
        with open(self.path, "r", encoding="utf-8") as f:
            for line in f:
                k, _, v = line.rstrip("\n").partition("=")
                if k == key:
                    latest = v
        return latest  # None if the key has never been written

    def close(self):
        self._f.close()

Thirty lines including blanks and the comment header. Let us walk through what each part is doing, because every choice here is a simplification of something a real database has to do more carefully.

open(path, "a", ...). The "a" mode is append. Every write goes to the current end of the file, regardless of the file pointer. If two processes open the file in "a" mode on POSIX and write at the same time, their writes do not interleave at a sub-record level on most filesystems — each write() system call is atomic for small buffers. (That is a lucky guarantee you should not rely on in production, but it saves you here.)

f"{key}={value}\n". The record format is human-readable text. key=value, one per line. This is what SQLite's .dump format uses, what .env files use, what most configuration file systems use. It has two properties that matter: (a) newlines separate records, so a reader can scan line-by-line, and (b) you can cat the file and read it. The tradeoff is that = and \n cannot appear inside keys or values without an escape mechanism, which we are ignoring. Real databases use a binary framing format (length-prefixed records) to avoid this; chapter 5 rebuilds this store with proper framing.

self._f.write(...) then self._f.flush(). In Python, write() puts the bytes into Python's internal buffer. flush() pushes them to the operating system. Neither of these puts the bytes on disk — the OS still has them in the kernel page cache. If the power cuts right now, you lose the write, even though Python has returned from put(). This is the single most important fact in database engineering, and chapter 3 is devoted to the missing step: os.fsync(), which tells the OS actually write it to the disk platter, and do not come back until you have. We are deliberately not calling fsync here so you can feel the performance difference later.

Why flush() alone is not durable: write() → Python buffer → OS kernel buffer → disk controller → platter/NAND. flush() moves bytes from the first buffer to the second. There are still two levels of volatile memory between the second buffer and actual persistent storage. A power cut at any of those levels loses data that put() already "returned." Chapter 3 walks this stack in full.

for line in f. Python iterates the file by reading it in chunks and yielding one line at a time. rstrip("\n").partition("=") splits the line at the first =. The read loop keeps overwriting latest every time it finds the key, so after the loop latest holds whichever value came last — the live value.

Save the file. Try it:

db = AppendOnlyKV("mydb.log")
db.put("name", "dipti")
db.put("age", "15")
db.put("city", "coimbatore")
db.put("age", "16")
print(db.get("age"))      # 16
print(db.get("city"))     # coimbatore
print(db.get("missing"))  # None
db.close()

That is a database. You can kill the Python process, restart it, call get("name"), and dipti will come back. Persistence, the first of the five promises from chapter 1, is satisfied by nothing more than "the file is still there."

Why append-only beats in-place writes

The natural objection hits immediately: this is wasteful. If I change age from 15 to 16, why not just find the old age=15 line and overwrite those bytes in place? The file stays small, and I do not have to scan past dead records.

The answer is crash safety. An in-place overwrite cannot be made atomic on a raw file. An append can, almost for free.

Here is the difference in pictures.

Left: an in-place overwrite crashes mid-write and the original record is gone. Right: an append crashes mid-write and the original record is still sitting above the partial tail, untouched.

The key asymmetry: when you overwrite in place, the only copy of the data is the one being modified. If the write is interrupted — a power cut, a kernel panic, a kill -9 — you are left with a record that is neither the old value nor the new one. It is garbage.

When you append, the new value goes to unused space at the end of the file. The old record is bystander bytes; the crash cannot reach them. Whatever broken half-written bytes sit at the tail after recovery, you can detect them (their line does not contain =, or the checksum does not match — chapter 5) and throw them away. The previous state of the database is intact.

This is the reason the rule exists in almost every storage engine you will meet:

SQLite writes its rollback journal or WAL file by appending, and only modifies the main database file after the journal is safely on disk.
Postgres uses a write-ahead log (WAL) — every change is appended to an on-disk log before touching any data page. Chapter 5 discusses this.
LevelDB, RocksDB, Cassandra are LSM-trees — every write goes to an in-memory buffer backed by an append-only log on disk. The whole design of LSM is "append and merge later."
Kafka is nothing but an append-only log, exposed to the network.

Everybody appends because appending is the only operation the filesystem is willing to make crash-safe cheaply. You are building the same primitive.

Benchmarking the thing

Talk is talk. Run the store. Here is a ten-line benchmark: write 100,000 records, then read 1,000 keys back, and report the rates.

# bench.py — benchmark the append-only KV store
import os, time, random
from appendkv import AppendOnlyKV

if os.path.exists("bench.log"): os.remove("bench.log")
db = AppendOnlyKV("bench.log")

N = 100_000
t0 = time.time()
for i in range(N):
    db.put(f"key{i}", f"value{i}")
write_rate = N / (time.time() - t0)

t0 = time.time()
for _ in range(1_000):
    db.get(f"key{random.randint(0, N - 1)}")
read_rate = 1_000 / (time.time() - t0)

print(f"writes: {write_rate:>10,.0f}/s   reads: {read_rate:>6,.0f}/s   size: {os.path.getsize('bench.log'):,} bytes")

On a 2024-era laptop you should see numbers in the ballpark of:

writes:    850,000/s   reads:    400/s   size: 1,688,890 bytes

Stop. Read those two numbers again.

Writes are eight hundred thousand per second. Just appending to a file, flushing Python's buffer, and moving on. No fsync, so these writes are not durable yet — they live in the kernel page cache until the OS decides to flush them — but they are blazingly fast. This is the speed of modern RAM and syscall dispatch, not the speed of disk. Real databases with fsync on every commit drop this number by a factor of 100 or more; the gap is what chapter 3 is about.

Reads are four hundred per second. That is five orders of magnitude slower than the writes. Why? Every get() opens the file and scans 1.7 megabytes of text from the top. A thousand reads means a thousand full scans. Even a tenfold-larger log — 17 MB, still nothing — will cut reads to 40/s. At a gigabyte you are down to sub-one-per-second, and the database is unusable for reads.

This is the first law you discover by building things: raw append is fast to write and horrifying to read. Every database design decision after this chapter is, in some form, a trade that gives up some of the writing speed to buy back some of the reading speed.

The scan cost grows linearly with the log

You can see the O(n) scan cost directly. Add this to the benchmark:

for k in (1_000, 10_000, 100_000, 1_000_000):
    db2 = AppendOnlyKV(f"log_{k}.log")
    for i in range(k):
        db2.put(f"key{i}", f"v{i}")
    t0 = time.time()
    for _ in range(100):
        db2.get(f"key{random.randint(0, k - 1)}")
    print(f"n={k:>9,}  reads: {100/(time.time()-t0):>8,.0f}/s")
    db2.close()

Typical output on a modern laptop:

n=    1,000  reads:  35,000/s
n=   10,000  reads:   4,200/s
n=  100,000  reads:     410/s
n=1,000,000  reads:      39/s

Every tenfold increase in the log size multiplies the per-read time by roughly ten. A graph of read rate against log size is a clean straight line on log-log paper with slope -1. That is what O(n) looks like in the wild.

The O(n) problem and the concept of the index

The benchmark is not subtle. A database whose read time grows linearly with the number of writes is not a database you can put behind a website. Imagine BharatBazaar on a sale day. The product catalogue has 50 million entries. Each page load has to look up a product. If every lookup scans 50 million records, every page takes seconds. The site dies.

The problem is that get() has no idea where in the file key42 lives. It has to look at every line from the top. If you could only remember, for each key, which byte offset in the file its latest value starts at, you could skip straight to that offset, read one line, and be done. One seek, one read, constant time — O(1) instead of O(n).

That table of key → byte-offset is called an index. Build 2 of this track is about building one. Here is the shape of what comes:

# preview: the in-memory hash index that chapter 7 will add
class IndexedKV:
    def __init__(self, path):
        self.path = path
        self._offsets = {}  # key → byte offset in the file
        self._rebuild_index()

    def _rebuild_index(self):
        offset = 0
        with open(self.path, "rb") as f:
            for line in f:
                k = line.decode().partition("=")[0]
                self._offsets[k] = offset
                offset += len(line)

The index is a Python dictionary, living in memory. On every put() we update _offsets[key] to the current file length. On every get() we look up _offsets[key] and seek() straight to that byte. Reads become O(1) in the size of the file, and cost only one memory lookup plus one disk read.

There are three problems with that preview, and solving them is what the rest of this track is about:

The index lives only in RAM. If the machine restarts, you rebuild it by scanning the log — the old slow O(n) read, except you only pay it once at startup. Chapter 9 describes how real engines checkpoint the index to disk periodically.
The index must fit in memory. For a billion keys, that is gigabytes of pointers. Build 4 introduces B+-trees, disk-resident indexes that page themselves in and out.
The log still grows forever. Even with an index, old age=15 lines sit in the file. Build 2 chapter 12 introduces compaction — merging segments and dropping dead records.

For now, notice the clean decomposition: the log is the storage; the index is the accelerator. The log is the thing the database commits to disk; the index is a derived, rebuildable structure. Lose the index, and you can rebuild it from the log. Lose the log, and there is nothing to rebuild from.

What you have actually built

The thirty-line class you typed has a name used by every real database engineer. It is a write-ahead log, or WAL.

The WAL is the most fundamental on-disk structure in modern databases. Its job in a real system is a little broader than what our toy does — in Postgres or MySQL, the WAL is where changes are committed first, before the actual data pages get updated, so that after a crash the database can replay the WAL and reconstruct the state. But the shape is identical: an append-only sequence of records, durably on disk, scanned in order on recovery.

Every production storage engine has an append-only log at its core. Everything else is layering on top.

When the Postgres source code logs a committed transaction, it is running a fancier version of your put(). When SQLite's WAL mode commits a page change, it is running a fancier version of your put(). When Kafka writes a message to a topic partition, it is literally running your put() (with framing, fsync, and replication). You have not built a toy — you have built the atom.

What you are missing, relative to those real systems:

Durability — you flushed Python's buffer but did not call fsync. That is chapter 3.
Crash recovery — when the tail is half-written, you have no way to tell and no way to cleanly truncate. That is chapter 4.
Indexing — reads are O(n), not O(1). That is Build 2.
Compaction — dead records pile up forever. That is chapter 5 and Build 2 chapter 12.
Concurrency — two processes writing at once can interleave. That is Build 5.
Types and queries — keys and values are strings, nothing else. That is Build 8 onwards.

Every one of those is a layer. The log stays. Every layer sits on top of it.

Recovering from a half-written tail

Append this adversarial test to the bench and see what your store does when a write is interrupted.

# inject a corrupt tail
with open("bench.log", "a") as f:
    f.write("partialrec")   # no = and no newline — a torn write

# now try to read — the scanner needs to survive this
db = AppendOnlyKV("bench.log")
print(db.get("key0"))   # still returns "value0"
print(db.get("partialrec"))   # returns None

Because our get() only records a key when partition("=") produced a non-empty key on the left, the mangled line is quietly ignored. Your store is accidentally crash-tolerant for that specific failure mode. But it will not notice if the last line read as key42=v42 when the disk actually wrote key42=v41 (silent corruption). That is what checksums are for. Chapter 4 (power-loss testing) shows you how to catch these.

Common confusions

"Append-only means the file can never shrink." Within a single log, yes — you never delete records from the middle. But real systems use many logs, called segments. When a segment becomes old enough that no live record references it, you can delete the whole segment file. That is called compaction, and Build 2 covers it.
"If I flush after every write, my data is safe." No — flush() in Python, or fflush() in C, pushes your userspace buffer into the kernel. The kernel still has the data in RAM (the page cache) and may not have written it to disk. You need os.fsync() for durability, and even that is subject to disk-controller lies on consumer hardware. Chapter 3 is about this.
"Last-write-wins means old writes are wasted." They are not wasted — they form a complete audit trail. For many use cases (event sourcing, accounting, version history) the old records are the value of the database. Kafka treats every old record as equally interesting; a traditional KV store treats only the latest as live. Same substrate, different reading rules.
"Text format is fine for a real database." It is not. A real WAL uses binary framing: a length prefix, a type byte, a checksum. That way you can store any bytes (including = and \n) and detect corruption. Our text format is a pedagogical choice that makes the file cat-readable; chapter 5 replaces it.
"Reading the whole file on every get() is obviously terrible." It is, and it will be fixed — with an index. But notice that the "terrible" O(n) scan is also what the recovery path of every real database does at startup, once. The cost you complain about on the hot read path is the cost every database pays on crash recovery. Making it faster there (with checkpoints) is its own chapter.
"If the power cuts between write() and flush(), how much data do I lose?" Up to Python's buffer size, which is 8192 bytes by default for text files. A single put() with a short value almost never fills the buffer by itself, so without flush() the write can sit in userspace for tens of milliseconds or longer before it even reaches the kernel. This is why every toy KV store example you will see online has a flush() call — and it is why it is not enough.

Going deeper

If you just wanted to understand what an append-only log is, you have it: a file you only add to, scanned top-to-bottom on reads, crash-safer than in-place because partial tails can be discarded. The rest of this section connects it to the real-world designs you will meet in later chapters.

Text framing versus length-prefixed binary framing

The key=value\n format works until a key or value contains = or \n. The classic fix is a length-prefixed binary format:

[ 4-byte key length | 4-byte value length | key bytes | value bytes | 4-byte CRC ]

A record is a fixed 12-byte header followed by the bytes themselves, plus a 4-byte checksum. The reader reads the header, knows exactly how many bytes to grab, reads them, and verifies the checksum. Any truncation at the tail produces a header that claims more bytes than the file has — an unambiguous "this record is torn, stop here" signal. This is how Bitcask, LevelDB, RocksDB, Kafka, Postgres WAL, SQLite WAL, and every production engine frames records. Chapter 5 rewrites our store in this format.

Why the log is the source of truth

In a modern database, there are two things on disk: the log and the data pages. The log is append-only and authoritative. The data pages are the "current picture" of the database — fast to read, but updated lazily from the log.

When you commit a transaction, the database appends to the log and fsyncs it. That is the commitment point. Only later, when the DB has spare time, does it update the data pages to match. If a crash happens before the pages are updated, startup replays the log against the last good checkpoint of the pages, and the state is reconstructed exactly.

This is called write-ahead logging, and the invariant is the one sentence: no change is considered committed until it is durably in the log. It is why the log is the part of the system that gets the fsync, and the pages are the part that can be rebuilt.

Our store has only the log, no pages. That is the simplest possible instance of the pattern — the pages, so to speak, are reconstructed afresh on every get() by scanning the log. A log without any derived state is the purest form of WAL.

What real systems do about the read-time problem

Three different design families give three different answers:

Bitcask (the Riak backend) keeps an in-memory hash index mapping every key to its byte offset. Reads are one disk seek. Writes are one append. The whole key set has to fit in memory. Build 2 explains this in full.
LSM-trees (LevelDB, RocksDB, Cassandra, ScyllaDB) keep recent writes in an in-memory sorted table, flush it periodically to disk as an immutable sorted file (an SSTable), and merge those SSTables in the background. Reads check memory first, then the newest SSTable, then the next, and so on. Build 3 covers LSM in depth.
B+-trees (Postgres, SQLite, MySQL InnoDB, every traditional RDBMS) keep a disk-resident balanced tree of the keys. Reads traverse the tree, O(\log n) disk accesses. Writes update the tree in place — under the protection of a WAL for crash safety. Build 4 covers B-trees.

Each design picks a different point on the write-speed / read-speed / memory-footprint tradeoff triangle. All three are built on top of an append-only log. The log is the floor.

Why `fsync`-less append is still useful

If un-fsynced writes can be lost on a power cut, what is this toy good for? Two things.

First, in-process caches — a browser history file, a search index, a prompt history file. If the machine loses power, losing the last few seconds of keystrokes is acceptable, and you can save the cost of fsync on every write (which would be brutal for something written a thousand times a second).

Second, pedagogy. By deferring fsync to chapter 3, you get to feel the 100× difference between "fast, not durable" and "slow, durable" in your bones before the syntax arrives. Every database in the world balances on this tradeoff. Make it real before you name it.

What "the same object every DB uses" really means

If you read the source code of SQLite (sqlite3.c, the famous 230,000-line amalgamation) and find the WAL writer, you will see a function that does exactly put() — open the WAL file, write() a record, optionally fsync(), return. Fancier bookkeeping around it, but the hot loop is the same. The same is true of pg_xlog in Postgres, of the commit log in Cassandra, of the segment writer in Kafka. Learn this one loop, read it in every code base for the rest of your career.

Where this leads next

fsync, write barriers, and what "durable" actually means — chapter 3: the step between "in the kernel" and "on the platter," and what it costs.
Power-loss testing your own database — chapter 4: how to shoot your process at random moments and verify that the log survives every crash.
The three walls of append-only logs — chapter 5: the three design pressures (slow reads, torn tails, unbounded growth) that every real storage engine is built to answer.
What makes a database different from a file — chapter 1: the five promises a database owes its reader, and why this log is the minimum.

References

Martin Kleppmann, Designing Data-Intensive Applications (O'Reilly, 2017), Ch. 3 Storage and Retrieval — the reference treatment of logs, indexes, and LSM-trees for working engineers. dataintensive.net.
C. Mohan et al., ARIES: A Transaction Recovery Method Supporting Fine-Granularity Locking and Partial Rollbacks Using Write-Ahead Logging (ACM TODS, 1992) — the original write-ahead-logging paper that every modern RDBMS descends from. ACM.
Jay Kreps, The Log: What every software engineer should know about real-time data's unifying abstraction (ProfNet Engineering, 2013) — the essay that made "log as a first-class primitive" canonical. engineering.linkedin.com.
Patrick O'Neil et al., The Log-Structured Merge-Tree (LSM-Tree) (Acta DataFlow, 1996) — the paper behind LevelDB, RocksDB, and Cassandra. Springer.
SQLite, Write-Ahead Logging — the SQLite docs for WAL mode, surprisingly readable. sqlite.org/wal.html.
Justin Sheehy and David Smith, Bitcask: A Log-Structured Hash Table for Fast Key/Value Data (Riak, 2010) — the cleanest short paper on log + in-memory index. riak.com/assets/bitcask-intro.pdf.