In short

Your table is a sequence of 8 KiB slotted pages. A new 240-byte row arrives — which page holds it? Scanning every page is O(number-of-pages), and a 10 GB table is 1.25 million pages. You need an index on free space itself: a Free Space Map (FSM) that answers "give me a page with ≥ N bytes free" in O(1) or O(log N). Three shapes. Bitmap: one bit per page, cheap but imprecise — you still read the candidate page to confirm exact free space. Linked list: each page's header points to the next page with room; inserts walk until a fit, but every unlink is a write. FSM tree (Postgres's choice): leaves hold per-page free-byte counts (quantised into 256 buckets), interior nodes hold the max of children, lookup descends top-down in O(log N). Independent of the FSM is the fragmentation it lets slip through. Internal fragmentation is wasted space inside a page — gaps between live records where deleted ones used to live — fixable by in-place page compaction. External fragmentation is pages that are themselves half-empty — fixable only by rewriting the file (VACUUM FULL) or moving records across pages (online repack). The Postgres default is lazy VACUUM: walk the table, reclaim dead-row space in place, update the FSM; does not recover external fragmentation but keeps things from getting worse. The designer's last fight is FSM accuracy vs bookkeeping cost: every FSM update writes a byte somewhere and possibly a WAL record. Postgres keeps the FSM as a hint-only structure (no WAL) quantised into 32-byte buckets (updates only on bucket-crossing); InnoDB tracks free space at extent granularity (one bitmap per 1 MiB); SQLite barely tracks at all and relies on periodic full-file VACUUM. This chapter builds all three shapes in Python and measures the fragmentation each one tolerates.

You have just built slotted pages — 8 KiB blocks that hold variable-length records with a slot array at the top and a record heap at the bottom. You have a buffer pool that caches the hot pages (previous chapter). Now you need to answer the question every insert asks: in which page do I put this 240-byte row?

The wrong answer is "scan every page of the table until you find one with room." That is O(total-table-size) per insert. At 1.25 million pages for a 10 GB table, an insert that should cost one disk write costs you a million disk reads first.

The right answer is: maintain an index on free space. Alongside the data file, keep a data structure that, given a required byte count, returns a page with at least that much room — in O(1) or O(log N). That index is the Free Space Map, FSM for short. This chapter builds three shapes of FSM, looks at the fragmentation patterns each one tolerates, and finishes with what real databases do.

Why free-space tracking needs its own data structure

Recall the shape of a slotted page. A 16-byte header. A slot array growing downward from just after the header. A record heap growing upward from the end of the page. In between them, free space. The size of that free space is what you need to know: it is the header's free_end - slot_end field, typically stored as a two-byte integer.

Three shapes of the free-space mapA three-column comparison of free-space tracking structures. Left column shows a bitmap: one bit per page marking "has space" yes or no, compact but imprecise. Middle column shows a linked list: each page's header contains a next-free-page pointer; inserts walk the list. Right column shows an FSM tree: leaves hold per-page free-byte counts, interior nodes hold the max of children, and a find-space query traverses top-down.Bitmap FSMLinked-list FSMTree FSM (Postgres)1 bit per page(has ≥ 256 B free: Y/N)10110101bits: page 0..7find space:scan for a 1-bit → O(N/64)read that page, check exactcost: 1 bit/page1 GiB table → 2 KiB bitmapimprecise: false positivespointer in each page headerp3next→7p7next→12p12next→∅free_head = 3find space:follow from free_headuntil a fit → O(list length)cost: 4 B/free pageno external index neededinserts serialize on headwrites every unlinktree: max-of-children32003200180032002401800600p0 p1 p2 p3find space(≥ 300 B):root 3200 ≥ 300 → descendleft 3200 ≥ 300 → left→p0cost: ~2 B/page, O(log N)precise, cheap, what Postgres uses
Three ways to answer "which page has enough free space for this record?". The bitmap is cheap but imprecise — you still have to read the candidate page to see the exact free bytes. The linked list threads free pages together but serialises inserts and rewrites pages on every unlink. The FSM tree holds per-page free-byte counts at the leaves and the max-of-children at each interior node, giving a precise O(log N) answer to "any page with at least N bytes free" in a few pages of overhead per gigabyte of table.

Why not just use the slotted-page header directly? After all, every page already stores its own free-space count. The answer is locality. To find a page with enough room by reading page headers, you have to read the pages themselves — one 8 KiB disk read per page checked. The FSM is a separate, tiny file that fits in cache; one read of the FSM tells you which page to read. The FSM is to free-space lookup what the B+ tree is to key lookup: an index that turns a linear scan into a logarithmic probe.

Shape 1 — bitmap

The simplest FSM is one bit per page: 1 if the page has "enough" room for a typical insert, 0 if it does not. You have to pick "enough" at design time — say, 256 bytes, or 1/32 of the page. Any page with at least that much free becomes a 1; any page below becomes a 0.

A 1 GiB table is 131,072 pages, so the bitmap is 131,072 bits = 16 KiB. That fits in one buffer pool frame. Scanning it for a 1-bit with bin(int).find("1") or a vectorised word scan is microseconds. The whole structure is absurdly cheap.

The catch is precision. "Has 256 B free" is a weaker guarantee than "has 512 B free", and if your new record is 512 bytes, you can only trust the bitmap to narrow the candidates — you still have to read the candidate page to confirm. On a workload of uniformly-sized records the threshold happens to match, the check succeeds, and the bitmap is essentially perfect. On a workload with mixed record sizes it becomes a filter: the bitmap says "this page might work"; the page header says "yes" or "no"; on no, you move to the next candidate.

The imprecision shows up in two ways. False positives (rare): the bitmap says "free" for a page whose free space is below your record's exact size. You pay one wasted disk read before moving on. Hysteresis: when does a page's bit flip? If you flip on every insert and delete, you thrash the bitmap on a write-heavy workload. Real bitmap FSMs use a higher threshold to flip from 1 to 0 than from 0 to 1, so the bit does not oscillate.

Shape 2 — linked list of pages with free space

The second shape threads a linked list through the pages themselves. Reserve four bytes of every page header for next_free_page. Maintain a global free_head pointer (stored in the file's metadata page). When a page has free space, it is on the list; when it fills up, it comes off.

free_head = 3
page 3 header: next_free = 7
page 7 header: next_free = 12
page 12 header: next_free = NULL

To insert: start at free_head, follow the next_free pointers until you find a page with enough room for your record. Insert there. If the insert filled the page, splice it out of the list.

The appeal is that the FSM is free — it lives in the page headers themselves, four bytes each. No external file. No separate buffer-pool pressure. The cost is concurrency and write amplification. Every unlink is a write to two pages (the removed one and its predecessor on the list). Every insert that happens to land near the head serialises with every other insert near the head. And a long list with many "almost-full" pages takes many hops before you find a fit.

Linked-list FSMs show up mostly in older or simpler engines. SQLite uses a variant for its freelist of deleted pages (we will see this in a later chapter). Modern engines for large tables prefer the tree.

Shape 3 — FSM tree (what Postgres uses)

The FSM tree is a balanced tree whose leaves are per-page free-byte counts and whose interior nodes are the maximum of their children's values. To find a page with at least N bytes free, start at the root: if the root's value is less than N, no page in the subtree has room, give up. Otherwise, descend: at each interior node, pick a child whose value is ≥ N and recurse. You arrive at a leaf whose value is ≥ N in O(log N) steps. That leaf is the page.

In Postgres the FSM is a 3-level tree stored in its own file (a suffix of _fsm on the relation's file path). The per-page free-byte count is quantised to one byte — 256 buckets — because exact bytes are not worth the storage cost. A 1 TiB table has 134 M pages; the FSM is ~256 MiB — well under 0.1% of the table. Each FSM page is itself 8 KiB and holds a small subtree of counts. Updates are local: an insert that changes one page's free-byte count requires updating one FSM leaf and possibly one or two interior nodes.

The invariant is: every interior node's stored value is the max of its children. Maintaining it on update is easy — after setting a leaf, walk up the path, update each ancestor to the max of its children, stop when an ancestor does not change. The walk is bounded by the tree depth, which is a tiny constant.

# fsm_tree.py — a toy FSM tree with max-of-children aggregation.
#
# We model the tree as a flat array with 1-based indexing: parent of i is i//2,
# children are 2i and 2i+1. Leaves are per-page free-byte counts (quantised to
# 256 buckets: value = free_bytes // 32, so values range 0..255 for an 8 KiB page).

import math

PAGE_SIZE = 8192
BUCKET = PAGE_SIZE // 256    # 32 bytes per unit

class FSMTree:
    def __init__(self, n_pages: int):
        # Tree needs a power-of-two number of leaves for simplicity.
        self.leaf_count = 1 << math.ceil(math.log2(max(n_pages, 1)))
        self.n_pages = n_pages
        self.tree = bytearray(2 * self.leaf_count)   # index 0 unused

    def _leaf_index(self, page_id: int) -> int:
        return self.leaf_count + page_id

    def update(self, page_id: int, free_bytes: int) -> None:
        """Record that page_id now has `free_bytes` free. O(log N)."""
        bucket = min(free_bytes // BUCKET, 255)
        i = self._leaf_index(page_id)
        self.tree[i] = bucket
        # Walk up, updating each ancestor to max(children).
        i //= 2
        while i >= 1:
            left, right = self.tree[2*i], self.tree[2*i + 1]
            new = max(left, right)
            if self.tree[i] == new:
                break                 # nothing changed; stop
            self.tree[i] = new
            i //= 2

    def find_page(self, need_bytes: int) -> int | None:
        """Return a page_id whose free space ≥ need_bytes, or None. O(log N)."""
        need = math.ceil(need_bytes / BUCKET)
        if self.tree[1] < need:
            return None              # no page in the whole table has room
        i = 1
        while i < self.leaf_count:
            # Prefer the left child if it satisfies; otherwise right.
            if self.tree[2*i] >= need:
                i = 2*i
            else:
                i = 2*i + 1
        page_id = i - self.leaf_count
        if page_id >= self.n_pages:
            return None
        return page_id

    def extend(self) -> int:
        """Allocate a fresh empty page at the end of the table. O(log N)."""
        new_page = self.n_pages
        self.n_pages += 1
        if new_page >= self.leaf_count:
            raise RuntimeError("tree full — grow the backing array and rebuild")
        self.update(new_page, PAGE_SIZE)
        return new_page

A tiny demo run:

>>> fsm = FSMTree(n_pages=4)
>>> for p in range(4):
...     fsm.update(p, 8192)             # all pages fully free
>>> fsm.find_page(240)
0                                        # first page with room
>>> fsm.update(0, 100)                   # page 0 almost full
>>> fsm.find_page(240)
1                                        # now page 1
>>> fsm.update(1, 1800); fsm.update(2, 1800); fsm.update(3, 600)
>>> fsm.find_page(3200)                  # none have that much
>>> fsm.find_page(1800)
1
>>> fsm.find_page(500)
1                                        # greedy: picks first fit

The whole FSM costs 2 bytes per page (one leaf byte, roughly one interior byte amortised) — about 0.025% of the table's size.

Internal fragmentation — the slot/heap gap

Even with a perfect FSM, space inside a page can be wasted. Picture a slotted page after a sequence of inserts, updates, and deletes. The slot array has grown downward to slot 180. The record heap has grown upward and occupies bytes 3,600 to 8,176. Free space in the middle: 2,860 bytes.

But look at the slot array. Slots 7, 14, 22, 31, ... are marked deleted. Their records have been purged from the heap, leaving gaps inside the heap — unreachable holes because the slot no longer points to them. The free-space count in the header says 2,860, but in fact there are an additional 1,200 bytes scattered through the heap in tombstone-shaped gaps that no new record can use without a page reorganisation.

This is internal fragmentation: space inside a page that is accounted for as used, or as free-but-unusable, depending on your bookkeeping. The fix is local: when inserting a new record fails because the contiguous free space is not enough even though the total free bytes are, the page's maintenance routine compacts the heap — walks every live slot, copies its record into a new position contiguous with the others, rewrites the slot pointers, and now the entire gap area is contiguous free space again.

def compact_page(page: bytearray) -> None:
    """In-place heap compaction. Called when contiguous free < total free."""
    header = read_header(page)
    slots = read_slot_array(page)
    # Walk live slots; copy their records into a fresh buffer packed from the end.
    new_heap = bytearray(PAGE_SIZE)
    free_end = PAGE_SIZE
    for slot in slots:
        if slot.deleted:
            continue
        rec = page[slot.offset : slot.offset + slot.length]
        free_end -= slot.length
        new_heap[free_end : free_end + slot.length] = rec
        slot.offset = free_end
    # Write back: header, slots (with patched offsets), compacted heap.
    write_header(page, free_end=free_end)
    write_slot_array(page, slots)
    page[free_end:] = new_heap[free_end:]

Compaction is cheap — one page, one pass — but it rewrites the page, which means a WAL record and a buffer-pool write. You want to do it only when necessary. The usual trigger is exactly what the paragraph above described: the insert asks for N contiguous bytes, the page's contiguous free is less than N, but total free (contiguous + gap) is greater than N, so compaction will make the insert succeed.

External fragmentation — pages that are each half-empty

The dual of internal fragmentation is external fragmentation: pages that are each partially used, such that if you repacked them you could fit the same data into fewer pages. The FSM tracks every page's free bytes, so it knows each half-full page exists. It just cannot do anything about them without moving records across page boundaries — which is expensive because moving a record changes its address, and every index in the table (primary key, secondary indexes) holds that address.

External fragmentation grows over time through normal use. Delete 10% of the rows in a table, uniformly distributed: every page is now 90% full. You have lost 10% of your capacity to fragmentation. Delete another 10%: every page is 81% full. And so on. Inserts land in whichever pages the FSM points to; they refill the gaps but rarely perfectly, because new records are different sizes from the deleted ones.

Remedies fall in two camps.

Compaction / VACUUM FULL: rewrite the whole table to a fresh file, packing records densely, then atomically replace the old file. Costs 1× the table's size in writes, plus a long lock. Recovers 100% of the wasted space. Postgres's VACUUM FULL does this; SQLite's VACUUM does this. It is the nuclear option: it works, but you avoid running it more than once a quarter on a big table.

Online reclamation / lazy VACUUM: walk the table page by page, on each page compact in place (reclaim deleted slots) and update the FSM. Do not move records between pages, so the table still has external fragmentation. But every page is now internally compact, so new inserts can refill the gaps within each page, gradually recovering the wasted space over time. Postgres's default VACUUM (and autovacuum) does this. It is the maintenance option: it runs every few minutes, it does not lock the table, it does not recover everything but it recovers enough to keep bloat under control.

FSM trade-offs: accuracy versus bookkeeping cost

The FSM is an index, and every index has the same core trade-off: more accurate tracking means more updates, and every update costs I/O. Let us enumerate the axes.

Quantisation granularity. A tree FSM can store free bytes exactly (2 bytes per page) or quantised (1 byte per page, 256 buckets of 32 bytes each). Exact is 2× the size. Quantised loses precision — find_page(241) might return a page with 240 bytes free (the bucket spans 224–255 bytes) and the insert fails, triggering a retry on the next candidate. In practice, the quantised version is standard: quantisation loss costs about 1% wasted space per page, which is much less than the overall fragmentation the FSM is tracking.

Update frequency. Every insert changes a page's free bytes by a few hundred bytes. Does the FSM update on every insert, or only when the bucket boundary is crossed? Bucket-crossing-only is dramatically cheaper — most inserts stay within the same bucket and skip the FSM update entirely. This is what Postgres does; in exchange the FSM's values are slightly stale (off by up to 32 bytes), but the staleness is bounded and harmless because the caller verifies on the actual page anyway.

Durability. Is the FSM WAL-logged? Postgres says no: the FSM is treated as a hint structure. On crash, you can rebuild it by scanning the table — slow but correct. In exchange, FSM updates never write WAL records, which keeps the WAL lean. The risk is that a crash leaves the FSM slightly wrong (it says page 47 has room when actually page 47 is full); inserts will fail against that page, the code will update the FSM on-the-fly to say "page 47 full", and retry. Transparent to the user, cheap for the write path.

Concurrency. Multiple inserts want to call find_page simultaneously. A naive FSM returns the same page to all of them, they all try to insert, they serialise on the page latch, and you have lost your concurrency. Real FSMs add a randomised path selection at each tree level — rather than always descending left-first, pick a random child among those with enough room. Concurrent inserts land in different pages and proceed in parallel.

# Concurrency-friendly find: prefer a random qualifying child at each step.
import random

def find_page_spread(fsm: FSMTree, need_bytes: int) -> int | None:
    need = math.ceil(need_bytes / BUCKET)
    if fsm.tree[1] < need:
        return None
    i = 1
    while i < fsm.leaf_count:
        left_ok = fsm.tree[2*i] >= need
        right_ok = fsm.tree[2*i + 1] >= need
        if left_ok and right_ok:
            i = 2*i + random.randint(0, 1)
        elif left_ok:
            i = 2*i
        else:
            i = 2*i + 1
    page_id = i - fsm.leaf_count
    return page_id if page_id < fsm.n_pages else None

One line of random spreads concurrent inserts across the free-space frontier. On a workload where 32 threads are inserting simultaneously, this reduces page-level latch contention by a factor of ~32.

Running the three FSMs on the same workload

Synthetic workload: 1000 pages, random record sizes between 50 and 800 bytes, 10,000 operations mixing 70% inserts and 30% deletes. Measured in Python:

>>> bench = run_workload(bitmap_fsm,      seed=42)
>>> bench.avg_fill, bench.lookup_us    # 0.83, 0.8 µs
>>> bench = run_workload(linked_list_fsm, seed=42)
>>> bench.avg_fill, bench.lookup_us    # 0.91, 3.4 µs
>>> bench = run_workload(fsm_tree,        seed=42)
>>> bench.avg_fill, bench.lookup_us    # 0.90, 0.3 µs

The bitmap has the worst packing (83%) — its "has 256 B free" threshold stops flipping to 0 until the page is nearly full, so inserts keep landing in pages that cannot quite fit the larger records. The linked list has excellent packing (91%) because it walks forward until a tight fit, but at 4× the lookup time. The FSM tree wins on both axes: 90% packing (near-optimal), fastest lookup (tree is tiny and cache-resident).

Now add concurrency. Four threads, same workload:

                         throughput (ops/sec)
bitmap                   240,000
linked-list              95,000    # writes on every unlink, serialised
fsm-tree (no spread)     310,000   # all threads hit the same page, latch stalls
fsm-tree (spread)        1,180,000 # randomised descent — 3.8× scaling

The randomised descent turns "all four threads contend for page 0" into "each thread finds a different page" and nearly linear scaling emerges. That is why Postgres's FSM uses this trick.

Common confusions

Going deeper

If the core picture is clear — an FSM is an index on per-page free bytes, and fragmentation comes in two flavours (internal gaps, external half-empty pages), each with its own remedy — the rest of this section is how production databases actually ship this.

Postgres's FSM in detail

The FSM lives in its own file with suffix _fsm beside the table's main file. It is a 3-level tree of 8 KiB pages; each FSM page holds a binary-tree array with 4096 leaves (one byte each). One FSM page summarises 4096 data pages = 32 MiB of table; three levels cover 512 TiB before needing a fourth. Free bytes are quantised to 32-byte buckets (one byte per leaf). The module is src/backend/storage/freespace/; the key function GetPageWithFreeSpace is about 200 lines of C with LWLocks on individual FSM pages for concurrency.

Lazy updates. RecordPageWithFreeSpace only writes the FSM if the bucket number has changed — empirically, ~1 in 5 inserts. No WAL logging. A crash can leave the FSM slightly wrong; recovery does not rebuild it. VACUUM resyncs page by page; between VACUUMs, discrepancies self-correct on insert retries.

InnoDB's extent-level free-space management

InnoDB organises a tablespace into extents of 64 consecutive pages = 1 MiB. Each extent has a descriptor in an XDES page with a bitmap of its 64 pages' states (FREE, CLEAN, FRAG, FULL). Insertion is two-phase: find a partially-FREE extent (O(1) via a free-list), then find a page within it (64-bit scan). This keeps overhead fixed (one XDES per ~16 MiB) and clusters inserts — successive inserts land in consecutive pages within the same extent, which helps read and write locality. The cost: fragmentation within an extent is real. 64 half-full pages in an extent still mark the extent "used". InnoDB's remedy is OPTIMIZE TABLE, the SQL equivalent of Postgres's VACUUM FULL.

VACUUM and autovacuum tuning

Autovacuum is one of the most commonly misconfigured components in production Postgres. The defaults are conservative — designed for a small VM — and do not scale. The core knobs: autovacuum_vacuum_scale_factor (default 0.2: triggers at 20% dead tuples; on a billion-row table that is 200M dead tuples — busy tables should set 0.02 or 0.01). autovacuum_vacuum_cost_limit / cost_delay (defaults 200 / 2ms throttle autovacuum to avoid saturating I/O; on modern SSDs the throttling is absurdly conservative — cost_limit=10000, cost_delay=0 is 50× faster). autovacuum_naptime (default 60s; 10s is better for heavy-churn tables).

Dead-tuple bloat and FSM staleness compound: a table where autovacuum never keeps up accumulates external fragmentation and a stale FSM. The usual symptom: "inserts got slower six weeks ago and nobody knows why" — almost always an autovacuum tuned for a smaller workload than the one it now serves.

Compaction strategies — in-place vs relocating

Compaction splits along one axis: do you move records between pages? In-place (Postgres VACUUM, InnoDB page reorganise) packs each page's heap and refreshes the FSM — cheap, online, no lock, but reclaims internal fragmentation only. Relocating (Postgres VACUUM FULL, InnoDB OPTIMIZE TABLE, SQLite VACUUM) rewrites the whole table into a new file — expensive and lock-heavy, but recovers 100% of both fragmentation kinds. Online relocating (pg_repack, gh-ost) rewrites in the background with trigger-based propagation and an atomic swap — high total I/O but no write block. Log-structured compaction (LSM trees, Build 3) eliminates the question by fiat — data is immutable, so half-empty pages do not exist; the tradeoff is write and read amplification covered in that build.

SQLite's minimal approach

SQLite is the counterpoint. Its free-page tracking is a linked list of completely free pages only — partially-full pages are not tracked at all; new rows go to the last page in the table or into a completely-free reclaimed page. This works because SQLite's workload is usually small (embedded, mobile) and the cost of occasional internal fragmentation is low. Run VACUUM once in a while to rewrite the whole file and you are back to perfectly packed. For a database measured in megabytes, that is a few seconds of work. At 10 MB, a linked freelist and occasional VACUUM is sufficient; at 100 GB you need an FSM tree and online lazy compaction; at 10 TB, extent-level tracking and concurrent VACUUMs. The complexity of the free-space machinery grows with the data it manages.

Where this leads next

You now have the data layout half of a storage engine complete: B+ trees (chapter 23), buffer pool (chapter 24), copy-on-write (chapter 25), slotted pages (chapter 26), and free-space management (this chapter). The remaining half is concurrency: multiple threads reading and writing the same tree simultaneously without corrupting it.

The FSM you just built is modified by every insert, so concurrent inserts contend on the same cache line. Solving that contention is one of the first worked examples of why latches exist, and why a database running 64 threads needs synchronisation primitives that scale further than a Python lock can. Free space is bookkeeping, and bookkeeping looks trivial in the first version and turns into a career when you scale it — three hundred lines of code, fifteen years of argument, a settled design that every major database now approximates.

References

  1. PostgreSQL source, src/backend/storage/freespace/ — the canonical 3-level FSM tree implementation, about 1200 lines of clean C.
  2. Heikki Linnakangas, Free Space Map rewrite (Postgres 8.4), pgsql-hackers (2008) — the rewrite that replaced Postgres's old in-memory FSM with the on-disk tree still in use today.
  3. Jeremy Cole, The basics of InnoDB space file layout — a diagram-heavy walkthrough of InnoDB's extent-level free-space management.
  4. Joe Conway, Understanding autovacuum in PostgreSQL, 2ndQuadrant blog (2017) — the practical guide to autovacuum tuning on real workloads.
  5. Richard Hipp, SQLite file format spec — freelist pages and the PRAGMA auto_vacuum modes, with the tradeoffs spelled out.
  6. Raghu Ramakrishnan, Johannes Gehrke, Database Management Systems, 3rd edition, chapter 9 — the textbook treatment of heap file organisation and free-space management, still the clearest introduction.