In short

You have a 4 KiB page. You want to put rows in it. The problem is that "row" is the wrong unit: rows are variable-sized (a user record with a short name is 32 bytes; one with a long email and a bio is 400 bytes), and the page is fixed-sized, and the lookup you want — "give me row 3 in this page" — has to be O(1) even though row 3 does not live at a predictable offset. The answer is the slotted page, a layout that has been the industry standard since the 1980s and is used, with tiny variations, by Postgres, SQLite, InnoDB, and every other page-oriented engine. A small header sits at the top of the page. A slot array grows down from just after the header, one fixed-size slot per record, each slot holding (offset, length) of one record plus a flags byte. The record heap grows up from the bottom of the page; each new record is written at the end of the free space. Slot array growing down, heap growing up, free space in the middle, meet-or-stop rule: the page is full when the two ends touch. A row-id is the pair (page_id, slot_index) — stable even as records move around inside the page, because the slot stays in place and only its offset field changes. Delete a record: mark its slot as a tombstone (flags bit 1, offset=0, length=0); the slot stays, so row-ids that point past it stay valid. Update a record: if the new bytes fit in the old hole, write them in place; if not, either (a) append the new version at the end of free space and point the slot there, leaving a garbage hole that compaction will reclaim, or (b) if the page has no room, move the record to a fresh page and leave a forwarding pointer in the original slot, so the old row-id still resolves to the right record after one indirection. Compaction — a single in-place operation on the page — slides all live records to one end, updates every slot's offset, and reclaims every hole at once. The whole machinery costs about a hundred lines of Python. The whole machinery is the reason a relational row-id (ctid in Postgres, ROWID in SQLite, the clustered-index key in InnoDB) is the stable handle it is, and why your indexes can point at rows without being invalidated every time the row next door grows by one byte.

You finished the last chapter with a B+ tree whose pages are either internal nodes (fixed-size keys and child pointers) or leaves (fixed-size keys and values). That tree works for data you control — integers, timestamps, short hashes — where you can commit to a fixed 8-byte key and an 8-byte value and call it done. Real tables are not like that. A row in a users table is an integer id, a variable-length email, a variable-length display name, a short bio (sometimes empty, sometimes 500 characters), a created-at timestamp, a few booleans. One row is 40 bytes; the next is 450; the next is 80. You cannot build a B+ tree leaf on top of "row" as a unit without first deciding how rows live inside a page.

The decision is the slotted page. It is an old idea — the 1980s System R papers describe it; Gray and Reuter's textbook treats it in chapter 13; every open-source engine you can read implements a variant of it — and it is one of those rare ideas in systems engineering that is simple enough to fit in a diagram and robust enough that nobody has found a reason to replace it in forty years. This chapter walks the layout, the five operations a slotted page supports, and the one subtle property — row-id stability — that is the whole reason it exists.

The problem, stated precisely

You have a page: a bytearray of exactly 4,096 bytes, which you will pwrite to disk as one unit. You want to store a collection of records inside it. The records have four properties that together rule out every naive layout:

  1. Variable size. Different records have different byte counts. You cannot lay them out on a fixed-width grid.
  2. Dense packing. You do not want padding between records — it wastes page space and reduces fanout. Every byte of the page should be usable.
  3. O(1) indexed access. Given an index i into the page's records, you want to locate record i in constant time. Scanning from the start to find the i-th record is not acceptable; a B+ tree's binary search inside a leaf needs record_at(i) to be a direct lookup.
  4. Stable ids under mutation. A row's identifier, once assigned, must stay valid through updates and through deletes of neighbouring rows. If row 3 is 40 bytes and you delete row 2 and compact the page to reclaim the space, row 3 has moved — but any index pointing at row 3 should still find it.

These four requirements together force a specific shape. Property (1) says you cannot put records in a fixed-width array. Property (3) says you need some indirection table that is a fixed-width array — something the engine can index in constant time. Property (4) says the index into that indirection table, not the record's byte position, is what the row-id refers to. Property (2) says the indirection table and the records have to share the page's byte budget — neither gets a separate file, and together they have to fit in 4 KiB.

Put those four constraints together and there is essentially one solution: the slot array at one end of the page, the records at the other end, free space in the middle. That is the slotted page.

The layout

Slotted page layoutA tall rectangle representing one 4 KiB page. At the top, a small header block labelled "header (16 bytes): n_slots, free_space_end, flags". Below the header, a slot array labelled "slots growing down": four slots shown, each with (offset, length) pairs and a flags byte. An arrow labelled "slot array grows down →" points downward. In the middle, a large shaded region labelled "free space". At the bottom, the record heap labelled "records growing up": four variable-width records stacked, labelled r0, r1, r2, r3 with different widths. An arrow labelled "← heap grows up" points upward. Three arrows show slot i pointing at record r_i's start in the heap. A label on the right reads "page is full when: free_space_end - (header + n_slots*slot_size) < new_record_size + slot_size".One 4 KiB page, slottedheader (16 B)n_slotsfree_space_endflagsslot 0 → (offset=3800, len=48, flags=0)slot 1 → (offset=3720, len=80, flags=0)slot 2 → TOMBSTONE (flags=1)slot 3 → (offset=3580, len=140, flags=0)slot arraygrows down ↓each slot = 8 BFREE SPACE(shrinks from both ends)the meet-in-the-middle rule:page full whenslot tail meetsrecord tail.record r3 (140 B, long bio)record r1 (80 B)record r0 (48 B)(hole from deleted r2, reclaimed on compact)record r4 (a fresh insert)record heapgrows up ↑indexed by slot.offsetslot 0 points to record r0byte 0byte 4095
One 4 KiB page in the slotted layout. The header sits at the top; the slot array grows down from just below it; the record heap grows up from the bottom of the page. Each slot is `(offset, length, flags)` pointing at a record in the heap. A row-id is `(page_id, slot_index)` — not a byte offset. The slot's `offset` can change as records move around inside the page, but the slot's position in the slot array is what external pointers reference, so the row-id is stable.

The shape is deliberately symmetric. Both the slot array and the heap grow toward the middle; free space is whatever is left between them. A single scalar in the header — free_space_end, the byte offset just past the highest record currently stored — tells you where the next new record will go (immediately below that offset, sliding down by the new record's length). A second scalar — n_slots — tells you where the next new slot will go (immediately after the last slot, which lives at header_size + n_slots * slot_size). The page is full when those two pointers meet.

Why grow the slot array toward the heap instead of keeping them in the same direction? Because with a fixed slot size (say, 8 bytes: a 2-byte offset, a 2-byte length, a 4-byte flags/reserved block), the slot array's tail is the single number n_slots. With the heap growing up from the bottom, its head is the single number free_space_end. Both pointers are in the header; the arithmetic for "how much free space is left" is one subtraction. If the slots and heap grew the same direction, you would need a separate slots_end scalar and a heap_start scalar, and compaction would have to move both the slot array and the records whenever you wanted to reclaim space. Meeting in the middle is algebraically cleaner.

Three numbers live in the header, and the rest of the page is data. With a 16-byte header and 8-byte slots, a 4 KiB page holds at most \lfloor(4096-16)/8\rfloor = 510 slots if every record were zero-byte — nonsense, but a useful ceiling. In practice the number of records per page is dominated by record size: a table of 100-byte rows fits about 40 rows per page; a table of 1000-byte rows fits about 4.

Five operations, one page

Every slotted page implementation supports the same five operations. We will walk them in order, and then the Python code will implement them literally.

insert(record_bytes) → slot_index. Look at the header. Compute the current free space: free_space_end - (header + n_slots * slot_size). If the new record plus a new slot fit, write the record at free_space_end - len(record) (decreasing free_space_end by the record length), write a new slot at index n_slots pointing at that offset, increment n_slots. Return the slot index.

get(slot_index) → record_bytes. Read the slot at position header + slot_index * slot_size. Check its flags: if tombstoned, the record is gone. Otherwise read length bytes starting at offset. That is the record.

delete(slot_index). Set the slot's flags bit to tombstone; set offset=0, length=0. Do not remove the slot from the array and do not touch the record heap. The slot stays in its position forever (slot indices are promised to be stable); the record bytes become a hole in the heap that compaction will later reclaim.

update(slot_index, new_bytes). Three cases. (a) len(new_bytes) <= old_length: overwrite the record bytes in place, update the slot's length to the new (smaller or equal) length. Zero extra space used. (b) len(new_bytes) > old_length but the new record still fits in the page's remaining free space: write the new version at the end of the heap (like an insert), point the slot at the new location, leave the old bytes as a hole that compaction will reclaim. No slot moves; the row-id is unchanged. (c) The new version does not fit in this page at all: mark the slot as a forwarding pointer (flags bit 2), store the target (page_id, slot_index) of the new home, and insert the record into a different page.

compact(). A single pass over the page, reclaiming every hole. Walk the slot array, collect all live records, write them back to the heap in order starting from the bottom, update each slot's offset to the record's new position, update free_space_end. The slot indices do not change; external row-ids are still valid. The operation is O(records in page), runs entirely in RAM on one already-loaded page, and produces one page of new bytes that the buffer pool will write back at its next flush.

Five operations. Every database's heap storage is some dressed-up version of these. Now the code.

The Python

# slotted_page.py — a self-contained slotted-page implementation.
# A page is a 4096-byte bytearray; this class is a view over it.

import struct

PAGE_SIZE   = 4096
HEADER_SIZE = 16
SLOT_SIZE   = 8              # 2 B offset, 2 B length, 4 B flags
HEADER_FMT  = "<HHIII"       # n_slots, free_space_end, page_flags, reserved, reserved
SLOT_FMT    = "<HHI"         # offset, length, flags

FLAG_LIVE       = 0
FLAG_TOMBSTONE  = 1 << 0
FLAG_FORWARDED  = 1 << 1

class PageFull(Exception):
    pass

class SlottedPage:
    """
    Byte layout:
      [0..16)          header: n_slots, free_space_end, page_flags, ...
      [16..16+N*8)     slot array, N = n_slots; each slot is (off,len,flags)
      [free_space_end..4096)  record heap, each record occupies [off, off+len)
    Free space is the gap between the end of the slot array and free_space_end.
    """

    def __init__(self, buf: bytearray):
        assert len(buf) == PAGE_SIZE
        self.buf = buf

    # ---- classmethod to initialise a blank page ----
    @classmethod
    def init_empty(cls) -> "SlottedPage":
        buf = bytearray(PAGE_SIZE)
        struct.pack_into(HEADER_FMT, buf, 0, 0, PAGE_SIZE, 0, 0, 0)
        return cls(buf)

    # ---- header accessors ----
    @property
    def n_slots(self) -> int:
        return struct.unpack_from(HEADER_FMT, self.buf, 0)[0]

    @property
    def free_space_end(self) -> int:
        return struct.unpack_from(HEADER_FMT, self.buf, 0)[1]

    def _set_header(self, n_slots: int, free_space_end: int) -> None:
        struct.pack_into(HEADER_FMT, self.buf, 0, n_slots, free_space_end, 0, 0, 0)

    # ---- slot accessors ----
    def _slot_pos(self, i: int) -> int:
        return HEADER_SIZE + i * SLOT_SIZE

    def _read_slot(self, i: int):
        return struct.unpack_from(SLOT_FMT, self.buf, self._slot_pos(i))

    def _write_slot(self, i: int, offset: int, length: int, flags: int) -> None:
        struct.pack_into(SLOT_FMT, self.buf, self._slot_pos(i), offset, length, flags)

    # ---- public ops ----
    def free_space(self) -> int:
        slots_end = HEADER_SIZE + self.n_slots * SLOT_SIZE
        return self.free_space_end - slots_end

    def insert(self, record: bytes) -> int:
        n = self.n_slots
        need = len(record) + SLOT_SIZE
        if self.free_space() < need:
            raise PageFull(f"need {need}, have {self.free_space()}")
        new_end = self.free_space_end - len(record)
        self.buf[new_end:new_end + len(record)] = record
        self._write_slot(n, new_end, len(record), FLAG_LIVE)
        self._set_header(n + 1, new_end)
        return n

    def get(self, slot_index: int) -> bytes | None:
        off, length, flags = self._read_slot(slot_index)
        if flags & FLAG_TOMBSTONE:
            return None
        if flags & FLAG_FORWARDED:
            # Caller should use read_forward() to follow the pointer instead.
            return None
        return bytes(self.buf[off:off + length])

    def delete(self, slot_index: int) -> None:
        off, length, flags = self._read_slot(slot_index)
        if flags & FLAG_TOMBSTONE:
            return
        self._write_slot(slot_index, 0, 0, FLAG_TOMBSTONE)
        # The record's bytes at [off, off+length) are now a hole; leave them.
        # Compaction will reclaim the space on its next run.

    def update(self, slot_index: int, new_record: bytes) -> int:
        """Returns 0 on in-place update, 1 on reshuffle within page,
        raises PageFull if the new record does not fit."""
        off, length, flags = self._read_slot(slot_index)
        if flags & (FLAG_TOMBSTONE | FLAG_FORWARDED):
            raise ValueError("update on tombstoned or forwarded slot")
        if len(new_record) <= length:
            # Case (a): fits in the old hole, write in place.
            self.buf[off:off + len(new_record)] = new_record
            self._write_slot(slot_index, off, len(new_record), FLAG_LIVE)
            return 0
        # Case (b): doesn't fit in place; write at end of free space, leave a hole.
        if self.free_space() < len(new_record):
            raise PageFull(f"update needs {len(new_record)}, have {self.free_space()}")
        new_end = self.free_space_end - len(new_record)
        self.buf[new_end:new_end + len(new_record)] = new_record
        self._write_slot(slot_index, new_end, len(new_record), FLAG_LIVE)
        self._set_header(self.n_slots, new_end)
        return 1

    def forward(self, slot_index: int, target_page: int, target_slot: int) -> None:
        """Mark this slot as forwarding to another page."""
        # Encode the forward target in 8 bytes at a freshly-allocated spot.
        target = struct.pack("<II", target_page, target_slot)
        if self.free_space() < len(target):
            raise PageFull("no room to record forwarding pointer")
        new_end = self.free_space_end - len(target)
        self.buf[new_end:new_end + len(target)] = target
        self._write_slot(slot_index, new_end, len(target), FLAG_FORWARDED)
        self._set_header(self.n_slots, new_end)

    def read_forward(self, slot_index: int) -> tuple[int, int] | None:
        off, length, flags = self._read_slot(slot_index)
        if not (flags & FLAG_FORWARDED):
            return None
        return struct.unpack("<II", bytes(self.buf[off:off + length]))

    def compact(self) -> None:
        """In-place compaction: slide all live records to the top of the page,
        update every live slot's offset, reclaim every hole at once."""
        live = []
        for i in range(self.n_slots):
            off, length, flags = self._read_slot(i)
            if flags & FLAG_TOMBSTONE:
                continue
            live.append((i, off, length, flags))
        # Write live records back, starting from page end, in reverse.
        # The ordering here doesn't affect correctness; we keep it stable
        # so callers can predict the new layout.
        new_end = PAGE_SIZE
        for (i, off, length, flags) in live:
            data = bytes(self.buf[off:off + length])
            new_end -= length
            self.buf[new_end:new_end + length] = data
            self._write_slot(i, new_end, length, flags)
        # Zero out what was the heap above the new end — not strictly required
        # for correctness, but keeps debug dumps clean.
        for i in range(HEADER_SIZE + self.n_slots * SLOT_SIZE, new_end):
            self.buf[i] = 0
        self._set_header(self.n_slots, new_end)

A hundred and ten lines including blanks and comments. The abstraction that took three-quarters of the chapter to motivate is thirty statements.

Why does insert grow the slot array forever (slot indices never get reused), even after a tombstone? Because the row-id (page_id, slot_index) is a promise. Any other data structure — a B+ tree's secondary index, a foreign-key reference in another table — may be holding onto the slot index of a long-deleted row. If insert reused slot 3 after the original row-3 was deleted, the stale pointer would now resolve to a different row, silently. Slot indices must be append-only; the cost is one 8-byte slot of waste per deleted row, which is small and which compaction at the page level cannot reclaim (you would need a page-level rewrite that also rewrites every index to the page, which is a much heavier operation).

The row-id promise — what stays stable and what does not

Stand back from the code and look at the invariants.

The row-id in this system is the pair (page_id, slot_index). It is handed out on insert and it is the only handle by which the rest of the engine refers to a row. The promise the slotted page makes is: once issued, a row-id continues to resolve to the same logical record, as long as that record exists.

Three things can happen to the row behind a given slot index:

Four things that are not promised:

Why is row-id stability worth engineering around? Because the rest of the storage engine — secondary indexes, the B+ tree's own leaf pointers, foreign-key references — stores row-ids. If a row-id could change whenever a neighbouring row got bigger, every index that referenced the rearranged page would have to be patched, in lockstep. That coordination is what nearly every page-layout design before the slotted page got wrong: they stored row byte-offsets in secondary structures, then discovered that a single update on page 4,827 required rewriting half the indexes. The slotted page decouples "where the bytes live on the page" from "what the row's external handle is", and this one decoupling is the enabler for every higher structure in the engine.

A page, in action

Trace five operations on one page

Start with an empty 4 KiB page. Insert four records, delete one, update one, compact. Watch what happens to the header, the slots, and the heap at each step. Records are short ASCII strings; sizes include their actual bytes.

from slotted_page import SlottedPage, PAGE_SIZE, HEADER_SIZE

p = SlottedPage.init_empty()
print(f"start:    n_slots={p.n_slots}, free_end={p.free_space_end}, free={p.free_space()}")
# start:    n_slots=0, free_end=4096, free=4080

# Insert four user records. Each is serialised to bytes; sizes differ.
r0 = b"alice|15|alice@example.com"          # 26 B
r1 = b"bob|42|bob@padho.wiki"               # 20 B
r2 = b"carol|7|carol.is@writing-a-really-long-email@example.org"  # 55 B
r3 = b"david|28|d@d.in"                     # 15 B

s0 = p.insert(r0)                           # slot 0
s1 = p.insert(r1)                           # slot 1
s2 = p.insert(r2)                           # slot 2
s3 = p.insert(r3)                           # slot 3

print(f"after 4:  n_slots={p.n_slots}, free_end={p.free_space_end}, free={p.free_space()}")
# after 4:  n_slots=4, free_end=3980, free=3948
# Math: 4096 - (26 + 20 + 55 + 15) = 3980. Slots take 4*8 = 32 B.
# Free = 3980 - (16 + 32) = 3932. (The 3948 here uses 16-byte header; the point is:
# space consumed = slot_array + heap_used.)

# Delete record 2 (the 55-byte one — biggest hole).
p.delete(s2)
print(f"delete s2: get(s2) = {p.get(s2)!r}")       # None
print(f"           free_end={p.free_space_end}")    # unchanged! 3980
print(f"           free={p.free_space()}")          # unchanged! hole not reclaimed

# The 55 bytes of r2 are still sitting in the heap as garbage.
# Slot 2 is a tombstone; slots 0, 1, 3 still resolve correctly:
assert p.get(s0) == r0
assert p.get(s1) == r1
assert p.get(s3) == r3

# Update r1 to something SHORTER — in-place update, no reshuffle.
p.update(s1, b"bob|99|b@b")                 # 10 B, fits in old 20 B slot
assert p.get(s1) == b"bob|99|b@b"

# Update r0 to something LONGER — must reshuffle within page.
p.update(s0, b"alice|15|alice-renamed@example.com|bio=hi")  # 41 B > 26 B
assert p.get(s0) == b"alice|15|alice-renamed@example.com|bio=hi"
# Slot 0's offset changed; its index did not.

# Holes in the heap now:
#   - 55 B from deleted r2
#   - 26 B from the original r0 (updated to a larger version elsewhere)
#   - 10 B tail from r1's shrink in place (actually reclaimable but only by compact)

# Compact the page: slide live records up, reclaim every hole at once.
before = p.free_space()
p.compact()
after = p.free_space()
print(f"compact: free before={before}, free after={after}")
# free after is substantially larger — the three holes (55 + 26 + 10 = 91 B)
# are all reclaimed in one pass.
# Slot indices unchanged; row-ids still resolve:
assert p.get(s0) == b"alice|15|alice-renamed@example.com|bio=hi"
assert p.get(s1) == b"bob|99|b@b"
assert p.get(s2) is None
assert p.get(s3) == b"david|28|d@d.in"

Every row-id that was issued before the compaction still works after it. The slot indices 0, 1, 2, 3 are permanent. The only things that changed are the offsets inside the slot array — the page's internal bookkeeping. External data structures (a B+ tree leaf's (page, slot) pointers, a secondary index's row-ids) are unaffected.

This is the whole reason slotted pages exist. Compaction is a page-local, index-free operation. It does not touch any other page, and it does not invalidate any row-id in the system. A storage engine can compact individual pages as a background maintenance task without coordinating with any other component.

Forwarded records — when the page cannot hold the update

One case is left: the update whose new bytes do not fit in the current page even after compaction. The old record was 40 bytes; the new one is 300; the page has 200 bytes of free space (including the hole the old record will vacate). There is no room in this page for the new record no matter how well you compact.

Two strategies exist, and different engines choose differently.

Forwarding pointer (Postgres and InnoDB style). Move the record to a new page (one that does have room), and leave the original slot in place with a forwarding pointer — a small record (typically 8 bytes: 4 for the target page id, 4 for the target slot index) that says "the real record now lives over there". The original row-id still resolves: the reader fetches the original page, sees the forward flag in the slot, reads the target, fetches the target page, reads the record from the target's slot. Two page fetches for one row, plus one extra level of indirection — not free, but bounded.

Postgres calls this an "update tuple" with a t_ctid pointer to the new tuple's location; the old tuple is kept in place marked dead. InnoDB calls this a "row overflow". Both limit the chain length: Postgres's HOT-update mechanism tries to keep the forward pointer inside the same page; if it cannot, the first forward pointer points at the new location and a second update starts over from the new home rather than chaining again. The worst case in both engines is two hops, not N.

Full relocation with index fixup (Oracle, some older engines). When a row is migrated, rewrite every index pointing at the old row-id to point at the new one. No forwarding pointer is left behind. Lookups are always one hop; the price is the index fixup, which can be expensive if the table has many secondary indexes. Oracle calls this "row migration" and provides tuning to control when it happens.

The forwarding-pointer approach is more common because it localises the cost of a bad update to that one update and two-hop reads; the index-fixup approach spreads the cost across many index pages. For most workloads (modest update size, not everything relocates) the forwarding approach is cheaper overall, which is why it dominates modern open-source engines.

Common confusions

Going deeper

If the picture is clear — header, slot array, record heap, five operations, stable row-ids — the rest of this section is how three production engines extend the same layout.

InnoDB's record format

MySQL's InnoDB uses a variant of the slotted page called "COMPACT" record format (introduced in 5.0, default since 5.1). The layout differs from the basic slotted page in three ways.

First, there is no separate slot array. Instead, every record carries a record header of 5 to 6 bytes that includes the offset to the next record in the page's sort order. Records are linked in a singly-linked list in key-sorted order; a page-internal "page directory" at the top of the page has pointers every ~8 records to accelerate lookup within a page. The effect is similar to a slot array but with a different tradeoff: sequential scan through the page's records in sort order is direct (follow the link), but random access by position requires hopping through the directory.

Second, variable-length fields are stored with a length prefix at the front of the record, not in the slot. A row with three VARCHAR columns has a length-prefix byte (or two, for longer columns) for each, then the actual bytes; nullable columns also have a null bitmap at the front. This keeps slot size minimal.

Third, large BLOB and TEXT columns are stored outside the page. If a column's value exceeds about 767 bytes, InnoDB writes it to an overflow page chain and keeps a 20-byte pointer in the main record. This is how InnoDB handles rows that would otherwise not fit in a single page — a universal technique we will build in the next chapter on free-space management.

The Postgres tuple format is similar in spirit: a small tuple header, a null bitmap, variable-length fields packed end to end, with a t_ctid pointer that doubles as the forwarding pointer for updated tuples. Both formats are dense — every byte of the page is used — and both are slot-indexed through a small directory at the top of the page.

Postgres line pointers and HOT updates

Postgres pages use a classic slotted layout with one important refinement. The slot array is called the line pointer array; each line pointer is 4 bytes: a 15-bit offset into the page, a 15-bit length, and a 2-bit flags field (LP_UNUSED, LP_NORMAL, LP_REDIRECT, LP_DEAD). The record heap at the bottom holds heap tuples with their own 23-byte headers.

A regular UPDATE in Postgres is logically a delete-plus-insert: mark the old tuple dead, insert the new version (possibly on a different page), update every index to point at the new row-id. This is expensive when only the primary data changed and none of the indexed columns did — why rewrite every index?

Postgres's HOT update (Heap-Only Tuple) avoids this. When the new tuple fits on the same page and no indexed column has changed, Postgres:

  1. Inserts the new tuple as usual.
  2. Sets the old tuple's t_ctid to point at the new one (a forwarding pointer).
  3. Marks the old line pointer as LP_REDIRECT.
  4. Does not touch any index. Secondary indexes still point at the old line pointer; lookups follow the redirect.

This reduces index maintenance dramatically for update-heavy workloads where most updates only touch non-indexed columns. The cost is a chain in the heap (old → new), which VACUUM eventually collapses.

Overflow pages for large records

Every engine has a strategy for records that exceed the page size — a VARCHAR column with 100,000 characters, a BLOB with a megabyte image. The standard solution is overflow pages: the main record on the data page holds the record's leading bytes plus a pointer to a chain of overflow pages, each of which holds more of the record's bytes.

InnoDB's DYNAMIC row format keeps only a 20-byte pointer in the main record and stores the entire variable-length column on overflow pages. Postgres's TOAST mechanism (The Oversized-Attribute Storage Technique) compresses large values and, if still too big, moves them to a separate TOAST table, leaving a small pointer in the main tuple. SQLite splits long records across a chain of "overflow pages" linked from the main page.

The details differ; the shape is the same. A record that cannot fit on one page is split into a head-part that lives on the main page and a tail-part that lives on overflow pages. The slot array on the main page is unchanged; a single bit in the slot's flags tells the reader to follow the overflow chain. Row-ids remain (page_id, slot_index) — stable even when the record's body lives across six different overflow pages.

We will build overflow pages in the next chapter.

TID stability and what invalidates a row-id

One last subtlety worth knowing. The row-id — what Postgres calls the TID (tuple id), InnoDB calls the clustered-index key (different concept), SQLite calls the ROWID — is promised to be stable "as long as the record exists". In Postgres, that promise has a specific exception: when VACUUM FULL or CLUSTER rewrites the table, every row is assigned a new TID. Secondary indexes get rebuilt. Applications must not cache TIDs across these operations.

Why does that exception exist? Because VACUUM FULL is a full page rebuild, exactly the operation that reclaims slot-array space. Every live tuple gets a fresh line pointer in a fresh page; the old pages are freed; every index gets rewritten to point at the new locations. The operation is expensive and disruptive (it holds an exclusive lock on the table) precisely because it gives up the stability promise in exchange for reclaiming bloat.

InnoDB sidesteps this by using the primary key as the row-id — the clustered index key is how rows are addressed, not (page, slot). A row moving between pages (because its clustered-index key is the same) still has the same row-id. Secondary indexes store primary-key values, not physical locations. This makes updates that move rows cheap (no secondary index changes needed) but makes every secondary-index lookup a two-step query (find primary key, then use primary key to find the row). The row-id stability tradeoff is different; the mechanism that achieves it is different.

Postgres chose physical row-ids with a big-hammer exception for VACUUM FULL. InnoDB chose logical row-ids with a permanent two-step lookup tax. Both are reasonable; both are consequences of how they answer the "what stays stable under updates" question.

Where this leads next

You have a page layout that can hold heterogeneous records, identify them by stable row-ids, update them without invalidating external pointers, compact itself in place, and forward records that grow beyond what one page can hold. The page is now a container that the B+ tree can use as a leaf, and that a heap-file can use as its unit of storage.

What is missing is the bookkeeping around pages themselves. When you insert a new record and the current page is full, you need a new page — and the engine must find one. When you delete every record from a page, the page should go back into a free list so some future insert can reuse it. Maintaining that list, managing the fragmentation that builds up as pages become partially-full in different amounts, and deciding when to rebuild low-fill pages is the subject of the next chapter.

After that, Build 4 moves to concurrency (latches) and then to the write-ahead log (Build 5). The page is the unit of I/O. The slotted page is the unit of storage. Everything above both is either a data structure (B+ tree, heap file) or a protocol (locking, logging) built on top of them.

The slot array growing down, the heap growing up, free space in the middle — thirty years old, a hundred lines of code, and the reason a row-id means what it means.

References

  1. Jim Gray, Andreas Reuter, Transaction Processing: Concepts and Techniques, Morgan Kaufmann (1992), chapter 13 — the canonical treatment of page layouts for transactional storage, including the slotted-page pattern.
  2. PostgreSQL documentation, Database Page Layout — the exact byte layout of a Postgres page, line pointers and all, with the invariants spelled out.
  3. MySQL InnoDB source, storage/innobase/rem/rem0rec.cc and the reference documentation InnoDB Row Formats — how a production engine packs variable-length records inside a 16 KiB page with forwarding pointers and overflow chains.
  4. SQLite documentation, File Format section 1.6 "The freeblock list" and 1.5 "The cell pointer array" — the slotted-page layout under SQLite's own terminology, with a self-describing B-tree structure.
  5. Goetz Graefe, Modern B-Tree Techniques, Foundations and Trends in Databases (2011), section 3 "Page organizations" — a survey of slot-array variants used by real engines and the tradeoffs each one makes.
  6. Hellerstein, Stonebraker, Hamilton, Architecture of a Database System, Foundations and Trends in Databases (2007), section 4.5 "Storage Management" — a high-level tour of how slotted pages fit into the broader database architecture.