In short

Your database says "lock" in five different places and means two completely different things. A latch is a short, physical mutex — microseconds long — that one thread holds while it is reading or writing the bytes of a single page or a single internal B+ tree node. Its whole purpose is to keep the shape of the tree consistent while a concurrent writer is splitting a node or a concurrent reader is racing through it: without the latch, one thread could read a half-written node and see a key twice, or crash. A lock is a long, logical guard — seconds to minutes — that a transaction holds on a row, a range of rows, a table, or a predicate, for the entire duration of the transaction. Its purpose is isolation: two users running concurrent SQL statements must see outcomes equivalent to some serial order, and locks (or MVCC, the no-locks alternative) are the mechanism. Four axes separate them: lifetime (microseconds vs the whole transaction), API (latch/unlatch around a critical section vs acquire-on-first-access and release-at-commit), deadlock treatment (latches use lock-ordering conventions — in a B+ tree, always top-down, so cycles are impossible by construction; locks use runtime deadlock detection — a wait-for graph, a victim chosen on cycle), and granularity (latches guard one page; locks cover rows, ranges, tables, and entire databases in a hierarchy). The B+ tree descent illustrates the difference perfectly through latch crabbing: you latch the root, descend to a child, decide whether the child is "safe" (will not propagate a split/merge up to the root), and if it is, release the parent's latch before recursing further. Only the minimum set of latches needed to protect the current modification point is held at any time — the two-or-three-node window crawls down the tree like a crab along a branch. Meanwhile, the transaction-level lock on "row 17" is still held for ten more seconds until the user's transaction commits. Same word, different scales. This chapter builds the mental model, walks through a crabbing traversal, and lays the ground for the chapter on torn writes and the write-ahead log — which exists precisely because latches do not survive a crash and locks do not survive a process restart.

Open the Postgres source code and search for the word lock. You will find at least four different things called locks, three of which are not locks in the sense the rest of this chapter cares about. There is LWLock, which is a lightweight lock — a latch, really, under a different name. There is HeavyweightLock, which is the one that protects a row for the duration of a transaction. There is the buffer pool's content_lock on each frame — also a latch. There is advisory locking, predicate locking, row-level pinning — every one of them is called a lock somewhere, and half of them are not what a textbook means by lock.

This chapter is about pulling the two real concepts apart and putting each one in its place. You have a B+ tree from chapter 23 and a buffer pool from chapter 24. You are about to let more than one thread use them at the same time. Two mechanisms are about to show up, and you need to know which is which before the first race condition ruins an afternoon.

Same word, two jobs

Consider one SQL statement from the database's point of view.

UPDATE accounts SET balance = balance - 100 WHERE id = 17;

Under the hood, this expands into roughly the following sequence. Find row 17 in the accounts table. Read its balance. Compute the new balance. Write it back. Commit the transaction.

"Find row 17" is a B+ tree lookup. The tree has a root page, a second-level internal page, a third-level internal page, and a leaf page. The thread has to read those four pages. If any of those pages is being modified right now by another thread — perhaps a concurrent insert that is splitting a leaf — the reading thread must not see the page mid-modification. That protection is the job of a latch. The latch is acquired when the thread pins the page in the buffer pool, held for as long as the thread is reading bytes out of the page, and released as soon as the thread moves on to the next page. Total lifetime: a few microseconds.

"Compute the new balance. Write it back. Commit" is a logical transaction. The user expects that no other transaction's update to row 17 can interleave with this one in a way that violates isolation — if two users simultaneously deduct 100 from an account that starts at 500, the result had better be 300, not 400. That protection is the job of a lock on row 17. The lock is acquired the first time the transaction touches the row, held for the entire duration of the transaction, and released only at commit (or rollback). Total lifetime: the length of the transaction — usually milliseconds, sometimes many seconds, occasionally minutes.

The two time windows on one SQL statementA horizontal timeline showing the lifetime of an UPDATE statement. The top track shows latches held on four B+ tree pages during the descent — each latch is a very short bar, only microseconds wide, and they are released as soon as the descent moves past the page. The bottom track shows a single row lock on row 17, acquired at the moment the leaf is read and held all the way to COMMIT — hundreds of milliseconds or more.one UPDATE — two concurrency windowst = 0COMMITtimeLatches(microseconds)rootL2L3leafreleased as soon as descent moves past each pageLock(transaction)X-lock on row 17held until COMMIT so no other txn can modify row 17 meanwhileacquired when leaf is readscale differencelatches: ~1 μs eachlock: entire transactionoften 10 ms to 10 s
One SQL UPDATE statement, two concurrency windows. The top track shows four latches held during the B+ tree descent — each one microseconds long, each released as soon as the thread finishes reading that page. The bottom track shows one row-level lock, acquired when the leaf is first touched and held for the *entire* length of the transaction, until COMMIT. A typical transaction's latches and its locks differ in duration by six orders of magnitude.

Two different mechanisms, guarding two different kinds of invariant. The latch is there so that the bytes of a B+ tree page cannot be read while they are being written; without it, a reader could see a key listed twice (because the writer is halfway through a split), or segfault trying to follow a freshly rewritten pointer. The lock is there so that two users' transactions cannot interleave in a way that violates their intuition about what the balance should be.

Why do we need both? A lock on row 17 held from the start of the transaction to COMMIT keeps other transactions out of row 17 — but while this transaction is descending the tree to find row 17, what keeps the tree itself consistent against a different transaction that is inserting row 4,900,000 at the same time? That different transaction may be splitting the very same internal node this transaction is currently traversing. The lock on row 17 says nothing about the internal node — which is a purely physical object the user cannot even reference. A second, much shorter, much lower-level mechanism is needed for that: the latch. Latches protect physical data structures; locks protect logical data. The two populations do not overlap.

Four axes of difference

Pull the two mechanisms apart along four axes and the distinction clarifies fast.

Latch Lock
What it protects A physical object — a page in the buffer pool, an internal node of the B+ tree, a slot in a shared array A logical object — a row, a range of rows, a table, a predicate
Lifetime Microseconds: the length of a critical section Seconds to minutes: the length of a transaction
Who holds it One OS thread One transaction (which may span many threads)
API latch / unlatch, called by the storage engine's own code acquire / release, called by the transaction manager, usually implicit to the user
Modes Shared (S) vs Exclusive (X), sometimes Intent S, X, Update, Intent-S, Intent-X, Schema, and a full hierarchy
Deadlock treatment Prevented by convention — a strict lock-ordering (e.g. "always top-down in the tree") makes cycles impossible Detected at runtime — a wait-for graph, a cycle detector, a victim chosen to abort
Survives a crash? No — process death releases everything No — but the transaction it was holding rolls back, so logical state is preserved
Implementation A CPU atomic — atomic_exchange, cmpxchg, often a futex A shared data structure in RAM — lock table hashed by (table_id, row_id)
Cost when uncontended ~20 ns for an atomic op ~1 μs for a lock table lookup, sometimes 10 μs
Cost when contended Microseconds to milliseconds — spin, block Milliseconds to seconds — the transaction waits for the holder

The rows that deserve unpacking are the middle three: API, modes, deadlock treatment.

API. A latch is code you write inside the storage engine. Every function that reads a page starts with page.latch(SHARED) and ends with page.unlatch() — the bracket is tight around the critical section. The B+ tree's own implementer writes these calls; the user, the SQL planner, the application code never see them. A lock is the opposite: the transaction manager acquires a lock on row 17 implicitly, the moment the executor reads or writes row 17, and releases it implicitly at COMMIT. The user's SQL never says LOCK row 17; the lock manager sees the UPDATE and synthesises the lock. (SQL does let you name locks with SELECT ... FOR UPDATE or LOCK TABLE, but these are the transaction manager's API exposed to the user, not the storage engine's latch API.)

Modes. Latch modes are almost always binary — shared (many readers, no writer) or exclusive (one writer, no readers). That is enough because the critical section is tiny; adding more modes buys no useful compatibility. Lock modes are a zoo, because the critical section is an entire transaction and there are many distinct scenarios to optimise for. S and X are the base. U (update) is "I am going to read then write" — compatible with other readers but conflicts with another U or X; it avoids a two-phase upgrade from S to X that is a deadlock magnet. IS (intent-shared) and IX (intent-exclusive) are hierarchical locks that live on tables and ranges to say "I hold S or X on some row inside here, do not place an X lock on the whole table". The hierarchy — database, table, page, row, with intent locks on the upper levels — lets a single lock table hold locks of very different granularities without enumerating them.

Deadlock treatment. This is the sharpest contrast, and the one that drives the design of latch crabbing below.

A lock deadlock is unavoidable in the general case. Two transactions each hold one lock and want the other's. The system has no way to prevent it a priori — the lock acquisitions are driven by the user's SQL, which the database has no control over. So it lets deadlocks happen and detects them: a wait-for graph is maintained, a cycle detector runs periodically (or on-demand when a wait times out), and when a cycle is found, one transaction is chosen as the victim and aborted; its locks are released; the surviving transaction proceeds. Detection costs CPU and the victim loses work, but it is tractable because transaction starts are rare enough (thousands per second, not millions) that the graph is small.

A latch deadlock cannot be tolerated at all. Latches are held in the nanosecond-to-microsecond window; any deadlock detector that runs at that scale would cost more than the latches it protects. So latches use lock-ordering — a strict global rule about the order in which any two latches may be acquired, enforced by convention. In a B+ tree the rule is: latches are always acquired top-down, never bottom-up. Any thread that wants to latch the root and then a leaf acquires the root first. Any thread that wants to latch a leaf and then a parent never does. If the rule is followed, a cycle is structurally impossible: there is no pair of latches A and B where one thread holds A and wants B while another holds B and wants A, because top-down means the order (A, B) is always the same across all threads. The rule is cheap (free at runtime) and leaves no room for dynamic detection — it is enforced at code review.

Why does lock-ordering work for latches but not for transaction-level locks? Because the storage engine's code paths are small, well-known, and owned by the database implementer. The set of pages a single storage-engine call touches is bounded and can be arranged in a global order by construction. A transaction, on the other hand, runs arbitrary user SQL — which rows it touches is data-dependent and not known until runtime. There is no way to enforce a global order over "row 17 and row 92" because neither the user nor the database knows ahead of time which one the transaction will see first. So transaction locks get deadlock detection; latches get discipline.

Latch crabbing — the minimum set of latches

Here is the specific algorithm that makes the top-down ordering rule work in a B+ tree under concurrent reads and writes. It is called latch crabbing (or latch coupling), and the name comes from how a crab walks along a branch — two claws at a time, one foot always holding on.

The problem. A thread is inserting key 20 into a B+ tree. It descends from the root to the leaf. At the leaf, it may need to split. If the split happens, it needs to modify the parent — promoting the median into the parent's key list. If the parent overflows, it needs to modify the grandparent. In the worst case the cascade reaches the root. While this is happening, the thread must hold exclusive latches on every node from the leaf up to the point where the cascade stops, because any concurrent reader that visited those nodes mid-split would see a half-written key list.

The naive solution is: on descent, take an exclusive latch on every node you pass. Hold them all the way to the leaf. If the leaf splits, cascade up; when the cascade stops, release everything. This is correct. It is also catastrophically bad for concurrency: the root is latched exclusively for the entire duration of every insert, so only one insert at a time makes progress. The root becomes the bottleneck for a thousand-thread workload.

The crabbing solution is: release the parent's latch as soon as you know the current node is safe — where safe means "this insert cannot possibly propagate a split up past this node."

Formally, on the descent, a node is safe for an insert if it has at least one free slot (so even if every descendant splits, the new key fits here without overflowing). A node is safe for a delete if it has more than the minimum number of keys (so even if a descendant underflows and merges, this node will still be above the minimum). If the current node is safe, you can release every latch you are holding on its ancestors — because whatever happens below cannot travel up past a safe node. If it is not safe, you must hold the ancestor's latch, because a cascade might reach it.

Latch crabbing on a B+ tree insertA four-level B+ tree with root, L2 internal node, L3 internal node, and leaf. Three snapshots of the descent are shown side by side. Snapshot 1: thread has latched only the root; root holds 3 keys of max 3, so root is "unsafe" — keep its latch. Snapshot 2: descended to L2; L2 has 1 key of max 3, so L2 is safe — release root's latch. Only L2 is latched now. Snapshot 3: descended further to L3; L3 has 2 keys of max 3, also safe — release L2's latch. Only L3 is held. At the leaf, if the insert might split, L3's latch keeps the cascade contained. The crab metaphor is illustrated with a small crab icon walking down a branch, holding one or two nodes at any moment.latch crabbing — release the parent once the child is "safe"snapshot 1: latch rootroot (full)L2L2root has 3/3 keysunsafe — hold latchsnapshot 2: descend, release rootroot (unlatched)L2L2 (1/3)L2 has 1/3 keyssafe — release rootsnapshot 3: release L2 toorootL2L3 (2/3)leafonly L3 latched; leaf to belatched X next stepthe ruleon descent: latch child in the mode you need, then check if child is "safe"safe-for-insert = child has a free slot; safe-for-delete = child has more than the minimum keysif child is safe: release every latch above it — cascades cannot reach themif child is not safe: keep the ancestors' latches, recurseat any instant the thread holds ≤ 2 latches in the common case (parent + current),rising only when a cascade is genuinely needed. The crab walks with one claw always holding on.
Latch crabbing on an insert. As the thread descends, at each node it checks whether the current node is "safe" — will any split/merge below it stop at this node without propagating further up? If yes, every latch above this node is released. In the common case (no overflow in the ancestors), the thread holds at most two latches at once — parent and current — hence "crabbing": always one claw holding, one claw reaching. The rule scales: when the leaf is finally latched in exclusive mode for the actual modification, every ancestor that could not possibly be affected by the insert is already free for other threads.

The algorithm, in pseudocode:

# Crabbing descent for an insert. Latches acquired in EXCLUSIVE mode since we are modifying.

def insert_crabbing(tree, key, value):
    held = []                          # stack of latched nodes, root at bottom
    node = tree.root
    latch_X(node); held.append(node)

    while not node.is_leaf:
        # pick child that covers key
        i = route(node, key)
        child = node.children[i]
        latch_X(child)

        if is_safe_for_insert(child):
            # child will absorb any split from below; ancestors are no longer at risk
            for a in held:
                unlatch(a)
            held = [child]
        else:
            held.append(child)

        node = child

    # node is the leaf; it is latched X (added to `held` on its iteration).
    # Modify the leaf; if it splits, the cascade walks back up `held` modifying
    # each ancestor under its already-held X latch, and stops at the first safe one.
    do_insert_and_cascade(held, key, value)
    for a in held:
        unlatch(a)

def is_safe_for_insert(node):
    # safe = even after a worst-case insert, this node will not overflow
    return len(node.keys) < node.max_keys

Read it carefully. The held stack never gets longer than the number of consecutive unsafe ancestors. In the common case where the leaf's parent has at least one free slot, held shrinks to a single element after the first descent step that finds a safe child — and from then on the thread holds exactly one latch at a time (the current node). A concurrent thread inserting into a different subtree hits no contention whatsoever.

Reads get the simpler version, because reads never modify the tree: a reader latches the root shared, descends to the child, latches the child shared, releases the root, and so on. The held stack has at most two elements — parent and current — at any instant, no matter what. For reads, crabbing is simply "hold your grip until you have latched the next rung". This is why concurrent reads on a B+ tree are essentially free: every reader holds two shared latches at a time, on different nodes, non-conflicting.

The B-link variant. Even crabbing has a worst case: a write that cascades to the root holds latches on every level. The B-link tree of Lehman and Yao (1981) eliminates this by adding a right-sibling pointer to every level and a high key on every node. A reader that arrives at a node whose high key is now too low (because a concurrent split has moved their target to the right sibling) simply follows the right-sibling pointer — without needing to hold the parent's latch to confirm. This lets writers split without latching the parent up front; the parent gets updated later, lazily. Every serious production B+ tree — Postgres, SQL Server, InnoDB — is a B-link variant, and the trick that makes it work is exactly this: a latch-free way to recover when the shape changes under you.

Two threads, same tree, no deadlock

Thread A is inserting key 20. Thread B is reading key 9. The tree looks like this:

                   [17]
              /          \
          [7 | 11]     [23 | 31]
         /    |    \    /   |    \
      [3,5] [7,9,11] [13,15] [17,19,21] [23,25,29] [31,37,41]

A acquires the root latch in X (exclusive) mode. Root has 1 key of max 3 — safe. A then descends to the right internal node [23, 31], latches it X, and releases the root. B arrives wanting to read — it finds the root unlatched and grabs it in S mode. No contention: the root has moved on.

A continues to the leaf [17, 19, 21], which has 3 keys of max 3 — full. Latch it X. [23, 31] must be held because the leaf is not safe (it will split). A now holds two latches: the internal node [23, 31] and the leaf [17, 19, 21].

B, meanwhile, has descended from root (S) to [7, 11] (S, then released root), and on to the leaf [7, 9, 11] (S, then released [7, 11]). B reads key 9, returns it, releases the leaf.

A splits the leaf: [17, 19] and [20, 21]. The median 20 is promoted into [23, 31], which now becomes [20, 23, 31] — 3 keys, still at max. It did not overflow because it had one free slot at the moment A latched it (it had keys [23, 31] — 2 keys, so it was safe from A's point of view, even though the leaf was not). A releases both latches. Total time A held exclusive latches across the tree: a few hundred microseconds.

B's read and A's write completed concurrently, on disjoint subtrees, holding at most two latches each. No deadlock could occur because all latches were acquired top-down: even had A and B wanted the same nodes, one of them would have waited on the X latch and the other would have proceeded; the graph has no cycles by construction.

Now contrast this with the transaction lock picture. If A's insert was part of a transaction that also updated row 9, and B's read was part of a transaction that did SELECT FOR UPDATE on row 20, you could have a classic deadlock — A holds the row lock on 9 and waits for 20; B holds 20 and waits for 9. The lock manager's deadlock detector would find the cycle and abort one of the two transactions. Latches: deadlock-free by discipline. Locks: deadlock-detected at runtime. Two mechanisms, two strategies, both necessary.

Common confusions

Going deeper

If the distinction is clear — latches guard physical pages for microseconds, locks guard logical rows for a transaction — here are the advanced topics the rest of the field has written papers about.

Lock-free B-tree variants (the Bw-tree)

Microsoft Research's Bw-tree (Levandoski, Lomet, Sengupta, 2013) removes page-level latches entirely. The core idea: a page is a chain of delta records ending in a base page, and modifications are applied by prepending a new delta to the chain with a single atomic compare-and-swap. The page table maps page ids to the physical address of the head of the chain; updating a page is a single CAS of the head pointer. Readers walk the chain applying deltas on the fly. Periodically a background thread consolidates a chain into a new base page.

The result is a latch-free B-tree where insert, delete, and lookup use only atomic operations — no mutexes, no spin locks, no latch contention at all. The tradeoff is a more complex memory reclamation scheme (since old deltas and pages cannot be freed until no reader is still walking them — epoch-based reclamation does the job) and a harder-to-analyse performance profile under heavy write contention. The Bw-tree ships in Microsoft Hekaton (SQL Server's in-memory OLTP engine) and in a few research systems; it has not displaced latch-based B-trees in general-purpose databases, partly because the implementation complexity is substantial and partly because latch-based B-link trees already scale to hundreds of cores on modern hardware.

Latch-free indexes more broadly

Beyond the Bw-tree, the concurrent-indexes literature has explored:

PostgreSQL's SLRU buffers and content locks

Postgres's concurrency layer is instructive to read because it has evolved in the open over thirty years.

Reading the Postgres source for two hours will teach you more about concurrent databases than a semester of lectures. Start with src/backend/storage/lmgr/lwlock.c and src/backend/storage/buffer/bufmgr.c.

Hierarchical locks and the intention protocol

When a transaction wants to lock an entire table — for ALTER TABLE, or because the query planner proved that no finer granularity will be correct — you would naively scan every row and take a row lock on each. On a billion-row table this is absurd. The answer is hierarchical locking: the lock manager recognises a hierarchy (database → table → page → row), and a table-level X lock implicitly covers every row below it.

The subtlety: a transaction that wants a row lock inside that table must somehow announce its intention to the table level, so that a later transaction trying to take a table X lock knows to wait. This is the job of intent locks: IX (intent-exclusive) on the parent, then X on the child. Compatibility rules: two IX locks on the same table are compatible (both intentions coexist); an X on the table conflicts with any IX or IS below. The cost is one extra lock acquisition per level of the hierarchy, but the benefit is that coarse-grained locks become possible without quadratic work.

All major relational DBMSes implement this — DB2, SQL Server, Oracle, Postgres. The hierarchy is usually three or four levels deep; the compatibility matrix is a 6×6 table that every database textbook reprints.

Where this leads next

You now have a mental model for concurrent access to a B+ tree: latches for the physical structure, locks for the logical data, crabbing for the descent. What happens when a thread is mid-split, holding an X latch on a leaf with the right half already written to its new page but the parent's pointer not yet updated, and the power goes out?

The latch evaporates with the process, leaving the disk with a torn state: a page that is not a valid B+ tree node because it is half-way through a transformation. Every modification that takes more than one disk write — every split, every merge, every delete that cascades — has a window where the on-disk tree is temporarily not a tree at all. The buffer pool can hide this window from concurrent readers using latches, but the disk cannot hide it from a crash.

The answer is the write-ahead log: before a modification touches any page, write a record describing the intended change to a sequential log file, and fsync the log. If the crash happens during the modification, the log tells us how to finish (redo) or undo the operation on recovery. The WAL is the subject of the next chapter, and it is the mechanism by which the on-disk tree becomes as concurrent-safe across crashes as latches made it concurrent-safe across threads.

After that, Build 4 closes with the full WAL-protected B+ tree; Build 5 rebuilds it with ARIES-style recovery; Build 7 puts transactions and isolation on top of both. The latch/lock distinction from this chapter shows up at every layer — the names of the mechanisms change, but the two time scales never do.

References

  1. Philip L. Lehman, S. Bing Yao, Efficient Locking for Concurrent Operations on B-Trees, ACM TODS (1981) — the B-link tree paper; the canonical solution to concurrent B-tree access.
  2. Goetz Graefe, A Survey of B-Tree Locking Techniques, ACM TODS (2010) — a comprehensive survey of latch and lock protocols for B-trees, including crabbing, B-link, and their modern variants.
  3. Justin J. Levandoski, David B. Lomet, Sudipta Sengupta, The Bw-Tree: A B-Tree for New Hardware Platforms, ICDE (2013) — the latch-free B-tree used in SQL Server Hekaton.
  4. Viktor Leis, Michael Haubenschild, Thomas Neumann, Optimistic Lock Coupling: A Scalable and Efficient General-Purpose Synchronization Method, IEEE Data Engineering Bulletin (2019) — the OLC paper behind HyPer and Umbra's concurrency.
  5. Jim Gray, Andreas Reuter, Transaction Processing: Concepts and Techniques (1992), chapters 7–8 — the textbook treatment of latches, locks, and the hierarchical lock protocol.
  6. PostgreSQL source tree, src/backend/storage/lmgr/ and src/backend/storage/buffer/ — a production implementation of everything in this chapter, exhaustively commented.