Note: Company names, engineers, incidents, numbers, and scaling scenarios in this article are hypothetical — even when they resemble real ones. See the full disclaimer.

Wall: no discussion is possible without a machine model

At 14:02 on a Wednesday, Riya pushes a one-line "fix" to KapitalKite's order-matching engine: a volatile qualifier on the done flag the watchdog reads after the matcher writes it. The CI suite passes 4,800 deterministic tests; canary on one box looks fine for an hour. At 14:48 she rolls it to the full 64-core EPYC fleet. Within nine minutes the watchdog stops firing — done reads as false long after the matcher has set it to true, on a different core. Riya's mental model said volatile means "the write is visible to other threads". The C++ standard says volatile means almost nothing of the sort. The hardware says even less. Every prior chapter in this build has talked about threads, races, deadlocks, and liveness using a vocabulary that pretends one core is the same as another and that a write reaches the next core when the wire dries. From here on, that pretence breaks.

You cannot argue about a concurrent program without a machine model — a precise statement of which writes become visible to which reads, in which order, and through what mechanism. "Thread-safe", "synchronised", and "atomic" are not properties; they are claims that are only meaningful relative to a memory model. The rest of this curriculum is the machine showing up: store buffers, cache coherence, instruction reordering, the C++ / Rust / JMM specs that bound what hardware is allowed to do. Every concurrency bug you have ever written was, ultimately, you reasoning under the wrong model.

The four chapters before this one used a model they never named

Chapters 1–4 talked about concurrency vs parallelism vs async, why concurrency is hard with shared state, safety and liveness, and deadlock / livelock / starvation / priority inversion. They drew interleaving diagrams in which T1 wrote x = 1 and T2 read x and saw 1. That diagram embeds a strong assumption: every interleaving of individual reads and writes is allowed, and the value a read sees is the value of the most recent write in the chosen interleaving. This model has a name — sequential consistency — and Lamport published it in 1979. It is what the diagrams in chapters 1–4 silently assumed.

Real hardware has not implemented sequential consistency since approximately 1992. x86 implements Total Store Order (TSO): each core has a private store buffer between its execution unit and the L1 cache, so a store can sit in the buffer for tens of cycles before any other core can see it, and your own subsequent loads can see the store before anyone else does. ARM and POWER implement weak ordering — store-store, load-load, load-store, and store-load reorderings are all permitted unless you insert explicit barriers. The Java Memory Model (JMM, JSR-133, 2004), the C++11 memory model, and the Rust memory model (which inherits C++'s) are the language-level contracts that bound what the underlying hardware is allowed to do for code written in those languages.

Same two-line dance. Four different legal-answer sets depending on which model you reason under. The question "is the program correct?" has four different answers — and only one of them is what the dashboard shows.

Why this matters more than it looks: every "thread-safe" claim, every code review comment, every "I added a lock" assumes a model. If your reader is reasoning under sequential consistency and your hardware is running TSO, your code can be correct under SC and broken under TSO. The Dekker / Peterson mutual-exclusion algorithms are the canonical example — both are correct under SC and both are broken on x86 without an explicit mfence between the store and the load. Lamport's original paper that defined SC was a warning that real hardware would not provide it. The warning was correct.

The watchdog flag — Riya's `volatile` did not do what she thought

Reproduce Riya's bug in twelve lines of C++. Two threads, one shared bool done, no atomics, no mutexes. The writer sets done = true after computing a value; the reader spins until done is true and then reads the value. With volatile bool done, the program "looks" correct — but on aarch64 (Apple M1, AWS Graviton, almost any phone) the reader can see done == true before it sees the value the writer wrote. The C++ standard does not guarantee any ordering between two non-atomic stores, even with volatile.

// volatile_is_not_atomic.cpp
// build: g++ -O2 -std=c++17 -pthread volatile_is_not_atomic.cpp -o vol
// run:   ./vol   (run on aarch64 to see the bug; x86 hides it under TSO)
#include <atomic>
#include <cstdio>
#include <cstdint>
#include <thread>

volatile bool done = false;            // Riya's "fix"
volatile int  payload = 0;
std::atomic<uint64_t> seen_zero{0};
std::atomic<uint64_t> trials{0};

int main() {
    for (int run = 0; run < 200000; ++run) {
        done = false; payload = 0;
        std::thread writer([]{ payload = 42; done = true; });
        std::thread reader([]{
            while (!done) { /* spin */ }
            if (payload == 0) seen_zero.fetch_add(1, std::memory_order_relaxed);
        });
        writer.join(); reader.join();
        trials.fetch_add(1, std::memory_order_relaxed);
    }
    std::printf("trials=%llu, reader_saw_done_but_payload_zero=%llu\n",
        (unsigned long long)trials.load(), (unsigned long long)seen_zero.load());
}

Sample run on a Graviton3 (aarch64) box:

trials=200000, reader_saw_done_but_payload_zero=1247

On x86 the same binary will almost always print seen_zero=0 — TSO's per-core FIFO store buffer means stores from a single thread reach other threads in program order. On ARM it does not, so done = true can become globally visible before payload = 42 does, and the reader observes the impossible-looking state where done is set but payload is still its initial zero. Why volatile does not save Riya: the C++ standard says volatile prevents the compiler from optimising the access away (good — her spinloop will not be turned into while(1)). It says nothing about cache coherence, store-buffer drain, or inter-thread visibility ordering. The Linux kernel's WRITE_ONCE / READ_ONCE macros use volatile for that one purpose only and pair every cross-thread synchronisation with explicit barriers. Java's volatile is completely different — JSR-133 redefined it in 2004 to provide release/acquire semantics, which is what Riya's intuition was actually borrowing from. The same word in two languages means two different things. Without naming the model, you cannot tell which intuition is right.

The fix is std::atomic<bool> done with std::memory_order_release on the store and std::memory_order_acquire on the load. That pair establishes a synchronizes-with edge in the C++ memory model: every write sequenced before the release-store is visible to every read sequenced after the acquire-load. The cost on aarch64 is one dmb ish instruction per release / acquire (~12 ns); on x86 the release is free and the acquire is one bypass-the-store-buffer load (~1 ns). The bug-rate goes to zero because the model now says it goes to zero — not because the hardware feels like behaving today.

The vocabulary you have been using is shorthand for a model

Recast the prior chapters' vocabulary in machine-model terms. Each phrase you have been reading was implicitly a claim about which hardware behaviours are allowed.

Phrase from chs 1–4	What it actually means once you name the model
"T1 writes `x = 1`, then T2 reads `x` and sees `1`"	Under SC, after a global linearisation point. Under TSO, after T1's store buffer drains. Under ARM, after a release/acquire pair establishes a synchronizes-with edge.
"atomic increment"	An indivisible read-modify-write with a defined memory ordering parameter. `std::atomic<int>::fetch_add(1, std::memory_order_relaxed)` is atomic but contributes nothing to inter-variable ordering. `std::memory_order_seq_cst` is atomic and participates in a global total order.
"thread-safe data structure"	One of: race-free (no two operations on the same memory without HB ordering), atomic per operation, linearizable (every operation appears to take effect at a single instant between its invocation and response), sequentially consistent, wait-free, lock-free, obstruction-free. State which.
"the lock provides mutual exclusion"	The lock acquire is an acquire-fence, the release is a release-fence; together they create an HB edge from one critical section to the next, which is what makes the data inside the critical section visible. The mutual exclusion is half the property; the memory ordering is the other half.
"the watchdog detects the matcher is done"	Only if the write to `done` and the read of `done` are paired by an HB edge. In the absence of an edge, the watchdog can see the old value forever, even after the matcher has exited.

This re-grounding is why Part 2 of this curriculum is the machine and Part 3 is memory models: until you have the machine and the model, the words are shorthand for assumptions that cannot be checked.

The happens-before relation, drawn. Without the synchronizes-with edge from the release-store to the acquire-load, the reader has no guarantee that payload's store is visible. With it, every write sequenced-before the release is visible to every read sequenced-after the acquire — by transitive closure of program order with synchronizes-with.

Why happens-before is the central definition of every modern memory model: it is the smallest relation that gives compilers and hardware enough freedom to reorder while still letting programmers reason. Sequential consistency is happens-before lifted to a global total order — too strong, costs too much. Per-thread program order alone gives no inter-thread guarantees — too weak, no programs can be written. The acquire-release pair is exactly the construct that builds an HB edge across threads when you need one and lets the compiler reorder freely when you don't.

What the rest of this curriculum is, in one sentence each

The remainder of the 114 chapters is, structurally, the machine model arriving in eight successive layers of detail. Each part is a sharper picture of the same machine.

Part 2 — the machine underneath. Cores, hyperthreads, store buffers, cache lines, MESI, NUMA. The hardware substrate every line of concurrent code runs on.
Part 3 — memory models. TSO, weak ordering, the C++ / JMM / Rust specs. The contract that bounds what the hardware is allowed to do.
Part 4 — atomics from scratch. compare_exchange, fetch_add, the universal primitive Herlihy proved CAS to be in 1991.
Part 5 — mutexes. Futex + spin + park; uncontended vs contended; priority inheritance.
Part 6 — read-heavy synchronisation. RWLock, seqlock, RCU; specialise the read path.
Part 7 — lock-free data structures. Treiber, Michael-Scott, Harris; CAS loops, ABA, linearizability.
Part 8 — memory reclamation. Hazard pointers, epoch-based, RCU; the "free()" you can't write naively in a lock-free world.
Part 9–17. HTM, STM, async, coroutines, work-stealing, actors, testing, data parallelism, frontier.

KapitalKite's 14:48 incident touches exactly the layers in Parts 2 and 3. Riya's volatile understood Part 1's vocabulary but assumed Part 2's machine was simpler than it is and Part 3's contract was stronger than it is. Every line of the code below the fix is the next two parts of this curriculum.

Common confusions

"volatile makes a variable thread-safe in C/C++" It does not. volatile prevents the compiler from optimising the access away (so your spinloop is not deleted), but it provides no memory-ordering guarantees, no cache coherence beyond what hardware already gives, and is not the same as Java's volatile. The Linux kernel uses volatile only inside READ_ONCE / WRITE_ONCE and pairs every cross-thread access with explicit barriers. Use std::atomic with explicit memory orders.
"Sequential consistency is what the hardware does" It is not. SC was published by Lamport in 1979 as the idealised model that real hardware should approximate. No mainstream CPU has implemented strict SC since the early 1990s; x86 implements TSO, ARM and POWER implement weakly-ordered variants. The cost of forcing SC on x86 is one mfence per store (~7 ns); on ARM it is one dmb ish per release/acquire (~12 ns).
"A mutex makes the data inside it thread-safe" It does two things, not one. The mutex provides mutual exclusion (no two threads inside the critical section simultaneously) and memory ordering (the unlock is a release, the lock is an acquire, together establishing an HB edge between successive critical sections). A mutex with mutual exclusion but no memory ordering would silently expose torn reads on multi-core ARM. The standard requires both, but it is two properties, not one.
"Atomic operations are slow because of locks" They are slow because of cache-coherence traffic. An uncontended std::atomic<int>::fetch_add on x86 is lock xadd, which costs ~10 ns when the cache line is local and ~80 ns when it must be shifted from another core's cache (ping-pong). The "lock" prefix on x86 is not a software lock — it is a cache-coherence assertion telling other cores their copy of this line is invalid. The cost is hardware physics, not OS bookkeeping.
"If it works on x86 it works everywhere" False, dangerously. x86's TSO is one of the strongest commodity memory models. Code that relies on accidental TSO guarantees — store-store ordering, store-load ordering through bypass — silently breaks on aarch64 (M1, Graviton, every phone) and POWER. ThreadSanitizer and loom will find these even on x86 by injecting permitted reorderings; running your suite on a Graviton box catches the rest.
"Thread-safe means linearizable" Not necessarily. Thread-safe is a vague claim; linearizability is a precise one (Herlihy & Wing 1990): every operation appears to take effect at a single instant between its invocation and response, and that order is consistent with real time. A std::atomic<int> with seq_cst is linearizable. A std::atomic<int> with relaxed is atomic per-operation but not linearizable across operations. tbb::concurrent_queue is linearizable. boost::lockfree::queue is linearizable. A mutex-protected std::queue with relaxed atomics around a counter is not linearizable in general. State which property you need.

Going deeper

Lamport's 1979 SC paper as the warning that hardware would betray it

Lamport's "How to Make a Multiprocessor Computer That Correctly Executes Multiprocess Programs" (IEEE TC 1979) defined SC and proved that it is what programmers naturally reason about. The paper's purpose was to warn that real multiprocessors of that era were not implementing SC and that programs written under SC assumptions would fail on them. The warning was correct; SC turned out to be too expensive to implement in hardware once cores got out-of-order and added private store buffers. SPARC's TSO (1991) and Alpha's release consistency (1992) were the first explicit, documented relaxations. x86 documented TSO formally only in 2008 — twelve years after deployed CPUs were actually doing it. Why this history matters: the language-level memory models (Java JSR-133, C++11) had to be designed against hardware that already existed, not against an ideal. The result is that the language models expose the hardware's reordering through acquire / release / relaxed and force the programmer to opt into the cost. The alternative — define the language as SC and emit a fence at every load and store — was measured at 30–40% slowdown on lock-free code in early JVM prototypes, and was abandoned.

The cost matrix — what each model costs in nanoseconds

A handful of micro-benchmarks calibrate the cost of "naming the model". On a 3.6 GHz x86 (Skylake, libstdc++ 13): uncontended std::atomic<int>::fetch_add(1, relaxed) is 5 ns; seq_cst adds an mfence for ~7 ns total impact only on the store-load path (8 ns net); a contended fetch_add over 8 cores ping-ponging on the same line is 80 ns; uncontended std::mutex::lock/unlock is 18 ns; contended over 8 cores is 1.4 µs. On an Apple M1: uncontended relaxed fetch_add is 4 ns; seq_cst is 14 ns (the dmb ish is more expensive than x86's mfence); contended over the four P-cores is 110 ns. Rust's parking_lot::Mutex lock is 6 ns uncontended on x86 — three times faster than std::mutex — because it skips the futex syscall path on the fast case. These numbers are the cost of correctness; choosing the wrong model wastes them, choosing the wrong primitive wastes more.

Why Java got `volatile` right and C/C++ inherited the wrong word

JSR-133 (2004) redefined Java's volatile to provide release-acquire semantics: a volatile write is a release, a volatile read is an acquire, and the two pair to give the synchronizes-with edge. This was a deliberate fix to JLS pre-2004, where volatile was — like C's — a compiler-only directive. Doug Lea drove the redesign so that java.util.concurrent could be implemented portably across IBM POWER, ARM, and x86 JVMs without each implementer reinventing the memory ordering. C and C++ inherited volatile from K&R (1978) where it was for memory-mapped I/O on a single-core PDP, and the meaning was never expanded to cover threading because, in 1978, there was no multi-threading to cover. The 2011 standards added std::atomic rather than redefine volatile, leaving the legacy keyword in place with its old, narrow meaning. The result is that the same word "volatile" carries one meaning in Java and another in C++. This is a real source of bugs across language ports — Riya's intuition was the Java one, the compiler followed the C++ one, the hardware did what the hardware does.

Memory model verifiers — `loom`, `cdschecker`, `herd7`

You cannot test your way to memory-model correctness; the bug-once-in-a-million is too rare. Three verifiers actually search the model's permitted reorderings. loom (Rust, Tokio team) treats every atomic operation as a non-deterministic choice and exhaustively explores the schedules a permitted reordering allows; it found real bugs in crossbeam and parking_lot between 2018–2021. cdschecker (Norris & Demsky, OOPSLA 2013) does the same for C++11 atomics and is the standard for proving lock-free algorithms correct against the standard model. herd7 (Maranget, Sarkar, Sewell) takes a litmus test in their cat language and tells you which outcomes a given hardware model permits — it is how the ARM and POWER memory model papers were written. Using these tools is the only way the next chapters can claim correctness without hand-waving; "it works on my x86" is not a proof.

Where this leads next

Part 2 (chapters 6–12) opens the machine: threads as a kernel concept, user-space schedulers, cores vs hyperthreads, cache coherence, store buffers, NUMA, and the wall those last two impose at 64+ cores. Every one of those chapters cashes in the IOU written by this wall. Part 3 (chapters 13–18) puts the language-level memory model on top: TSO vs SC, acquire-release, sequenced-before vs synchronizes-with vs happens-before, and the exact rule that turns Riya's bug from a mystery into a one-line spec violation.

If you want to read ahead in the wild before the next chapters land: Hans-J. Boehm's "Threads Cannot Be Implemented As a Library" (PLDI 2005) is the paper that forced C++ to grow its own memory model, and Doug Lea's "JSR-133 Cookbook" is the engineer's-eye view of what fences live where in a Java implementation.

References

How to Make a Multiprocessor Computer That Correctly Executes Multiprocess Programs — Lamport (IEEE TC 1979) — the paper that defined sequential consistency and warned hardware would not provide it.
Threads Cannot Be Implemented As a Library — Boehm (PLDI 2005) — the paper that proved C/C++ could not have threading without a memory model in the language spec.
The Java Memory Model — Manson, Pugh, Adve (POPL 2005) / JSR-133 — Java's memory model, the first widely-deployed language-level model.
A Tutorial Introduction to the ARM and POWER Relaxed Memory Models — Maranget, Sarkar, Sewell (2012) — the formal model behind every aarch64 phone.
Shared Memory Consistency Models: A Tutorial — Adve & Gharachorloo (IEEE Computer 1996) — the textbook reference for memory-model taxonomy.
JSR-133 Cookbook — Doug Lea — the engineer's-eye view of what fences a JVM must emit on each platform.
Internal — why concurrency is hard with shared state — the prior chapter that this wall closes the loop on.
Internal — safety vs liveness properties — the framework whose claims this chapter forces you to re-state in machine-model terms.

# Reproduce on your laptop (Linux x86_64 OR aarch64; the bug shows on aarch64)
g++ -O2 -std=c++17 -pthread volatile_is_not_atomic.cpp -o vol
./vol
# expected on aarch64: reader_saw_done_but_payload_zero > 0 (the bug)
# expected on x86:     reader_saw_done_but_payload_zero ≈ 0 (TSO hides it)
# Now check what your CPU actually is and what fences it needs:
lscpu | grep -i 'arch\|name'
cat /proc/cpuinfo | grep -i 'flags' | head -1