Note: Company names, engineers, incidents, numbers, and scaling scenarios in this article are hypothetical — even when they resemble real ones. See the full disclaimer.
Instruction reordering by CPU and compiler
A KapitalKite engineer hands you a 9-line excerpt from the order-matching daemon. Two threads, one shared bool ready, one shared Order order. The producer thread fills order, then sets ready = true. The consumer thread loops on while (!ready) {}, then reads order. The code has shipped in production for nine months. On a Friday afternoon the team upgrades GCC from 11.4 to 13.2, recompiles with -O3, and the consumer starts reading order fields that are still zero. Nothing changed in the source. Nothing changed in the CPU. The compiler is the only variable. This is the moment every concurrency engineer eventually meets: program order is a fiction maintained at two layers, and you only own one of them.
There are two independent reorderings between your source code and what the CPU actually does: the compiler reorders statically (alias analysis, loop hoisting, register allocation, dead-store elimination), and the CPU reorders dynamically (out-of-order issue, store buffer, load speculation). Both preserve single-threaded semantics. Neither preserves cross-thread semantics. A correct concurrent program needs to constrain both — volatile only stops the compiler half, fences only stop the CPU half, and std::atomic (or equivalent in your language) is the only construct that constrains both at once.
Two layers, two reorderings, one bug
The program you write is not the program the compiler emits. The program the compiler emits is not the program the CPU executes. There are two independent rewriting steps, and they each obey one rule and one rule only: single-threaded observable behaviour must be preserved. As-if-serial. Anything else — reorderings, eliminations, fusions, register caching — is fair game. The C++ standard calls this the "as-if rule" (§1.9). Java calls it "within-thread as-if-serial semantics" (JMM §17.4.5). Hardware architects call it "program order is the appearance of program order, observed by the issuing thread alone".
Why both reorderings are necessary, not bugs: a modern compiler producing ordered, naively-emitted code on a Zen 4 core would run at perhaps 8% of the chip's peak performance. Loop-invariant hoisting, common-subexpression elimination, and register allocation routinely give 3–10× speedups; out-of-order issue, register renaming, and store buffering give another 4–8×. The combined uplift is the difference between a CPU that holds Moore's-Law promises and a CPU that does not. Concurrency correctness is paid for on top of this stack, not by removing it.
What the compiler does, and what stops it
Compiler reordering is static — it happens at build time, has no notion of other threads, and sees only the dataflow within one function (and the call graph at LTO). With -O2 GCC and Clang routinely apply: dead-store elimination (a store whose value is overwritten before any side-effecting operation is removed), load hoisting (a repeated read of memory the compiler thinks no one else writes is moved out of a loop into a register), store sinking (a store can be delayed past unrelated work), CSE (a value computed twice from the same memory is computed once), inlining followed by rescheduling (after inlining, the new larger basic block is reordered as one), and branch reordering (if (likely) {} blocks are restructured). Every one of these is legal under as-if-serial. Every one of these can break a multithreaded handshake.
The notorious example is the spinning-flag bug:
// flag_hoist.c — what GCC -O2 does to a naive spin-wait.
// Build: gcc -O2 -pthread flag_hoist.c -S -o flag_hoist.s (look at the asm)
// gcc -O2 -pthread flag_hoist.c -o flag_hoist
// Run: ./flag_hoist (will hang on most machines)
//
// Compare with the _Atomic version (acquire load) which compiles to a
// real memory load every iteration.
#include <pthread.h>
#include <stdatomic.h>
#include <stdio.h>
#include <unistd.h>
static int ready_plain; // plain int — compiler hoists it
static _Atomic int ready_atomic; // atomic — compiler must reload
static int payload;
static void* producer(void* _) {
sleep(1);
payload = 42;
ready_plain = 1;
atomic_store_explicit(&ready_atomic, 1, memory_order_release);
return NULL;
}
int main(int argc, char** argv) {
pthread_t t; pthread_create(&t, NULL, producer, NULL);
if (argc > 1 && argv[1][0] == 'a') {
while (!atomic_load_explicit(&ready_atomic, memory_order_acquire)) { }
printf("atomic: payload=%d\n", payload);
} else {
while (!ready_plain) { } // hoisted to a register at -O2: hangs
printf("plain : payload=%d\n", payload);
}
pthread_join(t, NULL);
return 0;
}
gcc -O2 emits, for the plain branch, roughly mov eax, ready_plain; .L: test eax, eax; je .L — the load happens once, before the loop, and the loop spins on the cached register value forever. The compiler is allowed to do this because, within the function, nothing writes to ready_plain, so by as-if-serial the value cannot change. The atomic branch compiles to .L: mov eax, ready_atomic; test eax, eax; je .L — a real load every iteration, because _Atomic's contract forbids reload elision. Sample output, after producer wakes:
$ ./flag_hoist a
atomic: payload=42
$ timeout 3 ./flag_hoist
$ echo "exit=$?"
exit=124 # timed out — the plain version hangs
The atomic version completes; the plain version hangs forever.
Per-line walk-through of what the compiler did. The producer's payload = 42; ready_plain = 1; pair compiles to two ordinary mov stores in source order — -O2 keeps them ordered because dataflow within the function says "store payload, then store ready_plain" and there's no intervening dependency to swap them. (Crucially, another thread reading these variables is not visible to the dataflow analysis — it has no representation in the C abstract machine without atomics.)
The consumer's while (!ready_plain) {} is the interesting half. GCC's loop-invariant code motion (LICM) pass observes that ready_plain is never written inside the loop body and is not flagged volatile or _Atomic. LICM hoists the load above the loop and replaces the body with test eax, eax; je .L — a register-only spin. The compiler is not "wrong"; it is producing the only legal output under as-if-serial within this thread. The presence of another thread is not visible to the optimiser unless you tell it through _Atomic (or, crudely, volatile). Tell the compiler, and it tells the CPU; don't tell the compiler, and the CPU never gets the chance. Why volatile int ready would have unhung this loop but is still wrong: volatile tells the compiler "treat every access as observable; do not elide reloads, do not cache in a register, do not reorder with respect to other volatiles". That is enough to fix this specific bug, but volatile says nothing about reordering relative to non-volatile memory operations, emits no fences, and does not constrain the CPU at all. It fixes layer 1 (compiler) and ignores layer 2 (CPU). It is a 1990s tool for memory-mapped I/O, not for concurrency. Linus Torvalds's 2007 rant ("the volatile keyword should not be used") is about precisely this confusion — kernel code that used volatile for cross-thread visibility was correct on x86 by accident and broken on Alpha, ARM, and POWER.
What the CPU does, on top of what the compiler did
Even after the compiler emits a fixed instruction sequence, the CPU rewrites it at runtime. A modern out-of-order core has a scheduler window of 200–500 in-flight instructions; it issues them as their operands become ready, not in program order. The store buffer (previous chapter) lets stores retire before they are coherently visible. Load buffers and load-load speculation let later loads issue ahead of earlier ones, with a re-execute on snoop-hit. Branch prediction lets the front-end speculate down both arms. Register renaming removes false dependencies. Memory disambiguation lets a load issue past an older store to a different address even when the address arithmetic isn't yet resolved.
The architecture's memory consistency model (TSO on x86, weak on ARM/POWER, strong on the now-dead Alpha 21264 with explicit mb) defines which of these reorderings are visible to other cores. x86 TSO permits exactly one (store→load, as the previous chapter showed). ARMv8 permits all four (store→store, store→load, load→load, load→store) between distinct addresses, plus speculative load-store reordering through branch mispredicts. The CPU's fence instructions — mfence, dmb sy, lwsync, sync — are the only knobs the programmer has on this layer.
Why std::atomic is not just volatile plus a fence: the C++11 atomics specify ordering on both layers simultaneously. memory_order_acquire on a load means: (1) the compiler must not move any later read or write up across this load (compiler half), and (2) the CPU must emit (or be on an architecture where) no later memory operation is observed before the load (CPU half). The compiler chooses the right machine code for each architecture — on x86, an acquire load is a plain mov because TSO already forbids load→load and load→store reordering; on ARMv8, it is ldar (load-acquire). Same C++ source, different machine code, same observable cross-thread behaviour. The atomic abstraction unifies the two reordering layers into one mental model — Boehm and Adve's foundational result.
Diagnosing reordering bugs in production
Reordering bugs are silent. The program does not crash; it returns an inconsistent value. The race shows up under load, on specific CPUs, after a compiler upgrade, or after enabling LTO. There is no stack trace pointing to the root cause — the symptom is a value (a stale read, a torn struct, a missed wakeup), and the cause is metres of distance away in the binary, in time, or in the compiler pass that ran nine months ago when the binary was built.
Three diagnostic moves from the field:
-
Compare
-O0vs-O2. If the bug appears at-O2and disappears at-O0, you have a compiler-reordering bug. Audit for missing_Atomic/std::atomic, missingvolatile-on-MMIO, missingasm volatile("" ::: "memory")compiler barriers in hand-written sync code. -
Run the same binary on x86 and on ARM. A binary that passes 10⁹ iterations of stress-testing on Intel and fails at 10⁵ iterations on Graviton has a CPU-reordering bug invisible under TSO. The fix is almost always a missed
memory_order_acquire/memory_order_releaseannotation that on x86 became free. -
Use
loom(Rust) orcppmemto model-check the suspect interleaving. If your code hasstd::atomicwithrelaxedorderings,loomwill exhaustively explore the legal interleavings and produce a concrete failing trace. This is faster than reading the C++ standard. KapitalKite's matching team usesloomas a pre-merge gate on all lock-free PRs; the false-positive rate is essentially zero, and it caught three production-grade bugs in the first month. -
tsan(ThreadSanitizer) for races at the access level. Compile with-fsanitize=thread, run a load test, read the race reports.tsanoperates at the C++ memory-model level — it flags any pair of accesses where at least one is a non-atomic write and there is no happens-before edge. It will not catch reordering bugs in correct atomic code, but it will catch every "I forgot the atomic" bug, which is by far the more common failure. The runtime overhead is 5–15× and the memory overhead 5–10×, but it is exhaustive on the paths it covers.
$ # 1. Reproducing the compiler half
$ gcc -O0 -pthread flag_hoist.c -o flag_hoist_O0 && timeout 3 ./flag_hoist_O0
plain : payload=42
$ gcc -O2 -pthread flag_hoist.c -o flag_hoist_O2 && timeout 3 ./flag_hoist_O2
$ echo "exit=$?" # 124 — hung at -O2, OK at -O0
exit=124
$ # 2. Reproducing the CPU half — a TSO-passing program failing on ARM
$ scp sb_litmus aarch64.host:/tmp && ssh aarch64.host '/tmp/sb_litmus 10000000'
iters=10000000 both_zero=487263 fence=no # 4.9% on ARM vs 0.15% on x86
$ # 3. Model-checking with loom
$ cd treiber_stack && cargo test --release --features loom 2>&1 | tail -5
test treiber::tests::push_pop_loom ... FAILED in 0.91s
loom::sync::atomic::AtomicUsize::store(_, Ordering::Relaxed)
counterexample: T2 stores 1 with relaxed; T1's acquire-load on flag observed 1; payload still 0
This is the empirical workflow. Reordering bugs reproduced this way are obvious in retrospect and invisible in advance.
In each of the three modes above, the fix is almost always smaller than the diagnostic effort. A handful of _Atomic qualifiers, a single memory_order_release annotation on a publish-store, a single memory_order_acquire on the matching consumer load — these are the edits. The hard part is finding which unsynchronised access is the culprit. Reading random codepaths and adding atomics defensively is worse than the original bug; it bloats binaries, hides the actual race, and gives a false sense of security. Diagnose first, fix second.
A war story to make the cost concrete. PaySetu's reconciliation pipeline shipped a "fast path" in early 2025 that skipped a lock by having two threads cooperate via two flags — producer sets data_ready, consumer reads data then sets consumed, producer waits for consumed. The four memory accesses were plain ints. Stress-testing on the Bengaluru staging cluster (Intel Xeon Platinum, x86 TSO) ran 14 hours with zero anomalies. The same binary on the production Graviton2 fleet started corrupting reconciled batches within 90 seconds — data_ready=1 was visible on the consumer before the data payload writes had drained from the producer's store buffer. The fix was four std::atomic<int> declarations and release / acquire annotations on the four operations; the binary on x86 emitted byte-identical assembly afterwards (TSO already preserved everything for free), and on aarch64 picked up stlr / ldar instructions where it needed them. The lesson: passing on x86 is not evidence. The architecture hides bugs the source contains.
Common confusions
- "
volatilemakes a variable thread-safe" It does not —volatileconstrains only the compiler, never the CPU, and emits no fences. It is correct for a single-threaded program reading memory-mapped I/O registers (where the compiler must not optimise away accesses) and incorrect for cross-thread synchronisation on every CPU since the Pentium 4. Use_Atomic(C11) orstd::atomic(C++11) for thread-safe access — both include the compiler-barrier semantics thatvolatilegave plus the architecture-specific fencesvolatiledid not. - "Compiler reordering only matters at high optimisation levels" Even
-O1enables most reorderings that break naive multithreading: dataflow analysis, register allocation, dead-code elimination.-O0is the only level that emits a near-literal translation, and nobody ships at-O0. Code that "works at-O1" is luck, not correctness. - "x86 has no compiler reordering issues because it's strongly ordered" The CPU is strongly ordered (TSO); the compiler still reorders aggressively. A KapitalKite outage in 2024 was traced to a
std::shared_ptrref-count read hoisted out of a loop by Clang 17 — the architecture was Skylake, the bug was 100% compile-time. As-if-serial is the rule the compiler obeys; "the CPU is strong" does not change that. - "
std::atomicwithmemory_order_relaxedis the same asvolatile"relaxedstill constrains the compiler not to elide the access, fuse it with another, or reorder it across other atomic operations. It does not impose any cross-thread ordering — that is what acquire/release/seq_cst are for.volatiledoes not even guarantee the access is atomic on multi-byte types. Arelaxedatomic is at minimum a single, indivisible, non-elided access;volatileis none of those guarantees in cross-thread terms. - "Adding
printfmade the bug go away, so it's a compiler issue"printfcalls are opaque to the compiler — the optimiser cannot prove they don't read or write your shared state, so it is forced to reload everything across the call. Insertingprintfis a brutally effective compiler-barrier kludge. If your bug disappears withprintf, you have a compiler-reordering bug; the proper fix is_Atomicor an explicitasm volatile("" ::: "memory")barrier, not theprintf. - "The CPU only reorders memory operations, not arithmetic" The CPU reorders everything — arithmetic, branches, address computations. What the architecture guarantees is that other cores cannot detect the reordering of arithmetic (it is purely local). Memory operations are the only reorderings that leak into the cross-core observable behaviour, which is why the memory model is stated in terms of loads and stores rather than in terms of "the CPU executes in order".
Going deeper
How LTO and PGO change the reordering surface
Link-time optimisation (LTO, -flto in GCC/Clang) and profile-guided optimisation (PGO) make the compiler's static reordering window enormous. Without LTO, the compiler reorders within one translation unit; with LTO, it reorders across the entire program. Inlined library functions are no longer opaque, and previously-hidden-behind-an-extern-call shared state may suddenly become visible — and reorderable. PGO additionally lets the compiler restructure code based on observed branch frequencies, hoisting cold paths and merging hot ones. The net effect is that a program that worked at -O2 may break at -O2 -flto -fprofile-use. The fix is the same — annotate every cross-thread access as _Atomic — but the diagnostic surface is different. Folly's documentation explicitly notes that several of its lock-free containers required hardening between LTO-off and LTO-on builds, and the canonical bugs are all "an inlined helper's load got moved up across a release store".
The exact set of compiler reorderings the C++ memory model permits
The C++11 memory model permits the compiler to perform any reordering that does not change the observable behaviour of a data-race-free program (DRF-SC). Two operations on the same memory location, where at least one is a write and at least one is non-atomic, with no happens-before edge between them, are a data race; the program has undefined behaviour and the compiler is not required to preserve any observable property. This is why "if you race, even your debug log is wrong": once UB is invoked, optimisations that look catastrophic become legal. CricStream's video pipeline once observed a memory leak that looked like a missing delete; the underlying bug was a benign-looking read of a bool from one thread and a write from another, both non-atomic. Under DRF-SC, the program's behaviour was undefined; the compiler had elided the delete because it inferred the path "could never be reached" from a value the racing thread had stale-read. There is no reasoning about what the compiler will do once you race; the only contract is "don't race, and we promise SC".
The Linux kernel's READ_ONCE / WRITE_ONCE discipline
The Linux kernel pre-dates C11 atomics and has its own reordering vocabulary, codified in tools/memory-model/ and the READ_ONCE / WRITE_ONCE / smp_load_acquire / smp_store_release / smp_mb macros. READ_ONCE(x) expands to a volatile-cast read — it stops the compiler from eliding or fusing the load but emits no fence. smp_load_acquire(p) is READ_ONCE plus the architecture's acquire-barrier (a no-op on x86, ldar on ARMv8, lwsync plus dependency tracking on POWER). The kernel rule is that every shared-memory access must be one of these macros — a plain x read in a function that another CPU can write to is a bug, full stop. The Kernel Concurrency Sanitizer (KCSAN) flags exactly this. The kernel's discipline is stricter than C++'s and predates it; understanding READ_ONCE is also useful for anyone writing C code that needs to interoperate with kernel APIs (eBPF, io_uring, perf ringbuffers — all use the same primitives).
volatile versus compiler barriers versus atomics — the precise differences
Three constructs people conflate, with three precise meanings:
| Construct | Stops compiler? | Emits CPU fence? | Atomic on multi-byte? | Use for |
|---|---|---|---|---|
volatile T x |
Per-access only | Never | No | Memory-mapped I/O |
asm volatile("" ::: "memory") |
Globally at the barrier point | Never | N/A | Hand-written sync, around inline asm |
_Atomic T x / std::atomic<T> |
Yes, per-operation, with ordering semantics | Yes, per the chosen memory_order |
Yes (up to lock-free width) | Cross-thread synchronisation |
asm volatile("" ::: "memory") is the kernel-style compiler barrier — it emits no instructions but tells GCC/Clang that all memory may have changed, forcing reloads. It is what READ_ONCE plus subsequent dependent reads desugars into. It is not a CPU fence; on ARM you still need dmb. KapitalKite's matching engine uses asm volatile barriers around its custom seqlock implementation because the seqlock's pattern (read seq → read data → re-read seq) is awkward to express as std::atomic without the compiler defeating the optimisation. The choice is intentional and documented; do not use it casually.
Compiler explorer as the only honest documentation
The official reference for "what GCC does to my code" is godbolt.org. Pasting the source, picking the compiler version, picking the architecture, and reading the asm is the only way to know what your build pipeline actually emits. The C++ standard tells you what is legal; godbolt tells you what your compiler chose. PaySetu's lock-free queue audit in 2024 used a godbolt-driven CI gate: every change to the queue's source files triggered a compile across GCC 11/12/13/14 and Clang 15/16/17/18 on x86-64 and aarch64, with the assembly diffed against a checked-in golden. Three regressions caught in the first quarter — all involving acquire/release annotations the compiler chose to satisfy with weaker fences after a heuristic change. This is heavyweight CI but cheaper than a 3am production page. Tokio adopted the same pattern in 2023 for its scheduler hot path.
Where this leads next
The next chapter — the wall: you cannot reason without a memory model — closes Part 2 by formalising the question this chapter pushes you towards: given that compilers and CPUs both reorder, and that the reorderings interact, what contract lets a programmer reason at all? The answer is "a memory model", and Part 3 spends seven chapters constructing the language-level memory models you will use every day: sequential consistency, TSO, ARM/POWER weak memory, the C++ memory orders, and the Java and Go memory models. Once you have a memory model, the two-layer reordering problem collapses into a single set of rules expressed in terms of acquire/release/seq_cst and happens-before. That is the abstraction Boehm and Adve gave us in 2008, and it is the only way to write portable concurrent code today.
References
- Boehm, "Threads Cannot be Implemented as a Library" (PLDI 2005) — the foundational paper on why pre-C++11 threading was unsound; documents specific compiler reorderings that break pthreads code.
- Boehm & Adve, "Foundations of the C++ Concurrency Memory Model" (PLDI 2008) — the design of the C++11 memory model, DRF-SC, and the unification of compiler and CPU reorderings.
- Sewell et al., "x86-TSO: A Rigorous and Usable Programmer's Model for x86 Multiprocessors" (CACM 2010) — formal model of x86 hardware reordering.
- Linus Torvalds, "Why the volatile type class should not be used" — the canonical kernel-perspective rant on
volatile's inadequacy. - Paul McKenney, "Memory Barriers: a Hardware View for Software Hackers" — the deep operational walkthrough of compiler and CPU reorderings on real hardware.
- Compiler Explorer (godbolt.org) — the only honest documentation for what your compiler emits.
- Internal — store buffers and their consequences — the CPU half of the reordering story.
- Internal — wall: no discussion is possible without a machine model — Part 2's framing chapter.