Note: Company names, engineers, incidents, numbers, and scaling scenarios in this article are hypothetical — even when they resemble real ones. See the full disclaimer.

Store buffers and their consequences

PaySetu's settlement service has a two-flag handshake — ready=1 on one thread, done=1 on another, with each thread reading the other's flag after writing its own. The code is symmetric, the variables are aligned, the increments are std::atomic<int> with memory_order_release and memory_order_acquire. Once every 12 hours, a node logs both saw 0 — a state the program's author can prove is impossible by reading the source. It is impossible if the CPU executes stores and loads in program order. It is not impossible on x86. The reason has nothing to do with caches, atomics, or the compiler. It is a 16-deep FIFO inside every core called the store buffer, and it is the single most-misunderstood structure in the modern CPU.

A store buffer is a per-core FIFO that holds retired stores while their cache line is being acquired in M state via MESI. It lets the core continue executing past a store without stalling on coherence. The cost: your own subsequent loads can see stores that no other core has observed yet, and the only way to drain the buffer is a mfence (or a lock-prefixed instruction, which implies a fence). Every "x86 TSO weirdness" you have ever seen — Dekker's algorithm breaking, store-load reordering, the SB litmus permitting 0/0 — is the store buffer.

What the store buffer is and why it exists

When core 0 executes mov [x], 1, the cache line containing x may be in I state on this core — held in M state by core 5. To complete the store correctly, core 0 must send a Read-For-Ownership (RFO) request to core 5, wait for an invalidate-ack and the line transfer, and only then write the value into L1d. That round trip is 80–150 ns. If the CPU stalled the pipeline waiting, every cross-core store would be a 200-cycle bubble; sustained throughput would collapse to single-digit million stores per second.

The fix designed into every modern x86 (and ARM and POWER) core: when a store retires, the value goes into a small per-core FIFO called the store buffer — typically 16–60 entries deep on Intel and AMD — and the pipeline immediately continues. The store buffer drains in the background as cache lines arrive in M state. From the point of view of other cores, the store has not happened yet — the value is invisible until the line is acquired and the entry is retired into L1d. From the point of view of this core, however, subsequent loads to the same address will be served out of the store buffer (a feature called store-to-load forwarding), so the writer always sees its own writes immediately.

Store buffer between the core and L1dA diagram of one CPU core showing the execution unit on the left, a vertical FIFO labelled store buffer in the middle with four slots holding pending stores, and L1d cache on the right. Arrows: the execution unit writes into the top of the FIFO; the FIFO drains from the bottom into L1d once the line is in M state; loads from the execution unit either hit L1d or are forwarded from the FIFO if a matching address is found. One core: execution unit, store buffer, L1d Execution unit issue / retire mov [x], 1 (store) mov eax, [y] (load) add eax, 1 Store buffer (FIFO) ~16–60 entries [x]=1 waiting for M [p]=0xff waiting empty empty drain head ↓ L1d cache visible to other cores (via MESI) line for [y] S line for [x] I → M retire store drain (when M) forward [x] back to load A load checks the store buffer first (forwarding); only if no entry matches does it go to L1d. Other cores never see the buffer.
The store buffer sits between retirement and L1d. It hides MESI's RFO latency from the issuing core but creates a window where this core's stores are visible to itself and not to anybody else. That asymmetry is the whole story.

Why this is necessary, not a bug: without a store buffer, every store to an I-state line stalls the pipeline for one full coherence round trip. With a 16-entry buffer and a sustained store rate, the core absorbs bursts of stores up to the buffer depth before it has to stall. Measured on a Zen 4: a tight loop of stores to alternating cache lines runs at ~1 store/cycle when the buffer can absorb them and at ~0.012 stores/cycle (one per RFO round trip) without one. That is a 80× throughput difference, and it is the reason no production CPU since the Pentium Pro has shipped without a store buffer.

The consequence: store-to-load reordering on x86

The store buffer creates exactly one allowed reordering on x86 TSO: a younger load to a different address may complete before an older store has drained. The store has retired (it is in the buffer) but it has not committed to the coherence layer (the line is not yet in M state). Meanwhile, the younger load — to a different line that is already in S state in this core's L1d — sails through and returns its value. As far as any other core can tell, this core executed the load before the store.

This is the store-buffer (SB) litmus test, the canonical example. Two threads, two flags, four lines of code total, and a state the source rejects but the hardware allows.

// sb_litmus.c — the canonical store-buffer reordering demo, on x86 TSO.
// Build: gcc -O2 -pthread sb_litmus.c -o sb_litmus
// Run:   taskset -c 0,1 ./sb_litmus 10000000
// Compare with --fence variant where mfence is added after each store.
#define _GNU_SOURCE
#include <pthread.h>
#include <sched.h>
#include <stdatomic.h>
#include <stdint.h>
#include <stdio.h>
#include <stdlib.h>

static _Atomic int x, y;       // shared flags (different cache lines)
static int r1, r2;              // per-thread loads
static volatile int go, done_a, done_b;
static int use_fence = 0;

static void pin(int cpu) {
    cpu_set_t s; CPU_ZERO(&s); CPU_SET(cpu, &s);
    pthread_setaffinity_np(pthread_self(), sizeof s, &s);
}

static void* T1(void* _) {
    pin(0);
    while (!go) { }
    atomic_store_explicit(&x, 1, memory_order_relaxed);   // store x
    if (use_fence) atomic_thread_fence(memory_order_seq_cst);
    r1 = atomic_load_explicit(&y, memory_order_relaxed);  // load y
    done_a = 1;
    return NULL;
}
static void* T2(void* _) {
    pin(1);
    while (!go) { }
    atomic_store_explicit(&y, 1, memory_order_relaxed);   // store y
    if (use_fence) atomic_thread_fence(memory_order_seq_cst);
    r2 = atomic_load_explicit(&x, memory_order_relaxed);  // load x
    done_b = 1;
    return NULL;
}

int main(int argc, char** argv) {
    long iters = (argc > 1) ? atol(argv[1]) : 1000000;
    use_fence = (argc > 2 && argv[2][0] == 'f');
    long both_zero = 0;
    for (long i = 0; i < iters; i++) {
        x = y = 0; r1 = r2 = 0; done_a = done_b = 0; go = 0;
        pthread_t a, b;
        pthread_create(&a, NULL, T1, NULL);
        pthread_create(&b, NULL, T2, NULL);
        go = 1;
        pthread_join(a, NULL); pthread_join(b, NULL);
        if (r1 == 0 && r2 == 0) both_zero++;
    }
    printf("iters=%ld both_zero=%ld fence=%s\n", iters, both_zero,
           use_fence ? "yes" : "no");
    return 0;
}

The crucial four lines are the body of T1 and T2: each writes its flag, then reads the other's. By any naive reading of the source, at least one of the threads must observe the other's write, so r1==0 && r2==0 should be impossible. Sample output on a Zen 4 desktop, threads pinned to cores 0 and 1:

$ gcc -O2 -pthread sb_litmus.c -o sb_litmus
$ ./sb_litmus 10000000
iters=10000000 both_zero=14823 fence=no
$ ./sb_litmus 10000000 fence
iters=10000000 both_zero=0 fence=yes

About 0.15% of runs return both_zero=true. With mfence after each store (fence=yes), it is exactly zero. The hardware is doing what the source forbids — unless you read the source as describing the architectural contract, which is x86 TSO. TSO permits exactly one reordering: an earlier store may be observed by other cores after a later load. Why TSO calls it "store→load reordering, nothing else": loads to different lines never reorder past each other (load→load is preserved). Stores to different lines never reorder past each other (store→store is preserved, because the store buffer is FIFO). Loads do not reorder past earlier stores to the same address (forwarding handles this). The only relaxation is store→load to a different address, and it is the one the store buffer creates by design.

How the litmus failure unfolds inside the store buffers

Sequencing the SB failure step by step makes the buffer visible. Both cores hold their writer's flag in I state initially (or in S; identical outcome). Both the loaded values' lines are in S on both cores.

Time-ordered trace of the SB-litmus reordering on two coresA two-row time diagram. Top row labelled core 0 shows: at t1, store x=1 retires into store buffer; at t2, load y from L1d returns 0 (line was in S); at t3, store buffer drains x=1 into L1d after acquiring M. Bottom row labelled core 1 shows the same pattern mirrored: store y=1 into buffer at t1', load x from L1d returns 0 at t2', drain at t3'. The lines do not exchange MESI traffic until both loads have already completed. SB litmus: how both cores can see 0 — illustrative Core 0 Core 1 t=0 retire stores issue loads (hit S) drain stores store x=1 (buf) load y → 0 (S hit) drain x=1 → L1d store y=1 (buf) load x → 0 (S hit) drain y=1 → L1d y's line is in S — no MESI request. Load returns 0. x=1 is still in core 0's store buffer; no other core has seen it. Symmetrically: y=1 still in core 1's buffer when load x=0 hits. By the time the stores drain (right edge), both loads have already returned 0. The reordering is real and observable.
The SB-litmus failure decomposed: each core's store sits in its own store buffer while its load reads the other's flag from a still-S-state line. Drain happens later. From a global timeline view both stores precede both loads in program order; from the coherence layer's view, both loads precede both stores. Both views are "correct" for x86 TSO — the architecture explicitly admits this gap.

Why this is not a bug — and why it is also why Dekker's algorithm does not work on x86 without a fence: Dekker's mutual-exclusion algorithm, the SB litmus, Peterson's algorithm, and any "two threads each set a flag and check the other" pattern are all isomorphic to the SB litmus. They all depend on store→load ordering. They all fail on x86 TSO without mfence, and on ARM/POWER without a dmb sy / sync barrier. Lamport's bakery algorithm has the same dependency. The store buffer is the reason every textbook mutual-exclusion algorithm needs to be re-verified for any real CPU's memory model.

Why mfence (and lock-prefix) drains the buffer

The fix for the SB litmus, and for every store-buffer-induced reordering, is to drain the buffer before the next load. mfence does exactly that: the instruction blocks retirement of any later load until every prior store in the buffer has committed to L1d (i.e., the line is in M state and the entry is gone). On x86, any lock-prefixed instruction (lock cmpxchg, lock incq, xchg with memory) implicitly fences — they all guarantee the buffer is drained before the locked operation, and stronger, that the locked operation is globally observed before any later instruction.

The cost is the cost of waiting for the worst pending store. On Zen 4 with intermittent contention, mfence is 25–50 cycles when the buffer is empty and 200+ cycles when a contended line is still being acquired. This is why C++ memory_order_seq_cst stores on x86 are ~5× the cost of release stores — seq_cst requires a fence, release does not (because store→store is already preserved on TSO).

// fence_cost.c — measure mfence cost when the store buffer holds nothing
// vs. when it holds a contended line still waiting for M.
// Build: gcc -O2 -pthread fence_cost.c -o fence_cost
#include <pthread.h>
#include <stdatomic.h>
#include <stdint.h>
#include <stdio.h>
#include <time.h>

static _Atomic uint64_t shared;     // contended by helper thread
static _Atomic int stop_helper;

static void* hot(void* _) {
    while (!atomic_load(&stop_helper)) {
        atomic_fetch_add_explicit(&shared, 1, memory_order_relaxed);
    }
    return NULL;
}

static inline uint64_t now_ns(void) {
    struct timespec t; clock_gettime(CLOCK_MONOTONIC, &t);
    return (uint64_t)t.tv_sec * 1000000000ull + t.tv_nsec;
}

int main(int argc, char** argv) {
    int contended = (argc > 1 && argv[1][0] == 'c');
    pthread_t h;
    if (contended) pthread_create(&h, NULL, hot, NULL);
    enum { N = 5000000 };
    uint64_t t0 = now_ns();
    for (int i = 0; i < N; i++) {
        atomic_fetch_add_explicit(&shared, 1, memory_order_seq_cst);
        // seq_cst RMW on x86 = lock xadd, which fences the store buffer
    }
    uint64_t t1 = now_ns();
    if (contended) { atomic_store(&stop_helper, 1); pthread_join(h, NULL); }
    printf("mode=%-10s ns/op=%.1f\n",
           contended ? "contended" : "uncontended", (double)(t1 - t0) / N);
    return 0;
}

Sample output, threads pinned to cores 0 and 1:

$ ./fence_cost
mode=uncontended ns/op=4.2
$ ./fence_cost c
mode=contended  ns/op=86.7

The 80-ns gap is the cache-line ping-pong (covered in the cache-coherence chapter) plus the fence wait. Why uncontended fences look "free": the buffer is essentially empty, the line is already in M state, and lock xadd retires immediately. The fencing semantics are still active; you just cannot see them. This is the trap programmers fall into when they micro-benchmark atomics on a single core and conclude seq_cst is cheap. It is cheap only when nothing is happening.

Common confusions

  • "The store buffer is the same as the write-back cache" No. Write-back is a property of L1d/L2 — dirty lines stay in cache and only flush on eviction. The store buffer is upstream of L1d, between the core's retirement and the coherence layer. A store goes buffer → L1d (becomes M) → L2/L3/DRAM (on eviction). The buffer is per-core and invisible to MESI; the cache hierarchy is visible.
  • "volatile flushes the store buffer" It does not. volatile controls compiler reordering only. It does not emit any fence instruction, does not interact with MESI, and does not drain the store buffer. Use atomic_thread_fence(memory_order_seq_cst) or any lock-prefixed atomic operation to drain the buffer.
  • "x86 is sequentially consistent" It is TSO (Total Store Order), not SC. TSO permits exactly the SB-litmus reordering — store→load to different addresses. Every other reordering (store→store, load→load, load→store) is forbidden by the architecture. SC forbids all four. Programmers who say "x86 is strong enough to ignore" usually mean "I have never written a Dekker's algorithm" — the moment you do, the difference matters.
  • "Store-to-load forwarding is unsafe" It is the safe behaviour. Forwarding ensures the writer always observes its own writes, which is required for any sequential reasoning at all (x = 1; assert(x == 1) must hold). What is "unsafe" is other cores not seeing the store yet, and that is a property of the buffer not having drained, not of forwarding.
  • "mfence is the same as sfence plus lfence" No. sfence orders stores against stores; lfence orders loads against loads (and serializes the front-end). Neither drains the store buffer for a load-after-store pair. Only mfence (or a lock-prefixed instruction) does. The Intel manual is explicit: "MFENCE provides a serializing operation that guarantees... every load and store instruction... preceding the MFENCE is globally observable prior to any load or store instruction that follows."
  • "Store buffer contents are visible to anybody" They are visible only to the issuing core (via forwarding). Other cores cannot read them, cannot snoop them, cannot wait for them. The buffer is private state. The instant the buffer entry retires into L1d (line in M), MESI takes over and the value becomes globally observable.

Going deeper

ARM/POWER weak models — every reordering is allowed unless fenced

x86 TSO permits one reordering (store→load). ARMv8 and POWER permit all four: store→store, store→load, load→load, and load→store may all reorder freely between two different addresses. The reason is the same — store buffers, plus more aggressive load-load speculation — but the architecture exposes more of it. The consequences are operationally severe: code that "happens to work" on x86 because TSO hides the store buffer's effects on store→store and load→load can fail catastrophically when ported to ARM. Why this is the bug-pattern people see when their service moves from Intel to Graviton: the same C++ code, with the same std::atomic types and the same memory orders, runs on x86 with r1==0 && r2==0 happening at 0.15%, and on Graviton at 4–8%. The numbers explode by orders of magnitude on weaker memory models because more reorderings are visible. This is why C++11's memory model was specified to the weakest reasonable target (DRF-SC for race-free programs, with explicit weaker orderings as opt-ins) — to make the same source compile correctly on every CPU.

Why seq_cst stores cost more than release stores on x86

In C++, atomic_store(p, v, memory_order_release) compiles to a plain mov on x86 — TSO already provides release ordering for free, because store→store is preserved by the buffer's FIFO discipline. atomic_store(p, v, memory_order_seq_cst) on most compilers compiles to mov; mfence (or xchg, which implicitly fences). The difference is one drained store buffer, and on a contended workload it is 5–10× the latency. Most production systems can use release/acquire everywhere, paying for seq_cst only at the precise points (Dekker-style handshakes, certain epoch-based reclamation algorithms) where the SB pattern actually matters. PaySetu's settlement service was using seq_cst everywhere "to be safe" and was paying 60 ns per atomic store on the hot path; the audit cut 21 of 23 fences and brought sustained throughput from 8 M to 41 M ops/sec on the same hardware.

Store-buffer overflow and back-pressure

The buffer is finite. Sustained stores to lines that are not arriving in M state — common when many cores write to lines held in M elsewhere — fill the buffer, and once full, the next store stalls retirement. Front-end issue grinds to a halt; the core sits at IPC ≈ 0. The signature in perf is high cycles_stalled_md and low IPC concurrent with mem_inst_retired.lock_loads. KapitalKite's risk-engine team once observed a 12× throughput regression after migrating an order-management daemon from a single-CCD pinning to spread across two CCDs — store-buffer back-pressure, not lock contention, was the cause. The fix was to keep producer threads on the same CCD as the data they wrote.

Hazards beyond x86: speculation past stalled stores and Spectre-class concerns

Intel cores will speculatively execute past a stalled-buffer store, and that speculation has been the foundation of multiple side-channel attacks (Foreshadow, Store-to-Load forwarding speculative attacks). Modern microcode mitigates these but the architectural quirk remains: the store buffer's interaction with speculation creates side channels that pure software cannot fully close. This is largely irrelevant to performance engineering but matters for any hardened deployment — Erlang's BEAM and Rust's loom model checker both assume the architectural memory model, not the speculative one, and security-critical paths add lfence accordingly.

Where this leads next

The next chapter — false sharing: the invisible killer — combines this chapter's store-buffer story with the previous chapter's MESI ping-pong story. Two unrelated counters on the same cache line cause the write side to suffer ping-pong (MESI) while the producer's store buffer accumulates pending stores it cannot drain (because the line keeps getting yanked away). The combination is the worst of both pathologies and explains the canonical 5–50× throughput collapse from a single misaligned struct field.

Part 3 of this curriculum builds the bridge from store buffers to formal memory consistency models. Once you have absorbed why the buffer permits store→load reordering, the TSO vs ARM vs SC chapter and the acquire/release semantics chapter are mostly nomenclature on top of the same physics. The store buffer is also the structure responsible for half of the surprises in the Linux kernel's memory model — the other half being load buffers and speculative execution, which Part 3 will name.

References