Note: Company names, engineers, incidents, numbers, and scaling scenarios in this article are hypothetical — even when they resemble real ones. See the full disclaimer.

Why concurrency is hard: shared state

Aditi, an SRE at PaySetu, is debugging a daily reconciliation report. The merchant ledger says ₹4,82,71,500 was settled yesterday; the bank statement says ₹4,82,71,503. The mismatch is three rupees. It happens every day, with a different small number, and only on the days when the settlement service is run with eight worker threads instead of one. The code path is fifteen lines long, has been reviewed twice, and contains exactly one shared field — a running total. The function looks correct. It is not.

Concurrency is hard because the moment two threads share mutable state, the program's correct behaviour stops depending only on its source code and starts depending on the interleaving of CPU instructions chosen by the OS scheduler — an interleaving you cannot see, cannot reproduce, and cannot test exhaustively. A single integer increment is three machine operations, and any of them can be ripped apart by another thread. Every later mechanism in this curriculum — atomics, mutexes, lock-free queues, memory models — exists to give you back the property that "step N completes before step N+1 starts" that you took for granted in single-threaded code.

One counter, two threads, the wrong answer

Here is the entire problem of concurrency, distilled. Two threads each increment a shared integer one million times. The expected final value is two million. The actual value, on a Python --disable-gil build or a Java program, is something like 1,438,921 — different on every run.

# bug.py — run with python3.13 -X gil=0 bug.py to see the race
import threading

counter = 0  # shared mutable state

def bump(n):
    global counter
    for _ in range(n):
        counter += 1   # looks atomic. is not.

t1 = threading.Thread(target=bump, args=(1_000_000,))
t2 = threading.Thread(target=bump, args=(1_000_000,))
t1.start(); t2.start()
t1.join();  t2.join()

print(f"expected 2000000, got {counter}, lost {2_000_000 - counter}")

A representative run on Python 3.13 free-threaded build, M2 MacBook Air, 8 cores:

expected 2000000, got 1438921, lost 561079

Run it again — you get 1,512,003. Again — 1,602,847. The number is different every time. Why: the OS scheduler decides moment-to-moment which thread runs on which core and for how long. Each different scheduling decision produces a different interleaving of the two threads' instructions, and most of those interleavings drop some increments. The bug is not in the source — the source is the same on every run. The bug is in the interaction between the source and the runtime, which the source alone cannot describe.

Why does the increment lose updates? Because counter += 1 is not one operation. It is three:

  1. Load the current value of counter from memory into a CPU register.
  2. Add 1 to the register.
  3. Store the register back to counter's memory location.

Between any two of those steps, the OS can suspend this thread and run the other one. If both threads load the same value (say 47), both add 1 (to 48), and both store 48 — two increments collapsed into one. The lost update is invisible: no exception, no log line, no crash. The counter just ends up too small.

Lost update: two threads, three steps each, one collapsed incrementTwo columns labelled T1 and T2 with time arrows pointing down. Three boxes per column showing load, add, store. The interleaving on the right shows T1 loads 47, T2 loads 47, both add to 48, both store 48 — two increments become one. A single `counter += 1` is three steps; the OS can interleave them anywhere Correct interleaving (no overlap) final counter = 49 T1: load 47 T1: add → 48 T1: store 48 T2: load 48 T2: add → 49 T2: store 49 final counter = 49 ✓ Buggy interleaving (overlap on load) final counter = 48 — one update lost T1: load 47 T2: load 47 T1: add → 48 T2: add → 48 T1: store 48 T2: store 48 final counter = 48 ✗ Illustrative — the scheduler can pick any interleaving of the six steps. There are C(6,3)=20 distinct interleavings; only some give the right answer.
Two threads each running `counter += 1` on the same `counter`. The left interleaving runs the threads sequentially and produces 49. The right interleaving overlaps the loads — both threads read 47, both compute 48, both write 48 — and one update is lost. Nothing in the source code distinguishes the two cases.

This is the problem. Every concurrency mechanism in the rest of this curriculum exists to remove, contain, or detect interleavings of this shape. A mutex makes step 1 → step 3 indivisible from another thread's perspective. An atomic fetch_add collapses load-add-store into one CPU instruction (lock xadd on x86) that the cache-coherence protocol guarantees no other core can interleave with. A lock-free CAS loop reads, computes, and re-checks before writing, restarting if anyone else got there first. A linearizable counter is a counter where, no matter the interleaving, the visible effect is as if all increments happened one at a time in some order. Why: the language of solutions in this field is the language of ordering and atomicity. "Atomic" means "no interleaving can split this op". "Linearizable" means "the global trace looks like some serial order". "Happens-before" means "the language guarantees one event finishes before another begins". You will meet these words again in every later chapter; their entire purpose is to defeat the lost update from the right column above.

Why a single core hides the bug, and a single GIL hides it too

If you run the same program on a uniprocessor — or on CPython 3.12 with the GIL on — the lost-update bug usually goes away. Aditi tested the settlement service on her laptop, on staging, on a single-core canary container; it never lost a rupee. It only failed on the eight-core production fleet. This is the most disorienting property of concurrency bugs: they are invisible on the configuration where you debug, and they appear on the configuration where it matters.

There are two reasons the bug hides.

Reason one — single-core scheduling. On one CPU core, a context switch can happen anywhere, but it costs the OS a few microseconds (save registers, load other thread's registers, possibly flush TLB). The Linux scheduler typically gives a thread several milliseconds before preempting it — millions of instructions. A 1,000,000-iteration loop completes inside one or two scheduling slices. Whether the bug can fire is a separate question from whether it will fire. On one core, statistically, almost every interleaving runs sequentially.

Reason two — the GIL. CPython's Global Interpreter Lock serialises Python bytecode execution. The bytecode INPLACE_ADD followed by STORE_NAME for counter += 1 is interruptible — Python releases the GIL every 5 ms or every 100 bytecode ticks — but for short, hot loops the probability that a switch lands exactly between the load and the store is small. Aditi's settlement service ran on PyPy, which has no GIL; the bug fired immediately. The lesson is not "use the GIL to be safe" — it is "the GIL hides bugs that will resurface the moment you switch interpreter, language, or --disable-gil".

# gil_observer.py — observe GIL hiding the race on standard CPython
import sys, threading
counter = 0
def bump(n):
    global counter
    for _ in range(n):
        counter += 1

for trial in range(5):
    counter = 0
    t1 = threading.Thread(target=bump, args=(1_000_000,))
    t2 = threading.Thread(target=bump, args=(1_000_000,))
    t1.start(); t2.start(); t1.join(); t2.join()
    print(f"trial {trial}: counter = {counter}")
# CPython 3.12 with GIL (the default):
trial 0: counter = 1382491
trial 1: counter = 1245773
trial 2: counter = 1671004
trial 3: counter = 1058362
trial 4: counter = 1429815

On standard CPython 3.12 the race still fires, because between bytecodes the GIL can be released — but the lost-update rate is high enough to demonstrate. The myth that the GIL makes Python "thread-safe" is wrong: it serialises bytecodes, not Python statements. counter += 1 compiles to multiple bytecodes (LOAD_GLOBAL, LOAD_CONST 1, BINARY_ADD, STORE_GLOBAL), and the GIL can be released between any two. Why: CPython's eval loop checks for GIL handover after each bytecode (subject to a sys.setswitchinterval floor). A statement that compiles to four bytecodes has three internal points where another thread can grab the GIL, run its own bytecodes, mutate the same global, and hand the GIL back. The illusion of atomicity belongs to a single bytecode (each bytecode is atomic with respect to GIL handover), not to a single line of source.

The four words you need to internalise for the rest of the curriculum:

  • Race condition: the program's correctness depends on the relative timing of unsynchronised threads. The lost-update example is the canonical race.
  • Data race (a stricter, formal version): two threads access the same memory location, at least one access is a write, and there is no synchronisation ordering between them. In C++ and Rust, a data race is undefined behaviour — the compiler can do anything, including miscompile your code under the assumption that it doesn't happen.
  • Atomicity: the property that an operation is indivisible from another thread's point of view. fetch_add is atomic; counter += 1 in pure Python is not.
  • Linearizability: the property that, even though operations from many threads overlap in real time, the observed result is as if they all ran one at a time in some sequential order consistent with real-time. A linearizable counter is what you want; counter += 1 is what you have.

Memorise these. Every later chapter is a refinement on one of them.

How Aditi found the three rupees

The settlement service has one shared field — total_settled — that every worker thread increments after each successful merchant transfer. The original code, slightly simplified:

class Settlement:
    def __init__(self):
        self.total_settled = 0  # in paise
    def record(self, amount_paise):
        self.total_settled += amount_paise   # the bug

On the eight-worker fleet, the record calls overlap. Two workers credit ₹1 (100 paise) at the same instant; both load total_settled = 47_138_204 (paise), both compute 47_138_304, both store 47_138_304. The merchant-side ledger accumulates correctly because each worker's local view is independent; the bank-side ledger logs each transfer to a database row, also correct. But the in-memory total_settled — used for the daily report — has dropped one credit. ₹1 lost on a quiet day; ₹3 on a busy day; ₹47 on Diwali. The bug had been silently shaving rupees off the daily total since the service moved from one worker to eight, six months ago.

The fix is one line — but the lesson is which one line. Three options worked:

import threading
class Settlement:
    def __init__(self):
        self._lock = threading.Lock()
        self.total_settled = 0
    def record(self, amount_paise):
        with self._lock:                           # option A: mutex
            self.total_settled += amount_paise

Option A wraps the read-modify-write in a mutex. Cost: every increment now pays a lock cmpxchg plus a possible futex wait under contention. Throughput drops but the count is exact.

Option B uses an atomic counter. Python doesn't ship one in the standard library; you reach for multiprocessing.Value('q') (which uses an OS spinlock under the hood) or, in CPython 3.13's free-threaded build, threading._atomic_int (provisional). In Java, AtomicLong.addAndGet. In C++, std::atomic<int64_t>. Why: an atomic fetch-add on x86 compiles to a single lock xadd instruction. The lock prefix asserts ownership of the cache line via the coherence protocol — no other core can read or write that line until the instruction retires. The load-add-store fuses into one indivisible step, which is exactly what counter += 1 was missing.

Option C — the option Aditi actually shipped — eliminates the shared field. Each worker keeps its own local_total, and a single aggregator thread sums the per-worker totals once a minute for the report. This is the structure-the-program-to-not-share approach: when you can avoid shared mutable state entirely, you don't need atomics, mutexes, or memory models. The settlement service ships zero Lock calls and zero atomics; the workers don't talk to each other; the aggregator is the only reader, and it reads after the workers have finished a batch. Throughput went up compared to the original buggy code (no contention at all, no false sharing on the counter's cache line), and the daily report reconciles to the paise.

Three fixes for the lost-update race, with cost trade-offsThree columns showing Mutex, Atomic, and Per-thread aggregation approaches. Each shows a small diagram of threads accessing the counter. Mutex shows a queue of threads at a lock. Atomic shows a single hot cache line with all threads CAS-ing on it. Per-thread shows independent local counters with an aggregator reading them later. Three ways to defeat the lost update — each has a different cost shape A. Mutex T1 T2 T3 T4 lock + field cost: contention queue at the lock ~22 ns uncontended ~1.4 µs at 8 cores B. Atomic fetch_add T1 T2 T3 T4 hot cache line cost: cache-line ping-pong ~6 ns uncontended ~80 ns per failed CAS C. Per-thread + aggregate T1 T2 T3 T4 aggr. cost: aggregator lag no shared write path writes ~ free read sees 1-min stale
Three structurally different fixes. A and B both keep the shared counter and pay the cost of synchronising on it. C eliminates the share. The right answer depends on whether you need a real-time exact value (A or B) or whether a stale aggregated value is acceptable (C). Latency numbers are illustrative — see Parts 4–5 for measurements.

The structural choice — share-and-synchronise versus don't-share — recurs in every later chapter. Lock-free queues, hazard pointers, RCU, work-stealing schedulers, actor mailboxes — they are all answers to "we have to share something; what is the smallest, cheapest thing we can share?". Aditi's option C is the limit case where the shared thing is almost nothing — just the per-thread totals, read once per minute, no concurrent writes.

Common confusions

  • "A single line of source is atomic" No. counter += 1 compiles to a load-modify-store sequence in every imperative language; in Python it's three or four bytecodes; in C it's three or more machine instructions on most ISAs. The unit of atomicity is decided by the compiler and the hardware, not the source line. State the atomicity property explicitly: "is this a single CPU instruction?" or "does this acquire a lock?".
  • "The GIL makes Python thread-safe" No. The GIL serialises bytecode execution, not Python-statement execution. Statements that compile to multiple bytecodes (which is almost all of them) can be torn between bytecodes. Empirically, counter += 1 loses updates under standard CPython 3.12 with two threads. The GIL prevents some races (single-bytecode operations on built-in types are typically atomic) and hides others until you switch interpreter or --disable-gil.
  • "Race condition is the same as data race" Related but distinct. A race condition is a logical bug — the program's correctness depends on timing. A data race is the formal C++/Rust term: two unsynchronised accesses to the same location, at least one a write. Every data race is a race condition; not every race condition is a data race (you can have correctly-synchronised but logically-racy code, e.g. a TOCTOU file-system race).
  • "Thread-safe" without saying what property "Thread-safe" is a marketing word. State the property: atomic (each call is indivisible), linearizable (atomic plus a real-time order across calls), race-free (no undefined behaviour from data races), internally synchronised (uses a mutex internally — safe to call but not necessarily safe to compose with another such object). A Mutex<HashMap<K,V>> is internally synchronised; python dict access of a single key is atomic for the bytecode under the GIL but not linearizable across get+put pairs. They are different guarantees with different composability.
  • "The bug doesn't repro on my machine" Concurrency bugs are probabilistic. The probability depends on core count, scheduler policy, cache architecture, and contention level. A bug that fires once per million increments on a 2-core laptop fires fifty times per million on a 32-core production host. "Doesn't repro" means "the probability on this configuration was below your patience threshold". Use stress tools (tsan, loom, race -race in Go) that amplify the rate, not your eyes.
  • "Adding a print makes it stop happening" It does, and the bug is still there. print involves a syscall, an I/O wait, and (in Python) a GIL release at well-defined points; all of these change the scheduling pattern enough that the buggy interleaving becomes rarer. This is why "Heisenbugs" exist — the act of observing perturbs the timing. Trust race detectors, not log lines.

Going deeper

The 1965 paper that named the problem

Edsger Dijkstra's Cooperating Sequential Processes (1965, EWD123) is where shared-state concurrency was first put on rigorous footing. He posed the mutual exclusion problem — two processes accessing a critical section, only one at a time, no fairness assumptions, no atomic instructions beyond load and store — and gave Dekker's algorithm as a solution. The paper is forty pages of careful reasoning about exactly the lost-update problem, decades before multicore CPUs existed, in a setting where the only "concurrency" was an OS time-slicing two batch jobs on a single processor. The lesson the paper teaches, which has not aged a day: correctness in concurrency is a property of the protocol, not of the implementation. If the protocol is wrong, no amount of testing or "obviously correct" code review catches it. If the protocol is right, you can implement it in any language and on any hardware and it will work — provided your hardware honours the memory model the protocol assumes (which is the entire content of Part 3).

Why the OS scheduler is your adversary

The scheduler is not malicious, but for the purpose of reasoning about correctness, you must treat it as if it were. A correct concurrent program is correct under every interleaving the scheduler might produce, including the worst one. This is the adversarial scheduler model from theoretical CS, and it is the right mental model for production code. Why: the kernel's CFS (or Linux 6's EEVDF) does not know about your data structures. It picks the next thread to run based on its own fairness and latency goals. If your code is correct only on the schedules CFS happens to produce on your laptop, you will discover the schedules CFS happens to produce on a 64-core EPYC under load — and they will be different. The cheapest way to write correct concurrent code is to assume the scheduler will pick the worst interleaving, then engineer it so that even the worst interleaving is correct.

Why "looks correct" is the most dangerous review comment

Code reviewers are trained on sequential code. In sequential code, counter += 1 does what it says. In concurrent code, the source describes the programmer's intent, not the runtime's behaviour — and the gap between them is invisible to inspection. The Therac-25 medical-accelerator deaths in the 1980s, the Knight Capital ₹3,000 crore loss in 2012 (in 45 minutes of mis-routed orders, partly attributed to a non-atomic flag flip), the Mars Pathfinder priority-inversion bug — every one passed code review. The reviewer who says "looks correct" without checking the protocol against an adversarial schedule is doing the equivalent of approving a multiplication routine without checking it for negative inputs. The defence is not better reviewers; it is tools that exhaust the schedule space — race detectors (TSAN), model checkers (TLA+, Loom), exhaustive interleaving exploration (CDSChecker). Part 15 of this curriculum is dedicated to those tools precisely because human review provably cannot catch this class of bug.

The cost of synchronisation, in numbers

Synchronisation is not free, and the costs are dramatic. Rough numbers for x86 (libstdc++ on Linux, Skylake-class):

  • Uncontended std::atomic<int>::fetch_add (lock xadd): ~6 ns
  • Contended fetch_add on a hot cache line, 8 cores hammering: ~80 ns per attempt (the line bounces M→I→M between cores)
  • Uncontended std::mutex::lock(): ~22 ns (one CAS in user space, no syscall)
  • Contended std::mutex::lock() over 8 cores: ~1.4 µs (futex syscall to park; wake-up overhead)
  • Cross-core message via std::atomic flag (the so-called "load-acquire/store-release ping"): ~80 ns

For a counter incremented one million times per second per thread, with eight threads: option A (mutex) costs 8M × 1.4 µs = 11.2 seconds of total CPU per wall-clock second — more than your eight cores have. Option B (atomic on the same line) costs 8M × 80 ns = 640 ms of CPU per second — fits, but the cores spend 80% of their time waiting for the cache line. Option C (per-thread, aggregate at 1 Hz) costs 8M × 1 ns local increment + 1 read-out per second = essentially zero contention. The arithmetic is why "structure-the-program-to-not-share" is the pattern senior engineers reach for first.

Where this leads next

The next chapter (safety-vs-liveness) draws the formal distinction between something bad never happens (safety — no two threads in the critical section, no lost update) and something good eventually happens (liveness — the thread waiting for the lock eventually gets in, the program eventually makes progress). Every later mechanism in the curriculum is judged against both lenses; a deadlock is a liveness failure, a data race is a safety failure, and the two require different proofs.

After Part 1 establishes the vocabulary, Part 2 plunges into the machine — cache lines, store buffers, MESI coherence — because every claim in this chapter ("the load and the store are separate instructions", "the cache line ping-pongs between cores", "lock xadd is one indivisible op") only makes sense if you can see the hardware. Part 3 then layers the language-level memory model on top, explaining why C++'s memory_order_seq_cst exists and why ARM does not give it to you for free.

The lost update from this chapter will reappear, in a different costume, in every part. In Part 4 it returns as the ABA problem in a CAS loop. In Part 7 it returns as a torn read across a non-atomic 16-byte pointer-and-counter pair. In Part 14 it returns as message reordering between actors. The mechanism changes; the underlying fact — that shared mutable state without an ordering protocol is broken — does not.

References

# Reproduce this on your laptop (Linux / macOS, Python 3.13+ for the no-GIL build)
# 1. Standard CPython — race still fires, slowly:
python3 bug.py
# 2. Free-threaded CPython 3.13 — race fires immediately:
python3.13t bug.py
# 3. Or in C, without any GIL question:
cat > bug.c <<'EOF'
#include <pthread.h>
#include <stdio.h>
long counter = 0;
void *bump(void *_) { for (int i=0;i<1000000;i++) counter++; return 0; }
int main(){ pthread_t a,b; pthread_create(&a,0,bump,0); pthread_create(&b,0,bump,0);
  pthread_join(a,0); pthread_join(b,0); printf("counter = %ld\n", counter); }
EOF
gcc -O2 -pthread bug.c -o bug && ./bug
# expect: a number less than 2000000, different on every run