Note: Company names, engineers, incidents, numbers, and scaling scenarios in this article are hypothetical — even when they resemble real ones. See the full disclaimer.

Safety vs liveness properties

Riya, an on-call SRE at KapitalKite, gets two pages within the same hour. Page one: "OrderBook total volume is 17 lots short of the exchange's confirmation; reconcile-or-roll-back fired at 09:23." Page two: "Trade-stream consumer has been idle for 11 minutes; backlog 4.2 M messages and rising." Both are concurrency bugs. They look similar in the dashboard — a service "broken in production" — but they are opposite kinds of broken. The first one is a thing that should never happen, that did. The second is a thing that should always eventually happen, that didn't. Tell them apart in the first sixty seconds and you fix the right problem; confuse them and you spend an afternoon adding a Lock to code that needs a wakeup().

Every correctness property of a concurrent program is either a safety property (something bad never happens — no two threads in the critical section, no torn read, no negative balance) or a liveness property (something good eventually happens — every request is answered, every waiter eventually wakes, no deadlock). Safety violations are demonstrated by a finite trace; liveness violations require an infinite trace. The two demand different proofs, different tools, and different fixes — confusing them is the most common diagnostic error in concurrent debugging.

Two failures, two shapes of trace

Riya's first page is a safety failure. The OrderBook had two threads simultaneously credit a partial fill against the same parent order — both read the remaining-quantity field as 42, both subtracted their fill of 9, both wrote back 33, when the correct answer was 24. A finite sequence of events — six instructions across two threads — proves the bug exists. Print the trace, point at it, file the JIRA. The system did the wrong thing in finite time.

Riya's second page is a liveness failure. The trade-stream consumer is alive — the process is up, threads are scheduled, the heap is healthy — but it is not making progress. One worker is parked on a condition variable that nobody is signalling; another is spinning on a CAS that always fails because a third worker holds a lock the second is also trying to take. There is no finite trace that says "the bug happened here". The bug is the absence of an event — a wake-up that should arrive and does not, a step the program should take and never takes. Liveness is about the infinite future.

Lamport (1977) gave the canonical definitions:

Safety: a property such that, if it is violated, there exists a finite prefix of the execution that witnesses the violation. Informally — "something bad happens". Safety bugs include data races, lost updates, double-free, two threads in the critical section, broken invariants on shared data.
Liveness: a property such that no finite prefix of an execution can witness its violation; only the infinite trace can. Informally — "something good never happens". Liveness bugs include deadlock, livelock, starvation, missed wake-ups, async cancellation leaks.

The asymmetry matters. To prove a safety property, you reason about every reachable state and show none of them is bad. To prove a liveness property, you reason about every infinite execution and show every one eventually reaches a good state. The first is a reachability argument; the second is a fairness argument, and it usually requires assumptions about the scheduler ("every thread that is enabled infinitely often is scheduled infinitely often").

The two shapes of concurrent-bug trace. Safety bugs end at a forbidden state; the prefix to the bad state is your bug report. Liveness bugs never end — the program runs forever, doing nothing useful, and the "good thing" never arrives. Different shapes need different tools.

Why: the formal distinction matters because every concurrency property decomposes into the two kinds. Alpern & Schneider proved (1985) that any property of an infinite execution can be written as the intersection of one safety property and one liveness property. So when you see a bug, the first triage question is always "is it that something happened that shouldn't have, or that something didn't happen that should have?" — and the answer narrows the search to one of two halves of the universe of possible bugs.

Reproducing both failures in Python

Here is a single program that, depending on a flag, exhibits a safety violation or a liveness violation. Run it both ways. The diagnostic output is dramatically different — and that difference is exactly what the formal distinction predicts.

# safety_vs_liveness.py — run with: python3 safety_vs_liveness.py safety
#                                   python3 safety_vs_liveness.py liveness
import sys, threading, time

# ---------- SAFETY: lost update on a shared balance -------------
def run_safety():
    balance = [1000]              # start: ₹1000
    def transfer(amount, n):
        for _ in range(n):
            cur = balance[0]      # load
            time.sleep(0)         # invite a context switch
            balance[0] = cur - amount   # store — racy
    t1 = threading.Thread(target=transfer, args=(1, 50_000))
    t2 = threading.Thread(target=transfer, args=(1, 50_000))
    t1.start(); t2.start(); t1.join(); t2.join()
    expected = 1000 - 100_000
    print(f"[safety] expected balance={expected}, got={balance[0]}, "
          f"violation={'YES' if balance[0] != expected else 'NO'}")

# ---------- LIVENESS: deadlock from inverted lock order ---------
def run_liveness():
    A = threading.Lock(); B = threading.Lock()
    def t1_body():
        with A:
            time.sleep(0.01)      # hold A long enough for T2 to grab B
            print("[liveness] T1 has A, asking for B...")
            with B:
                print("[liveness] T1 got both")
    def t2_body():
        with B:
            time.sleep(0.01)
            print("[liveness] T2 has B, asking for A...")
            with A:
                print("[liveness] T2 got both")
    t1 = threading.Thread(target=t1_body)
    t2 = threading.Thread(target=t2_body)
    t1.start(); t2.start()
    t1.join(timeout=2.0)
    t2.join(timeout=2.0)
    if t1.is_alive() or t2.is_alive():
        print("[liveness] DEADLOCK — both threads parked, no progress in 2s")
    else:
        print("[liveness] completed (race didn't fire this run)")

if __name__ == "__main__":
    {"safety": run_safety, "liveness": run_liveness}[sys.argv[1]]()

Sample output:

$ python3 safety_vs_liveness.py safety
[safety] expected balance=-99000, got=-83417, violation=YES

$ python3 safety_vs_liveness.py liveness
[liveness] T1 has A, asking for B...
[liveness] T2 has B, asking for A...
[liveness] DEADLOCK — both threads parked, no progress in 2s

The two failures look the same in the operating system's eyes — a Python process running two threads — but the evidence you need to prove the bug is opposite-shaped. In run_safety, the bug is the printed mismatch itself. The five lines of output prove the program produced a state — balance=-83417 — that the source code's intent forbade. The trace is finite; it has already happened. Why: this is exactly the structure of every safety failure. There is a predicate INVARIANT(state) that should always hold (here: balance == 1000 - 1*total_decrements); the bug is a state where INVARIANT is false; the trace from the initial state to that state is the witness. Test runners, race detectors, and assertion failures all work this way.

In run_liveness, the bug is the silence. Neither thread prints "got both"; both join calls time out. Notice that the liveness-bug detection requires a timeout — you cannot tell the difference between "deadlocked" and "still working, almost done" by inspecting the program at a single moment. Why: a deadlock is the absence of a future event. To distinguish "absence" from "delay", you must wait, and pick a duration beyond which absence is treated as failure. This is why deadlock detectors use timeouts, why CI runs use --timeout, why every tokio::select! has a tokio::time::timeout partner. Liveness is the property that eventually the good thing happens; "eventually" is operationally a timeout.

In run_liveness there is no tsan-style automatic detector for the bug. The threading runtime is not in violation of any safety invariant when both threads are parked — both locks are correctly held, no field has been corrupted, the program could in principle still finish if one thread abandoned its lock (which it cannot). The bug is the protocol — acquire-A-then-B vs acquire-B-then-A — not the state. To find protocol bugs you need a model checker (TLA+, Alloy, loom) or a defensively-written runtime that watches for circular wait (Java's jstack deadlock detector).

Where the failure modes split, in production

KapitalKite's two pages, mapped onto the safety/liveness lens:

Failure mode	Kind	Detector	Typical fix
Lost update on order qty	Safety	TSAN, asserts, reconciliation	Atomic CAS, mutex, or per-thread + aggregate
Two threads in critical section	Safety	Mutex (eliminates by construction); race detector	Use a mutex; verify the mutex is correctly placed
Torn read of 16-byte pair	Safety	TSAN flags it; manifests as nonsense values	`std::atomic<__int128>` or split into two fields
Deadlock (circular wait on locks)	Liveness	jstack, `pstack`, watchdog timeouts	Lock ordering, lock-free, or `try_lock` with backoff
Livelock (CAS retry storm)	Liveness	High CPU + zero progress; `perf` shows hot CAS	Backoff, ticket lock, helper-thread pattern
Starvation (one thread never gets in)	Liveness	Per-thread latency histograms with long tail	Fair lock, FIFO queue, ticket lock
Lost wake-up (`notify` fired before `wait`)	Liveness	Thread parked on cv, no signal in flight	Predicate-loop pattern, `notify_all`, monitor invariants
Async cancellation leak (await never returns)	Liveness	tokio-console / py-spy shows hung task	Structured concurrency, cancel tokens, `select!` timeouts

The four quadrants of concurrency failure. Top-row bugs (safety) leave a finite trace ending in a forbidden state. Bottom-row bugs (liveness) leave an infinite trace where the desired state is never reached. The detector you reach for differs by row, not by column.

The first triage at 03:00 is to ask: "did the system do something it shouldn't have, or not do something it should have?" The answer routes you to a different set of dashboards, a different set of tools, and a different shape of fix. Why: safety fixes typically constrain the program (add a lock, add an atomic, add a check) while liveness fixes typically liberate it (drop a lock you held too long, restructure to avoid circular wait, add a wake-up signal you forgot to send). A safety fix that restricts execution can introduce a liveness bug — a mutex prevents the lost update but creates the possibility of deadlock. A liveness fix that frees execution can introduce a safety bug — releasing a lock early means another thread might see an inconsistent intermediate state. The two are in tension, and senior concurrency engineers are the ones who can hold the tension without sacrificing one for the other.

Common confusions

"Deadlock is a safety bug" No. Deadlock is the canonical liveness bug. The system in deadlock has not violated any invariant on the data — every lock is correctly held, every field is in a valid state. The bug is that progress has stopped. To detect a deadlock you must observe the absence of an event over time (a watchdog timeout, jstack showing all threads parked, a CI run hanging) — there is no single state that is "deadlock". Contrast with a lost update: a single state-snapshot reveals the corruption.
"Race condition is a safety bug" Usually but not always. Most race conditions (lost updates, torn reads, double-frees) are safety bugs because they leave the data in a forbidden state. But some race conditions — like a wake-up that races with the wait, missing the signal — are liveness bugs, because the data is fine; the bug is that one party is now waiting forever. The class "race condition" cuts across the safety/liveness divide.
"Adding more locks fixes liveness bugs" Backwards. Locks make liveness worse — every lock is a new opportunity for deadlock, lock-ordering inversion, or priority inversion. Locks fix safety bugs (prevent forbidden interleavings); they create the conditions under which liveness bugs become possible. The fix for a liveness bug is usually to remove a lock, restructure to avoid circular wait, or replace blocking with try-lock-and-back-off.
"Liveness can be tested with unit tests" Only weakly. A unit test runs for a finite time; it can prove a safety bug exists (the test fails on a bad state) but cannot prove a liveness bug exists in the formal sense. What CI does is bound the wait — "if the test takes more than 10 seconds, treat as deadlock" — which catches obvious liveness failures but not subtle starvation where a thread gets in eventually but is delayed unfairly. Real liveness verification needs model checkers (TLA+, loom) or fairness-aware fuzzers.
"Thread-safe means safety" "Thread-safe" is unspecified. State the property: atomic (each call indivisible — a safety property), linearizable (atomic plus real-time order — a safety property), deadlock-free (a liveness property), wait-free (every thread completes in bounded steps — a strong liveness property), internally synchronised (uses a mutex — composes badly across calls). A Mutex<HashMap> is internally synchronised — safe by itself, but a get then put pair is not atomic across the two calls.
"If the trace is finite, it's a safety bug; if infinite, liveness" Close but with a wrinkle. Lamport's definition is about witnesses: a safety bug always has a finite witness; a liveness bug never does. A specific run can be finite (the program crashed; the test timed out; the process was killed) without that finite run being a witness for either kind. The classification is about the property, not the run: "no two threads in the critical section" is a safety property because if it is violated, a finite prefix proves it. "Every request eventually returns" is a liveness property because no finite prefix can prove it.

Going deeper

Alpern–Schneider: every property is safety ∩ liveness

Bowen Alpern and Fred Schneider published Defining Liveness in 1985 and proved a remarkable theorem: every property of an infinite execution can be written as the intersection of a safety property and a liveness property. Concretely, given a property P, you can mechanically construct P_safe (the largest safety property implied by P) and P_live (a liveness property such that P = P_safe ∩ P_live). Mutual exclusion, for instance, decomposes into "at most one thread in the critical section at a time" (safety) and "every thread that requests entry eventually enters" (liveness). The decomposition is canonical, and it explains why proofs of correctness for concurrent algorithms always come in two parts: an invariant argument (no reachable state violates the invariant — safety) and a progress argument (every request is eventually served — liveness). Senior systems papers — Lamport's bakery algorithm, Herlihy's wait-free synchronisation, the Paxos invariants — every one is structured as safety-proof followed by liveness-proof.

Fairness: the assumption you need to prove liveness

You cannot prove liveness without saying something about the scheduler. Consider the simplest mutex protocol: a thread that wants the lock spins on compare_exchange. Under an adversarial scheduler, the scheduler can starve any single thread — pick the other thread every time, forever. Liveness fails. To get liveness, you assume fairness: every thread that is enabled (ready to make progress) is eventually scheduled. There are two flavours: weak fairness (a thread continuously enabled is eventually scheduled) and strong fairness (a thread enabled infinitely often is eventually scheduled). Linux's CFS gives you weak fairness in practice; real-time schedulers can give you stronger guarantees. Why: every TLA+ liveness specification ends with a WF_vars(action) or SF_vars(action) clause precisely because liveness without a fairness assumption is unprovable. The fairness clause says "assume the scheduler is not actively trying to starve us"; without it, even correctly-engineered code admits infinite traces in which the good thing never happens.

Why `pthread_mutex` gives you safety but not liveness

pthread_mutex_lock guarantees that two threads are never in the critical section together — that is the safety property, and it holds unconditionally. It does not guarantee that a thread waiting for the lock will eventually get in. Under POSIX, a default pthread_mutex is allowed to starve a waiting thread arbitrarily long if other threads keep barging in. To get a fairness guarantee — a liveness property — you need a fair mutex (a ticket lock, a queueing lock like MCS, or an OS-provided FIFO mutex). On Linux, pthread_mutex with PTHREAD_MUTEX_PRIO_INHERIT does not guarantee FIFO order; pthread_mutex_lock calls futex_wait, and the wake-up order is whatever the kernel decides. KapitalKite's order-matcher hit this in 2024 (hypothetically): a low-rate thread competing with a hot loop on a shared lock saw a 99.99th-percentile wait of 380 ms while the hot loop's median wait was 90 ns. Switching to a MCS queueing lock made the latencies symmetric — and slower on average. The trade-off between fairness and throughput is intrinsic.

Async cancellation as a liveness problem

Modern async runtimes (tokio, Python asyncio, .NET Task) have a uniquely subtle liveness bug class: cancellation leak. A task is cancelled, but the cancellation signal is dropped on the floor — the task continues to hold a connection, a lock, or memory. The future never completes, the task is never reaped, the resource accumulates. Tokio's JoinHandle::abort() only delivers a cancel-at-await-point; if the task is in a non-yielding loop, the abort waits. Python's asyncio.Task.cancel() raises CancelledError at the next await — code that holds a lock across the await may release it on cancel, code that does not is leaked. The fix is structured concurrency (Trio's nursery, tokio's JoinSet) which guarantees that no task outlives its scope. Why: cancellation leak is purely a liveness failure — the cancelled task is not corrupting any data (no safety violation), it is simply not making progress towards completion when it should. The detector is tokio-console or py-spy dump, which show the hung task's stack frame; the fix is structural, not localised.

Where this leads next

The next chapter (deadlock, livelock, starvation: the three liveness failures) zooms into the liveness side of this divide and gives you the diagnostic tree for which one you have. After Part 1 establishes the vocabulary, Part 2 plunges into the machine — and the safety/liveness distinction reappears at the hardware level (cache-coherence protocols guarantee safety; forward-progress guarantees on x86 give you liveness for lock instructions, but not on every architecture).

Part 5 (mutexes) and Part 7 (lock-free data structures) split this curriculum's middle along the same line. Mutex-based code prioritises safety and has known liveness pitfalls (deadlock, priority inversion). Lock-free code prioritises liveness — non-blocking guarantees that some thread always makes progress — and pays for it with subtler safety properties (linearizability proofs are harder, ABA bugs lurk). Part 15 (testing) returns to the divide: TSAN catches safety; loom and TLA+ catch liveness; you need both because no tool catches both.

References

Proving the Correctness of Multiprocess Programs — Leslie Lamport (IEEE TSE 1977) — where the safety/liveness terminology was introduced.
Defining Liveness — Bowen Alpern and Fred Schneider (Information Processing Letters 1985) — the formal definitions and the proof that every property = safety ∩ liveness.
Recognizing Safety and Liveness — Bowen Alpern and Fred Schneider (Distributed Computing 1987) — topological characterisation of the two classes.
Cooperating Sequential Processes — E. W. Dijkstra (EWD 123, 1965) — the original mutual-exclusion problem statement; safety and progress conditions stated in plain English.
The Temporal Logic of Actions — Leslie Lamport (TOPLAS 1994) — TLA, the logic in which both safety and liveness are first-class, with the WF/SF fairness operators.
Specifying Systems — Leslie Lamport (Addison-Wesley 2002) — the practitioner's guide to TLA+, which makes the safety/liveness distinction operational.
Real-World Concurrency — Bryan Cantrill & Jeff Bonwick (CACM 2008) — production stories where the safety/liveness divide is the diagnostic frame.
Internal — why-concurrency-is-hard-shared-state — the previous chapter, where the lost-update example that is now classified as a safety bug first appeared.

# Reproduce this on your laptop (Linux / macOS, Python 3.11+)
python3 safety_vs_liveness.py safety
python3 safety_vs_liveness.py liveness
# expect: safety run prints VIOLATION=YES with a different short balance each time
# expect: liveness run prints DEADLOCK after a 2-second timeout