Note: Company names, engineers, incidents, numbers, and scaling scenarios in this article are hypothetical — even when they resemble real ones. See the full disclaimer.

Concurrency vs parallelism vs async

Riya, a backend engineer at PaySetu, opens her laptop on a Sunday because the order-validation service is doing something strange: it accepts 8 000 requests per second on a 16-core machine, but adding more cores does nothing — throughput sits at the same number on a 64-core host. The product manager calls it a "concurrency problem". Her tech lead calls it a "parallelism problem". The on-call SRE swears it is an "async problem". All three are using the words as if they are interchangeable, and they are not. The fix lives in exactly one of the three, and picking the wrong one wastes the next two weeks.

Concurrency is the property of structuring work so that multiple tasks make progress in overlapping time windows. Parallelism is the property of executing work simultaneously on multiple physical compute units. Async is one specific implementation technique — cooperative scheduling on a single thread, where tasks voluntarily yield at I/O boundaries. You can have any of the three without the other two; conflating them is the source of more concurrency bugs than every memory-model subtlety combined.

The three words name three different things

Concurrency is about structure. A program is concurrent if it is organised as multiple independently-progressing tasks — even if there is exactly one CPU core that interleaves them a microsecond at a time. Erlang on a single core is concurrent. A Python program with two threading.Thread objects under the GIL is concurrent. A single-threaded asyncio event loop is concurrent. None of those three programs runs anything in parallel.

Parallelism is about execution. A program is parallel if at some moment in physical time, two or more pieces of its work are literally executing on two or more compute units at once — two CPU cores, two SIMD lanes inside one core, or a thousand CUDA threads on a GPU. A C program that runs a single for loop on a single core is not parallel, even on a 64-core box. A NumPy a + b over a million-element array is parallel — across SIMD lanes, even on one core.

Async is the narrowest of the three. It is a specific implementation technique for concurrency — cooperative scheduling on a single OS thread, where each task explicitly yields control at known points (an await in Python, an await in JavaScript, a .await in Rust, a yield in older Python generators). The runtime keeps a queue of ready tasks and picks the next one only when the current one yields. There is no preemption — if a coroutine spends 200 ms doing CPU work without an await, every other coroutine on the same loop is frozen for those 200 ms.

Concurrency, parallelism, async — three orthogonal axesA 3-by-2 grid showing six combinations: concurrent yes/no on the rows, parallel yes/no on the columns. Each cell contains an example program. Async sits inside the concurrent-but-not-parallel cell as a labelled subset. The figure shows that the three properties are independent. Concurrency and parallelism are orthogonal; async is a subset of concurrent Concurrent? Parallel? (≥2 units executing simultaneously) no yes yes no Concurrent, not parallel Python asyncio on 1 core Python threads under GIL Erlang on 1 BEAM scheduler async lives here Concurrent and parallel Java ForkJoinPool on 16 cores Rust tokio multi-thread runtime Go GOMAXPROCS=N Sequential, not parallel Plain Python script Single-threaded C `for` loop Sequential, parallel NumPy `a+b` (SIMD lanes) A single GPU kernel launch Illustrative — the cell labels are common cases, not exhaustive. A program can sit in different cells at different phases.
Concurrency and parallelism are independent axes. A program can be neither, either, or both. Async is a specific implementation of "concurrent but not parallel".

Why this matters: when Riya's order-validation service stops scaling at 16 cores, the right question is not "is it concurrent?" — every modern Python web server is concurrent — but "is it parallel, and if not, what is preventing parallel execution?" In her case the answer is the GIL: the workload is concurrent (multiple coroutines, multiple threads) but not parallel (only one thread holds the GIL at a time during pure-Python CPU work). Adding cores does not help. Switching to multiprocessing, or moving the hot loop to NumPy, or rewriting the validator in Rust does.

What the CPU is actually doing in each case

Watch a single physical CPU core for 100 milliseconds.

In a purely sequential program, the core runs your one thread's instructions back-to-back. Cache lines stay warm. Branch predictors learn your patterns. There is no scheduler overhead, no context switch, no synchronisation cost. The maximum throughput is "one core's worth of work per wall-clock second" and that is also the minimum.

In a concurrent-but-not-parallel program — say Python threading with the GIL, or asyncio on one event loop — the core still runs one thread of execution at any given microsecond, but the runtime swaps between tasks. With OS threads under the GIL, the swap happens every few milliseconds (CPython releases the GIL on a tick every 100 bytecodes by default in older versions, every 5 ms in 3.2+). With asyncio, the swap happens only when a coroutine voluntarily awaits. Either way, no two tasks execute simultaneously on different cores. Why: under the GIL, only one thread of CPython bytecode executes at a time, regardless of how many cores you have. The other threads are scheduled by the OS but blocked on the GIL. This is why a CPU-bound Python program with threading does not get faster on a 64-core box — the bottleneck is not core count, it is a single global lock inside the interpreter.

In a truly parallel program — Java with 16 worker threads, Rust tokio with the multi-threaded runtime, Go with GOMAXPROCS=16, a NumPy operation that fans out to AVX-512 SIMD — the OS scheduler places different threads on different cores, and at the same wall-clock microsecond, two or more cores are executing your code's instructions. Cache lines may bounce between cores (see Part 2 — false sharing); atomics on shared state cost real cycles; but you get N× throughput up to the point where contention dominates.

In an async program, the third case happens inside the second. There is one thread, one core's worth of CPU, and a queue of tasks. The event loop picks a task, runs it until the task hits await, and at that exact instant the loop saves the task's state and picks the next ready task. The await is not a function call to "wait"; it is the explicit hand-off back to the scheduler. Forgetting this is the most common async bug: a coroutine that does 50 ms of CPU work without an await blocks every other coroutine on the loop for 50 ms, which on a 1 000 RPS server means dropping 50 requests on the floor.

Demonstrating all three with one Python program

Here is a runnable benchmark that shows the three cases on a CPU-bound and an I/O-bound workload. Save as cpvsa_demo.py and run with python3 cpvsa_demo.py.

import asyncio, multiprocessing, threading, time, math, urllib.request

# ---- the work ----
def cpu_work(n):
    """CPU-bound: prime check up to n. No I/O, no GIL release."""
    total = 0
    for i in range(2, n):
        if all(i % p for p in range(2, int(math.isqrt(i)) + 1)):
            total += 1
    return total

def io_work(url):
    """I/O-bound: HTTP fetch. Network wait dominates."""
    return len(urllib.request.urlopen(url, timeout=5).read())

# ---- the experiments ----
N = 200_000
URLS = ["http://example.com"] * 8

def sequential_cpu():
    t = time.perf_counter()
    [cpu_work(N) for _ in range(4)]
    return time.perf_counter() - t

def threaded_cpu():
    t = time.perf_counter()
    ts = [threading.Thread(target=cpu_work, args=(N,)) for _ in range(4)]
    [x.start() for x in ts]; [x.join() for x in ts]
    return time.perf_counter() - t

def multiprocessed_cpu():
    t = time.perf_counter()
    with multiprocessing.Pool(4) as p:
        p.map(cpu_work, [N] * 4)
    return time.perf_counter() - t

async def async_io():
    t = time.perf_counter()
    loop = asyncio.get_event_loop()
    await asyncio.gather(*[loop.run_in_executor(None, io_work, u) for u in URLS])
    return time.perf_counter() - t

if __name__ == "__main__":
    print(f"sequential CPU x4 :  {sequential_cpu():.2f}s   (1 core, 1 task at a time)")
    print(f"threaded   CPU x4 :  {threaded_cpu():.2f}s   (concurrent, NOT parallel — GIL)")
    print(f"multi-proc CPU x4 :  {multiprocessed_cpu():.2f}s   (concurrent AND parallel)")
    print(f"async      I/O x8 :  {asyncio.run(async_io()):.2f}s   (concurrent, 1 thread, I/O parallel)")

Sample run on a 4-physical-core x86 laptop (Python 3.12, Linux):

sequential CPU x4 :  4.31s   (1 core, 1 task at a time)
threaded   CPU x4 :  4.42s   (concurrent, NOT parallel — GIL)
multi-proc CPU x4 :  1.18s   (concurrent AND parallel)
async      I/O x8 :  0.34s   (concurrent, 1 thread, I/O parallel)

Three observations make the chapter:

threaded_cpu is slower than sequential_cpu, by 2.5%. Switching threads under the GIL adds context-switch and lock-handoff overhead without any parallelism benefit. A reader who internalises this stops reaching for threading for CPU work in pure Python forever. Why: every CPython bytecode-executing thread holds the GIL. The OS will preempt one thread and schedule another, but the second thread immediately blocks on the GIL until the first releases it. The two threads never execute Python bytecode simultaneously; you pay all the synchronisation cost of multithreading and get none of the throughput.

multiprocessed_cpu is ≈4× faster because each process has its own interpreter and its own GIL. The OS schedules the four processes on four cores; they run in parallel. The cost is a fork/spawn at the start (≈40 ms here) and inter-process serialisation (pickle) for arguments and results. For long-running CPU work that cost amortises away.

async finishes 8 fetches in 0.34 s — all on a single thread. The event loop dispatches eight HTTP requests, then awaits the responses; while the network is in flight the kernel is doing the work, the Python thread is idle but the loop is not blocked. If you replaced io_work with cpu_work inside an async function (and dropped the run_in_executor), the eight tasks would run sequentially, taking ≈8.6 s — async does not magically parallelise CPU work.

When to reach for which

The decision tree is short and the cost of getting it wrong is large.

Decision tree: which mechanism for which workloadA flowchart starting from the question "Is the work CPU-bound or I/O-bound?" with branches to "single core enough?", "GIL involved?", and final leaves recommending threads, processes, async, or vectorisation. Pick the mechanism by what the work is bottlenecked on Workload bottleneck? CPU-bound I/O-bound (network, disk, syscall wait) Need > 1 core? Many small fan-outs? no yes few many Sequential or NumPy / SIMD on 1 core multiprocessing or C-extension that releases GIL threads if blocking APIs are unavoidable async / await cooperative, 10⁵ tasks/thread Trap: never run CPU-bound code inside an async coroutine without offloading. A 50 ms `for` loop inside an `async def` blocks the entire event loop for 50 ms — every other coroutine waits.
The most expensive bug in this space: writing CPU-bound work inside an `async def` and discovering at 1 000 RPS that the p99 latency has crossed your error budget for the whole quarter.

For PaySetu's order-validation problem, the answer is on the right side of the tree. Validation is mostly cheap CPU work (parse JSON, schema check) plus some I/O (Redis lookup, downstream call to fraud check). The team had been using asyncio because the I/O dominates — a sensible default. The bug was that one validator step ran a regular-expression check against a large compiled pattern; under load the regex took 8 ms per request and ran inside an async def without an await or an asyncio.to_thread. Eight ms × 1 000 RPS on a single event loop = 8 000 ms of CPU work per second, which by definition cannot fit on one core. Replacing that one line with await asyncio.to_thread(big_regex.search, payload) moved the regex to the default thread pool and unblocked the loop. p99 dropped from 380 ms to 28 ms, throughput climbed from 8 000 to 22 000 RPS on the same hardware. They never needed more cores; they needed to stop blocking the loop.

Common confusions

  • "asyncio is faster than threads" Not in general — async is faster for I/O-bound fan-out on Python, where the per-task overhead of an OS thread (≈8 KB stack + scheduler entry) dominates. For four CPU-bound tasks it makes no difference; for one I/O-bound task it makes no difference. The win shows up at thousands of concurrent waits.
  • "Multithreading gives me parallelism" In Python under the GIL, no. In Java, Rust, C++, Go, yes — each language has a different threading model. State the language before claiming parallelism.
  • "Async means no race conditions" Async on a single event loop avoids preemptive races (the runtime cannot interrupt you mid-statement), but it does not avoid interleaving races: any state you mutate before an await and read after an await can be modified by another coroutine that ran during the await. Race conditions are about shared mutable state and points where another task can run, not about whether a thread or a coroutine ran it.
  • "Concurrency improves performance" Concurrency is about structure; it can improve responsiveness (the UI does not freeze, the next request gets served while the previous one is still computing) but if the work is CPU-bound and the resources are saturated, concurrency without parallelism just adds overhead. PaySetu's threaded version of cpu_work was slower than the sequential one.
  • "Thread-safe means safe to call from multiple threads" Thread-safe is not a single property — it can mean atomic (each call appears indivisible), linearizable (atomic plus a real-time order), race-free (no undefined behaviour from data races), or "uses a mutex internally" (and therefore safe in isolation but not necessarily safe to compose). State which one. A linearizable counter and a Mutex<HashMap> are both "thread-safe" but they offer different guarantees and different composability.
  • "Parallelism on one core via SIMD does not count" It counts. AVX-512 doing 16 single-precision multiplies in one instruction is parallel execution by any honest definition; the parallelism just lives inside the core rather than across cores. NumPy on a vector operation is the closest most Python programmers come to true parallelism without leaving the GIL behind.

Going deeper

The 1985 paper that named the problem

Robert Harper's lecture notes draw the line between concurrency and parallelism back to Carl Hewitt's actor formalism (1973) and Hoare's CSP (1978), but the modern consensus crystallised around Rob Pike's 2012 talk Concurrency Is Not Parallelism and Robert Harper's Parallelism is not concurrency essay (2014). Both arguments rest on the same observation: concurrency is a property of the program, parallelism is a property of the execution, and a single program can be concurrent on a uniprocessor and parallel on a multiprocessor without changing a line of source. Erlang takes this to its logical end — the language is concurrent everywhere, and the BEAM VM decides at runtime how many physical schedulers (one per core, by default) execute the program in parallel. The programmer never sees pthread_create or tokio::spawn; they see spawn(fun() -> ...) and trust the VM.

The GIL is not a bug — it is a contract

CPython's Global Interpreter Lock is regularly described as "a flaw to be removed". That description is a category error. The GIL is a contract — it guarantees that reference counts, dictionary internals, and most C extensions can assume single-threaded access without atomic operations. Removing the GIL (which Python 3.13 began under the --disable-gil build) requires every C extension that mutates Python objects to add atomic refcount operations, fine-grained locking on dict and list internals, and a memory model. The Python core team's measurement on the no-GIL build, before optimisations, was a 10–15% slowdown on single-threaded workloads — the cost of those atomics. Whether that cost amortises in production is the open question of the next five years. Until then, "the GIL is the price of CPython's compatibility surface" is a more accurate description than "the GIL is a mistake".

Async cancellation is not a free lunch

The hidden third axis under "async" is cancellation. Threads are preemptable: kill the thread (modulo all the well-known reasons not to) and the OS reclaims its stack. Coroutines are not — cancelling a coroutine raises an exception at the next await point, which means a tight CPU loop inside an async def is uncancellable. KapitalKite's order-router learned this on a Diwali deploy: a coroutine got stuck in a degenerate JSON-decoder branch that did 4 seconds of CPU work without an await; the orchestration layer issued a cancellation; the cancellation sat in the queue for those 4 seconds; the upstream timed out; the retry doubled the load; cascade. The fix was not "switch to threads" — it was a loop.call_later(timeout, task.cancel) plus a to_thread wrapper around the JSON parse. Async cancellation works, but it works at await points. Why: an asyncio.CancelledError is delivered by the runtime as the coroutine resumes from its current await. If the coroutine is not at an await, the cancellation cannot fire. This is a property of cooperative scheduling, not a flaw — preemptive cancellation requires the runtime to interrupt the thread mid-instruction, which Python deliberately does not do.

What "release the GIL" actually means

A C extension that does Py_BEGIN_ALLOW_THREADS is releasing the GIL — it tells CPython "I am about to spend time in C code that does not touch any Python object, so let other threads run". NumPy releases the GIL around large array operations (the +, *, FFT, BLAS calls); that is why a NumPy program with threading does see parallel speedup, even on CPython. The mental model is: the GIL covers interpreter state, not C state. Any code path that stays in C, holds no Python references, and does not call back into Python is free to run truly parallel. This is the loophole that the entire scientific-Python ecosystem (NumPy, SciPy, scikit-learn, PyTorch with torch.set_num_threads) exploits.

Where this leads next

The next chapter (why-concurrency-is-hard) catalogues the failure modes — race conditions, deadlocks, livelocks, starvation — that every concurrent program must defend against. The chapter after (safety-vs-liveness) draws the formal distinction between something bad never happens and something good eventually happens, which is the lens every later chapter uses to evaluate a synchronisation primitive.

Once Part 1 establishes the vocabulary, Part 2 plunges into the machine — cache lines, store buffers, MESI coherence — because concurrency without a machine model is folklore. Part 3 then layers the language-level memory model on top. By the end of Part 4 you will know exactly which one of "concurrent", "parallel", "async" PaySetu's order-validation needed, and which CPU instruction (lock cmpxchg, lock xadd, or none of the above) the fix compiled to.

References

# Reproduce this on your laptop (Linux / macOS, Python 3.10+)
python3 -c "import sys; assert sys.version_info >= (3, 10)"
curl -sO https://example.invalid/cpvsa_demo.py  # or save the code block above
python3 cpvsa_demo.py
# expect: threaded ≈ sequential, multi-proc ≈ N× faster, async ≈ I/O-fan-out fast