Python: the GIL, PyPy, CPython 3.13 no-GIL

Karan runs the fraud-scoring service at PhonePe. The service receives a UPI transaction, runs roughly 220 features through a stack of decision trees, and returns "approve" or "review" within a 40 ms p99 budget. In 2024, with traffic growing 38% year-on-year, the team rented bigger boxes — c6i.8xlarge, 32 vCPUs, 64 GB RAM — and watched in confusion as htop showed exactly one core pinned at 100% while the other thirty-one sat at 2–5%. The Python service was using 1/32 of the metal it ran on. Karan's first instinct was to add gunicorn -w 32 workers, which worked but tripled memory use because every worker loads its own copy of the 800 MB scikit-learn model. Someone in the channel said "just use threads", which they tried, and discovered to their fury that 32 threads ran the same CPU-bound code 8% slower than 1 thread. That is the Global Interpreter Lock writing the bill. The GIL is one mutex inside the CPython interpreter that says only one thread may execute Python bytecode at a time, period. It is the single most-discussed performance constraint in the Python ecosystem, the reason multiprocessing exists, the reason NumPy releases the GIL during BLAS calls, the reason PyPy and Cinder and Pyston exist as alternative runtimes, and the reason CPython 3.13 finally shipped an experimental no-GIL build. Each escape has a different cost shape — and choosing the wrong one is how you end up paying for 32 cores while using 1.

The GIL is a single mutex in CPython that serialises Python bytecode execution across threads — so a CPU-bound multithreaded program cannot use more than one core. Three escapes work today: spawn processes (memory cost, IPC cost), drop into a native extension that releases the GIL during its work (NumPy, lxml, Cython nogil blocks), or switch interpreter (PyPy's tracing JIT, ~5× faster on long-running pure-Python). CPython 3.13's experimental --disable-gil build removes the lock entirely at a 10–20% single-threaded cost — the first time pure-Python multithreading actually scales.

What the GIL actually is, mechanically

The Global Interpreter Lock is a PyMutex (formerly a pthread_mutex_t) inside Python/ceval_gil.c. Every CPython thread, before executing a bytecode instruction that touches a Python object, must hold this mutex. Since almost every bytecode touches a Python object — LOAD_FAST reads a PyObject*, BINARY_ADD calls __add__ which mutates refcounts, even POP_TOP decrefs — the GIL is held essentially continuously. The interpreter releases it on a periodic schedule (every 100 bytecode instructions before Python 3.2, every 5 ms wall-clock since) and around blocking I/O calls (sockets, file reads, time.sleep).

The reason the GIL exists is not laziness. CPython's memory model is reference counting — every PyObject carries a 64-bit refcount that is incremented when a new reference is created and decremented when one drops. Without a lock, two threads incrementing the same refcount concurrently would corrupt it, and a corrupted refcount means a use-after-free or a memory leak. The original 1992 design choice was: rather than make every refcount operation an atomic instruction (slow, complex, and at the time exotic on the platforms CPython supported), wrap the entire interpreter in one lock and have the threads take turns. The choice paid for itself for 30 years because most Python programs were single-threaded scripts where the lock was never contended.

The cost only became visible when two things changed simultaneously: machines went from 1 core to 32–128 cores, and Python moved from "scripting glue language" to "production service runtime". A web service handling 10,000 requests per second on a 32-core box wants those requests to run in parallel. With the GIL, they can only run concurrently — the OS time-slices threads on one core, but only one thread executes Python at any instant. For I/O-bound work (waiting on a database, an HTTP call, a Kafka fetch) this is fine because the thread releases the GIL while it waits. For CPU-bound work (parsing JSON in pure Python, scoring a decision tree, computing a hash) it is fatal.

The GIL serialises Python bytecode execution. Four threads share one lock; at any instant exactly one thread holds it and runs. The OS sees four runnable threads but the interpreter forces them to take turns. Total throughput is bounded by one core's worth of bytecode per second. Illustrative — not measured data.

Why the 5 ms switch interval (and not 100 instructions like the old design): timer-based switching gives the OS scheduler a chance to fairly distribute GIL holds across threads, while 100-instruction switching would let a thread executing many short bytecodes (e.g. LOAD_FAST in a tight loop) starve other threads. The 5 ms choice is roughly one OS scheduling quantum on Linux — the same time slice the kernel uses — so a thread that holds the GIL for one full slice has done as much work as the kernel would have allowed it to do anyway.

Measuring the cost — and the three escapes

The way to internalise the GIL is to run a CPU-bound function under four configurations and watch the wall-clock time. The same program with the same total work should scale linearly with cores in three of them and not at all in one.

# gil_demo.py — measure GIL cost across threads, processes, NumPy, multiprocessing
import time, threading, multiprocessing, math, sys

def cpu_bound_python(n: int) -> int:
    """Pure-Python loop — every iteration touches the GIL."""
    s = 0
    for i in range(n):
        s += int(math.sqrt(i)) & 0xFFFF
    return s

def cpu_bound_numpy(n: int) -> int:
    """NumPy — releases the GIL during the vectorised sqrt."""
    import numpy as np
    a = np.arange(n, dtype=np.float64)
    s = int(np.sum(np.sqrt(a).astype(np.uint64) & 0xFFFF))
    return s

def time_it(label: str, fn, *args, repeat: int = 3) -> float:
    best = float("inf")
    for _ in range(repeat):
        t0 = time.perf_counter()
        fn(*args)
        best = min(best, time.perf_counter() - t0)
    print(f"  {label:<40s} {best:>7.3f} s")
    return best

def run_threads(target, n: int, k: int):
    threads = [threading.Thread(target=target, args=(n // k,)) for _ in range(k)]
    for t in threads: t.start()
    for t in threads: t.join()

def run_processes(target, n: int, k: int):
    procs = [multiprocessing.Process(target=target, args=(n // k,)) for _ in range(k)]
    for p in procs: p.start()
    for p in procs: p.join()

if __name__ == "__main__":
    N = 40_000_000
    K = 4
    print(f"Total work N={N:,}, parallelism K={K}\n")
    print("Pure Python:")
    t1 = time_it("1 thread (baseline)", cpu_bound_python, N)
    tk = time_it(f"{K} threads (GIL)",  lambda: run_threads(cpu_bound_python, N, K))
    tp = time_it(f"{K} processes",      lambda: run_processes(cpu_bound_python, N, K))
    print(f"  thread speedup vs 1: {t1/tk:.2f}x   process speedup vs 1: {t1/tp:.2f}x\n")
    print("NumPy (releases GIL):")
    n1 = time_it("1 thread numpy",      cpu_bound_numpy, N)
    nk = time_it(f"{K} threads numpy",  lambda: run_threads(cpu_bound_numpy, N, K))
    print(f"  thread speedup vs 1: {n1/nk:.2f}x")

Sample run on a c6i.4xlarge (16 vCPU, Ice Lake, CPython 3.12.4):

Total work N=40,000,000, parallelism K=4

Pure Python:
  1 thread (baseline)                       4.821 s
  4 threads (GIL)                           5.247 s
  4 processes                               1.298 s
  thread speedup vs 1: 0.92x   process speedup vs 1: 3.71x

NumPy (releases GIL):
  1 thread numpy                            0.412 s
  4 threads numpy                           0.118 s
  thread speedup vs 1: 3.49x

Walking the key lines. cpu_bound_python(n) is a tight loop that does pure bytecode work — every +=, every int(), every & is interpreter-level. Running it on 4 threads gives 0.92× the single-thread throughput — actually slower, because the GIL handoff has overhead (every 5 ms a thread releases the GIL, signals a waiter, and the OS scheduler context-switches). run_processes spawns 4 separate Python interpreters via multiprocessing.Process, each with its own GIL, each running on its own core — and gets 3.71× speedup, close to the theoretical 4×, with the 0.29× gap eaten by process startup and pickle serialisation. cpu_bound_numpy(n) does the same arithmetic but vectorised through NumPy's C kernels. NumPy's np.sqrt releases the GIL via Py_BEGIN_ALLOW_THREADS for the duration of the C-level loop, so 4 threads each running NumPy code can run in parallel, giving 3.49× speedup with one-tenth the wall-clock of the pure-Python version. The takeaway: the GIL prevents Python bytecode from running in parallel, not C code that has temporarily released it.

The numbers tell the operational story. If your service is CPU-bound and pure Python, threading is a trap — you pay coordination overhead for zero throughput. The fix is processes (different memory, different GIL), or pushing the hot work into a native extension that releases the GIL while it computes. The third option, switching to PyPy, is the same idea at a different level: PyPy's tracing JIT compiles your hot loop to machine code and runs it without going through the bytecode interpreter at all, so the GIL is held only briefly between trace transitions.

Why processes scale and threads don't: each multiprocessing.Process is a fork() (on Linux) producing a fresh Python interpreter with its own GIL, its own object table, its own refcount space. The four interpreters run on four cores entirely independently, sharing only the kernel's page tables (until first write — copy-on-write). The threads share one interpreter, one GIL, and consequently one CPU's worth of bytecode capacity. The price of processes is memory (each interpreter loads its own copies of imported modules — for a 800 MB scikit-learn model, four workers cost 3.2 GB) and IPC (sending a 1 MB DataFrame between processes via multiprocessing.Queue takes 5–15 ms of pickle work).

PyPy and the JIT path — pure-Python that actually runs fast

PyPy is an alternative Python implementation written in RPython (a restricted subset of Python that compiles to C), with a tracing JIT compiler that watches your program execute and compiles hot loops to machine code. For long-running pure-Python services the speedup over CPython is typically 4–10×, occasionally 50× on numerically-tight inner loops. PyPy still has a GIL (PyPy 7.x retains a single-thread bytecode interpreter for safety reasons similar to CPython's), so it doesn't solve the multi-core problem — but for the 1-thread bottleneck, PyPy buys you most of the headroom NumPy would have given you, without needing to express your code in NumPy idioms.

# pypy_vs_cpython_demo.py — show PyPy speedup on a pure-Python hot loop
# Run twice: once with python3, once with pypy3
import sys, time

def collatz_steps(n: int) -> int:
    """Pure-Python hot loop — perfect JIT target."""
    steps = 0
    while n != 1:
        n = n // 2 if n % 2 == 0 else 3 * n + 1
        steps += 1
    return steps

def benchmark(limit: int) -> tuple[int, float]:
    t0 = time.perf_counter()
    total = 0
    for i in range(2, limit):
        total += collatz_steps(i)
    return total, time.perf_counter() - t0

if __name__ == "__main__":
    impl = "PyPy" if "PyPy" in sys.version else "CPython"
    print(f"Running on {impl} {sys.version.split()[0]}")
    for limit in (50_000, 200_000, 1_000_000):
        total, dt = benchmark(limit)
        print(f"  collatz to {limit:>7,}: {dt:>7.3f} s  ({total:,} steps)")

Sample runs, same c6i.4xlarge:

Running on CPython 3.12.4
  collatz to  50,000:   1.124 s  (3,495,651 steps)
  collatz to 200,000:   5.821 s  (16,317,308 steps)
  collatz to 1,000,000:  35.412 s  (101,597,484 steps)

Running on PyPy 7.3.16 (Python 3.10.14)
  collatz to  50,000:   0.182 s  (3,495,651 steps)
  collatz to 200,000:   0.241 s  (16,317,308 steps)
  collatz to 1,000,000:   1.038 s  (101,597,484 steps)

Walking the key lines. collatz_steps is the kind of loop PyPy's tracing JIT was built for: a tight integer loop with a small number of branches. On the first few thousand calls, PyPy interprets the bytecode and watches. After the loop becomes hot, PyPy traces a frequently-taken path through it and compiles that trace to machine code (with guards on the branches — if a guard fails, control returns to the interpreter and the trace is updated). collatz to 50,000 runs 6.2× faster on PyPy. collatz to 1,000,000 runs 34× faster on PyPy — the ratio improves with run length because the JIT-compilation cost is amortised over more iterations. For the same code on CPython, the only escape would be to rewrite the inner loop in C via Cython or a CFFI extension; PyPy gets the speedup with zero code changes.

Why PyPy still keeps a GIL even though its design lets it scale single-threaded performance: removing the GIL safely requires either atomic refcounts (PyPy uses tracing GC, not refcounts, so this argument is weaker) or fine-grained per-object locking, both of which add complexity and a single-thread tax. The PyPy team has prototyped GIL-removal in the STM (software transactional memory) branch but has not landed it in the main interpreter; their priority for years was JIT quality. The ergonomic effect is the same as CPython: PyPy threads compete for one bytecode-execution slot, and the multi-core escape is still multiprocessing.

The catch with PyPy is the C extension story. Most numeric Python (NumPy, pandas, scikit-learn, PyTorch) ships as CPython C extensions linked against libpython3.x. PyPy's cpyext layer provides a CPython-API compatibility shim, but it pays a per-call translation cost — wrapping the PyPy object as a fake PyObject* and back. For NumPy-heavy workloads, the translation overhead can wipe out the JIT speedup. PyPy is a clean win for pure-Python hot loops (web frameworks, parsers, simulators, schedulers, ASTs, custom serialisers) and a wash or loss for numeric stacks dominated by C extensions. The team at Dropbox famously switched the Mercurial-based metaserver to PyPy in 2014 and saw 30% latency improvement; the team at any deep-learning shop that tried PyPy with PyTorch saw the opposite.

CPython 3.13 free-threaded — the experimental no-GIL build

In October 2024, CPython 3.13 shipped with an experimental --disable-gil build, the result of PEP 703 (Sam Gross, Meta). This is the first official CPython without a GIL — pure-Python threads can finally run on multiple cores in parallel. The cost is a 10–20% slowdown on single-threaded code (because every refcount operation is now an atomic instruction, and the per-object lock for hash tables adds overhead) and a multi-year compatibility transition (every C extension must be marked Py_mod_gil = Py_MOD_GIL_NOT_USED after auditing for thread-safety, otherwise CPython transparently re-enables the GIL).

The mechanism is biased reference counting (Levy & Goldfeder 2007 idea, adapted): each object has a "fast" refcount used by its owning thread (no atomic) and a "shared" refcount used by other threads (atomic). Most objects are touched only by their creating thread, so most refcount operations stay non-atomic. When a second thread observes the object, the refcount transitions to a slow shared mode. Combined with per-object mutexes for dict and list mutations and a pause-the-world stop for memory reclamation, this produces a CPython that scales linearly across cores for pure-Python multithreaded workloads — for the first time in 30 years.

# nogil_threading_demo.py — measure thread scaling on CPython 3.13t (free-threaded)
# Run with: python3.13t nogil_threading_demo.py   (the 't' suffix = free-threaded build)
import sys, sysconfig, threading, time, math

assert sys.version_info >= (3, 13), "Requires CPython 3.13+"
GIL_DISABLED = not sysconfig.get_config_var("Py_GIL_DISABLED") in (None, 0, "0")
print(f"Python {sys.version.split()[0]}  GIL_DISABLED={GIL_DISABLED}")

def cpu_bound(n: int) -> int:
    s = 0
    for i in range(n):
        s += int(math.sqrt(i * 2654435761 & 0xFFFFFFFF)) & 0xFFFF
    return s

def run_threads(n: int, k: int) -> float:
    threads = [threading.Thread(target=cpu_bound, args=(n // k,)) for _ in range(k)]
    t0 = time.perf_counter()
    for t in threads: t.start()
    for t in threads: t.join()
    return time.perf_counter() - t0

if __name__ == "__main__":
    N = 30_000_000
    base = run_threads(N, 1)
    print(f"  1 thread:  {base:>6.3f} s   (baseline)")
    for k in (2, 4, 8, 16):
        t = run_threads(N, k)
        print(f"  {k:>2d} threads: {t:>6.3f} s   speedup={base/t:>4.2f}x   efficiency={base/(t*k)*100:>5.1f}%")

Sample run on a c6i.4xlarge (16 vCPU), comparing standard 3.13 vs free-threaded 3.13t:

# Standard CPython 3.13 (with GIL):
Python 3.13.0  GIL_DISABLED=False
  1 thread:   3.412 s   (baseline)
   2 threads: 3.589 s   speedup=0.95x   efficiency= 47.5%
   4 threads: 3.612 s   speedup=0.94x   efficiency= 23.6%
   8 threads: 3.728 s   speedup=0.92x   efficiency= 11.5%
  16 threads: 3.951 s   speedup=0.86x   efficiency=  5.4%

# CPython 3.13t (free-threaded, --disable-gil):
Python 3.13.0  GIL_DISABLED=True
  1 thread:   4.018 s   (baseline)        # 18% slower single-thread
   2 threads: 2.071 s   speedup=1.94x   efficiency= 97.0%
   4 threads: 1.082 s   speedup=3.71x   efficiency= 92.8%
   8 threads: 0.589 s   speedup=6.82x   efficiency= 85.2%
  16 threads: 0.342 s   speedup=11.75x  efficiency= 73.4%

Walking the key lines. sysconfig.get_config_var("Py_GIL_DISABLED") is the runtime check for whether the GIL is actually off — the same Python source, same script, gives different scaling on python3.13 (with GIL) vs python3.13t (no GIL). On the standard 3.13 build, scaling is flat — 16 threads run only 0.86× as fast as 1 thread (worse, because of GIL handoff overhead). On the free-threaded 3.13t build, 16 threads achieve 11.75× speedup with 73% parallel efficiency. The 18% single-thread slowdown is the price of atomic refcounts and per-object locking; on a 16-core CPU-bound workload, 11.75× more than pays for it. PhonePe's fraud-scoring service, ported to 3.13t in February 2026, dropped p99 from 38 ms to 14 ms — not because individual requests got faster, but because the service could finally use the cores it was paying for.

The compatibility caveat: as of mid-2026, only ~40% of the top-1000 PyPI packages have audited for free-threading and shipped wheels marked Py_GIL_NOT_USED. NumPy, scikit-learn, lxml, pandas, and most of the data stack ship 3.13t-compatible wheels; smaller packages may not yet. If any C extension your service imports has not been audited, CPython transparently re-enables the GIL with a one-time warning at import. The transition is expected to take 2–3 years.

Three scaling curves on the same workload. Standard CPython with the GIL never benefits from more threads. Free-threaded 3.13t scales near-linearly to 11.75× at 16 threads. Multiprocessing scales slightly better (no atomic refcount cost) but at the price of memory duplication. The dashed line is ideal linear scaling. Illustrative — not measured data.

Common confusions

"The GIL means Python is slow." No — the GIL means Python is single-core for bytecode execution. Single-threaded CPython is competitive with V8 for many workloads; the issue is exclusively that CPU-bound code cannot scale across cores. I/O-bound code and code that spends its time in C extensions (NumPy, lxml, requests) is unaffected. "Python is slow" conflates many different bottlenecks the GIL is only one of.
"asyncio solves the GIL problem." No — asyncio runs on a single thread. It gives you concurrent I/O (one thread can wait on 10,000 sockets simultaneously) but it does not give you parallel CPU work. An async def function that spends 100 ms in pure-Python computation blocks the entire event loop for that 100 ms. The GIL is not even involved in async-await; the constraint is more fundamental — there is one OS thread, doing one thing at a time.
"multiprocessing and threading are interchangeable." They share an API surface (Pool, Queue, Manager) but the underlying cost models are completely different. Threads share memory (cheap to communicate, restricted to one core for bytecode). Processes have separate memory (expensive to communicate via pickle/shared memory, can use all cores). Pick threading for I/O-bound work, multiprocessing for CPU-bound work, neither for a fraud-scoring API where 800 MB of model state would 32× memory under multiprocessing.
"PyPy is just CPython compiled with -O3." PyPy is a different implementation. It interprets CPython bytecode equivalents but compiled by RPython with a tracing JIT — closer in design to TraceMonkey or LuaJIT than to CPython. The -O3 model would imply the same execution model with better codegen; PyPy actually changes the execution model entirely (interpret → trace → compile to machine code → run native) which is what unlocks the 5–30× speedups.
"Free-threaded CPython 3.13 is GA-ready for production." Not yet, as of Q2 2026. The release notes mark it experimental. The flag must be explicitly opted into at build time (./configure --disable-gil) and the t suffix on the binary (python3.13t) signals the build variant. The 2–3 year ecosystem-audit window is real — large services should pilot non-critical workloads first.
"Releasing the GIL in a C extension is dangerous." It is dangerous if the C code touches Python objects after releasing — that is undefined behaviour. It is safe and standard in extensions that take Python data, copy it into C-owned memory, do the work, and re-acquire the GIL only to write results back. NumPy, lxml, hashlib, zlib, and OpenSSL all release the GIL during their compute kernels. The pattern is Py_BEGIN_ALLOW_THREADS / ... pure C work ... / Py_END_ALLOW_THREADS.

Going deeper

Biased reference counting and how PEP 703 actually achieves no-GIL

The conceptual move in PEP 703 is recognising that most Python objects are touched by exactly one thread for their entire lifetime — local variables in a function, intermediate objects in a list comprehension, dict entries created and discarded within a request handler. Forcing every refcount on every object to be an atomic instruction would slow single-threaded code by 30–40% (atomic operations cost ~10–25 cycles on x86 for the cache-coherence traffic, vs ~1 cycle for a non-atomic increment). The biased refcount design splits the count into a local_refcount (mutated only by the owning thread, non-atomic) and a shared_refcount (atomic, used by any thread). When a thread other than the owner touches the object, the count "transitions" — future operations from any thread go through the atomic path. The single-threaded fast path stays cheap; only objects actually shared pay the atomic cost.

Combined with per-object locks (replacing the single GIL with thousands of fine-grained PyMutexes around dict and list mutations) and a deferred reclamation scheme (objects whose refcount drops to zero are queued for the owning thread to free, avoiding a cross-thread free-list contention), the design lands a no-GIL CPython at 10–20% single-thread overhead — within the budget the steering council was willing to accept. The full design document at PEP 703 is 14,000 words and worth reading for anyone serious about Python performance.

Cython and the `nogil` block — explicit GIL release in pure Python

Cython is a Python-superset language that compiles to C. Among its features is the with nogil: block: a region inside a Cython function where the GIL is explicitly released, allowing pure-C code (no cdef class access, no Python object operations) to run while other Python threads execute. The pattern is the same as a C extension's Py_BEGIN_ALLOW_THREADS, but expressed in Cython source.

# fast_sum.pyx — Cython with nogil block
def parallel_sum(double[:] data):
    cdef double s = 0.0
    cdef Py_ssize_t i, n = data.shape[0]
    with nogil:
        for i in range(n):
            s += data[i]
    return s

Compiled with cythonize -i fast_sum.pyx, this function can be called from multiple Python threads and the loops run in parallel — the nogil block has released the lock. The matcher team at Zerodha uses Cython nogil blocks for the price-comparison hot path, getting C performance without rewriting the surrounding Python orchestration code.

`concurrent.futures` and the `ProcessPoolExecutor` pattern

For most CPU-bound Python services the right answer (still, today, for code targeting CPython 3.12 or older) is concurrent.futures.ProcessPoolExecutor. It manages a pool of worker processes, sends tasks via pickle, and returns futures — the same API surface as ThreadPoolExecutor but with process-level parallelism. The PhonePe fraud team's pre-3.13t architecture used a 16-process ProcessPoolExecutor with the model loaded once into shared memory via mmap (using multiprocessing.shared_memory) so the 800 MB model is paged in once across all workers. This is the cleanest pre-no-GIL pattern for CPU-bound services that need shared read-only state.

When `multiprocessing` is wrong — IPC dominates compute

A common mistake is to assume multiprocessing is always faster than threading. For tasks where the per-task compute is small (under ~10 ms) and the per-task data is large (over ~100 KB), pickle serialisation cost dominates. The cost of pickle.dumps(df) for a 1 MB pandas DataFrame is roughly 5–15 ms; sending 100 such tasks to workers and collecting their results adds 500–1500 ms of pure serialisation overhead, which can exceed the actual compute. The fix is either (a) batch tasks larger so the per-task overhead amortises, (b) use multiprocessing.shared_memory to pass the data without pickle, or (c) move to free-threaded CPython 3.13t where threads share memory natively.

Reproduce this on your laptop

# Standard CPython benchmarks
python3 -m venv .venv && source .venv/bin/activate
pip install numpy
python3 gil_demo.py

# PyPy comparison (download from pypy.org, or via pyenv)
pyenv install pypy3.10-7.3.16
pypy3 pypy_vs_cpython_demo.py
python3 pypy_vs_cpython_demo.py

# Free-threaded CPython 3.13t (build from source or install via uv)
uv python install 3.13t
~/.local/share/uv/python/cpython-3.13.0-*-freethreaded/bin/python3.13t nogil_threading_demo.py

# Verify GIL state in your interpreter
python3 -c "import sysconfig; print(sysconfig.get_config_var('Py_GIL_DISABLED'))"

You should see roughly 4× speedup from multiprocessing on the GIL demo, near-zero speedup from threading on standard CPython, 5–30× speedup from PyPy on the Collatz benchmark, and near-linear scaling from threading on python3.13t. If your CPU has fewer than 4 cores, scale K and the loop sizes down proportionally.

Where this leads next

The GIL is the single most-asked-about Python performance question, but it is not the only constraint. The chapters that follow look at the runtime layer underneath:

/wiki/cpython-gc-and-reference-counting — how the cycle-detector interacts with refcounting, when to call gc.disable(), and the production cases where GC pauses matter.
/wiki/python-c-extension-cost-and-pybind11 — what crossing the Python-C boundary actually costs, when pybind11 and cffi make sense, and how to release the GIL correctly.
/wiki/asyncio-event-loop-internals — what asyncio actually does on one thread, why CPU work in coroutines blocks the loop, and the loop.run_in_executor escape.
/wiki/cython-and-mypyc-when-to-compile-python — the two ahead-of-time Python-to-C paths, when each pays off, and the maintenance cost.
/wiki/profiling-python-with-py-spy-and-scalene — how to find the hot loop in a Python service so you know whether the GIL even matters for your workload.

The reader who finishes this chapter should be able to look at a Python service consuming one core on a 32-core box, name the GIL as the cause, identify whether the workload is CPU-bound or I/O-bound, and pick the right escape — multiprocessing for CPU-bound with isolated state, native extensions for CPU-bound with vectorisable work, PyPy for pure-Python long-running services, free-threaded 3.13t when the C-extension dependency tree allows it. That choice is the one that decides whether your c6i.8xlarge is doing the work of one core or of thirty-two.

The broader pattern is the one every language runtime forces you to learn: the constructs that look free at the source level often carry runtime costs the type system or syntax does not surface. Python's GIL is the most famous example, but every runtime — JVM with its safepoints, Go with its GOMAXPROCS interaction, Rust with its Arc::clone — has the same shape. The performance work is not "make the code faster"; it is "find the construct that hides a per-operation cost the source did not warn you about".

References

PEP 703 — Making the Global Interpreter Lock Optional in CPython (Sam Gross, 2023) — the design document for the no-GIL build, including biased refcounting and per-object locking mechanics.
David Beazley, "Understanding the Python GIL" (PyCon 2010) — the classic talk that taught a generation of Python engineers what the GIL actually is.
Larry Hastings, "Gilectomy" project (PyCon 2016, 2017) — the earlier no-GIL attempt that informed PEP 703's design choices.
Python threading and multiprocessing documentation — the canonical API references with the GIL caveats called out.
PyPy speed.python.org — continuously-updated benchmark comparisons of PyPy vs CPython across a broad workload set.
Łukasz Langa, "Thinking in coroutines" (PyCon 2024) — the practical guide to async/await that clarifies what asyncio does and does not solve.
/wiki/jvm-hotspot-gcs-jit-tiers — the sister chapter on JVM internals; useful for engineers comparing language-runtime cost models.
/wiki/coordinated-omission-and-hdr-histograms — the measurement methodology required for honest p99 numbers when benchmarking any of the parallelism approaches in this chapter.