Note: Company names, engineers, incidents, numbers, and scaling scenarios in this article are hypothetical — even when they resemble real ones. See the full disclaimer.

User-space schedulers

The night after MealRush's 18,000-pthread incident, Aditi rewrote the dispatch service in Rust on tokio. The same machine — 64-core EPYC, same dinner-rush load — now ran with 8 OS threads in top -H, a constant 8 lines of R. Inside the process, 84,000 tokio tasks were alive at the peak. CPU load was 2.1 instead of 6.2. The p99 dispatch latency dropped from 38 ms to 1.4 ms. Nothing about the workload had changed; what changed is that the scheduler making per-request decisions no longer lived in the kernel. It lived inside the process, in user space, and it could schedule a task in ~60 nanoseconds where the kernel needed ~20 microseconds to schedule a thread. The factor of 300× was not won by writing faster code — it was won by not asking the kernel to be involved in every decision.

A user-space scheduler is a runtime library that multiplexes many lightweight tasks (goroutines, tokio futures, Erlang processes) onto a few OS threads — the M:N model. Because the runtime owns its tasks' lifecycle, it can context-switch in tens of nanoseconds, store task state in tens of bytes instead of megabytes, and schedule a million tasks where the kernel collapses at ten thousand. The cost is that the runtime must cooperate with the kernel at every blocking syscall — a read() that blocks the OS thread blocks every task on it. M:N won at the runtime layer for the same reason it lost at the kernel layer: at the runtime, you control every task's yield point.

Why scheduling moved out of the kernel

Linux's NPTL (2003) settled the kernel side of the threading debate at strict 1:1 — every pthread is a task_struct, every context switch is a syscall, every blocking syscall is a kernel scheduling decision. That model is correct. It is also expensive. A task_struct is ~9 KB; a context switch is 1–3 µs; the CFS runqueue is a red-black tree per CPU keyed by vruntime. At 256 threads on a 64-core box, the kernel is excellent. At 18,000, the wakeup-and-pick path runs more often than the threads themselves do useful work. The kernel cannot fix this: every cost is the price of being correct in the presence of arbitrary, untrusted, possibly-misbehaving user code.

The user-space scheduler observes that for a single, cooperating, language-level process, none of those defences are needed. There is no untrusted code inside the runtime — every task was compiled by the same compiler, follows the same yield protocol, lives in the same address space, and trusts the runtime to be fair. The runtime can therefore strip away every cost the kernel pays. No syscall to switch tasks — it is a function-pointer jump. No task_struct — the task is a small heap object, often under 200 bytes. No TLB consequences — every task shares the same mm_struct. No preemption mid-instruction — tasks yield only at well-defined points the compiler or runtime inserts. The result is a scheduler that does in 60 ns what the kernel does in 20 µs.

M:N scheduling — many user-space tasks on few OS threadsA diagram showing the M:N scheduling model. At the top, a process box contains four OS threads (P1, P2, P3, P4) — these are the worker threads the kernel sees. Below them, inside the process boundary, a much larger pool of user-space tasks (T1 through T16, with three dots indicating more) is shown. Arrows from each OS thread point down into a per-thread local run queue of tasks. A separate global queue is shown to the side. Beneath the process, the kernel sees only the four OS threads, each as a task_struct. The kernel's CFS scheduler is shown as a red-black tree with four entries, not sixteen. A label points out that the user-space scheduler runs entirely above the kernel boundary. M:N scheduling — runtime above, kernel below user space (one process) OS thread P1 worker OS thread P2 worker OS thread P3 worker OS thread P4 worker global queue overflow / steal target local run queue T1 T5 T9 T2 T6 T3 T7 T4 T8 T13 T14 T15 M = 84,000 tasks (futures, goroutines), each ~150 B N = 4–8 OS threads, scheduler runs in user code, no syscall to switch work-stealing: idle worker steals half of a busy worker's local queue — kernel boundary — kernel space (CFS sees only 4 task_structs) task P1 task P2 task P3 task P4 CFS runqueue: 4 entries — kernel does almost nothing
The M:N scheduler keeps 84,000 tasks alive inside one process while the kernel sees only the four worker OS threads. The user-space scheduler has its own per-worker run queue, a global overflow queue, and a work-stealing protocol on the queue boundaries. Every transition between tasks is a function call — no syscall, no `task_struct` allocation, no TLB consequence.

Why M:N died in the kernel and lived in the runtime: when the kernel implements M:N (Solaris LWPs pre-1996, the original POSIX threads on FreeBSD, LinuxThreads's M:1), the kernel must marshal user threads it does not own and cannot trust. A blocking syscall by one user thread blocks the kernel thread it sat on, stranding every other user thread. The kernel must add a "scheduler activations" upcall protocol (Anderson et al., 1991) to recover, doubling the complexity. When the runtime implements M:N, the runtime owns every task: it picks task yield points, it knows which futures are I/O-blocked, it converts blocking syscalls into non-blocking epoll_wait registrations. The kernel sees only N OS threads, all of them productively busy or productively parked.

Anatomy of a tokio worker — the dispatch loop

A user-space scheduler is, at its core, a fixed-size pool of OS threads each running this loop: pick a task, run it until it yields, push it back if it has more work, take the next one. The trick is that all of "pick", "yield", and "next" are nanosecond-cost operations because nothing leaves user space.

// tokio_worker.rs — illustrative single-worker loop, ~70 lines of the real thing
// Build:  cargo run --release
// The actual tokio worker is at tokio/tokio/src/runtime/scheduler/multi_thread/worker.rs

use std::collections::VecDeque;
use std::sync::Arc;
use std::sync::atomic::{AtomicBool, Ordering};

type Task = Box<dyn FnMut() -> TaskState + Send>;
enum TaskState { Pending, Done }

struct Worker {
    local: VecDeque<Task>,         // owned, push/pop on this end
    global: Arc<crossbeam::queue::SegQueue<Task>>,  // overflow + steal target
    parked: Arc<AtomicBool>,
}

impl Worker {
    fn run(&mut self) {
        let mut tick: u32 = 0;
        loop {
            tick = tick.wrapping_add(1);
            // Every 61 ticks, drain global queue first — fairness vs locality
            let next = if tick % 61 == 0 { self.from_global() } else { None }
                .or_else(|| self.local.pop_front())
                .or_else(|| self.from_global())
                .or_else(|| self.try_steal());

            match next {
                Some(mut task) => match task() {
                    TaskState::Pending => self.local.push_back(task),
                    TaskState::Done    => { /* drop */ }
                },
                None => {
                    self.parked.store(true, Ordering::SeqCst);
                    std::thread::park();   // futex sleep — only when truly idle
                    self.parked.store(false, Ordering::SeqCst);
                }
            }
        }
    }
    fn from_global(&self)  -> Option<Task> { self.global.pop() }
    fn try_steal(&self)    -> Option<Task> { /* steal half from a random sibling */ None }
}

Sample run (real tokio, 4-worker runtime, 100,000 spawn-and-yield tasks on the same 16-core Ryzen 7950X used in the previous chapter's measurements):

$ cargo run --release --bin spawn_yield_bench -- --tasks=100000 --workers=4
spawned 100000 tasks in 6.1 ms
mean per-spawn:        61 ns
mean per-yield:        38 ns
mean per-context-switch (yield + pick next): 99 ns
peak resident memory:  17 MB
worker 0: 26,103 polls    worker 1: 25,912 polls
worker 2: 24,687 polls    worker 3: 23,298 polls

Read the loop. The tick % 61 == 0 line is fairness against locality — without it, a worker that keeps spawning child tasks into its local queue would never check the global queue or other workers, and a slow trickle of work could starve. 61 is prime and arbitrary; Go uses 61 for the same reason. self.local.pop_front() is the hot path — pop from your own queue, no synchronisation, ~5 ns. self.try_steal() is the rare path: if your local queue is empty and the global queue is empty, you randomly pick another worker and try to steal half their tasks from the back of their deque (covered in work-stealing schedulers). std::thread::park() is the only place a worker enters the kernel — when there is genuinely no work to do, the worker parks on a futex and the kernel can take the OS thread off-CPU until a task_unparker wakes it. The match task() line is where a future is polled — for futures, the function call returns Pending if the future hit an await on something not yet ready, or Done if it ran to completion.

Why the worker count defaults to the CPU count, not the task count: each OS worker thread is a vehicle for executing tasks, not a per-task object. With more workers than cores, the kernel time-slices the workers and you pay 1.6 µs per cross-process switch; with fewer workers than cores, you cannot saturate the box. The sweet spot is N = number of cores (or 2× cores for hyperthreaded CPU-bound work, or fewer for I/O-heavy work where workers are usually parked). tokio defaults to num_cpus::get(); Go's GOMAXPROCS defaults to the same. This is the entire reason 8 OS threads beat 18,000 in the MealRush rewrite.

How runtimes hide the scheduler at three different layers

Every modern user-space scheduler is the same idea — M:N, work-stealing, futex-park-on-idle — implemented at a different point in the language stack. The choice of "where" determines what a yield looks like to the developer.

Runtime Yield point Task representation Stack Kernel-visible threads
Go (runtime) chan send/recv, syscall entry, function preamble (preemption check), 10ms preemption tick g struct (~2 KB initial, growable up to 1 GB) Stackful, segmented GOMAXPROCS (default = CPU count)
Rust tokio (library) .await on a Future that returns Poll::Pending Pinned future state machine (~150 B for simple, varies with captured state) Stackless — state is the future itself num_cpus::get() workers
Erlang BEAM (VM) After every ~2,000 reductions (BEAM-instruction count) proc struct (~233 B initial heap) Stackful, per-process heap, copying GC +S flag, default = CPU count
Python asyncio (library) await on awaitable; loop.run_in_executor for blocking Coroutine object (~600 B + frame) Stackless — generator state 1 (the GIL constrains true M:N)
JVM Project Loom virtual threads JNI calls, blocking syscall via Continuation.yield(); bytecode rewriter inserts checkpoints VirtualThread (~256 B) on a heap-allocated Continuation Stackless via continuation, mounted on a carrier thread ForkJoinPool.commonPool() carriers
Kotlin coroutines suspend function calls; compiler rewrites suspending functions into state machines State machine class (~120 B) Stackless Configurable Dispatchers.Default

Three of these — Go, Erlang, Loom — can preempt running tasks. The other three — tokio, asyncio, Kotlin — are strictly cooperative: a task runs until it explicitly yields, and a task that loops without an await will stall its worker until the next break point. This is the single most operationally significant decision in the matrix. A cooperative runtime cannot recover from a misbehaving task without restarting the worker; a preemptive runtime takes a small constant cost on every check but is robust to one badly-behaved task.

A worker's poll cycle and where the cost livesA horizontal lifeline diagram showing one tokio worker thread over a span of 1.2 microseconds. The lifeline is divided into segments: 1) pick task from local queue, ~5 ns. 2) call task.poll(), the future runs application code for ~800 ns. 3) future returns Poll::Pending, scheduler updates wake registration, ~30 ns. 4) pick next task, ~5 ns. 5) poll, run for ~340 ns. 6) Poll::Ready, drop task, ~10 ns. Total ~1.2 microseconds includes two complete task transitions. A label notes: zero syscalls, zero TLB churn, zero task_struct allocation. Below, a comparison lifeline shows the same two task transitions if implemented as kernel threads: 22000 nanoseconds total — entirely dominated by two pthread_create calls, two clone() syscalls, and two CFS runqueue operations. Worker poll cycle — user-space vs kernel-thread cost tokio worker — two task transitions, ~1.2 µs total pick 5ns poll() — task A runs application code ~800 ns (depends on the task) register 30ns pick 5ns poll() — task B runs ~340 ns drop 10ns zero syscalls, zero TLB churn, zero task_struct allocation Same two task transitions as kernel threads — ~22 µs (illustrative) pthread_create A ~9000 ns A runs 800 ns ctxsw 1600 ns pthread_create B ~9000 ns B runs 340 ns ctxsw 1600 ns join 300 ns two clone() syscalls, two CFS runqueue operations, two pthread teardowns — same workload, ~18× more time Illustrative — measured on Ryzen 7950X (Linux 6.5). The ratio is robust across hardware; the absolute numbers vary. For long-running tasks (> 1 ms each) the gap shrinks — both models pay the same per-task application cost. The user-space scheduler wins on short-lived tasks. That is most server workloads.
Illustrative — measured on a 16-core Ryzen 7950X under Linux 6.5; ratios are robust across hardware, absolute numbers vary. The cost per task in tokio is dominated by the task's own work; the cost per task as a kernel thread is dominated by `clone()` and the context switch. Server workloads consist of short-lived tasks (a request, a database call, a metric emission) — exactly the regime where user-space scheduling wins.

Where the abstraction leaks — blocking calls, CPU loops, and the cancellation problem

A user-space scheduler is faster than the kernel because the runtime trusts every task to yield. That trust is a contract, and tasks that violate it break the runtime. The three classic failures:

A blocking syscall on a worker thread. Inside a tokio task, calling std::fs::read_to_string("big.csv") does a blocking read() syscall. The OS thread enters the kernel and parks until disk I/O completes — perhaps 4 ms. Every other task whose state lives on that worker's local queue is frozen for those 4 ms. With 4 workers, one such call freezes 25% of the runtime. Tokio's contract is that you call tokio::fs::read_to_string instead — which dispatches the actual read to a separate tokio::task::spawn_blocking thread pool, returning a future that completes when the dispatched job finishes. Go solves this differently: when a goroutine enters a blocking syscall, the runtime detaches the OS thread (the "M") from the goroutine's logical processor (the "P") and spins up a new M, so the P keeps making progress. This is why a Go program may have far more OS threads in pstree than GOMAXPROCS — the surplus are M's stuck in syscalls.

A CPU-bound loop. A tokio task with for i in 0..1_000_000_000 { sum += compute(i); } and no .await will own its worker for as long as the loop runs. No other task on that worker will be polled. Go's preemptive scheduler escapes this — since 1.14 (2020), the runtime sends a SIGURG signal to long-running goroutines to force a checkpoint. Erlang's BEAM has done this since 1986 via reduction counting. Tokio asks you to tokio::task::yield_now().await periodically inside long loops, or to spawn the work onto spawn_blocking. Why preemption was added to Go but not to tokio: Go is a language, and the compiler can insert preemption checks at every function preamble for free. Tokio is a library on top of Rust, and Rust the language has no notion of "yield" outside of .await. Adding preemption to tokio would require either compiler support (Rust does not have it) or signal-based hijacking (which conflicts with panic = "unwind" and async-signal-safety). The cooperative model is the price of being a library, not a language.

Cancellation. A user-space task that the runtime forgets about leaks. Tokio's JoinHandle::abort() schedules the task to be dropped at its next await, which means a CPU-loop task is uncancellable until it yields. Go's context.Context is cooperative — select { case <-ctx.Done(): return } is the idiom but a goroutine that doesn't check ctx.Done() will run to completion regardless of cancel. Erlang has the most aggressive answer: exit(Pid, kill) — the BEAM unconditionally tears down the process and reclaims its heap, because every Erlang process is isolated by mailbox so killing it cannot corrupt anything else. The choice of cancellation semantics is downstream of the runtime's isolation model — a topic that plays out in detail in actor systems.

Common confusions

  • "Goroutines and tokio tasks are the same thing as OS threads" They are not. An OS thread is a task_struct, costs 8 MB of stack address space and ~9 KB of kernel memory, and is scheduled by the kernel. A goroutine is a g struct (~2 KB initial, growable), a tokio task is a pinned future (~150 B), and both are scheduled by the language runtime inside a process. The distinction matters: you can spawn 1 million goroutines on a 16-core box; spawning 1 million pthreads will OOM-kill the process before the loop completes.
  • "M:N scheduling failed and won't come back" It failed at the kernel layer (Solaris LWPs, LinuxThreads M:1, FreeBSD KSE) because the kernel cannot trust user code to yield politely. It is the dominant model at the runtime layer — Go, Erlang, tokio, Loom, every modern async runtime. The 2003 NPTL decision was "kernel does 1:1, runtimes can do M:N on top", and that has been the architecture for 23 years.
  • "User-space schedulers are faster because they're written in Rust/Go" They are faster because they don't enter the kernel on every task switch. The same scheduler written in C would be the same speed; the same scheduler written in Java (Project Loom) is also fast, ~50 ns per task switch. The win comes from staying in user space, not from the implementation language.
  • "Async/await is a user-space scheduler" Not by itself. async/await is the yield protocol — the syntax that compiles a function into a state machine with explicit yield points. The scheduler is the runtime that actually decides which task to poll next: tokio, async-std, smol on the Rust side; the asyncio event loop on the Python side. A program with async fn and no runtime is inert — the futures never get polled.
  • "tokio::spawn and std::thread::spawn are interchangeable" They are not. tokio::spawn returns a JoinHandle for a task on the current tokio runtime; std::thread::spawn calls pthread_create and returns a JoinHandle for an OS thread. The first costs 60 ns, the second costs 20,000 ns, and the second has 8 MB of stack regardless of what you do with it.
  • "Thread-safe means safe to use from any goroutine / tokio task" "Thread-safe" without specifying which property is the most-overloaded phrase in concurrency. A Mutex<T> is race-free (the borrow checker forbids data races at compile time in Rust) but not deadlock-free. A tokio::sync::Mutex is await-safe (you can hold it across an .await) but not fair. A std::sync::atomic::AtomicU64 is lock-free and linearizable per-operation but composed atomics give no atomicity. Always state the property: race-free, deadlock-free, lock-free, wait-free, linearizable, sequentially consistent. "Thread-safe" alone tells you nothing.

Going deeper

The Go scheduler — G-M-P and why preemption took a decade

Go's runtime uses three-letter abbreviations: G is a goroutine, M is an OS thread (the "machine"), P is a logical processor (a context that a goroutine needs to run). The invariant is len(P) == GOMAXPROCS. M's are recycled from a pool (runtime.lockOSThread), G's are tiny structs with growable stacks, P's hold the local run queue. A goroutine that blocks in a syscall causes its M to detach from its P; the runtime spins up a new M and binds it to the orphaned P, so progress continues. Why preemption took until Go 1.14 (Feb 2020): the original Go scheduler was strictly cooperative — a goroutine yielded only at function preamble checks (where the stack might need to grow). A tight loop with no function calls (for { x++ }) ran forever. The fix was signal-based: at GC pause time the runtime sends SIGURG to running G's, the signal handler re-enters the scheduler at a safe point (using stack-map information the compiler emits), and the goroutine is preempted. The implementation took years because making async-preemption interact correctly with the GC's stack-scan was non-trivial — the GC needs to know the type of every stack slot at the preemption point, which means the compiler has to emit live-pointer maps for every instruction, not just function preambles. The result: a Go program with for { x++ } no longer pegs one core to 100% forever.

Erlang's BEAM — reduction counting since 1986

Erlang processes preempt every ~2,000 reductions, where a reduction is roughly one BEAM bytecode instruction. The counter decrements on every BEAM step; when it hits zero the process yields back to the scheduler regardless of where in its code it was. This was the design from day one (1986, Joe Armstrong & Robert Virding) and is the reason Erlang has shipped soft-real-time telecoms switches with five-nines availability for three decades. The cost is honest — every BEAM instruction includes a if (--FCALLS == 0) yield() — but BEAM instructions are coarse enough (each is a few machine instructions of inner-loop work) that the overhead is in the low single-digit percent. Modern BEAM (OTP 26+) also has dirty schedulers — separate OS thread pools for "dirty CPU" (long-running NIF code) and "dirty I/O" (filesystem access) — so a misbehaving NIF cannot freeze regular schedulers, mirroring tokio's spawn_blocking.

KapitalKite's deliberate kernel-thread choice for the matching engine

Not every workload should use a user-space scheduler. KapitalKite's order-matching engine on a 64-core EPYC runs one pinned pthread per matching shard, set to SCHED_FIFO priority 50, with mlockall(MCL_CURRENT|MCL_FUTURE) to forbid paging. The matching loop is a tight while (true) { read_order(); match(); publish_fill(); } with no .await and no I/O — it spins on a memory-mapped ring buffer. p99 add-order latency is 9 µs end-to-end. Why a user-space scheduler would be wrong here: the runtime's value is multiplexing many tasks onto few threads. The matching shard has one task. Adding a tokio runtime would introduce 60 ns per poll, a heap allocation per future, and a wakeup-and-pick path that does not exist in the spinning loop. The runtime's strength becomes a tax. The right choice is shard-per-thread, thread-pinned, real-time priority — the kernel scheduler is fine because it is asked to schedule one task per CPU, which it does perfectly. User-space schedulers are for the workload with 100,000 short-lived requests; kernel-pinned threads are for the workload with 64 long-lived hot loops.

Stackless vs stackful: what your task remembers across a yield

A goroutine has its own real stack — when it suspends, the runtime saves SP, BP, and PC, and resumes by restoring them. The stack survives unchanged across a yield; you can recursively call into deep call stacks without thinking about it. A tokio future has no runtime stack — when it suspends, it serialises its current state into a struct (the future itself), and when it resumes it deserialises and continues. Recursive .await is hard precisely because the future would need to contain itself; you have to Box::pin the recursion. The cost trade-off: stackful coroutines (Go's 2 KB initial, growable) use more memory per task; stackless coroutines (tokio's ~150 B) use less but cap recursion depth and require compiler rewriting. Erlang's processes are stackful with a copying GC, getting the recursion-friendly model with the memory efficiency. Project Loom's virtual threads are stackful with a heap-allocated continuation, paying a heap allocation per stack-grow but giving you the stackful programming model on the JVM.

The fairness-vs-locality knob — tick % 61 and the global queue

Every M:N runtime has the same internal conflict: pop from your local queue (cache-hot, fast, but biases towards your own children) or pop from the global queue (fairer, but cache-cold cross-core). tokio and Go both resolve it the same way: 60 picks of "local first" then 1 pick of "global first", repeating. The number 61 is prime to avoid harmonic resonance with other periodic events (10 ms preempt tick, GC pacing, network poll cycle); it is not load-bearing in itself, just an arbitrary decoupling. Why this matters operationally: a misconfigured runtime where every worker only checks its own queue produces tail-latency stalls — a task sitting on the global queue might wait 100 ms for a worker to notice it, even though every worker is idle. The 61-tick fairness is invisible when everything is healthy and load-bearing the moment one worker is busy and others are idle. It is the same insight as in the work-stealing scheduler chapter: queues are fast, fairness across queues is what costs.

Where this leads next

The next chapter goes one floor lower: cores vs hardware threads (SMT/HT) — when two siblings sharing one physical core help (instruction-level parallelism on memory stalls) and when they hurt (false sharing on the L1d that they both write into). After that, the cache hierarchy for the concurrency mind — why the unit of contention is not the variable but the 64-byte cache line. The user-space scheduler keeps the kernel's involvement minimal, but it still runs on hardware whose physics ultimately decide whether your tasks go fast.

Past this part, in Part 11 (async I/O models) and Part 13 (work-stealing schedulers), the user-space scheduler returns as the central object — its event-loop tick, its work-stealing protocol, its cancellation semantics — but at a higher level of zoom. Read this chapter as the "what is it" introduction; those chapters are the "how do you build it" follow-ups.

References

# Reproduce on your laptop (Linux x86_64 or aarch64, Rust 1.75+)
cargo new spawn_yield_bench && cd spawn_yield_bench
cargo add tokio --features full
# paste the worker sketch from the body into src/main.rs and run
cargo run --release -- --tasks=100000 --workers=4
# Inspect Go's scheduler from outside:
GODEBUG=schedtrace=1000,scheddetail=1 go run ./your_program | head -40
# Inspect tokio task counts at runtime via tokio-console (after enabling the feature):
RUSTFLAGS="--cfg tokio_unstable" cargo run --features tokio/tracing