Note: Company names, engineers, incidents, numbers, and scaling scenarios in this article are hypothetical — even when they resemble real ones. See the full disclaimer.

Threads and the OS kernel's view

At 21:14 on a Friday, MealRush's dispatch service starts dropping orders during the dinner rush. The Java service is sitting on 18,000 threads on a 64-core box, CPU load is 6.2, and top -H shows hundreds of those threads spinning at the bottom of the latency histogram. Aditi opens /proc/<pid>/status and finds voluntary_ctxt_switches: 4.1M, nonvoluntary_ctxt_switches: 9.8M — over 13 million context switches in the last minute. She has been told all her career that "threads are lightweight". Tonight, the kernel disagrees. Every one of those 18,000 threads is a task_struct weighing about 9 KB of kernel memory plus an 8 MB user stack, every context switch flushes the TLB partially and burns ~1.5 µs of CPU, and the scheduler's runqueue is doing more work choosing the next thread than the threads are doing once they run. The "lightweight" abstraction was never lightweight to the kernel. It was lightweight relative to processes — and only in 1996, when the rest of the world thought a process was the only unit of execution.

A thread is, to a Linux kernel, a task_struct with its own stack, register set, and thread ID, sharing the address space (mm_struct) and file descriptor table with its siblings. clone() with the right flags is what creates one; pthread_create is a libc wrapper around that syscall. The kernel does not distinguish "thread" from "process" — both are tasks; the flags at clone-time decide what is shared. Every thread costs you a kernel object, a stack, a slot in the runqueue, and ~1–3 µs per context switch. "Threads are lightweight" is a 1996 marketing claim that does not survive 2026's core counts.

A thread is a task_struct — the kernel does not have "thread" as a category

Open the Linux source at include/linux/sched.h. The structure that represents a schedulable thing is called task_struct and it is approximately 1,800 lines long. The kernel allocates one of these per thread, and one per process — because to the Linux kernel, a process is just a thread that happens not to share its mm_struct with anyone. The word "thread" appears in user-facing APIs (pthread_create, gettid) and in some kernel comments, but in the scheduler, in the runqueue, in /proc, the unit is the task. A thread is a task that shares specific resources with other tasks; a process is a task whose resources are unshared.

This is not a pedantic distinction — it is the architectural choice Linus Torvalds made in 1991 (and Linus + Ingo Molnar tightened in 1996 with the clone() syscall) and it is the reason Linux's threading is faster and more uniform than Solaris's or Windows NT's of the same era. The mechanism is one syscall — clone() — with a flag word that decides which of the parent's resources the child gets:

Flag What it shares with parent
CLONE_VM Address space (the mm_struct) — the defining flag for "this is a thread"
CLONE_FS Filesystem info (cwd, root, umask)
CLONE_FILES File descriptor table
CLONE_SIGHAND Signal handlers
CLONE_THREAD Thread group (so getpid() returns the same value for all siblings)
CLONE_SYSVSEM System V semaphores
CLONE_SETTLS Set up a new TLS (thread-local storage) base register

fork() calls clone() with none of these set — child gets its own copy of everything. pthread_create() calls clone() with all of them set — child shares everything except its register set, kernel stack, and TID. The "thread" the user sees is exactly that: same address space, same file descriptors, same signal handlers, different stack and registers.

Two pthread siblings as task_structs sharing one mm_structDiagram showing a process with two threads. Two task_struct boxes are shown side by side, each with its own pid, kernel_stack, registers, and TLS base, but both pointing to a single shared mm_struct that contains the page tables, VMAs, and the address space. A single files_struct (file descriptor table) is also shared, as is the sighand_struct. Each task_struct is a node in the scheduler's runqueue. Two pthread siblings — the kernel's view task_struct (TID 4127) pid = 4127 tgid = 4126 state = TASK_RUNNING prio = 120 policy = SCHED_OTHER kernel_stack = 0xffff8800... user_stack = 0x7f00b000 (8 MB) regs = {RIP, RSP, RBP, ...} fs_base (TLS) ~9 KB kernel object task_struct (TID 4128) pid = 4128 tgid = 4126 state = TASK_INTERRUPTIBLE prio = 120 policy = SCHED_OTHER kernel_stack = 0xffff8801... user_stack = 0x7f00a000 (8 MB) regs = {RIP, RSP, RBP, ...} fs_base (TLS) ~9 KB kernel object SHARED: mm_struct page tables (PGD) VMAs (heap, code, .bss) brk, mmap_base files_struct (FD table) sighand_struct refcount = 2 CLONE_VM CLONE_VM tgid (thread group id) = 4126 — both siblings same getpid()
Two threads created by pthread_create. Each is a full task_struct in the kernel — its own TID, stack, registers, scheduler state. They share the mm_struct (page tables, VMAs, heap), the file descriptor table, and the signal handlers. tgid (thread group ID) is the same for both — that is what makes getpid() return the same value, and what kill(pid, ...) targets.

Why this design beat the alternatives: pre-Linux, the dominant approach was either kernel threads as a separate concept (Solaris, NT — two scheduler classes, two sets of code paths, two sets of bugs) or user-space threads multiplexed onto a few kernel threads (the original POSIX threads on Solaris before LWPs, "M:N" threading). Both lost to Linux's "1:1 thread = task" model on simplicity, on the kernel's side, and on uniformity of top -H, ps -L, /proc/[pid]/task/[tid]/. The cost is paid: every pthread is a real kernel scheduling entity, not a coroutine. Modern async runtimes (tokio, asyncio, Go's goroutines) put the M:N model back, but in user space — see coroutines and green threads for why that move was correct for that layer, leaving the kernel layer simple.

What pthread_create actually does — the syscall trace

Read the kernel's view by strace-ing a tiny C program that creates one thread. The output is short, real, and reveals every step.

// thread_minimal.c
// build: gcc -O2 -pthread thread_minimal.c -o thr
// run:   strace -f -e clone,mmap,set_robust_list ./thr
#include <pthread.h>
#include <stdio.h>
#include <unistd.h>
#include <sys/syscall.h>

static pid_t mytid(void) { return (pid_t) syscall(SYS_gettid); }

void *worker(void *arg) {
    printf("worker: pid=%d tid=%d arg=%ld\n", getpid(), mytid(), (long)arg);
    return NULL;
}

int main(void) {
    printf("main:   pid=%d tid=%d\n", getpid(), mytid());
    pthread_t t;
    pthread_create(&t, NULL, worker, (void *)42);
    pthread_join(t, NULL);
    return 0;
}

Sample run on Linux 6.5 (x86_64), with strace -f -e clone,mmap,set_robust_list:

main:   pid=4126 tid=4126
mmap(NULL, 8392704, PROT_NONE, MAP_PRIVATE|MAP_ANONYMOUS|MAP_STACK, -1, 0) = 0x7f3a8b6f8000
mmap(0x7f3a8b6f9000, 8388608, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS|MAP_STACK|MAP_FIXED, -1, 0) = 0x7f3a8b6f9000
clone(child_stack=0x7f3a8bef7eb0,
      flags=CLONE_VM|CLONE_FS|CLONE_FILES|CLONE_SIGHAND|CLONE_THREAD|
            CLONE_SYSVSEM|CLONE_SETTLS|CLONE_PARENT_SETTID|CLONE_CHILD_CLEARTID,
      parent_tid=[4127], tls=0x7f3a8bef8700, child_tidptr=0x7f3a8bef89d0) = 4127
[pid 4127] set_robust_list(0x7f3a8bef89e0, 24) = 0
[pid 4127] worker: pid=4126 tid=4127 arg=42

Read the output. Line 2 — the 8 MB stack is mmap'd in user space before the syscall; the kernel knows nothing about it yet. The MAP_STACK flag is a hint, the PROT_NONE first page is the guard page that triggers SIGSEGV on stack overflow. Line 4 — the clone() syscall is what creates the second task_struct. The flag word — eight flags ORed — is the entire definition of "thread": share VM, share FS info, share FDs, share signal handlers, join the same thread group, share SysV sems, set up new TLS, write child TID into both parent and child memory. Line 6set_robust_list registers the new thread's robust-futex list, so that if it dies holding a futex-backed pthread mutex, the kernel can wake the next waiter. Lines 1 and 7getpid() returns 4126 in both threads (because tgid is shared); gettid() returns the distinct 4126 / 4127. Why both numbers exist: getpid() is the POSIX-mandated return of the thread group leader's PID, so signals targeted at "the process" go to the right place. gettid() is Linux-specific and returns the per-task TID — the only way to identify a specific thread in /proc/[pid]/task/[tid]/ and in perf record -t [tid]. If you find yourself logging "thread X did something" and the X is getpid(), you have logged the same number for every sibling.

The whole creation costs roughly 20 µs on x86 — about half of that is the mmap for the stack, and most of the rest is kernel-side task_struct allocation, runqueue insertion, and scheduler bookkeeping. Compare to the same machine: fork() is ~80 µs (must copy page tables and many mm fields), and a tokio::spawn inside an already-running process is ~60 nanoseconds. The factor of 300× between pthread-create and tokio-spawn is the entire reason async runtimes exist.

What you actually pay per thread on a modern box

The cost of a thread breaks into four buckets, each measurable on your laptop. Memory — every thread reserves an 8 MB stack VMA by default (the RLIMIT_STACK ulimit -s value); only pages you actually touch are committed, but the address-space reservation is real and on a 32-bit process used to be the limiting factor at ~512 threads. The kernel task_struct itself is ~9 KB; the kernel stack is 8 KB or 16 KB depending on CONFIG_THREAD_SIZE_ORDER. Creation — measured above, ~20 µs. Context switch — voluntary or involuntary, ~1–3 µs depending on whether the TLB is preserved (kernel/user same mm_struct, fast) or not (cross-process, full TLB flush — slower; PCID on Skylake+ and ASID on ARM helps). Scheduling overhead — the CFS runqueue is a red-black tree keyed by vruntime; with N runnable tasks the per-tick work is O(log N), small in absolute terms but cumulative across millions of schedules per minute on a busy box.

# thread_cost.py — measure pthread_create cost vs tokio-style spawn
# build: cc -O2 -pthread thread_create_bench.c -o tcb && cargo build --release --manifest-path tokio_spawn/Cargo.toml
# run:   python3 thread_cost.py
import subprocess, json, time

def run(cmd):
    out = subprocess.run(cmd, capture_output=True, text=True, check=True).stdout
    return json.loads(out)

pthread_metrics = run(["./tcb", "100000"])
tokio_metrics   = run(["./tokio_spawn/target/release/spawn_bench", "100000"])

print(f"pthread_create  per-thread:    {pthread_metrics['ns_per_op']:>8.0f} ns")
print(f"tokio::spawn    per-task:      {tokio_metrics['ns_per_op']:>8.0f} ns")
print(f"ratio (kernel thread / async): {pthread_metrics['ns_per_op'] / tokio_metrics['ns_per_op']:.1f}x")
print(f"resident kernel memory per pthread (RSS delta): {pthread_metrics['rss_per_thread_kb']} KB")
print(f"resident memory per tokio task (heap delta):    {tokio_metrics['bytes_per_task']} B")

Sample output on a 16-core Ryzen 7950X (Linux 6.5, kernel CFS):

pthread_create  per-thread:       19840 ns
tokio::spawn    per-task:            58 ns
ratio (kernel thread / async): 342.1x
resident kernel memory per pthread (RSS delta): 12 KB
resident memory per tokio task (heap delta):    176 B

That ratio — 342× cost, 70× memory — is why MealRush's 18,000 pthread fleet collapsed and an equivalent tokio fleet at 18,000 tasks would not have noticed. The kernel is excellent at scheduling 64–256 threads on a 64-core box; it is not excellent at scheduling 18,000. The CFS runqueue still works correctly, but the wakeup-and-pick path runs on every interrupt, the cache footprint of task_struct traversals starts spilling out of L2, and the scheduler's per-CPU lock starts contending.

Anatomy of a context switch — illustrative latency budget on x86A horizontal stacked-bar figure showing the cost breakdown of a Linux context switch on x86. The bar is divided into four segments: scheduler decision (CFS pick) at around 200 nanoseconds; saving the outgoing task's registers and FPU state at around 300 nanoseconds; switching the page tables which on systems without PCID involves a TLB flush at around 800 nanoseconds; restoring the incoming task's registers at around 300 nanoseconds. Total approximately 1.6 microseconds. A second bar below shows the cost when both tasks share the same mm_struct (same process) — the page-table switch is skipped, total drops to about 800 nanoseconds. The figure is labelled illustrative. Context switch cost — illustrative breakdown (x86, Linux 6.x) Illustrative — measured numbers vary by kernel, CPU, and workload Cross-process (different mm_struct) CFS pick ~200 ns save regs + FPU ~300 ns switch CR3 + TLB cost (PCID helps) ~800 ns restore regs ~300 ns total ≈ 1.6 µs Same-process (siblings, shared mm_struct) CFS pick ~200 ns save regs + FPU ~300 ns skip restore regs ~300 ns total ≈ 800 ns — half the cost; same `mm_struct` means no CR3 reload, no TLB cost Cross-process: full CR3 switch; TLB flush amortised by PCID/ASID on Skylake+ and ARMv8. Same-process: registers + FPU only; this is why thread-pool servers beat process-pool servers on hot dispatch.
Illustrative — not measured data. A typical context-switch budget on x86 Linux. Same-process switches (between sibling threads) skip the CR3 reload and the TLB invalidation; cross-process switches pay both. PCID on Skylake+ and ASID on ARMv8 turn the "TLB flush" into "TLB tag swap", reducing the cost — but the kernel still has to write CR3 and the new mappings can take TLB-miss latencies on the next loads. This is why a busy thread-pool server measurably outperforms a fork-per-request server on the same hardware.

Why MealRush's incident specifically hit the scheduler and not the application: at 18,000 runnable tasks on 64 cores, the average runqueue length per CPU is ~280; with CFS's targeted scheduling latency of 6 ms, each task gets a slice of ~21 µs before being preempted. That slice is barely larger than the context-switch cost itself. The system enters the regime where it is spending ~7% of all CPU just choosing what to run next, before any work happens. Reducing the thread count to 256 (one per core × 4) and routing all order-dispatch through a fixed-size async runtime moved the bottleneck from scheduler overhead to actual order-matching, where it belonged.

What kernel state lives per-thread vs per-process

The split between "shared at the process level" and "private to the thread" is the part of the API that catches engineers who learned threading on a forum thread. The exact split:

Resource Per-thread Per-process
Address space (mm_struct) shared one per process
Page tables (PGD / PUD / PMD / PTE) shared one set per process
Heap (the brk-managed region and mmap'd arenas) shared — malloc from one thread can return memory the other freed one
File descriptor table shared one
Working directory (cwd) shared (unless CLONE_FS was unset) one
Signal handlers (signal(), sigaction) shared one set
Signal mask (sigprocmask, pthread_sigmask) per-thread n/a
errno per-thread (it is a TLS variable in glibc) n/a
Stack per-thread n/a
Register set (RIP, RSP, RBP, GP, XMM, FPU) per-thread n/a
TLS (__thread, thread_local) — fs_base on x86 per-thread n/a
Robust futex list per-thread n/a
TID (gettid()) per-thread TID = TGID for the leader
getpriority / setpriority (nice value) per-thread on Linux n/a
setuid / setgid shared on Linux (POSIX requires; Linux added the special semantics in 2.6) one
prctl(PR_SET_NAME, ...) (the name top -H shows) per-thread (16-byte limit) n/a
/proc/[pid]/maps one per process shared view
/proc/[pid]/task/[tid]/sched per-thread scheduler stats n/a

errno deserves a note. In single-threaded C, errno is a global. In multi-threaded glibc, the symbol errno expands (via macro) to (*__errno_location()), where __errno_location returns a pointer into TLS. Why this had to change: a global errno in a multi-thread program is the textbook example of a benign-looking data race that produces real bugs — thread A calls read(), gets -1 and errno = EINTR; thread B's preceding write() failed and set errno = EAGAIN; thread A's check of errno reads B's value, retries the wrong syscall, and the bug is reproducible only under load. Making errno thread-local is part of why C11 added _Thread_local and the <threads.h> header at the language level — the threading model exists in the spec because the alternative was a decade of latent bugs in every libc.

Common confusions

  • "A thread is lighter than a process" Lighter than fork(), yes — clone() with CLONE_VM skips the page-table copy, saving ~80% of fork's cost. But on a modern multicore Linux box a thread is not "free": ~20 µs to create, ~1–3 µs per context switch, 8 MB of address-space reservation per default stack, 9 KB of kernel memory. At 10,000+ threads the scheduler becomes the bottleneck. "Lightweight" is relative to processes, not to coroutines, not to async tasks, not to nothing.
  • "pthread_create makes a kernel thread" It calls clone(), which makes a task_struct — the same kernel object as a process. There is no separate "kernel thread" object in Linux for user-created threads. The phrase "kernel thread" in Linux typically means the kernel-internal threads (kworker, ksoftirqd, kthreadd's children) which are tasks with no user-space mm_struct. Your pthread is a user task, not a kernel thread in that sense.
  • "getpid() returns a unique number per thread" It does not. getpid() returns the thread group ID (TGID), which is shared across all threads in a process. To get the per-thread ID, call gettid() (Linux 2.4.11+, glibc 2.30+) or syscall(SYS_gettid). If your logging library prints getpid(), you cannot distinguish threads in the log.
  • "All threads have the same priority because they share a process" Linux schedules at the task level — every thread has its own nice value (setpriority(PRIO_PROCESS, tid, ...)), its own scheduling class (SCHED_OTHER, SCHED_FIFO, SCHED_RR, SCHED_DEADLINE), and its own vruntime in CFS. You can pin different threads to different cores (pthread_setaffinity_np), give one thread real-time priority while the others run normal, and have one thread sleep on a futex while another spins.
  • "Thread-local storage is just a variable per thread" It is more interesting than that. On x86_64 Linux, TLS is accessed through the fs segment register: __thread int x; compiles to mov %fs:offset, %eax. The kernel sets fs_base on each thread (one of the things clone() does with CLONE_SETTLS). The slot is allocated by the dynamic linker from a per-thread block. This makes TLS access ~1 ns — but it also means that across a dlopen'd shared library boundary, TLS layout is more complex and can involve the __tls_get_addr runtime call.
  • "M:N threading is dead" It died in the kernel — Linux's NPTL (2003) replaced the M:N LinuxThreads with strict 1:1 — but it returned with a vengeance in user space. Go's goroutines are M:N (many goroutines on a few GOMAXPROCS OS threads). Tokio's tasks are M:N (many futures on a few worker threads). Erlang's processes are M:N. The lesson was that the kernel is the wrong layer for M:N (too much shared state to keep cheap), but the runtime is the right layer (you control the scheduler, the stacks, the lifecycle).

Going deeper

clone3() and what pthread_create actually calls in 2026

The original clone() syscall has accumulated arguments over 30 years until it became unwieldy. clone3() (Linux 5.3, 2019) takes a single struct clone_args pointer with named fields, supporting CLONE_INTO_CGROUP (atomic cgroup placement at creation — used by container runtimes), set_tid (pick the child TID; used by criu for process-tree restore), and pidfd (a file-descriptor handle to the child, replacing TID-reuse races). Modern glibc (2.34+) calls clone3() from pthread_create, falling back to clone() on older kernels. The flag set is the same; the API is cleaner. Why pidfd matters: TIDs are integers and Linux reuses them. If thread A holds a TID it has saved and thread B is created, terminates, and a new thread C gets B's old TID, A's saved value now points to C — a classic TOCTOU. pidfd_open(tid, 0) returns a file descriptor that becomes invalid the moment the underlying task exits, eliminating the race.

CFS, the runqueue, and why you can see the pick decision

The Completely Fair Scheduler (Ingo Molnar, 2.6.23, 2007) replaced the previous O(1) scheduler with a red-black tree keyed by vruntime — virtual runtime, the time the task has run weighted by its inverse nice value. Pick is "leftmost node of the tree". Each task pays for the time it runs by adding to its vruntime; the leftmost is always the most-deserving. You can read this state directly: cat /proc/<pid>/sched shows se.vruntime, nr_voluntary_switches, nr_involuntary_switches, and wakeup_latency. perf sched records every scheduler event; perf sched latency summarises wakeup-to-run latencies per task. On MealRush's box that night, perf sched latency would have shown the dispatch threads at the bottom of the histogram: average wakeup-to-run of 4.8 ms with a p99 of 38 ms — the kernel could not get back to them fast enough, because there were 280 other runnable tasks ahead of them on each core's tree.

task_struct size, kernel stack overflow, and CONFIG_VMAP_STACK

A task_struct is large and has grown over the kernel's life — adding cgroup pointers, KASAN metadata, BPF program contexts, MIRRORED scheduling fields. On a recent kernel it is ~9 KB, allocated from task_struct_cachep (a slab cache). The kernel stack for the task is separate — 8 KB or 16 KB (THREAD_SIZE). Pre-2016, kernel stacks were physically contiguous pages; an overflow would silently corrupt the next task's task_struct, producing a random crash hours later. CONFIG_VMAP_STACK (4.9, 2016) puts each kernel stack in its own vmalloc'd region with guard pages on either side, turning the overflow into an immediate page fault with a useful stack trace. Every modern distro kernel ships with this on. The cost is ~1 µs of extra TLB pressure per stack page, which is in the noise.

Why "8 MB stack per thread" is not what it looks like

A pthread's default 8 MB stack is address space, not committed memory. Linux uses demand paging: only pages you actually write to are backed by physical memory. A typical idle thread has 1–2 stack pages committed (~4–8 KB). The cost of "8 MB per thread" is paid in two ways: (1) virtual address space, which on 32-bit limits a process to ~512 threads but on 64-bit is effectively infinite, and (2) the guard page below the stack is reserved with PROT_NONE, so a stack overflow generates SIGSEGV at the boundary. You can shrink it: pthread_attr_setstacksize(&attr, 64*1024) gives a 64 KB stack, and at 64 KB Linux can fit ~30,000 threads in the same address space cost as ~250 default-stack threads. Servers like nginx (which uses processes, not threads, but shares the lesson) and Erlang's BEAM (which uses 233-byte initial stacks per process) take this to the extreme.

What top -H and htop -t show — and why your service has 18,000 lines

top -H switches the per-process view to per-task view; each line is a thread. htop -t does the tree version. ps -L -p <pid> lists threads. /proc/<pid>/task/ is a directory with one subdirectory per thread, each containing a per-thread view of status, sched, stat, comm, wchan, and stack. Why this matters operationally: when MealRush's dispatch service was misbehaving, the first useful command was top -H -p <pid> | sort -k9 (sort by CPU%) — it showed which threads were burning CPU, not which process. The next was cat /proc/<pid>/task/*/status | grep -E 'Name|State|voluntary' which revealed the thread-naming convention used by the JVM (so each thread had a meaningful comm), the state distribution (most were S waiting on a futex, a few were R running, none were D stuck in I/O), and the voluntary/involuntary switch ratio. Without the per-thread /proc view, a 18,000-thread JVM is opaque.

Where this leads next

The next chapter goes one level deeper into the user-space side — how pthread_create, Rust's std::thread::spawn, and Go's go func() differ in what they hand the kernel. After that, cores vs hardware threads (SMT/HT) — when two siblings sharing one physical core help and when they hurt. Then the cache hierarchy and the first encounter with MESI, which is where two threads writing to fields next to each other in memory becomes a measurable disaster.

Read Robert Love's Linux Kernel Development chapter 3 and 4 if you want the kernel-side tour at book length; read the kernel itself at kernel/sched/core.c and kernel/fork.c if you want the source. The kernel is more readable than its reputation suggests, and the scheduler is one of its better-commented subsystems.

References

# Reproduce on your laptop (Linux x86_64 or aarch64)
gcc -O2 -pthread thread_minimal.c -o thr
strace -f -e clone,clone3,mmap,set_robust_list ./thr 2>&1 | head -20
# Inspect a running multi-threaded process:
ps -L -p $(pgrep -n java) | head -20
ls /proc/$(pgrep -n java)/task/ | wc -l       # how many threads
cat /proc/$(pgrep -n java)/status | grep -E 'Threads|voluntary'
# Watch context switches in real time:
pidstat -wt 1 5
# See the scheduler's per-thread view of a single TID:
cat /proc/$(pgrep -n java)/task/$(pgrep -n java)/sched | head -20