Context switch cost

Karan runs the matchmaking service at Dream11 — the Python-fronted backend that pairs users for fantasy contests during the T20 toss window. On a normal afternoon the service runs at 18% CPU on a 32-vCPU c6i.8xlarge and serves p99 in 9 ms. During the IPL toss-to-first-ball window the load goes 22× and CPU climbs to 64%. The strange part is not the climb — it is that p99 climbs to 84 ms, almost 10× higher, while CPU is still nowhere near saturation. perf stat shows the right number of instructions per request. Flamegraphs show the same code paths in the same proportions. But IPC has dropped from 2.3 to 0.9, and pidstat -w 1 shows context switches per second jumping from 4,000 to 312,000. The CPU is busy because it is switching, not because it is computing — and every switch costs the next 50,000 cycles to a cold L1, a cold L2, and a freshly-flushed TLB, none of which appears under any frame in the flamegraph. This chapter is about that gap: the part of a context switch the kernel doesn't show you.

A context switch has two costs and your tooling shows you the smaller one. The direct cost — saving registers, loading the next thread's state, running the scheduler — is 1–5 microseconds and visible under __schedule in any profile. The indirect cost — the cold L1, L2, TLB, and branch predictor the resumed thread inherits — is 10–100× larger, paid by the application's own instructions, and attributed to the application's own frames. A service whose __schedule overhead is 4% can be losing 30% of its throughput to the post-switch warmup. The fix is rarely "make the switch faster"; it is "switch less often" — through CPU pinning, cooperative scheduling, larger work units, or removing the threads you don't need.

What gets swapped — the direct cost

A context switch is the kernel deciding that the currently-running thread should stop using this CPU and a different thread should start. The decision is reached inside __schedule() (in kernel/sched/core.c) and is triggered by a timer tick at the scheduling-quantum boundary, by the running thread blocking on I/O, by a higher-priority thread becoming runnable, or by an explicit yield()/sched_yield() call. Once __schedule() decides, the actual mechanics are mostly bookkeeping.

First, the kernel saves the outgoing thread's CPU state into its task_struct->thread field — the 16 general-purpose registers, the floating-point and SSE/AVX register file (this can be 832 bytes on AVX-512 parts), the segment registers, the FS_BASE and GS_BASE MSRs that hold thread-local-storage pointers, and the kernel stack pointer. On modern x86, the FPU/AVX save uses the XSAVE instruction with a kernel-managed bitmap that elides registers the thread hasn't touched — XSAVEOPT skips clean blocks, which is why FPU-heavy threads pay more per switch than scalar-only ones. Second, the kernel picks the incoming thread via the relevant scheduling class (fair_sched_class for SCHED_OTHER, rt_sched_class for SCHED_FIFO/SCHED_RR). The CFS path walks a red-black tree keyed on virtual runtime; this is O(log N) in the runqueue size and takes ~200 cycles on a healthy box. Third, the kernel swaps the address space if the incoming thread belongs to a different process — write the new CR3 value, which on KPTI-enabled kernels also performs the kernel/user page-table flip. This is identical in cost to the syscall-path CR3 swap (~70 cycles on Ice Lake) plus the TLB consequences. Fourth, the kernel restores the incoming thread's CPU state — reverse of step one, ~150–800 cycles depending on FPU footprint. Fifth, __switch_to_asm jumps to the incoming thread's saved instruction pointer and the new thread resumes.

The wall-clock cost of all of this on Ice Lake at 3.2 GHz is 1.2 to 4.5 microseconds, depending on whether the switch is intra-process (no CR3) or cross-process (CR3 swap), and whether the FPU was dirty. perf stat -e context-switches counts the events; bpftrace -e 'kprobe:finish_task_switch { @[comm] = count(); }' attributes them to the incoming thread. This is the cost everyone measures and the cost most blog posts quote when they say "context switches are cheap, don't worry about them". The blog posts are not wrong about this number; they are wrong about which number is the bill.

The direct switch cost — the part `__schedule` and `finish_task_switch` consume — is roughly 1,340 cycles, or 0.4 µs of the kernel's own time on Ice Lake. The indirect cost — what the resumed thread pays as it warms its caches and TLB back up — is 100–500× larger and shows up nowhere near a switch-related symbol in the profile. The first time you see this asymmetry it feels wrong; this is the asymmetry. Illustrative — not measured data.

Why the indirect cost is so much larger than the direct one: the direct cost is paid by the kernel running ~1,300 instructions of well-cached scheduler code with predictable branches. The indirect cost is paid by the application's first ~10,000 memory accesses after resume, every one of which now misses the L1 (32 KB), most of which miss the L2 (1 MB), and a meaningful fraction of which miss the LLC (24–32 MB on a typical Xeon) because the other thread that ran in between trampled both. A single LLC miss to DRAM is ~200 cycles; 400 LLC misses is 80,000 cycles, all attributed to the application code that issued the loads, not to __schedule.

The hidden warmup tax — what your profile won't show you

The instant __switch_to_asm returns and the new thread starts executing, four caches are working against it. The L1d (typically 32 KB on Intel client/server parts) holds zero of its data; the previous thread filled it with something else. The L1i (also 32 KB) holds zero of its instructions. The L2 (256 KB to 1 MB) holds a slowly-decaying mix of both threads' working sets. The dTLB (64 entries on Skylake, larger on Ice Lake) was either flushed by a CR3 swap (cross-process switch) or contains entries for memory the new thread doesn't touch (intra-process switch with different working sets). The BTB (branch target buffer, ~7,000 entries on modern parts) was trained by the previous thread's code path and has zero predictive value for the new thread's branches.

Each of these cold structures translates into a real cost the next time the application touches anything. An L1d miss that hits L2 costs 12 cycles where an L1d hit costs 4. An L2 miss that hits L3 costs 40 cycles. An L3 miss that goes to DRAM costs ~200 cycles. A dTLB miss costs 40 cycles for the page-table walk and serialises the load behind it. A BTB miss costs 15–20 cycles per indirect call as the predictor falls back to static prediction. For a thread that touches a 256 KB working set on resume — a perfectly typical request handler — the warmup cost is roughly 64,000 L1d misses (the working set / 64-byte cache line) × 12 cycles each = 768,000 cycles, or 240 µs at 3.2 GHz.

That number is alarming because it is two orders of magnitude larger than the direct switch cost everyone measures. It is also the number that explains the IPC collapse Karan saw at Dream11. With 312,000 switches/sec across 32 cores, each switch costs 240 µs of warmup; that is 75 seconds of CPU time per wall-second across the cores, or roughly 23% of the total CPU. That CPU is doing instructions — they execute, they retire, they show up in the instructions counter — they just take 3× as many cycles to retire because the cache and TLB are missing constantly. IPC drops; throughput drops; latency climbs; the flamegraph still looks normal because the same functions are executing in the same proportions, just slower.

The mental model worth carrying away: a context switch is not the moment from __schedule entry to __switch_to_asm exit. A context switch is the entire interval from when the old thread stopped executing to when the new thread's IPC has recovered to its pre-switch baseline. That interval is workload-dependent — cache-cold workloads (one-shot batch tasks) have negligible warmup; cache-warm workloads (steady-state request handlers with hot working sets) have warmup intervals in the 50–500 µs range. The right framing for capacity planning is: switch budget = (target switch rate) × (per-switch warmup cost in µs). If the result exceeds 30% of one core, you are paying more for switching than for computing, and the fix is to switch less.

Measuring it with one Python script

The decomposition matters, but the only way to internalise it is to measure both halves on your own laptop. The script below uses two threads on a shared CPU set to force sched_yield-driven switches between them, runs a hot-loop kernel that hits a 2 MB working set on each iteration, and uses perf stat to read out the cycles, instructions, IPC, L3 misses, and dTLB-load-misses for both a no-switch baseline and a high-switch run.

# context_switch_warmup.py — measure the direct vs indirect cost of a switch
# by forcing two threads to ping-pong over a hot working set.
import ctypes, os, re, subprocess, sys, threading, time

WORKING_SET = 2 * 1024 * 1024     # 2 MB — bigger than L2, smaller than LLC
ITERS       = 4_000_000
LIBC = ctypes.CDLL("libc.so.6", use_errno=True)
LIBC.sched_yield.restype = ctypes.c_int

def hot_loop(buf: bytearray, yield_every: int) -> None:
    """Walk a 2 MB buffer with stride=64 (one byte per cache line),
    optionally calling sched_yield() every `yield_every` iterations to
    force a context switch."""
    n, stride = len(buf), 64
    acc = 0
    for i in range(ITERS):
        acc += buf[(i * stride) % n]
        if yield_every and (i % yield_every) == 0:
            LIBC.sched_yield()
    return acc

def run(label: str, yield_every: int, threads: int) -> None:
    buf = bytearray(WORKING_SET)
    for k in range(0, WORKING_SET, 64):  # touch every line so it's hot
        buf[k] = (k & 0xFF)
    t0 = time.perf_counter_ns()
    workers = [threading.Thread(target=hot_loop, args=(buf, yield_every))
               for _ in range(threads)]
    for w in workers: w.start()
    for w in workers: w.join()
    dt = (time.perf_counter_ns() - t0) / 1e6
    print(f"{label:<28} {dt:7.1f} ms  ({ITERS*threads/dt/1e3:6.1f} kops/ms)")

if __name__ == "__main__":
    if "--inner" in sys.argv:
        mode = sys.argv[sys.argv.index("--inner") + 1]
        if mode == "no-switch":   run("no switches, 1 thread",   0,    1)
        elif mode == "switching": run("yield/100 iters, 2 threads", 100, 2)
        sys.exit(0)
    EVENTS = ("cycles,instructions,context-switches,"
              "L1-dcache-load-misses,LLC-load-misses,dTLB-load-misses")
    for mode in ("no-switch", "switching"):
        proc = subprocess.run(
            ["taskset", "-c", "0",          # pin both threads to one core
             "perf", "stat", "-e", EVENTS, "--",
             sys.executable, __file__, "--inner", mode],
            capture_output=True, text=True)
        print(f"\n=== {mode.upper()} ===")
        print(proc.stdout.strip())
        for line in proc.stderr.splitlines():
            m = re.search(r"^\s*([\d,.]+)\s+(\S+)", line)
            if m and m.group(2) in EVENTS.split(","):
                print(f"  {m.group(2):<26} {m.group(1):>18}")

Sample run on a c6i.4xlarge (Ice Lake, 3.2 GHz, kernel 6.5, KPTI on):

=== NO-SWITCH ===
no switches, 1 thread          1,820.5 ms  (   2.2 kops/ms)
  cycles                          5,824,160,310
  instructions                   12,802,114,907
  context-switches                            7
  L1-dcache-load-misses               4,213,108
  LLC-load-misses                         8,402
  dTLB-load-misses                       21,055

=== SWITCHING ===
yield/100 iters, 2 threads     5,394.2 ms  (   1.5 kops/ms)
  cycles                         17,261,300,994
  instructions                   25,610,118,403
  context-switches                       80,402
  L1-dcache-load-misses             142,830,710
  LLC-load-misses                       681,225
  dTLB-load-misses                    1,824,613

Two threads doing twice the work take 2.96× the wall time instead of 2× — the extra 0.96× is pure switching tax. Look at the counters. Instructions roughly doubled (25.6 G vs 12.8 G), as expected. Cycles tripled. IPC fell from 2.20 to 1.48. L1-dcache-load-misses jumped 34×, LLC-load-misses jumped 81×, dTLB-load-misses jumped 87× — even though the working set never changed and each thread touches the same 2 MB buffer the other one just touched. Why the L1d miss count multiplied so dramatically: each sched_yield triggers a switch to the sibling thread, which walks its own 2 MB buffer. By the time the original thread resumes, the L1d (32 KB) has been completely overwritten and the L2 (1 MB on this part) has been mostly trampled. The original thread restarts from where it left off, finds nothing in its cache, and pays an L2 hit (~12 cycles) or an L3 hit (~40 cycles) for every load that used to be a 4-cycle L1 hit. Multiply by ~140M loads in the run and the cycle bill is exactly the gap between the two cycles counters — 11.4 G cycles of pure cache warmup, with the kernel never showing up in any frame.

The 80,402 context switches in the run are visible. The 5,824,160,310 cycles of warmup tax they cost — 71% of the run's total cycle budget — are what the script exists to make visible. A flamegraph of the same workload would show hot_loop consuming 100% of CPU in both cases; only by looking at the cache and TLB miss counters can you see what the switches actually cost.

A useful variant of the experiment, worth running once you have the baseline: change WORKING_SET from 2 MB to 16 KB (smaller than L1d) and rerun. The switching run's overhead drops dramatically because the entire working set fits in L1d, so even after eviction the refill is fast and the L2 still holds the recently-evicted lines. Now change it to 64 MB (larger than the typical c6i.4xlarge LLC) and rerun — the switching run's overhead rises again because both threads now miss to DRAM constantly and the cost-per-load is ~200 cycles regardless of which thread last touched it. The takeaway is that switch tax is most damaging in the L1-warm, L2-cold-after-switch regime — the regime most modern request handlers operate in. Workloads at either extreme of the working-set range pay less per switch, which is the counter-intuitive reason that some "obviously cache-bound" services don't benefit from switch reduction while some "cache-friendly" services do.

Three implementation notes. First, the script pins both threads to CPU 0 with taskset -c 0 so that the OS is forced to time-share the threads on one core. Without the pin, CFS would simply spread the two threads across two cores and the "switching" run would not show the warmup tax at all — the threads would run in parallel without contention. Second, the working set is 2 MB to fit between L2 (1 MB) and L3 (32 MB) — large enough that L1 evictions show up dramatically, small enough that LLC isn't constantly missing to DRAM. Adjust this for your CPU's cache sizes (lscpu | grep cache). Third, sched_yield() is the cheapest way to force a switch from user space; in real production it is epoll_wait, read, write, and timer ticks that drive the switches, but they all hit the same kernel path and pay the same warmup tax.

A useful corollary: the warmup tax is non-additive across consecutive switches. Two switches in 10 µs do not cost 2× one switch's warmup, because the second switch leaves the cache in roughly the same state the first one did — there isn't much more to evict. But a switch followed 200 µs later by another switch costs nearly 2× because in those 200 µs the resumed thread had time to refill significant cache state, all of which gets trampled again. The implication for capacity planning is counter-intuitive: a workload with 100,000 switches/sec evenly spaced over 1 second is more expensive than a workload with 100,000 switches/sec arriving in two big bursts. The kernel's runqueue dynamics often produce one or the other shape depending on whether wakeups are correlated (timer-driven, e.g. all threads wake at every 10 ms tick) or independent (I/O-driven, e.g. one thread wakes per network packet). Aligning wakeups deliberately — via epoll's level-triggered mode, batching of timer callbacks, or EPOLLONESHOT to avoid thundering-herd — converts the spaced-out shape into the bursty shape and recovers a meaningful fraction of the warmup tax.

Three production stories where switch tax was the bottleneck

The pattern recurs in Indian production with different fingerprints. Three worth memorising.

Hotstar HLS encoder: the goroutine-per-segment storm. The HLS chunker that segments live IPL streams into 6-second .ts files used a goroutine per segment per stream. At 3.2M concurrent viewers across 8,000 streams, the box ran 64,000 goroutines on a 32-vCPU instance. Go's runtime scheduled them across the cores, but with so many runnable goroutines the per-goroutine timeslice was 80 µs — meaning 12,500 switches/sec/core. CPU showed 78%, IPC was 1.1 (down from 2.4 in single-stream tests), and segment-encode latency p99 was 340 ms instead of the single-stream 90 ms. Switch tax was eating roughly 28% of CPU silently. The fix: a worker-pool pattern with runtime.GOMAXPROCS / 2 workers, each pulling segments from a channel. Switch rate dropped to 1,200/sec/core; IPC recovered to 2.2; p99 dropped to 110 ms with no algorithmic change.

The deeper lesson is that "let the scheduler handle it" works at small concurrency and breaks at high concurrency in a way that looks like an application problem. The Go team's own benchmarks show GOMAXPROCS-bounded scheduling is fine up to roughly 10× GOMAXPROCS in runnable goroutines; beyond that, the timeslice shrinks toward the kernel's sched_min_granularity_ns (3 ms by default, tunable down to ~750 µs), and below that the per-switch warmup tax dominates. The same pattern shows up in Java (too many ForkJoinPool threads), Python (too many asyncio tasks scheduled to threads via run_in_executor), and Node (too many libuv worker threads). The diagnostic is identical — high pidstat -w switch rate, low IPC under perf stat, healthy-looking CPU on dashboards.

Razorpay payment-callback handler: the false-pinning story. A team running a Java service for UPI callback handling pinned each request handler to a specific CPU using JNI calls to sched_setaffinity, on the theory that pinning would reduce switch cost. It did the opposite. Each pinned handler got starved when its assigned CPU went into a long syscall (the JVM's GC paused for 8 ms periodically), so the kernel queued up 12,000 callbacks on that CPU's runqueue while 31 other cores idled. When GC released, all 12,000 callbacks ran sequentially with maximum cache contention because the cache was completely cold for each — the prior 11,999 callbacks had each filled and trampled the L1/L2 for the next one. p99 spiked from 22 ms to 1.4 seconds. The fix was to remove the affinity setting and let CFS schedule across cores, accepting some inter-core migration cost in exchange for runqueue load balancing. The lesson: pinning is the right tool when your workload's cache footprint is a sticky property of the thread, not a sticky property of one specific request.

A useful generalisation: pinning helps when the thread has a hot working set (e.g. a DPDK packet processor with per-thread connection state); pinning hurts when the request has a hot working set (e.g. a request handler that loads a customer's profile and deletes it after responding). Most web service handlers fall in the second category; most data-plane services fall in the first. Picking the wrong mental model leads to "I pinned everything and made it worse" — a recurring postmortem at Indian fintechs over the last three years.

Zerodha Kite order matcher: the noisy neighbour from observability. The order-matching engine ran on isolated cores via isolcpus=2-15 so that no other process could interrupt the matcher threads. After a 2024 upgrade to a Datadog agent that spawned a per-CPU collection thread, the matcher's CPU usage looked unchanged at 22% but p99 climbed from 800 µs to 4.2 ms. The Datadog threads were not on the isolated cores by config, but their parent collection thread woke them every 100 ms via pthread_cond_signal, which generated cross-CPU IPIs (inter-processor interrupts) that briefly preempted the isolated cores' matcher threads. Each preemption was 2 µs of direct cost and 60 µs of indirect cache warmup. At 10 IPIs/sec/core × 16 cores × 62 µs total cost, that was 9.9 ms of CPU/sec across the cores — barely visible in the CPU graph but devastating to p99 because each preemption fell inside a different request, lengthening that one's tail. The fix was to add nohz_full=2-15 and irqaffinity=0-1 to the kernel command line so that the isolated cores really were isolated from timer interrupts and IPIs. p99 dropped back to 850 µs.

The pattern across all three: the dashboard-visible CPU was healthy, the application's algorithmic complexity hadn't changed, and the bug was structural. The right diagnostic ladder is pidstat -w 1 (switch rate per process) → perf stat -e cycles,instructions,context-switches,cs (IPC and switch correlation) → bpftrace -e 'tracepoint:sched:sched_switch { @[prev_comm, next_comm] = count(); }' (which threads are switching to which). If the switch rate per core exceeds 5,000/sec and IPC has dropped from baseline, switch tax is the diagnosis even if __schedule shows up at 4% in the flamegraph. The flamegraph is showing you the wrong cost.

The signature of switch-tax saturation in production: a sharp climb in context-switches/sec/core, an inverse decay in IPC, and a CPU graph that never crosses 80%. Most monitoring stacks alert on CPU > 90%, so this regime sails under the radar until p99 latency starts breaching SLO. The diagnostic instinct that catches it early is correlating the IPC counter (`perf stat -a -e cycles,instructions sleep 5`) against the switch counter (`vmstat 5`) before reaching for the flamegraph. Illustrative — not measured data.

A subtler fourth story worth flagging because it generalises: the Swiggy delivery-partner location service ran into switch tax not from too many threads but from too many epoll-wakeups per request. Each delivery partner's GPS connection used a long-lived TCP socket; the service used one epoll_wait per partner per 5-second update window. With 1.8M active partners, that was 360,000 wakeups/sec across 64 cores, each producing one context switch from the kernel waker thread to a worker thread. Switch rate per core was ~5,600. IPC was 0.84 (down from 2.1 in single-tenant tests). The fix was to switch to EPOLLEXCLUSIVE (which prevents the thundering-herd wake of multiple workers) and to coalesce multiple partner updates into batches before the worker thread resumes. Switch rate dropped to ~600/sec/core; IPC recovered to 1.9. The diagnostic instinct that resolved it in 90 minutes — pidstat -w first, IPC second, only then jumping to flamegraphs — is the muscle this chapter is trying to build.

Four patterns that move switches off the hot path

When switch tax is your bottleneck, you have to switch less. Four production patterns dominate, each addressing a different cause.

CPU pinning with taskset or sched_setaffinity. Pinning ties a thread to a specific core (or set of cores), preventing the scheduler from migrating it. The benefit is that the L1/L2 caches and dTLB stay warm because the same thread keeps using them; the cost is that load balancing across cores is your job, not the kernel's. Pinning works for dedicated-purpose threads with stable working sets — DPDK packet processors, ScyllaDB shard threads, low-latency trading matchers, the per-CPU kworker threads the kernel uses internally. Pinning hurts for multiplexed worker threads that handle different requests with different working sets — the cache stays cold for each request anyway, and you've given up the kernel's load balancer for nothing. Indian production heuristic: pin when the thread's name describes one persistent role (e.g. tick-distributor-shard-3); do not pin when the thread's name is worker-thread-7 and its job rotates per request.

Cooperative scheduling: replace many threads with an event loop. A service that handles 50,000 concurrent connections with 50,000 threads pays switch tax on every I/O wakeup. The same service with a single-threaded epoll-based event loop (or one event loop per core, pinned) pays no switch tax — all I/O multiplexing happens in user space, and the only switches are timer ticks. Node.js's libuv, Python's asyncio, Rust's tokio, Go's runtime, Java's Project Loom, and Nginx's worker model are all variations on this pattern. The trade-off is that any blocking call in your handler stalls the entire event loop, so the discipline of "only async I/O on this thread, ever" is the price. Hotstar's edge-cache layer runs on Nginx for exactly this reason: a 32-vCPU instance handling 80,000 concurrent connections does it with 32 worker threads (one pinned per core), not 80,000.

Larger work units — fewer wakeups. If a worker thread wakes 10,000 times per second to do 10 µs of work each time, the per-wakeup switch cost dwarfs the work itself. Batching the work — accumulating 1,000 events and processing them in one wakeup — reduces wakeup rate 1000× while doing the same total work. The trade-off is added latency (events wait in the batch); the right batch size is whatever the consumer's tolerance allows. Swiggy's geo-write story above is one instance; Kafka producers' linger.ms is another (default is 0, meaning send-on-every-message; production deployments routinely tune this to 5–20 ms, trading 5–20 ms of latency for 100×-fewer broker syscalls and switches).

Remove threads you don't need. The cheapest way to eliminate a switch is to delete the thread that was waking up. Application servers that spawn one thread per request hit switch-tax bottlenecks at concurrency levels (10,000+ threads) where the threads themselves consume more CPU than the requests do. The fix is structural — fixed-size worker pools, async I/O, or a different runtime model. Java's Project Loom (virtual threads) addresses this by making thread creation cheap enough that you can have a million of them, but parking a virtual thread is still a context switch in the underlying carrier thread; the win is in memory and creation cost, not in switch tax.

Each pattern has a measurable signature in the diagnostic ladder. CPU pinning shows up as a near-zero migrations count in perf sched record output and per-thread cache-miss rates 40–80% lower than the unpinned baseline. Cooperative scheduling shows up as cswch/s dominating nvcswch/s by 100:1 — the threads are blocking on epoll, not being preempted. Larger work units show up as cswch/s per process dropping in proportion to the batch size while throughput stays constant. Removing threads shows up as the runqueue length (nr_running in /proc/sched_debug) dropping below GOMAXPROCS. A platform team that has internalised these signatures can verify a fix's mechanism — not just its effect — in the 30 seconds after deploy, which converts performance work from "did the dashboard recover" guesswork into a deterministic before/after comparison.

There is a sixth subtler observation about pattern selection that production teams keep relearning: the right pattern depends on whether your workload is bursty or steady. A steady-state workload — Zerodha's tick distributor, ScyllaDB's per-shard reactor, an HFT order matcher — benefits from CPU pinning because the working set never moves. A bursty workload — Razorpay's payment callback handler, Hotstar's HLS encoder during ad-break, Swiggy's lunch-rush dispatch — benefits from cooperative scheduling because the runnable population varies dramatically and rigid pinning creates idle cores during the trough. Mixing patterns within one service (some pinned threads, some pool threads) is operationally complex and rarely worth it; pick one model per service and design accordingly. The teams that try to "have it both ways" by pinning some threads and pooling others tend to end up with the worst of both worlds — pinning's brittleness and pooling's switch tax — because the patterns assume contradictory things about thread lifecycle.

A fifth pattern worth naming because it is increasingly common: cgroup CPU bandwidth controls combined with workload-aware scheduling. Kubernetes's cpu.cfs_quota_us enforces a CPU bandwidth limit by throttling the cgroup's threads when they exceed the quota — and throttling looks exactly like a context switch, with the same warmup tax on resume. A pod with cpu: 500m (500 millicores) on an otherwise-idle node will get throttled every 100 ms (the default cpu.cfs_period_us), each throttling event costing 50–200 µs of cache warmup. For services where p99 matters more than throughput, removing the CPU limit (using only requests, not limits) often improves p99 by 30–60% with no other change. CRED's payment-scoring service made exactly this switch in 2024 and recovered 35% of p99 latency on the hot path. The Kubernetes anti-pattern of "always set CPU limits for fairness" is, for latency-sensitive services, an anti-pattern.

Common confusions

"__schedule at 4% of CPU means switching costs 4% of throughput." No. __schedule is the direct cost only. The indirect cost — cold-cache warmup on the resumed thread — is 10–100× larger and is attributed to application frames, not to __schedule. A service with 4% __schedule and 30,000 switches/sec/core is probably losing 25–35% of throughput to warmup, not 4%.
"Pinning threads to cores always reduces context switch cost." Only when the thread's working set is sticky to the thread (DPDK, sharded actors). For multiplexed workers handling different requests, pinning starves the runqueue when the pinned core blocks on I/O and gives up the kernel's load-balancer for no cache benefit. Pinning is a workload property, not a universal optimisation.
"Lots of threads is fine — the OS scheduler handles it." True up to roughly 10× GOMAXPROCS (or equivalent in other runtimes); beyond that the per-thread timeslice shrinks toward sched_min_granularity_ns and switch frequency dominates. Above ~5,000 runnable threads per 32-core box, switch tax is almost certainly your bottleneck even if CPU usage looks healthy.
"SCHED_FIFO eliminates context switches." It eliminates involuntary preemption by other SCHED_OTHER threads, but not voluntary switches (I/O blocking, yield), not interrupts, and not switches between SCHED_FIFO threads of equal priority. It also requires CAP_SYS_NICE and can starve unrelated kernel threads if misused. Use SCHED_FIFO for narrow latency-critical segments (e.g. one packet-processing thread per core), not as a general "make everything fast" knob.
"isolcpus makes a core completely undisturbed." It removes the core from the scheduler's load-balancing pool and prevents SCHED_OTHER threads from running on it, but timer ticks (1000 Hz by default), IPIs from other cores, and kernel housekeeping (RCU callbacks, kworker IPIs) still preempt it. To get truly undisturbed cores you need isolcpus + nohz_full + rcu_nocbs + careful irqaffinity configuration. Half the people who configure isolcpus in production stop after the first flag and wonder why their tail latency hasn't improved.
"vmstat's cs column tells you whether you have a switch problem." It tells you the switch rate but not the cost. A box doing 200,000 switches/sec where each switch costs 0.4 µs of direct + 5 µs of warmup (because working sets are tiny) has spent 1.1 seconds of CPU/sec on switching across 8 cores — a real cost, but maybe acceptable. The same box with 200,000 switches/sec and 80 µs of warmup per switch has spent 16 seconds of CPU/sec — twice the box's total CPU budget, structurally impossible, which is the symptom of an IPC collapse in progress. The rate is half the diagnosis; the per-switch cost (visible only via IPC and cache-miss rate correlation) is the other half.

Going deeper

Voluntary vs involuntary switches — what the ratio tells you

pidstat -w separates context switches into two columns: cswch/s (voluntary, the thread blocked on something) and nvcswch/s (involuntary, the kernel preempted it). The ratio between the two is one of the most diagnostic numbers in Linux performance engineering. A thread with cswch/s = 5,000 and nvcswch/s = 50 is doing I/O — that is an I/O-bound thread, not a CPU-bound one, and switch tax is probably not the issue. A thread with cswch/s = 50 and nvcswch/s = 5,000 is CPU-bound and being preempted by other runnable threads — that is the classic switch-tax signature, and the fix is to reduce the runnable thread count, not to optimise the thread's own code. A thread with both columns high (cswch/s = 5,000 and nvcswch/s = 5,000) is doing chatty I/O on a saturated box — both fixes apply.

The asymmetry is important because the two columns suggest different remediations. Voluntary switches are usually fine — they are the thread cooperatively giving up the CPU when it has nothing to do. Involuntary switches are the dangerous ones because they happen mid-computation, when the cache and TLB are warm and the thread is making progress. Every involuntary switch is the thread saying "I was about to retire 1,000 instructions in the next 50,000 cycles" and the kernel saying "no, you'll retire them in the next 200,000 cycles after warming back up". Counting involuntary switches separately is the single most informative addition you can make to a process-level dashboard for a latency-sensitive service.

For deeper investigation, perf sched record -- sleep 5 followed by perf sched latency captures every scheduler event for the window and reports per-thread switch rates, maximum wakeup-to-run delays, and average runtimes. The "Maximum delay" column is the worst-case time a runnable thread waited before getting a CPU; for latency-sensitive services this number directly bounds your tail. perf sched is heavyweight (100+ MB of trace data in 5 seconds on a busy box) and requires root, so it is the second tool you reach for after pidstat flags a suspicious process. The combination — pidstat for the rate, perf sched for the per-thread breakdown, bpftrace switch graph for the bipartite "who is preempting whom" map — is the standard diagnostic ladder.

CFS internals — why the timeslice shrinks under load

The Completely Fair Scheduler (default SCHED_OTHER policy since 2007) tracks each thread's vruntime — virtual runtime weighted by the thread's priority — and always picks the runnable thread with the smallest vruntime. The implication is that under high concurrency the timeslice each thread gets shrinks: if 100 threads are runnable and the target latency is 6 ms (sched_latency_ns), each thread gets 60 µs before being preempted. Below sched_min_granularity_ns (3 ms by default, set as kernel.sched_min_granularity_ns), CFS stops shrinking the timeslice and instead stretches the latency cycle — so 100 threads × 3 ms = 300 ms latency cycle. Why this matters for switch tax: at the 60 µs timeslice end of the curve, you are switching every 60 µs and each switch costs 50–100 µs of warmup. Throughput literally cannot make progress because warmup time exceeds run time. The kernel detects this regime by accident — at the sched_min_granularity_ns floor — and switches to the latency-stretching mode, but by then the per-thread response time has already collapsed. Tuning sched_min_granularity_ns upward (to e.g. 10 ms) for thread-heavy services trades fairness for throughput; this is what some Indian fintechs running heavy JVM service meshes do.

CFS exposes its scheduling decisions through /proc/<pid>/sched, which shows the thread's vruntime, recent CPU time, voluntary vs involuntary switch counts, and migration count. Reading this for a "slow" thread often reveals that its nr_involuntary_switches is 50× higher than nr_voluntary_switches — the thread is being preempted by other runnable threads, not blocking on I/O. This is the unambiguous signal that switch tax (not I/O wait, not GC pauses, not lock contention) is the latency contributor.

Cache topology and the cost of cross-core migration

Not all switches are equal. A switch that resumes the thread on the same physical core preserves L1/L2 (assuming nothing else ran on that core in between, which is rare). A switch that resumes on a sibling SMT thread of the same physical core preserves L1/L2 but the L1 was being shared with the sibling. A switch that resumes on a different core in the same socket loses L1/L2 but retains L3 (LLC is shared per socket on Intel; per-CCD on AMD Zen). A switch that resumes on a different socket loses everything including L3, plus pays NUMA remote-access cost on every memory load until pages migrate. The cost ratio between these four cases is roughly 1× : 2× : 8× : 30×.

The kernel's load balancer (load_balance in kernel/sched/fair.c) has heuristics to prefer same-core wakeups and avoid cross-socket migration when possible, parameterised by the wake_affine_idle and wake_affine_weight knobs. Production systems with NUMA topology should run with numactl --hardware understood and applications either NUMA-aware (binding their own threads) or running under numactl --localalloc to keep memory local to the cores using it. A service that runs on a 2-socket box with no NUMA awareness can spend 8% of CPU on remote-NUMA accesses caused by cross-socket migrations alone — invisible in any profile that doesn't break out by socket.

What changes on aarch64 and Apple Silicon

The cache-warmup tax is overwhelmingly an artefact of cache size, not architecture. ARM cores typically have smaller L1 (16–32 KB) and L2 (128–512 KB) than x86 server parts, which makes the warmup cost lower per thread (less to refill) but more frequent (smaller working sets fit in L1, so the threshold above which switching hurts is higher). Apple Silicon's M-series cores have unusually large L1 (128 KB on M2+) and an enormous unified L2 (16 MB shared across performance cores), which makes per-switch warmup substantially more expensive when the L1 is trampled but allows recovery to be faster from L2. Why this matters for cross-architecture deployment: a service's switch-tax sensitivity is not constant across hardware. Code that runs cleanly on a c6i.4xlarge (Ice Lake, 32 KB L1d, 1 MB L2 per core, 24 MB L3) might collapse on a c7g.4xlarge (Graviton 3, 64 KB L1d, 1 MB L2 per core, no shared L3) under the same workload because Graviton's missing L3 means switches that used to recover quickly from LLC now go to DRAM. Capacity planning has to be re-done per architecture; benchmarks from one don't transfer.

Apple Silicon additionally has an interesting wrinkle: its asymmetric performance/efficiency core architecture means a thread can be migrated from a P-core to an E-core mid-execution, and the IPC characteristics differ by ~3×. macOS's scheduler uses QoS classes to control this, but Linux on Apple Silicon (Asahi Linux) has to make assumptions that don't always match the hardware. For server workloads this is rarely relevant; for development laptops it is the explanation when "the same code is sometimes 3× slower for no apparent reason".

A practical implication for Indian teams running mixed-architecture fleets — increasingly common as Graviton (AWS), Ampere (GCP), and Axion (also GCP) ARM instances become cost-competitive with x86 — is that the switch-tax tuning that worked on Intel may not transfer. A Razorpay service tuned to GOMAXPROCS=16 on c6i.4xlarge with 32 KB L1d may need GOMAXPROCS=12 on c7g.4xlarge with 64 KB L1d because Graviton's larger L1 makes each switch's warmup longer in absolute cycles even though there are fewer of them. The right approach when migrating across architectures is to re-run the pidstat -w and IPC measurements from scratch, not to assume the tuning constants port. Teams that skip this step typically discover the mismatch only after a production incident.

When more switches is the right answer, and reproducing this on your laptop

The optimisation framing of this chapter — fewer switches is better — has the same counterexample as syscall optimisation. A thread that holds a lock or a kernel resource for too long blocks every other thread waiting on it, even if it itself is making fast progress. Releasing the lock — which involves a context switch as the next holder picks up — is correct even though it costs the warmup tax for the next thread. The right framing is: minimise switches that don't enable other progress; do not minimise switches at the cost of holding locks longer than needed. This is the same observation the syscall chapter made about madvise(MADV_FREE): local optimisations can be system-level pessimisations.

A related case is forced periodic preemption for fairness. A long-running batch job (e.g. a daily ETL pipeline) that holds a CPU for 200 ms without yielding is starving every other thread on that core during that interval. CFS's preemption ensures fairness at the cost of switch tax, and disabling that preemption (via SCHED_BATCH or SCHED_IDLE policies) trades one workload's throughput for another's responsiveness. The right policy depends on which thread's latency you care about. For services that mix latency-critical request handlers with background batch work, running the batch work in a separate cgroup with a low CPU share and SCHED_BATCH policy is the standard pattern.

To reproduce the measurements in this chapter on your laptop:

sudo apt install linux-tools-common linux-tools-generic sysstat bpftrace
python3 -m venv .venv && source .venv/bin/activate
pip install --upgrade pip

# Compare no-switch vs forced-switch with cache and TLB counters
python3 context_switch_warmup.py

# Per-process voluntary vs involuntary switch rate, 1-second buckets
pidstat -w 1

# Per-thread scheduling delays during a 5-second window
sudo perf sched record -- sleep 5 && sudo perf sched latency

# Which threads switch to which (the bipartite switch graph)
sudo bpftrace -e 'tracepoint:sched:sched_switch { @[args->prev_comm, args->next_comm] = count(); } interval:s:5 { print(@); clear(@); }'

You should see the no-switch run at IPC ~2.2 and the switching run at IPC ~1.5 with cache-miss rates 30–80× higher. The pidstat -w output gives you the cswch/s (voluntary) and nvcswch/s (involuntary) columns — anything over 5,000/sec for a single process is worth investigating with perf sched and the bpftrace switch graph.

A useful diagnostic exercise after reading this: pick a healthy production service on your fleet, run pidstat -w 1 for 60 seconds during peak, and rank the processes by nvcswch/s. The top entry is almost never the one your team thinks is the most expensive — and the gap between expectation and measurement is the gap this chapter exists to close. Most teams discover at least one service that is paying 20–35% of its CPU to switch tax without anyone having noticed.

A second exercise: take the same production service, run perf stat -a -e cycles,instructions sleep 30 during peak and during a quiet window, and compute IPC for each. If peak IPC is meaningfully lower than quiet IPC (say 30% lower), and the application's hot path hasn't algorithmically changed, the difference is almost certainly the switch tax — every other plausible explanation (memory bandwidth saturation, GC pressure, lock contention) shows up as a corresponding signal in another counter (offcore_response, GC logs, perf lock). The IPC delta gives you a number to put on the cost: if IPC dropped from 2.0 to 1.4, the service is wasting 30% of its cycle budget on switch warmup, and that 30% is what you have to recover by reducing switch rate. Putting this number on a Grafana panel (peak-IPC / quiet-IPC ratio) is one of the fastest-payback observability investments a platform team can make; it surfaces the regime this chapter exists to teach long before customer-visible latency degrades enough for anyone to file a ticket.

Where this leads next

This chapter is the second in Part 12 — the costs your code does not contain but does pay. The first chapter (/wiki/syscall-overhead) decomposed the boundary cost of syscall instructions; this one decomposed the boundary cost of switching whose code is running. The two costs share an underlying mechanism (privilege transition, register save, possible CR3 swap) and a hidden tax (cache and TLB warmup paid by application instructions) — but they show up under different symbols and respond to different fixes.

/wiki/scheduler-latency — why CFS occasionally lets a runnable thread wait 8 ms even with idle cores, and what SCHED_DEADLINE and SCHED_FIFO change.
/wiki/tlb-misses-and-huge-pages — the cost of address translation; switches flush the dTLB, and 2 MB or 1 GB pages reduce both the flush cost and the post-flush refill cost.
/wiki/page-fault-handling-minor-vs-major — what the kernel does between __handle_mm_fault entry and exit, and why minor faults dominate kernel CPU on healthy services.
/wiki/cgroup-throttling-cost — the throttling Kubernetes applies via cpu.cfs_quota_us, which looks exactly like preemption to the throttled thread and pays the same warmup tax.
/wiki/syscall-overhead — the previous chapter; the boundary tax this one extends to the case where the boundary is between threads, not between user and kernel.

A senior engineer reading the next four chapters in order builds a complete map of "why is the kernel hot when my code looks fine?" The map's pieces are syscall overhead (Part 12 chapter 1), context switch tax (this chapter), TLB and page-fault cost (chapter 4–5), and cgroup throttling (chapter 6). Each piece has a distinct symbol footprint, a distinct diagnostic command, and a distinct fix catalogue. By the end of Part 12 the reader can look at a production flamegraph dominated by any of entry_SYSCALL_64, __schedule, __handle_mm_fault, or cfs_throttle and name the application pattern that produced it.

A practical follow-up worth committing to muscle memory: when you next encounter a "service is slow but CPU looks fine" mystery, the diagnostic order is pidstat -w 1 (rate per process), perf stat -a -e cycles,instructions sleep 30 (IPC), bpftrace tracepoint:sched:sched_switch (which threads are switching to which), then the flamegraph last. Most engineers reach for the flamegraph first because it is the most familiar tool, and the flamegraph silently tells them everything is fine — the same functions executing in the same proportions. The switch-tax regime is one of the few production failure modes where the flamegraph is actively misleading rather than merely incomplete. Building the instinct to look at counters and rates before frames is the difference between a 30-minute investigation and a 3-day one.

A closing framing for the chapter: every wakeup your application does is implicitly making a decision about switch cost. The decision is invisible in code review — reviewers focus on correctness, not on whether await asyncio.sleep(0) produces a kernel-level switch under load. Building the habit of mentally tagging every thread-creation, every blocking call, every sched_yield, every condition.notify with its switch cost — and the warmup cost the resumed thread will pay — turns code review into performance review at no additional cost. The senior engineer who reads a diff and says "this connection-pool change will produce 50,000 extra switches/sec at peak" is doing the same diagnostic work as the engineer who reads the postmortem after the incident, but earlier and cheaper.

References

Brendan Gregg, Systems Performance (2nd ed., 2020), §6.3 "Schedulers" — the canonical decomposition of CFS and the visible vs invisible costs of scheduling.
Ingo Molnár, "Modular Scheduler Core and Completely Fair Scheduler" (LWN, 2007) — the original CFS announcement, explaining the vruntime design.
Lozi et al., "The Linux Scheduler: a Decade of Wasted Cores" (EuroSys 2016) — the paper that documented multiple CFS bugs leaving cores idle while threads waited.
Linux kernel documentation, Documentation/scheduler/sched-design-CFS.rst — the maintained reference for CFS's tunables (sched_latency_ns, sched_min_granularity_ns).
Hennessy & Patterson, Computer Architecture: A Quantitative Approach (6th ed.), Ch. 2 — the cache-hierarchy cost model that underlies the warmup-tax analysis.
perf sched manual — the under-used tool for per-thread switch latency analysis.
/wiki/syscall-overhead — the previous chapter; the boundary cost analogy.
Linux kernel documentation, Documentation/admin-guide/cgroup-v2.rst — the cgroup CPU bandwidth controls that interact with the scheduler to produce throttling-as-preemption.
Daniel Bristot de Oliveira et al., "Demystifying the Real-Time Linux Scheduling Latency" (ECRTS 2020) — the formal model of where wakeup latency comes from on real-time Linux.