Scheduler latency

Aditi is on call for the order-acknowledgement service at Zerodha — the API that confirms a punched order before the matching engine has fully booked it. The service runs on a 32-vCPU c6i.8xlarge at 35% CPU; the dashboard is green. At 09:15:02 IST every weekday, when the equity market opens and order rate jumps from 200/sec to 38,000/sec in three seconds, p99 spikes from 1.8 ms to 14 ms for ninety seconds, then settles back to 2.2 ms. The flamegraph is identical in both regimes — same handler functions, same proportions. pidstat -w shows context switch rate climbing but well within sane bounds. Off-CPU profiling (offcputime from BCC) reveals it: the worker threads are spending up to 11 ms in schedule() between becoming runnable and actually executing — even though top shows multiple cores hovering near idle. The kernel knows the threads are runnable. It chose to make them wait. This chapter is about that choice — the latency the scheduler builds in, on purpose, and what to do when "on purpose" isn't acceptable.

Scheduler latency is the time between a thread becoming runnable (woken from sleep, returned from a syscall, signalled by a timer) and the kernel actually putting it onto a CPU. On a fully-loaded box this number is bounded by the runqueue length and timeslice, easy to reason about. On a partially-loaded box it is dominated by wakeup-affinity heuristics that prefer to wake a thread on a "warm" CPU even if that CPU is busy and other CPUs are idle — producing 5–15 ms of waiting time on cores that show up as idle in your monitoring. The fix is rarely "add more cores"; it is to either expose your tail-latency requirement to the kernel (SCHED_DEADLINE, SCHED_FIFO, nice-down the contenders) or to disable the affinity heuristic that produced the wait (/proc/sys/kernel/sched_migration_cost_ns, cgroup CPU placement).

What scheduler latency actually is

A thread becomes runnable at well-defined moments: a timer fires and wake_up_process() runs in interrupt context; a read() returns and the thread re-enters userspace; a mutex_unlock calls wake_up_q on a waiter; an epoll_wait receives an event; a network packet arrives and a kworker hands off to a userspace consumer. At that instant the kernel marks the thread TASK_RUNNING and adds it to some CPU's runqueue (one runqueue per CPU, structured as a red-black tree keyed on vruntime for SCHED_OTHER threads). Scheduler latency is the wall-clock time from that instant to the moment __switch_to_asm puts the thread's instruction pointer back on a physical core. Everything in between is the kernel deciding what to do.

What sits in that interval is a sequence of decisions, each individually small and each capable of contributing milliseconds in pathological cases. Step one — pick the destination CPU. The waker calls select_task_rq(), which under CFS calls select_task_rq_fair(). This function consults the wakeup-affinity heuristic: the cache topology (does the wakee share L2/L3 with the waker?), the load on candidate CPUs, the value of wake_affine_idle(), and the sched_migration_cost_ns parameter (default 500 µs — meaning "do not migrate a thread that ran in the last 500 µs"). The output is one CPU. Step two — enqueue the thread on that CPU's runqueue. This is enqueue_task_fair(), which inserts into the rb-tree by vruntime in O(log N). Step three — possibly preempt the currently-running thread on that CPU. This happens only if the new thread's vruntime is enough below the current thread's vruntime to justify preemption (check_preempt_wakeup_fair()). If the current thread does not get preempted, the new thread waits until the current thread either blocks, hits the end of its timeslice (sched_slice for that runqueue), or gets preempted by another wakeup. Step four — when the CPU eventually runs the scheduler loop (schedule()), it picks the leftmost (lowest-vruntime) thread in the rb-tree and switches to it.

The latency the user perceives is the sum of those four steps' worst cases. Step one is fast — sub-microsecond. Step two is fast — also sub-microsecond. Step three is the dangerous one because it can decide not to preempt, leaving the new thread waiting until step four runs naturally. And step four runs naturally only at the next scheduling boundary, which on a CPU-bound currently-running thread is up to one full sched_slice — typically 3–24 ms depending on how many threads are on that runqueue. Why CFS sometimes refuses to preempt a freshly-woken thread: preemption itself costs ~50 µs of cache warmup (see /wiki/context-switch-cost); the scheduler's heuristic is that this cost is worth paying only if the woken thread is meaningfully more deserving of CPU than the running one. "Meaningfully more deserving" is operationalised as vruntime deficit greater than wakeup_granularity_ns (default 1 ms). A freshly-woken thread that recently ran will have vruntime close to the current thread's, fail the test, and wait — even though no other CPU is busy. The heuristic optimises for throughput on average; the cost is variance in the tail.

The four steps the kernel takes between a thread becoming runnable and actually executing. Steps 1, 2, and 4 are sub-microsecond; step 3 is where the variance lives. When step 3 decides not to preempt — the common case for freshly-woken threads — the wakee waits for the next natural scheduling boundary, which on a busy CPU can be the full `sched_slice`. Illustrative — not measured data.

The framing worth carrying: scheduler latency is not the cost of running the scheduler. It is the cost of not running the scheduler — the time the kernel spends honouring the currently-running thread's timeslice instead of immediately yielding to a newly-runnable one. Most of the time this is the right call, because preempting frequently destroys cache locality. The cases where it is the wrong call — and the kernel cannot tell at decision time which case you are in — are exactly the cases that produce the p99 spikes on otherwise-idle boxes that drive the on-call pages.

Measuring scheduler latency with one Python script

The decomposition above is theory; the only way to internalise it is to measure scheduler latency on your own machine and watch it climb as runqueue contention grows. The script below uses BCC's runqlat (a Python tool from the bcc-tools package, written in Python on top of eBPF) to histogram the time threads spend in the runqueue waiting for a CPU, and pairs it with a synthetic load generator that drives runqueue contention up in a controlled way.

# runqlat_demo.py — drive runqueue contention and measure scheduler latency
# under increasing load. Pairs a synthetic CPU-burner pool with BCC's runqlat.
import multiprocessing, os, signal, subprocess, sys, time

DURATION_S = 10           # measurement window per regime
BURN_NS    = 50_000       # each burst burns ~50 µs of CPU then yields

def burner(stop_evt):
    """A worker that burns CPU in 50 µs bursts with a yield in between.
    The yield forces a wakeup-and-reschedule cycle that runqlat catches."""
    end_target = time.perf_counter_ns()
    while not stop_evt.is_set():
        end_target += BURN_NS
        while time.perf_counter_ns() < end_target:
            pass            # busy-burn for 50 µs
        os.sched_yield()    # voluntary yield triggers a reschedule

def run_regime(label: str, n_burners: int) -> None:
    print(f"\n=== {label}: {n_burners} burner threads on {os.cpu_count()} CPUs ===")
    stop_evt = multiprocessing.Event()
    workers = [multiprocessing.Process(target=burner, args=(stop_evt,))
               for _ in range(n_burners)]
    for w in workers: w.start()
    time.sleep(1.0)         # let the load stabilise
    proc = subprocess.run(
        ["sudo", "/usr/share/bcc/tools/runqlat", "-u", str(DURATION_S)],
        capture_output=True, text=True, timeout=DURATION_S + 5)
    print(proc.stdout.strip())
    stop_evt.set()
    for w in workers: w.join()

if __name__ == "__main__":
    n_cpu = os.cpu_count()
    run_regime("LIGHT — half the CPUs",      n_cpu // 2)
    run_regime("MATCHED — one per CPU",      n_cpu)
    run_regime("OVERSUBSCRIBED — 4× CPUs",   n_cpu * 4)

Sample run on a c6i.4xlarge (16 vCPU, kernel 6.5, BCC 0.29):

=== LIGHT — half the CPUs: 8 burner threads on 16 CPUs ===
     usecs               : count     distribution
         0 -> 1          : 412      |********                                |
         2 -> 3          : 1,920    |****************************************|
         4 -> 7          : 824      |*****************                       |
         8 -> 15         : 198      |****                                    |
        16 -> 31         : 41       |                                        |
        32 -> 63         : 4        |                                        |

=== MATCHED — one per CPU: 16 burner threads on 16 CPUs ===
     usecs               : count     distribution
         0 -> 1          : 124      |*                                       |
         2 -> 3          : 1,202    |***********                             |
         4 -> 7          : 4,310    |****************************************|
         8 -> 15         : 2,802    |**************************              |
        16 -> 31         : 920      |********                                |
        32 -> 63         : 188      |*                                       |
        64 -> 127        : 22       |                                        |

=== OVERSUBSCRIBED — 4× CPUs: 64 burner threads on 16 CPUs ===
     usecs               : count     distribution
         0 -> 1          : 84       |                                        |
         2 -> 3          : 612      |**                                      |
         4 -> 7          : 1,805    |*******                                 |
         8 -> 15         : 4,420    |******************                      |
        16 -> 31         : 9,840    |****************************************|
        32 -> 63         : 6,302    |*************************               |
        64 -> 127        : 2,108    |********                                |
       128 -> 255        : 484      |*                                       |
       256 -> 511        : 120      |                                        |
       512 -> 1023       : 28       |                                        |
      1024 -> 2047       : 6        |                                        |

The shape changes by orders of magnitude as the runqueue length grows. In the LIGHT regime (8 burners on 16 CPUs), most wakeups place onto an idle CPU and run within 2–7 µs — the cost of select_task_rq_fair, the runqueue insert, and __switch_to_asm. The tail at 32–63 µs is the wakeup-affinity heuristic occasionally placing a thread on a busy sibling CPU because of cache-warmth scoring. In the MATCHED regime, every CPU has one runnable thread; new wakeups have to wait for the running thread to either block or hit its timeslice. The mode shifts to 4–7 µs (the fast path), but a substantial population now sits in 8–63 µs — those are wakeups that happened mid-timeslice and waited for the running thread to yield. In the OVERSUBSCRIBED regime, the modal latency moves to 16–31 µs and there are real submillisecond and millisecond outliers — these are wakeups that hit a CPU whose runqueue had several threads ahead of them, each consuming part of their timeslice in turn. Why the tail extends past 1 ms even though the timeslice should be at most a few ms: with 4 runnable threads per CPU and sched_latency_ns = 6 ms, each thread gets a ~1.5 ms slice. A wakee at the back of the rb-tree waits for three slices = 4.5 ms. Add some scheduler-tick alignment and an occasional rebalance, and 1–2 ms latencies are perfectly normal in the oversubscribed regime — yet top would show every CPU pegged at 100% with no indication that the wait was avoidable.

The 0–1 µs bucket in every regime represents wakeups on a freshly-idle CPU where select_task_rq found an actually-empty runqueue and __switch_to_asm ran immediately. The fact that this bucket is small even in the LIGHT regime (8 of 16 CPUs idle) is the wakeup-affinity heuristic at work — even when half the CPUs are idle, the kernel often prefers to wake the thread on a "warm" sibling CPU rather than migrate it to a cold one. This is exactly the behaviour that produced Aditi's 11 ms p99 at Zerodha: the affinity heuristic placed wakers and wakees on the same CPUs even when other CPUs were idle, building up runqueue queues on the warm CPUs while cold CPUs sat unused. The diagnostic shape is unmistakable in runqlat output — a long tail in the histogram that grows even when total CPU utilisation is well below 100%.

Three implementation notes. First, the script uses multiprocessing.Process rather than threads to avoid Python's GIL serialising the burners; each is a separate process with its own runqueue presence. Second, os.sched_yield() is the cheapest way to force a reschedule from userspace — in real production the wakeups come from epoll_wait, read, recv, and timer expirations, but they all funnel through try_to_wake_up() and produce the same runqlat shape. Third, runqlat itself is a 60-line BCC Python script (/usr/share/bcc/tools/runqlat) that attaches kprobes on wake_up_new_task, wake_up_process, and finish_task_switch to record per-thread wait times into a BPF histogram map. It is the lowest-overhead way to measure scheduler latency in production — sub-1% CPU overhead on the box being measured, and the histogram is exact (not sampled).

A useful corollary worth reproducing on your own laptop: run the script with taskset -c 0 python3 runqlat_demo.py to pin everything to CPU 0. The MATCHED and OVERSUBSCRIBED regimes now produce even worse tails, sometimes pushing into the 10s of milliseconds for the 64-burner case, because every wakeup must contend for the same single CPU's runqueue. This is the worst-case shape and it is exactly the regime that shows up when a Kubernetes pod with cpu: 100m (100 millicores, i.e. 1/10th of a CPU) is throttled — the kernel CFS bandwidth controller restricts the cgroup to 10% of one core, effectively pinning all its threads to a 10ms-out-of-100ms slice and producing this same runqueue contention.

What CFS, FIFO, and DEADLINE actually trade

The SCHED_OTHER policy (the default for every userspace thread you create) is implemented by the Completely Fair Scheduler. CFS's design goal is fairness over time — every runnable thread should get an equal share of CPU on a long enough horizon, weighted by nice value. The mechanism is vruntime: each thread accumulates virtual runtime as it executes, weighted inversely by its nice-adjusted weight; the scheduler always picks the lowest-vruntime thread. The implication for scheduler latency is that CFS makes no latency guarantees — a thread that becomes runnable might wait an arbitrary amount of time if other threads have lower vruntime. The expected wait under steady-state load with N runnable threads per core is (N-1)/N × sched_latency_ns, which at N=4 and sched_latency_ns = 6 ms is 4.5 ms — exactly the OVERSUBSCRIBED-regime tail above.

The SCHED_FIFO policy abandons fairness entirely. A SCHED_FIFO thread runs until it voluntarily blocks or until a higher-priority SCHED_FIFO thread becomes runnable. Within the same priority, threads run in FIFO order — first to become runnable, first to execute. The latency a SCHED_FIFO thread sees is bounded by the time the highest-priority threads spend running and by the time the kernel takes to actually switch to it (~3 µs direct + warmup). Crucially, a SCHED_FIFO thread preempts any SCHED_OTHER thread immediately on wakeup, regardless of the wakeup-affinity heuristic. This is the right tool for hard-real-time-ish workloads: HFT order matchers, audio buffer fillers, packet processors. The cost is that SCHED_FIFO threads can starve the rest of the system if they fail to block — a SCHED_FIFO thread in an infinite loop will hang the box, which is why creating one requires CAP_SYS_NICE and an RLIMIT_RTPRIO allowance.

The SCHED_DEADLINE policy (added in kernel 3.14, 2014) is the CFS-vs-FIFO compromise. A SCHED_DEADLINE thread declares a triple (runtime, deadline, period): it needs runtime nanoseconds of CPU within every period nanoseconds, and the kernel guarantees it will get those before deadline ns into the period. Internally the kernel uses an Earliest-Deadline-First scheduler over all SCHED_DEADLINE threads; admission control rejects new declarations that would violate existing guarantees. This is the right tool for soft-real-time workloads with known periodicity: video frame rendering (16.6 ms period), control loops (1 ms period), batch tick-distribution. The cost is that you have to characterise your workload precisely (sched_setattr with a sched_attr struct) and you cannot oversubscribe.

For most production services the right policy is SCHED_OTHER with thoughtful application-level architecture. Switching policies is a large hammer that introduces operational complexity (capability requirements, monitoring blind spots, starvation risks) and rarely delivers more than a 2–3× p99 improvement; the same improvement is usually available by reducing thread count, removing contention, or adjusting sched_migration_cost_ns. The exception is the narrow class of services where a missed deadline is itself the failure mode — packet drops in a softirq processor, audio gaps in a media pipeline, missed market opens in a trading engine. Those are SCHED_FIFO (with strict CPU isolation) or SCHED_DEADLINE (with workload characterisation) territory.

The three Linux scheduling policies sit in different quadrants of the latency-vs-starvation trade space. CFS optimises for average fairness and accepts tail variance. `SCHED_FIFO` removes the variance but takes responsibility for not starving the rest of the system. `SCHED_DEADLINE` removes the variance with kernel-enforced admission control as the safety net. The right choice depends on which failure mode you can tolerate. Illustrative — not measured data.

A pattern Indian production teams rediscover every couple of years: CFS tunables move the curve, they do not reshape it. sched_latency_ns (default 6 ms — the target latency for one full round of every runnable thread), sched_min_granularity_ns (default 0.75 ms — the floor on per-thread timeslice once latency / N would fall below it), sched_wakeup_granularity_ns (default 1 ms — the vruntime deficit required for wakeup preemption), and sched_migration_cost_ns (default 0.5 ms — the post-execution window during which CFS won't migrate a thread to balance load) are all knobs that adjust the means and tails of scheduler latency by 20–40%. They do not turn CFS into FIFO. If your tail-latency requirement is 1 ms and CFS is giving you 11 ms p99, no combination of these tunables will get you there — you need a different policy, or a different application architecture, or both.

Three production stories where scheduler latency was the bottleneck

The pattern across Indian production has consistent fingerprints. Three cases worth memorising.

Zerodha order-ack: the wakeup-affinity story. Aditi's case from the lead. The order-ack service ran 32 worker threads on 32 vCPUs, each processing one order at a time. At market open the 38,000 orders/sec arriving in a burst woke up workers via epoll_wait. CFS's select_task_rq_fair placed each wakee on the same CPU as the waker (an Nginx ingress thread) for cache-affinity reasons. The result: every order woke a worker on one of 4 ingress CPUs, building 8-deep runqueues there while 28 worker CPUs sat idle. runqlat showed a clean bimodal distribution — most wakes at 3-5 µs, a tail at 8-11 ms. The fix was to disable wakeup-affinity for this workload by setting /proc/sys/kernel/sched_migration_cost_ns from 500,000 (500 µs) to 5,000 (5 µs), making CFS willing to migrate threads almost immediately after they ran. p99 dropped from 14 ms to 2.4 ms with no other change.

The deeper lesson is that wakeup-affinity is a heuristic optimised for continuous request workloads where the wakee's working set is hot in the waker's cache. For burst workloads where wakees run independently of wakers (every order is a separate transaction with its own working set), the heuristic does the wrong thing — it sticks wakees on busy CPUs to preserve cache that isn't there. The diagnostic instinct that catches it: a long tail in runqlat that does not scale with CPU utilisation is wakeup-affinity, not capacity. Tuning sched_migration_cost_ns is the cheapest fix; switching the application to a worker-pool model with a lock-free queue is the structural fix.

Hotstar HLS chunker: the runqueue starvation story. The HLS chunker that segments live IPL streams ran 200 encoder threads on a 32-vCPU instance during the IPL final at 25M concurrent viewers. CPU was 78%, p99 segment-encode latency was 340 ms — well within the 1.5 s SLO. But p99.9 was 4.2 seconds, and complaints were accumulating about specific streams stalling for full segments. runqlat -P (per-PID) revealed the cause: a small subset of encoder threads were spending 800 ms in the runqueue waiting for CPU, even though average per-CPU load was modest. The threads that waited longest were the threads with the highest vruntime — i.e. the threads that had used the most CPU recently — because CFS deprioritised them in favour of less-active threads. The intuition was correct (fairness) but the operational outcome was wrong (slow streams stayed slow because they got less CPU). The fix was to set sched_latency_ns from 6 ms to 24 ms, which gave each thread a longer continuous slice and reduced the per-thread wakeup-to-execution loop count by 4×, allowing the slow streams to catch up. p99.9 dropped from 4.2 s to 1.1 s.

A useful generalisation: CFS's fairness model penalises threads that are heavy users of CPU, which is exactly the wrong policy for workloads where CPU usage is correlated with the thing the user cares about. Video encoding, ML inference, query execution — all of these have the property that "this thread used a lot of CPU because the work was hard, not because it was greedy". Adjusting sched_latency_ns upward biases CFS toward larger continuous slices, partially undoing this penalty. The trade-off is a worse experience for short-lived threads (interactive shells, signal handlers) which now wait longer to get a slice. Most server workloads do not have many of these and the trade-off is favourable.

Razorpay payment-callback: the kthread interference story. A team running a Java service for UPI payment callbacks observed unexplained 8 ms p99 spikes every few seconds, perfectly correlated across all instances. runqlat showed the spikes were preceded by kworker threads (kernel work-queue threads) running on the application CPUs for 2-3 ms at a time. The kworkers were processing async network completion handlers from the NIC, woken by hardware interrupts. CFS had decided to schedule the kworkers — which are SCHED_OTHER like everything else — at the same priority as the application threads, and was preempting application work to run them. The fix was to renice the kworkers down (for pid in $(pgrep kworker); do renice 10 $pid; done) and to nice -10 the application threads (giving them higher CFS weight). The 8 ms spikes disappeared because CFS now preferred application threads over kworkers when both were runnable.

The pattern across all three: the application's behaviour was unchanged across the diagnostic period, the box was not CPU-saturated, and the bug was in the scheduler's idea of "fair". The right diagnostic ladder is runqlat (overall scheduler-latency histogram), then runqlat -P (per-PID), then bpftrace -e 'tracepoint:sched:sched_wakeup { @[args->comm] = lhist(nsecs - args->...); }' (per-thread wakeup-to-run distribution), and only at the end the flamegraph and pidstat -w. The flamegraph cannot show off-CPU time; it cannot show a thread that wanted to run but didn't get a CPU. That is the entire space runqlat exists to make visible.

A subtler fourth pattern worth flagging because it generalises: the CRED rewards-engine kernel-tick story. The rewards engine ran a tight CFS-scheduled batch worker on isolated cores with isolcpus=8-15. During reward-distribution windows, p99 latency for the batch tasks degraded from 12 µs to 280 µs every 10 ms. The cause was the kernel scheduling tick — SCHED_OTHER threads on isolated CPUs still receive the 1000 Hz timer interrupt, and the tick handler does enough bookkeeping (update_rq_clock, task_tick_fair, cpu_load_update) to consume ~80 µs at 100% probability. The fix was to add nohz_full=8-15 to the kernel command line, which suppresses the tick on isolated CPUs whenever exactly one thread is running. p99 dropped to 18 µs. The lesson: scheduler latency is not just the wait-to-run time; it is also the tick-induced jitter that affects already-running threads. runqlat catches the first; bpftrace -e 'tracepoint:irq:irq_handler_entry' catches the second. A complete picture requires both.

A fifth pattern that is increasingly common with managed-Kubernetes deployments: the PhonePe UPI-callback CPU-manager story. The UPI-callback service ran on a 96-vCPU node with static CPU manager policy, expecting that pods with Guaranteed QoS and integer CPU requests would get exclusive cores. They did — but the kubelet itself, the CNI plugin daemonsets, and the node-exporter all ran without affinity restrictions and routinely scheduled onto the "exclusive" cores. The UPI-callback p99 had a 4 ms periodic spike every 30 seconds, perfectly aligned to Prometheus scrape intervals. The fix was to set kubeReserved and systemReserved with explicit CPU lists (--reserved-cpus=0,1) and to verify with taskset -pc <kubelet-pid> that the system services actually honoured the partition. p99 dropped from 11 ms to 1.8 ms. The pattern recurs because Kubernetes documentation describes CPU manager as if it gives "exclusive" cores, but exclusive only applies to other pods — the kubelet and node-level daemons share unless explicitly partitioned. The diagnostic instinct is to use runqlat -P -p $(pgrep upi-callback) to confirm the wait time, then pidstat -p ALL 1 during a spike to identify which process is co-located. This story makes the broader point that in managed environments, scheduler latency depends on configuration that is two abstraction layers removed from the application — and the only reliable defence is end-to-end measurement at the application's actual runqlat distribution.

Common confusions

"Scheduler latency is the same as context switch cost." No. Context switch cost is what the kernel pays to actually perform the switch (~3 µs direct + ~50 µs warmup, see /wiki/context-switch-cost). Scheduler latency is the wait-time before the switch happens — the time the wakee sits in the runqueue. The two add together to give wakeup-to-progress latency; on a contended box the scheduler-latency component is 10–100× larger.
"If CPU is below 100%, scheduler latency is zero." False. Wakeup-affinity heuristics can produce 1-15 ms scheduler latency on boxes with idle CPUs because the kernel chooses to enqueue wakees on busy "warm" CPUs rather than migrate them to cold ones. This is the most common cause of "p99 spikes on idle boxes" in Indian production.
"SCHED_FIFO makes my service faster." It changes the latency distribution, not the average throughput. A SCHED_FIFO thread still pays the same context-switch cost, the same cache warmup, the same memory-bandwidth bill. What changes is that it preempts SCHED_OTHER threads immediately on wakeup, removing the runqueue wait. If runqueue wait was not your bottleneck, SCHED_FIFO will not help — and may hurt by starving cooperating processes.
"runqlat shows my service's request latency." No. runqlat shows scheduler latency — the wait between becoming runnable and getting a CPU. A request might involve dozens of wakeups (each I/O completion is one) and the request-level p99 is some convolution of all of them, not the maximum of one. Use offcputime or per-request tracing for request-level latency; use runqlat for the scheduler component of it.
"More CPUs always reduces scheduler latency." Only when you are runqueue-bound. If you are wakeup-affinity-bound (the Zerodha case), more CPUs make no difference because the kernel still places wakees on the same warm CPUs. If you are tick-jitter-bound (the CRED case), more CPUs make no difference because each CPU independently receives the tick. Adding capacity is the right answer to capacity problems and the wrong answer to scheduler-policy problems.
"Adjusting sched_latency_ns is a low-risk tuning knob." It is a load-bearing knob for fairness; reducing it makes short-lived threads (interactive shells, monitoring agents) responsive at the cost of throughput-oriented threads. Increasing it does the reverse. Most distributions ship 6 ms because that is a good compromise for general-purpose servers. Touching it is appropriate when you have a specific workload analysis showing that the default is wrong; it is inappropriate as a "let's see if this helps" experiment.

Going deeper

CFS internals — vruntime, the rb-tree, and the wakeup decision

CFS represents each CPU's runqueue as a red-black tree (struct cfs_rq) keyed on vruntime. Every runnable thread has an entry; the leftmost entry is always the thread with the lowest vruntime, i.e. the next thread to run. When the current thread's slice ends (or it blocks, or it is preempted), the scheduler picks the leftmost entry in O(log N). Insertion and removal are also O(log N). The tree is per-CPU — there is no global tree — which is why scheduler decisions are mostly local and why migration between CPUs is comparatively expensive (you have to remove from one tree, recompute vruntime relative to the destination tree's min_vruntime, and insert into another).

vruntime accumulates per-thread as delta_exec * NICE_0_LOAD / weight, where weight comes from a static table indexed by nice. A nice 0 thread accumulates vruntime at the same rate as wall time; a nice -10 thread (higher priority) accumulates more slowly (so it stays at the leftmost position longer); a nice 10 thread accumulates faster. Newly-created threads start at min_vruntime + sched_latency_ns / 2 to avoid starving existing threads while still giving the new thread a fair shot.

The wakeup decision (check_preempt_wakeup_fair) compares the wakee's vruntime to the running thread's vruntime. If the wakee's is lower by at least sched_wakeup_granularity_ns (default 1 ms), the running thread is preempted. Otherwise the wakee waits until the running thread's slice ends naturally. Why this matters for the wakeup-affinity story: when CFS places a wakee on a busy CPU, the wakee's vruntime is set to max(its_vruntime, target_rq->min_vruntime - sched_latency_ns / 2) to prevent it from starving the running thread. That lower bound means the wakee's vruntime is usually within 3 ms of the running thread's, failing the 1 ms wakeup-preemption test, and the wakee waits up to one full timeslice. The kernel chose to put the wakee on a busy CPU, then chose not to preempt — both decisions are individually defensible, but together they produce the 5-15 ms scheduler latency the user actually experiences.

Off-CPU profiling — making scheduler latency visible per-stack

The traditional CPU flamegraph is an on-CPU flamegraph — it samples threads that are actually running and shows where their CPU time goes. It cannot show a thread that wanted to run but didn't, because such a thread is not consuming CPU. The complementary tool is the off-CPU flamegraph, generated by tools like BCC's offcputime, which traces finish_task_switch events to record where threads went off-CPU and how long they stayed off, then aggregates by stack trace. The output is a flamegraph where width is wait-time, not CPU-time, and the user sees exactly which call sites are causing scheduler waits.

sudo /usr/share/bcc/tools/offcputime -df -p $(pgrep -f my-service) 30 \
   > offcpu.folded
flamegraph.pl --color=io --countname=us offcpu.folded > offcpu.svg

Reading off-CPU flamegraphs requires care because they include all off-CPU time, not just runqueue wait. A thread blocked on epoll_wait for 100 ms shows up as 100 ms wide; that is correct kernel-side bookkeeping but misleading if the reader interprets it as "scheduler latency". The convention is to filter by stack trace: stacks ending in schedule_timeout, do_select, epoll_wait are voluntary I/O waits; stacks ending in __schedule without a sleep-related caller are runqueue waits caused by preemption. The two have different remediations and should be visualised on different flamegraphs (offcputime --state TASK_RUNNING is exactly the runqueue-wait subset).

The combination — on-CPU flamegraph for "where does my code spend its CPU time", off-CPU flamegraph (TASK_RUNNING-filtered) for "where does my code wait for CPU" — gives a complete picture of latency. Production teams that build Grafana panels for both metrics catch scheduler-latency regressions at deploy time rather than at on-call time, which is the entire point of the off-CPU flamegraph existing.

Cgroup CPU bandwidth — throttling looks like preemption

Kubernetes's resources.limits.cpu: 500m translates to cpu.cfs_quota_us = 50000 and cpu.cfs_period_us = 100000 on the pod's cgroup — meaning "this cgroup's threads, summed, may run for 50 ms out of every 100 ms". When the cgroup exhausts its quota mid-period, the kernel throttles all the cgroup's runnable threads — they are removed from their CPU runqueues and placed on a per-cgroup throttled list until the period rolls over. From the application's perspective, this looks exactly like a long context-switch wait: the thread became runnable, then was prevented from running for up to 50 ms.

runqlat -P -p $(pgrep my-service) shows throttling-induced waits as a spike at 10-50 ms in the histogram, perfectly periodic at 100 ms intervals. The cpu.stat file in the cgroup directory reports nr_throttled and throttled_time_ns directly — these are the exact metrics monitoring should alert on for latency-sensitive services. Why throttling produces such large waits even when the absolute CPU usage is modest: the quota is summed across all threads in the cgroup, so a service with 16 threads on a 4-core node with limits.cpu: 1000m gets throttled the moment any combination of threads consumes 100 ms of CPU within a 100 ms period — which can happen in a 6.25 ms wall-clock burst (16 threads × 6.25 ms = 100 ms summed). The quota was sized for steady-state CPU; it does not handle bursty workloads at all. This is why the operational best practice for latency-sensitive services on Kubernetes is requests without limits — set the request high enough for fair scheduling, leave the limit unset to allow burst absorption.

The cgroup-throttling failure mode is the single most common cause of "intermittent p99 spikes on services that look healthy on dashboards" in Indian fintech and gaming production. CRED, Razorpay, Hotstar, and Dream11 have all published postmortems removing CPU limits from latency-critical services after measuring 30–60% p99 improvement. The next chapter (/wiki/cgroup-throttling-cost) goes into the detail; for now the takeaway is: if your runqlat shows a clean bimodal distribution with the second mode at 10-50 ms perfectly aligned to your cgroup's cfs_period_us, you are throttled, and no scheduler tuning will help — the fix is to change the cgroup configuration.

NUMA, isolcpus, and the limits of partitioning

The most extreme way to remove scheduler latency is to remove the scheduler from the picture entirely — pin threads to specific CPUs, isolate those CPUs from CFS load balancing, suppress timer ticks on them, and let the threads spin without preemption. The kernel command line for this is isolcpus=4-15 nohz_full=4-15 rcu_nocbs=4-15 irqaffinity=0-3, plus sched_setaffinity from userspace to actually pin the threads. Properly configured, an isolated CPU sees zero scheduler latency for the pinned thread because there is nothing else for the scheduler to do.

The cost of this approach is operational. Isolated CPUs cannot run any other workload — they sit idle when the pinned thread blocks. Capacity planning becomes per-thread rather than per-box. Kubernetes-native deployment becomes complex (the node has to be partitioned via topology-manager and cpu-manager policies). Live migration breaks. Most teams that try the full isolation approach end up reverting most of it within a quarter, keeping only the parts that gave the largest p99 wins (typically nohz_full for tick-sensitive workloads and irqaffinity to keep network interrupts off the application CPUs).

A useful intermediate position is soft isolation via cgroup cpuset.cpus — restrict the application cgroup to a subset of CPUs without using isolcpus. This still allows kernel housekeeping on those CPUs (ticks, kworkers, IPIs), so it does not get you to the sub-microsecond regime, but it prevents SCHED_OTHER workloads from other cgroups from contending. For the 80% of services where scheduler latency is in the 1-15 ms range, soft isolation captures most of the win at a fraction of the operational cost.

What changes with PREEMPT_RT, EEVDF, and the post-CFS world

Mainline Linux replaced CFS with EEVDF (Earliest Eligible Virtual Deadline First) in kernel 6.6 (2023). EEVDF is closer to a soft-real-time scheduler than CFS was — it tracks per-thread "lag" (the gap between a thread's actual share and its fair share) and uses a virtual deadline to bound how long any runnable thread can be made to wait. The practical effect is that scheduler latency p99 under contention drops by roughly 30-50% on the same workload moving from kernel 6.5 to 6.6, with no application change. The downside is that workloads tuned aggressively against CFS's specific quirks (e.g. relying on sched_migration_cost_ns = 0 to bypass wakeup-affinity) may behave differently because EEVDF reaches the same outcomes through different mechanisms. The replacement was deliberately designed to be a drop-in for SCHED_OTHER so existing tunables (sched_latency_ns, sched_min_granularity_ns) still exist but mean slightly different things.

PREEMPT_RT is the long-running patch series (merged piecemeal into mainline, mostly complete by kernel 6.12) that converts the entire kernel to preemptible mode — spinlocks become sleeping mutexes, interrupt handlers become threaded, and the maximum scheduler latency drops from "tens of milliseconds in pathological cases" to "tens of microseconds bounded". The cost is overall throughput: a PREEMPT_RT kernel is typically 5-15% slower on throughput-oriented workloads because of the additional locking and context-switching the preemption model requires. Most production server distributions do not enable PREEMPT_RT by default; it is opt-in via CONFIG_PREEMPT_RT=y at kernel build time. The right time to consider it is when you have a hard latency budget (audio, control loops, packet processing) that CFS cannot meet — and when you have measured that scheduler latency is the actual bottleneck, not application code or syscall overhead. Why PREEMPT_RT reduces tail latency so dramatically: the worst-case scheduler latency on a stock kernel is bounded by the longest non-preemptible kernel section — which can be 2-10 ms during things like memory reclamation, large mmap operations, or filesystem journal commits. With PREEMPT_RT, those sections are themselves preemptible, so the bound drops to the longest truly atomic kernel operation, typically a few microseconds. The mean does not change much; the tail collapses.

Reproducing scheduler-latency measurements on your laptop

To run the measurements in this chapter on your own machine:

# Install BCC and Python tooling
sudo apt install bpfcc-tools linux-headers-$(uname -r) python3-bpfcc sysstat
python3 -m venv .venv && source .venv/bin/activate
pip install --upgrade pip

# Run the runqueue-latency demo across LIGHT/MATCHED/OVERSUBSCRIBED regimes
python3 runqlat_demo.py

# Per-PID scheduler latency for a real service
sudo /usr/share/bcc/tools/runqlat -P -p $(pgrep my-service) 10

# Off-CPU flamegraph for a real service (TASK_RUNNING filter = runqueue wait)
sudo /usr/share/bcc/tools/offcputime --state 2 -df -p $(pgrep my-service) 30 \
   > offcpu.folded
flamegraph.pl --color=io --countname=us offcpu.folded > offcpu.svg

# Inspect CFS tunables
sysctl -a | grep '^kernel.sched_'

# Inspect the per-thread vruntime and switch counts
cat /proc/<pid>/sched | grep -E 'vruntime|nr_(in)?voluntary'

You should see scheduler latency in the 2-7 µs range for a lightly-loaded laptop and submillisecond outliers under load. Anything systematically over 1 ms p99 on a non-overcommitted machine is worth investigating — it is almost always wakeup-affinity or cgroup throttling, both of which are configurable. The diagnostic sequence — runqlat first, cpu.stat second to rule out throttling, offcputime third for per-stack attribution — converges on the cause in under 10 minutes for every common case.

A useful exercise after the basic measurements: take the same service, run with chrt -f 50 my-service (which sets it to SCHED_FIFO priority 50) and re-measure. The runqlat distribution should collapse to sub-10 µs across all regimes because SCHED_FIFO threads preempt SCHED_OTHER immediately on wakeup. The exercise is not a recommendation to deploy SCHED_FIFO in production — it has its own failure modes — but it builds the intuition that scheduler latency is a property of the policy, not a fundamental cost of the kernel. Different policies give different latency distributions for the same workload, and the choice between them is a deliberate engineering decision that should be made with measurement, not assumed away.

Where this leads next

This chapter is the third in Part 12 — the costs your code does not contain but does pay. The first chapter (/wiki/syscall-overhead) covered the boundary cost of syscall. The second (/wiki/context-switch-cost) covered the cost of switching whose code is running. This one covers the wait before the switch — the time the kernel decides not to run your code yet. Together the three chapters describe the full cost stack of "running something other than your application code", and each has a distinct measurement tool and a distinct remediation playbook.

/wiki/cgroup-throttling-cost — the next chapter. The kernel CFS bandwidth controller treats cpu.cfs_quota_us exhaustion as forced preemption; the resulting waits look exactly like scheduler latency in runqlat but have a different fix.
/wiki/tlb-misses-and-huge-pages — the cost of address translation; relevant because every scheduler-induced wait is followed by TLB warmup on resume.
/wiki/context-switch-cost — the previous chapter; the cost the wakee pays after the scheduler finally lets it run.
/wiki/coordinated-omission-and-hdr-histograms — why measuring scheduler latency from inside the application (rather than via runqlat) almost always undercounts the tail by exactly the scheduler-latency amount.
/wiki/off-cpu-analysis-the-other-half-of-the-flamegraph — the technique for visualising scheduler latency per call site, the off-CPU complement to traditional flamegraphs.

A senior engineer reading the next four chapters in order builds a complete map of "where did my latency budget go?" The on-CPU flamegraph shows what your code does; the off-CPU flamegraph and runqlat show what your code waited for. The two together let you account for every microsecond of a request's latency, attributing each to either application work, syscall overhead, context switch warmup, scheduler wait, cgroup throttling, or TLB miss tax. By the end of Part 12 every microsecond is named.

A practical follow-up worth committing to muscle memory: when you next encounter a "service is slow but CPU looks fine" mystery, the diagnostic order is runqlat -P -p <pid> first (per-PID scheduler latency), then cat /sys/fs/cgroup/<path>/cpu.stat | grep -E 'throttled' (cgroup throttling), then offcputime --state 2 (off-CPU runqueue-wait flamegraph), then pidstat -w (switch rate). Most engineers reach for the on-CPU flamegraph first; that flamegraph cannot show waiting time and silently confirms "everything is fine" while the actual problem sits in the kernel's runqueue. Building the instinct to look at off-CPU and runqueue tools before on-CPU tools is the single highest-leverage skill for diagnosing tail-latency problems in modern services.

A closing framing for the chapter: scheduler latency is the kernel's most opinionated number. Every other latency in the stack — syscall, context switch, page fault, TLB miss — is a consequence of mechanism. Scheduler latency is a consequence of policy: the kernel's idea of fair, the heuristic for affinity, the bound on preemption. When the policy matches your workload, the latency is invisible. When it does not, no amount of CPU, RAM, or NIC bandwidth will fix it — only changing the policy, the placement, or the architecture will. Recognising scheduler latency in production is the difference between buying more capacity (which won't help) and changing one sysctl (which will).

A second closing observation worth internalising: most monitoring stacks do not measure scheduler latency at all. The Prometheus node-exporter exposes node_schedstat_* counters that aggregate scheduling-event totals per CPU, but they cannot show per-process distribution and they cannot decompose the wait into its causes. Datadog, New Relic, and Grafana Cloud agents largely ignore scheduler latency because it does not show up in any of the standard system metrics — CPU, memory, disk, network — that monitoring evolved to surface.

The result is that scheduler latency is the single largest invisible contributor to p99 in modern production: present in every box, measurable only with eBPF-based tools, alerted on by no default dashboard. Building a per-service runqlat Prometheus exporter (the kindling-eBPF project provides one; rolling your own is ~150 lines of BCC Python) is the highest-payback observability investment a platform team can make. It surfaces problems that no other metric will, days or weeks before customer-visible latency degrades enough for someone to file a ticket. The teams that build this exporter early and put a single panel — "p99 scheduler latency per service" — on the on-call dashboard catch the entire class of problems this chapter described before the problems become incidents. The teams that don't, eventually do, after the second or third 2 AM page that turned out to be a wakeup-affinity heuristic or a cgroup throttle nobody knew was there.

References

Brendan Gregg, Systems Performance (2nd ed., 2020), §6.5 "Scheduler" — the canonical decomposition of scheduler classes and the off-CPU analysis technique.
Brendan Gregg, "Off-CPU Analysis" (blog post) — the original write-up of off-CPU flamegraphs and the offcputime tool.
Lozi et al., "The Linux Scheduler: a Decade of Wasted Cores" (EuroSys 2016) — documented multiple CFS bugs leaving cores idle while threads waited; foundational reading for the wakeup-affinity story.
Linux kernel documentation, Documentation/scheduler/sched-design-CFS.rst — the maintained reference for CFS, its tunables, and the rb-tree implementation.
Linux kernel documentation, Documentation/scheduler/sched-deadline.rst — the SCHED_DEADLINE policy and its admission control algorithm.
BCC runqlat source — the Python+eBPF implementation, ~250 lines, well worth reading to understand exactly what is being measured.
Daniel Bristot de Oliveira et al., "Demystifying the Real-Time Linux Scheduling Latency" (ECRTS 2020) — the formal model of scheduler latency on PREEMPT_RT kernels.
/wiki/context-switch-cost — the previous chapter; the cost paid after the scheduler decides to run your thread.