Scaling on heterogeneous hardware

Asha runs the fraud-scoring tier at PhonePe. The team migrates from c6i (16 vCPU Ice Lake, all identical cores) to c7i.metal-48xl (Sapphire Rapids, 96 cores, all still uniform) — easy lift, predictable cost. Then the platform team offers a 30% cheaper option: a Graviton-style instance with 8 high-perf cores and 24 efficiency cores, plus an attached inference accelerator. Same cumulative TFLOPS on paper. She schedules her existing Java service on it. P50 stays put. P99 explodes from 38 ms to 220 ms. CPU utilisation across the box averages 41% — most cores look idle. The flamegraph shows nothing pathological, just the usual JIT methods. The bug is not in her code. The bug is that the JVM scheduler treats all 32 cores as equal, the OS load-balancer migrates her latency-critical request handlers onto the 24 efficiency cores at random, and those cores run her hot path at 40% of the speed of the performance cores — at unpredictable moments, by request rather than by design.

This is the third physical scaling regime, after serial fraction (Amdahl) and bandwidth saturation. Heterogeneous hardware — performance cores plus efficiency cores plus accelerators — breaks the assumption every parallel-systems mental model has rested on for thirty years: that "a core is a core". When cores have different speeds, different cache hierarchies, different power-frequency curves, and different instruction sets, the question "which core do I run this on?" becomes a first-class scheduling decision the runtime gets wrong by default. This chapter is about how to recognise the regime, measure the asymmetry, and stop fighting the scheduler that thinks all your cores are interchangeable.

A modern server or laptop is not a pool of identical cores; it is a hierarchy of fast cores, slow cores, and accelerators with 2–4× speed asymmetries and entirely different cache, frequency, and ISA characteristics. Default OS schedulers and language runtimes assume homogeneity and migrate work onto whichever core looks idle, producing 2–10× tail-latency variance for no observable reason. The fix is to make placement explicit — pin latency-critical work to performance cores, batch-throughput work to efficiency cores, and matrix work to accelerators — and to measure the asymmetry directly with lscpu --extended and per-core benchmarks before you trust any aggregate number.

What heterogeneous hardware actually looks like in 2026

The uniform-core assumption was always a simplification, but until ~2020 it was a useful one — server CPUs shipped with N identical cores, the OS treated them symmetrically, and any worker placed on any core completed in roughly the same time. That world is gone. Three independent forces broke it:

Hybrid x86 (Intel since Alder Lake, 2021). Performance cores ("P-cores", Golden Cove / Raptor Cove / Lion Cove) and efficiency cores ("E-cores", Gracemont / Crestmont / Skymont) share the same package and instruction set, but P-cores are ~1.5–2× faster per cycle, support 2-way SMT (hyperthreading), and run at 5+ GHz; E-cores are slower (~3.5 GHz peak), single-threaded, and use 1/4 the silicon area per core. A 14th-gen Core i9 has 8 P-cores + 16 E-cores; a Sapphire Rapids server (Xeon Max 9480 variants) ships P+E variants for cloud bursting. Linux's CFS sees them all as schedulable entities and migrates tasks freely.

ARM big.LITTLE (since 2011, dominant on phones, now in servers). Apple silicon (M-series), Graviton 3/4 (some configurations), Ampere AmpereOne, and every Android phone ship with a mix of high-perf and high-efficiency cores. Apple's M3 has 4 performance + 4 efficiency cores; the asymmetry is ~2.5× on integer workloads, ~2× on floating-point. macOS and Linux use scheduler hints (sched_setaffinity, EAS — Energy Aware Scheduling) but the defaults still migrate.

Accelerators alongside the CPU. GPUs (CUDA, ROCm), NPUs (Apple ANE, Qualcomm Hexagon, Google TPU), DPUs (NVIDIA BlueField, AWS Nitro), and FPGAs (AWS F1, Azure NP). Each accelerator has its own memory space, its own programming model, its own latency profile (typically 5–50 µs to dispatch a kernel, hiding the win unless the kernel is large enough to amortise dispatch). The CPU is now the coordinator, not the executor, for any compute-heavy operation that fits an accelerator's wheelhouse.

The four-tier heterogeneous node. Latency budget shrinks left-to-right; per-op throughput grows. The dispatch-latency column is the hidden cost — a 50 µs PCIe hop is fatal to a 200 µs request budget if you cross it for a 1 ms kernel. Illustrative — typical 2026 hybrid x86 server class.

The 2026 node thus speaks four cost languages: P-core nanoseconds, E-core nanoseconds (~2.5× slower), NPU/iGPU microseconds (with a small dispatch tax), and discrete-GPU microseconds (with a fat dispatch tax). A scheduler that pretends these are interchangeable is a scheduler that places latency-critical traffic on whatever happens to be idle — and in production, what happens to be idle is usually an E-core, because the OS aggressively packs P-cores first to free them for boost. The result is exactly what Asha sees: latency variance with no flamegraph signal.

Measuring the per-core asymmetry on your own box

Before any optimisation, measure the asymmetry. The harness below pins identical work to each logical CPU in turn, measures wall time, and prints a per-core throughput table. The point is to see — on the hardware in front of you — that "a core is a core" is wrong.

# core_asymmetry.py — measure per-CPU throughput on a heterogeneous box
# Run: python3 core_asymmetry.py
import os, time, subprocess, json, sys

# A pure-CPU kernel: integer inner loop, no memory traffic, no syscalls.
# We use ctypes to call into a tight C loop so Python interpreter overhead
# does not dominate the measurement.
import ctypes
KERNEL = r"""
#include <stdint.h>
uint64_t spin(uint64_t n) {
    uint64_t x = 1;
    for (uint64_t i = 0; i < n; i++) x = x * 1103515245u + 12345u;
    return x;
}
"""

# Compile the kernel once.
import tempfile, pathlib
src = pathlib.Path(tempfile.gettempdir()) / "asym_kernel.c"
lib = pathlib.Path(tempfile.gettempdir()) / "asym_kernel.so"
src.write_text(KERNEL)
subprocess.run(["cc", "-O2", "-shared", "-fPIC", str(src), "-o", str(lib)], check=True)
k = ctypes.CDLL(str(lib))
k.spin.argtypes = [ctypes.c_uint64]; k.spin.restype = ctypes.c_uint64

ITERS = 2_000_000_000  # ~1 s on a fast P-core
ncpu = os.cpu_count()

def bench_on(cpu):
    # Pin this process to the single logical CPU.
    os.sched_setaffinity(0, {cpu})
    # Warmup so frequency boost stabilises.
    k.spin(50_000_000)
    t0 = time.perf_counter()
    k.spin(ITERS)
    return time.perf_counter() - t0

results = []
for cpu in range(ncpu):
    dt = bench_on(cpu)
    rate = ITERS / dt / 1e9  # giga-iters per second
    results.append((cpu, dt, rate))

# Print with the fastest core normalised to 1.0
fastest = max(r[2] for r in results)
print(f"{'CPU':>4}  {'time(s)':>8}  {'Giter/s':>8}  {'rel':>5}")
for cpu, dt, rate in results:
    rel = rate / fastest
    tier = "P" if rel > 0.85 else ("E" if rel < 0.6 else "?")
    print(f"  {cpu:3d}  {dt:8.3f}  {rate:8.3f}  {rel:5.2f}  [{tier}]")

# Sample run on a 14th-gen Core i9 13900K (8 P-cores SMT-on, 16 E-cores), Bengaluru workstation
 CPU   time(s)   Giter/s    rel
   0     1.026     1.949     1.00  [P]
   1     1.041     1.921     0.99  [P]
   2     1.029     1.943     1.00  [P]
   3     1.038     1.927     0.99  [P]
   4     1.034     1.934     0.99  [P]
   5     1.041     1.921     0.99  [P]
   6     1.044     1.916     0.98  [P]
   7     1.057     1.892     0.97  [P]
   8     1.038     1.927     0.99  [P]   # SMT siblings of 0-7
   9     1.042     1.919     0.98  [P]
  ...
  16     2.553     0.783     0.40  [E]
  17     2.561     0.781     0.40  [E]
  ...
  31     2.598     0.770     0.39  [E]

Walk-through. os.sched_setaffinity(0, {cpu}) pins this process to a single logical CPU; this is the kernel's sched_setaffinity(2) syscall behind a Python wrapper, and it is the only honest way to measure per-core throughput on Linux because the default scheduler will migrate the process between cores during the benchmark. k.spin(ITERS) runs the C kernel via ctypes; we call into C because a Python-level inner loop would spend 95% of its time in interpreter dispatch, drowning the per-core signal. Warmup (k.spin(50_000_000)) is essential because P-cores boost from 3.0 to 5.5 GHz over ~100 ms, and a benchmark that does not warm up reports the boosted-once-then-thermal-throttled number — neither end of the range. The output tells the story. P-cores complete in ~1.03 s; E-cores in ~2.56 s. The asymmetry is 2.5×, exactly what Intel's product specs claim, but you only see it by measuring. Tools like lscpu --extended or cat /sys/devices/system/cpu/cpu*/topology/core_type give you the labels; the harness gives you the cost.

Why a single benchmark per core (rather than running them all in parallel): parallel benchmarks measure aggregate throughput, which is the right number for capacity planning but the wrong number for the question "is this core 2× slower than that core?". Per-core sequential measurement isolates the variable. The full picture comes from doing both — run the asymmetry harness first to know the per-core ratio, then run a parallel benchmark to know the aggregate ceiling. The two numbers together give you the placement decision: latency-critical to the fastest cores, batch to the slowest, and never the other way around.

The diagnostic that answers "is the OS scheduler migrating my work across tiers?" is taskset -p <pid> plus pidstat -t -u 1 watched over a couple of minutes. If pidstat's CPU column for your worker thread oscillates between low-numbered (P) and high-numbered (E) CPU IDs, the scheduler is moving you and you are paying the asymmetry as variance. The fix is taskset -c 0-7 at process launch, or os.sched_setaffinity from inside the program — pin latency-critical threads to P-cores explicitly.

# pidstat sample — Asha's fraud-scoring service before pinning, on a hybrid x86 box
$ pidstat -t -u -p $(pgrep -f fraud-scorer) 1
14:02:01      UID      TGID       TID    %usr %system  %guest   %wait    %CPU   CPU  Command
14:02:02     1000    321789         -   88.00    2.00    0.00    1.00   90.00     3  fraud-scorer
14:02:03     1000    321789         -   89.00    1.00    0.00    1.00   90.00    19  fraud-scorer  # E-core!
14:02:04     1000    321789         -   88.00    2.00    0.00    1.00   90.00     5  fraud-scorer
14:02:05     1000    321789         -   89.00    1.00    0.00    0.00   90.00    23  fraud-scorer  # E-core
14:02:06     1000    321789         -   88.00    2.00    0.00    1.00   90.00     2  fraud-scorer

The thread bounces between CPUs 2, 3, 5 (P-cores) and 19, 23 (E-cores) every second. The %CPU column reads 90 — looking healthy — but the work being done is at half speed every other second. After taskset -c 0-15 -p $(pgrep -f fraud-scorer), the CPU column stays in 0–15 and the p99 latency drops back to 38 ms.

The placement decision — three tiers, three policies

Heterogeneous scheduling collapses to three placement decisions, each with a different heuristic:

Latency-critical work goes to P-cores, pinned. Anything in the request-response hot path — gRPC handlers, query parsers, JSON encoders, anything that contributes to user-facing p99 — runs on P-cores with explicit affinity. The cost of pinning is reduced load-balancing flexibility (if all P-cores are busy, this work waits rather than spilling); the benefit is predictable per-request latency. For Razorpay's UPI handlers (200 ms SLO), Zerodha's order-match worker (sub-millisecond budget), or Hotstar's edge cache lookup, this is the only sensible policy. The sleeper-issue: when the OS schedules a system thread (kworker, ksoftirqd) onto a P-core, your pinned worker waits behind it; the response is isolcpus=0-7 at boot to remove those CPUs from the OS's general scheduler, then nohz_full=0-7 and rcu_nocbs=0-7 to silence ticks and RCU callbacks. This is "PREEMPT_RT-style" cordoning of the latency-critical core set; production trading systems at Indian exchanges run exactly this configuration.

Throughput-bound batch work goes to E-cores, also pinned. Background jobs, log parsers, batch ETL, GC threads — anything where total wall time matters more than per-task latency — runs on E-cores. The asymmetry actually helps here: E-cores have higher perf-per-watt, and at the same total chip power budget you can run more E-core threads in parallel than P-core threads. For a 500-task batch of independent log records, 16 E-cores at 0.4× throughput each delivers 6.4× total throughput, vs 8 P-cores at 1.0× each delivering 8× — superficially worse, but freeing the P-cores for latency-critical traffic. Net throughput is higher on the system as a whole because you stop time-slicing the latency tier with the batch tier.

Compute-dense kernels go to accelerators, asynchronously. Matrix multiplies, tensor convolutions, embedding lookups for ML inference, video transcoding, cryptographic hashing at scale — anything with enough arithmetic intensity to amortise a 5–50 µs PCIe dispatch latency goes on a GPU/NPU/ASIC. The rule of thumb: the kernel must run for ≥10× the dispatch latency to be a win, so a 50 µs dispatch tolerates kernels of ≥500 µs. Below that the CPU wins by avoiding the round trip; above it the accelerator wins by ~50–2000×. The hard part is the dispatch boundary — a poorly factored API that calls back to the CPU for control flow inside the kernel pays the dispatch cost on every call and never realises the speedup. Profile with nvprof or nsys to see the dispatch-vs-compute breakdown.

# placement_policy.py — concrete decision logic for hybrid x86 + accelerator
# Run: python3 placement_policy.py
import os, ctypes
NPROC = os.cpu_count()

# Discover P-core / E-core split via sysfs; on hybrid x86 the file
# /sys/devices/system/cpu/cpuN/topology/core_type contains "Atom" (E) or "Core" (P).
def core_type(cpu):
    path = f"/sys/devices/system/cpu/cpu{cpu}/topology/core_type"
    try:
        with open(path) as f: return f.read().strip()
    except FileNotFoundError:
        return "Unknown"

p_cores, e_cores = [], []
for cpu in range(NPROC):
    t = core_type(cpu)
    if t == "Core":  p_cores.append(cpu)
    elif t == "Atom": e_cores.append(cpu)

print(f"P-cores: {p_cores}")
print(f"E-cores: {e_cores}")

def pin_latency_critical(thread_fn):
    """Run latency-critical work on the P-core set, no migration."""
    os.sched_setaffinity(0, set(p_cores))
    return thread_fn()

def pin_batch(thread_fn):
    """Run batch throughput work on the E-core set."""
    os.sched_setaffinity(0, set(e_cores))
    return thread_fn()

def should_dispatch_to_accelerator(kernel_size_flops, accel_dispatch_us=50, accel_tflops=2e3, cpu_tflops=1e2):
    """Below the break-even kernel size, CPU wins; above it, accelerator wins."""
    cpu_time_us = kernel_size_flops / cpu_tflops
    accel_time_us = kernel_size_flops / accel_tflops + accel_dispatch_us
    return accel_time_us < cpu_time_us, cpu_time_us, accel_time_us

# Print the break-even table for matmul-shaped kernels (2*N^3 FLOPs)
print(f"\n{'N':>5}  {'FLOPs':>12}  {'CPU(µs)':>10}  {'GPU(µs)':>10}  {'use':>4}")
for N in (16, 64, 256, 1024, 4096):
    flops = 2 * N**3
    use, ct, at = should_dispatch_to_accelerator(flops)
    print(f"  {N:4d}  {flops:12d}  {ct:10.1f}  {at:10.1f}  {'GPU' if use else 'CPU'}")

# Sample run on a 14900K (8 P + 16 E) without an attached GPU; numbers from the policy fn
P-cores: [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15]
E-cores: [16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31]

    N         FLOPs    CPU(µs)    GPU(µs)   use
   16          8192        0.1       50.0   CPU
   64        524288        5.2       50.3   CPU
  256      33554432      335.5       66.8   GPU
 1024    2147483648    21474.8     1124.0   GPU
 4096  137438953472  1374389.5    68769.5   GPU

Walk-through. /sys/devices/system/cpu/cpuN/topology/core_type is the kernel-exposed truth on hybrid x86 — on Alder Lake and newer, the value is literally the string Core or Atom. ARM big.LITTLE uses /sys/devices/system/cpu/cpu*/cpu_capacity (a unitless 0–1024 rating). pin_latency_critical and pin_batch are the two fundamental placement primitives; everything else is bookkeeping. should_dispatch_to_accelerator is the break-even formula: a kernel pays the fixed dispatch cost regardless of size, so small kernels lose to the CPU and large kernels win on the accelerator. The crossover for a matmul on this hypothetical setup is at N≈256 (33M FLOPs) — below that, CPU; above, GPU. Crucially, this is per-call; if your inference batch combines 16 small matmuls (each below break-even individually), batching them into one large matmul flips them all to the GPU side. Why the break-even formula matters more than raw TFLOPS comparisons: a vendor brochure says "GPU is 20× faster than CPU on FP16 matmul". This is true for the kernel itself but ignores dispatch. If you are running 1000 small kernels per second (50 µs dispatch each = 50 ms/sec wasted on dispatch alone), the brochure number is meaningless — you need the kernel size to amortise the dispatch, and below the break-even you are paying the GPU's overhead without realising any of its speed.

The CPU line is linear in kernel size (no fixed cost). The GPU line is flat below break-even (dispatch dominates) then linear above (compute dominates). The dashed crossover marks the smallest kernel for which dispatching to the GPU is worth the round trip. Illustrative — assumes 100 GFLOPS CPU, 2 TFLOPS GPU, 50 µs dispatch.

The break-even line is what the placement layer must compute on every dispatch decision. PyTorch's torch.compile and torch.jit do this implicitly when they fuse small ops into a single CUDA kernel — they raise the effective kernel size above break-even by batching neighbours that would otherwise individually fall below it. Production inference services that do not fuse pay the dispatch tax 100× per request; ones that do amortise it across the whole forward pass.

Common confusions

"All cores on a server are equal — only laptops have hybrid silicon." Wrong since 2023. Sapphire Rapids, Sierra Forest, Granite Rapids, Bergamo, Graviton 4 — every major server CPU family has hybrid or per-core-frequency variance. Even "homogeneous" cores hit different boost frequencies based on thermal headroom; cores 0–3 may run at 5.0 GHz while cores 4–7 run at 4.5 GHz simply because the package power budget is uneven.
"The OS scheduler handles heterogeneity automatically." Linux's CFS does not. Recent additions (SCHED_DEADLINE, EAS on ARM) help with batch-vs-latency separation on phones but on hybrid x86 servers the scheduler still treats P and E cores as fungible. macOS and Windows have stronger heuristics (Apple's QoS classes, Windows' Thread Director) but they still need the application to tag its work — untagged latency-critical threads end up on E-cores by default.
"GPUs are always faster for any parallel work." GPUs win on compute-dense kernels with high arithmetic intensity (matmul, conv, FFT). They lose on small-kernel, low-AI work (parsing, branching, pointer-chasing) because the dispatch overhead dwarfs the compute, and the GPU's per-thread compute is slower than a CPU core for serial code. A workload of 1000 parses-per-second is far better on E-cores than on a GPU.
"Pinning to P-cores wastes E-core capacity." Only if the workload has nothing to put on E-cores. In practice every production system has GC threads, log shippers, metric exporters, batch jobs, cache warmers — all of which are throughput-bound. Move them to E-cores explicitly. The result is full-system utilisation with predictable latency, not idle E-cores.
"big.LITTLE on phones is the same problem as hybrid x86 on servers." Same shape, different constraint. Phones optimise for energy; servers optimise for throughput at a latency SLO. The placement policy looks similar (latency-critical → P, batch → E) but the goals are opposite — phones want to keep cores in the lowest-power state that meets the user-facing requirement, servers want to keep cores in the highest-throughput state that meets the SLO.
"Dispatch latency to the GPU is negligible because PCIe is fast." PCIe Gen5 is fast at bulk transfer (64 GB/s), but a kernel launch is dozens of round trips through the driver, the user-mode CUDA stack, and the GPU command queue. Measured dispatch is 5–50 µs depending on the API and driver. For sub-millisecond inference targets, this is fatal — the dispatch alone is 5% of your budget — and the response is to fuse kernels until the dispatch tax is amortised.

Going deeper

Energy-Aware Scheduling and the QoS-class abstraction

Linux's Energy-Aware Scheduling (EAS, mainlined in 5.1, default on most ARM platforms) is the kernel's first attempt at heterogeneity-aware placement. It builds a per-CPU power-and-capacity model from device-tree data (cpu_capacity, power_efficiency) and tries to place tasks on the lowest-capacity CPU that still meets their deadline. EAS works well for foreground/background separation on phones; it works less well for server workloads because the kernel cannot infer "this thread is latency-critical, never put it on an E-core" from observation alone.

Apple's solution is the QoS class abstraction (DISPATCH_QUEUE_QOS_USER_INTERACTIVE, QOS_CLASS_BACKGROUND) — applications tag their work with a quality-of-service class, and the scheduler uses the tag as a hard placement hint. USER_INTERACTIVE work runs on P-cores at maximum frequency, BACKGROUND on E-cores at reduced frequency. Linux's analog is SCHED_DEADLINE (for hard deadlines) plus nice-value adjustments, but it requires application-side cooperation and most JVM / Python / Go runtimes do not propagate the hint correctly. Why QoS classes matter for production: when the OS knows the request priority, it can make the right placement decision without the application doing per-thread pinning. The cost is that every layer of the stack (the OS, the runtime, the framework, the application) must propagate the QoS tag — break the chain anywhere and the scheduler reverts to default placement. This is why pinning, even though crude, is more robust in mixed-runtime production than QoS hints.

Frequency scaling, turbo boost, and the per-core race

Even on "homogeneous" servers, cores are heterogeneous in the small. Modern CPUs run per-core dynamic voltage-frequency scaling — each core picks its own clock speed based on its power-budget allocation, thermal state, and SMT-sibling activity. A core running a single thread at low temperature can boost to 5.5 GHz; the same core under thermal pressure or shared with an SMT sibling may run at 3.8 GHz. The CPU exposes this through /sys/devices/system/cpu/cpu*/cpufreq/scaling_cur_freq; check it in production and you will see frequency variance of ±20% across cores at any given moment.

The implication: a benchmark that runs on cold cores reports best-case numbers; the same code in production sees variable cores. The safest configuration for latency-sensitive work is to disable turbo (echo 0 > /sys/devices/system/cpu/intel_pstate/no_turbo) and pin to a fixed-frequency core set — slower peak, predictable floor. Zerodha Kite's market-data fanout runs exactly this way: turbo off, all cores locked at base frequency, latency-critical threads pinned. The peak throughput is lower; the p99.99 is rock-stable. For high-frequency trading workloads at NSE, predictability beats peak.

Accelerator dispatch — kernel fusion as the universal optimisation

When latency budgets push you to GPUs, the universal optimisation is kernel fusion: combining adjacent ops into a single dispatch so the dispatch cost is amortised across them. PyTorch's torch.compile, JAX's jit, TensorFlow XLA, and TVM all perform some form of fusion — they identify ops that share data flow and emit a single CUDA / ROCm kernel rather than N separate dispatches. The speedup is rarely from making any single op faster; it is from eliminating N-1 dispatches.

For a Razorpay-scale fraud inference service running 200K predictions/sec, the model graph might have 30 ops per inference. Naive PyTorch eager-mode execution dispatches 30 kernels per inference, paying 30 × 30 µs = 900 µs of dispatch alone — a 4.5× p99 hit on a 200 ms budget. After torch.compile, the same graph is fused into 4 kernels per inference, paying 4 × 30 µs = 120 µs of dispatch — a 7.5× reduction in dispatch overhead, with no change to the model itself. The skill is recognising when fusion is the bottleneck (the symptom: high cudaLaunchKernel frequency in nsys traces, low cudaMemcpy and kernel-execution ratios) and applying the right tool.

Heterogeneous memory hierarchies and the NUMA-of-accelerators

The accelerator brings its own memory — HBM on a GPU, on-chip SRAM on an NPU — and those memories are not coherent with the CPU's DRAM. Every dispatch implies a transfer (or a pinned shared region with explicit synchronisation). The aggregate cost model is therefore NUMA-of-accelerators: each device has its own memory tier, with bandwidth and latency to other tiers. CXL 2.0 promises cache-coherent access across tiers, but in 2026 the production reality is still explicit cudaMemcpyAsync / hipMemcpy calls.

For ML inference, the dominant pattern is persistent residency: weights are loaded into GPU HBM once at startup and never moved; inputs are transferred per request via pinned host memory and DMA. The transfer cost is small (~5–20 µs for a 16 KB input), but it is non-zero, and at 200K req/sec it costs 4 ms of every wall-second. Production inference servers (TensorRT-LLM, vLLM, TGI) all implement input batching specifically to amortise this transfer across multiple requests in flight — a single 64-request batch transfer costs 5 µs instead of 64 × 5 µs = 320 µs.

Heterogeneity in Indian production — three concrete patterns

Hotstar's IPL inference fleet runs ad-personalisation models on a hybrid CPU + Tensor accelerator topology. The request flow: incoming HTTP on a P-core Java handler (predictable 2 ms parse), feature lookup against an in-memory store on an E-core thread (5 ms, fine — throughput-bound), model inference dispatched to a TPU-v4 accelerator (~80 µs kernel + 30 µs dispatch = 110 µs total), response assembly back on the originating P-core (1 ms). Total p99: 8 ms. Without P-core pinning of the handler thread, the same path runs at 22 ms p99 — the parse stage migrates onto E-cores under load and inflates the budget. The fix was three lines of taskset in the systemd unit file, not a model change.

Flipkart's Big Billion Days catalogue ranker runs on c7i hybrid instances with embedding lookups dispatched to attached Inferentia accelerators. The naive approach (one inference per product score, 8M products to score) would dispatch 8M kernel launches per query — 8M × 30 µs = 240 seconds of dispatch alone. The actual implementation batches 1024 products per dispatch, so 8000 dispatches per query × 30 µs = 240 ms of dispatch. The per-batch kernel runs ~600 µs, so total wall is 8000 × 630 µs = 5 s — fast enough for the 10 s indexing budget. Without batching, the same hardware would not meet the SLO at all; the heterogeneity-aware fix is in the dispatch policy, not the model.

Zerodha Kite's order-match worker runs purely on isolated P-cores with isolcpus and nohz_full. There is no accelerator path because the workload is branch-heavy serial code (order matching is a tree traversal with conditional updates, not a matmul) — exactly the kind of work that loses on a GPU. The team measured the GPU-dispatch break-even and concluded that for kernels under 100 µs of compute, the CPU wins. The order match runs in 8 µs on a pinned P-core; on the GPU it would take 5 µs of compute plus 30 µs dispatch — a 4× regression. Heterogeneity-aware design sometimes means not using the accelerator.

When heterogeneity is the wrong abstraction — homogeneous fallbacks

For workloads that are themselves homogeneous and unbatchable — like a request handler that does a fixed sequence of 200 ns operations and serves a fixed-size response — the right hardware is sometimes a uniform-core processor (Bergamo all-Zen-4c, Graviton 3 all-Neoverse, AmpereOne all-A1) rather than a hybrid. The placement decision becomes trivial (any core is fine), the OS scheduler does the right thing by default, and the operational complexity drops to zero. This is a real choice in 2026 cloud architecture: AWS offers c7g (Graviton 3, uniform), c7i (Sapphire Rapids, optionally hybrid), and c7a (Genoa, uniform Zen 4 cores). For a payments microservice with no batch component, c7g often wins on operational simplicity even though c7i has a higher peak score on synthetic benchmarks. The lesson: heterogeneity is a tool, not a default. Use it where the workload has multiple cost classes; skip it where it does not.

Where this leads next

The next chapter (/wiki/wall-i-o-is-where-systems-actually-block) shifts from the CPU side of scaling to the I/O side: the wall where queueing-theory ceilings, kernel block-layer overhead, and storage-device tail latency dominate the picture even when the cores are idle. The placement decisions of this chapter compose with the I/O policies of the next: a P-core blocked on a slow disk read is no faster than an E-core blocked on the same read.

The Part 9 (chapters 60-66) thread closes here. Each chapter named a different physical resource that ceiling-limits parallel scaling: serial fraction (Amdahl), coherence traffic, memory bandwidth, heterogeneous topology. The pattern across all four: the bottleneck is physical, not algorithmic; the fix is structural, not "add more cores"; and the diagnostic is direct measurement of the suspected resource (pcm-memory for bandwidth, perf c2c for coherence, the asymmetry harness above for heterogeneity). Adding cores is the engineering response to none of these problems. See /wiki/memory-bandwidth-as-the-real-ceiling for the bandwidth-side companion and /wiki/coherence-traffic-as-a-hidden-ceiling for the coherence-side companion.

The two operational habits this chapter adds to the Part 9 toolkit. First, measure per-core asymmetry on every new instance type before benchmarking aggregate numbers — a 10-minute run of core_asymmetry.py saves weeks of misattributed latency variance later. Second, design placement before designing the algorithm — decide which work goes on which tier (P-core, E-core, accelerator) at the architecture stage; retrofitting placement onto an existing service is harder than retrofitting most things, because the runtime's assumptions about uniform cores are baked into thread-pool sizing, GC heuristics, and queue depths.

Reproducibility footer

Both harnesses run on any Linux box with Python 3.11+ and a C compiler. The asymmetry harness needs a hybrid CPU to show the 2.5× spread (Alder Lake 12th-gen and newer, Apple silicon via Asahi Linux, or any ARM big.LITTLE board); on a uniform-core CPU the harness will show a flat 1.0× across all cores, which is itself useful confirmation. The placement-policy harness runs anywhere (it's pure logic, no GPU required).

# Reproduce this on your laptop, ~3 minutes for both
sudo apt install build-essential linux-tools-common
python3 -m venv .venv && source .venv/bin/activate
pip install --upgrade pip   # no third-party packages needed; ctypes is stdlib
python3 core_asymmetry.py
python3 placement_policy.py
# To see scheduler migration, run the asymmetry harness without sched_setaffinity:
# comment out the os.sched_setaffinity line and watch numbers blur together.

References

Brendan Gregg, Systems Performance (2nd ed., 2020), chapter 6 — CPUs — the practical treatment of CPU heterogeneity, scheduler classes, and per-core measurement.
Intel® Hybrid Architecture Software Developer Guide (2024 rev.) — the official guide to P-core / E-core scheduling and Thread Director, the hardware feedback channel that informs the OS scheduler.
Arm big.LITTLE Technology White Paper (2013, updated 2023) — the original heterogeneous-core architecture and its scheduling abstractions; still the clearest exposition of the model.
Patrick Bellasi et al., "Energy-Aware Scheduling: a step beyond CPU frequency scaling" (Linux Kernel Summit, 2018) — the Linux EAS design and how it interacts with cpufreq.
Sharon Maoz et al., "Heterogeneous Computing: Hardware and Software Perspectives" (ACM Computing Surveys, 2022) — the academic survey of accelerator dispatch costs and break-even analysis.
NVIDIA, "Optimizing GPU kernel launch latency" (Developer Blog, 2023) — the canonical reference on kernel-launch overhead and CUDA Graphs as the fusion solution.
/wiki/memory-bandwidth-as-the-real-ceiling — the bandwidth companion to this chapter; together they cover the two physical ceilings outside the cores themselves.
/wiki/coherence-traffic-as-a-hidden-ceiling — the coherence companion; the third physical scaling ceiling alongside bandwidth and heterogeneity.