Note: Company names, engineers, incidents, numbers, and scaling scenarios in this article are hypothetical — even when they resemble real ones. See the full disclaimer.

Cores, hyperthreads, and the pretend-ness of parallelism

KapitalKite's matching engine pinned eight matching shards to eight cores on a 16-core / 32-thread EPYC and saw p99 add-order at 9 µs. When Aditi pinned them to logical CPUs 0,1,2,3,4,5,6,7 instead — sequential numbering, "what could go wrong" — p99 jumped to 28 µs. The engine did not change. The cores did not change. What changed is that on Linux, logical CPU 0 and logical CPU 16 are the same physical core sharing one set of execution units, and Aditi had unknowingly co-located four pairs of matching shards onto four cores while leaving four cores idle. The kernel reported 32 CPUs. The hardware had 16. The difference between those two numbers is what this chapter is about.

A core has its own ALU, FPU, branch predictor, L1d, and L1i. A hardware thread (SMT / Intel hyperthread) shares all of those with its sibling thread on the same physical core — it is a second instruction stream multiplexed onto one core's execution resources. SMT helps when one stream stalls on a memory load (the other can issue ALU ops); it hurts when both streams compete for the same cache or vector unit. nproc reports hardware threads, not cores, because that is what most workloads can use. For latency-critical workloads, that count lies — pin to physical cores, leave the siblings idle, and verify the topology with lscpu -e.

What is on a core, and what is shared between siblings

Open lscpu -e on any modern x86 box and you will see a table where two rows share a CORE column. Those two rows are SMT siblings — two hardware-thread contexts that appear to the kernel as separate CPUs but live on one physical core. The core has one set of execution units (the ALUs, the FPU, the AGUs, the load/store unit, the L1d, the L1i, the branch predictor, the L2). The two hardware threads each have their own architectural register file, their own program counter, their own privilege state, and their own RIP — the bookkeeping that makes them look like two CPUs to the OS — but they fight for everything that does the actual work.

The trick is that on any given cycle, a core's execution backend has more issue ports than a single instruction stream can usually keep busy. A modern Zen 4 or Golden Cove core has 8–10 issue ports (4 ALU, 2–3 AGU, 2 FPU/SIMD, 2 load, 1 store) but most workloads' average IPC is 1.5–3.0 — the rest of the ports are idle on memory stalls and branch mispredicts. SMT slots a second instruction stream onto those idle ports. When stream A is waiting for a 200-cycle DRAM load, stream B can issue ALU ops on otherwise-idle ports. When stream A and stream B both want the FPU, one of them stalls. The net effect is workload-dependent: SMT speeds up memory-bound workloads (database scans, web servers) by 15–30%; it speeds up branch-heavy code (compilers, game logic) by 5–15%; and it slows down compute-bound, cache-resident workloads (matrix multiply, AES, video encoding) by 5–10% because the second stream evicts the first from L1d.

One physical core with two SMT siblings — what is duplicated, what is sharedDiagram of a single physical core showing two hardware-thread contexts. At the top, two boxes labelled "HT0 (logical CPU 0)" and "HT1 (logical CPU 16)" each contain their own architectural register file, program counter, and TSS. Below them, a single shared layer holds: front-end (fetch, decode, branch predictor), execution backend (ALUs, FPUs, load/store units), L1 instruction cache, L1 data cache, L2 cache. The shared resources are drawn as one box spanning both threads, with arrows from each thread feeding into them. A label notes "two streams of instructions, one set of execution units". One physical core, two hardware threads — duplicated vs shared HT0 — logical CPU 0 architectural register file (RAX, RBX, … XMM0–15) program counter (RIP), flags (RFLAGS) privilege level, segment registers, MSRs ≈ 256 bytes of state, duplicated per HT HT1 — logical CPU 16 architectural register file (its own RAX, RBX, …) program counter (its own RIP), flags privilege level, segment registers, MSRs ≈ 256 bytes of state, duplicated per HT shared front-end + backend (one physical core) fetch / decode / µop-cache branch predictor (BTB, RAS) 8–10 issue ports (ALU, FPU, LD/ST) retirement / ROB two instruction streams competing for one µop scheduler shared L1i + L1d (one physical core) L1i: 32 KB, 8-way L1d: 32 KB, 8-way, ~4 cycle hit L2: 1 MB, ~12 cycle hit cache lines visible to both HT → shared L3 (across all cores on the CCD), then memory controller, then DRAM across-core sharing happens at L3; within-core sharing happens at L1/L2
One physical core with two SMT siblings. The duplicated state at the top is the minimum needed for the core to look like two CPUs to the OS — register files and the program counter. Everything below the line is one set of hardware that the two streams share. The kernel's `nproc` counts the boxes at the top; the throughput of your workload depends on how much it contends for the boxes below.

Why SMT exists at all: the average single-stream IPC on most server workloads is 1.5–2.5 against a backend that can retire 4 µops/cycle. The remaining 1.5–2.5 µops/cycle of capacity is idle, mostly waiting on L2/L3 misses (200–600 cycles each). Adding the architectural state for a second stream costs ~256 bytes per core — negligible silicon — and lets the backend issue useful work during the first stream's stalls. The hardware-thread-as-overhead-hider intuition is the right one. SMT is not a magic 2× — Intel's own marketing has settled at "up to 30%" since Nehalem (2008), and that number is mostly correct on the workloads SMT is designed for.

Why make -j32 was slower than make -j16 — measuring SMT's interference

The kernel reports 32 logical CPUs on a 16-core / 32-thread EPYC because it treats each hardware thread as a schedulable entity. Most workloads benefit from this — the kernel scheduler picks an idle HT regardless of which core it sits on, and oversubscribing each core with two streams smooths out memory stalls. For latency-critical workloads, the abstraction inverts: the kernel will happily schedule two latency-sensitive tasks onto two siblings of the same core, and they will fight for the L1d, the FPU, and the branch predictor in ways that destroy tail latency. The fix is to enumerate the topology and pin deliberately.

Here is a minimal C benchmark that demonstrates the interference. Two threads each run a tight pointer-chase through a 64 KB random-permutation graph (slightly larger than L1d, fits in L2). When pinned to two different physical cores, the two threads see independent throughput. When pinned to the two siblings of one physical core, throughput drops by 30–55% per thread because they evict each other from L1d and L2.

// smt_interference.c — measure SMT sibling interference vs cross-core
// Compile: gcc -O2 -pthread smt_interference.c -o smti
// Usage:   ./smti <cpuA> <cpuB>   e.g.   ./smti 0 16   (siblings on Linux x86)
//                                        ./smti 0 1    (different cores, NPS=1)
#define _GNU_SOURCE
#include <pthread.h>
#include <sched.h>
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <time.h>

#define BUF_BYTES (64 * 1024)
#define N_NODES   (BUF_BYTES / sizeof(size_t))
#define HOPS      (200u * 1000u * 1000u)

static size_t graph[N_NODES];

static void build_graph(unsigned seed) {
    unsigned x = seed | 1;
    size_t idx[N_NODES];
    for (size_t i = 0; i < N_NODES; i++) idx[i] = i;
    for (size_t i = N_NODES - 1; i > 0; i--) {       // Fisher–Yates
        x = x * 1664525u + 1013904223u;
        size_t j = x % (i + 1);
        size_t t = idx[i]; idx[i] = idx[j]; idx[j] = t;
    }
    for (size_t i = 0; i < N_NODES - 1; i++) graph[idx[i]] = idx[i + 1];
    graph[idx[N_NODES - 1]] = idx[0];                // close the cycle
}

static void pin(int cpu) {
    cpu_set_t s; CPU_ZERO(&s); CPU_SET(cpu, &s);
    pthread_setaffinity_np(pthread_self(), sizeof s, &s);
}

static void *worker(void *arg) {
    int cpu = (int)(long)arg;
    pin(cpu);
    size_t i = 0;
    struct timespec t0, t1;
    clock_gettime(CLOCK_MONOTONIC, &t0);
    for (unsigned k = 0; k < HOPS; k++) i = graph[i]; // pointer chase, no reorder
    clock_gettime(CLOCK_MONOTONIC, &t1);
    double ns = (t1.tv_sec - t0.tv_sec) * 1e9 + (t1.tv_nsec - t0.tv_nsec);
    fprintf(stderr, "cpu=%2d ns/hop=%.2f sink=%zu\n", cpu, ns / HOPS, i);
    return NULL;
}

int main(int argc, char **argv) {
    if (argc != 3) { fprintf(stderr, "usage: %s <cpuA> <cpuB>\n", argv[0]); return 2; }
    build_graph(1);
    pthread_t a, b;
    pthread_create(&a, NULL, worker, (void *)(long)atoi(argv[1]));
    pthread_create(&b, NULL, worker, (void *)(long)atoi(argv[2]));
    pthread_join(a, NULL); pthread_join(b, NULL);
    return 0;
}

Sample output on a Ryzen 7950X (16 cores, 32 threads, Linux 6.5):

$ ./smti 0 1     # different physical cores
cpu= 0 ns/hop=4.18 sink=12731
cpu= 1 ns/hop=4.21 sink=12731

$ ./smti 0 16    # SMT siblings on the same physical core
cpu= 0 ns/hop=6.47 sink=12731
cpu=16 ns/hop=6.49 sink=12731

The first run pins to two distinct physical cores. Each thread has its own L1d (32 KB), its own L2 (1 MB on Zen 4), and its own load/store unit; they compete only at L3 and the memory controller, neither of which the working set touches. Throughput is independent. The second run pins to two SMT siblings of one physical core. They share the same 32 KB L1d, and the working set is 64 KB — twice the L1d. Each thread's loads evict the other thread's lines, so the average latency per pointer chase moves from ~4 ns (L1d hit) toward ~12 ns (L2 hit), and the per-thread ns/hop rises ~55%.

Why this pattern matters operationally: a pointer-chase is the worst-case for SMT — it cannot be vectorised, every load depends on the previous load's data, and the working set is sized to expose the L1d-vs-L2 cliff. A real matching engine that reads a std::map<price, level> looks similar: every node-walk is a load whose result determines the next load. SMT siblings sharing one core's L1d do interfere on this workload pattern, and the way to find out which logical CPUs are siblings is lscpu -e, not guesswork. The Linux kernel exposes /sys/devices/system/cpu/cpu0/topology/thread_siblings_list for the same purpose.

When SMT helps, when it hurts — a topology-aware decision

The benchmark above is the cliff. The plateau is workloads where SMT actively helps. Here is the rough decision matrix, calibrated from public Folly / Mongo / Postgres benchmarks and from internal microbench data the matching-engine team would have on a 64-core EPYC:

SMT throughput effect by workload class — illustrativeA horizontal bar chart comparing throughput effect of enabling SMT (using both sibling threads vs only one) across six workload classes. From top: pointer-chasing matching engine — minus 35 percent; AES encryption — minus 8 percent; matrix multiply hot in L1 — minus 12 percent; HTTP server with database I/O — plus 22 percent; database table scan — plus 28 percent; compiler / make build — plus 15 percent. A vertical line at 0 percent separates harmful from helpful. Effect of enabling SMT (both siblings vs just one) — illustrative, not measured here 0% −40% (worse) +40% (better) pointer-chase matching engine −35% AES (L1-resident, vector unit) −8% matmul, L1-hot −12% HTTP server + DB call +22% DB table scan (memory-bound) +28% compiler / make -j +15% Numbers are illustrative — measure your workload with `perf stat -e cycles,instructions,cache-misses` before deciding.
Illustrative — not measured data. The pattern is: SMT helps when the primary bottleneck is memory latency (the second stream uses ports during the first's stall) and hurts when both streams compete for the same hot L1d / vector unit. Latency-critical pinned workloads sit firmly in the "hurts" half.

The operational rules that fall out of this:

  1. Latency-critical pinned services (matching engines, HFT order books, real-time bidders) — disable SMT for those CPUs, or pin only to even-numbered logical CPUs and leave odd-numbered ones idle.
  2. Throughput-critical mixed workloads (web servers, batch processors, build farms) — leave SMT on. The kernel's load balancer is good at colocating compatible streams.
  3. Numa-aware databases — leave SMT on for the readers, pin writers to physical cores. Postgres' parallel_workers_per_gather and Mongo's WiredTiger threads benefit from SMT on scans.

The harness that drives the C kernel and compares the two pinning strategies is short. This is the Python-driver-over-C-kernel pattern from §6 of the style guide:

# smt_compare.py — drive the C kernel across pinning configurations
import subprocess, re, json, sys

def run(cpu_a, cpu_b):
    out = subprocess.run(
        ["./smti", str(cpu_a), str(cpu_b)],
        capture_output=True, text=True, check=True,
    )
    ns = [float(m.group(1)) for m in re.finditer(r"ns/hop=([\d.]+)", out.stderr)]
    return sum(ns) / len(ns)

# Discover one physical core with two SMT siblings
import os, glob
sib = open("/sys/devices/system/cpu/cpu0/topology/thread_siblings_list").read().strip()
sibs = [int(x) for x in sib.split(",")]   # e.g. [0, 16] on a 16C/32T box
other_core = 1 if 1 not in sibs else 2     # any cpu not in sibs

cross = run(sibs[0], other_core)           # different cores
sib_pair = run(sibs[0], sibs[1])           # SMT siblings
print(json.dumps({
    "cross_core_ns_per_hop":  round(cross, 2),
    "smt_sibling_ns_per_hop": round(sib_pair, 2),
    "smt_overhead_pct":       round(100 * (sib_pair - cross) / cross, 1),
    "siblings_detected":      sibs,
}, indent=2))
$ python smt_compare.py
{
  "cross_core_ns_per_hop": 4.20,
  "smt_sibling_ns_per_hop": 6.48,
  "smt_overhead_pct": 54.3,
  "siblings_detected": [0, 16]
}

Why the harness reads thread_siblings_list from sysfs instead of guessing: the mapping from logical-CPU number to physical-core number is not portable across Linux kernel versions, vendors, or BIOS settings. On Intel x86, siblings have historically been (N, N + nproc/2) (i.e. 0 and 16 on a 32-CPU box). On AMD EPYC with NPS=4 (NUMA-per-socket), the mapping is interleaved within a CCX. The only correct way to discover the topology is to ask the kernel: lscpu -e, /sys/devices/system/cpu/cpu*/topology/thread_siblings_list, or hwloc-ls. Pinning by guess is how Aditi's matching engine got 28 µs p99 instead of 9 µs.

Production patterns — pinning, isolation, and SMT-off

Three patterns dominate latency-critical Linux deployments:

Pin and isolcpus. Boot the kernel with isolcpus=8-15,24-31 nohz_full=8-15,24-31 rcu_nocbs=8-15,24-31. The kernel will not schedule any task onto those CPUs unless explicitly pinned via sched_setaffinity or taskset. The matching engine pins one shard per isolated CPU; the kernel's general workload runs on 0-7,16-23. The CPUs that run shards never see a context switch, never see a RCU softirq, never see a timer tick when idle — the shard owns the core for the duration of the trading day.

SMT-off via BIOS or runtime. The most aggressive answer: turn SMT off entirely. echo off > /sys/devices/system/cpu/smt/control at boot disables every odd-numbered logical CPU on Intel; the BIOS toggle is equivalent. The 16-core EPYC then reports 16 CPUs and the matching engine cannot accidentally co-locate two shards on one core. The cost is the throughput-improving workloads on the same box lose 15–25%. KapitalKite's choice would be: dedicate two boxes — one with SMT off for matching, one with SMT on for the write-ahead-log and analytics — rather than mix.

Sibling-leave-idle. Compromise: leave SMT enabled, pin shards to even-numbered logical CPUs 0,2,4,…, and explicitly mark odd-numbered ones unschedulable for everything except a tiny housekeeping process. The shards still own their L1d and L2 because the sibling is asleep. This avoids the BIOS reboot but requires a topology-aware orchestrator that understands thread_siblings_list. Why this works: an idle hardware thread genuinely does not consume execution-port cycles. The SMT cost is competition for backend ports and L1d capacity, both of which a sleeping sibling does not exert. The only residual cost is that when the housekeeping process runs on the sibling, the shard takes a small hit. Limit the housekeeping process's CPU budget with cgroups (cpu.max = 5000 100000 for 5%) and the residual is in the noise.

Common confusions

  • "Hyperthreading doubles throughput" It does not. Intel's own published numbers since Nehalem (2008) describe SMT as "up to 30%" on benchmarks selected to favour it. Real-world server gains are 15–25% on memory-bound workloads, 0–10% on compute-bound, and negative on cache-resident hot-loop workloads. Treat SMT as a stall-hider, not a parallelism doubler.
  • "nproc returns the number of cores" It returns the number of online schedulable CPUs, which on an SMT-enabled box is twice the number of physical cores. lscpu -p=core | sort -u | wc -l (after stripping comments) returns the actual core count. getconf _NPROCESSORS_ONLN is the same as nproc. Always prefer lscpu -e for topology questions.
  • "SMT siblings are independent CPUs" They share L1d, L1i, L2, the front-end, the µop scheduler, the execution backend, and the branch predictor. The only architectural state that is not shared is the register file and program counter. Two threads on siblings interfere with each other in every way that matters for cache-sensitive workloads. The kernel hides this from top, but perf stat -e cache-misses,branch-misses --per-core reveals it instantly.
  • "Disabling SMT halves your CPU" It halves the number of logical CPUs the kernel sees, but the physical execution capacity drops by 0–15% on most workloads (because the second stream was contributing little) and rises by 5–15% on cache-resident workloads (because the sibling's interference is gone). For a latency-critical workload, SMT-off can be net throughput-positive.
  • "SMT and NUMA are the same kind of thing" They are different layers of pretend-ness. SMT is two streams sharing one core's execution units. NUMA is many cores sharing a memory controller, but accessing a different socket's memory pays a cross-socket hop. SMT contention shows up as mem_load_l1d_missmem_load_l2_hit; NUMA contention shows up as mem_load_remote_dram > 0 in perf stat. Confusing them leads to wrong fixes — you cannot solve NUMA latency by disabling SMT.
  • "Thread-safe code stays thread-safe under SMT" SMT does not change the language-level memory model — your code is still race-free if it was race-free before. SMT does change performance behaviour: an atomic CAS that ping-pongs between cross-core threads at 80 ns can drop to 12 ns between SMT siblings (they share the L1d, so the line never leaves the core), which sometimes hides bugs in benchmarks. State the property you actually depend on — race-free, linearizable, fair — and verify it with TSAN / loom, not with timing.

Going deeper

x86 SMT vs SPARC CMT vs IBM POWER SMT4/SMT8

x86 has implemented 2-way SMT since Pentium 4 Northwood (2002, "Hyper-Threading"). Sun's UltraSPARC T1 (Niagara, 2005) shipped 4-way fine-grained CMT — every cycle picks a different one of 4 threads, hiding all memory-latency stalls at the cost of single-thread performance roughly halving. T2 (2007) doubled to 8-way. IBM POWER7 (2010) shipped a runtime-configurable 1/2/4-way SMT, and POWER8 (2014) added an 8-way mode for OLTP workloads where memory-bound stalls dominate. The x86 2-way design is a compromise: enough to hide L2/L3 stalls without halving single-thread performance. POWER's 8-way mode is the right answer for a banking workload where a single transaction blocks on dozens of DRAM accesses; x86's 2-way is the right answer for a mixed cloud workload where some tenants are latency-sensitive. Why x86 didn't follow POWER to SMT8: x86's design point is mixed-workload servers where the kernel cannot predict per-tenant SMT preference. With 2-way SMT, the misalignment cost is small; with 8-way, a single latency-sensitive thread sharing a core with 7 other streams sees an order-of-magnitude per-cycle slowdown on its hot path. The choice is a workload-fit decision, not a "more is better" one.

CPU vulnerabilities — L1TF, MDS, and why some clouds disable SMT

L1 Terminal Fault (2018) and Microarchitectural Data Sampling (2019) are speculation-side-channel vulnerabilities where a malicious thread on one SMT sibling could read data the kernel had loaded into L1d for the other sibling. The mitigations require either flushing L1d on every kernel/user transition (expensive) or co-scheduling — guaranteeing both siblings of a core run in the same security domain at all times. Some clouds (initially Google's GCE, with similar moves at others) chose to disable SMT on their hypervisor hosts entirely as a hard mitigation. For tenants paying for a vCPU, this means they get a real core, not a sibling — but it also means the host's total throughput drops 15–25%. The economic decision (charge less per vCPU but advertise no co-tenancy on SMT siblings) is one of the largest hardware-utilisation shifts of the post-Spectre era.

MealRush's Kubernetes node-pool with topology-aware scheduling

MealRush's order-dispatch service runs on a 32-core / 64-thread Kubernetes node pool. The dispatch pods are CPU-pinned via cpuManagerPolicy: static and topologyManagerPolicy: single-numa-node. Every dispatch pod requests cpu: 4 (cores, not threads), and the kubelet's CPU manager assigns physical cores — not arbitrary logical CPUs — by reading /sys/devices/system/cpu/cpu*/topology/thread_siblings_list. The siblings of those four cores are placed on a separate "best-effort" CPU set that runs the metrics agent, the log shipper, and the sidecar — workloads where the SMT contention does not hurt p99. The result: dispatch p99 latency stayed at 1.4 ms even when the node was at 95% CPU utilisation, because the dispatch pods' execution units were not shared with anything that ran a hot loop.

The "logical-vs-physical" gap on cloud vCPUs is its own can of worms

A "vCPU" on AWS EC2 (with the exception of *.metal instances), GCP Compute Engine, and Azure D-series is one hardware thread, not one core. A c7i.4xlarge advertises 16 vCPUs; underneath, that is 8 physical cores with SMT on. If you pin a real-time service to "vCPU 0 and vCPU 1", you may have just put it on the two siblings of one physical core. The fix on AWS is to use *.metal instances (no virtualisation, you see the real topology) or to use the AWS Nitro Enclaves topology hints which expose the logical-to-physical mapping. On GCP, --threads-per-core=1 at instance creation disables SMT for that VM. The lesson: cloud "vCPU" is a billing unit, not a topology unit. Read lscpu -e inside the VM and trust that, not the cloud console.

Where this leads next

The next chapter — the cache hierarchy for the concurrency mind — opens the L1d / L2 / L3 / DRAM ladder and explains why concurrent programs don't contend on variables, they contend on 64-byte cache lines. Once you have the cache hierarchy in your head, the false-sharing chapter is the obvious next step — two unrelated counters on the same cache line spend their lives bouncing between cores at 80 ns per flip. After that, the MESI protocol and store buffers complete Part 2's machine-model picture.

Past Part 2, this chapter's topology lessons resurface in work-stealing schedulers: rayon and ForkJoinPool are topology-aware by default, but tokio's default config is not — a tokio worker on logical CPU 0 can have its sibling running an unrelated batch job, which destroys p99. The pattern of "pin to physical cores, leave siblings for housekeeping" recurs in every latency-critical chapter that follows.

References

# Reproduce on your laptop (Linux x86_64, root not required)
gcc -O2 -pthread smt_interference.c -o smti
# Discover sibling pairs first:
lscpu -e | head -20
cat /sys/devices/system/cpu/cpu0/topology/thread_siblings_list
# Compare cross-core vs sibling pinning:
./smti 0 1     # different physical cores
./smti 0 16    # SMT siblings (substitute the second number from thread_siblings_list)
# Drive both via the Python harness for a clean JSON report:
python smt_compare.py