Note: Company names, engineers, incidents, numbers, and scaling scenarios in this article are hypothetical — even when they resemble real ones. See the full disclaimer.

Interconnects: QPI, UPI, Infinity Fabric

Aditi at SetuStream was profiling the IPL final's catalogue API on a 2-socket Sapphire Rapids box; single-socket benchmarks showed a clean p99 of 240 ms, but production p99 climbed to 1.4 s as load crossed 60 % with no flamegraph change and no CPU saturation. Two perf counters told the story: remote DRAM accesses were 38 % of all L3 misses, and the UPI link was saturated at 22 GB/s on a wire rated for 24. The CPUs were not slow — the wire between the sockets was full, and 14 hours of incident response went into discovering a counter the team had never read.

QPI, UPI, and Infinity Fabric are the wires that connect sockets, chiplets, and memory controllers when one die runs out of room. They are not transparent: every cross-socket access pays latency and consumes a finite per-link bandwidth budget that the cache-coherence protocol shares with your data. When the wire saturates — typically around 70 % of nameplate bandwidth — every cross-socket access starts queueing, p99 climbs, and your CPU dashboards show nothing because the CPUs are waiting on the link, not on themselves.

Why a wire exists between sockets at all

A modern server CPU does not fit on one die. The Intel Xeon Platinum 8592+ has 64 P-cores, 320 MB of L3, 8 DDR5 channels, and 80 PCIe-5 lanes — that integration limit hits a die-area wall around 600 mm² and a power wall around 350 W. Once you commit to two sockets, those two dies need to talk: a thread on socket 0 that misses its L3 must be able to fetch a cache line from socket 1's DRAM, and a write on socket 0 must be able to invalidate the copy of that line cached on socket 1. The wire that carries those messages is the interconnect.

Three families dominate production hardware in 2026:

Interconnect	Vendor	Generation	Per-link bandwidth (each direction)	Latency hop
QPI (QuickPath Interconnect)	Intel	Nehalem → Broadwell (2008–2015)	9.6–25.6 GB/s	~80–100 ns
UPI (Ultra Path Interconnect)	Intel	Skylake-SP → Sapphire Rapids → Emerald Rapids → Granite Rapids (2017–2026)	20.8 GB/s (UPI 1.0), 24 GB/s (UPI 2.0), 32 GB/s (UPI 3.0)	~70–110 ns
Infinity Fabric (xGMI inter-socket)	AMD	EPYC Naples → Rome → Milan → Genoa → Bergamo → Turin (2017–2026)	32 GB/s (Naples) → 36 GB/s (Genoa, xGMI3) → 48 GB/s (Turin, xGMI4)	~110–180 ns inter-socket; ~40–80 ns intra-socket between chiplets

The numbers move every generation, but the shape of the budget is stable: you get tens of GB/s per link, hundreds of nanoseconds of one-way latency, and the link's total capacity is shared between your application's data, the kernel's DMA, and the cache-coherence directory traffic. Why this budget feels small: a single DDR5-4800 channel delivers ~38 GB/s, and a Sapphire Rapids socket has 8 channels for ~300 GB/s of local DRAM bandwidth. The UPI 2.0 link to the other socket delivers 24 GB/s. The cross-socket pipe is 12× narrower than the local memory pipe. Any workload that treats remote memory as if it were local will hit the wire's ceiling long before it hits DRAM's ceiling, and the symptom — high cross-socket access fraction with a saturated UPI counter — looks nothing like a CPU-bound or memory-bound profile.

Intel routes every cross-socket request over a small number of fat UPI links from a monolithic die. AMD's chiplet design adds an intra-socket Infinity Fabric layer (CCD↔IOD) before any traffic ever reaches the inter-socket xGMI links. Illustrative — based on Intel and AMD optimisation manuals.

The two designs trade off differently. Intel's monolithic die keeps all cross-socket traffic on a single hierarchy: a core misses L3, the request goes to a single ring/mesh, then onto a UPI link. AMD's chiplet design has two hops even within one socket — a CCD that wants memory on its own socket goes through the IO die first; a CCD that wants memory on the other socket goes CCD → IOD → xGMI → IOD → memory controller — and the cumulative latency is higher (~140 ns vs ~110 ns for Intel) but the bandwidth scales because xGMI doesn't have to share with as many cores at once. Why production tuning differs by vendor: an EPYC service that pins to one CCD's nearest memory controller can avoid the CCD→IOD hop entirely and run with ~80 ns L3-miss-to-DRAM latency; the same trick on a Xeon costs you the wider mesh's spare bandwidth. The "right" pinning strategy is platform-specific, and the previous chapter's numactl --cpunodebind --membind is necessary but not sufficient — the chapter on NUMA-aware allocators covers the second half.

Coherence traffic eats the same bandwidth budget

The interconnect's per-link bandwidth number on the spec sheet is the total bandwidth — it carries everything: bulk data fetches for L3 misses, cache-coherence snoop messages, directory updates, and atomic-operation acknowledgements. A 24 GB/s UPI 2.0 link does not deliver 24 GB/s of your data plus another lane for coherence; the coherence is on the same lane, and on cache-line-bounce-heavy workloads it can be the majority of the traffic.

A cache line is 64 bytes. A coherence message — say, a RFO (Read-For-Ownership) when one core wants exclusive access to a line another core has — is roughly 16 bytes of header plus the 64-byte payload, or no payload if it's just an invalidate. If two cores on different sockets bounce a hot line back and forth a million times per second, that's 1M × ~80 bytes = 80 MB/s of coherence traffic — small. But if you have 1,000 such hot lines (a typical false-sharing scenario, see false sharing), it's 80 GB/s — well over a single UPI link's capacity, and the entire workload's cross-socket bandwidth budget is consumed by coherence chatter that carries no useful payload.

# measure_upi_saturation.py
# Drive a synthetic cross-socket workload, then read the UPI bandwidth
# perf counters to see how much of the link the workload is consuming.
# Works on Intel Skylake-SP and later. AMD users substitute the
# uncore_df_*/upi_* counters as documented in /sys/devices/uncore_df_*.
#
# Run as:
#   sudo python3 measure_upi_saturation.py
# Requires: numactl (apt install numactl), perf (linux-tools-common),
# and a 2-socket box. Single-socket boxes will report 0 GB/s — the
# workload runs but no cross-socket traffic is generated.

import os, re, subprocess, sys, time

def numa_node_count():
    if not os.path.isdir("/sys/devices/system/node"):
        return 1
    return sum(1 for d in os.listdir("/sys/devices/system/node")
               if d.startswith("node") and d[4:].isdigit())

def upi_event_names():
    """Find the right perf event names for this CPU's UPI/QPI counters."""
    out = subprocess.run(["perf", "list"], capture_output=True, text=True).stdout
    rx_evt = re.search(r"(uncore_upi.*?/(?:UNC_UPI_RxL_FLITS_G[A-Z._]+|RxL_FLITS\.ALL_DATA)/[^\s]*)", out)
    tx_evt = re.search(r"(uncore_upi.*?/(?:UNC_UPI_TxL_FLITS_G[A-Z._]+|TxL_FLITS\.ALL_DATA)/[^\s]*)", out)
    return (rx_evt.group(1) if rx_evt else "uncore_upi/event=0x3,umask=0xf/",
            tx_evt.group(1) if tx_evt else "uncore_upi/event=0x2,umask=0xf/")

# Cross-socket workload: thread on node 0 reads memory bound to node 1.
WORKLOAD = """
import numpy as np, ctypes, os, time
SIZE = 4 * 1024 * 1024 * 1024  # 4 GiB
buf = np.zeros(SIZE // 8, dtype=np.int64)
buf[::8] = 1                     # touch every cache line
t0 = time.perf_counter()
total = 0
for _ in range(10):
    total += int(buf.sum())      # streaming read across the buffer
elapsed = time.perf_counter() - t0
print(f"workload elapsed: {elapsed:.2f}s, sum={total}")
"""

if numa_node_count() < 2:
    print("Single-node box detected; UPI counters will read 0.")
    sys.exit(0)

rx_evt, tx_evt = upi_event_names()
print(f"Using rx event: {rx_evt}\n            tx: {tx_evt}")

# Run the workload pinned to node 0, with memory bound to node 1.
cmd = ["perf", "stat", "-e", f"{rx_evt},{tx_evt}", "--",
       "numactl", "--cpunodebind=0", "--membind=1",
       "python3", "-c", WORKLOAD]
result = subprocess.run(cmd, capture_output=True, text=True)
print("\n--- perf stat output ---")
print(result.stderr)

# UPI flits are 8 bytes each on UPI 2.0 (9 bytes wire, 8 bytes payload).
# Look for the count line and convert.
for line in result.stderr.splitlines():
    m = re.search(r"([\d,]+)\s+\S*upi.*FLITS", line, re.IGNORECASE)
    if m:
        flits = int(m.group(1).replace(",", ""))
        gb = flits * 8 / 1e9
        print(f"  -> {gb:.2f} GB transferred on {line.strip()[:60]}")

A sample run on a 2-socket Sapphire Rapids 8480+ (SetuStream's catalogue tier hardware):

$ sudo python3 measure_upi_saturation.py
Using rx event: uncore_upi/event=0x3,umask=0xf/
            tx: uncore_upi/event=0x2,umask=0xf/

--- perf stat output ---
workload elapsed: 4.83s, sum=53687091200

 Performance counter stats:

    14,389,221,887      uncore_upi/event=0x3,umask=0xf/      # rx flits
    14,401,003,442      uncore_upi/event=0x2,umask=0xf/      # tx flits

       4.831287241 seconds time elapsed

  -> 115.11 GB transferred on  14,389,221,887  uncore_upi/event=0x3,...
  -> 115.21 GB transferred on  14,401,003,442  uncore_upi/event=0x2,...

The walkthrough on the parts that matter most:

numactl --cpunodebind=0 --membind=1 is the worst-case incantation: every memory access from every thread is necessarily remote. We use it deliberately to make the UPI traffic visible — a real workload should never look like this.
uncore_upi/event=0x3,umask=0xf/ (RxL_FLITS_G2.ALL_DATA) counts incoming data flits on a UPI link. A flit is the link-layer unit (8 bytes of payload on UPI 2.0). Multiply by 8 to get bytes; divide by elapsed to get GB/s. Why we count flits not bytes: UPI is a packetised link with headers; the per-flit count is the kernel's most direct view. The _G2 (Granite/Generation 2) suffix differs by CPU family — perf list | grep -i upi on the target machine tells you which event name your kernel exposes. AMD machines use uncore_df events with similar semantics but different names (look for xgmi or df_remote_inout).
115 GB / 4.83 s = 23.8 GB/s of bidirectional traffic on each direction of one link. The Sapphire Rapids 8480+ has 4 UPI links per socket at 24 GB/s each; this single-thread workload saturates one direction of one link almost completely. A multi-threaded workload in production saturates them all.
The workload is not CPU-bound. The numpy.sum runs at memory speed; the bottleneck is the link capacity. perf stat -e cycles,instructions on the same workload would show IPC of ~0.3 — most cycles spent waiting on memory — but the cycles are not the cause, they're the symptom of the wire being full.

The headline lesson: when a workload reads bw / link_capacity close to 1.0 in either direction, every additional cross-socket access queues up. The link is now your latency-defining resource. Adding more cores does nothing — they will all wait on the same wire.

A second lesson hides in the symmetry of the rx and tx counts. A cross-socket read generates traffic in both directions: the request goes out (tx, ~16 bytes), the data comes back (rx, ~80 bytes). A cross-socket write is reversed: the data goes out (tx, ~80 bytes), the acknowledgement comes back (rx, ~16 bytes). When you see asymmetric saturation — say, tx at 95 % and rx at 30 % — your workload is dominated by writes leaving this socket; conversely, rx-heavy saturation means reads pulling data into this socket. The asymmetry tells you which socket is the producer and which is the consumer of the cross-socket data. PaisaBridge's payment matcher initially showed a 3:1 tx:rx imbalance on socket 0; the matchmaking results were being written to a socket-1-resident audit log. Moving the audit log to a per-socket sharded layout halved both directions of UPI traffic and dropped p99 by 1.4 ms.

Saturation symptoms and the queueing knee

Interconnects, like every other shared resource, follow the queueing curve covered in latency vs throughput: latency stays roughly flat from 0 % utilisation to about 70 %, then climbs steeply, and goes asymptotic as you approach 100 %. The knee for a UPI or xGMI link is consistently at 70 % of nameplate bandwidth in production traces.

That number is why SetuStream's incident at the top of this chapter showed 22 GB/s on a 24 GB/s link as a problem. 22 / 24 = 92 % utilisation; the link was deep in the queueing region, every cross-socket request was queueing for ~300 ns instead of the unloaded ~100 ns, and the cumulative effect across a request that touched 40 cache lines (typical for a JSON catalogue lookup) was 40 × 200 ns = 8 µs of extra latency per request — precisely the gap between the 240 ms p99 in the lab and the 1.4 s p99 in production.

UPI link latency vs utilisation. Below the 70 % knee, cross-socket latency is dominated by the wire's transit time; above it, the queue at the link's egress port dominates. Illustrative — based on M/M/1 queueing applied to UPI with measured saturation behaviour.

The 70 % number is not arbitrary. M/M/1 queueing theory predicts that average wait time is ρ / (1 - ρ) service times, where ρ is utilisation; at ρ = 0.7 the wait is 2.3 service times, at ρ = 0.85 it's 5.7, at ρ = 0.95 it's 19. Real interconnects are M/D/1 (deterministic service) at the link layer but M/M/1 in aggregate because the issuers — the cores — are stochastic, and the curve shape matches the predicted shape of M/M/1 with ρ measured at the link's egress port. Treat 70 % as the budget; use the headroom for traffic spikes; alert at 65 % to give yourself a 5-percentage-point buffer.

The signature of UPI saturation in perf stat is unmistakable once you know to look for it:

offcore_response.demand_data_rd.l3_miss.remote_dram rises as a fraction of LLC-load-misses. If more than 30 % of your L3 misses are going remote, you have a placement problem.
uncore_upi/event=0x2,umask=0xf/ (TxL_FLITS) / link capacity approaches 0.7 or higher. This is the bandwidth side.
offcore_response.demand_data_rd.l3_miss.remote_dram average latency (computed via cycle_activity.stalls_l3_miss / offcore_response.demand_data_rd.l3_miss) climbs from ~250 cycles unloaded to ~600 cycles saturated.
CPU utilisation looks fine. This is the cruel part. The cores are spending more time in STALL_LDM (stalled on load miss to memory) but the OS reports them as busy because they're not idle. Dashboards driven by top or by node_cpu_seconds_total{mode="user"} show no problem.

The third class of symptom — high latency with normal CPU and normal IPC and no obvious flamegraph hot spot — is what Aditi finally noticed at hour 14 of the IPL incident. The fix at her layer was operational: a numactl change and a code refactor to keep user-session data on one socket. The fix at the architecture layer would have been to replicate the catalogue cache on both sockets and accept the memory cost — SetuStream shipped that in the next quarter.

What you can do about it

Three classes of intervention reduce cross-socket traffic, each with a different cost. The order below reflects the order an SRE should reach for them: the cheapest fix first, the architecturally invasive fix last.

Pin smarter. Most cross-socket traffic exists because some thread reaches for memory it didn't know was on the other side. The previous chapter's numactl --cpunodebind --membind is the first move; the next is sharding. If your service has user sessions, route a user's requests consistently to one socket using a consistent-hash on user_id, so user A's session data lives on socket 0 and user B's on socket 1, and each request's working set is local. PaisaBridge's matcher does exactly this — every UPI VPA hashes to a fixed socket, and the matching state for that VPA lives on that socket's DRAM. Cross-socket traffic dropped from 41 % of L3 misses to 4 % after the rollout; p99 fell by 3.2 ms.

Replicate read-mostly state. A catalogue cache, a feature index, a routing table — anything that's read 1000× more than written — can be replicated on every socket. The cost is memory (2× on a 2-socket box, 4× on a 4-socket); the benefit is that every read is local. SetuStream's catalogue tier moved from a shared 80 GB cache (one copy, half remote on average) to per-socket 80 GB caches (two copies, all local) and the UPI saturation issue evaporated. Why this works for read-mostly: writes still cross sockets to update the other replica, but writes are 0.1 % of accesses; the 99.9 % of reads now run at local-DRAM latency. The bandwidth budget the writes consume is two orders of magnitude smaller than what the reads were consuming when they were remote, so the link saturation goes from 92 % to 4 %.

Reduce coherence chatter. False sharing is the highest-leverage fix. A single hot counter that two threads on different sockets both xadd to generates one cross-socket coherence transaction per increment. Padding the counter to its own cache line, sharding it per-CPU, or using a per-socket aggregator that aggregates locally and merges occasionally — these are micro-architectural fixes that show up in perf c2c (cache-to-cache) reports as the "shared lines" disappearing. ParakhTrade's order-matching engine had a single global orders_processed counter that was a top entry in perf c2c until they sharded it to per-CPU; the change cost 12 lines of code and dropped UPI traffic by 18 % during the 09:15 IST market open.

The fourth move — buy a single-socket machine — is rarely available but always worth thinking about. AMD's Bergamo and Turin SKUs ship with 128–192 cores on one socket, which means many workloads that used to need a 2-socket box now fit on one. The interconnect simply doesn't exist if you don't have two sockets, and zero is a much better number than 70 %. BharatBazaar's catalogue search tier moved to single-socket Bergamo nodes during the 2025 fleet refresh and the per-node throughput went up 22 % despite the per-node core count staying the same — the entire UPI-saturation tax disappeared because there was no UPI.

A subtler intervention worth naming: placement of the kernel's per-CPU data structures. Linux allocates per-CPU areas (ring buffers, slab caches, network queues) on whichever NUMA node the kernel decided to home that CPU on. For a CPU on socket 1, those structures should live on socket 1's DRAM. They usually do, but a few configurations get this wrong: VMs whose vCPUs are pinned to specific physical cores by the hypervisor but whose memory is allocated by the host kernel before the pinning takes effect; containers whose CPU topology is hidden by cpuset cgroups in a way that confuses the kernel's placement decisions. The symptom is numa_foreign rising on socket 1 even when application memory is correctly bound. The fix is at the orchestrator layer (Kubernetes Topology Manager, qemu's -numa flags) rather than in the application — but you have to know to look there, which most application developers do not.

Edge cases that bite in production

Three failure modes show up rarely enough that most teams haven't seen them, but with enough impact that one occurrence can ruin a launch.

The link-degraded silent failure. A UPI or xGMI link can negotiate down to a slower speed at boot if a connector is dirty, a board has a marginal trace, or thermal events triggered link retraining. The system boots, the OS reports its full core count, applications run — but one of three UPI links is operating at 12 GB/s instead of 24 GB/s. The kernel exposes the negotiated speed via /sys/devices/system/node/nodeN/numastat indirectly and via vendor-specific MSRs directly; dmesg | grep -i "upi" on Intel and dmesg | grep -i "xgmi" on AMD often shows the retrain event. DigiPaisa caught a degraded-link node in their fleet in 2024 by alerting on per-node numa_miss rate diverging from the fleet median; the node had been running at 50 % aggregate UPI bandwidth for three weeks and nobody noticed because the workload happened to fit at lower throughput.

The interrupt storm masquerading as UPI saturation. Network interfaces on multi-socket servers raise interrupts that, depending on irqbalance configuration, can land on a CPU on the wrong socket from where the NIC's PCIe root complex lives. Every interrupt then triggers a cross-socket DMA descriptor read, a coherence operation on the descriptor ring, and the actual packet data motion. At 10 Gbps line rate with 1500-byte frames, that's ~830K packets per second, each generating a few cross-socket transactions; the symptom looks identical to application UPI saturation in perf stat. The fix is set_irq_affinity.sh from the NIC driver pinning interrupts to the local socket; the diagnosis is cat /proc/interrupts showing rising counts on cores on the wrong socket. Why this is harder to spot than application saturation: the cross-socket traffic doesn't show up in your application's flamegraph at all — it happens entirely in kernel context during interrupt handling, before the kernel has even decided which application thread to wake. Tools that show only userspace activity miss it; perf top -k or bpftrace -e 'profile:hz:99 { @[kstack] = count(); }' is what reveals it.

Memory hotplug and link-asymmetric topologies. Cloud instances increasingly expose only a subset of the underlying hardware's NUMA nodes — a 2-socket bare-metal box presented as a 1-socket VM, or a partial socket with one DDR channel removed for a CXL slot. The instance reports a coherent topology to the OS, but the underlying interconnect carries traffic the OS can't see. Performance tuning that relies on numactl --hardware output is fooled because that output is the VM's view, not the hardware's view. The mitigation is empirical: run a known-cross-socket workload and measure its bandwidth with the methods earlier in this chapter; if the measured ceiling is suspiciously low, the instance is sharing an interconnect link with another VM and your tuning is fighting an invisible neighbour. Cloud providers don't document this, and it's the most common reason a benchmark on a cloud instance fails to reproduce on bare metal.

Common confusions

"UPI bandwidth is just the link's spec sheet number." No. The spec sheet is the physical bandwidth. The usable bandwidth — what your data actually gets — is lower because the same link carries cache-coherence traffic, MSI-X interrupts, and on some platforms peripheral DMA. Plan for 70 % of the spec sheet as the saturation threshold and 50 % as the "you can run hot here" threshold.
"Infinity Fabric is just AMD's UPI." They serve the same role but the topology is different. Infinity Fabric runs inside a socket between chiplets (CCD ↔ IO die) as well as between sockets (xGMI). On a Genoa box, every memory access from a CCD already goes through Infinity Fabric to reach DRAM, even when "local". UPI on Intel only matters when crossing sockets. The implication: AMD chiplet-to-DRAM latency is higher than Intel monolithic die-to-DRAM, but AMD's bandwidth scales better because the IO die has more memory channels per socket.
"perf stat cycles tells me about UPI saturation." It does not — cycles only counts core cycles, and a stall on a remote DRAM access consumes core cycles whether the link is at 10 % or 90 %. The signal is offcore_response.*.remote_dram plus the uncore_upi/...FLITS counters. Without the uncore counters, you cannot distinguish "remote access at low utilisation" (110 ns) from "remote access at high utilisation" (600 ns), and the latter is the production incident.
"NUMA balancing (numa_balancing=1) prevents UPI saturation." It helps for long-lived workloads but reacts on a 100 ms+ horizon. A request-response service with a 4 ms p99 SLO experiences UPI saturation in tens of microseconds during a load spike; auto-balancing has not migrated a single page yet by the time the SLO is breached. Manual binding plus replication is the production answer; auto-balancing is the cleanup pass behind it.
"Adding more cores helps when UPI is saturated." It hurts. More cores means more concurrent cross-socket requests piling up at the link's egress queue. The link's bandwidth is fixed; the queue depth grows linearly with the number of issuers. Every additional core past saturation adds latency to all of them. The right move when UPI is saturated is to remove sources of cross-socket traffic, not add more cores.
"Modern interconnects fix this — UPI 3.0 and xGMI4 are fast enough." They raise the ceiling, not eliminate it. UPI 3.0 at 32 GB/s is 33 % more bandwidth than UPI 2.0; the workload's cross-socket needs grew faster than that on most production fleets between 2022 and 2026 (more cores per socket, larger working sets, more connections per server). The relative bottleneck — local memory bandwidth vs cross-socket bandwidth — has stayed roughly 12× across generations. The wire is structurally narrower than the local DRAM.

Going deeper

The directory protocol behind the scenes

Modern Intel and AMD CPUs use a directory-based coherence protocol for cross-socket coherence (the within-socket protocol is snoopy MESI/MOESI; see cache coherence MESI/MOESI). Each socket maintains a directory — a small SRAM table — that tracks which cache lines from this socket's DRAM are cached on other sockets. When a remote core wants line L, it asks its local directory; if L is exclusive on socket 0, the directory routes the request to socket 0, which downgrades L from M to S, sends the data, and updates its directory entry. The full transaction is 3–5 messages on the interconnect for the worst case (RFO with intervention from a third socket). This is why the per-line bandwidth cost of coherence traffic is much higher than the data payload alone — the protocol overhead is fixed-size headers that compound. Intel's Sapphire Rapids documents the directory's capacity (512K entries per CHA) and the eviction behaviour when the directory overflows (the line is "snooped broadcast" instead, which is much more expensive). Workloads with many sockets and many cached lines — e.g. an in-memory database with a 1 TB working set on a 4-socket box — can blow the directory and trigger broadcast snoop storms that look like UPI saturation but are caused by a different mechanism. The fix is the same (replicate, partition, reduce sharing); the diagnosis is perf stat -e LLC-load-misses plus the directory-overflow uncore event (uncore_cha.directory_lookup_state_NoSnoop and friends).

Sub-NUMA Clustering (SNC) and NPS

Intel exposes Sub-NUMA Clustering as a BIOS option that splits a single physical socket into 2 (SNC2) or 4 (SNC4) logical NUMA nodes — each pinned to a subset of the socket's L3 slices and memory channels. AMD's equivalent is NPS (Nodes Per Socket) with NPS=1, NPS=2, NPS=4 modes. The motivation is the same: the within-socket mesh is large enough that a thread on one corner of the die has measurably worse latency to memory channels on the opposite corner than to channels nearby. SNC/NPS exposes that asymmetry to the OS so the scheduler and allocator can place threads near "their" memory. The catch: SNC4 turns a 2-socket box into an 8-node box from the kernel's perspective, and numactl --hardware returns 8 nodes; software that wasn't tested in this configuration may pin to "node 0" thinking it's a whole socket and end up on a quarter of one. SetuStream's catalogue tier actually runs with SNC disabled (NPS=1 equivalent) because the catalogue cache benefits from full-die L3 visibility; their tail-tier (per-user feature index) runs with SNC4 because each user's index fits in a quarter-socket. The choice is per-workload; there is no global right answer.

CXL: the next interconnect

CXL (Compute Express Link) is the emerging interconnect that runs on PCIe-6 physical links and adds cache-coherent memory sharing across nodes. CXL 1.1 and 2.0 (shipping in 2024–2026 servers) let you attach a memory expansion module that the CPU treats as another NUMA node — coherent, but with ~150–250 ns latency vs ~100 ns for local DRAM. CXL 3.0 (2026+) adds peer-to-peer cache coherence across multiple hosts, which conceptually is "UPI between machines". The performance tuning lessons from this chapter transfer almost wholesale: a CXL-attached memory tier behaves like a slower NUMA node, and the same numactl/replication/partitioning strategies apply. Production deployments at scale are still rare in 2026; expect them to become normal in 2027–2028 as the latency penalty narrows. SetuStream, PaisaBridge, BharatBazaar's storage teams are all evaluating CXL today; production payment-path workloads will be the last to adopt it because the latency is still too high for sub-1 ms SLOs.

The interesting hybrid is CXL.mem with tiered memory — a server with 1 TB of local DDR5 plus 4 TB of CXL-attached memory presented to the OS as nodes 0–7 (DDR) and node 8 (CXL). Linux's numa_balancing daemon promotes hot pages from the slow tier to the fast tier and demotes cold pages the other way; the OS tries to give your hot working set local-DRAM latency while letting the cold long-tail live cheaply on CXL. Whether this works depends entirely on the application's working-set hit rate and the speed of the promotion/demotion cycle (typically 100 ms–1 s). The interconnect-saturation lesson generalises: CXL has its own bandwidth budget too, and a workload that streams through cold memory at line rate can saturate the CXL link the same way an unbinded workload saturates UPI today. The diagnostic counter is uncore_cxl/... on Sapphire Rapids and later; the symptom shape is identical.

Reading `perf c2c` for cross-socket coherence hot spots

perf c2c record captures cache-to-cache transfer events; perf c2c report ranks cache lines by how much cross-socket traffic they generated. The output is a table with one row per hot line, showing the address, the symbol, the load/store counts from each socket, and most importantly the HITM (Hit-Modified) count — how often a load on one socket hit a line that was modified on another socket and had to fetch it. A line with high HITM and high cross-socket access fraction is a coherence hot spot: a candidate for padding, sharding, or replicating. The ParakhTrade orders_processed counter from earlier showed up as the #1 line in their perf c2c report with 18 M HITM events per second; after sharding it the line vanished from the report and the next-worst contributor was 200K HITM/sec. Reading perf c2c reports is a learnable skill — the format is dense but every column is worth understanding, and man perf-c2c plus Brendan Gregg's blog post are the two references that make it click.

A useful workflow for triage: run perf c2c record for 30 seconds during a known-saturated period, then perf c2c report --stdio and look at the top 10 entries. If the cumulative HITM count of those 10 lines accounts for more than 60 % of total HITM, the problem is concentrated — the fix is targeted padding/sharding of those specific lines. If the top 10 only account for 20 % and the long tail is hundreds of lines each at 1–2 %, the problem is diffuse — the workload's data structures are inherently shared across sockets, and the fix is structural (replicate the data, partition the workload by socket). The two cases want different responses; reading the distribution before deciding saves the wasted effort of micro-optimising in the diffuse case.

A note on Arm-based instances, which Indian cloud workloads increasingly land on: AWS Graviton, Azure Cobalt, and Ampere Altra use Arm's CMN-700 mesh as their cross-die interconnect. The events are exposed under arm_cmn_* PMU names; the bandwidth budget is in the same order of magnitude as UPI/xGMI (50–100 GB/s per cross-die link); the saturation symptom is identical. PaisaBridge's UPI-payment authoriser runs partly on Graviton3 c7g and partly on Sapphire Rapids c7i; the team maintains two perf event lists for the two architectures and the same alerts work on both because queueing physics doesn't care about the brand on the chip.

Reproduce this on your laptop

# 2-socket box recommended; single-socket boxes will report 0 GB/s of UPI/xGMI.
sudo apt install numactl linux-tools-common linux-tools-generic
numactl --hardware                          # confirm 2+ nodes
perf list | grep -iE "upi|xgmi|df_remote"   # find the right counter names

python3 -m venv .venv && source .venv/bin/activate
pip install numpy
sudo python3 measure_upi_saturation.py      # the script from above

# For directory-overflow / coherence:
sudo perf c2c record -- numactl --cpunodebind=0 --membind=1 \
     python3 -c "import numpy as np; a=np.zeros((1<<26,)); [a.sum() for _ in range(20)]"
sudo perf c2c report --stdio | head -50

A single-socket laptop runs the workload but reports zero UPI traffic — there is no link to saturate. A 2-socket EPYC or Xeon workstation, or any cloud *.metal instance, exercises the real counters and shows the bandwidth-vs-utilisation knee in action.

If you don't have access to bare-metal hardware, AWS c6i.metal-32xl (Intel Ice Lake, 2-socket) or c7a.metal-48xl (AMD Genoa, 2-socket) on ap-south-1 rent for roughly ₹400–₹800 per hour and run the script unmodified.

A two-hour exploration session is enough to see the saturation knee, the perf c2c workflow, and the asymmetric tx/rx pattern from the PaisaBridge anecdote — for under the cost of a dinner. Spin up the instance, run the three scripts (measure_upi_saturation.py, the perf c2c record ladder, and a stress workload of your choice), terminate the instance.

The numbers you measure will not match the numbers in this chapter exactly — a Genoa box's xGMI counters report different units than a Sapphire Rapids box's UPI counters; a c6i.metal-32xl has different link counts than a c7i.metal-48xl. The shape of the result is what transfers: at low utilisation the latency is flat; somewhere past 70 % it bends; past 90 % it's vertical. Once you've seen that bend yourself, you'll never read a UPI counter the same way again.

Where this leads next

You can now see the wire and read its saturation. The next chapters turn that visibility into code that doesn't fight the topology, and into measurement disciplines that catch link saturation before it becomes a tail-latency incident:

/wiki/numa-aware-allocators-and-data-structures — once placement and link visibility are solved, the next layer is making malloc itself NUMA-aware: per-arena allocators, sharded hash tables, per-node freelists. The shape of code that fits the interconnect's budget instead of overflowing it.
/wiki/measuring-numa-effects-with-perf — turning interconnect counters into routine SRE dashboards: numa_hit, numa_miss, numa_foreign, plus the uncore UPI/xGMI events on a per-socket basis. The recipe PaisaBridge's NUMA-drift detector is built on.
/wiki/false-sharing-the-silent-killer — the most common production source of cross-socket coherence traffic; padding and sharding patterns that drop interconnect bandwidth use by 10–100×.

The deeper habit is to treat the interconnect as a first-class capacity-planning resource alongside CPU and RAM, with its own dashboard, its own alerts, and its own rollback criteria on deploy. SetuStream's IPL post-mortem after Aditi's incident added a single Prometheus alert: upi_link_utilisation > 0.65 for 60 seconds pages the on-call. The alert has fired twice in the year since; both times the on-call rolled back a deploy before SLO was breached. Three lines of node_exporter text-collector code, one PromQL expression, one PagerDuty rule. The cost of that alert is a fraction of one engineer-hour per quarter to tune. The cost of not having it was 14 hours of incident response.

References

Intel® 64 and IA-32 Architectures Optimization Reference Manual — Volume 1, the chapters on Sapphire Rapids and Granite Rapids cover the UPI link, the directory-based coherence protocol, and the uncore performance events.
AMD EPYC 9004 Series Processors Performance Tuning Guide — the canonical reference for Infinity Fabric / xGMI bandwidth, NPS modes, and the chiplet-to-IOD latency budget.
Brendan Gregg, Systems Performance (2nd ed., 2020) — Chapter 7 (Memory) and Chapter 16 (Case Study: NUMA) for production NUMA debugging including UPI counters.
Linux kernel Documentation/admin-guide/perf/intel-uncore.rst — the uncore PMU documentation; tells you which UPI events your kernel exposes and how to interpret them.
Daniel Lemire, "Performance overhead of NUMA on modern CPUs" (2023) — a measurement-driven look at remote-access cost on contemporary hardware.
Hennessy & Patterson, Computer Architecture: A Quantitative Approach (6th ed.) — Chapter 5 on multiprocessor coherence; the foundational treatment of directory protocols on which UPI and Infinity Fabric are built.
Christoph Lameter, "NUMA: An Overview" (Linux Plumbers 2013) — the conceptual backbone of NUMA on Linux; still accurate on what the kernel does and does not do for you.
/wiki/numactl-and-memory-binding — the previous chapter; pin threads and pages before you measure interconnect traffic.