Note: Company names, engineers, incidents, numbers, and scaling scenarios in this article are hypothetical — even when they resemble real ones. See the full disclaimer.

TLB and address translation costs

Aditi is profiling a Java service at ClearJourney that walks a 4 GB in-memory fare cache on every search request. perf stat shows a 96% L1d hit rate — the lookup keys are tiny and her hash function spreads them well. IPC, however, is 0.7. The flamegraph has no obvious hot symbol; the CPU appears to be idle for two-thirds of every load. She runs perf stat -e dTLB-load-misses,dTLB-loads and finds 4.3% of every load is a TLB miss, and each miss costs ~120 cycles in a hidden page-walker that does not show up in any profiler bar. The L1 cache is doing its job perfectly. The translation cache is not. The 4 GB working set has 1,048,576 distinct 4 KB pages, the dTLB has 64 entries, and Aditi is paying 7 ns of address-translation overhead on every fourth access to a fare she has already cached perfectly.

Every load and store on a modern CPU goes through a virtual-to-physical address translation before the cache hierarchy even sees it. The TLB caches recent translations; on a miss, a hardware page-walker traverses 4 levels of page tables, costing 60–250 cycles. Working sets larger than tlb_entries × page_size (typically 64 × 4 KB = 256 KB for L1 dTLB) hit the TLB ceiling. Hugepages (2 MB / 1 GB) are the lever that buys you back 512× or 262144× the TLB reach.

Why every load is two loads — virtual memory's hidden second access

The CPU you write code for does not address physical RAM directly. Your process sees a flat 64-bit virtual address space; the OS and hardware together maintain a per-process page table that maps virtual addresses to physical ones at 4 KB granularity (the page size on x86 and most ARM Linux systems). Every load and store fires this translation first; only after the virtual-to-physical mapping is resolved does the access reach the cache hierarchy you already know about (L1 → L2 → L3 → DRAM).

The page table itself lives in DRAM. On x86-64 it has four levels (PML4 → PDPT → PD → PT, with 5-level paging on Ice Lake and later); each level is a 4 KB page of 512 entries, indexed by 9 bits of the virtual address. A naive implementation would do four DRAM reads on every memory access — a 400+ ns translation tax on top of every load. That is unworkable. The hardware fix is the Translation Lookaside Buffer (TLB): a small, fully-associative cache inside the core that holds recent virtual-to-physical mappings. A TLB hit is one cycle; a TLB miss triggers the page-walker, a dedicated piece of hardware that walks the page tables on your behalf and refills the TLB.

The hidden second memory access: every virtual address must be translated before the cache sees it. TLB hit is one cycle; TLB miss invokes the page walker, which performs up to four cache-line reads to traverse the page tables. Illustrative.

Why the page table itself is hierarchical and not flat: a flat page table for a 64-bit address space at 4 KB granularity would need 2^52 entries — exabytes, hopelessly larger than DRAM. The four-level radix-tree structure means most of the tree is never allocated; only the parts of the address space the process actually uses have intermediate page-table pages in memory. A typical process's full page-table footprint is a few megabytes, not exabytes. The cost of that compactness is paid on TLB miss: the walker has to traverse all four levels.

The TLB on a Skylake / Ice Lake / Sapphire Rapids x86 core is a small organism. There are several TLBs working together:

L1 dTLB (data, ~64 entries) — covers ~256 KB of working set at 4 KB pages
L1 iTLB (instruction, ~128 entries) — covers ~512 KB of code
L2 STLB (shared, 1536–2048 entries) — covers ~6–8 MB of working set
L1 dTLB for 2 MB pages (~32 entries) — covers ~64 MB of working set
L1 dTLB for 1 GB pages (~4 entries) — covers ~4 GB of working set

When all of these miss, the page-walker walks DRAM. Sapphire Rapids has 2 page-walkers per core working in parallel, but they still cost real cycles. Brendan Gregg's measurements on Linux x86 give a typical TLB miss cost of ~30 cycles when the page-table walk itself hits in L1/L2 cache (the page-table pages are themselves cached!), ~100 cycles when it hits in L3, and ~250+ cycles when it goes to DRAM. The variance is huge, and the worst case is brutal: a single TLB miss can cost more than a full L3 cache miss, because it serialises four sequential cache-line reads instead of one.

Watching TLB pressure happen — measuring miss rate as the working set grows

The cleanest way to see the TLB ceiling is to write a loop that touches one cache line out of every 4 KB page in a working set of varying size, and watch where the latency cliff lands. With a 256 KB working set, the dTLB covers the whole thing — every translation hits the L1 dTLB, ~1 ns/access. At 4 MB, you're past the L1 dTLB but the L2 STLB still covers it — ~3 ns/access. At 64 MB you've blown past the L2 STLB; every access misses the TLB and the page-walker fires; ~30 ns/access if the page-table pages are in L2 cache, more if not. At 4 GB you're missing both the TLB and the page-table cache; 100+ ns/access just for translation.

# tlb_pressure.py
# Measures TLB miss cost across working-set sizes by touching one cache
# line per 4 KB page. Wraps perf stat to read dTLB-load-misses directly.
import ctypes, mmap, os, subprocess, sys, time
import numpy as np

PAGE = 4096
LINE = 64
WORKING_SETS_MB = [1, 4, 16, 64, 256, 1024, 4096]

def time_walk(n_pages: int, n_iters: int = 200) -> float:
    """Touch one 64-byte line per 4 KB page, n_iters times. Return ns/access."""
    # Allocate a fresh MAP_ANON region of n_pages*4KB, page-aligned.
    size = n_pages * PAGE
    buf = mmap.mmap(-1, size, mmap.MAP_PRIVATE | mmap.MAP_ANONYMOUS,
                    mmap.PROT_READ | mmap.PROT_WRITE)
    arr = np.frombuffer(buf, dtype=np.uint8)
    # Pre-fault all pages so we measure TLB cost, not page-fault cost.
    arr[::PAGE] = 1
    # Touch one byte in each page in a stable order, repeated.
    indices = np.arange(0, size, PAGE, dtype=np.int64)
    np.random.default_rng(42).shuffle(indices)  # defeat the prefetcher

    t0 = time.perf_counter_ns()
    total = 0
    for _ in range(n_iters):
        # numpy fancy-index does the n_pages random loads.
        total += int(arr[indices].sum())
    t1 = time.perf_counter_ns()
    buf.close()
    return (t1 - t0) / (n_iters * n_pages)

if __name__ == "__main__":
    print(f"{'WS (MB)':>8} | {'pages':>10} | {'ns/access':>10} | reach")
    print("-" * 60)
    for mb in WORKING_SETS_MB:
        n_pages = (mb * 1024 * 1024) // PAGE
        ns = time_walk(n_pages)
        reach = ("L1 dTLB" if n_pages <= 64 else
                 "L2 STLB" if n_pages <= 1536 else
                 "page walk")
        print(f"{mb:>8} | {n_pages:>10} | {ns:>10.1f} | {reach}")

Sample run on a 16-core c6i.4xlarge (Ice Lake, dTLB 64 entries, STLB 2048 entries, 4 KB pages):

 WS (MB) |      pages |  ns/access | reach
------------------------------------------------------------
       1 |        256 |        4.2 | L2 STLB
       4 |       1024 |        4.8 | L2 STLB
      16 |       4096 |       18.7 | page walk
      64 |      16384 |       42.3 | page walk
     256 |      65536 |       58.6 | page walk
    1024 |     262144 |       89.4 | page walk
    4096 |    1048576 |      127.1 | page walk

The cliff is at 4 → 16 MB, where the working set exceeds STLB reach (1536 × 4 KB ≈ 6 MB). Latency jumps 4× even though the cache hierarchy has not changed at all — the bytes you're reading still fit in L3 (32 MB on this part) up to ~30 MB. After the cliff, the latency keeps climbing because the page-table pages themselves get evicted from L1/L2 cache; at 1 GB+, every translation walks DRAM.

Why the cliff is sharper than you'd expect: a TLB is fully-associative (any entry can hold any translation), so capacity is binary — it fits or it doesn't. Cache misses are statistical; TLB misses are categorical. Once your working set exceeds STLB entries, every access is a translation-miss, not just some of them. That's the cliff.

# Verify the silicon-level cost with perf:
perf stat -e dTLB-loads,dTLB-load-misses,dtlb_load_misses.walk_active,dtlb_load_misses.miss_causes_a_walk \
    python3 tlb_pressure.py

The walk_active event gives you cycles spent in the page-walker; miss_causes_a_walk gives you the count. Dividing the two tells you average page-walk cycles for your workload on your hardware — far more useful than the canonical "100 cycles" number from textbooks. On a workload with cold page tables (recently context-switched, or huge working set), walk_active / miss_causes_a_walk can be 200–300 cycles. On a hot workload it can be 30–50.

The cliff: per-access latency stays flat while the working set fits the STLB, then jumps 4× when it doesn't. The latency keeps climbing past the L3 boundary because the page-table pages themselves get evicted. Illustrative — based on a c6i.4xlarge run.

The cliff is the defining feature of TLB pressure: it is not a slow degradation, it is a step function. If your working set sits below the STLB reach, you do not pay this tax at all. If it sits above, you pay it on every access. Optimisation strategies for TLB pressure are about pushing your effective working set back below the threshold — by data layout (keep hot data dense), by page size (hugepages multiply reach), or by access pattern (sequential walks let prefetchers warm the page-walker's cache lines).

Hugepages — the 512× lever and the 262144× lever

The TLB has a fixed number of entries; the only way to increase its reach (the total bytes addressable by the TLB) is to increase the size of each page. Linux supports two hugepage sizes on x86: 2 MB pages (one TLB entry covers 2 MB instead of 4 KB, a 512× expansion in reach per entry) and 1 GB pages (262144× expansion). On the Ice Lake STLB with 32 entries for 2 MB pages, the reach is 64 MB. On the dedicated 1 GB TLB with 4 entries, the reach is 4 GB. Either of these is enough to keep Aditi's 4 GB fare cache fully covered, eliminating the page-walker tax.

The cost of hugepages is twofold. First, internal fragmentation: a 2 MB allocation that uses 100 KB wastes 1.9 MB. Second, fewer pages mean coarser memory protection — a single page fault, copy-on-write, or madvise zero-out affects 2 MB at a time, not 4 KB. For workloads with hot, contiguous, long-lived data (databases, in-memory caches, large arrays in scientific code, JVM heaps), the trade is overwhelmingly worth it. For workloads with sparse, short-lived, small allocations (web request handlers, small services), it can hurt. The decision is workload-specific and is one of the few low-effort, large-impact tunings still available on a 2026-era Linux box.

There are two ways to use hugepages on Linux:

Explicit (hugetlbfs / MAP_HUGETLB) — the application asks for hugepages directly. The OS pre-reserves a hugepage pool at boot (vm.nr_hugepages or vm.nr_hugepages_2M); allocations succeed or fail based on pool availability. JVMs (-XX:+UseLargePages), PostgreSQL (huge_pages = on), Redis, and many trading systems take this path. Predictable; admin overhead.
Transparent Huge Pages (THP) — the kernel opportunistically merges adjacent 4 KB pages into 2 MB hugepages in the background, via the khugepaged kernel thread. Apps see a normal mmap() API; the kernel does the page-size promotion invisibly. Default on most distros (/sys/kernel/mm/transparent_hugepage/enabled = madvise). Saves admin work; introduces tail-latency spikes when khugepaged runs and migrates live pages. Many performance-sensitive systems disable THP (echo never >...) precisely because the latency variance is worse than the average improvement.

# hugepage_demo.py
# Same TLB pressure benchmark as before, but allocates with MAP_HUGETLB
# and compares to the 4 KB baseline. Requires:
#   sudo sysctl -w vm.nr_hugepages=2200   # ~4 GB of 2 MB hugepages
import mmap, time
import numpy as np

PAGE_4K = 4096
PAGE_2M = 2 * 1024 * 1024
MAP_HUGETLB = 0x40000  # asm-generic/mman.h

def measure(size_bytes: int, use_hugepages: bool, n_iters: int = 50) -> float:
    flags = mmap.MAP_PRIVATE | mmap.MAP_ANONYMOUS
    if use_hugepages:
        flags |= MAP_HUGETLB
        pagesize = PAGE_2M
    else:
        pagesize = PAGE_4K
    # Round up.
    n_pages = (size_bytes + pagesize - 1) // pagesize
    actual = n_pages * pagesize
    buf = mmap.mmap(-1, actual, flags, mmap.PROT_READ | mmap.PROT_WRITE)
    arr = np.frombuffer(buf, dtype=np.uint8)
    # Pre-fault every page so we measure TLB, not page-fault.
    arr[::pagesize] = 1
    # Touch one byte per 4 KB page so the access-count is identical
    # for both runs (ensures we're measuring translation cost only).
    stride_indices = np.arange(0, actual, PAGE_4K, dtype=np.int64)
    np.random.default_rng(42).shuffle(stride_indices)

    t0 = time.perf_counter_ns()
    total = 0
    for _ in range(n_iters):
        total += int(arr[stride_indices].sum())
    t1 = time.perf_counter_ns()
    buf.close()
    return (t1 - t0) / (n_iters * len(stride_indices))

if __name__ == "__main__":
    sizes_mb = [256, 1024, 4096]
    print(f"{'WS (MB)':>8} | {'4 KB ns/acc':>12} | {'2 MB ns/acc':>12} | speedup")
    print("-" * 60)
    for mb in sizes_mb:
        size = mb * 1024 * 1024
        small = measure(size, use_hugepages=False)
        try:
            big = measure(size, use_hugepages=True)
            speedup = small / big
            print(f"{mb:>8} | {small:>12.1f} | {big:>12.1f} | {speedup:>5.2f}x")
        except OSError as e:
            print(f"{mb:>8} | {small:>12.1f} | unavailable | (need vm.nr_hugepages={size//PAGE_2M})")

Sample run on the same c6i.4xlarge with hugepage pool pre-allocated (sudo sysctl -w vm.nr_hugepages=2200):

 WS (MB) |  4 KB ns/acc |  2 MB ns/acc | speedup
------------------------------------------------------------
     256 |         58.6 |         11.2 |  5.23x
    1024 |         89.4 |         12.8 |  6.98x
    4096 |        127.1 |         14.1 |  9.01x

The 4 KB regime gets worse as working set grows (more TLB pressure, page-table pages spilling out of cache). The 2 MB regime is essentially flat — 2200 pages × 2 MB = 4.4 GB of reach covered by the L2 STLB's 32-entry hugepage section; everything fits, every access hits the TLB, latency is just the cache-hierarchy cost. The 9× speedup at 4 GB is pure translation overhead, eliminated by changing one allocation flag. No application code changed; the loop, the access pattern, and the byte-level work are identical.

Why hugepages don't help below the STLB cliff (under ~6 MB on Ice Lake): if your working set already fits in the regular STLB, you're already hitting the TLB on every access. Hugepages can only eliminate misses; they can't accelerate hits. The 256 MB → 5× speedup column above is a workload that was deeply in the page-walker regime; a 16 MB workload would show maybe a 1.4× speedup, and a 2 MB workload would show none. Always know which regime you're in before reaching for the hugepage hammer.

When TLB pressure ate PaisaBridge's UPI peak

Vivek, an SRE at PaisaBridge, was on call during the 2025 Diwali UPI surge. The payment-decision service — the hot path that decides whether each transaction is allowed in under 80 ms — runs as a Go service holding a 12 GB in-memory rules engine: merchant whitelists, BIN-range tables, velocity counters, fraud-feature lookups. At 8:47 PM IST the box's CPU jumped from 35% to 78% in 90 seconds; p99 latency went from 42 ms to 380 ms. The throughput stayed roughly flat — same 480k tx/sec coming in — but each request was burning more CPU.

The flamegraph showed nothing obvious. No fat function, no GC spike, no allocator hotspot. perf stat told the real story: dTLB-load-misses had jumped from 0.4% to 6.1% of all loads, and dtlb_load_misses.walk_active showed the page-walker was active for 18% of all CPU cycles. The 12 GB working set spans 3M 4 KB pages; the STLB covers ~6 MB; every access to a non-recently-touched rule was a page walk.

The trigger was a release that morning. The service had moved from mmap()-based file-backed rules tables (which inherit hugepage settings from the filesystem and were on hugepages) to an in-process make([]byte, 12<<30) allocation in Go (which used 4 KB pages because Go's runtime doesn't request hugepages by default). The 9× translation-overhead increase did not show up in benchmarks because the staging load was much lower; the page-table pages stayed in L2 cache. Under Diwali production load, with cache pressure from concurrent goroutines, the page-table pages were getting evicted, and the page-walker was hitting DRAM repeatedly.

The fix was one line — mmap the buffer with MAP_HUGETLB instead of using a Go slice — plus reserving 6500 hugepages at boot. p99 dropped back to 38 ms within 5 minutes of the restart. The rules engine code, the merchant lookup logic, and the actual byte-level work were unchanged. The bug was that the engineer who wrote make([]byte, 12<<30) did not know that allocation choice was a TLB-reach decision. The flamegraph didn't show it because the page-walker is not in user-space; it's in microcode.

This is the canonical shape of TLB-pressure incidents: a working-set increase (or a memory-allocation strategy change) pushes the program past the STLB cliff, and the cost shows up as "high CPU, low throughput, no obvious hot symbol". ClearJourney's fare cache, KreditClub's rewards-rule engine, ParakhTrade's tick-data buffer, SetuStream's chunk-metadata cache — all four have hit this exact shape in production at peak load. The diagnosis ladder is short: perf stat -e dTLB-load-misses,dtlb_load_misses.walk_active first, hugepage rollout second.

Common confusions

"A TLB miss is the same as a cache miss." No. A cache miss fetches data; a TLB miss fetches the address of the data. They serialise: the TLB miss has to complete before the cache lookup begins. A TLB miss that itself causes 4 cache misses (the page-walker fetching all 4 page-table levels from DRAM) costs 4× a single L3-miss latency, not 1×. This is why TLB misses are uniquely bad: they multiply the worst-case memory latency.
"Increasing my L1 cache size will help TLB pressure." No. The TLB and the L1 cache are physically separate structures. A bigger L1d does not give you more TLB entries. The only ways to reduce TLB miss rate are: smaller working set, hugepages, or sequential access patterns (where the prefetcher pre-warms the next page's translation). CPU vendors are aware of this; the L2 STLB has been growing faster than L1d in recent generations precisely because the imbalance bites.
"TLB shootdowns are expensive but rare." A TLB shootdown — invalidating a TLB entry on remote CPUs because one CPU changed a mapping — is rare in steady-state but very common during memory-management events (mmap/munmap, mprotect, fork, page migration). On a 64-core box, a single munmap can fire 64 IPIs (inter-processor interrupts), each costing ~5 µs, for a total of 100+ µs of "where did my latency go" that no application profiler attributes correctly. JVM workloads with -XX:+UseG1GC migrate large regions of pages during GC pauses and pay shootdown costs on every old-region-to-new-region promotion. If you see periodic 200 µs spikes that correlate with GC events but don't show up in GC logs, TLB shootdown cost is a strong candidate; perf stat -e tlb_flush.dtlb_thread,tlb_flush.stlb_any measures it directly.
"Transparent Huge Pages are always a win." Many performance-sensitive shops run THP set to never. The reason is variance, not average: khugepaged runs in the background and migrates 4 KB pages into 2 MB regions, briefly pausing the threads holding those pages. The average improves; the p99.9 gets worse. For a workload like PaisaBridge's payment service where p99.9 is a hard SLO, predictable 4 KB pages plus explicit MAP_HUGETLB for known-large allocations is preferred. For a batch workload where mean throughput matters, THP enabled wins.
"Hugepages eliminate the cost of address translation." They reduce it dramatically; they don't eliminate it. Even with 1 GB pages, the TLB has only ~4 entries on most x86 cores, so a 32 GB working set still misses the 1 GB-page TLB on 28 of every 32 accesses. The page-walker is still invoked, just with a shallower walk (3 levels for 2 MB pages, 2 levels for 1 GB pages, since the lower levels disappear when the page itself is huge). For very large working sets, hugepages reduce per-miss cost; for moderately large, they eliminate misses. Both regimes matter.
"On ARM the TLB story is the same as x86." Mostly yes, with two important differences. ARM (Apple silicon, AWS Graviton) supports a 16 KB page size that x86 does not, which can be a sweet spot — 4× the reach of 4 KB without the fragmentation downside of 2 MB. Pearle M-series chips default to 16 KB pages on iOS and macOS for this reason. Also, ARM's TLBI (TLB invalidate) instruction is broadcast over the coherence fabric in a way that's typically faster than x86's IPI-based shootdown — the per-shootdown cost on Graviton is 1–2 µs versus 5+ µs on Xeon. The general shape of TLB-pressure-driven latency cliffs is identical; the constants differ.

Going deeper

The page-walker is itself cached — and that's the second-order cost

When the TLB misses, the page-walker fetches the four page-table pages it needs. Those page-table pages are themselves stored in regular memory and go through the regular L1/L2/L3 cache hierarchy. Recent x86 cores add a page-walker cache (sometimes called PSC, Paging Structure Cache) that holds intermediate page-table entries — the PML4, PDPT, and PD entries — separately from the data cache. Sapphire Rapids has a 32-entry PML4 cache, a 32-entry PDPT cache, and a 64-entry PD cache. With these warm, a TLB miss costs only one cache-line read (for the PT entry); cold, it costs four.

The implication is that TLB miss latency is not constant. A workload that touches few unique high-order address ranges will have its PML4/PDPT cache warm and pay ~30 cycles per TLB miss. A workload that scatters across the 64-bit address space (or context-switches frequently — the PSC is flushed on cr3 reload, which happens on every process switch) will pay ~200+ cycles per miss. This is why the same dtlb_load_misses count can correspond to wildly different cycles-per-miss costs in practice; reading both walk_active (cycles) and miss_causes_a_walk (count) and dividing gives you the actual cost on your machine.

A practical consequence: short-lived processes, container cold-starts, and serverless functions pay the PSC-cold tax on every invocation. AWS Lambda's "cold start" latency includes ~3–5 ms of TLB and PSC warming on the first thousand requests. The fix is process-level: keep workers warm, batch related work, prefer long-running services to per-request fork-exec patterns.

Multi-level TLBs and the "TLB IPC stall" classification on Top-Down

Intel's Top-Down Microarchitecture Analysis (TMAM) — the framework perf stat --topdown reports on — classifies every cycle into Frontend Bound / Backend Bound / Bad Speculation / Retiring. TLB misses fall under Backend Bound → Memory Bound → DTLB Miss. On a workload like Aditi's fare cache, you'd see:

 # Topdown level 2 from perf stat --topdown -M TopdownL1,TopdownL2
 IPC                          0.71  ()
 Frontend Bound               4.2%  ()
 Bad Speculation              1.8%  ()
 Backend Bound               72.4%  ()
   Memory Bound              68.1%  ()
     L1 Bound                 2.3%  ()
     L2 Bound                 1.1%  ()
     L3 Bound                 4.7%  ()
     DRAM Bound              12.4%  ()
     DTLB Bound              47.6%  ()  <-- the smoking gun
   Core Bound                 4.3%  ()
 Retiring                    21.6%  ()

DTLB Bound near 50% means the core spent half its cycles waiting for address translation — not for data. This is the metric that tells you "no, the cache is fine, fix the TLB". Without --topdown, you'd be staring at a flamegraph with no clear culprit. With it, the diagnosis is one number. Brendan Gregg's Systems Performance book covers TMAM in §6.4; reading that chapter once is the highest-leverage 30 minutes you can spend on profiling literacy.

A correlated metric to watch is tlb_flush.dtlb_thread and tlb_flush.stlb_any, which count TLB invalidations. Spikes here (without corresponding mmap/munmap calls in the application) often indicate kernel-side activity: page migrations, KSM (Kernel Same-page Merging) collapses, or NUMA balancing. A workload that "mysteriously" performs worse under load may be losing TLB entries to kernel-side page-management churn that is invisible to user-space tools.

The PaisaBridge fix in detail — Go runtime and `MAP_HUGETLB`

Go's runtime allocates from mheap via mmap() with MAP_PRIVATE | MAP_ANONYMOUS and 4 KB page granularity. There is no hugepage flag in Go's standard allocator. Vivek's fix took the rules-engine buffer out of Go's heap and into a unix.Mmap() call with MAP_HUGETLB:

// rulesengine/heap.go
const (
    sizeBytes  = 12 << 30        // 12 GiB
    MAP_HUGETLB = 0x40000        // unix.MAP_HUGETLB on linux
)

func allocHugePages(n int) ([]byte, error) {
    return unix.Mmap(-1, 0, n,
        unix.PROT_READ|unix.PROT_WRITE,
        unix.MAP_PRIVATE|unix.MAP_ANONYMOUS|MAP_HUGETLB)
}

// At service startup:
buf, err := allocHugePages(sizeBytes)
if err != nil {
    log.Fatalf("hugepage alloc failed (vm.nr_hugepages low?): %v", err)
}
rulesIndex.Init(buf)  // populate from disk; the slice never grows

The rest of the service's heap stays on 4 KB pages — it should, because most Go allocations are short-lived and fragmenting them across hugepages would waste memory. The 12 GB rules buffer is the only thing that needs hugepages; everything else uses Go's native allocator unchanged. The kernel needs vm.nr_hugepages = 6500 set at boot (or via sysctl) — 6500 × 2 MB = 13 GB, leaving headroom over the 12 GB allocation. This is encoded in the box's launch template as a one-line user-data script.

The same pattern works in C / C++ (mmap(NULL, size, ..., MAP_HUGETLB, -1, 0)), Rust (MmapMut::map_anon with MmapOptions::huge), Python (the mmap module accepts the flag, as in the demo above), and JVM (-XX:+UseLargePages plus appropriately-sized heap). The principle is the same: identify the one or two big, long-lived allocations, put them on hugepages, leave everything else alone. It's surgical, not configuration-wide.

The other half of the TLB story — code TLB pressure

Everything above is about data TLB (dTLB). There is also an instruction TLB (iTLB) that translates instruction-fetch addresses. iTLB pressure shows up as frontend-bound stalls in TMAM and looks completely different in profiles: lots of idq_uops_not_delivered.cycles_fe_was_ok events, low IPC, no obvious data hotspot. Workloads that spread their instruction stream across many code pages — large interpreted workloads (Python, Ruby), JIT-compiled JVM code that has churned a lot, microservice frameworks with tons of indirection — can be iTLB-bound rather than dTLB-bound.

The fix for iTLB pressure is similar in spirit but different in mechanism: kernel transparent_hugepage/defrag set to defer-only-madvise, plus userspace tools like liblargepages that wrap an executable's text segment in 2 MB pages at load time. Java has -XX:+UseTransparentHugePages and -XX:+UseLargePagesIndividualAllocation; Go has had it as an experiment for several releases (GODEBUG=asyncpreemptoff=1 impacts iTLB pressure indirectly). For services with 100+ MB of compiled code (an unusually large but not unheard-of class — think a JVM with many classes loaded), iTLB pressure is a measurable win to fix. For a typical Go service with a 30 MB binary, it's noise; don't chase it without profiling first.

Reproduce this on your laptop

# Linux x86 with perf and Python 3.11+
sudo apt install linux-tools-common linux-tools-generic
python3 -m venv .venv && source .venv/bin/activate
pip install numpy

# Set up a hugepage pool (5 GB worth, requires root):
sudo sysctl -w vm.nr_hugepages=2600

# Watch the cliff:
python3 tlb_pressure.py
sudo perf stat -e dTLB-loads,dTLB-load-misses,dtlb_load_misses.walk_active \
    -- python3 tlb_pressure.py

# Compare hugepages:
python3 hugepage_demo.py

# TMAM breakdown — find the DTLB Bound row:
sudo perf stat --topdown -M TopdownL1,TopdownL2 -- python3 tlb_pressure.py

The ratio of 4 KB to 2 MB latency at the 1 GB working-set point is the empirical measurement of your hardware's TLB-miss tax. A 4× ratio means your STLB is small or your page-walker is cold; an 8×+ ratio means you're well into the page-walker regime and any large working-set workload on your box is leaving 8× performance on the table by default. Most Linux servers in 2026 are not configured to use hugepages out of the box; this is a measurement that pays for itself the first time you run it on a real production-shaped workload.

Where this leads next

The cache line was the noun of /wiki/cache-lines-and-why-64-bytes-rules-everything. The translation is the verb. Every load you write is two operations — translate, fetch — and the cost of the translation is invisible to your application code. The TLB is the cache that hides the translation cost when your working set is small; the page-walker is the bill you get when it isn't. Hugepages are the lever that resizes the TLB's reach. These three primitives — TLB, page-walker, page size — define the shape of every "we have lots of memory and it's still slow" conversation in production performance work.

The chapters that follow connect this story to the rest of the memory hierarchy:

/wiki/false-sharing-the-silent-killer — coherence on a single line; orthogonal to translation, but stacks with it on multicore workloads.
/wiki/numa-topology-and-page-placement — page placement decides which NUMA node owns a translation; cross-socket page-walks pay the NUMA tax twice.
/wiki/hardware-prefetchers-and-when-they-help — the prefetcher warms not just data lines but page-table lines; sequential walks barely pay TLB cost.
/wiki/working-set-size-and-cache-pressure — the working-set framework that tells you when TLB pressure starts to bite.
/wiki/syscall-overhead-and-vdso-tricks — every syscall flushes the TLB on older kernels (KPTI / Meltdown mitigations); modern kernels selectively flush; cost still nontrivial.

The deeper thread is that the memory hierarchy has more layers than the textbook diagram shows. The standard L1/L2/L3/DRAM picture leaves out the TLB, the STLB, the PSC, the page-walker, and the kernel page tables themselves. A complete model of "where my memory access actually goes" includes all of these. Once you carry that model, performance reports stop having mystery rows; every cycle has a place. The next chapter zooms in on the prefetchers — the silent third member of every cache-hierarchy story — and shows how they interact with TLB pressure to either hide it or amplify it.

A short note on industry direction. CXL.mem (the new memory-expansion fabric appearing in 2025-era servers) introduces a third tier of "memory" that lives behind a PCIe-like link, with 200–300 ns latency. CXL.mem accesses go through the same TLB as DRAM, but the page-walker has to be informed about which physical addresses live on which CXL device. Linux's CXL-aware NUMA support, mature as of kernel 6.4, treats CXL memory as a separate NUMA node. The TLB-pressure story extends naturally: a workload spanning local DRAM + CXL memory has the same TLB cliff as a workload spanning two NUMA nodes, plus the longer page-walker round-trip when the page table itself ends up on the slower memory. The principles do not change; the constants get more interesting.

A final mental check before you ship code that allocates anything large: how many pages? If it's under 1500, you're fine; the STLB covers you. If it's between 1500 and 32K, you're in the page-walker regime and 2 MB hugepages help. Above 32K, you may want 1 GB pages or a per-thread sharding strategy that keeps each thread's working set below the STLB cliff. The single number — pages, not bytes — is the right unit to think in.

References

Intel® 64 and IA-32 Architectures Optimization Reference Manual — §3.2 (TLB structures), §B.5 (uncore performance counters), §11.7 (TMAM with DTLB Bound classification).
AMD64 Architecture Programmer's Manual, Vol. 2: System Programming — §5 (paging and the page-walker), §7.4 (TLB management).
Drepper, "What Every Programmer Should Know About Memory" (2007) — §4 (virtual memory and TLBs), with timing diagrams that still apply 18 years later.
Brendan Gregg, Systems Performance (2nd ed., 2020) — §6.4 (TMAM, TLB metrics), §7.6 (hugepages and transparent_hugepage tuning).
Linux kernel docs: Transparent Hugepage Support — the canonical reference for THP behaviour, khugepaged tunables, and madvise() semantics.
Basu, Gandhi, Chang, Hill, Swift, "Efficient Virtual Memory for Big Memory Servers" (ISCA 2013) — measurements of TLB-miss costs on large-memory workloads; motivates the case for direct segments and 1 GB pages.
Brendan Gregg, "Linux 4.x Tracing Tools — perf cheatsheet" — the practical entry point for perf stat -e dTLB-loads,dTLB-load-misses,dtlb_load_misses.walk_active.
/wiki/cache-lines-and-why-64-bytes-rules-everything — the cache-line foundation that TLB pressure stacks on top of.

TLB and address translation costs

Why every load is two loads — virtual memory's hidden second access

Watching TLB pressure happen — measuring miss rate as the working set grows

Hugepages — the 512× lever and the 262144× lever

When TLB pressure ate PaisaBridge's UPI peak

Common confusions

Going deeper

The page-walker is itself cached — and that's the second-order cost

Multi-level TLBs and the "TLB IPC stall" classification on Top-Down

The PaisaBridge fix in detail — Go runtime and MAP_HUGETLB

The other half of the TLB story — code TLB pressure

Reproduce this on your laptop

Where this leads next

References

The PaisaBridge fix in detail — Go runtime and `MAP_HUGETLB`