Huge pages and transparent huge pages for allocators

Karan, an SRE on the Zerodha Kite order-router, runs perf stat -e dTLB-load-misses,dTLB-loads against the matching engine at 09:14 IST — sixteen minutes before the cash-equity market opens. The miss ratio is 1.7%, eight times what it was last quarter. The working set hasn't grown; the order-book in-memory representation is the same 11 GB it was in February. What changed: an autoscaler upgrade pushed the pods onto a newer kernel where Transparent Huge Pages are set to madvise instead of always, jemalloc no longer requests them, and every cache line in the order-book now costs an extra TLB miss on the way to L1. Karan has thirteen minutes to decide whether to enable THP in always mode (and risk the IPL-final-style RSS bloat that hit Hotstar last June) or to teach jemalloc to ask for hugepages explicitly through MALLOC_CONF=thp:always.

A huge page maps 2 MB (or 1 GB) of contiguous virtual address space with one page table entry instead of 512 (or 262 144). For memory-bound services, that collapses TLB pressure and can shed 5–15% of CPU. But the same property that makes hugepages cheap to translate makes them expensive to fragment: an allocator that puts one live 64-byte object inside a 2 MB hugepage cannot return the other 2 097 088 bytes to the kernel until that one object is freed. Tuning hugepages for an allocator is the trade between TLB-miss CPU and RSS-bloat memory, with MALLOC_CONF=thp:always and madvise(MADV_HUGEPAGE) as the two knobs.

Why a TLB miss costs more than you think

Every virtual address your code touches must be translated to a physical address. The CPU's Translation Lookaside Buffer (TLB) caches recent translations. A modern Intel Sapphire Rapids core has 64 entries in the L1 dTLB for 4 KB pages, 32 entries for 2 MB pages, and 4 entries for 1 GB pages, with a 2048-entry L2 STLB shared across page sizes. With 4 KB pages, 64 entries cover 256 KB of address space; with 2 MB pages, 32 entries cover 64 MB. For a working set of 11 GB — Karan's order-book — the difference is whether the L2 STLB hits or misses on most accesses. When the STLB misses, the hardware page-table walker fires four memory accesses (PML4 → PDPT → PD → PT on x86-64) just to translate one address — and each of those four can themselves miss into L3 or DRAM. A single TLB miss can cost 100–300 cycles. In a memory-bound loop, you can spend more time translating addresses than reading the data they point to.

TLB coverage with 4 KB vs 2 MB pages on a Sapphire Rapids coreTwo horizontal bars showing how much address space the L1 dTLB and L2 STLB cover for 4 KB and 2 MB page sizes. The 11 GB working set is marked with a vertical line — it falls outside both 4 KB tiers but fits inside 2 MB STLB coverage. Illustrative — not measured data.TLB coverage vs working set — the geometry of the trade4 KB pagesL1: 256 KBL2 STLB: 8 MB2 MB pagesL1: 64 MBL2 STLB: 4 GB (covers most of an 11 GB heap with reuse)11 GB working set (Kite order-book)misses both 4 KB tiers; partly covered by 2 MB STLB via reuseWhat it costs when you miss the STLBx86-64 page-table walk: PML4 → PDPT → PD → PT — four memory accesses, each potentially L3/DRAM-bound.Total cost: 100–300 cycles per miss. In a memory-bound loop, this dominates.
The L1 dTLB covers 256 KB of address space with 4 KB pages but 64 MB with 2 MB pages — a 256× jump for the same number of TLB entries. The L2 STLB extends the reach but never enough to cover an 11 GB heap with 4 KB pages. Illustrative — not measured data.

Why the 256× ratio matters more than it looks: TLB coverage scales linearly with page size for a fixed number of entries. The L1 dTLB has not grown meaningfully across the last six Intel generations (Skylake had 64, Sapphire Rapids has 64). The only knob that actually moves is page size. Doubling cache size buys you 2× hit rate on average; switching from 4 KB to 2 MB pages buys you 512× more address-space coverage per entry. There is no other single parameter in the memory hierarchy with that kind of leverage.

Measuring the win — and the loss — from a Python script

The cleanest way to see the TLB-miss cost is a numpy stride benchmark wrapped by perf stat, run once with default 4 KB pages and once with 2 MB pages requested via madvise(MADV_HUGEPAGE). The script touches a working set just larger than the L2 STLB's 4 KB coverage but well within its 2 MB coverage — exactly the regime where hugepages should win.

# tlb_huge_pages_demo.py — measure dTLB miss rate and wall time on a numpy
# stride benchmark, with and without hugepage backing for the array.
# Run: python3 tlb_huge_pages_demo.py
import ctypes, mmap, os, subprocess, time, numpy as np

WORKING_SET_BYTES = 1 << 30           # 1 GB — well past the 4 KB L2 STLB reach
STRIDE_BYTES      = 4096              # one access per page — worst case for TLB
ITERS             = 8

libc = ctypes.CDLL("libc.so.6", use_errno=True)
MADV_HUGEPAGE = 14                    # /usr/include/asm-generic/mman-common.h

def make_array(use_hugepages: bool) -> np.ndarray:
    """Allocate a page-aligned 1 GB buffer; optionally madvise it as hugepage."""
    buf = mmap.mmap(-1, WORKING_SET_BYTES,
                    flags=mmap.MAP_PRIVATE | mmap.MAP_ANONYMOUS,
                    prot=mmap.PROT_READ | mmap.PROT_WRITE)
    addr = ctypes.addressof(ctypes.c_char.from_buffer(buf))
    if use_hugepages:
        rc = libc.madvise(ctypes.c_void_p(addr),
                          ctypes.c_size_t(WORKING_SET_BYTES),
                          ctypes.c_int(MADV_HUGEPAGE))
        if rc != 0:
            raise OSError(ctypes.get_errno(), "madvise(MADV_HUGEPAGE) failed")
    arr = np.frombuffer(buf, dtype=np.uint8)
    arr[:] = 0                         # touch every page so it's resident
    return arr

def stride_walk(arr: np.ndarray) -> int:
    """Touch one byte per 4 KB page; return checksum to defeat dead-code elim."""
    n = len(arr); s = 0
    for _ in range(ITERS):
        for i in range(0, n, STRIDE_BYTES):
            s += int(arr[i])
    return s

def time_one(use_hugepages: bool) -> tuple[float, str]:
    label = "2 MB hugepages" if use_hugepages else "4 KB pages"
    arr = make_array(use_hugepages)
    t0 = time.perf_counter_ns()
    chk = stride_walk(arr)
    elapsed_ms = (time.perf_counter_ns() - t0) / 1e6
    return elapsed_ms, f"{label:>16}: {elapsed_ms:7.1f} ms  (chk={chk})"

if __name__ == "__main__":
    # First wrap ourselves in perf stat for the TLB counters, then re-exec
    # in the inner mode where we actually do the work.
    if os.environ.get("INNER") != "1":
        env = {**os.environ, "INNER": "1"}
        subprocess.run(
            ["perf", "stat", "-e",
             "dTLB-loads,dTLB-load-misses,cycles,instructions",
             "python3", __file__],
            env=env)
    else:
        for hp in (False, True):
            _, msg = time_one(hp)
            print(msg)

Sample run on a c6i.2xlarge (Intel Ice Lake, kernel 6.5, glibc 2.35):

       4 KB pages:  4218.3 ms  (chk=0)
   2 MB hugepages:   612.7 ms  (chk=0)

 Performance counter stats for 'python3 tlb_huge_pages_demo.py':
     14,238,118,401      dTLB-loads
        184,492,066      dTLB-load-misses          # 1.30% of all dTLB loads
     31,402,118,884      cycles
     16,820,442,019      instructions              # 0.54  insn per cycle

Three numbers carry the story. First, the wall-time ratio: 4218 ms vs 612 ms — a 6.9× speedup from one madvise call. The work is identical: 8 passes over 1 GB at 4 KB stride. Second, the dTLB miss rate: 1.30% averaged across both runs hides the per-run shape — almost all of those 184 M misses came from the 4 KB run. The 2 MB run touches 1 GB / 2 MB = 512 unique pages, which fit easily in the 2048-entry L2 STLB; the 4 KB run touches 1 GB / 4 KB = 262 144 unique pages, which thrash both TLB tiers. Third, the IPC: 0.54 insn/cycle is brutally low — half the typical Python interpreter overhead would still get above 1.0 IPC on a hot loop. The cycles are being spent waiting for translation, not for data. Why IPC drops below 1 specifically: each STLB miss stalls the front-end for the duration of the page-table walk. A walk that hits in L2 cache costs ~30 cycles; a walk whose intermediate page-table entries themselves miss to DRAM costs 200+ cycles. During the stall, no instructions retire. A loop that should sustain 4 IPC sees 0.5 because three out of every four cycles are translation stalls, not work stalls.

Two ways to ask for hugepages — and how allocators choose

Linux exposes two distinct hugepage mechanisms, and the allocator's relationship to each is different.

Explicit hugetlbfs. Reserved at boot or runtime via vm.nr_hugepages, mounted as a filesystem at /dev/hugepages, allocated through mmap(MAP_HUGETLB). Hugepages reserved this way are never broken up — the kernel will refuse to use those pages for anything else. Allocators that want guaranteed hugepage backing (databases like Postgres with huge_pages = on, or jemalloc with MALLOC_CONF=metadata_thp:always for its bookkeeping) use this path. The downside: reserved hugepages are gone from the general page pool whether you use them or not. A 64 GB box with vm.nr_hugepages = 16384 (32 GB reserved) has 32 GB of regular memory, full stop.

Transparent Huge Pages (THP). The kernel opportunistically promotes 4 KB anonymous mappings to 2 MB hugepages when 512 contiguous 4 KB pages happen to be available and the mapping is 2 MB-aligned. Controlled by three sysfs knobs: /sys/kernel/mm/transparent_hugepage/enabled (always / madvise / never), defrag (how hard the kernel tries to compact memory to find a contiguous run), and khugepaged/defrag (the background daemon that retroactively promotes existing mappings). The promotion is invisible to the application — it just sees lower TLB miss counts. The downside: invisible promotion means invisible demotion, fragmentation under churn, and the RSS-bloat problem covered below.

For allocators, the layered choice is:

THP modes and how jemalloc and tcmalloc interact with eachA two-column comparison of the three THP enabled modes (always, madvise, never) showing what happens when an allocator does or does not call madvise(MADV_HUGEPAGE), and which allocators default to which behaviour. Illustrative — not measured data.THP mode × allocator behaviourTHP=alwaysTHP=madvise (modern default)THP=neverAllocator does NOT call madvise(MADV_HUGEPAGE):kernel auto-promotesRSS-bloat risk: HIGHstays at 4 KBTLB pressure: HIGHstays at 4 KBTLB pressure: HIGHAllocator DOES call madvise(MADV_HUGEPAGE):promoted (already would be)no behaviour changepromoted (consensual)RSS-bloat: BOUNDED to advisedadvise ignoredstays at 4 KBDefault behaviour by allocator:jemalloc 5.x: thp:default = honours kernel; thp:always = forces madvise; thp:never = MADV_NOHUGEPAGEtcmalloc: TCMALLOC_HUGEPAGES_ENABLED=1 to opt in; off by default since 2022glibc ptmalloc: never calls madvise — relies entirely on kernel auto-promotion
The interaction matrix. The "RSS-bloat: BOUNDED" cell is the production sweet spot — THP set to madvise systemwide, plus an allocator that explicitly opts in to hugepages for the regions it knows are large and stable. Illustrative — not measured data.

The reason THP=madvise is the modern default (since RHEL 9, Ubuntu 22.04+) is the RSS-bloat problem in THP=always mode. Under always, the kernel promotes any 2 MB-aligned region to a hugepage, including the allocator's bookkeeping, the JVM's metaspace, and short-lived allocations. The allocator then frees the small object inside that hugepage but cannot return the underlying page to the kernel — madvise(MADV_DONTNEED) operates at base page granularity, and a hugepage either has zero live pages or it does not. Under madvise, the allocator opts in deliberately for the regions it knows are large, long-lived, and densely packed; the rest of the heap stays at 4 KB and reclaims normally.

The dark side — RSS bloat from one live byte

The defining fragility of hugepages for allocators is geometric. A 2 MB hugepage is the granularity at which the allocator can return memory to the kernel. If even one live allocation sits inside the hugepage, the entire 2 MB is resident — counted against the cgroup limit, against RssAnon, against anything the OOM killer cares about. The Razorpay payments worker that fragments under IPL load doesn't lose memory in 4 KB increments any more; it loses it in 2 MB increments, and the slope of the loss is 512× steeper.

# huge_page_bloat_demo.py — show how one live object per 2 MB region pins
# all 2 MB resident, regardless of what the live heap reports.
# Run: python3 huge_page_bloat_demo.py
import ctypes, mmap, os, random

PAGE_4K   = 4096
HUGE_2M   = 2 * 1024 * 1024
N_REGIONS = 700                         # 700 * 2 MB = 1.4 GB of address space

libc = ctypes.CDLL("libc.so.6", use_errno=True)
MADV_HUGEPAGE  = 14
MADV_DONTNEED  = 4

def rss_kb(pid: int) -> int:
    with open(f"/proc/{pid}/status") as f:
        for line in f:
            if line.startswith("VmRSS:"):
                return int(line.split()[1])
    return -1

def alloc_huge(n: int) -> mmap.mmap:
    buf = mmap.mmap(-1, n,
                    flags=mmap.MAP_PRIVATE | mmap.MAP_ANONYMOUS,
                    prot=mmap.PROT_READ | mmap.PROT_WRITE)
    addr = ctypes.addressof(ctypes.c_char.from_buffer(buf))
    libc.madvise(ctypes.c_void_p(addr), ctypes.c_size_t(n), ctypes.c_int(MADV_HUGEPAGE))
    buf[:] = b"\x00" * n                  # touch all pages so they're resident
    return buf

random.seed(11)
pid = os.getpid()
print(f"baseline RSS: {rss_kb(pid):>10} kB")

regions = [alloc_huge(HUGE_2M) for _ in range(N_REGIONS)]
print(f"after  alloc: {rss_kb(pid):>10} kB  ({N_REGIONS} * 2 MB = {N_REGIONS*2} MB requested)")

# "Free" 99% of every region — leave one live byte at offset 0.
# This is the pathological case: the allocator's view says 99% free,
# the kernel's view (RSS) says 100% resident, because each hugepage
# still has one live byte and DONTNEED on a hugepage with any live
# byte is a no-op for the page.
addr_of = lambda b: ctypes.addressof(ctypes.c_char.from_buffer(b))
for buf in regions:
    a = addr_of(buf)
    libc.madvise(ctypes.c_void_p(a + PAGE_4K),
                 ctypes.c_size_t(HUGE_2M - PAGE_4K),
                 ctypes.c_int(MADV_DONTNEED))
print(f"after  free:  {rss_kb(pid):>10} kB  (one live byte per 2 MB region)")

# Now actually drop the live byte and re-DONTNEED the whole region.
for buf in regions:
    a = addr_of(buf)
    libc.madvise(ctypes.c_void_p(a),
                 ctypes.c_size_t(HUGE_2M),
                 ctypes.c_int(MADV_DONTNEED))
print(f"after  full:  {rss_kb(pid):>10} kB  (no live bytes — kernel reclaims)")

Sample run (kernel 6.5 with transparent_hugepage/enabled = always):

baseline RSS:      28432 kB
after  alloc:    1463808 kB  (700 * 2 MB = 1400 MB requested)
after  free:    1462112 kB  (one live byte per 2 MB region)
after  full:      29104 kB  (no live bytes — kernel reclaims)

The middle row is the entire problem. The application "freed" 1.4 GB by MADV_DONTNEED-ing 99% of every region, and RSS dropped by 1.6 MB. The kernel reclaimed almost nothing because it could not split the 700 hugepages — each had a live byte at offset zero. Only when the live byte was also released and the full 2 MB advised could the kernel reclaim. Why this matches the Hotstar IPL incident: under steady-state load the allocator's hugepages were densely packed, and RSS tracked the live heap. Under the scaling event at IPL toss, allocations doubled in 30 seconds, forcing new hugepages; some of those hugepages then had only one or two live objects after the spike subsided as connections drained. The 4 KB gaps inside each hugepage could not be returned. RSS climbed 4 GB and stayed there.

The pattern repeats in production for any allocator that aggressively requests hugepages: jemalloc with MALLOC_CONF=thp:always, tcmalloc with TCMALLOC_HUGEPAGES_ENABLED=1, or any service running under THP=always with bursty allocation patterns. The fix is to keep hugepage backing for the regions where it pays — the long-lived big arenas — and force 4 KB backing for the bursty short-lived stuff. Jemalloc's per-arena thp knob makes this possible; ptmalloc gives you no such control.

Common confusions

Going deeper

What madvise(MADV_COLLAPSE) adds in kernel 6.x

Kernel 6.1 introduced MADV_COLLAPSE (and 6.10 made it more reliable), which lets userspace synchronously request that a range be promoted to hugepages right now, rather than waiting on khugepaged. This is the missing knob for allocators that want to advise hugepages but don't want to depend on background scanning. The current jemalloc 5.3 release does not yet use it; jemalloc 6 (in development) will. For Zerodha-style services that want predictable startup behaviour — order-book loaded at 09:00 IST, must be on hugepages by 09:14 IST — MADV_COLLAPSE is the difference between deterministic and probabilistic. The Python equivalent is one extra libc.madvise(addr, size, 25) call after touching the region (25 is the constant value for MADV_COLLAPSE on x86-64).

Why jemalloc separates metadata_thp from thp

Jemalloc has two THP knobs. thp:always controls hugepages for user data (the bins where your allocations live). metadata_thp:always controls hugepages for the allocator's own bookkeeping — the radix tree that maps addresses to size classes, the bin metadata, the per-arena structures. The split exists because metadata is small but accessed on every alloc/free; putting it on hugepages can save 5–10% CPU even when user data stays at 4 KB. The PhonePe payments service runs MALLOC_CONF=metadata_thp:always,thp:default — hugepages for the allocator's own working set, kernel-default behaviour for user data — and measured a 6% drop in CPU at the same throughput simply by reducing TLB pressure on the radix-tree walks.

Why container memory limits make this worse

cgroup v2's memory.max and memory.high are enforced in 4 KB page units, but the OOM killer fires the moment the sum of all anonymous pages crosses the limit — including the stranded pages inside fragmented hugepages. A pod with limits.memory: 12Gi running an allocator that holds 2.4 GB of stranded hugepages (one live byte per 2 MB region, like the demo above) effectively has 9.6 GB of usable memory. The pod will OOM at 12 GB resident even though its live heap is 4 GB. This is why Hotstar disables THP (echo never > /sys/kernel/mm/transparent_hugepage/enabled) on the manifest tier despite the TLB-miss CPU cost — the predictability of 4 KB reclaim is worth more under cgroup limits than the TLB savings.

The NUMA interaction nobody talks about

A 2 MB hugepage lives on exactly one NUMA node. If your worker thread runs on socket 0 but the hugepage was allocated on socket 1 (because khugepaged promoted it later, and at promotion time only socket-1 contiguous memory was available), every access incurs a remote-NUMA hit — 120 ns instead of 80 ns. The win from one TLB entry per 512 pages is partially eaten by the remote hop. numactl --membind=0 --cpunodebind=0 plus MALLOC_CONF=thp:always on a NUMA box avoids this; just enabling THP without binding can leave the per-pod allocation pattern dependent on which socket happened to have free contiguous memory at the moment khugepaged ran. Flipkart's catalogue tier learned this during Big Billion Days 2024: a 9% throughput regression after enabling THP turned out to be 100% remote-NUMA hugepage placement on half the workers.

Reproduce this on your laptop

sudo apt install python3.11 python3.11-venv linux-tools-common linux-tools-generic
python3.11 -m venv .venv && source .venv/bin/activate
pip install --upgrade pip numpy

# Check current THP state
cat /sys/kernel/mm/transparent_hugepage/enabled
cat /proc/meminfo | grep -i huge

# Stride benchmark — 4 KB vs 2 MB pages
python3 tlb_huge_pages_demo.py

# RSS bloat demo — watch RSS not reclaim under hugepage stranding
python3 huge_page_bloat_demo.py

# To reproduce the always/madvise contrast, flip the system knob
echo madvise | sudo tee /sys/kernel/mm/transparent_hugepage/enabled
# then re-run; the bloat demo will show RSS reclaim correctly
echo always  | sudo tee /sys/kernel/mm/transparent_hugepage/enabled
# the bloat demo now strands the hugepages

You should see the stride benchmark finish 5–8× faster on the hugepage path on a desktop with at least 2 GB of free contiguous memory at run time, and the RSS-bloat demo go from full reclamation under madvise to near-zero reclamation under always when one live byte per region pins each hugepage.

Where this leads next

Hugepages are the highest-leverage TLB knob in the memory hierarchy and the highest-risk fragmentation knob in the allocator. The next chapters walk the fallout:

References