Huge pages and transparent huge pages for allocators
Karan, an SRE on the Zerodha Kite order-router, runs perf stat -e dTLB-load-misses,dTLB-loads against the matching engine at 09:14 IST — sixteen minutes before the cash-equity market opens. The miss ratio is 1.7%, eight times what it was last quarter. The working set hasn't grown; the order-book in-memory representation is the same 11 GB it was in February. What changed: an autoscaler upgrade pushed the pods onto a newer kernel where Transparent Huge Pages are set to madvise instead of always, jemalloc no longer requests them, and every cache line in the order-book now costs an extra TLB miss on the way to L1. Karan has thirteen minutes to decide whether to enable THP in always mode (and risk the IPL-final-style RSS bloat that hit Hotstar last June) or to teach jemalloc to ask for hugepages explicitly through MALLOC_CONF=thp:always.
A huge page maps 2 MB (or 1 GB) of contiguous virtual address space with one page table entry instead of 512 (or 262 144). For memory-bound services, that collapses TLB pressure and can shed 5–15% of CPU. But the same property that makes hugepages cheap to translate makes them expensive to fragment: an allocator that puts one live 64-byte object inside a 2 MB hugepage cannot return the other 2 097 088 bytes to the kernel until that one object is freed. Tuning hugepages for an allocator is the trade between TLB-miss CPU and RSS-bloat memory, with MALLOC_CONF=thp:always and madvise(MADV_HUGEPAGE) as the two knobs.
Why a TLB miss costs more than you think
Every virtual address your code touches must be translated to a physical address. The CPU's Translation Lookaside Buffer (TLB) caches recent translations. A modern Intel Sapphire Rapids core has 64 entries in the L1 dTLB for 4 KB pages, 32 entries for 2 MB pages, and 4 entries for 1 GB pages, with a 2048-entry L2 STLB shared across page sizes. With 4 KB pages, 64 entries cover 256 KB of address space; with 2 MB pages, 32 entries cover 64 MB. For a working set of 11 GB — Karan's order-book — the difference is whether the L2 STLB hits or misses on most accesses. When the STLB misses, the hardware page-table walker fires four memory accesses (PML4 → PDPT → PD → PT on x86-64) just to translate one address — and each of those four can themselves miss into L3 or DRAM. A single TLB miss can cost 100–300 cycles. In a memory-bound loop, you can spend more time translating addresses than reading the data they point to.
Why the 256× ratio matters more than it looks: TLB coverage scales linearly with page size for a fixed number of entries. The L1 dTLB has not grown meaningfully across the last six Intel generations (Skylake had 64, Sapphire Rapids has 64). The only knob that actually moves is page size. Doubling cache size buys you 2× hit rate on average; switching from 4 KB to 2 MB pages buys you 512× more address-space coverage per entry. There is no other single parameter in the memory hierarchy with that kind of leverage.
Measuring the win — and the loss — from a Python script
The cleanest way to see the TLB-miss cost is a numpy stride benchmark wrapped by perf stat, run once with default 4 KB pages and once with 2 MB pages requested via madvise(MADV_HUGEPAGE). The script touches a working set just larger than the L2 STLB's 4 KB coverage but well within its 2 MB coverage — exactly the regime where hugepages should win.
# tlb_huge_pages_demo.py — measure dTLB miss rate and wall time on a numpy
# stride benchmark, with and without hugepage backing for the array.
# Run: python3 tlb_huge_pages_demo.py
import ctypes, mmap, os, subprocess, time, numpy as np
WORKING_SET_BYTES = 1 << 30 # 1 GB — well past the 4 KB L2 STLB reach
STRIDE_BYTES = 4096 # one access per page — worst case for TLB
ITERS = 8
libc = ctypes.CDLL("libc.so.6", use_errno=True)
MADV_HUGEPAGE = 14 # /usr/include/asm-generic/mman-common.h
def make_array(use_hugepages: bool) -> np.ndarray:
"""Allocate a page-aligned 1 GB buffer; optionally madvise it as hugepage."""
buf = mmap.mmap(-1, WORKING_SET_BYTES,
flags=mmap.MAP_PRIVATE | mmap.MAP_ANONYMOUS,
prot=mmap.PROT_READ | mmap.PROT_WRITE)
addr = ctypes.addressof(ctypes.c_char.from_buffer(buf))
if use_hugepages:
rc = libc.madvise(ctypes.c_void_p(addr),
ctypes.c_size_t(WORKING_SET_BYTES),
ctypes.c_int(MADV_HUGEPAGE))
if rc != 0:
raise OSError(ctypes.get_errno(), "madvise(MADV_HUGEPAGE) failed")
arr = np.frombuffer(buf, dtype=np.uint8)
arr[:] = 0 # touch every page so it's resident
return arr
def stride_walk(arr: np.ndarray) -> int:
"""Touch one byte per 4 KB page; return checksum to defeat dead-code elim."""
n = len(arr); s = 0
for _ in range(ITERS):
for i in range(0, n, STRIDE_BYTES):
s += int(arr[i])
return s
def time_one(use_hugepages: bool) -> tuple[float, str]:
label = "2 MB hugepages" if use_hugepages else "4 KB pages"
arr = make_array(use_hugepages)
t0 = time.perf_counter_ns()
chk = stride_walk(arr)
elapsed_ms = (time.perf_counter_ns() - t0) / 1e6
return elapsed_ms, f"{label:>16}: {elapsed_ms:7.1f} ms (chk={chk})"
if __name__ == "__main__":
# First wrap ourselves in perf stat for the TLB counters, then re-exec
# in the inner mode where we actually do the work.
if os.environ.get("INNER") != "1":
env = {**os.environ, "INNER": "1"}
subprocess.run(
["perf", "stat", "-e",
"dTLB-loads,dTLB-load-misses,cycles,instructions",
"python3", __file__],
env=env)
else:
for hp in (False, True):
_, msg = time_one(hp)
print(msg)
Sample run on a c6i.2xlarge (Intel Ice Lake, kernel 6.5, glibc 2.35):
4 KB pages: 4218.3 ms (chk=0)
2 MB hugepages: 612.7 ms (chk=0)
Performance counter stats for 'python3 tlb_huge_pages_demo.py':
14,238,118,401 dTLB-loads
184,492,066 dTLB-load-misses # 1.30% of all dTLB loads
31,402,118,884 cycles
16,820,442,019 instructions # 0.54 insn per cycle
Three numbers carry the story. First, the wall-time ratio: 4218 ms vs 612 ms — a 6.9× speedup from one madvise call. The work is identical: 8 passes over 1 GB at 4 KB stride. Second, the dTLB miss rate: 1.30% averaged across both runs hides the per-run shape — almost all of those 184 M misses came from the 4 KB run. The 2 MB run touches 1 GB / 2 MB = 512 unique pages, which fit easily in the 2048-entry L2 STLB; the 4 KB run touches 1 GB / 4 KB = 262 144 unique pages, which thrash both TLB tiers. Third, the IPC: 0.54 insn/cycle is brutally low — half the typical Python interpreter overhead would still get above 1.0 IPC on a hot loop. The cycles are being spent waiting for translation, not for data. Why IPC drops below 1 specifically: each STLB miss stalls the front-end for the duration of the page-table walk. A walk that hits in L2 cache costs ~30 cycles; a walk whose intermediate page-table entries themselves miss to DRAM costs 200+ cycles. During the stall, no instructions retire. A loop that should sustain 4 IPC sees 0.5 because three out of every four cycles are translation stalls, not work stalls.
Two ways to ask for hugepages — and how allocators choose
Linux exposes two distinct hugepage mechanisms, and the allocator's relationship to each is different.
Explicit hugetlbfs. Reserved at boot or runtime via vm.nr_hugepages, mounted as a filesystem at /dev/hugepages, allocated through mmap(MAP_HUGETLB). Hugepages reserved this way are never broken up — the kernel will refuse to use those pages for anything else. Allocators that want guaranteed hugepage backing (databases like Postgres with huge_pages = on, or jemalloc with MALLOC_CONF=metadata_thp:always for its bookkeeping) use this path. The downside: reserved hugepages are gone from the general page pool whether you use them or not. A 64 GB box with vm.nr_hugepages = 16384 (32 GB reserved) has 32 GB of regular memory, full stop.
Transparent Huge Pages (THP). The kernel opportunistically promotes 4 KB anonymous mappings to 2 MB hugepages when 512 contiguous 4 KB pages happen to be available and the mapping is 2 MB-aligned. Controlled by three sysfs knobs: /sys/kernel/mm/transparent_hugepage/enabled (always / madvise / never), defrag (how hard the kernel tries to compact memory to find a contiguous run), and khugepaged/defrag (the background daemon that retroactively promotes existing mappings). The promotion is invisible to the application — it just sees lower TLB miss counts. The downside: invisible promotion means invisible demotion, fragmentation under churn, and the RSS-bloat problem covered below.
For allocators, the layered choice is:
The reason THP=madvise is the modern default (since RHEL 9, Ubuntu 22.04+) is the RSS-bloat problem in THP=always mode. Under always, the kernel promotes any 2 MB-aligned region to a hugepage, including the allocator's bookkeeping, the JVM's metaspace, and short-lived allocations. The allocator then frees the small object inside that hugepage but cannot return the underlying page to the kernel — madvise(MADV_DONTNEED) operates at base page granularity, and a hugepage either has zero live pages or it does not. Under madvise, the allocator opts in deliberately for the regions it knows are large, long-lived, and densely packed; the rest of the heap stays at 4 KB and reclaims normally.
The dark side — RSS bloat from one live byte
The defining fragility of hugepages for allocators is geometric. A 2 MB hugepage is the granularity at which the allocator can return memory to the kernel. If even one live allocation sits inside the hugepage, the entire 2 MB is resident — counted against the cgroup limit, against RssAnon, against anything the OOM killer cares about. The Razorpay payments worker that fragments under IPL load doesn't lose memory in 4 KB increments any more; it loses it in 2 MB increments, and the slope of the loss is 512× steeper.
# huge_page_bloat_demo.py — show how one live object per 2 MB region pins
# all 2 MB resident, regardless of what the live heap reports.
# Run: python3 huge_page_bloat_demo.py
import ctypes, mmap, os, random
PAGE_4K = 4096
HUGE_2M = 2 * 1024 * 1024
N_REGIONS = 700 # 700 * 2 MB = 1.4 GB of address space
libc = ctypes.CDLL("libc.so.6", use_errno=True)
MADV_HUGEPAGE = 14
MADV_DONTNEED = 4
def rss_kb(pid: int) -> int:
with open(f"/proc/{pid}/status") as f:
for line in f:
if line.startswith("VmRSS:"):
return int(line.split()[1])
return -1
def alloc_huge(n: int) -> mmap.mmap:
buf = mmap.mmap(-1, n,
flags=mmap.MAP_PRIVATE | mmap.MAP_ANONYMOUS,
prot=mmap.PROT_READ | mmap.PROT_WRITE)
addr = ctypes.addressof(ctypes.c_char.from_buffer(buf))
libc.madvise(ctypes.c_void_p(addr), ctypes.c_size_t(n), ctypes.c_int(MADV_HUGEPAGE))
buf[:] = b"\x00" * n # touch all pages so they're resident
return buf
random.seed(11)
pid = os.getpid()
print(f"baseline RSS: {rss_kb(pid):>10} kB")
regions = [alloc_huge(HUGE_2M) for _ in range(N_REGIONS)]
print(f"after alloc: {rss_kb(pid):>10} kB ({N_REGIONS} * 2 MB = {N_REGIONS*2} MB requested)")
# "Free" 99% of every region — leave one live byte at offset 0.
# This is the pathological case: the allocator's view says 99% free,
# the kernel's view (RSS) says 100% resident, because each hugepage
# still has one live byte and DONTNEED on a hugepage with any live
# byte is a no-op for the page.
addr_of = lambda b: ctypes.addressof(ctypes.c_char.from_buffer(b))
for buf in regions:
a = addr_of(buf)
libc.madvise(ctypes.c_void_p(a + PAGE_4K),
ctypes.c_size_t(HUGE_2M - PAGE_4K),
ctypes.c_int(MADV_DONTNEED))
print(f"after free: {rss_kb(pid):>10} kB (one live byte per 2 MB region)")
# Now actually drop the live byte and re-DONTNEED the whole region.
for buf in regions:
a = addr_of(buf)
libc.madvise(ctypes.c_void_p(a),
ctypes.c_size_t(HUGE_2M),
ctypes.c_int(MADV_DONTNEED))
print(f"after full: {rss_kb(pid):>10} kB (no live bytes — kernel reclaims)")
Sample run (kernel 6.5 with transparent_hugepage/enabled = always):
baseline RSS: 28432 kB
after alloc: 1463808 kB (700 * 2 MB = 1400 MB requested)
after free: 1462112 kB (one live byte per 2 MB region)
after full: 29104 kB (no live bytes — kernel reclaims)
The middle row is the entire problem. The application "freed" 1.4 GB by MADV_DONTNEED-ing 99% of every region, and RSS dropped by 1.6 MB. The kernel reclaimed almost nothing because it could not split the 700 hugepages — each had a live byte at offset zero. Only when the live byte was also released and the full 2 MB advised could the kernel reclaim. Why this matches the Hotstar IPL incident: under steady-state load the allocator's hugepages were densely packed, and RSS tracked the live heap. Under the scaling event at IPL toss, allocations doubled in 30 seconds, forcing new hugepages; some of those hugepages then had only one or two live objects after the spike subsided as connections drained. The 4 KB gaps inside each hugepage could not be returned. RSS climbed 4 GB and stayed there.
The pattern repeats in production for any allocator that aggressively requests hugepages: jemalloc with MALLOC_CONF=thp:always, tcmalloc with TCMALLOC_HUGEPAGES_ENABLED=1, or any service running under THP=always with bursty allocation patterns. The fix is to keep hugepage backing for the regions where it pays — the long-lived big arenas — and force 4 KB backing for the bursty short-lived stuff. Jemalloc's per-arena thp knob makes this possible; ptmalloc gives you no such control.
Common confusions
- "Hugepages are just bigger pages." They are also a different reclaim granularity, a different fragmentation regime, and a different OOM profile. The TLB win is one property; the inability to partially reclaim is another, and they trade off against each other. Calling them "just bigger" hides the trade.
- "THP=always is faster than THP=madvise." True for steady-state read-heavy workloads on stable working sets — the kernel promotes everything and TLB miss rate drops. False for bursty allocation workloads where short-lived hugepages get stranded with sparse live objects, inflating RSS without any TLB benefit on the cold pages. Most production guidance since 2022 has shifted to
madviseprecisely because the second case is more common in containerised services. - "My allocator already uses hugepages because Linux promotes them automatically." True for
enabled=always, false forenabled=madvise(the modern default). Undermadvise, the allocator must explicitly callmadvise(MADV_HUGEPAGE)on the region. ptmalloc never does. If you want hugepages with a glibc-default app on a modern distro, you must either flip the kernel knob toalways(with the bloat risks) or switch to jemalloc/tcmalloc and configure it explicitly. - "
khugepagedwill fix any hugepages I miss at allocation time." It tries —khugepagedis the kernel daemon that scans existing 4 KB-backed mappings and promotes them in the background. But it requires 512 contiguous free 4 KB pages to be available at promotion time, which on a long-running fragmented system is rare. It also wakes only everykhugepaged/scan_sleep_millisecs(default 10 s) and processes onlykhugepaged/pages_to_scanper scan (default 4096), so promoting an 11 GB heap takes hours. It is a fallback, not a primary mechanism. - "1 GB hugepages are just bigger 2 MB hugepages — same trade-offs." Different. 1 GB hugepages must be reserved at boot via
default_hugepagesz=1G hugepagesz=1G hugepages=Nkernel cmdline; they cannot be allocated dynamically or transparently promoted. They are the right tool for databases and kernel-bypass network stacks (DPDK) that pin gigabytes of long-lived memory; they are the wrong tool for general allocators because the unit of waste is now 1 GB, not 2 MB. One stranded live object pins a gigabyte resident. - "THP only matters for memory-bound workloads." It also matters for allocator-bound workloads — the cost of
mmap/munmapand the kernel's page-table-management overhead drops when the page count drops 512×. A microservice that creates/destroys threads (each with its own stack region the kernel must track) sees lower context-switch cost under THP=always purely from page-table simplicity, even when its data accesses are L1-cache-friendly.
Going deeper
What madvise(MADV_COLLAPSE) adds in kernel 6.x
Kernel 6.1 introduced MADV_COLLAPSE (and 6.10 made it more reliable), which lets userspace synchronously request that a range be promoted to hugepages right now, rather than waiting on khugepaged. This is the missing knob for allocators that want to advise hugepages but don't want to depend on background scanning. The current jemalloc 5.3 release does not yet use it; jemalloc 6 (in development) will. For Zerodha-style services that want predictable startup behaviour — order-book loaded at 09:00 IST, must be on hugepages by 09:14 IST — MADV_COLLAPSE is the difference between deterministic and probabilistic. The Python equivalent is one extra libc.madvise(addr, size, 25) call after touching the region (25 is the constant value for MADV_COLLAPSE on x86-64).
Why jemalloc separates metadata_thp from thp
Jemalloc has two THP knobs. thp:always controls hugepages for user data (the bins where your allocations live). metadata_thp:always controls hugepages for the allocator's own bookkeeping — the radix tree that maps addresses to size classes, the bin metadata, the per-arena structures. The split exists because metadata is small but accessed on every alloc/free; putting it on hugepages can save 5–10% CPU even when user data stays at 4 KB. The PhonePe payments service runs MALLOC_CONF=metadata_thp:always,thp:default — hugepages for the allocator's own working set, kernel-default behaviour for user data — and measured a 6% drop in CPU at the same throughput simply by reducing TLB pressure on the radix-tree walks.
Why container memory limits make this worse
cgroup v2's memory.max and memory.high are enforced in 4 KB page units, but the OOM killer fires the moment the sum of all anonymous pages crosses the limit — including the stranded pages inside fragmented hugepages. A pod with limits.memory: 12Gi running an allocator that holds 2.4 GB of stranded hugepages (one live byte per 2 MB region, like the demo above) effectively has 9.6 GB of usable memory. The pod will OOM at 12 GB resident even though its live heap is 4 GB. This is why Hotstar disables THP (echo never > /sys/kernel/mm/transparent_hugepage/enabled) on the manifest tier despite the TLB-miss CPU cost — the predictability of 4 KB reclaim is worth more under cgroup limits than the TLB savings.
The NUMA interaction nobody talks about
A 2 MB hugepage lives on exactly one NUMA node. If your worker thread runs on socket 0 but the hugepage was allocated on socket 1 (because khugepaged promoted it later, and at promotion time only socket-1 contiguous memory was available), every access incurs a remote-NUMA hit — 120 ns instead of 80 ns. The win from one TLB entry per 512 pages is partially eaten by the remote hop. numactl --membind=0 --cpunodebind=0 plus MALLOC_CONF=thp:always on a NUMA box avoids this; just enabling THP without binding can leave the per-pod allocation pattern dependent on which socket happened to have free contiguous memory at the moment khugepaged ran. Flipkart's catalogue tier learned this during Big Billion Days 2024: a 9% throughput regression after enabling THP turned out to be 100% remote-NUMA hugepage placement on half the workers.
Reproduce this on your laptop
sudo apt install python3.11 python3.11-venv linux-tools-common linux-tools-generic
python3.11 -m venv .venv && source .venv/bin/activate
pip install --upgrade pip numpy
# Check current THP state
cat /sys/kernel/mm/transparent_hugepage/enabled
cat /proc/meminfo | grep -i huge
# Stride benchmark — 4 KB vs 2 MB pages
python3 tlb_huge_pages_demo.py
# RSS bloat demo — watch RSS not reclaim under hugepage stranding
python3 huge_page_bloat_demo.py
# To reproduce the always/madvise contrast, flip the system knob
echo madvise | sudo tee /sys/kernel/mm/transparent_hugepage/enabled
# then re-run; the bloat demo will show RSS reclaim correctly
echo always | sudo tee /sys/kernel/mm/transparent_hugepage/enabled
# the bloat demo now strands the hugepages
You should see the stride benchmark finish 5–8× faster on the hugepage path on a desktop with at least 2 GB of free contiguous memory at run time, and the RSS-bloat demo go from full reclamation under madvise to near-zero reclamation under always when one live byte per region pins each hugepage.
Where this leads next
Hugepages are the highest-leverage TLB knob in the memory hierarchy and the highest-risk fragmentation knob in the allocator. The next chapters walk the fallout:
/wiki/fragmentation-external-and-internal— the prior chapter, where hugepages compound the external-fragmentation problem by 512×./wiki/numa-aware-allocators-and-data-structures— what hugepages do when "memory" is plural across sockets, and why placement matters as much as size./wiki/measuring-allocator-overhead— pullingstats.allocated,stats.resident, and the THP-specificmetadata.thpout of jemalloc and turning them into Prometheus gauges./wiki/jemalloc-vs-tcmalloc-vs-mimalloc— how the three production allocators differ in their THP policies, and which to pick for which workload shape./wiki/tlb-misses-and-page-walks— Part 12's deep dive into the page-table walker hardware, which is the cost hugepages exist to amortise.
References
- Mel Gorman, "Transparent Hugepage Support" (LWN, 2010) — the original THP design proposal; the trade-offs flagged then are the same ones operators face in 2026.
- Linux kernel
Documentation/admin-guide/mm/transhuge.rst— the binding reference forenabled,defrag, andkhugepagedknobs, plusMADV_HUGEPAGE/MADV_NOHUGEPAGE/MADV_COLLAPSEsemantics. - Jason Evans, "Tick tock, malloc needs a clock" (Facebook engineering, 2019) — describes jemalloc's
thpandmetadata_thpknobs and the FB-scale measurements behind them. - Intel® 64 and IA-32 Architectures Optimization Reference Manual, §2.4 "TLB Coverage" — the per-microarchitecture TLB entry counts and walk costs that determine the size of the win.
- Brendan Gregg, Systems Performance (2nd ed., 2020), §7.6 "TLB" — the diagnostic methodology this chapter's
perf statrecipe follows. - Andrea Arcangeli, "Transparent hugepages and KVM" (KVM Forum, 2011) — the original khugepaged design and the reason it scans rather than promotes synchronously.
/wiki/fragmentation-external-and-internal— the prior chapter; hugepages amplify the external-fragmentation problem by changing the reclaim granularity./wiki/malloc-internals-glibc-jemalloc-tcmalloc-mimalloc— the per-allocator THP defaults and how to change them.