Wall: some overheads are invisible
Karan runs the rewards engine at CRED. The service does one job: take a transaction event from the Razorpay-lookalike rail it consumes, look up the user's reward state, decide whether the transaction earns coins, write the new state back. He benchmarked his Rust allocator on his MacBook last week — jemalloc at 28 ns per allocation, p99 at 84 ns, no surprises. He pushed the new build to the c6i.4xlarge production fleet on Tuesday at 14:30 IST. By 14:34 the dashboard p99 had climbed from 6 ms to 31 ms. CPU was unchanged. The allocator's own self-reported stats said 31 ns per allocation — exactly the laptop number. Karan opened a flamegraph and found 18% of CPU time inside the kernel: __handle_mm_fault, clear_page_erms, __rmqueue_pcplist. None of these names appeared in his microbenchmark. They had been there the whole time on the laptop too — but at the laptop's allocation rate of a few thousand per second, the per-call overhead was zero. At production's 4.2 million allocations per second across 64 cores, the same per-call overhead became the single largest line item on the profile.
A microbenchmark measures what a function does; production measures what the function asks the kernel and the hardware to do on its behalf. The gap is the invisible overhead — TLB misses, first-touch page zeroing, VMA lock contention, NUMA round-trips, and the cache fills that profilers attribute to the wrong frame. Knowing the allocator is necessary, but knowing the costs it hides from you is what separates a microbenchmark from a production fix.
Five overheads your microbenchmark cannot see
A microbenchmark in a tight loop on one core, with a warm cache and a warm TLB, measures one thing: the cost of the allocator's own code path. A real service running at 4 million allocations per second across 64 cores measures something else: the sum of the allocator's code path plus every cost the allocator pushes downstream. The two numbers diverge by 50–500× in production, and the divergence is structural, not accidental.
TLB misses on freshly-touched pages. The allocator returns a pointer to a page the CPU has never accessed. The first instruction that touches the page generates a TLB miss — the page-walk takes 30–100 cycles even for a 4 KB page. With 2 MB huge pages disabled (the default on most distros), a 64 MB working set touches 16,384 pages, each a potential TLB-miss. On Skylake-X the dTLB has 1,536 entries; you start thrashing it the moment your working set exceeds 6 MB.
First-touch page zeroing. When mmap(MAP_ANONYMOUS) returns a page, the kernel hasn't physically allocated anything — the page table entry points to the global zero page. The first write triggers a minor page fault: __handle_mm_fault allocates a real page, zeros it (clear_page_erms — a 4 KB streaming write), and updates the PTE. The cost is ~1.5 µs per page on a c6i. Your allocator, which "took 30 ns", just queued up 1.5 µs of kernel work that runs at the next memory access.
VMA lock contention on mmap. Every process has a tree of virtual memory areas (VMAs) describing its address-space layout. mmap and munmap mutate this tree under mm->mmap_lock (formerly mm->mmap_sem). At low concurrency the lock is uncontended. At 64 threads all expanding their per-thread arenas, the lock serialises them. Glibc's ptmalloc releases memory back via munmap aggressively under fragmentation; the resulting lock storm shows up as kernel CPU at 30–50%, all of it spinning or sleeping on mmap_lock.
NUMA cross-socket round-trips. On a two-socket EPYC box, memory allocated on socket 0 and accessed from a thread on socket 1 costs ~190 ns per cache line vs ~78 ns local. The allocator's "first-touch" policy means whichever core wrote the page first owns the page's NUMA placement. If your service threads migrate across sockets — the default on most schedulers — half your accesses become remote. Your microbenchmark, pinned to one core on the laptop, never sees this.
Cache pollution attributed to the wrong frame. perf record samples on cycles. When the allocator returns a pointer and the caller writes 64 bytes to it, the cache miss that fetches the line is attributed to the caller's instruction pointer, not the allocator's. The flamegraph shows a fat frame in your business logic; the truth is that the allocator's policy of returning fresh, cold pages is what made that line cold. Fixing the business logic does nothing; pre-touching or warming the allocator pool removes the miss.
The five overheads share one structural property: each is an asynchronous deferred cost. The allocator's return statement looks atomic to the caller — a pointer comes back in 30 ns — but it has implicitly registered work that the kernel, the MMU, and the cache hierarchy will perform later, on the caller's own time. The cost isn't hidden because someone wanted to hide it. It is hidden because the abstraction the allocator presents — "give me memory" — does not have a vocabulary for "and please report the costs my asking will create over the next million cycles". Production engineering is largely the practice of restoring that vocabulary, one tool at a time.
Why these costs are described as "invisible" rather than "uncovered": every one of them is observable with the right tool. perf stat -e dTLB-load-misses,minor-faults,page-faults exposes TLB and zeroing. perf lock contention -- python3 service.py exposes mmap_lock. numastat -p $(pidof service) exposes NUMA placement. The reason they go uncounted is that the standard development feedback loop — write a function, microbenchmark it on your laptop, ship — never invokes those tools. The cost is hidden by the absence of the right measurement, not by any property of the underlying system. The wall is methodological, not physical.
Measuring the gap with one Python script
The clearest way to see the wall is to run the same allocation workload twice — once cold (fresh pages, cold TLB, no warmup), once warm (pool pre-populated, pages pre-touched, TLB warm) — and compare both to what perf stat reports about kernel work, page faults, and TLB misses. The Python driver wraps perf stat and parses its stderr.
# invisible_overhead_demo.py — show the gap between alloc cost and total cost
# by running cold vs warm against the same workload, with perf stat counters.
# Run: python3 invisible_overhead_demo.py
import ctypes, os, re, subprocess, sys, time
N = 200_000 # 200k allocations
SZ = 4096 # 4 KB each — one page; forces a touch per alloc
ITERS = 5
libc = ctypes.CDLL("libc.so.6")
libc.malloc.restype = ctypes.c_void_p
libc.malloc.argtypes = [ctypes.c_size_t]
libc.free.argtypes = [ctypes.c_void_p]
libc.memset.argtypes = [ctypes.c_void_p, ctypes.c_int, ctypes.c_size_t]
def cold_run() -> tuple[float, list[int]]:
"""Each iteration allocates fresh — pages are cold, TLB cold."""
t0 = time.perf_counter_ns()
sizes = []
for _ in range(ITERS):
ptrs = [libc.malloc(SZ) for _ in range(N)]
for p in ptrs:
libc.memset(p, 1, SZ) # first-touch: kernel must zero, then we write
sizes.append(sum(1 for _ in ptrs))
for p in ptrs:
libc.free(p)
return (time.perf_counter_ns() - t0) / 1e9, sizes
def warm_run() -> tuple[float, list[int]]:
"""Pre-allocate and pre-touch a pool once; reuse across iterations."""
pool = [libc.malloc(SZ) for _ in range(N)]
for p in pool: # pre-touch — pay the cost up front
libc.memset(p, 0, SZ)
t0 = time.perf_counter_ns()
sizes = []
for _ in range(ITERS):
for p in pool:
libc.memset(p, 1, SZ) # write into already-mapped pages
sizes.append(len(pool))
elapsed = (time.perf_counter_ns() - t0) / 1e9
for p in pool:
libc.free(p)
return elapsed, sizes
if __name__ == "__main__":
if "--inner" in sys.argv:
mode = sys.argv[sys.argv.index("--inner") + 1]
if mode == "cold":
t, _ = cold_run()
else:
t, _ = warm_run()
print(f"INNER {mode}: {t:.3f} s ({(t*1e9)/(N*ITERS):.1f} ns/op)")
sys.exit(0)
# Outer driver: wrap each mode in perf stat and parse the kernel counters
EVENTS = "dTLB-load-misses,minor-faults,page-faults,context-switches"
for mode in ("cold", "warm"):
proc = subprocess.run(
["perf", "stat", "-e", EVENTS, "--",
sys.executable, __file__, "--inner", mode],
capture_output=True, text=True)
print(f"\n=== {mode.upper()} ===")
print(proc.stdout.strip())
# perf stat writes its counters to stderr
for line in proc.stderr.splitlines():
m = re.search(r"^\s*([\d,]+)\s+(\S+)", line)
if m and m.group(2) in EVENTS.split(","):
print(f" {m.group(2):<22} {m.group(1):>15}")
Sample run on a c6i.4xlarge (Ice Lake, kernel 6.5, glibc 2.35):
=== COLD ===
INNER cold: 4.812 s (4812.0 ns/op)
dTLB-load-misses 18,492,310
minor-faults 1,001,184
page-faults 1,001,184
context-switches 42
=== WARM ===
INNER warm: 0.612 s (612.0 ns/op)
dTLB-load-misses 284,015
minor-faults 212
page-faults 212
context-switches 38
The wall sits in the gap between 612 ns/op and 4,812 ns/op — a 7.9× ratio for the same allocator, same workload, same machine. The only difference is whether the pages are fresh. Why minor-faults dropped from 1,001,184 to 212: in the cold run, every one of the 200,000 × 5 = 1,000,000 allocations triggered a first-touch page fault (the +184 are syscall trampolines and stdlib initialisation). In the warm run, the pool was pre-touched once before the timer started, so no faults occur during the timed region. The kernel work didn't disappear — it moved out of the measurement window. Production has no such window. The dTLB-miss collapse — 18.5M down to 284k — comes from the same cause: the warm pool's 800 MB working set fits in the L2 TLB once the entries are populated; the cold workload re-populates them on every iteration.
The 4,812 ns figure includes ~30 ns of allocator code (the _int_malloc path), ~1,500 ns per page of zeroing in the kernel, ~80 ns for the TLB miss on first access, and the rest in cache fills, the syscall path, and mmap_lock traffic. The microbenchmark you write tomorrow morning will show 30 ns. Your service tomorrow afternoon will pay 4,812. Both are correct measurements of different things.
Two implementation details worth flagging in the script. First, the cold path uses memset(p, 1, SZ) rather than memset(p, 0, SZ). Writing zeros to a fresh page lets the kernel detect that nothing changed and skip the actual page allocation in some kernel versions — the comparison would silently lie. Writing a non-zero byte forces a real CoW allocation. Second, the script runs in the same process for both modes; the tracemalloc and Python interpreter overhead cancels out, isolating the kernel-side cost difference. Running each mode in a separate process would conflate Python startup with allocation cost and obscure the wall.
A useful exercise after running the script: re-run the cold path with MAP_POPULATE enabled (via mmap directly through ctypes, since malloc doesn't expose the flag) and watch the per-op time collapse to roughly the warm number while the mmap syscall itself becomes seconds-long. The total work done is identical; what changes is whether the cost falls inside or outside the timed region. Production has the equivalent decision to make at every layer — should the cost happen at boot time, at warmup, on the first request after deploy, or on every request forever? The answer is almost always "earlier than the request path", and "almost always" is the part the microbenchmark can't tell you.
What the kernel-work flamegraph looks like
Once you accept that the invisible costs sit below your allocator, the next question is what they look like in the place engineers actually spend most of their debugging time — the flamegraph. The shape is distinctive enough that once you've seen it, you recognise it instantly.
A subtler observation: the flamegraph's user-space frame write_to_buffer includes the user-instruction that triggered the page fault, but not the kernel work the fault induced. The kernel work shows up under a separate top-level frame (or under entry_SYSCALL_64 for syscalls). When kernel-space sampling is disabled — the default for unprivileged perf record — the kernel block is missing entirely from the flamegraph, and the engineer sees only the user-space sliver. The first time a junior engineer sees the same flamegraph captured with perf record --call-graph dwarf as root vs unprivileged, the contrast is striking: the unprivileged version says the service is mostly idle; the privileged version says the kernel is doing 78% of the work. The wall is not just methodological; it is also permissions-shaped. Many production environments restrict perf to root, and engineers who don't know to ask for the privilege spend weeks chasing user-space bottlenecks that aren't there.
The recurring kernel symbols are worth memorising. __handle_mm_fault is the top-level page-fault handler. do_anonymous_page is its branch for MAP_ANONYMOUS allocations (the case for malloc's underlying memory). __alloc_pages is the buddy allocator returning a physical page from the free list. clear_page_erms (or clear_page on older CPUs) is the actual zeroing — erms stands for Enhanced REP MOVSB, the modern x86 streaming-write path. __rmqueue_pcplist is the per-CPU page list the buddy allocator pulls from to avoid global lock contention. When these names dominate a flamegraph, the conversation isn't "tune the allocator" — it's "stop forcing the kernel to do this work on the request path".
Three production walls Indian engineers have hit, and what fixed them
Karan's CRED rewards story isn't unusual. The same wall shape recurs across Indian production with different fingerprints; the fix is always to pay the invisible cost upfront, in a controlled place, so the request path doesn't pay it under load.
Razorpay payment-status API: VMA lock contention at 8 cores → 64 cores. A team migrated from 8-core to 64-core hosts to handle a 7× traffic ramp during the GST filing deadline. Throughput grew 4×, not 7×, and p99 doubled. Flamegraph showed 31% in __down_write on mmap_lock. Glibc's per-thread arenas were producing mmap/munmap storms because each request's lifetime spanned the threshold where ptmalloc decides to release memory back to the kernel. The fix: switch to jemalloc via LD_PRELOAD (jemalloc holds memory longer in its bin caches and mmaps less often) and set MALLOC_ARENA_MAX=2 for the cases where ptmalloc had to stay. p99 dropped from 24 ms to 9 ms; the VMA lock disappeared from the profile. The kernel was correct the entire time — mmap_lock is a cathedral lock by design — but the application was asking it to serialise 64 threads.
Hotstar IPL ingest: first-touch zeroing during the toss-to-first-ball burst. During a 2025 IPL final, the manifest-rewrite service experienced a 90-second p99 spike from 18 ms to 240 ms exactly between toss and first ball — the moment when concurrent viewers jumped from 8M to 22M. CPU was unsaturated; the spike showed as kernel time in clear_page_erms. The service was scaling out new pods, each starting with an empty heap, all simultaneously calling mmap for fresh pages and triggering kernel zeroing on first write. The fix: add a warmup probe that performs a memset over the heap before the pod is added to the load balancer. The 5-second warmup shifted 240 ms of per-request zeroing into a controlled startup cost. p99 stabilised at 22 ms across the burst.
The same incident produced an internal runbook line that has aged well: "If MALLOC_ARENA_MAX was the fix, the bug was the kernel's lock, not glibc's design." Glibc's per-thread arenas are a correct answer to single-process concurrency on small core counts. They are the wrong answer to 64-core service hosts because the original design (Wolfram Gloger, 2006) targeted a hardware regime that no longer exists in production. The point is not that glibc is bad; the point is that the defaults of any decade-old system component need to be re-validated against this decade's hardware.
A common follow-up question is "could Razorpay have caught this in load testing?". The honest answer is yes, but only if the load test ran at production concurrency on production-shaped hardware for long enough to fragment glibc's arenas. A 5-minute wrk2 run on a single 8-core staging box would not have surfaced the lock storm. The wall is partly about microbenchmarking; it is also about short benchmarking — tests that don't run long enough for the allocator's steady-state behaviour to develop. Production walls of this shape often show up 30 minutes into a sustained load, never in the first 5 minutes. The remedy is the same shape as the chapter: the test harness must measure the distribution of behaviour over time, not the average across a short run.
Zerodha matching engine: NUMA misses after a kernel scheduler change. A kernel upgrade from 5.15 to 6.1 changed the default scheduler tunables; the matching-engine threads, previously sticky to socket 0, started migrating across sockets every few hundred milliseconds. Latency p99.9 went from 800 µs to 3.2 ms with no code change. numastat -p showed 47% remote accesses where it had been <2%. The fix: pin the threads explicitly with numactl --cpunodebind=0 --membind=0 and taskset in the systemd unit. Latency returned to baseline. The wall here was not the allocator at all — but the allocator's first-touch policy is what made NUMA placement matter for memory allocated months ago by threads that had never moved.
The shared diagnostic pattern: when the allocator's reported numbers are healthy but the service is slow, the cost is in the layers below the allocator that the allocator's API can't expose. Move down the stack one level — kernel page faults, VMA lock waits, NUMA placement, TLB occupancy — and the missing time is always there.
A second pattern across the three stories: the fix in each case was operational, not algorithmic. Karan didn't change a data structure; he changed a kernel boot parameter and added a warmup script. The Razorpay team didn't change their request handling code; they added an LD_PRELOAD line to their systemd unit. The Hotstar team didn't rewrite their parser; they pre-touched the heap before adding the pod to the load balancer. None of these fixes would have been findable by reading the application's source code, because none of them are in the application's source code. They are in the seam between the application and the kernel — the seam the application's vocabulary does not name and that the application's tests do not exercise. Operational fixes for invisible costs are the highest-ROI work in production performance because the fix is small (often a few lines) and the impact is large (often 5–10× p99 reduction). The barrier is not engineering effort; it is the diagnostic skill to find the seam.
There is a recurring surprise in each of these stories: the engineer who eventually fixed the bug had to learn a tool they didn't use day-to-day. Karan had never read numastat output before the CRED incident; the Razorpay team had never used perf lock contention; the Hotstar team had never traced clear_page_erms in a flamegraph. Every invisible-cost wall expands the toolbox by one. After three or four such walls, the engineer stops being surprised — they have built the diagnostic instinct that says "the allocator looks fine; therefore the cost is below the allocator", and they reach for the right tool first. This instinct is what separates a senior performance engineer from a senior backend engineer who happens to know perf top. The walls are how the instinct gets built.
A fourth shape, less common but worth naming, shows up at PhonePe, Paytm, and any high-frequency UPI dispatcher: CPU steal time inside containers. Kubernetes nodes pack 30–60 pods onto a 64-vCPU host. The CFS bandwidth controller throttles a pod that exceeds its CPU quota by sleeping it for the rest of the period, even if the host has idle cores. To the application, this looks like a 10–80 ms pause that no in-process tool can explain. perf stat against the process shows clean counters; cat /sys/fs/cgroup/cpu/cpu.stat shows nr_throttled climbing. The fix is either to raise the quota, switch to cpuManagerPolicy=static for the pod, or accept the throttling as part of the SLO. The wall here isn't allocator-related at all — but the symptom is identical: a tail latency that no in-process measurement explains, because the cause sits in a layer the application can't see.
A useful lens for the "operational fix" pattern: every invisible cost can be paid earlier, paid in a different place, or paid at a different cadence — but it cannot be made to vanish. Pre-touching pays the first-touch cost at startup instead of on the request. MALLOC_ARENA_MAX pays per-arena lock cost instead of mmap_lock cost. Pinning to NUMA nodes pays a flexibility cost (a thread can't migrate to an idle core on the other socket) instead of a remote-access cost. Each operational knob is a relocation, not an elimination. The skill is knowing which relocation your specific workload tolerates — which is a measurement question, not a design question.
What "invisible" means in practice — three calibration scenarios
Before the diagnostic ladder, it helps to have intuition for what "invisible cost is the bottleneck" looks like in the wild. Three calibration scenarios that recur often enough to memorise:
Scenario A — The benchmark looks great, production is slow. Microbenchmark on a developer laptop reports 30 ns per allocation, p99 within 2× of median. Production p99 is 50–500× slower than median, with no measurable allocator-internal contention. Diagnosis: the production environment exposes invisible costs the laptop cannot. Almost always first-touch zeroing or VMA-lock contention; sometimes NUMA. Fix: pre-touch on startup, cap arenas, pin to NUMA nodes — in that order of likelihood.
Scenario B — The benchmark and production both look slow, but for different reasons. Microbenchmark p99 is high (say 800 ns) and production p99 is also high (say 8 ms). Easy mistake: assume the production slowness scales with the benchmark slowness. It doesn't. The benchmark is paying allocator-internal cost (bin search, lock acquisition); production is paying additional cost on top from the kernel/MMU. Diagnosis: subtract the benchmark cost from the production cost; the residual is the invisible component, and the residual is what the diagnostic ladder targets.
Scenario C — The flamegraph is full of __libc_malloc but the fix isn't a faster allocator. This is the most common misdiagnosis. The flamegraph correctly shows __libc_malloc as fat, so the engineer reaches for jemalloc. p99 improves by 3% and the engineer concludes the wall is intractable. The truth: __libc_malloc was fat because it was being called too often, not because it was slow per call. The fix is reducing call frequency (pooling, arena-per-request) — which is a workload change, not an allocator change. Always check the call rate before changing the allocator.
There is also a fourth scenario worth flagging — the case where the wall is the allocator, but you can only see it because you have eliminated the other walls. This appears at the end of long optimisation projects: after pre-touching, NUMA pinning, MALLOC_ARENA_MAX tuning, and huge pages, the remaining 20% latency tail genuinely is allocator-internal. At that point a switch to mimalloc or a custom slab is the right fix. The rule is sequential: address invisible-cost walls first, then allocator walls. Reverse the order and the allocator change you ship looks ineffective because the kernel is still bottlenecking.
The triage rule that emerges: when production p99 is more than 10× the microbenchmark p99 on the same allocator, you are in scenario A or scenario C, and the answer is almost never "ship a faster allocator". When production p99 is 2–4× the microbenchmark p99, you are in scenario B, and the allocator's own internals genuinely matter — but only as one term in the sum. Calibrating against these three shapes before opening any tool turns the diagnostic ladder from a checklist into instinct, and is the difference between fixing the wall in 30 minutes and fixing it in three days.
Common confusions
- "If
perf statshows the cycles, the cycles are visible."perf statshows the count of events, not their attribution. A TLB miss takes 80 cycles, but those 80 cycles are charged to whatever instruction was retiring when the miss completed — usually the user-code load instruction, not the allocator that handed out the cold page. The cycles are visible in the totals; the cause of the cycles is hidden in the timeline. To recover the cause you needperf record -e dTLB-load-missesand a flamegraph that calls out the cold-page pattern, not justperf stat. - "Pre-allocating buffers fixes everything." Pre-allocation fixes first-touch zeroing and TLB warming, but doesn't fix VMA lock contention (which is about
mmap/munmap, not allocation), and doesn't fix NUMA placement if your threads migrate. Each invisible cost has its own fix. Treating "pre-allocate" as a universal cure misses the other walls. The diagnostic ladder is:perf stat -e minor-faults,dTLB-load-missesfirst, thennumastat -p, thenperf lock contention, then a flamegraph againstdo_page_faultand__handle_mm_fault. - "Huge pages eliminate TLB misses." Huge pages reduce TLB pressure by 512× (a 2 MB page covers what 512 4 KB pages would), but they don't eliminate misses — and Transparent Huge Pages can introduce new invisible costs (
khugepagedrunning in the background, fragmentation under memory pressure, write-amplification on copy-on-write forks). The decision is workload-dependent. For long-lived heaps with sequential access, huge pages help. For services with frequent fork-and-exec or short-lived address spaces, the kernel daemon's CPU can show up as another invisible cost. - "Container limits hide the kernel from me." They don't — they constrain it. Cgroup memory limits force the kernel's reclaim path to run more aggressively inside your container, which means more
direct_reclaimcalls on the allocation path, which means longer allocation tails. Cgroup CPU limits introduce throttling that looks identical to a GC pause from inside the application. The kernel is more visible, not less, when you put a container around your service. The only way to confirm the kernel is innocent is to run the diagnostic ladder inside the container's cgroup, not on the bare host. - "
mallocis a function, so its cost is local."mallocis a function whose effects are global. It updates per-thread arenas (cache-line traffic across cores), occasionally callsmmap/brk(kernel andmmap_lock), reuses freed pages that may have been first-touched on another socket (NUMA remote), and returns memory whose first access will fault and zero (1.5 µs of kernel work). The "cost of malloc" is whatever you draw the box around. Drawing the box around the C function alone is the mistake the wall is named for. - "Production overhead means I need a faster allocator." Sometimes — but more often you need the same allocator with one of: explicit thread pinning, an arena-per-request pattern, huge pages enabled, an
mlockpre-touch step, or aMALLOC_ARENA_MAXcap. Switching from glibc to jemalloc fixes the lock-contention class of wall but does nothing for first-touch or NUMA. Choosing the fix requires diagnosing which invisible cost is paying the bill, not picking the allocator that scored best on the last microbenchmark you read about.
Going deeper
The diagnostic ladder when "the allocator looks fine"
When a service runs slow and your allocator's stats say it isn't the cause, work the ladder in this order. The mental model engineers usually bring is wrong: they associate "page fault" with "swapping", and their service has no swap configured, so they ignore the page-fault counter. Minor faults are not about disk — they are about the kernel doing the allocation work the application asked for implicitly. A service at steady state should have minor faults proportional to its allocation rate; when minor faults run at 1M/sec and your allocation rate is 50k/sec, the allocator is returning the same pages as fresh repeatedly because something in your workload is forcing them to be released and re-acquired.
perf stat -e minor-faults,major-faults,page-faults,dTLB-load-misses,context-switches -p $(pidof svc) sleep 30. Ifminor-faultsdivided by allocation rate is greater than 1, the service is touching fresh pages on every alloc — first-touch zeroing is the bill. IfdTLB-load-missesexceeds 0.5% ofdTLB-loads, the working set has overflowed the TLB and you need huge pages or a denser layout.numastat -p $(pidof svc). Look at theOther_Nodecolumn. Anything above 5% means threads are reading remote memory — the fix isnumactlpinning or NUMA-aware allocation, not the allocator itself.perf record -F 99 -g -p $(pidof svc) -- sleep 30 && perf script | stackcollapse-perf.pl | flamegraph.pl > flame.svg. Look for kernel frames —__handle_mm_fault,down_write,clear_page_erms. Their width is your invisible-cost budget.bpftrace -e 'kprobe:__handle_mm_fault { @[comm] = count(); } interval:s:10 { print(@); clear(@); exit(); }'. Tells you exactly how many minor faults each process is causing per second./proc/<pid>/status— readVmRSS(resident bytes) andVmData(anonymous mapped). IfVmDatais 5×VmRSS, you have a lot of mapped-but-not-resident memory; first access will fault. Pre-touch orMAP_POPULATEremoves the cliff.
Why the order is sequential, not parallel: each step rules out a class of cause. Skipping straight to the flamegraph is tempting but produces a flamegraph you cannot interpret without knowing which counter is hot. Start with perf stat; the answer is usually in the first 30 seconds, and the remaining steps refine the diagnosis rather than discover it.
The MAP_POPULATE escape hatch and its allocator-equivalent flags
Linux's mmap accepts a MAP_POPULATE flag that asks the kernel to pre-fault every page of the mapping at mmap time, instead of lazily on first access. This collapses the first-touch invisible cost into a single explicit cost paid at map-time. The trade-off: the mmap call itself blocks for the duration of the population (1.5 µs × N pages = 1.5 ms per MB). For a 2 GB allocation you wait 3 seconds. This is unacceptable on a request path but ideal at startup. The pattern in production C++ services: mmap(NULL, sz, ..., MAP_PRIVATE | MAP_ANONYMOUS | MAP_POPULATE, -1, 0) for arena buffers allocated during initialisation. Java's equivalent is -XX:+AlwaysPreTouch, which forces the JVM to write a single byte to every page of the heap at startup. Go's runtime has no first-class equivalent but can be approximated by allocating large buffers at startup and explicitly writing zeros to each page. The cost of not using these flags is usually paid in the first few hundred requests after a deploy — exactly the requests that the load balancer is using to validate the new pod's health, which is why bad luck with first-touch faults often produces "the deploy looked healthy in canary, then p99 spiked when we promoted to 100%" stories.
Why first-touch is the right default, and how it shifts when you switch allocators
It is tempting to ask "why does the kernel zero pages lazily? Why not zero them eagerly in the background?" The kernel runs a kswapd/zone-reclaim thread that pre-zeros some pages, but it can't pre-zero everything because (a) it doesn't know which process will need pages next, (b) zeroing dirties cache lines that may evict useful working-set data, and (c) memory pressure makes pre-zeroed pages a luxury. First-touch policy is the right default because it places memory close to the thread that will use it (NUMA-locally) and avoids paying for memory that was never written. The same defence applies to lazy TLB filling, the single global VMA tree, and NUMA first-touch — each is locally optimal; the wall is what happens when many locally-optimal defaults compose against a workload they were not tuned for.
Switching from glibc's ptmalloc to jemalloc or tcmalloc does not eliminate these costs — it redistributes them. Why the redistribution matters: jemalloc holds memory longer in bin caches, so mmap/munmap storms drop and VMA-lock contention disappears, but RSS climbs and the kernel's reclaim path runs more often. tcmalloc releases via madvise(MADV_DONTNEED) instead of munmap, keeping the VMA tree small but telling the kernel "you can drop these pages" — and the next access faults them back in, paying first-touch again. mimalloc's segments hit the kernel less often than either but are sensitive to access pattern: random access across the segment range thrashes the TLB worse than jemalloc's arenas. Each allocator is a different distribution of the same total cost across kernel, TLB, and cache. The benchmark that decides which is right for your service must measure all three layers simultaneously, which is exactly what the diagnostic ladder above is for.
When MALLOC_ARENA_MAX is the right knob, and when it isn't
Glibc's ptmalloc defaults to one arena per CPU (capped at 8× the CPU count for the heap arena). On a 64-core host that means up to 512 arenas, each with its own bin caches, each contending for mmap_lock when it needs to grow. Setting MALLOC_ARENA_MAX=2 collapses the contention by forcing all threads through 2 shared arenas — at the cost of higher per-arena lock contention inside ptmalloc itself. This is the classic "trade contention at one layer for contention at another" knob, and the right value depends on the ratio of mmap rate to allocation rate. Why the right value isn't always 2: services with very high allocation rates and very few mmap calls (a steady-state JSON parser whose arena is already sized) are better off with the default — more arenas means less in-allocator contention. Services with bursty mmap patterns (anything that periodically grows or shrinks its working set, like a cache eviction storm) benefit from MALLOC_ARENA_MAX=2 because the kernel-level mmap_lock is a bigger bottleneck than the per-arena lock. The decision is workload-dependent and changes with hardware; what worked on 16 cores often fails on 64. The Razorpay incident above is the bursty-mmap case; a steady-state Hotstar HLS parser with a stable working set is the opposite case. Reading the MALLOC_ARENA_MAX recommendation in a blog post and applying it without measuring is how you turn a working service into a slower one.
A related knob worth naming is io_uring's batched memory-management path. The page-fault itself is invoked synchronously by hardware and cannot be batched, but the adjacent costs — mmap, madvise, mprotect — can. Linux 5.13 added io_uring submission for memory-management operations; for services that allocate large buffers in batches at startup (Hotstar's per-stream segment buffers, Zerodha's per-symbol order-book pages), this collapses the boot-time allocation phase from seconds to milliseconds. The first-touch zeroing still happens on first access — io_uring doesn't change physics — but the syscall and VMA-lock costs vanish. This is part of a broader pattern in modern kernel APIs: the syscall is no longer where engineers should look for invisible overhead; the batched-submission queue is. Combine MALLOC_ARENA_MAX (fewer arenas → less mmap traffic) with io_uring batching (cheaper mmap per call) and the kernel-side cost can drop by 5–10× on the same workload, no allocator change required.
Reproduce this on your laptop
sudo apt install linux-tools-common linux-tools-generic numactl
python3 -m venv .venv && source .venv/bin/activate
pip install --upgrade pip
# Show the cold/warm gap with perf-stat counters
python3 invisible_overhead_demo.py
# Look at the page-fault rate of any service you have running
perf stat -e minor-faults,major-faults,dTLB-load-misses -p $(pidof <yourpid>) sleep 10
# Inspect NUMA placement of an existing process
numastat -p $(pidof <yourpid>)
You should see the cold run come in 5–15× slower than the warm run, and the minor-faults counter difference will explain almost all of it. On a laptop with only one NUMA node the NUMA column doesn't matter, but the TLB and first-touch costs are identical to production. The numbers vary by hardware (Apple Silicon zeroes pages faster because of dedicated streaming-store paths) but the ordering is invariant.
Where this leads next
This wall closes Part 11 — the allocator chapters. The cost lines that the allocator pushes downstream — TLB misses, page faults, syscalls, context switches — are the entire subject of Part 12. The transition is direct: every overhead this chapter named as "invisible" gets a chapter of its own next.
/wiki/syscall-overhead— whatmmap,munmap,brk, and the rest actually cost in cycles, and where the modern alternatives (io_uring, batched syscalls) move the cost./wiki/context-switch-cost— the 1–10 µs the kernel spends moving a thread off a core, plus the 50–500 µs of cold-cache penalty after./wiki/scheduler-latency— why CFS occasionally lets a thread wait 8 ms for the CPU even when CPU utilisation is at 60%./wiki/tlb-misses-and-huge-pages— the invisible cost of address translation, and the trade-offs of 2 MB and 1 GB pages./wiki/page-fault-handling-minor-vs-major— the kernel path through__handle_mm_faultand why minor faults can dominate a healthy service's CPU profile.
The progression from Part 11 to Part 12 mirrors the diagnostic ladder this chapter sketched: Part 11 explained what malloc does, what its arenas look like, why jemalloc and tcmalloc exist, and how regions and pools sidestep parts of the cost. Part 12 takes the invisible layer that Part 11 deliberately left aside — the kernel and hardware costs the allocator pushes downstream — and gives each one its own chapter. By the end of Part 12 the reader should be able to look at a flamegraph dominated by kernel symbols and name not just what the symbol is but which allocator decision created the work, what the typical magnitude is, and which of the four fix categories (allocator, kernel path, workload, hardware) applies.
The reader who finishes both parts has the full vocabulary to argue about memory in production. They can say "this service is paying for first-touch zeroing on the hot path; we should MAP_POPULATE at startup" instead of "the allocator is slow"; they can say "the VMA lock is contended because we have 64 threads all hitting mmap; we need MALLOC_ARENA_MAX=2" instead of "we need a faster allocator". The vocabulary is the difference between a fix that works and a fix that hopes.
References
- Brendan Gregg, Systems Performance (2nd ed., 2020), §7.5 "Memory Architecture" — the canonical reference for the methodology this chapter follows.
- Ulrich Drepper, "What Every Programmer Should Know About Memory" (2007) — §2 (cache hierarchy) and §4 (virtual memory) explain the hardware mechanisms behind the invisible costs.
- Linux kernel documentation,
Documentation/admin-guide/mm/transhuge.rst— the official source on transparent huge pages andkhugepaged. - Jeff Bonwick, "The Slab Allocator" (USENIX 1994) — the foundational paper on caching constructed objects to avoid the per-allocation costs this chapter measures.
- Jason Evans, "A Scalable Concurrent malloc(3) Implementation for FreeBSD" (BSDCan 2006) — jemalloc's design paper; explains the per-arena strategy that fixes the VMA-lock wall.
- Gil Tene, "How NOT to Measure Latency" (Strange Loop 2015) — the talk that defined coordinated omission; directly applicable to allocator microbenchmarks.
- Aleksey Shipilëv, "JVM Anatomy Quark #20: -XX:+AlwaysPreTouch" — the JVM equivalent of the
MAP_POPULATEpattern, with measurement of the cold-page penalty. /wiki/gc-vs-manual-vs-region-based-allocation— the previous chapter; the regimes whose hidden costs this wall exposes.