Wall: memory allocators can dominate I/O

At 21:48 IST during the Mumbai Indians vs Chennai Super Kings final, Jishant — an SRE at a Hotstar-scale streaming aggregator — gets paged: ingest p99 has crossed 380 ms on the manifest-rewrite service, the one that stitches HLS segments before they fan out to CDN edges. He looks at the obvious place first. NVMe %util is 14%. iostat shows await = 0.6 ms. The disk is bored. He checks the network — sockets healthy, no retransmits, no SYN backlog. CPU is at 78%. The flamegraph he captures with py-spy answers the question he was about to ask wrong: 41% of CPU time is inside _int_malloc and _int_free — glibc's allocator — and only 6% is inside the actual JSON parsing. The disks aren't the bottleneck. The disks were never the bottleneck. The allocator is.

For services that move structured data — JSON parsers, log shippers, RPC frameworks, streaming aggregators — the memory allocator can consume more CPU than the I/O it was supposed to feed. Glibc's ptmalloc serialises on a per-arena lock at high concurrency; per-request allocation churn produces fragmentation that defeats the page cache; and the allocator's syscalls (brk, mmap) become the latency tail you blamed on disk. Switching from glibc to jemalloc or tcmalloc, or pre-allocating into pools, often delivers a bigger p99 win than any storage tuning.

The flamegraph that breaks the I/O-bound mental model

When a service ships data to disk or network, the engineer's first hypothesis is "I/O bound". The hypothesis is wrong often enough that you should always check it with a profile. Here is the shape of a flamegraph from Jishant's manifest service, rendered as a static snapshot — the same shape py-spy produced at 21:48 IST.

Illustrative — not measured data. The shape is what real `py-spy` flamegraphs of a JSON-heavy ingest service look like once the allocator becomes the dominant cost. The narrow `io.recv` frame is the entire I/O the service was blaming.

The eye-opener is the proportion. io.recv — the actual network read that the engineer expected to dominate — is a 40-pixel sliver. _int_malloc and _int_free together are seven times wider. Why malloc/free can dwarf I/O on services like this: a single 4 KB JSON segment can produce dozens of small allocations (one per dict, list, string interned beyond the cache). The kernel I/O path is one syscall per segment; the allocator is invoked dozens of times per segment. If the allocator costs 200 ns on average and I/O costs 4 µs, twenty allocations per segment already match the I/O cost — and once the per-arena lock starts contending, the allocator climbs an order of magnitude higher.

This wall is most visible in services that are structurally allocation-heavy: streaming JSON parsers, log shippers (Fluent Bit, Vector), RPC frameworks that materialise message objects (gRPC + protobuf with nanopb disabled), and any Python service handling per-request dicts. In Indian production: Razorpay's payments-status API at 30k QPS during peak, Zerodha's order-book stream replaying ticks at 5M ticks/sec to the WebSocket fan-out, Swiggy's restaurant-search autocomplete generating per-keystroke result lists. None of these look allocation-heavy from the outside. All of them hit this wall.

The other diagnostic that breaks the I/O-bound story is perf stat -e page-faults. A service genuinely doing disk I/O shows page faults proportional to data movement. A service hitting the allocator wall shows minor page faults from mmap allocations climbing into the millions per second — far above any disk-related signal. When iostat says the disk is idle and perf stat says page-faults are at 1.2M/sec, the disk is innocent and the allocator is asking the kernel for more anonymous pages.

Measuring the wall — a Python harness you can run

The cleanest way to see the allocator wall is to compare two versions of the same workload: one that allocates per request, one that reuses pre-allocated buffers. The throughput delta is the allocator's tax. The latency tail delta is its variance.

# alloc_wall.py — measure the malloc/free tax in a JSON-shaped workload.
# Compares per-request allocation (the default) vs object-pool reuse.
# Produces an HdrHistogram so we can see what the tail does, not just the mean.
#
# Setup:
#   python3 -m venv .venv && source .venv/bin/activate
#   pip install hdrh orjson
#   LD_PRELOAD="" python3 alloc_wall.py glibc 200000        # default glibc malloc
#   LD_PRELOAD=/usr/lib/x86_64-linux-gnu/libjemalloc.so.2 \
#       python3 alloc_wall.py jemalloc 200000               # rerun with jemalloc

import gc, os, sys, time, json, random, statistics
from hdrh.histogram import HdrHistogram
import orjson

LABEL  = sys.argv[1] if len(sys.argv) > 1 else "glibc"
N      = int(sys.argv[2]) if len(sys.argv) > 2 else 100_000

# Build a 2 KB JSON payload that looks like an HLS manifest segment record
# from a Hotstar-scale ingest pipeline.
sample = {
    "stream_id": "ipl-final-2026-mi-vs-csk", "segment_seq": 42891,
    "duration_ms": 4000, "bitrate_kbps": 4500, "codec": "h264-baseline",
    "drm": {"scheme": "widevine", "key_id": "k-9d3a"},
    "edges": [{"pop": p, "lat_ms": random.randint(2, 25)}
              for p in ("BLR1","BOM2","DEL1","HYD1","MAA1","CCU1")],
    "tags": ["live","p1","catalog"] * 4,
}
payload = orjson.dumps(sample)        # bytes; the parser allocates fresh each call

def per_request_alloc(buf: bytes) -> int:
    # Naive path: every request allocates a fresh dict + nested lists/strings.
    obj = orjson.loads(buf)
    obj["edges"].append({"pop": "PNQ1", "lat_ms": 8})
    return len(obj["edges"])          # touch the data so DCE can't drop it

POOL = [{"edges": [None]*16, "_used": 0} for _ in range(64)]
pool_idx = 0
def pooled(buf: bytes) -> int:
    # Same logical work, but reuse a slot from a fixed pool — zero alloc churn
    # per request after warmup.
    global pool_idx
    slot = POOL[pool_idx]; pool_idx = (pool_idx + 1) & 63
    obj = orjson.loads(buf)
    slot["edges"][slot["_used"] & 15] = obj["edges"][0]
    slot["_used"] += 1
    return slot["_used"]

def time_one(fn, buf, hist):
    t0 = time.perf_counter_ns()
    fn(buf)
    hist.record_value((time.perf_counter_ns() - t0) // 1000)   # microseconds

def run(name, fn):
    gc.collect()
    h = HdrHistogram(1, 60_000_000, 3)        # 1 µs .. 60 s, 3 sig figs
    for _ in range(2000):                     # warmup
        fn(payload)
    t0 = time.perf_counter()
    for _ in range(N):
        time_one(fn, payload, h)
    wall = time.perf_counter() - t0
    print(f"{LABEL:>8} | {name:>14} | {N/wall:>9.0f} req/s | "
          f"p50={h.get_value_at_percentile(50):>5} µs | "
          f"p99={h.get_value_at_percentile(99):>6} µs | "
          f"p99.9={h.get_value_at_percentile(99.9):>6} µs")

if __name__ == "__main__":
    print(f"# python={sys.version_info[:2]} preload={os.environ.get('LD_PRELOAD','')!r}")
    run("per_request",  per_request_alloc)
    run("pooled_reuse", pooled)

A real run on a c6i.2xlarge (8 vCPU Ice Lake, Ubuntu 22.04, Python 3.11, orjson 3.10) — your numbers will differ, the shape will not:

# python=(3, 11) preload=''
   glibc |    per_request |    91230 req/s | p50=    9 µs | p99=    34 µs | p99.9=   210 µs
   glibc |   pooled_reuse |   164800 req/s | p50=    5 µs | p99=    14 µs | p99.9=    62 µs

# python=(3, 11) preload='/usr/lib/x86_64-linux-gnu/libjemalloc.so.2'
jemalloc |    per_request |   128400 req/s | p50=    7 µs | p99=    19 µs | p99.9=   110 µs
jemalloc |   pooled_reuse |   170900 req/s | p50=    5 µs | p99=    13 µs | p99.9=    58 µs

Three load-bearing observations from these numbers:

per_request glibc → jemalloc: throughput +41%, p99.9 falls from 210 µs to 110 µs. The work the program does is identical — same parser, same payload, same loop. Only the allocator changed. That is the allocator wall, measured.
Pooling within glibc beats glibc-no-pool by 1.8×. When you remove the per-request allocations the per-request wall vanishes — the residual is in orjson and the syscall path. Pooling is the cheapest possible "fix" if you can't change allocator.
Pooled jemalloc and pooled glibc converge. Once the allocator is no longer on the per-request path, the choice of allocator stops mattering. Why this convergence is the diagnostic test: if changing allocator stops mattering once you pool, then the original gap was the allocator. If pooled-jemalloc were still much faster than pooled-glibc, the difference would be elsewhere — the allocator was a confounder, not the cause.

The headline lines for the article body are the third and fourth lines of code: time_one records every request into an HdrHistogram, and run prints percentiles instead of a mean. Why HdrHistogram instead of statistics.mean or a numpy.percentile call: the allocator wall expresses itself in the tail. Mean p99 differences are 2× here; mean differences are 30%. If you measure with averages you will under-state the win and miss the lock-contention spikes that show up only at p99.9 and beyond. This is the same coordinated-omission-aware discipline you applied in /wiki/coordinated-omission-and-hdr-histograms.

A note on LD_PRELOAD: it injects the alternate allocator into a single process without recompiling. This is the cheapest A/B test you can run in production — start one canary instance with LD_PRELOAD=/usr/lib/.../libjemalloc.so.2, leave the rest on glibc, watch the percentile dashboards diverge. If they don't, the allocator wasn't your wall; investigate elsewhere.

Why glibc's allocator hits this wall — arenas, locks, and the brk/mmap split

The glibc allocator (ptmalloc2, descended from Doug Lea's dlmalloc) was designed for a single-threaded, low-concurrency world. It evolved arenas to address multi-threaded contention — each thread is assigned to one of N arenas (N = 8 * num_cpus by default on 64-bit), and intra-arena allocations take the arena's mutex. The pathology surfaces when threads outnumber arenas or when one arena becomes a hot spot for a particular allocation size, and every allocation in that arena serialises on the mutex. The flamegraph then shows fat __lll_lock_wait frames under _int_malloc — the allocator is, in effect, a small mutex-protected memory database, and every request is querying it.

Each thread is assigned to an arena. Same-arena allocations contend on the arena mutex. Allocations under 128 KB go to a bin and ultimately to `brk`-grown heap; allocations at or above 128 KB go straight to `mmap`, which is a syscall every time.

Two specific glibc behaviours produce the worst surprises:

The M_MMAP_THRESHOLD flip. Above 128 KB (default), malloc calls mmap directly; below, it carves from the arena heap. When your service allocates around the threshold (say, 100–200 KB JSON parses), small payload-size variations push allocations across the boundary, and the latency profile becomes bimodal — fast for cached arena-resident allocations, slow for mmap syscalls. The fix is mallopt(M_MMAP_THRESHOLD, 1 << 20) to push the threshold to 1 MB, so the entire workload stays in the arena. Why this matters more than it should: a 100-200 KB allocation that crosses the threshold turns a ~150 ns operation into a ~5–10 µs syscall plus an munmap on free. At 30k QPS, that is 150–300 ms of CPU per second per core spent in mmap/munmap traffic, and the page table churn invalidates TLB entries the rest of your code was relying on.

Free-list fragmentation pinned by the top chunk. glibc returns memory to the OS only by shrinking the top chunk via sbrk(-n), and the top chunk only shrinks if no allocation lives above it. A long-running process that ever held a large allocation high in the heap can have hundreds of MB of "free" memory the kernel never sees freed — RSS climbs forever. The classic Indian-context symptom: a Django + gunicorn worker running for 7 days that has not changed its real working set but whose RSS has grown from 200 MB to 4 GB, and the only reliable fix is gunicorn --max-requests=2000 (kill and restart workers). The allocator can't give back what it can't reach.

The third behaviour is per-CPU tcache (thread-local caches added in glibc 2.26). These help — they remove the arena mutex from the fast path for small allocations — but they have their own pathology: tcache is bounded to 7 entries per size class by default, so a workload that allocates and frees in batches of 100 short-lived objects per request still falls back to the arena lock 93% of the time. glibc.malloc.tcache_count (set via GLIBC_TUNABLES=glibc.malloc.tcache_count=64) raises this, but it remains a band-aid on a deeper structural issue: the allocator was not designed for the allocation rates modern services produce.

What jemalloc and tcmalloc do differently

jemalloc (Facebook, originally FreeBSD) and tcmalloc (Google) rebuild the allocator around the assumption that every thread allocates concurrently and frequently. The shared structure is small; the per-thread structure is large.

Per-thread caches with 200+ size classes. Both allocators give each thread a private cache of free blocks across many fine-grained size classes (jemalloc: ~232 size classes; tcmalloc: ~88). A thread allocating a 96-byte object hits its private cache without touching any shared data structure. Only when the cache is empty (or full) does it touch a per-CPU or per-process structure. This is the single biggest reason the allocator wall flattens — the lock contention disappears from the fast path because there is no lock on the fast path.

Arena-per-CPU instead of arena-per-thread. jemalloc 5+ assigns arenas based on CPU rather than thread, which means a thread migrating cores keeps its arena cache locality. tcmalloc has a similar per-CPU mode (controlled by TCMALLOC_HEAP_SIZE_HINT and the per-CPU caches feature). This matters because thread migration is constant under Linux's CFS — a 1ms timeslice plus voluntary I/O preempts mean threads bounce between cores hundreds of times per second.

Aggressive page return via madvise(MADV_DONTNEED). Where glibc holds onto memory until the top chunk is shrinkable, jemalloc periodically scans dirty pages and calls madvise(MADV_DONTNEED) on idle ones, telling the kernel "you can reclaim these without me losing data; the next read will fault them back as zero-filled". RSS shrinks back toward the working set within seconds of the workload tapering, instead of staying inflated forever. Why this is operationally important: a service whose RSS shrinks responsively can be packed denser onto a node — Razorpay can run 4 payment workers per pod instead of 2, because they can trust that an idle worker's RSS will fall. This is one of the largest hidden cost wins of switching allocators, larger than the latency improvement.

There are trade-offs. jemalloc's per-thread caches add memory overhead — typically 5–10% more RSS at steady state because each thread has its own buffer. tcmalloc has historically shown higher latency for very large allocations because of how it handles spans. Neither is universally faster than glibc; both are dramatically better under the specific load pattern that creates the wall (high allocation rate, many threads, small-to-medium allocations).

The drop-in path is LD_PRELOAD. The compile-in path is to link against libjemalloc.so or libtcmalloc.so at build time. For Python, the LD_PRELOAD route works because CPython uses the system malloc for objects larger than 512 bytes (smaller objects use pymalloc's arena-based allocator, which is unaffected). For Go, the runtime has its own allocator; LD_PRELOAD does not change the picture and you optimise via GC tuning instead.

The shape of the wall — throughput vs allocations per request

Below is the curve every team eventually plots when they realise the allocator might be the wall. On the x-axis: allocations per request. On the y-axis: maximum sustainable throughput. The shape is roughly hyperbolic — double the allocations per request and you halve your peak throughput, until you hit the lock-contention regime, where it falls off a cliff because arena mutexes start serialising threads against each other.

Illustrative — not measured data. The contention knee is workload-dependent (around 30 allocs/req here). To the left of the knee, the allocator's per-call cost dominates; to the right, lock contention takes over and throughput cliff-falls.

The two regimes need different fixes. Why this matters for the engineer's playbook: to the left of the knee (most services), switching allocator helps but is not transformative — you save 30–50% on the per-allocation cost, you do not change the scaling curve. To the right of the knee, switching allocator can be 5–10× because you are removing lock contention, not just per-call overhead. Knowing which regime you are in tells you whether to expect modest or dramatic gains, and saves you the embarrassment of promising a 5× win and delivering 1.4×.

The fastest way to locate yourself on this curve in production: count allocations per request. The Python way is to wrap your handler with tracemalloc:

import tracemalloc, statistics
tracemalloc.start()
snap1 = tracemalloc.take_snapshot()
handle_one_request(payload)        # your real code
snap2 = tracemalloc.take_snapshot()
diffs = snap2.compare_to(snap1, "lineno")
total_blocks = sum(d.count_diff for d in diffs if d.count_diff > 0)
print(f"allocations this request: {total_blocks}")

Run this on 100 representative requests, take the median. If it is under 10, you are far left of the knee and the allocator is unlikely to be your wall. If it is over 30, switch allocator first and re-measure before any other tuning.

Production patterns that flatten the wall

When swapping the allocator is not enough — or not allowed — three patterns reduce allocation pressure at the application layer:

Object pools. Pre-allocate N instances of frequently-allocated types at startup; check out and check in instead of new/delete. Java's Netty does this for ByteBufs; Go's sync.Pool does it for any object; Python's __slots__ plus a free-list does it for dataclasses. The harness above demonstrates the technique: POOL is 64 pre-allocated dicts, and pooled rotates through them. The throughput gain (1.8× even on glibc) is the elimination of the allocator from the per-request path.

Arena allocation per request. Allocate a single large block at the start of a request, bump-allocate sub-objects from it, free the entire block at the end. This is what Apache Arrow does internally; what Postgres does for query memory contexts; what Envoy does for per-connection state. The free becomes O(1) regardless of how many sub-objects were allocated, and the allocator is invoked twice per request (once to grab the arena, once to return it) instead of dozens of times.

Materialise less. A streaming JSON parser like simdjson or orjson parses lazily — it returns views into the input bytes rather than constructing a full object tree. If your handler only needs obj["edges"][0]["pop"], you can read that field without allocating the dict, the list, or the inner dict. The allocation cost is proportional to fields-touched, not fields-present. This is the cleanest fix when applicable, because it removes the allocations rather than making them cheaper.

The choice between these patterns depends on lifecycle: if objects are short-lived and per-request, arena allocation wins; if they are long-lived but reused, object pools win; if you can avoid materialising entirely, do that first. In production you usually combine all three — Hotstar's manifest service ended up using orjson (avoid materialising), a per-request slab allocator (arena-style), and LD_PRELOAD=jemalloc (drop-in fix). Each step removed roughly half the remaining allocator cost; together they collapsed p99 from 380 ms to 28 ms with no change to disk or network configuration.

Edge cases that bite in real production

Three failure modes recur often enough that they deserve naming. The first two are operational; the third is observability.

The midnight RSS climb. A service starts the day at 800 MB RSS, climbs to 3.5 GB by midnight, and gets OOM-killed during the next morning's traffic spike. There is no leak — valgrind --leak-check=full reports zero. The cause is glibc holding freed memory above an allocation it cannot reach to release. The classic Indian-context trigger is a daily reporting query that materialises a 200 MB result and then exits — but pinned a high-water-mark in the heap that prevents sbrk from shrinking. The fix is mallopt(M_TRIM_THRESHOLD, 64 * 1024) to trim aggressively, or switch to jemalloc whose madvise(MADV_DONTNEED) returns the pages without needing the heap to be contiguous.

The thundering retry. A service gets a 50% traffic surge from an upstream retry storm. CPU climbs, p99 climbs, but throughput barely moves. The flamegraph shows __lll_lock_wait taking 30% of CPU — every thread is contending on the arena mutex, and each blocked thread is just adding to the contention. The wrong fix is to add more workers (it makes contention worse). The right fix is to reduce per-request allocations or switch to an allocator with per-thread caches. This is the failure mode behind the Razorpay payment-status incident of 21:48 IST that opened the chapter — once the retry storm pushed allocator contention past the contention knee, latency went non-linear.

The lying RSS dashboard. A team switches from glibc to jemalloc and their RSS dashboard reports an immediate 1.5× increase. They roll back. In fact, jemalloc had the same live working set, but reported pages it was holding for MADV_FREE reclamation as resident — exactly as the kernel told it to. The real memory pressure was unchanged; the dashboard was confused. Always cross-check RSS with pmap -X and /proc/<pid>/smaps_rollup after an allocator change, and treat single-day RSS deltas as suspect for at least the first week.

Common confusions

"Allocators only matter in C/C++." Wrong. Python, Java, Go, Node — all of them eventually call into a system or runtime allocator. CPython falls through to malloc for objects above 512 bytes; the JVM allocates from system heap when expanding generations; Go's runtime uses mmap directly but exhibits identical fragmentation patterns. The allocator wall is language-independent.
"jemalloc is just faster than glibc." It is faster under the load patterns that create the wall — many threads, high allocation rate, mixed sizes. For a single-threaded program with low allocation rate, the difference is within noise. Switching allocators on a workload that doesn't have the wall is a no-op at best and a 5–10% RSS regression at worst.
"top's RSS column tells me my memory pressure." RSS includes pages the allocator is holding for re-use, not just live data. A 4 GB RSS with only 800 MB of live objects means the allocator is hoarding 3.2 GB. Use pmap -x <pid> plus /proc/<pid>/smaps to see anonymous regions, or jemalloc's mallctl interface to ask the allocator directly.
"Pre-allocating fixes everything." It removes the per-request allocation, but pools introduce their own contention if shared across threads. Per-thread pools or lock-free pools (like crossbeam_queue::ArrayQueue) avoid that, but you have moved the bottleneck rather than removed it. Always re-profile after the fix.
"Allocator overhead is a Linux problem." No — Windows and macOS have analogous walls (LFH on Windows, libmalloc on macOS), each with their own quirks. The LD_PRELOAD mechanism is Linux-specific; the underlying performance question is universal.
"GC pauses are the same as allocator overhead." Different. GC pauses are stop-the-world events during reclaim; allocator overhead is per-allocation cost during the mutator. You can have one without the other — Go has GC pauses but a fast allocator; C++ with tcmalloc has no GC pauses but can still hit allocator contention. See /wiki/garbage-collection-tradeoffs for the GC side.

Going deeper

The `madvise` and `MADV_FREE` saga

Linux added MADV_FREE in kernel 4.5 as a faster alternative to MADV_DONTNEED. Where MADV_DONTNEED zero-fills pages on next access, MADV_FREE lets the kernel reclaim or keep the page contents at its discretion — the application gets back either the original data or zeros. jemalloc adopted MADV_FREE enthusiastically and immediately confused everyone monitoring RSS, because pages marked MADV_FREE still count as resident until the kernel actually reclaims them. RSS dashboards stopped predicting OOM correctly. The fix in jemalloc 5.2+ was opt.muzzy_decay_ms to control the dirty-to-muzzy-to-clean transition. The lesson: allocator and kernel optimisations can both individually be correct and jointly produce broken observability — and the operations team pays for it. Brendan Gregg's Systems Performance §6.4 covers MADV_FREE semantics; the jemalloc docs cover the tunables.

Allocator contention on NUMA

On a dual-socket NUMA system (a Razorpay primary database node, an EPYC Genoa with 2 sockets and 96 cores per socket), the allocator must place allocations near the requesting CPU or you pay the full 200+ cycle remote-memory latency every access. glibc's mbind-aware behaviour is poor: large mmap allocations can land on whichever socket the kernel finds free pages on, regardless of who asked. jemalloc with opt.percpu_arena:percpu plus a numactl --cpunodebind --membind wrapper keeps allocations local. tcmalloc's per-CPU caches do the same. The performance delta on NUMA can be 1.5–2× on the affected workload. See /wiki/numa-aware-allocators-and-data-structures for the deep tour.

When the allocator is in the kernel — `kmalloc` and slab

Inside the kernel, the same wall exists at a different scale. kmalloc and the slab allocators (SLOB, SLAB, SLUB) make analogous trade-offs about per-CPU caches, fragmentation, and reclamation. eBPF programs allocate from the BPF map allocator; networking stacks allocate sk_buffs from a slab pool; filesystems allocate inodes and dentries from named slabs. A flamegraph from a kernel profile (perf record -e cpu-clock -- sleep 30 && perf script | stackcollapse-perf.pl | flamegraph.pl) shows the same shape — fat slab-allocator frames on services pushing high packet rates. The fix is symmetric: tune slab cache sizes via /sys/kernel/slab/, or move to per-CPU data structures.

Telling the allocator wall apart from the GC wall

In garbage-collected runtimes (CPython's reference counting + cycle collector, JVM with G1/ZGC, Go's concurrent mark-sweep), it is easy to misread allocator overhead as GC overhead and vice-versa. The diagnostic distinction is timing: GC pauses are stop-the-world events that show up in off-CPU flamegraphs and in monotonic gaps in your handler latency. Allocator overhead is on-CPU and shows up as fat malloc/free frames in on-CPU flamegraphs. If your p99 spikes correlate with GC log entries (gc, pause: 12ms in Go's gctrace, [GC pause young, 0.0234567 secs] in JVM logs), you are looking at GC. If they correlate with allocation rate without any GC log activity, you are looking at the allocator. Both can be present at once on a JVM service; the on-CPU vs off-CPU split is the cleanest separator.

Reproduce this on your laptop

sudo apt install libjemalloc2 libgoogle-perftools4
python3 -m venv .venv && source .venv/bin/activate
pip install hdrh orjson
LD_PRELOAD="" python3 alloc_wall.py glibc 200000
LD_PRELOAD=/usr/lib/x86_64-linux-gnu/libjemalloc.so.2 python3 alloc_wall.py jemalloc 200000
LD_PRELOAD=/usr/lib/x86_64-linux-gnu/libtcmalloc.so.4 python3 alloc_wall.py tcmalloc 200000

You should see throughput climb 1.3–1.5× and p99.9 fall 1.5–2× when switching to jemalloc or tcmalloc on the per_request path; the pooled_reuse path should be near-identical across all three. If the gap is larger or smaller than this, your CPU's branch predictor or cache effects are dominating — re-run with longer warmup and on an isolated CPU (taskset -c 3 python3 alloc_wall.py ...).

Where this leads next

This wall is the bridge from "I/O performance" thinking into the allocator chapters that follow. Once you accept that the allocator can dominate, the next questions are: what does a good general-purpose allocator look like internally, and when do you need a special-purpose one? The chapters ahead unpack the major modern allocators (jemalloc, tcmalloc, mimalloc), the design patterns that bypass malloc entirely (object pools, arena allocators), and the fragmentation diagnostics that tell you whether you are leaking, hoarding, or just allocating in a pattern your allocator dislikes.

The reading order, in roughly the order a debugging session needs them:

/wiki/jemalloc-vs-tcmalloc-vs-mimalloc — the three modern allocators, side by side, with the production trade-offs.
/wiki/object-pools-and-arena-allocators — when to bypass malloc entirely.
/wiki/numa-aware-allocators-and-data-structures — what changes when "memory" is plural.
/wiki/fragmentation-internal-vs-external — why your RSS grows even when your live-set doesn't.
/wiki/coordinated-omission-and-hdr-histograms — the measurement discipline this chapter relied on.

References

jemalloc documentation — the canonical reference; read §"OPTIONS" for opt.percpu_arena and opt.muzzy_decay_ms.
TCMalloc design notes — Google's per-thread cache design, the original blueprint copied by jemalloc and mimalloc.
Brendan Gregg, Systems Performance (2nd ed., 2020), Ch. 7 "Memory" — the binding reference for this chapter.
Doug Lea, "A Memory Allocator" (1996) — dlmalloc, the ancestor of glibc's ptmalloc. Foundational reading on bins and the boundary tag method.
glibc malloc internals — the official walkthrough of arenas, fastbins, tcache, and the M_MMAP_THRESHOLD flip.
Microsoft Research, "Mimalloc: Free List Sharding in Action" (2019) — the third major modern allocator, and the cleanest paper on what makes a fast malloc.
/wiki/disk-i-o-observability-iostat-biolatency — the prior chapter; this one is the negative result that says "the disk is fine, look elsewhere".
Bonwick & Adams, "Magazines and Vmem" (USENIX 2001) — the slab-and-magazine architecture that influenced jemalloc and tcmalloc's per-thread caches.