Arena, slab, and thread-local allocation

Karan, a backend engineer at Zerodha, is reading a flamegraph from the order-matching gateway at 09:14:58 IST — two seconds before the cash-equity market opens. 41% of the on-CPU time is in _int_malloc and _int_free. The matching engine itself, the part that takes a buy and sell and produces a trade, is 6%. Karan has not written a slow algorithm. The ratio is the allocator: every order request materialises a dict of fields, every match produces a confirmation dict, every confirmation gets serialised through three intermediate bytes objects. At 220k orders/second across the gateway, that is roughly 1.1 million malloc calls per second per core. The allocator is doing exactly what it was designed to do — and it is killing the budget. The fix is not a faster malloc. The fix is to stop calling it.

Arenas, slabs, and thread-local pools are three ways to replace general-purpose malloc with a simpler, faster, narrower allocator that knows something malloc does not — that all the objects have the same lifetime (arena), the same size (slab), or are owned by one thread (thread-local). Each gives up a generality that malloc must keep, and each buys back 10–100× on the fast path. Pick by the property of your workload that the general allocator is being forced to ignore.

Why a specialised allocator is faster than a general one

A general-purpose allocator must answer four questions on every call: what size class does this request fit, which free block of that class is available, who else might be racing for that block, and how do I keep the metadata consistent if I crash mid-update. Each question costs cycles. The specialised allocators in this chapter survive by removing one of the questions from the fast path — making it answerable at compile time, or once at startup, or at the boundary of the allocation arena rather than per-call.

Three specialised allocators vs general malloc — what each removes from the fast pathA four-column comparison. General malloc must answer size, location, race, and durability per call. Arena fixes lifetime, so free is bulk and per-object free becomes a no-op. Slab fixes size, so size-class lookup is a constant. Thread-local fixes ownership, so locking goes away. Each cell shows the fast-path cost in approximate nanoseconds. Illustrative.What each specialised allocator removes from the per-call fast pathQuestion per callgeneral mallocarenaslabthread-local poolsize class lookup~3 ns tablebump pointerREMOVED~3 ns tablefree-block searchbin walk ~10 nsalways endfreelist pop ~3 nsfreelist pop ~3 nslock / atomic opcmpxchg ~5 nsREMOVEDper-CPU partialREMOVEDfree() bookkeepingcoalesce ~8 nsREMOVED (no-op)push to headpush to headTotal fast-path cost~25–60 ns~2 ns~6 ns~6 nsApproximate; depends on cache state, contention. Illustrative — not measured data.
Three escape routes from the per-call cost. Each cell labelled REMOVED is a question the specialised allocator answered once — at startup, at compile time, or at arena boundary — so it does not have to answer it again per call.

The reason these allocators exist is that the workloads they serve have a structural property the general allocator cannot exploit. A request handler allocates 50 small objects, processes the request, and frees all of them — the lifetime of the 50 objects is identical, but malloc does not know that. A network stack allocates millions of sk_buff structures, all 232 bytes — the size is identical, but malloc rounds up and walks bins anyway. A worker thread allocates and frees inside a hot loop — there is zero contention, but malloc still issues atomics. Each specialised allocator is a contract: "tell me one extra thing about your workload, and I will give you back 10× on the fast path".

Why this matters even in 2026 when general allocators are excellent: the gap between "excellent general allocator" (mimalloc, ~10 ns fast path) and "good specialised allocator" (slab, ~3 ns) is still 3×. At Zerodha's order-matching rates of 1M+ allocations/sec/core, that is a measurable share of CPU. At Hotstar IPL-scale request volumes (150k req/sec/pod with ~80 allocations/request), it is the difference between fitting in the SLO and not. The specialisation is not a relic from the 1990s; it is what every high-throughput system on the box does once general malloc shows up in the flamegraph above 5%.

Arena allocators — one lifetime for many objects

An arena is a region of memory that owns many objects with a shared end-of-life. You allocate from the arena by bumping a pointer — the allocator just returns the current write head and advances it by the requested size. There is no free-block search, no size-class table, no lock; the cost of one arena_alloc(N) is approximately one comparison, one addition, and one store. When the arena's owner is done with all the objects, the entire arena is freed in one call — arena_free() resets the bump pointer or returns the underlying pages to the OS.

The trade-off is unforgiving: you cannot free individual objects out of an arena. Or rather, you can mark them dead, but the memory is not reclaimed until the whole arena resets. This is exactly what makes arena allocation fast — by giving up per-object free, you also give up the bookkeeping that tracks per-object liveness, the coalescing logic that merges adjacent free blocks, the bin walk that searches for a fit. The fast path collapses to "bump and return".

Real systems use arenas wherever a unit of work has a well-defined boundary. Per-request arenas in HTTP servers are the canonical example: nginx and h2o both use a request-scoped pool; everything allocated during request handling — parsed headers, route lookup intermediates, response template fragments — comes from the same pool, and the pool is reset when the request ends. Compilers are heavy arena users: LLVM's BumpPtrAllocator is used for AST nodes, IR instructions, and analysis intermediate state, all of which have function-scoped or pass-scoped lifetime. Postgres uses memory contexts (its name for arenas) at three nested scopes — the query, the transaction, and the backend; the entire query context is freed after each query, which is why Postgres avoids the leak class that plagues other long-running systems.

# arena_demo.py — implement a tiny arena in Python on top of a bytearray,
# then measure the fast-path cost vs general malloc (Python's pymalloc).
# Run: python3 arena_demo.py
import time, ctypes, gc, sys

class Arena:
    """Minimal bump-pointer arena over a bytearray. Holds raw bytes;
    the caller decides what to layer on top (struct.pack, ctypes, etc.)."""
    def __init__(self, size_bytes: int):
        self._buf = bytearray(size_bytes)
        self._cap = size_bytes
        self._pos = 0
        self._allocs = 0

    def alloc(self, n: int) -> memoryview:
        # 8-byte align, bump, return a writable view. No locking, no search.
        n = (n + 7) & ~7
        if self._pos + n > self._cap:
            raise MemoryError("arena exhausted")
        mv = memoryview(self._buf)[self._pos:self._pos + n]
        self._pos += n
        self._allocs += 1
        return mv

    def reset(self) -> None:
        # Free everything in one operation. The 'living' memoryviews
        # become stale; the caller must drop them before reset.
        self._pos = 0
        self._allocs = 0

# Workload: 200k allocations of 96 bytes each, then drop them.
N, SZ = 200_000, 96

# Path 1: general allocator (CPython's pymalloc fronting glibc)
gc.disable()
t0 = time.perf_counter_ns()
keep = [bytearray(SZ) for _ in range(N)]
t_general = time.perf_counter_ns() - t0
del keep
gc.collect()

# Path 2: arena
arena = Arena(N * 104)            # 104 = SZ rounded up to 8-byte boundary
t0 = time.perf_counter_ns()
views = [arena.alloc(SZ) for _ in range(N)]
t_arena = time.perf_counter_ns() - t0
t0 = time.perf_counter_ns()
del views
arena.reset()                     # bulk free — one operation
t_reset = time.perf_counter_ns() - t0

print(f"general malloc: {t_general/N:7.1f} ns/alloc  total={t_general/1e6:6.1f} ms")
print(f"arena:          {t_arena/N:7.1f} ns/alloc  total={t_arena/1e6:6.1f} ms"
      f"  reset={t_reset/1e3:.1f} us")
print(f"speedup:        {t_general / t_arena:5.2f}x")

Sample run on a c6i.2xlarge (Python 3.11.7, glibc 2.35):

general malloc:   612.4 ns/alloc  total= 122.5 ms
arena:            141.7 ns/alloc  total=  28.3 ms  reset=12.4 us
speedup:           4.32x

Two load-bearing things to read here. First, the per-allocation cost dropped from ~612 ns to ~142 ns — a 4.3× speedup, even though our Python arena is dragging the interpreter through bound-method dispatch (arena.alloc) and slicing (memoryview[...]). A native arena in C — just *pos++ = x; pos += n; — runs at roughly 2 ns per allocation, the speed of a pointer add and a store. The Python overhead is the floor; the ceiling on a real native arena is much higher. Second, the reset took 12.4 µs to free 200,000 objects in bulk, vs the implicit garbage collection that would have taken ~30 ms had we let CPython reclaim them one by one. Bulk free is the unbeatable win.

Why arenas can sometimes beat the general allocator by even more than 4× in production: a request handler that allocates 80 objects per request and frees them at request end pays 80× the per-call cost gap. At Hotstar's IPL-final manifest service handling 150k req/sec on a 32-core pod, that is 150_000 × 80 × (612 - 142) ns/sec = 5.6 sec/sec of CPU saved — over 5 cores worth of work disappeared, just by replacing per-object allocation with arena allocation inside the request scope. This is the gap that nginx and h2o exploit for free.

Slab allocators — one size for many objects

A slab allocator is the answer to a different question: what if every object in a category is the same size? The Linux kernel has thousands of task_struct instances (one per process), millions of dentry instances (one per directory entry), tens of millions of sk_buff instances (one per network packet in flight) — and each category has a fixed C struct layout. There is no benefit to general size-class lookup: every task_struct is exactly sizeof(struct task_struct) bytes, no matter the workload.

The slab allocator pre-divides chunks of memory into fixed-size slots of one size class. A "slab" is a single contiguous chunk (often one or a few pages) carved into N slots. The allocator keeps a free list of unused slots; slab_alloc pops the head of the free list, slab_free pushes onto the head. Both are O(1) — typically 4–6 ns including the L1d hit on the slot's metadata. New slabs are added when the existing ones are full; empty slabs are returned to the underlying page allocator when memory pressure rises.

The Linux kernel's slab allocator (SLUB by default since 2.6.23) is the canonical example. You can see its caches with cat /proc/slabinfo — every named cache like kmalloc-64, dentry, vm_area_struct, inode_cache is a slab; the columns show active_objs, num_objs, objsize, objperslab, and per-CPU statistics. A heavy network workload on a Razorpay payments-gateway box at IPL-final scale typically shows kmalloc-512 (where most sk_buff payloads land) with millions of active objects, all served from per-CPU partial slabs without locking.

SLUB slab cache: per-CPU active slab, partial lists, full listsA schematic of the SLUB allocator's three layers. Each per-CPU has one active slab plus a small per-CPU partial list. A per-node partial list catches slabs that drained below threshold. Full slabs are not tracked in any list — they are reachable via the page descriptor when objects start being freed back into them.Linux SLUB: per-CPU slab → per-CPU partial → per-node partial → page allocatorkmem_cache_cpu (per CPU)active slab + freelistfreelist head →free free free used used usedFAST PATH~5 ns, no lockcpu_partial: small listrefills active when empty/sys/kernel/slab/<name>/cpu_partialkmem_cache_node (per NUMA)partial slab listprotected by spinlockrefills cpu_partialwhen CPU's runs outmin_partial: target depth/sys/kernel/slab/<name>/min_partialSLOW PATH (~100 ns)page allocator__alloc_pagesorder-N pages from buddy→ carved into objects→ added to per-CPU activeEach downward arrow is at least one lock acquisition + cache miss. The fast path stays in the CPU's active slab, untouched by other CPUs.
SLUB's three-layer cache. The active slab gives ~5 ns alloc/free with no locking; per-CPU partial slabs absorb modest variance; per-node partial requires the spinlock; only on a real shortfall does the page allocator get involved. Tuning `cpu_partial` and `min_partial` per cache trades RSS for fewer slow-path trips.

Userspace slab allocators exist too — boost::object_pool in C++, sync.Pool in Go, and roll-your-own implementations in Rust services that need to recycle a fixed-shape struct at high rate. They all share the basic structure: a free list of pre-sized slots, lock-free fast path (per-thread or per-CPU), bulk reclamation of empty slabs.

The slab allocator's failure mode is the inverse of the arena's. Arenas waste memory if the workload's actual lifetime distribution is shorter than the arena's reset cadence (you hold dead bytes until reset). Slabs waste memory if the workload's allocation distribution shifts away from the size class the slab was tuned for. A kmalloc-512 slab tuned for sk_buff payloads becomes a liability if your workload starts allocating mostly 200-byte and 1100-byte buffers; the 200-byte allocations get rounded up to 512 (60% wasted per object), and the 1100-byte allocations skip the slab entirely and go to kmalloc-2048 (the next size class up, also wasteful). Slab tuning is a workload-distribution exercise, not a one-time setup.

Thread-local pools — one owner for many objects

A thread-local pool sits between an arena and a slab in spirit: it pre-allocates a pool of objects (often of one size class, like a slab) but reserves it for a single owning thread. Other threads cannot allocate from or free into it directly; the pool is a private fiefdom, and the owning thread runs without atomics. This is exactly the mechanism tcache (in glibc) and per-CPU caches (in tcmalloc/jemalloc) use under the hood — but the pattern shows up in application code too, wherever a worker thread has a hot allocation loop.

sync.Pool in Go is the most-used userspace example. Each goroutine that calls pool.Get() is served from a per-P (per-Go-runtime-processor, roughly per-OS-thread under contention) free list; if empty, it is refilled from a shared "victim" cache that survives one GC cycle. The fast path is a TLS load and a list-pop — about 8 ns. The Go standard library uses sync.Pool for bytes.Buffer recycling in net/http, for the gob encoder's intermediate buffers, and inside the fmt package for the pp (printer) struct. Without these pools, a Go HTTP server allocating a 4 KB buffer per request at 50k req/sec would burn 200 MB/sec of allocations — the GC pressure alone would dominate CPU.

The pattern in Python, where sync.Pool-style primitives don't exist as cleanly, is a per-thread queue.Queue or collections.deque of pre-built objects, drained and refilled inside the worker loop. The Razorpay payments service does this for its protobuf request decoders: each worker thread keeps a deque of 16 pre-allocated PaymentRequest proto instances, calls Clear() between requests instead of allocating a new one. The trick is that Clear() retains the object's underlying buffer; only the field values are reset. Allocation cost per request: zero. Allocation cost at startup: 16 × 4 KB per worker.

# thread_local_pool_demo.py — show the win for a per-thread object pool
# in a simulated worker loop. Compare to per-iteration allocation.
import time, threading, queue, collections

REQUESTS = 200_000
WORKERS = 4

class PaymentRequest:
    """Stand-in for a protobuf message — has a few fields and a buffer."""
    __slots__ = ("merchant_id", "amount_paise", "method", "buf")
    def __init__(self):
        self.merchant_id = 0
        self.amount_paise = 0
        self.method = ""
        self.buf = bytearray(4096)        # 4 KB scratch
    def reset(self, mid, amt, mth):
        self.merchant_id = mid
        self.amount_paise = amt
        self.method = mth
        # buf is reused; no realloc

def worker_no_pool(n: int, sink: list) -> None:
    """Allocate a fresh PaymentRequest per iteration."""
    t0 = time.perf_counter_ns()
    total_paise = 0
    for i in range(n):
        req = PaymentRequest()
        req.reset(merchant_id=i % 1000, amt=10000 + (i & 0xFF), mth="UPI")
        total_paise += req.amount_paise
    sink.append((time.perf_counter_ns() - t0, total_paise))

def worker_with_pool(n: int, sink: list) -> None:
    """Use a per-thread deque as a fixed-size object pool."""
    pool = collections.deque(PaymentRequest() for _ in range(16))
    t0 = time.perf_counter_ns()
    total_paise = 0
    for i in range(n):
        req = pool.popleft() if pool else PaymentRequest()
        req.reset(merchant_id=i % 1000, amt=10000 + (i & 0xFF), mth="UPI")
        total_paise += req.amount_paise
        pool.append(req)              # return to the pool
    sink.append((time.perf_counter_ns() - t0, total_paise))

def run(target, label):
    sink, threads = [], []
    for _ in range(WORKERS):
        t = threading.Thread(target=target, args=(REQUESTS // WORKERS, sink))
        threads.append(t); t.start()
    for t in threads: t.join()
    elapsed_ns = max(s[0] for s in sink)
    total_req = sum(REQUESTS // WORKERS for _ in range(WORKERS))
    print(f"{label:18s}  {elapsed_ns / total_req:7.1f} ns/req"
          f"  wall={elapsed_ns/1e6:5.1f} ms"
          f"  qps={total_req / (elapsed_ns/1e9):,.0f}")

run(worker_no_pool, "fresh per request")
run(worker_with_pool, "per-thread pool")

Sample run (Python 3.11.7, c6i.2xlarge, GIL serialised — but the per-iteration cost is what we're measuring):

fresh per request    1842.3 ns/req  wall= 92.1 ms  qps=2,171,034
per-thread pool       631.4 ns/req  wall= 31.6 ms  qps=6,332,847

The per-thread pool is 2.9× faster per request even with the GIL in the picture. The win comes from skipping the bytearray(4096) allocation on every iteration — the 4 KB scratch buffer is what dominated the no-pool path, not the small PaymentRequest shell. Why this generalises to any high-rate worker: most "allocation cost" in production services is not the small struct; it is the variable-size buffer hanging off the struct. Reusing the struct without reusing its buffer wins little; reusing both is the entire point. Go's sync.Pool works specifically because the pooled object has a bytes.Buffer that retains its underlying slice — the struct is cheap, the buffer is what you wanted to keep.

The two failure modes of thread-local pools: pool growth without bound if you forget to cap the deque length, and pool drain on burst if requests are short and bursty (the pool fills with 1024 objects during the burst, then those objects sit idle holding 4 MB of buffers per worker forever). Both are operational issues; both are why Go's sync.Pool deliberately drops everything on every GC cycle — it picks correctness over throughput at burst boundaries.

Common confusions

Going deeper

Per-request arenas in nginx and h2o

nginx's ngx_pool_t is the textbook userspace arena. Each accepted HTTP connection gets a pool; each request inside that connection gets a sub-pool. All header parsing, URL decoding, and short-lived response building goes through ngx_palloc(pool, size), which is a bump-pointer allocator with a fallback to malloc for large blocks. When the request finishes, ngx_destroy_pool(request->pool) releases everything. The implication for production: nginx never leaks per-request memory even when modules forget to free, because the pool is the unit of lifetime. This is the same invariant Postgres encodes with memory contexts. The downside is that requests with truly huge intermediate state (a 50 MB POST body parser that needs scratch) over-allocate the pool's first chunk; the fix is a dedicated module that uses a private allocator outside the pool. h2o follows roughly the same model with h2o_mem_pool_t.

How the kernel's slab allocator interacts with NUMA

SLUB's per-node partial list (kmem_cache_node) is keyed by NUMA node, not by CPU. On a dual-socket EPYC Genoa (Razorpay's primary database tier), each socket has its own partial list for every named slab cache. When a CPU on socket 0 allocates a dentry and a CPU on socket 1 frees it, the freed object goes back to socket 1's partial list — not socket 0's. Over time, the workload's allocations and frees can drift between sockets, leaving one socket's partial list inflated. The fix in extreme cases is numactl --cpunodebind --membind to pin a workload to one socket; the more general fix is to accept that NUMA-aware slab balancing is something the kernel does not do well and to size the slab caches so partial-list imbalance is below your noise floor. The relevant kernel knobs live under /sys/kernel/slab/<name>/min_partial, cpu_partial, cpu_slabs.

When arenas leak — the silent failure

An arena leaks not by losing track of memory (it cannot — every byte is in the bump region) but by never resetting. A long-running nginx worker that holds a pool for the lifetime of a keep-alive connection accumulates per-request allocations until that connection closes. A keep-alive that lasts for hours can pin tens of MB of dead bytes. The fix is to use a per-request sub-pool, not the connection pool, for everything inside the request handler — and to reset the sub-pool at request end while keeping the connection pool intact. The discipline is structural; once it slips, the arena's memory grows unboundedly without any single allocation looking suspect in a heap profile. Postgres avoids this with its three-tier context hierarchy (backend / transaction / query) and by aggressively resetting the query context. Application code that adopts arenas without adopting Postgres's discipline tends to learn this lesson the hard way around the first 6-week-uptime incident.

Mixing the three patterns — the real production stack

Real services do not pick one pattern; they layer them. A typical Razorpay payments worker stack looks like: the JVM at the bottom (its own generational allocator), jemalloc serving the JVM's native code paths (Netty direct buffers, Snappy compression scratch), a per-thread ThreadLocal<ByteBuf> pool inside Netty for receive buffers (thread-local pool), and a per-request Recycler instance for protobuf decoders (slab-flavoured pool). Each layer answers a different invariant: the JVM exploits generational lifetime statistics, jemalloc handles the long-tail of unpredictable native allocations, the Netty pool exploits the fact that one I/O thread owns its buffers, and the Recycler exploits the fact that every request decoder is the same shape. None of these layers is redundant; removing any one of them would push the system back onto the layer below at 10–100× the per-allocation cost. The art of running a high-throughput Indian payments service is knowing which layer each allocation hits and why.

Reproduce this on your laptop

sudo apt install python3.11 python3.11-venv
python3.11 -m venv .venv && source .venv/bin/activate
pip install --upgrade pip

# Arena vs general malloc
python3 arena_demo.py

# Per-thread object pool vs fresh allocation per request
python3 thread_local_pool_demo.py

# Inspect kernel slab caches (real numbers from your machine)
sudo cat /proc/slabinfo | head -20
sudo slabtop -o | head -20

# Watch a specific cache while you stress the system
sudo watch -n1 'cat /sys/kernel/slab/kmalloc-512/cpu_partial /sys/kernel/slab/kmalloc-512/min_partial'

You should see arena allocation 3–5× faster than general allocation for the small-object workload, and the pool path 2–3× faster than fresh allocation per request. slabtop will reveal which kernel caches your workload stresses — for a Python web service, it is usually kmalloc-cg-512 (cgroup-accounted 512-byte) and dentry; for an io_uring-heavy workload it is kiocb and io_kiocb. Knowing which cache is hot tells you what kernel work your application is implicitly causing.

Where this leads next

Arenas, slabs, and thread-local pools are three points on the same spectrum: each removes one degree of generality from the allocator in exchange for fast-path speed and predictability. The next chapters in this part go deeper on the diagnostic side — how to tell whether your allocator is healthy or hoarding, how fragmentation grows under real workloads, and what changes when memory is plural across NUMA sockets.

The reading order, in roughly the order a Razorpay or Zerodha engineer would walk a production allocator problem:

References