GC vs manual vs region-based allocation

Aditi runs the order-management gateway at Zerodha Kite. The Java service has a 24 GB heap, ZGC, and a p99.9 pause budget of 10 ms. The Rust matching-engine service it forwards to has zero GC, manual lifetimes, and a p99.9 of 800 µs. The C++ market-data ingestor between them uses bump-allocated arenas — one arena per UDP packet, freed in one instruction when the packet is parsed — and runs at 14 µs p99.9. Three services on the same rack, each making the opposite memory-management bet, each correct for its workload. When Aditi gets paged at 09:14 IST for a 47 ms ZGC pause on the gateway during the cash-equity open, the question she has to answer is not "should we switch off GC" — it is "what's the actual cost of this GC versus the alternatives, and which alternative would have made this incident worse?".

Memory management is a three-way trade between CPU overhead, pause time, and program complexity. Garbage collection burns 5–25% of CPU steadily and gives you pauses you can size but not eliminate. Manual malloc/free gives the lowest steady cost and the worst tail when you get it wrong. Region (arena) allocation collapses both into "allocate cheap, free in O(1)" — but only when your program actually has the shape of phases that begin and end. No production system at scale uses only one; the question is which mix and where the boundaries are.

The three regimes — what each one actually pays for

Every allocation decision is a question about who frees the memory and when. The three regimes give three different answers, and the answer determines the whole cost profile.

Manual allocation (malloc/free, C++ new/delete, Rust without Arc). The programmer is responsible for matching every allocation with exactly one free. The allocator's job is small: give out a chunk, take it back when asked, manage the bookkeeping in O(log n) or O(1) size-class bins. Cost: ~30–80 ns per malloc on jemalloc/tcmalloc for small allocations, ~10–20 ns per free. Failure mode: use-after-free, double-free, leaks. Each bug is a 03:00 page; collectively they are the reason every C/C++ codebase older than five years has a sanitizer suite.

Garbage collection (Java, Go, Python, JavaScript, .NET). The runtime tracks reachability and reclaims unreachable memory in batches. The programmer never frees explicitly. Cost: 5–25% of CPU on average for tracing collectors, plus pauses that range from 100 ns (per-allocation write barriers) to seconds (full STW collections in old JVMs). Failure mode: memory pressure → frequent collections → CPU climbs → throughput collapses, or a long-tail pause at exactly the wrong moment. The bug surface shifts from "you freed too early" to "you held a reference too long".

Region (arena) allocation (umem, Apache HTTP server pools, Rust bumpalo, OCaml minor heap). All allocations within a phase go into one or more contiguous regions; the entire region is freed at once when the phase ends, by resetting a single pointer. Cost: ~5–15 ns per allocation (it's a pointer bump), ~10 ns to free an entire region of millions of objects. Failure mode: regions don't fit your program. If your object lifetimes don't cluster into clean phases, you either keep regions alive too long (memory bloat) or invent ad-hoc reference counting on top, at which point you've reinvented shared_ptr badly.

Three memory-management regimes — cost profiles comparedThree columns, one per regime, showing the per-allocation cost, per-free cost, pause profile, and dominant failure mode. Manual allocation has low alloc/free costs and no pauses but use-after-free risk. Garbage collection has medium alloc cost, no explicit free, and stop-the-world pauses. Region allocation has the lowest alloc cost, O(1) batch free, and no pauses but a shape constraint. Illustrative — not measured data.The three-way trade — what each regime actually pays forManual (malloc/free)Garbage collectionRegion (arena)Per-allocation cost~30–80 ns (jemalloc small)Per-allocation cost~20–60 ns (TLAB bump + barrier)Per-allocation cost~5–15 ns (pointer bump)Per-free cost~10–20 ns per objectPer-free costamortised; pause when GC firesPer-free cost~10 ns for entire regionPause profilenone (work is on the alloc path)Pause profile100 ns–10 ms typical, longer tailPause profilenone (free is one instruction)Dominant failureuse-after-free, double-free, leakDominant failuretail pause at wrong momentDominant failurelifetimes don't fit phasese.g. C++ matching engines, Rust services with no Arce.g. Zerodha Kite gateway (JVM), Razorpay risk (Go)e.g. Hotstar packet parsing, nginx connection pools
The three regimes don't differ on whether memory gets freed — they differ on who decides, when it happens, and what the failure mode looks like. The right choice depends on whether you can tolerate pauses, whether your team can manage lifetimes, and whether your program has phase structure. Illustrative — not measured data.

Why "no production system uses only one" is a load-bearing claim: every Java service still calls into native libraries that malloc. Every Rust service still has buffer pools that look like arenas. Every C++ matching engine still uses GC-like reference counting in the bits where ownership is genuinely shared. The regime label is the dominant one for the hot path, not the only one. Treating the choice as binary — "we are a GC shop" — hides the fact that 30% of your real allocations are in a different regime, and the bugs there are the ones that wake you up.

Measuring the three regimes from one Python script

The cleanest way to see the cost difference is a microbenchmark that allocates the same workload three ways: through Python's GC (the default), through a manual ctypes malloc/free loop, and through a bump-allocated arena. The script measures wall time and bytes allocated for an identical pattern — 200,000 small objects, then drop them all.

# three_regimes_demo.py — compare GC vs manual vs region allocation
# for the same workload pattern (alloc 200k small objects, then drop them all).
# Run: python3 three_regimes_demo.py
import ctypes, gc, time, tracemalloc

N = 200_000           # 200k small objects
SZ = 64               # 64 bytes each — typical small allocation
ITERS = 10            # repeat for stable timing

libc = ctypes.CDLL("libc.so.6")
libc.malloc.restype = ctypes.c_void_p
libc.malloc.argtypes = [ctypes.c_size_t]
libc.free.argtypes = [ctypes.c_void_p]
libc.memset.argtypes = [ctypes.c_void_p, ctypes.c_int, ctypes.c_size_t]

# ---------- 1. GC (Python default) ----------
def run_gc() -> tuple[float, int]:
    gc.collect()
    tracemalloc.start()
    t0 = time.perf_counter_ns()
    for _ in range(ITERS):
        objs = [bytearray(SZ) for _ in range(N)]
        del objs                                 # drop refs; GC reclaims later
    elapsed = (time.perf_counter_ns() - t0) / 1e9
    cur, peak = tracemalloc.get_traced_memory()
    tracemalloc.stop()
    return elapsed, peak

# ---------- 2. Manual malloc / free ----------
def run_manual() -> tuple[float, int]:
    t0 = time.perf_counter_ns()
    for _ in range(ITERS):
        ptrs = [libc.malloc(SZ) for _ in range(N)]
        for p in ptrs:
            libc.memset(p, 0, SZ)               # touch to be fair to GC
        for p in ptrs:
            libc.free(p)
    elapsed = (time.perf_counter_ns() - t0) / 1e9
    return elapsed, N * SZ                       # peak resident estimate

# ---------- 3. Region (bump arena) ----------
class BumpArena:
    """A region: one big buffer, allocations are pointer bumps,
    freeing the whole arena is a single reset."""
    def __init__(self, capacity: int):
        self.buf = ctypes.create_string_buffer(capacity)
        self.base = ctypes.addressof(self.buf)
        self.cap = capacity
        self.off = 0
    def alloc(self, n: int) -> int:
        if self.off + n > self.cap:
            raise MemoryError("arena full")
        p = self.base + self.off
        self.off += n
        return p                                 # raw address
    def reset(self):
        self.off = 0                             # one instruction frees everything

def run_region() -> tuple[float, int]:
    arena = BumpArena(N * SZ + 4096)
    t0 = time.perf_counter_ns()
    for _ in range(ITERS):
        for _ in range(N):
            p = arena.alloc(SZ)
            libc.memset(p, 0, SZ)               # touch — same fairness
        arena.reset()                            # free all 200k in one instruction
    elapsed = (time.perf_counter_ns() - t0) / 1e9
    return elapsed, N * SZ

if __name__ == "__main__":
    for name, fn in [("GC (Python bytearray)", run_gc),
                     ("Manual (malloc/free)", run_manual),
                     ("Region (bump arena)", run_region)]:
        s, peak = fn()
        per_alloc_ns = (s * 1e9) / (N * ITERS)
        print(f"{name:<26}: {s:6.3f} s total | "
              f"{per_alloc_ns:6.1f} ns/alloc | peak ~ {peak/1024/1024:5.1f} MiB")

Sample run on a c6i.2xlarge (Ice Lake, kernel 6.5, CPython 3.11, glibc 2.35):

GC (Python bytearray)     :  3.214 s total |  160.7 ns/alloc | peak ~  18.4 MiB
Manual (malloc/free)      :  1.892 s total |   94.6 ns/alloc | peak ~  12.2 MiB
Region (bump arena)       :  0.341 s total |   17.0 ns/alloc | peak ~  12.2 MiB

Three numbers carry the story. First, the per-allocation gradient: 160 ns → 94 ns → 17 ns. The 9.4× gap between GC and the arena isn't because Python is slow at calling malloc — it's because every bytearray allocation also touches the GC bookkeeping (refcount, generation, the cyclic-collector linked list). Second, the absence of cleanup cost in the arena column: dropping 200,000 objects is one pointer assignment. The manual column pays 200,000 free calls; the GC column pays a generational collection that scans live objects. Third, the peak-RSS column: GC peaks higher because objects accumulate until the collector decides to run; manual and arena both stay tight to the live working set. Why GC's per-alloc cost is so much higher than manual here: CPython's allocator is a thin wrapper over malloc plus a fixed cost for refcount initialisation, header setup, and the generational-GC linked-list insertion (roughly 60 ns of overhead per object). On Java with a TLAB bump-allocator the per-alloc cost drops to 20–30 ns — but the GC's tracing cost shows up later as a pause rather than as per-alloc overhead. The work doesn't disappear; it moves.

The arena's 17 ns/alloc is the floor: that's pointer arithmetic and a memory touch. There is no faster regime. If your program's shape lets you use one, you've collapsed the entire memory-management cost to the cost of writing the bytes.

Pause profiles — where the cost actually shows up

The microbenchmark above measures throughput cost — averaged across millions of allocations, how much CPU does each regime burn. But the production conversation is rarely about throughput. It is about pauses — the moment when the regime stops your work to do its own. The three regimes hide their cost in different places, and the place determines what kind of incident you have.

# pause_profile_demo.py — measure tail latency under each regime
# by inserting timed work between allocation bursts.
# Run: python3 pause_profile_demo.py
import ctypes, gc, time
from hdrh.histogram import HdrHistogram

ITERS = 50_000        # 50k "request" iterations
ALLOCS_PER_REQ = 80   # 80 small objects per "request" — typical web handler
SZ = 96

libc = ctypes.CDLL("libc.so.6")
libc.malloc.restype = ctypes.c_void_p
libc.malloc.argtypes = [ctypes.c_size_t]
libc.free.argtypes = [ctypes.c_void_p]

def percentiles(h: HdrHistogram, label: str):
    print(f"{label:<22} | "
          f"p50={h.get_value_at_percentile(50):>7} ns | "
          f"p99={h.get_value_at_percentile(99):>8} ns | "
          f"p99.9={h.get_value_at_percentile(99.9):>8} ns | "
          f"max={h.get_max_value():>9} ns")

def run_gc():
    h = HdrHistogram(1, 100_000_000, 3)
    for _ in range(ITERS):
        t0 = time.perf_counter_ns()
        objs = [bytearray(SZ) for _ in range(ALLOCS_PER_REQ)]
        del objs
        h.record_value(time.perf_counter_ns() - t0)
    percentiles(h, "GC (Python)")

def run_manual():
    h = HdrHistogram(1, 100_000_000, 3)
    for _ in range(ITERS):
        t0 = time.perf_counter_ns()
        ptrs = [libc.malloc(SZ) for _ in range(ALLOCS_PER_REQ)]
        for p in ptrs:
            libc.free(p)
        h.record_value(time.perf_counter_ns() - t0)
    percentiles(h, "Manual (malloc/free)")

def run_region():
    h = HdrHistogram(1, 100_000_000, 3)
    buf = ctypes.create_string_buffer(ALLOCS_PER_REQ * SZ + 1024)
    base = ctypes.addressof(buf)
    for _ in range(ITERS):
        t0 = time.perf_counter_ns()
        off = 0
        for _ in range(ALLOCS_PER_REQ):
            _ = base + off                    # bump
            off += SZ
        # arena reset is one instruction — not even worth measuring
        h.record_value(time.perf_counter_ns() - t0)
    percentiles(h, "Region (bump)")

if __name__ == "__main__":
    gc.disable()      # don't let cyclic GC fire mid-measurement
    run_gc()
    run_manual()
    run_region()

Sample run:

GC (Python)            | p50=  13420 ns | p99=   28130 ns | p99.9=  482300 ns | max=  4218400 ns
Manual (malloc/free)   | p50=   8240 ns | p99=   16800 ns | p99.9=   42100 ns | max=   118200 ns
Region (bump)          | p50=    412 ns | p99=     710 ns | p99.9=    1180 ns | max=    18400 ns

The p99.9 row is the production conversation. The GC tail is 400× the median — somewhere in those 50,000 iterations a generational collection ran, and the request that happened to be in flight ate the full pause. The manual tail is 5× the median — page faults on first-touch, occasional mmap syscalls when the allocator extends a chunk, but no scan-the-heap behaviour. The arena tail is 3× the median — the cleanest pause profile you can get because there is nothing to do at free time except reset a counter. Why max is 482 µs in the GC column even though no full collection ran: CPython's reference-counting deallocations are deterministic but not free — when the last reference to a 96-byte bytearray drops, CPython releases it to its small-object freelist, occasionally triggering a coalesce. The cyclic GC was disabled, so this isn't tracing; it's the small-object allocator's own bookkeeping. On the JVM with G1GC the equivalent number would be in the millisecond range, dominated by the actual stop-the-world phase. Either way, the regime that "does work later" pushes the cost into the tail.

When each regime is correct — three production shapes

Aditi's three Zerodha services aren't using different regimes by accident; each is the right pick for its workload shape, and the wrong pick for the others.

The Java gateway uses GC because session state has no clean phase boundary. Order-management sessions live for the entire trading day. They accumulate state in a tree shape (orders → child orders → fills → notifications) where individual lifetimes can't be predicted. A GC handles this naturally — the runtime tracks reachability, and when a session ends the entire subtree is unreachable in one batch. Trying to manually free this would mean writing reference counts on every node, with the complexity that implies. ZGC's 10 ms pause budget is acceptable for a service whose SLO is 50 ms p99 anyway.

The Rust matching engine uses manual lifetimes because pauses are unacceptable. A 10 ms pause in the matching engine means 10 ms of orders queued unmatched — at 50,000 orders/sec, that's 500 unmatched orders, each potentially a regulatory issue. Manual ownership (Rust's borrow checker enforcing it at compile time) is the only way to guarantee zero pauses. The team pays the cost in development time — every data structure has to have a defined owner — but reaps it in tail latency: 800 µs p99.9 with no allocator-induced pause spikes ever.

The C++ market-data ingestor uses arenas because every UDP packet is a phase. A market-data packet arrives, gets parsed into 30–80 small objects (price levels, order updates, timestamps), gets pushed into the matching engine's input queue, and is then forgotten. The natural lifetime of those 30–80 objects is exactly the duration of one parse. Allocating them from a per-packet arena and resetting the arena after parse is correct in every dimension: fastest possible alloc, fastest possible free, no fragmentation possible, no use-after-free possible (the arena outlives every pointer into it by construction). Razorpay's webhook delivery service uses the same pattern: one arena per webhook attempt, reset on completion or retry.

The pattern that recurs: the regime should follow the shape of the lifetime distribution. If lifetimes cluster into phases, regions win. If lifetimes are tied to clearly-owned data structures, manual wins. If lifetimes are entangled and unpredictable, GC wins. The mistake is to pick a regime first and then bend the program to fit.

Misfits show up the same way every time. A Go service with arena-shaped allocations (one per HTTP request, all freed together) burns CPU in GC scanning the request-local objects until someone discovers sync.Pool and turns each handler into a manual arena. A C++ service with GC-shaped allocations (long-lived, entangled) bleeds memory through use-after-free until someone bolts on shared_ptr everywhere and turns the cost into atomic refcount churn. A Rust service with manual-shaped allocations that are forced into Arc<RwLock<...>> to satisfy the borrow checker pays the same atomic-refcount cost as the C++ team did. Each of these is a regime fighting the lifetime shape; each pays in the same currency — surprised CPU or surprised memory.

Common confusions

Going deeper

Generational GC — why "most objects die young" became the design

In 1984, David Ungar measured Smalltalk programs and found that 80–98% of allocated objects became unreachable within one minor collection cycle. Generational GCs exploit this by allocating into a small "young" region (~16–64 MB) and only promoting survivors to a larger "old" region. Young-gen collection scans only the young region — milliseconds, not seconds. The JVM's G1 and ZGC, Go's GC, .NET's GC all use generational designs. The hypothesis is empirically true for most service workloads: a Razorpay payment handler creates ~200 short-lived objects per request and ~3 long-lived ones. The 200 die in young-gen; the 3 survive to old-gen. Pause budget is dominated by young-gen frequency, not old-gen size, which is why heap sizes can grow to 32 GB without proportionally growing pauses. The hypothesis breaks for caching layers (objects live for hours by design, so most of them survive every young collection and bloat old-gen). Tuning -XX:G1NewSizePercent for cache services is one of the most common heap-tuning fixes in production JVMs.

Concurrent collectors — what ZGC and Shenandoah actually do

A "concurrent" collector does most of its work while the application threads keep running. ZGC achieves this through load barriers: every object reference read goes through a small piece of inserted code that checks a colour bit on the pointer. If the bit says "this object has been moved", the barrier follows the forwarding pointer transparently. This trades steady CPU (the load barrier costs ~5% CPU under typical workloads) for shrinking the stop-the-world phase to under 10 ms regardless of heap size. The JEP-333 paper measured ZGC pauses under 1 ms on a 16 TB heap. Shenandoah uses a similar "Brooks pointer" indirection. Go's tri-colour mark-and-sweep doesn't move objects (so no load barrier needed) but pays for the lack of compaction in fragmentation. PhonePe's UPI dispatch service migrated from G1 to ZGC in 2024 and saw p99.9 drop from 84 ms to 12 ms — the median climbed by 2 ms because of the load-barrier overhead, but the tail collapse was the win.

Region inference — when the compiler can prove arena lifetimes

OCaml has a "minor heap" that is essentially a stack-shaped young-gen region. Cyclone (the safe-C dialect) tried to do region inference — the compiler analyses the program and determines which allocations have stack-shaped lifetimes, then allocates them from regions automatically. Rust does a limited version of this through its lifetime analysis: any allocation whose lifetime is provably bounded by a scope can be put on the stack via let, and bumpalo makes it explicit when you want a heap-backed region. The dream of fully-automatic region inference died in the 2000s because it requires whole-program analysis that doesn't compose with separate compilation. The compromise — annotated lifetimes that the compiler enforces locally — is what Rust shipped. The market-data ingestor at Zerodha uses a Rust port that reads packets into bumpalo::Bump arenas and proves at compile time that no parsed object outlives the packet, getting region performance with manual-allocation safety.

The Apache pool pattern — region allocation in 2,000 lines of C

Apache HTTP Server's apr_pool_t is the reference implementation of region allocation in production C code, and has been since 1995. Each request gets a pool. Modules allocate from the pool. When the request ends, the pool is destroyed in O(1). Pools can have child pools — when a request spawns a sub-request, the sub-request gets a pool whose parent is the request pool, and destroying the parent destroys the children. This composes the "lifetime tree" of a request directly into the allocator. The cost of the entire mechanism is roughly 2,000 lines of C in apr_pools.c. nginx, Postgres (MemoryContext), and Squid all reinvented the same idea independently. If a 2,000-line C file can give you GC-free, fragmentation-free, leak-free allocation for a 30-year-old web server, the pattern is doing real work.

Reproduce this on your laptop

sudo apt install python3.11 python3.11-venv
python3.11 -m venv .venv && source .venv/bin/activate
pip install --upgrade pip hdrh

# Throughput comparison — alloc/free cost for 200k objects per regime
python3 three_regimes_demo.py

# Tail-latency comparison — pause profile per regime
python3 pause_profile_demo.py

# To compare against a real GC, install a JVM and run the equivalent
# benchmark with -XX:+UseG1GC vs -XX:+UseZGC; the arena column is the
# same shape as Rust's bumpalo, which you can install via cargo if curious

You should see the arena column come in 5–10× faster on throughput than the GC column on a typical x86 laptop, and the p99.9 tail-latency column show 100–1000× separation between the arena and GC tails. Numbers vary by hardware (Apple Silicon laptops compress the GC tail because of better branch prediction on refcount paths) but the ordering is invariant: arena fastest, manual middle, GC last on both throughput and tail.

Where this leads next

The three regimes recur in every chapter of Part 11 and most of Part 13. The next chapters work through the consequences:

References