GC vs manual vs region-based allocation
Aditi runs the order-management gateway at Zerodha Kite. The Java service has a 24 GB heap, ZGC, and a p99.9 pause budget of 10 ms. The Rust matching-engine service it forwards to has zero GC, manual lifetimes, and a p99.9 of 800 µs. The C++ market-data ingestor between them uses bump-allocated arenas — one arena per UDP packet, freed in one instruction when the packet is parsed — and runs at 14 µs p99.9. Three services on the same rack, each making the opposite memory-management bet, each correct for its workload. When Aditi gets paged at 09:14 IST for a 47 ms ZGC pause on the gateway during the cash-equity open, the question she has to answer is not "should we switch off GC" — it is "what's the actual cost of this GC versus the alternatives, and which alternative would have made this incident worse?".
Memory management is a three-way trade between CPU overhead, pause time, and program complexity. Garbage collection burns 5–25% of CPU steadily and gives you pauses you can size but not eliminate. Manual malloc/free gives the lowest steady cost and the worst tail when you get it wrong. Region (arena) allocation collapses both into "allocate cheap, free in O(1)" — but only when your program actually has the shape of phases that begin and end. No production system at scale uses only one; the question is which mix and where the boundaries are.
The three regimes — what each one actually pays for
Every allocation decision is a question about who frees the memory and when. The three regimes give three different answers, and the answer determines the whole cost profile.
Manual allocation (malloc/free, C++ new/delete, Rust without Arc). The programmer is responsible for matching every allocation with exactly one free. The allocator's job is small: give out a chunk, take it back when asked, manage the bookkeeping in O(log n) or O(1) size-class bins. Cost: ~30–80 ns per malloc on jemalloc/tcmalloc for small allocations, ~10–20 ns per free. Failure mode: use-after-free, double-free, leaks. Each bug is a 03:00 page; collectively they are the reason every C/C++ codebase older than five years has a sanitizer suite.
Garbage collection (Java, Go, Python, JavaScript, .NET). The runtime tracks reachability and reclaims unreachable memory in batches. The programmer never frees explicitly. Cost: 5–25% of CPU on average for tracing collectors, plus pauses that range from 100 ns (per-allocation write barriers) to seconds (full STW collections in old JVMs). Failure mode: memory pressure → frequent collections → CPU climbs → throughput collapses, or a long-tail pause at exactly the wrong moment. The bug surface shifts from "you freed too early" to "you held a reference too long".
Region (arena) allocation (umem, Apache HTTP server pools, Rust bumpalo, OCaml minor heap). All allocations within a phase go into one or more contiguous regions; the entire region is freed at once when the phase ends, by resetting a single pointer. Cost: ~5–15 ns per allocation (it's a pointer bump), ~10 ns to free an entire region of millions of objects. Failure mode: regions don't fit your program. If your object lifetimes don't cluster into clean phases, you either keep regions alive too long (memory bloat) or invent ad-hoc reference counting on top, at which point you've reinvented shared_ptr badly.
Why "no production system uses only one" is a load-bearing claim: every Java service still calls into native libraries that malloc. Every Rust service still has buffer pools that look like arenas. Every C++ matching engine still uses GC-like reference counting in the bits where ownership is genuinely shared. The regime label is the dominant one for the hot path, not the only one. Treating the choice as binary — "we are a GC shop" — hides the fact that 30% of your real allocations are in a different regime, and the bugs there are the ones that wake you up.
Measuring the three regimes from one Python script
The cleanest way to see the cost difference is a microbenchmark that allocates the same workload three ways: through Python's GC (the default), through a manual ctypes malloc/free loop, and through a bump-allocated arena. The script measures wall time and bytes allocated for an identical pattern — 200,000 small objects, then drop them all.
# three_regimes_demo.py — compare GC vs manual vs region allocation
# for the same workload pattern (alloc 200k small objects, then drop them all).
# Run: python3 three_regimes_demo.py
import ctypes, gc, time, tracemalloc
N = 200_000 # 200k small objects
SZ = 64 # 64 bytes each — typical small allocation
ITERS = 10 # repeat for stable timing
libc = ctypes.CDLL("libc.so.6")
libc.malloc.restype = ctypes.c_void_p
libc.malloc.argtypes = [ctypes.c_size_t]
libc.free.argtypes = [ctypes.c_void_p]
libc.memset.argtypes = [ctypes.c_void_p, ctypes.c_int, ctypes.c_size_t]
# ---------- 1. GC (Python default) ----------
def run_gc() -> tuple[float, int]:
gc.collect()
tracemalloc.start()
t0 = time.perf_counter_ns()
for _ in range(ITERS):
objs = [bytearray(SZ) for _ in range(N)]
del objs # drop refs; GC reclaims later
elapsed = (time.perf_counter_ns() - t0) / 1e9
cur, peak = tracemalloc.get_traced_memory()
tracemalloc.stop()
return elapsed, peak
# ---------- 2. Manual malloc / free ----------
def run_manual() -> tuple[float, int]:
t0 = time.perf_counter_ns()
for _ in range(ITERS):
ptrs = [libc.malloc(SZ) for _ in range(N)]
for p in ptrs:
libc.memset(p, 0, SZ) # touch to be fair to GC
for p in ptrs:
libc.free(p)
elapsed = (time.perf_counter_ns() - t0) / 1e9
return elapsed, N * SZ # peak resident estimate
# ---------- 3. Region (bump arena) ----------
class BumpArena:
"""A region: one big buffer, allocations are pointer bumps,
freeing the whole arena is a single reset."""
def __init__(self, capacity: int):
self.buf = ctypes.create_string_buffer(capacity)
self.base = ctypes.addressof(self.buf)
self.cap = capacity
self.off = 0
def alloc(self, n: int) -> int:
if self.off + n > self.cap:
raise MemoryError("arena full")
p = self.base + self.off
self.off += n
return p # raw address
def reset(self):
self.off = 0 # one instruction frees everything
def run_region() -> tuple[float, int]:
arena = BumpArena(N * SZ + 4096)
t0 = time.perf_counter_ns()
for _ in range(ITERS):
for _ in range(N):
p = arena.alloc(SZ)
libc.memset(p, 0, SZ) # touch — same fairness
arena.reset() # free all 200k in one instruction
elapsed = (time.perf_counter_ns() - t0) / 1e9
return elapsed, N * SZ
if __name__ == "__main__":
for name, fn in [("GC (Python bytearray)", run_gc),
("Manual (malloc/free)", run_manual),
("Region (bump arena)", run_region)]:
s, peak = fn()
per_alloc_ns = (s * 1e9) / (N * ITERS)
print(f"{name:<26}: {s:6.3f} s total | "
f"{per_alloc_ns:6.1f} ns/alloc | peak ~ {peak/1024/1024:5.1f} MiB")
Sample run on a c6i.2xlarge (Ice Lake, kernel 6.5, CPython 3.11, glibc 2.35):
GC (Python bytearray) : 3.214 s total | 160.7 ns/alloc | peak ~ 18.4 MiB
Manual (malloc/free) : 1.892 s total | 94.6 ns/alloc | peak ~ 12.2 MiB
Region (bump arena) : 0.341 s total | 17.0 ns/alloc | peak ~ 12.2 MiB
Three numbers carry the story. First, the per-allocation gradient: 160 ns → 94 ns → 17 ns. The 9.4× gap between GC and the arena isn't because Python is slow at calling malloc — it's because every bytearray allocation also touches the GC bookkeeping (refcount, generation, the cyclic-collector linked list). Second, the absence of cleanup cost in the arena column: dropping 200,000 objects is one pointer assignment. The manual column pays 200,000 free calls; the GC column pays a generational collection that scans live objects. Third, the peak-RSS column: GC peaks higher because objects accumulate until the collector decides to run; manual and arena both stay tight to the live working set. Why GC's per-alloc cost is so much higher than manual here: CPython's allocator is a thin wrapper over malloc plus a fixed cost for refcount initialisation, header setup, and the generational-GC linked-list insertion (roughly 60 ns of overhead per object). On Java with a TLAB bump-allocator the per-alloc cost drops to 20–30 ns — but the GC's tracing cost shows up later as a pause rather than as per-alloc overhead. The work doesn't disappear; it moves.
The arena's 17 ns/alloc is the floor: that's pointer arithmetic and a memory touch. There is no faster regime. If your program's shape lets you use one, you've collapsed the entire memory-management cost to the cost of writing the bytes.
Pause profiles — where the cost actually shows up
The microbenchmark above measures throughput cost — averaged across millions of allocations, how much CPU does each regime burn. But the production conversation is rarely about throughput. It is about pauses — the moment when the regime stops your work to do its own. The three regimes hide their cost in different places, and the place determines what kind of incident you have.
# pause_profile_demo.py — measure tail latency under each regime
# by inserting timed work between allocation bursts.
# Run: python3 pause_profile_demo.py
import ctypes, gc, time
from hdrh.histogram import HdrHistogram
ITERS = 50_000 # 50k "request" iterations
ALLOCS_PER_REQ = 80 # 80 small objects per "request" — typical web handler
SZ = 96
libc = ctypes.CDLL("libc.so.6")
libc.malloc.restype = ctypes.c_void_p
libc.malloc.argtypes = [ctypes.c_size_t]
libc.free.argtypes = [ctypes.c_void_p]
def percentiles(h: HdrHistogram, label: str):
print(f"{label:<22} | "
f"p50={h.get_value_at_percentile(50):>7} ns | "
f"p99={h.get_value_at_percentile(99):>8} ns | "
f"p99.9={h.get_value_at_percentile(99.9):>8} ns | "
f"max={h.get_max_value():>9} ns")
def run_gc():
h = HdrHistogram(1, 100_000_000, 3)
for _ in range(ITERS):
t0 = time.perf_counter_ns()
objs = [bytearray(SZ) for _ in range(ALLOCS_PER_REQ)]
del objs
h.record_value(time.perf_counter_ns() - t0)
percentiles(h, "GC (Python)")
def run_manual():
h = HdrHistogram(1, 100_000_000, 3)
for _ in range(ITERS):
t0 = time.perf_counter_ns()
ptrs = [libc.malloc(SZ) for _ in range(ALLOCS_PER_REQ)]
for p in ptrs:
libc.free(p)
h.record_value(time.perf_counter_ns() - t0)
percentiles(h, "Manual (malloc/free)")
def run_region():
h = HdrHistogram(1, 100_000_000, 3)
buf = ctypes.create_string_buffer(ALLOCS_PER_REQ * SZ + 1024)
base = ctypes.addressof(buf)
for _ in range(ITERS):
t0 = time.perf_counter_ns()
off = 0
for _ in range(ALLOCS_PER_REQ):
_ = base + off # bump
off += SZ
# arena reset is one instruction — not even worth measuring
h.record_value(time.perf_counter_ns() - t0)
percentiles(h, "Region (bump)")
if __name__ == "__main__":
gc.disable() # don't let cyclic GC fire mid-measurement
run_gc()
run_manual()
run_region()
Sample run:
GC (Python) | p50= 13420 ns | p99= 28130 ns | p99.9= 482300 ns | max= 4218400 ns
Manual (malloc/free) | p50= 8240 ns | p99= 16800 ns | p99.9= 42100 ns | max= 118200 ns
Region (bump) | p50= 412 ns | p99= 710 ns | p99.9= 1180 ns | max= 18400 ns
The p99.9 row is the production conversation. The GC tail is 400× the median — somewhere in those 50,000 iterations a generational collection ran, and the request that happened to be in flight ate the full pause. The manual tail is 5× the median — page faults on first-touch, occasional mmap syscalls when the allocator extends a chunk, but no scan-the-heap behaviour. The arena tail is 3× the median — the cleanest pause profile you can get because there is nothing to do at free time except reset a counter. Why max is 482 µs in the GC column even though no full collection ran: CPython's reference-counting deallocations are deterministic but not free — when the last reference to a 96-byte bytearray drops, CPython releases it to its small-object freelist, occasionally triggering a coalesce. The cyclic GC was disabled, so this isn't tracing; it's the small-object allocator's own bookkeeping. On the JVM with G1GC the equivalent number would be in the millisecond range, dominated by the actual stop-the-world phase. Either way, the regime that "does work later" pushes the cost into the tail.
When each regime is correct — three production shapes
Aditi's three Zerodha services aren't using different regimes by accident; each is the right pick for its workload shape, and the wrong pick for the others.
The Java gateway uses GC because session state has no clean phase boundary. Order-management sessions live for the entire trading day. They accumulate state in a tree shape (orders → child orders → fills → notifications) where individual lifetimes can't be predicted. A GC handles this naturally — the runtime tracks reachability, and when a session ends the entire subtree is unreachable in one batch. Trying to manually free this would mean writing reference counts on every node, with the complexity that implies. ZGC's 10 ms pause budget is acceptable for a service whose SLO is 50 ms p99 anyway.
The Rust matching engine uses manual lifetimes because pauses are unacceptable. A 10 ms pause in the matching engine means 10 ms of orders queued unmatched — at 50,000 orders/sec, that's 500 unmatched orders, each potentially a regulatory issue. Manual ownership (Rust's borrow checker enforcing it at compile time) is the only way to guarantee zero pauses. The team pays the cost in development time — every data structure has to have a defined owner — but reaps it in tail latency: 800 µs p99.9 with no allocator-induced pause spikes ever.
The C++ market-data ingestor uses arenas because every UDP packet is a phase. A market-data packet arrives, gets parsed into 30–80 small objects (price levels, order updates, timestamps), gets pushed into the matching engine's input queue, and is then forgotten. The natural lifetime of those 30–80 objects is exactly the duration of one parse. Allocating them from a per-packet arena and resetting the arena after parse is correct in every dimension: fastest possible alloc, fastest possible free, no fragmentation possible, no use-after-free possible (the arena outlives every pointer into it by construction). Razorpay's webhook delivery service uses the same pattern: one arena per webhook attempt, reset on completion or retry.
The pattern that recurs: the regime should follow the shape of the lifetime distribution. If lifetimes cluster into phases, regions win. If lifetimes are tied to clearly-owned data structures, manual wins. If lifetimes are entangled and unpredictable, GC wins. The mistake is to pick a regime first and then bend the program to fit.
Misfits show up the same way every time. A Go service with arena-shaped allocations (one per HTTP request, all freed together) burns CPU in GC scanning the request-local objects until someone discovers sync.Pool and turns each handler into a manual arena. A C++ service with GC-shaped allocations (long-lived, entangled) bleeds memory through use-after-free until someone bolts on shared_ptr everywhere and turns the cost into atomic refcount churn. A Rust service with manual-shaped allocations that are forced into Arc<RwLock<...>> to satisfy the borrow checker pays the same atomic-refcount cost as the C++ team did. Each of these is a regime fighting the lifetime shape; each pays in the same currency — surprised CPU or surprised memory.
Common confusions
- "GC pauses are a JVM problem; Go's GC is pauseless." Go's GC is concurrent, not pauseless. Go pauses for ~100 µs–1 ms even on a healthy heap (mostly for stack scanning and root marking). Under memory pressure, Go pauses lengthen and
GOGCtriggers more frequently, eating CPU. Discord famously rewrote their service from Go to Rust in 2020 specifically because Go's "low pause" GC still produced 2-second tail spikes under their workload. The shape "concurrent collector" reduces but does not eliminate the pause column in the trade table. - "Manual allocation is faster than GC." True at the per-allocation level (the Python benchmark above shows 94 ns vs 160 ns), false at the system level once you account for the cost of writing manual code. Most service code that uses
malloc/freeends up calling the allocator more often than it needs to, because programmers can't predict lifetimes accurately and over-allocate to be safe. A real comparison has to measure the allocation rate of the program a team actually ships, not the program a team would write if they had infinite time. - "Region allocation is just
alloca/stack allocation." No —allocais bounded by the stack frame and the stack size (typically 8 MB). Region allocation can hold gigabytes, lives across function boundaries, and can be passed between threads (with care). The relationship: a region is a heap-allocated stack. You can allocate from it like a stack (fast bump) and free it like a stack (one reset), but its lifetime is whatever you want, not the function call. - "You can't have GC and manual together — pick one." Every JVM service that calls into a native library through JNI has both — Java objects on the GC heap, native buffers on
malloc. The bug surface is the boundary: who frees the native buffer when the Java reference drops?Cleaner/PhantomReferenceexists exactly to bridge it. Every Go service that usescgohas the same problem. Every Python service that uses numpy has the same problem (numpy arrays own native memory, freed via the array's__del__). The boundaries between regimes are where the most-bugs-per-line live. - "
shared_ptris reference-counted GC." It is reference counting, but not what most people mean by GC. Reference counting can't reclaim cycles (without an explicit cycle collector), it pays an atomic increment/decrement on every assignment, and it concentrates the free cost at the moment the last reference drops — which can be a huge cascade if the last reference releases a tree of millions of objects, producing a pause indistinguishable from a GC pause. Tracing GCs reclaim cycles and amortise the work; refcounting trades cycle support for predictability. - "Region allocation only makes sense for parsing." The pattern shows up everywhere a request has a beginning and an end: HTTP request handlers (per-request arena), packet processing (per-packet arena), database query execution (per-query arena), compiler passes (per-pass arena). Apache HTTP Server has used pools since 1995. PostgreSQL's
MemoryContextis a tree of regions. Linux kernel slab allocators are a related idea (per-cache pools). The phrase "request-scoped allocation" is just region allocation with a more enterprise-friendly name.
Going deeper
Generational GC — why "most objects die young" became the design
In 1984, David Ungar measured Smalltalk programs and found that 80–98% of allocated objects became unreachable within one minor collection cycle. Generational GCs exploit this by allocating into a small "young" region (~16–64 MB) and only promoting survivors to a larger "old" region. Young-gen collection scans only the young region — milliseconds, not seconds. The JVM's G1 and ZGC, Go's GC, .NET's GC all use generational designs. The hypothesis is empirically true for most service workloads: a Razorpay payment handler creates ~200 short-lived objects per request and ~3 long-lived ones. The 200 die in young-gen; the 3 survive to old-gen. Pause budget is dominated by young-gen frequency, not old-gen size, which is why heap sizes can grow to 32 GB without proportionally growing pauses. The hypothesis breaks for caching layers (objects live for hours by design, so most of them survive every young collection and bloat old-gen). Tuning -XX:G1NewSizePercent for cache services is one of the most common heap-tuning fixes in production JVMs.
Concurrent collectors — what ZGC and Shenandoah actually do
A "concurrent" collector does most of its work while the application threads keep running. ZGC achieves this through load barriers: every object reference read goes through a small piece of inserted code that checks a colour bit on the pointer. If the bit says "this object has been moved", the barrier follows the forwarding pointer transparently. This trades steady CPU (the load barrier costs ~5% CPU under typical workloads) for shrinking the stop-the-world phase to under 10 ms regardless of heap size. The JEP-333 paper measured ZGC pauses under 1 ms on a 16 TB heap. Shenandoah uses a similar "Brooks pointer" indirection. Go's tri-colour mark-and-sweep doesn't move objects (so no load barrier needed) but pays for the lack of compaction in fragmentation. PhonePe's UPI dispatch service migrated from G1 to ZGC in 2024 and saw p99.9 drop from 84 ms to 12 ms — the median climbed by 2 ms because of the load-barrier overhead, but the tail collapse was the win.
Region inference — when the compiler can prove arena lifetimes
OCaml has a "minor heap" that is essentially a stack-shaped young-gen region. Cyclone (the safe-C dialect) tried to do region inference — the compiler analyses the program and determines which allocations have stack-shaped lifetimes, then allocates them from regions automatically. Rust does a limited version of this through its lifetime analysis: any allocation whose lifetime is provably bounded by a scope can be put on the stack via let, and bumpalo makes it explicit when you want a heap-backed region. The dream of fully-automatic region inference died in the 2000s because it requires whole-program analysis that doesn't compose with separate compilation. The compromise — annotated lifetimes that the compiler enforces locally — is what Rust shipped. The market-data ingestor at Zerodha uses a Rust port that reads packets into bumpalo::Bump arenas and proves at compile time that no parsed object outlives the packet, getting region performance with manual-allocation safety.
The Apache pool pattern — region allocation in 2,000 lines of C
Apache HTTP Server's apr_pool_t is the reference implementation of region allocation in production C code, and has been since 1995. Each request gets a pool. Modules allocate from the pool. When the request ends, the pool is destroyed in O(1). Pools can have child pools — when a request spawns a sub-request, the sub-request gets a pool whose parent is the request pool, and destroying the parent destroys the children. This composes the "lifetime tree" of a request directly into the allocator. The cost of the entire mechanism is roughly 2,000 lines of C in apr_pools.c. nginx, Postgres (MemoryContext), and Squid all reinvented the same idea independently. If a 2,000-line C file can give you GC-free, fragmentation-free, leak-free allocation for a 30-year-old web server, the pattern is doing real work.
Reproduce this on your laptop
sudo apt install python3.11 python3.11-venv
python3.11 -m venv .venv && source .venv/bin/activate
pip install --upgrade pip hdrh
# Throughput comparison — alloc/free cost for 200k objects per regime
python3 three_regimes_demo.py
# Tail-latency comparison — pause profile per regime
python3 pause_profile_demo.py
# To compare against a real GC, install a JVM and run the equivalent
# benchmark with -XX:+UseG1GC vs -XX:+UseZGC; the arena column is the
# same shape as Rust's bumpalo, which you can install via cargo if curious
You should see the arena column come in 5–10× faster on throughput than the GC column on a typical x86 laptop, and the p99.9 tail-latency column show 100–1000× separation between the arena and GC tails. Numbers vary by hardware (Apple Silicon laptops compress the GC tail because of better branch prediction on refcount paths) but the ordering is invariant: arena fastest, manual middle, GC last on both throughput and tail.
Where this leads next
The three regimes recur in every chapter of Part 11 and most of Part 13. The next chapters work through the consequences:
/wiki/measuring-allocator-overhead— turningtracemalloc, jemalloc'sstats.allocated, and the JVM's-Xlog:gc*into Prometheus gauges so you can see which regime is paying what at runtime./wiki/numa-aware-allocators-and-data-structures— what happens to each regime's cost profile when "memory" is plural across sockets./wiki/jvm-gc-g1-zgc-shenandoah-comparison— picking a JVM collector once you've decided GC is the right regime./wiki/golang-gc-internals-and-tuning— the same conversation for Go's concurrent collector./wiki/rust-bumpalo-and-typed-arenas— region allocation in Rust with compile-time safety.
References
- Hans Boehm, "Bounding Space Usage of Conservative Garbage Collectors" (POPL 2002) — the foundational paper on GC space overhead bounds; explains why GC heaps run 1.5–3× the live set.
- Mike Hicks et al., "Region-based Memory Management in Cyclone" (PLDI 2002) — the canonical paper on region inference and the trade-offs against manual and GC.
- Per-Åke Larson & Murali Krishnan, "Memory Allocation for Long-Running Server Applications" (ISMM 1998) — the SQL Server team's measurement of which allocator regime fits which subsystem; the pattern of "no system uses just one" comes from here.
- Per Bothner, "Boehm GC vs. Manual Memory Management" (1995) — Hans Boehm's GC project page, with the foundational benchmark methodology this chapter's measurement shape derives from.
- Erik Österlund, "ZGC: A Scalable Low-Latency Garbage Collector" (JEP 333, 2018) — the design proposal for ZGC, including the load-barrier mechanism that makes sub-10-ms pauses possible on multi-TB heaps.
- Apache Portable Runtime,
apr_pools.c— 2,000 lines of C that implement the canonical region allocator pattern; the model copied by nginx, Postgres, and many others. - Brendan Gregg, Systems Performance (2nd ed., 2020), §7.4 "Memory Methodology" — the diagnostic ladder for memory issues, including how to tell which regime is paying.
/wiki/malloc-internals-glibc-jemalloc-tcmalloc-mimalloc— the previous chapter, where the manual-allocation column's per-alloc cost actually comes from.