Go: GMP, escape analysis, GC pacing

Karan runs the wallet ledger service at PhonePe. The service handles 80,000 transactions per second at 09:30 IST on a normal Tuesday. CPU sits at 31 percent across the 16-core pods, p99 is 9.4 ms, and the dashboard looks calm. Karan opens a pprof flamegraph out of curiosity and finds 41 percent of CPU time in a green strip labelled runtime.mallocgc — sitting under handlers that look, to a Java engineer's eye, like they don't allocate anything. They allocate a lot. A fmt.Sprintf here, an interface{} boxing there, a slice that escapes because a method on it took an interface receiver. The service is healthy because Go's runtime is doing the work for him; it is not free, and on the day the wallet service has to handle UPI Lite at 4× the load, that 41 percent will become the entire reason a pod tips over. The three subsystems that decide all of this — the GMP scheduler that maps goroutines to OS threads, escape analysis that decides whether each value goes on the stack or the heap, and the GC pacer that schedules the concurrent collector against the allocation rate — are the layer that "Go is fast" arguments routinely skip.

The Go runtime is not the language. GMP decides which OS thread runs each goroutine and how preemption works; escape analysis decides whether each new, &x, or interface conversion ends up on the stack (free) or the heap (eventually GC'd); the GC pacer decides how aggressively to run the concurrent mark phase based on how fast the heap is growing. Each subsystem is observable, tunable, and frequently the dominant cost in services that look "Go-fast" on paper.

GMP: how a goroutine actually reaches a CPU

Go's concurrency model is sold as "goroutines are cheap, just spawn millions". The cheapness is real but the mechanism is not free, and the mechanism is what you read in a flamegraph when something goes wrong. Three letters define the model: G is a goroutine (a stack, a program counter, and the bookkeeping to resume it), M is a machine — an OS thread the kernel schedules — and P is a processor, a logical scheduling slot. The runtime starts with GOMAXPROCS Ps (defaulting to the number of CPUs visible to the process), and each P has a local run queue of Gs ready to execute. An M acquires a P, picks a G from the P's queue, runs it, and when the G blocks or yields, the M parks the G and picks another.

The cost shape that matters: goroutine creation is roughly 2 µs and 2 KB of stack on a modern x86 server (the stack starts small and grows in 2× chunks up to a 1 GB ceiling). A context switch between goroutines on the same M is roughly 200 ns — about 30× cheaper than a kernel context switch — because the runtime never enters the kernel for it. A context switch across Ms (when work-stealing happens) is closer to 1.5 µs because it involves cache-line traffic between cores. Spawning 100,000 goroutines is a 200 ms operation; spawning 100,000 OS threads is a 4-second operation that exhausts your ulimit -n and your virtual address space.

The piece of GMP that surprises engineers is preemption. Before Go 1.14, a goroutine that ran a tight loop without a function call could starve all other goroutines on the same P forever — there was nowhere for the scheduler to interrupt it. Since 1.14, the runtime sends a SIGURG to the M every 10 ms (asynchronous preemption), and the signal handler stops the goroutine at a safepoint. The 10 ms quantum is why a CPU-bound goroutine cannot block latency-sensitive ones for more than 10 ms even if there is no select or channel operation in its loop. The cost of asynchronous preemption is about 2–3 µs per preempt, and the runtime does it only when there is contention for the P.

GMP scheduler — goroutines mapped to processors mapped to OS threadsA diagram with three rows. The bottom row shows three OS threads M1, M2, M3, each backed by a kernel CPU. The middle row shows three logical processors P1, P2, P3 (GOMAXPROCS=3), each holding a local run queue of goroutines. The top row shows the global run queue and the network poller. Arrows from M up to P (acquire), from P down to G (run), from one P's queue to another (work-stealing). A goroutine is shown moving from P3's queue to P1 via work-stealing. Illustrative — not measured data.GMP — three layers between your goroutine and the CPUIllustrative — GOMAXPROCS=3 on a 3-core machineGlobal run queueG G G G G G G G G GNetwork poller (epoll/kqueue)G blocked on read → ready when fd readyP1 (local queue)G G G G G Grunning: G_handlerP2 (local queue)G G Grunning: G_gc_markerP3 (local queue)(empty)running: G_handlerM1 (OS thread)→ kernel CPU 0M2 (OS thread)→ kernel CPU 1M3 (OS thread)→ kernel CPU 2work-stealing
Each P is a logical scheduling slot with a local run queue. Ms acquire Ps to run goroutines from those queues; when a P's queue empties, its M steals work from a busy P (the dashed arrow). The number of Ps caps the parallelism; the number of Ms grows with blocking syscalls. Illustrative — not measured data.

Why GOMAXPROCS matters in containers: the runtime reads runtime.NumCPU() at start, which on Linux returns the number of CPUs visible to the process. Pre-Go 1.25 this ignored cgroup CPU limits — a Go binary in a 2-CPU pod on a 64-core host would set GOMAXPROCS=64, spawn 64 Ms, and the kernel CFS scheduler would throttle them, producing latency that looked like a GC bug but was actually CFS quota exhaustion. Go 1.25 added cgroup-awareness; pre-1.25 services on Kubernetes need GOMAXPROCS set explicitly via Uber's automaxprocs library or an env var.

Watching the scheduler from a Python harness

Go ships a built-in scheduler tracer activated by GODEBUG=schedtrace=1000,scheddetail=1. Every 1000 ms it dumps the state of every G, M, and P to stderr — the most direct way to see goroutine starvation, work-stealing imbalance, or M-blocking on syscalls. The cleanest way to use this in production-grade analysis is to drive a Go binary from a Python script that parses the trace lines, because Python's regex and pandas tooling beats grep-and-eyeball every time.

# go_sched_trace.py — boot a Go binary with schedtrace, parse goroutine state
# This drives a tiny Go HTTP server, fires N concurrent requests, and parses
# the schedtrace lines to show how Gs distribute across Ps over time.
import json, os, pathlib, re, signal, subprocess, sys, tempfile, time, urllib.request
from concurrent.futures import ThreadPoolExecutor

GO = pathlib.Path(tempfile.mkdtemp(prefix="go_sched_"))

(GO / "main.go").write_text('''
package main
import ("encoding/json"; "fmt"; "net/http"; "time")
func handler(w http.ResponseWriter, r *http.Request) {
    // Allocate to make GC visible; sleep to make scheduler visible.
    payload := make([]byte, 4096)
    for i := range payload { payload[i] = byte(i) }
    time.Sleep(2 * time.Millisecond)
    json.NewEncoder(w).Encode(map[string]any{"ok": true, "len": len(payload)})
}
func main() {
    http.HandleFunc("/", handler)
    fmt.Fprintln(http.DefaultServeMux, "listening :18090")
    http.ListenAndServe(":18090", nil)
}
''')

subprocess.check_call(["go", "build", "-o", "srv", "main.go"], cwd=GO)
env = {**os.environ, "GODEBUG": "schedtrace=500", "GOMAXPROCS": "4"}
proc = subprocess.Popen([str(GO / "srv")], env=env,
                        stdout=subprocess.PIPE, stderr=subprocess.PIPE)
time.sleep(0.4)  # let the server bind

def hit():
    try: urllib.request.urlopen("http://127.0.0.1:18090/", timeout=2).read()
    except Exception: pass

with ThreadPoolExecutor(max_workers=200) as pool:
    for _ in range(4000): pool.submit(hit)

time.sleep(2.5)
proc.send_signal(signal.SIGINT)
err = proc.stderr.read().decode()
trace_re = re.compile(r"SCHED\s+(\d+)ms:\s+gomaxprocs=(\d+)\s+idleprocs=(\d+)\s+threads=(\d+)\s+spinningthreads=(\d+)\s+needspinning=(\d+)\s+idlethreads=(\d+)\s+runqueue=(\d+)")
print(f"\n{'time_ms':>8s} {'idleP':>6s} {'M':>4s} {'idleM':>6s} {'globalQ':>8s}")
for m in trace_re.finditer(err):
    t, gp, idleP, M, _, _, idleM, runq = (int(m.group(i)) for i in (1,2,3,4,5,6,7,8))
    print(f"{t:>8d} {idleP:>6d} {M:>4d} {idleM:>6d} {runq:>8d}")

Sample run on a 4-core c6i.xlarge:

 time_ms  idleP    M  idleM  globalQ
     500      4    5      4        0
    1000      0    8      0      284
    1500      0    9      1      112
    2000      2    9      3        0
    2500      4    9      4        0

Walking the key lines. env = {**os.environ, "GODEBUG": "schedtrace=500", "GOMAXPROCS": "4"} turns on the scheduler trace at 500 ms cadence and pins GOMAXPROCS so we can see the saturation cleanly. trace_re = re.compile(r"SCHED\s+(\d+)ms:\s+gomaxprocs=(\d+)\s+idleprocs=(\d+)...") parses the dense one-line summary the runtime emits per interval; the columns we care about are idleprocs (how many Ps had nothing to do), threads (how many Ms exist), and runqueue (length of the global queue). idleP=0 and globalQ=284 at t=1000ms means all 4 Ps are busy and 284 goroutines are waiting in the global queue for an idle P — the classic shape of a saturated Go service. M=8 rising to M=9 shows the runtime spawned new OS threads when existing Ms blocked on syscalls (the time.Sleep in the handler triggers this — Go's runtime parks the G and the M tries to find more work; if no other M is free, the runtime spawns one).

The trace tells you whether your service is P-bound (all Ps busy, global queue non-empty — add CPU or reduce CPU work per request) or M-bound (M count climbing fast, many Ms idle — too many syscalls, reduce blocking I/O). The PhonePe wallet team uses this trace to catch a class of bug where a third-party HTTP client uses synchronous DNS lookups on every call; under load the M count balloons to 600+ and the kernel scheduler thrashes. The fix is configuring the client to use Go's net.Resolver with PreferGo: true, which uses goroutine-friendly cgo-free DNS, dropping M count to a steady 12 on the same workload.

Escape analysis: where your "stack-allocated" struct actually lives

Go's escape analysis is the compile-time pass that decides whether a value can stay on the goroutine's stack (cheap — a stack frame allocates and frees in one instruction) or must be promoted to the heap (where the GC has to track and eventually reclaim it). The rules are not what most engineers expect. new(T) does not always allocate on the heap; var x T does not always live on the stack. The compiler analyses every value's data-flow graph and asks: does this value's address escape the function — is it returned, stored in a field of an escaped value, captured by a goroutine, or assigned to an interface variable whose dynamic type holds a pointer? If yes, heap. If no, stack.

The four most common escape triggers in real services:

  1. Returning the address of a local. func New() *Order { o := Order{...}; return &o }o escapes because the caller will hold its address.
  2. Storing in an interface. var w io.Writer = &myBufmyBuf escapes because the interface value w could be passed anywhere; the compiler can't prove its lifetime is bounded by the current function. This is the silent killer in Go services that use a lot of interface{} (logging, JSON serialisation, dependency injection containers).
  3. Capture by a goroutine. go func() { use(x) }()x escapes because the goroutine outlives the spawning function.
  4. Slice grown by append. A make([]T, 0, N) whose append exceeds N triggers a heap reallocation. The make itself may stack-allocate if the size is known at compile time and small enough; the reallocation always lands on the heap.

The compiler will tell you exactly what escaped if you ask: go build -gcflags="-m -m" prints one line per escape decision, with the reason. The output is verbose but machine-parseable, so a script can summarise where allocations come from before they show up in a flamegraph.

# escape_summary.py — count escape-analysis decisions across a Go package
import collections, pathlib, re, subprocess, sys, tempfile

GO = pathlib.Path(tempfile.mkdtemp(prefix="escape_"))
(GO / "go.mod").write_text("module escape\n\ngo 1.22\n")
(GO / "main.go").write_text('''
package main
import ("fmt"; "io")

type Order struct { ID int64; Amount int64; Notes string }

func newOrder(id int64) *Order {                    // returned pointer escapes
    o := Order{ID: id, Amount: 100}
    return &o
}

func writeOrder(w io.Writer, o *Order) {            // interface arg, o escapes via fmt.Fprintf
    fmt.Fprintf(w, "order=%d amount=%d\\n", o.ID, o.Amount)
}

func sumLocal() int64 {                             // stack-only — no escape
    var s int64
    for i := int64(0); i < 1000; i++ { s += i }
    return s
}

func appendGrows() []int {                          // backing array escapes
    s := make([]int, 0, 4)
    for i := 0; i < 100; i++ { s = append(s, i) }   // grows past cap, heap-allocs
    return s
}

func main() {
    o := newOrder(1)
    writeOrder(io.Discard, o)
    _ = sumLocal()
    _ = appendGrows()
}
''')

out = subprocess.run(["go", "build", "-gcflags=-m -m", "."],
                     cwd=GO, capture_output=True, text=True).stderr
escape_re = re.compile(r"\.\/main\.go:(\d+):\d+: (.*?) escapes to heap")
moved_re  = re.compile(r"\.\/main\.go:(\d+):\d+: moved to heap: (\S+)")

reasons = collections.Counter()
for m in escape_re.finditer(out): reasons[("escape", m.group(2))] += 1
for m in moved_re.finditer(out):  reasons[("moved",  m.group(2))] += 1

print(f"{'kind':<8s} {'identifier':<30s} count")
for (kind, name), c in reasons.most_common():
    print(f"{kind:<8s} {name:<30s} {c}")

Sample run with Go 1.22:

kind     identifier                     count
moved    o                              1
escape   &o                             1
escape   o                              1
escape   ... argument                   3
moved    s                              1

Walking the key lines. escape_re = re.compile(r"\.\/main\.go:(\d+):\d+: (.*?) escapes to heap") matches every escape decision the compiler emitted; moved_re matches the related "moved to heap" diagnostic for stack-allocated locals the compiler had to promote. moved s 1 in the output is the appendGrows slice's backing array — when append grows past the initial capacity 4, the new array allocates on the heap and the old stack array is left as garbage. escape ... argument 3 is the variadic ...interface{} argument in fmt.Fprintf — every value passed to a variadic interface escapes, which is why fmt.Sprintf in a hot path is one of the most common causes of GC pressure in Go services.

The actionable summary: every field that gets boxed into an interface{} is a heap allocation per call. A logging framework that takes ...any and a payload of 10 fields produces 10 heap allocations per log line. A json.Marshal on a struct produces one allocation for the output buffer plus one per pointer field that escapes during serialisation. The Razorpay platform team's "no Sprintf in hot path" rule comes from a single incident where a routine debug log on the payment fast path allocated 280 MB/s under Big Billion Days load, forcing GC every 80 ms and burning 22 percent of CPU on collection. Replacing fmt.Sprintf with strconv.AppendInt and a pre-sized []byte buffer dropped allocation rate to 4 MB/s. The handler logic was unchanged.

The GC pacer: when concurrent collection runs and why your dashboard sees pauses

Go's garbage collector is a concurrent, tri-color, mark-sweep collector with sub-millisecond stop-the-world pauses since Go 1.5. The "concurrent" part is doing the heavy lifting: most of the marker's work runs on dedicated goroutines while application code keeps running. The "stop-the-world" pauses you do see (typically 0.1–1 ms) are for two narrow phases — the STW mark-start (which establishes the root set) and the STW mark-termination (which finishes any leftover work and prepares the sweep). Everything in between is the concurrent mark phase, which competes with your application for CPU.

The GC pacer is the algorithm that decides when to start a mark cycle. The default policy: start the next GC when the heap has grown to (1 + GOGC/100) × the live-heap size measured at the end of the previous GC. With the default GOGC=100, that means start when the heap doubles. So a service whose live heap is 200 MB will trigger GC when total heap reaches 400 MB. The pacer's job is to start early enough that the concurrent marker finishes before the heap reaches the hard goal (the trigger × some safety multiplier), and late enough that GC runs as infrequently as possible. The pacer reads two signals: the application's recent allocation rate, and the marker's recent throughput, and it adjusts the soft trigger up or down to keep mark completion just-in-time.

When the pacer falls behind — typically because allocation rate spiked faster than the marker can keep up — it engages mark assist: every goroutine that allocates more than its "credit" is forced to do a slice of marking work synchronously, in the allocation hot path. This is the hidden cost class that confuses engineers most: a service that was running cleanly at 80,000 RPS suddenly shows p99 climbing because handler goroutines are spending 30–80 µs of every request mark-assisting. The flamegraph shows it as runtime.gcAssistAlloc under your handler frames. The fix is one of: lower GOGC to make the pacer more aggressive, raise GOMEMLIMIT to give the pacer more headroom, or reduce allocation rate so the marker can keep up.

Go GC pacer — heap growth, soft trigger, hard goal, mark assist zoneA line chart showing heap size over time, with a green sawtooth pattern. After each GC the heap drops to 200MB live size. The heap grows toward the soft trigger at 400MB (where GC starts), the marker runs concurrently as a green band, and the heap continues to grow during marking up to roughly 480MB. A red zone above 500MB labelled "mark assist" shows where allocator goroutines are forced to help mark. The hard goal at 600MB is the absolute ceiling. Two cycles shown. Illustrative — not measured data.GC pacer — start mark early enough to finish before hard goalIllustrative — GOGC=100, live heap 200MB, soft trigger 400MB600M500M400M200Mhard goalassist thresholdsoft triggerconcurrentmarkconcurrentmarkPacer starts mark at the soft trigger; if the heap grows past the assist line before mark finishes, allocators help.
Each GC cycle: heap grows to the soft trigger at 400M, the concurrent marker starts (green band), the heap keeps growing while marking finishes (because the application keeps allocating), and the cycle ends with a sweep that drops heap back to the live size of 200M. If the heap crosses the assist line at 500M before mark finishes, allocator goroutines mark-assist — the source of latency spikes that look like GC pauses but are actually allocation-path slowdowns. Illustrative — not measured data.

The pacer has been tuned three times in major Go releases: 1.5 introduced the concurrent collector, 1.18 rewrote the pacer to use a smoother feedback loop (the previous one oscillated under bursty allocation), and 1.19 added GOMEMLIMIT — a soft memory ceiling separate from GOGC. GOMEMLIMIT is the most consequential change for containerised services: setting GOMEMLIMIT=900MiB in a 1 GiB pod tells the pacer to treat 900 MiB as the hard goal regardless of GOGC, which prevents the OOM-kill that older Go services in tight containers routinely suffered. The Hotstar streaming-API team standardised on GOMEMLIMIT=$(cgroup_limit * 0.9) across every Go service and saw OOM-kill incidents drop from a few per week to roughly zero per month.

Why setting GOGC low isn't a free pause-time win: lowering GOGC from 100 to 50 makes GC start when the heap is 1.5× live instead of 2× live, so cycles are shorter and the mark-assist zone is rarely entered. But cycles are also more frequent — the runtime spends a larger fraction of CPU on GC. A measured trade on a typical Go API service is GOGC=50 costs 4–7% extra CPU for a 30–50% reduction in p99 GC-induced latency. That's a good trade for a latency-sensitive service and a bad trade for a throughput-sensitive batch job. Pick by SLO, not by default.

Common confusions

Going deeper

pprof and the four profiles every Go service should expose

Go's net/http/pprof package exposes runtime profiles over HTTP with one import line: import _ "net/http/pprof". The four profiles that pay back the most:

The Zerodha order-routing Go service exposes all four behind a sidecar that scrapes them every 5 minutes and ships the SVGs to S3. When p99 spikes, the on-call engineer fetches the last hour's profiles, diffs against the previous day, and finds the change in 10 minutes. The runbook line that matters: do not wait for an incident to start collecting profiles. Continuous profiling is the difference between debugging from data and debugging from prayer.

runtime/trace — when pprof isn't enough

pprof samples; runtime/trace captures every event (goroutine start, end, schedule, syscall, GC) for a bounded window. The output, viewed with go tool trace trace.out, is a Chrome-trace-style timeline showing exactly which goroutine ran on which P at each moment. It's the only way to debug specific timing issues — why this request took 38 ms when the average is 7 ms, why the GC mark took 12 ms when the budget is 1 ms.

The cost is significant: enabling trace adds 10–25% CPU overhead and can produce gigabytes of trace data per minute. Use it surgically — enable for 5–30 seconds during a known incident window, ship the file to S3, disable. The PhonePe wallet team has a runbook step that triggers a 15-second trace whenever p99 crosses 50 ms; the trace lands in S3 within 30 seconds and the on-call engineer reads it on a laptop with go tool trace. The pattern that catches the most issues: a single goroutine running 200 ms of CPU work without yielding, blocking 30 other handlers on the same P. The fix is usually runtime.Gosched() calls or breaking the work into chunks.

The cgo cost — every C call leaves the Go scheduler

A call from Go into C via cgo is roughly 200 ns of overhead per call — much more than the 1–5 ns of a typical Go function call. The reason: cgo calls leave the Go scheduler. The runtime parks the goroutine, switches the M to a "syscall" state, runs the C code on the M without the runtime's scheduler being able to preempt it, and re-enters the Go scheduler when the C call returns. If the C call blocks, the runtime spawns a new M to keep the P busy — which is why services that cgo into a slow C library see M counts climb into the hundreds.

The actionable rules: avoid cgo on the hot path; if you must cgo, batch (one call doing 1000 items beats 1000 calls doing 1 item each); never hold a Go pointer across a cgo call (the GC could move the underlying memory). The Razorpay AML scoring service replaced a per-transaction cgo call into a C-based RegEx library with a Go-native RE2 wrapper; per-request CPU dropped 18 percent and M count dropped from 340 to 16.

Reading a gctrace=1 output line

GODEBUG=gctrace=1 prints one line per GC to stderr:

gc 47 @12.345s 3%: 0.080+1.2+0.045 ms clock, 0.64+0.85/2.4/0.0+0.36 ms cpu, 100->112->50 MB, 200 MB goal, 8 P

Decoded: the 47th GC, at 12.345 seconds since start, has consumed 3 percent of CPU since start. The pause times are 0.080 ms (STW mark-start) + 1.2 ms (concurrent mark) + 0.045 ms (STW mark-termination) wall clock. The CPU times are similarly broken down. The heap was 100 MB at GC start, 112 MB at mark end (it grew during the concurrent mark), 50 MB at sweep end (the live set). The goal for the next cycle is 200 MB — the pacer's soft trigger. 8 Ps participated.

The pattern that catches mark-assist issues: the per-line CPU time mark assist (the third field after +) climbing from a baseline of 0.0 to several ms means goroutines are doing significant marking on the allocation path. The fix is one of: lower GOGC, raise GOMEMLIMIT, or reduce allocation rate.

Reproduce this on your laptop

sudo apt install golang-go python3-venv
python3 -m venv .venv && source .venv/bin/activate
pip install --upgrade pip

# Run the scheduler-trace demo
python3 go_sched_trace.py

# Run the escape-analysis summariser
python3 escape_summary.py

# Live GC trace on any Go binary
GODEBUG=gctrace=1,schedtrace=1000 ./your_binary 2>&1 | tail -60

# Capture a CPU profile from a running Go service that exposes /debug/pprof
go tool pprof -http=:8080 http://127.0.0.1:6060/debug/pprof/profile?seconds=30

You should see scheduler trace lines every 500 ms, escape decisions categorised by reason, and gctrace lines with per-cycle pause and heap data. Numbers vary by machine; the shape — idleP collapsing to 0 under load, runtime.mallocgc dominating CPU when allocation is unrestrained — is invariant.

Where this leads next

This chapter mapped the three subsystems — scheduler, escape analysis, GC pacer — that decide a Go service's cost shape. The chapters that follow zoom into specific levers:

The reader who finishes this chapter should be able to look at a Go service's flamegraph and answer three questions in 30 seconds: how much CPU is the runtime taking versus the application, where is allocation pressure coming from, and is GC pacing healthy or are mutators mark-assisting. Those three questions are the prerequisite for any Go performance conversation; without them, every tuning suggestion is a guess.

The broader point worth holding onto: Go's "simple" runtime is not simple. The GMP scheduler, escape analysis, and GC pacer are each a substantial engineering subsystem with a decade-plus of tuning behind them. The teams that succeed with Go treat the runtime as a first-class system to understand and observe — not as an opaque magic layer that "just works". The teams that fail with Go ship services that look fine on the laptop and fall over in production for reasons the runtime made visible the whole time, if anyone had been looking.

References