Node.js: V8, event loop, worker threads

Aditi runs the order-gateway at Zerodha Kite. The service sits between the Kite mobile app and the exchange's order-matching engine, validating roughly 38 fields per order and forwarding it to NSE. At 09:15:00 IST, when the cash-equity market opens, the gateway sees about 2.4 lakh orders in the first 90 seconds. On a normal Tuesday this is fine — p99 stays under 18 ms on a c6i.4xlarge. On the Tuesday the new "smart order router" feature shipped, p99 crossed 820 ms inside the first 30 seconds, the dashboard turned red, and Aditi watched in confusion as htop showed exactly one core pinned at 100% while the other fifteen sat between 1 and 4%. The c6i.4xlarge had 16 vCPUs. The Node service was using one of them. The new feature parsed a 240 KB JSON market-snapshot on every order, and JSON.parse of 240 KB takes about 70 ms of pure CPU on V8. Seventy milliseconds during which nothing else in the process can run — not the next order's HTTP read, not the Kafka producer's flush, not the 5 ms internal heartbeat that the SRE alerting watched. The single event loop is what makes Node simple. It is also what makes one slow JSON.parse a 70 ms outage for every other request in the process. To predict your Node service's behaviour at 9:15:00 IST, you need to model three layers: V8's optimising compiler pipeline, libuv's event loop phases, and the worker-thread escape that finally lets you off the one-core ceiling.

Node.js runs your JavaScript on a single OS thread driven by libuv's event loop, with V8 (the JIT) compiling hot functions to machine code through a four-tier pipeline (Ignition → Sparkplug → Maglev → TurboFan). Any synchronous work — JSON.parse, regex, hashing, template rendering — blocks every other request in the process for its full duration, because there is exactly one thread executing JavaScript. The two escapes are libuv's internal thread pool (used by fs, crypto, DNS — 4 threads by default) and worker_threads (real OS threads with isolated V8 isolates, communicating via MessagePort and SharedArrayBuffer). Choosing wrongly is how you pay for 16 cores while using one.

The architecture — V8, libuv, and the one thread that runs everything

Node.js is three pieces glued together. V8 is Google's JavaScript engine — a register-based bytecode interpreter (Ignition) plus a four-tier optimising compiler pipeline that turns hot JavaScript into native machine code. libuv is the event-driven I/O library originally extracted from Node.js — a cross-platform epoll/kqueue/IOCP wrapper that gives you a single thread that can wait on thousands of file descriptors and a small worker pool for operations the OS cannot do asynchronously. Node itself is the C++ layer that wires V8 and libuv together, plus the standard library (fs, http, crypto, net, worker_threads).

The architectural choice that defines the entire developer experience: one OS thread runs all your JavaScript. The same thread reads incoming HTTP requests off the socket, runs your app.post('/order', ...) handler, parses the JSON body, calls your validation logic, calls into the Postgres driver, runs the callback when the query returns, serialises the response JSON, writes it to the socket, and reads the next request. If any of those steps takes 70 ms of synchronous CPU work, the thread is unavailable for those 70 ms — every other connected client waits.

Node.js architecture — V8, libuv, the single event loop, and the worker poolArchitecture diagram showing the Node.js process layout. At the top, a JavaScript layer hosts user code on a single event-loop thread driven by libuv. V8 sits beside it, providing the JS engine — Ignition interpreter and TurboFan JIT. Below the event loop, libuv has a thread pool of 4 worker threads used by fs, crypto, dns, and zlib. To the right, a worker_threads block shows isolated V8 isolates with their own event loops. At the bottom, the kernel layer shows epoll, kqueue, and IOCP backends.One Node.js process — three layersJavaScript on the event-loop threadyour handlers, callbacks, async/await bodysingle OS thread — exactly oneJS code runs at any instantV8 — JS engineIgnition (bytecode interpreter)Sparkplug → Maglev → TurboFan (JIT tiers)heap, GC (Orinoco), inline cacheslibuv — event loop + thread pool7 phases: timers → pending → idle → poll →check → close → microtasks/nextTick betweenUV_THREADPOOL_SIZE = 4 by defaultworker_threads (optional)N independent V8 isolateseach with own event loop + heapMessagePort + SharedArrayBuffer for IPCKernel I/O multiplexerepoll (Linux) | kqueue (macOS, BSD) | IOCP (Windows)libuv normalises these — one thread waits on thousands of FDs
The Node.js process. The event loop owns the JavaScript thread; V8 compiles your code; libuv multiplexes I/O and runs a small worker pool for syscalls the OS cannot do asynchronously. The worker_threads block on the right is the only way to put more than one core's worth of JS work into the same process.

Why exactly one JS thread: V8's heap, garbage collector, and inline caches are not multi-thread-safe. Two threads concurrently mutating the same JS object would race on the shape pointer, the inline cache slots, and the marking bitmap during GC. Rather than introduce per-object locking (the cost the Python team is now paying with PEP 703), the Node design isolates each JS context behind a single thread — and provides multi-process (cluster) and multi-isolate (worker_threads) escapes for parallelism. The architectural simplicity is the feature; the single-core ceiling per isolate is the bill.

The default Node process is therefore exactly one JavaScript-running thread plus libuv's pool of 4 internal worker threads (configurable via UV_THREADPOOL_SIZE, used by fs, crypto.pbkdf2, crypto.randomBytes, DNS lookups via getaddrinfo, and zlib). The internal pool exists for one reason: not every I/O can be done with epoll. Filesystem reads on Linux, for example, do not have a stable async API for regular files — epoll only works on sockets, pipes, and certain character devices. fs.readFile therefore dispatches to the libuv pool, which calls pread() synchronously on a worker thread and posts the result back. The same pattern serves CPU-bound crypto operations like pbkdf2. None of this helps your JSON.parseJSON.parse runs on the event-loop thread, because it is a synchronous JavaScript built-in.

The event loop's seven phases — and where your callback actually runs

When you write setTimeout(cb, 0), when does cb run? When you write setImmediate(cb), why does it sometimes run before and sometimes after setTimeout? Why does process.nextTick(cb) always win? The answers all live in the seven phases libuv cycles through on every loop iteration:

  1. Timers — fires setTimeout and setInterval callbacks whose deadlines have passed
  2. Pending callbacks — runs callbacks for some system errors deferred from the previous iteration (TCP errors, etc.)
  3. Idle, prepare — internal libuv use only
  4. Poll — waits on the kernel I/O multiplexer (epoll_wait on Linux); runs incoming socket data callbacks; this is where the loop spends most of its idle time
  5. Check — fires setImmediate callbacks
  6. Close callbacks — fires 'close' events on sockets and handles
  7. Microtasks + process.nextTick — drained after every macrotask in any phase, before moving on

The poll phase is the heart of the loop. If there is no work to do (no timers due, no setImmediate queued, no nextTick queued), the loop blocks in epoll_wait until either a file descriptor becomes ready or the next timer's deadline arrives. This is what makes Node "non-blocking": the process sleeps in the kernel, doing nothing, until an actual event arrives — no thread spinning, no polling.

#!/usr/bin/env python3
# event_loop_blocker_demo.py — measure how a synchronous CPU stretch in Node
# blocks every other connection in the process. We use Python as the load
# generator and parser, invoking node and wrk2 via subprocess.
import subprocess, json, os, sys, tempfile, textwrap, time, signal, urllib.request

NODE_SERVER = textwrap.dedent("""
    const http = require('http');
    const crypto = require('crypto');

    // A 240 KB JSON blob — roughly the size of one Zerodha market snapshot.
    const big = JSON.stringify({tick: Array.from({length: 6000}, (_, i) =>
        ({sym: 'NSE_'+i, ltp: 1000+i*0.5, vol: i*7})) });

    const server = http.createServer((req, res) => {
        if (req.url === '/fast') {
            res.end('ok');                          // ~50 microseconds of CPU
        } else if (req.url === '/slow') {
            JSON.parse(big);                        // ~70 ms of CPU on V8
            res.end('parsed');
        } else if (req.url === '/heartbeat') {
            res.end(String(Date.now()));            // measures loop responsiveness
        } else {
            res.end('?');
        }
    });
    server.listen(7531, () => console.log('listening 7531'));
""")

def run():
    with tempfile.NamedTemporaryFile('w', suffix='.js', delete=False) as f:
        f.write(NODE_SERVER); path = f.name
    proc = subprocess.Popen(['node', path], preexec_fn=os.setsid,
                            stdout=subprocess.DEVNULL, stderr=subprocess.DEVNULL)
    time.sleep(0.4)                                     # let server bind

    # Warmup so V8 has tiered up the handlers.
    for _ in range(50): urllib.request.urlopen('http://127.0.0.1:7531/fast').read()

    def heartbeat_latency_ms():
        t0 = time.perf_counter()
        urllib.request.urlopen('http://127.0.0.1:7531/heartbeat').read()
        return (time.perf_counter() - t0) * 1000

    print(f"  baseline heartbeat:        {heartbeat_latency_ms():>6.2f} ms")

    # Hammer /slow from one connection while measuring /heartbeat from another.
    blocker = subprocess.Popen(['bash', '-c',
        'while true; do curl -s http://127.0.0.1:7531/slow > /dev/null; done'])
    time.sleep(0.2)                                     # let the blocker start
    samples = [heartbeat_latency_ms() for _ in range(30)]
    blocker.terminate(); blocker.wait()

    samples.sort()
    print(f"  while /slow runs — p50:    {samples[15]:>6.2f} ms")
    print(f"                     p99:    {samples[-1]:>6.2f} ms")
    print(f"                     max:    {max(samples):>6.2f} ms")

    os.killpg(os.getpgid(proc.pid), signal.SIGTERM); proc.wait()

if __name__ == "__main__":
    run()

Sample run on a c6i.2xlarge (8 vCPU, Ice Lake, Node v22.10.0):

  baseline heartbeat:          0.74 ms
  while /slow runs — p50:     71.83 ms
                     p99:    138.91 ms
                     max:    142.06 ms

Walking the key lines. JSON.parse(big) is one synchronous JavaScript built-in. While V8 is parsing the 240 KB blob (about 70 ms on this CPU), no other JavaScript can run in the process — not the /heartbeat handler, not the /fast handler, not even a setImmediate callback. heartbeat_latency_ms() measures the round-trip time for a tiny endpoint that should respond in under 1 ms. The baseline is 0.74 ms; under the /slow hammer the p50 jumps to 71 ms — every heartbeat that arrives is queued behind one in-flight JSON.parse, waits for it to finish, then runs. The p99 of 139 ms corresponds to a heartbeat that arrived when one JSON.parse had just started and another one was already queued behind it — so the heartbeat waits for two full parses. UV_THREADPOOL_SIZE is irrelevant hereJSON.parse does not touch the libuv pool; it runs on the event-loop thread. The only fix is to move the parse off the thread (worker_threads) or do it less often (cache the parsed object).

Why the heartbeat p50 is exactly one parse-time and not half: when a synchronous task blocks the loop for 70 ms, every request that arrives during those 70 ms accumulates in the kernel's socket-receive buffer. The instant the parse finishes and the loop returns to the poll phase, libuv drains all queued FDs in arrival order. The first heartbeat that arrived sees a full 70 ms wait; the second sees its own near-zero plus its position in the queue. Averaged, the wait approximates the parse duration — not half of it — because new requests keep arriving and replenishing the queue.

This is the model you carry around with you when reasoning about Node performance. Every synchronous millisecond on the event-loop thread is a millisecond of head-of-line blocking for every other connection. Database queries don't block — their callbacks fire asynchronously. HTTP outbound requests don't block. File reads don't block (they go to the libuv pool). But JSON.parse(big), JSON.stringify(big), Buffer.from(s, 'base64') on a 4 MB string, a regex with catastrophic backtracking, a deep object clone, a synchronous template render, a tight loop summing an array of 50,000 numbers — these all block.

V8's compilation pipeline — Ignition, Sparkplug, Maglev, TurboFan

The JavaScript you ship to production does not run as JavaScript. V8 compiles it through four tiers, each one slower to produce but faster to execute. Understanding the tiers is what lets you read a --prof profile or a --trace-opt log without panicking, and what lets you understand why your service's first 30 seconds after a deploy are 6× slower than the steady state.

The four tiers as of V8 12.x (Node 22):

  1. Ignition — register-based bytecode interpreter. Every function starts here. Cheap to "compile" (just lower JS to bytecode). Slow to execute — about 5–10× slower than optimised native.
  2. Sparkplug — non-optimising baseline JIT. Translates Ignition bytecode 1:1 to machine code with no inlining and minimal type assumptions. About 1.5× faster than Ignition. Tier-up triggers after a function has run a few times.
  3. Maglev — mid-tier optimising compiler (added in V8 11.7, default in Node 21+). Inlines small callees, specialises on observed types via inline-cache feedback. About 4–5× faster than Sparkplug. Tier-up after the function is genuinely hot.
  4. TurboFan — top-tier optimising compiler. Whole-function SSA optimisation, escape analysis, aggressive inlining, type speculation with deoptimisation guards. About 6–10× faster than Sparkplug. Tier-up after the function is very hot or has been Maglev-running for a while.

The tier-up logic uses call counts and execution counters maintained per function. A function called 1,000 times with consistent argument shapes will get to TurboFan within a few hundred milliseconds of becoming hot. The same function called with shape-shifting arguments (fn(1), then fn("two"), then fn({k: 3})) will be deoptimised — TurboFan throws away its compiled code and falls back to a lower tier — because its type speculation guards failed.

#!/usr/bin/env python3
# v8_tiering_demo.py — observe V8 tiering up a hot function via --trace-opt.
import subprocess, tempfile, textwrap, re, sys

NODE_PROG = textwrap.dedent("""
    function hotAdd(a, b) { return a + b; }
    let s = 0;
    // 250,000 calls with monomorphic int arguments — guaranteed tier-up.
    for (let i = 0; i < 250000; i++) s += hotAdd(i, i+1);
    console.log('sum=', s);
""")

def run():
    with tempfile.NamedTemporaryFile('w', suffix='.js', delete=False) as f:
        f.write(NODE_PROG); path = f.name

    out = subprocess.run(
        ['node', '--trace-opt', '--trace-deopt', path],
        capture_output=True, text=True, timeout=20
    ).stderr.splitlines() + subprocess.run(
        ['node', '--trace-opt', '--trace-deopt', path],
        capture_output=True, text=True, timeout=20
    ).stdout.splitlines()

    tier_events = []
    for line in out:
        m = re.search(r'\[(compiling|optimizing|completed (?:optimizing|compiling))[^]]*\][^a-zA-Z]*([a-zA-Z_$][\w$]*)', line)
        if m and 'hotAdd' in line:
            tier = re.search(r'using\s+(\w+)', line)
            tier_events.append((m.group(1), tier.group(1) if tier else '?'))
    seen = set(); ordered = []
    for ev in tier_events:
        if ev not in seen:
            seen.add(ev); ordered.append(ev)

    print('hotAdd compilation tier-ups:')
    for action, tier in ordered:
        print(f'  {action:<28s} tier={tier}')

if __name__ == "__main__":
    run()

Sample run on Node v22.10.0:

hotAdd compilation tier-ups:
  compiling                    tier=Sparkplug
  optimizing                   tier=Maglev
  completed optimizing         tier=Maglev
  optimizing                   tier=TurboFan
  completed optimizing         tier=TurboFan

Walking the key lines. hotAdd(a, b) is monomorphic — every call site passes two small integers (V8's Smi representation, 31 bits + tag). V8 sees this consistency in the inline-cache feedback collected during the Ignition and Sparkplug runs, and is willing to speculate that a + b will always be Smi + Smi. The first tier-up is to Sparkplug, the non-optimising baseline JIT, after a few hundred calls. The second tier-up is to Maglev, which inlines the + operator's Smi fast path directly. The third tier-up is to TurboFan, which performs full SSA optimisation, eliminating the function-call overhead entirely if hotAdd gets inlined into the loop body. The whole sequence happens in roughly 80 ms of wall-clock time. If you change the loop to s += hotAdd(i, "x") halfway through, you would see a [deoptimizing] line — TurboFan's Smi + Smi guard fails on the string, the compiled code is discarded, and execution falls back to Sparkplug while V8 collects new feedback.

Why this matters operationally: a freshly-deployed Node service spends the first 5–30 seconds in Ignition and Sparkplug for most code paths. Throughput during this window is 3–8× lower than steady state, and tail latency is dominated by tier-up pauses (50–500 µs each, hundreds of them per second on a busy service). Capacity-planning load tests that don't include a warmup phase systematically over-estimate steady-state throughput. The fix is either (a) replay realistic traffic to the new instance for 30 seconds before adding it to the load balancer, or (b) use V8's --predictable flag in capacity tests, which disables tiering entirely and gives you the worst-case baseline. Razorpay's payment-gateway team adds a 45-second --warmup-traffic step to their canary rollout, which dropped post-deploy 5xx spikes from 0.4% to 0.02%.

The deoptimisation trap is worth its own paragraph. TurboFan compiles your function under specific type assumptions: "the second argument is always a Smi", "this property access always finds the property on the prototype chain at depth 1", "this object always has shape {x: Smi, y: Smi}". When an assumption fails, the compiled code aborts — control jumps back to the interpreter, the optimised version is thrown away, and the function may or may not be re-optimised later (with new guards). A common pattern that triggers this: a hot function that handles a null value 1-in-10,000 calls. The 9,999 normal calls run in TurboFan; the 10,000th deopts; TurboFan recompiles 10 ms later; the cycle repeats; net throughput drops by 30%. The fix is if (x == null) return defaultValue; ... rest of hot path — make the null check the first line so V8 specialises on the post-check shape.

The worker_threads escape — real OS threads, isolated V8 contexts

worker_threads (added in Node 10.5, stable since 12) is the only way to run JavaScript on more than one core inside the same Node process. A worker is a real OS thread with its own V8 isolate (its own heap, its own GC, its own event loop), launched from a script file or an inline module. Workers communicate with the main thread via MessagePort (structured-clone serialisation) or SharedArrayBuffer (direct shared memory for typed arrays).

#!/usr/bin/env python3
# worker_threads_demo.py — show worker_threads scaling on a CPU-bound workload.
# We pick a deliberately CPU-bound task (1M sha256 hashes) and run it on the
# main thread, then sharded across a worker pool, then time both.
import subprocess, tempfile, textwrap, time, os, sys

WORKER_SCRIPT = textwrap.dedent("""
    const { parentPort, workerData } = require('worker_threads');
    const { createHash } = require('crypto');
    const { count, seed } = workerData;
    let acc = 0n;
    for (let i = 0; i < count; i++) {
        const h = createHash('sha256');
        h.update(`${seed}-${i}`);
        const buf = h.digest();
        acc += BigInt(buf[0]);
    }
    parentPort.postMessage(acc.toString());
""")

MAIN_SCRIPT = textwrap.dedent("""
    const { Worker } = require('worker_threads');
    const { createHash } = require('crypto');
    const path = process.argv[2];
    const totalWork = 1_000_000;
    const numWorkers = parseInt(process.argv[3], 10);

    function singleThread() {
        const t0 = process.hrtime.bigint();
        let acc = 0n;
        for (let i = 0; i < totalWork; i++) {
            const h = createHash('sha256'); h.update(`x-${i}`);
            acc += BigInt(h.digest()[0]);
        }
        const ms = Number(process.hrtime.bigint() - t0) / 1e6;
        console.log(`single-thread:        ${ms.toFixed(1)} ms   acc=${acc}`);
    }

    function workerPool(n) {
        return new Promise((resolve) => {
            const t0 = process.hrtime.bigint();
            const each = Math.floor(totalWork / n);
            let done = 0;
            for (let w = 0; w < n; w++) {
                const wk = new Worker(path, { workerData: { count: each, seed: `x-${w*each}` } });
                wk.on('message', () => { if (++done === n) {
                    const ms = Number(process.hrtime.bigint() - t0) / 1e6;
                    console.log(`workers=${n}:           ${ms.toFixed(1)} ms`);
                    resolve();
                } });
            }
        });
    }

    (async () => { singleThread(); await workerPool(numWorkers); })();
""")

def run():
    with tempfile.NamedTemporaryFile('w', suffix='.js', delete=False) as f:
        f.write(WORKER_SCRIPT); worker_path = f.name
    with tempfile.NamedTemporaryFile('w', suffix='.js', delete=False) as f:
        f.write(MAIN_SCRIPT); main_path = f.name
    for n in (2, 4, 8):
        print(f'\n=== {n} workers ===')
        subprocess.run(['node', main_path, worker_path, str(n)], check=True)

if __name__ == "__main__": run()

Sample run on a c6i.2xlarge (8 vCPU):

=== 2 workers ===
single-thread:        4128.4 ms   acc=...
workers=2:            2156.7 ms

=== 4 workers ===
single-thread:        4131.2 ms
workers=4:            1119.3 ms

=== 8 workers ===
single-thread:        4129.8 ms
workers=8:             637.4 ms

Walking the key lines. createHash('sha256') is a CPU-bound operation that, perhaps surprisingly, runs on the event-loop thread when called synchronously like this (the libuv-pool path is crypto.pbkdf2, not createHash). The single-thread baseline of 4.13 s therefore monopolises one core for 4 seconds. new Worker(path, {workerData}) spawns a real OS thread with its own V8 isolate and runs the script worker_path in it. Each worker computes its shard and posts the result back via parentPort.postMessage. With 2 workers we get 1.92× speedup, with 4 workers 3.69×, with 8 workers 6.48× — close to ideal linear scaling, with the gap eaten by worker startup (each new Worker costs about 8–15 ms to spawn — V8 isolate init, Node bootstrap), the structured-clone serialisation of the result message, and OS scheduling overhead. For tasks where per-shard compute is over a few hundred milliseconds, worker_threads is the right answer. For tasks under ~10 ms, the spawn cost dominates and you should pre-spawn a worker pool at startup (piscina is the canonical npm package for this).

Why workers are not threads in the C/Java sense: a worker_threads.Worker is a complete second V8 isolate. Each isolate has its own heap (typically 32 MB minimum overhead, often 100+ MB with libraries loaded), its own GC, its own JIT cache, its own event loop, its own libuv pool. Posting a message between workers serialises the value through structured-clone — the data is copied. The two exceptions are Transferables (the buffer is moved, not copied — original becomes unusable) and SharedArrayBuffer (a contiguous byte buffer that genuinely shares memory across isolates, with Atomics for synchronisation). For sharing a 200 MB ML feature matrix across workers without copying, SharedArrayBuffer is the pattern; for posting a result object, structured-clone is fine.

Worker thread scaling on a CPU-bound workloadLine chart with two curves. The x-axis shows worker count from 1 to 16. The y-axis shows speedup vs single-thread baseline. The first curve is ideal linear scaling shown as a dashed diagonal. The second curve is measured worker_threads scaling, rising near-linearly to 6.5x at 8 workers, then flattening to 9.8x at 16 workers (diminishing returns past physical core count on an 8-vCPU machine due to hyperthreading and memory bandwidth). Illustrative — not measured data.worker_threads scaling — sha256 CPU-bound, c6i.2xlarge (8 vCPU)Illustrative — single-process; per-isolate startup cost ~10 msworker countspeedup vs 1 thread12481616×ideal linearmeasuredphysical cores
Worker scaling on the sha256 workload. Near-linear up to the physical core count (8 on this part), then flattening — the additional 8 hyperthreads share execution units with the first 8 and only deliver about 1.5× more, not 2×. Illustrative — not measured data; your numbers depend on workload characteristics, especially memory bandwidth pressure.

The operational trade-off: workers add real complexity. Each worker is a process-like boundary inside one process — separate require cache, separate console, separate global state. Posting messages costs serialisation. Sharing memory via SharedArrayBuffer requires manual Atomics.wait/Atomics.notify synchronisation, which is harder to get right than pthread_mutex_lock. For most Node services, the answer is "don't add worker_threads; just spawn more Node processes via cluster or your container orchestrator". Workers are the right answer specifically when (a) you have a clearly CPU-bound subtask (image transcoding, JSON canonicalisation, hashing, ML inference) that is occasional rather than constant, (b) the data to share is large enough that IPC between processes would dominate, and (c) you're willing to maintain the structured-clone or shared-buffer plumbing.

Common confusions

  • "Node.js is single-threaded." Wrong on a technicality that hides the model. Node has one JavaScript thread per isolate, plus libuv's pool of 4 worker threads (for fs, DNS, pbkdf2, zlib), plus N user-spawned worker threads via worker_threads, plus V8's GC threads, plus V8's compiler threads (Maglev/TurboFan compile concurrently with execution). A typical Node process has 8–12 OS threads. The constraint is that only one of them runs your JavaScript.
  • "async/await makes my code parallel." No — async/await is sugar over Promises, which are sugar over callbacks, which all run on the same one event-loop thread. await fetch(url) lets the thread do other work while waiting on I/O; it does not run two computations in parallel. If both branches of Promise.all([heavy1(), heavy2()]) are CPU-bound, they take the sum of their times, not the max.
  • "setImmediate is the same as setTimeout(..., 0)." Different phases of the loop. setImmediate fires in the check phase; setTimeout(0) fires in the timers phase (with a minimum 1 ms delay enforced by libuv). Inside an I/O callback, setImmediate always runs before the next setTimeout(0); outside one, the order depends on whether the event loop has reached the timers phase yet — and is genuinely non-deterministic.
  • "process.nextTick is a microtask." Close but not quite. nextTick is a separate Node-specific queue drained before the microtask queue (Promise reactions). Both drain after every macrotask. The practical effect is that nextTick callbacks can starve the event loop if they queue more nextTick callbacks recursively — the loop never advances to the next phase. Promise microtasks have the same problem, but nextTick runs first and starves harder.
  • "Increasing UV_THREADPOOL_SIZE makes my service faster." Only if you are bottlenecked on operations that use the libuv pool (fs.readFile, crypto.pbkdf2, DNS via getaddrinfo, zlib). Bumping it from 4 to 64 will not help your JSON.parse problem at all, because JSON.parse doesn't use the pool. Bumping it past os.cpus().length is usually counterproductive — the threads contend for cores.
  • "V8's GC pauses are the reason for tail latency in Node." Sometimes, but rarely the dominant cause. V8's Orinoco GC is generational and largely concurrent — young-gen scavenges typically pause 0.5–2 ms, full mark-sweep 5–20 ms on a 500 MB heap. A single 70 ms JSON.parse blocks longer than 10 GC pauses combined. Profile before blaming GC.

Going deeper

V8's hidden classes and inline caches — why {x:1, y:2} and {y:2, x:1} are not the same shape

V8 represents every JavaScript object as an instance of a hidden class (called a Map in V8 source, unrelated to the JS Map type) — a runtime-built description of the object's shape. Two objects with the same property names added in the same order share the same hidden class. Two objects with the same property names added in different order have different hidden classes — and code that accesses them is megamorphic and falls off the fast path. Inline caches at each property-access site cache the hidden class → offset mapping; once a site has seen more than 4 different hidden classes it gives up and falls back to the generic dictionary lookup, which is 5–20× slower.

The performance pattern this enforces: always initialise object properties in the same order, ideally in the constructor, ideally with null placeholders for fields you'll set later. let o = {}; o.x = 1; o.y = 2; and let o = {x: 1, y: 2}; produce different hidden-class chains. delete o.x is the single most expensive thing you can do to an object — it forces a transition to dictionary mode that V8 will never undo. The Mongoose ODM team famously rewrote their model constructors in 2019 to enforce property order, dropping query response times by 18% on the same workload.

libuv internals — the watcher model and the I/O multiplexer abstraction

libuv's API is a set of handles (long-lived objects: uv_tcp_t, uv_timer_t) and requests (one-shot operations: uv_write_t, uv_fs_t). Each handle registers with the event loop and gets a callback when its event fires. On Linux, libuv uses epoll (edge-triggered for sockets, level-triggered for some pipes); on macOS/BSD, kqueue; on Windows, IOCP (which has a fundamentally different async model — IOCP completes operations, while epoll signals readiness — libuv normalises this difference at considerable internal cost). The thread pool is implemented in uv_threadpool.c and is a textbook producer-consumer queue with a pthread_cond_wait for idle workers. Reading the libuv source (about 60K LOC of C) is the single best way to understand what your Node program is actually doing — the internal vocabulary translates 1:1 to the operational behaviour you observe in production.

cluster vs worker_threads vs child_process — which IPC primitive fits which problem

Node has three multi-threading-or-multi-processing escapes, each with a different cost shape. child_process.spawn forks a separate Node (or arbitrary binary) process — full isolation, IPC via pipes or sockets, no shared memory, ~30 ms startup. cluster is child_process.fork plus shared-port logic (the master process accepts connections and round-robins them to workers); same cost as child_process plus the extra Node startup. worker_threads is a thread inside the same process — ~10 ms startup, MessagePort or SharedArrayBuffer for IPC, shared file descriptors. The practical decision tree: for a long-running CPU-bound subprocess (image transcode worker, ML inference server), use child_process with a queue — failure isolation is worth the IPC cost. For HTTP serving with N processes per box, use cluster (or just N container replicas behind a load balancer — increasingly the answer at companies like Razorpay). For short-lived CPU bursts inside a request, use a pre-spawned piscina worker pool. Mixing all three is the architecture you regret in 18 months.

Reading a --prof profile — what to look for first

Node's built-in profiler (node --prof app.js then node --prof-process isolate-*.log > out.txt) produces a tick-sampled CPU profile. The output is grouped by category: [Shared libraries] (libc, OpenSSL, libuv), [JavaScript] (your code, broken down by Optimized/Unoptimized/Builtin), [C++] (V8 internals), [GC] (garbage collector). The numbers to look at first: total [GC] percentage (if > 5%, you have a heap-pressure problem), the top three [JavaScript] Optimized entries (these are your hot code paths — confirm they're things you expected to be hot), and the top three Unoptimized entries (these are functions V8 wanted to optimise but couldn't — usually due to deoptimisation triggers or polymorphism). For production profiling without restarting, use 0x (npm package, samples a running pid via --prof-process machinery) or clinic.js flame.

Reproduce this on your laptop

# Install Node 22 and the Python packages used by the demo drivers
brew install node@22                        # macOS — or your OS's package manager
python3 -m venv .venv && source .venv/bin/activate
pip install --upgrade pip                   # the demos use only stdlib + curl

# The three artefacts in this article
python3 event_loop_blocker_demo.py          # heartbeat latency under load
python3 v8_tiering_demo.py                  # Sparkplug → Maglev → TurboFan trace
python3 worker_threads_demo.py              # speedup with N workers

# Inspect a real V8 profile
node --prof your_app.js                     # produces isolate-*.log
node --prof-process isolate-*.log > prof.txt
less prof.txt                               # look at [JavaScript], [GC] sections

You should see heartbeat p99 of 80–150 ms while /slow runs, a Sparkplug → Maglev → TurboFan tier-up sequence within ~80 ms of the loop starting, and worker_threads scaling near-linearly up to your machine's physical core count. If your machine has fewer than 4 cores, scale the worker counts down accordingly.

Where this leads next

The Node runtime is one of three the curriculum profiles in detail. Comparing across them is how you build the cost model that lets you choose a runtime for a service rather than defaulting to "what the team knows":

The reader who finishes this chapter should be able to look at a Node service consuming one core on a 16-core box, identify the synchronous JavaScript bottleneck, decide whether to fix it in-process (cache the parsed result, switch to a streaming parser, move to worker_threads) or out-of-process (more replicas, cluster, container scale-out), and predict the cost of each option within an order of magnitude. That decision is what stands between "we paid for a 16-core box" and "we used a 16-core box".

The broader pattern that connects this chapter to the other runtime chapters: every runtime makes a trade between simplicity-of-mental-model and parallel scalability. Node picked simplicity-of-mental-model and ships an escape (worker_threads). Python picked the same and is finally shipping its escape (no-GIL 3.13t). The JVM picked parallelism and ships sophisticated tooling for the resulting complexity (G1, ZGC, JFR). Go picked parallelism and hides the complexity in the goroutine scheduler. None of these is the right answer for every workload; all of them are the wrong answer for some workload. Knowing which trade-off your runtime made is the first step in predicting where its performance cliffs will appear.

References