Handling Thousands of Clients Per Process

Note: Company names, engineers, incidents, numbers, and scaling scenarios in this article are hypothetical — even when they resemble real ones. See the full disclaimer.

In short

Postgres uses process-per-connection: the postmaster fork()s a fresh backend process for every client that connects, the child spends its life serving that one client's queries, and it exits when the client disconnects. The model is conceptually clean — each backend is a single-threaded program with no concurrency to reason about, a segfault in one backend cannot corrupt another, and each can drop to a distinct OS user for privilege separation. It is also expensive: roughly 10 MB of resident memory and 20 ms of fork-and-initialise time per connection. At 100 simultaneous clients this is unremarkable. At 10,000 it is catastrophic — 100 GB of RAM burned on mostly-idle backends, process-table pressure, and ProcArrayLock contention that saturates the server before any real work gets done.

Event-loop servers flip the model. One process, one thread, thousands of open sockets. The thread sits in a tight loop: ask the kernel which sockets have data waiting, handle those, ask again. The primitive that makes this scale is epoll on Linux, kqueue on BSD and macOS, and io_uring on Linux 5.1+ — a kernel facility that maintains a watch-set of file descriptors and returns the ready subset in O(ready), not O(watched). PgBouncer, Redis, nginx, and HAProxy all use this pattern, and each routinely handles tens of thousands of concurrent clients per CPU core with per-connection overhead measured in hundreds of bytes.

Databases mostly do not, because the inside of query execution does blocking disk I/O. One slow SELECT that waits 5 ms for a buffer read would freeze the entire event loop and stall every other client. The pragmatic compromise is the hybrid: an event-loop front end accepts connections and parses wire-protocol frames, then hands work to a pool of worker threads that can block safely. This chapter walks both models, builds a toy event-loop TCP server in ~80 lines of Python using the standard-library selectors module, explains epoll / kqueue / io_uring at the syscall level, and compares the two architectures on a realistic 10,000-client web workload.

The C10K problem — and why the answer was a loop, not more threads

In 1999, Dan Kegel published The C10K Problem — a short web essay asking whether one commodity server could handle ten thousand concurrent clients. At the time the answer was effectively no. Every production server opened one thread or process per client, the kernel's select() syscall scaled linearly in the watched-socket count, and context-switch overhead above a few thousand threads destroyed throughput. Serving 10k clients meant a cluster of machines each handling a few hundred.

The essay enumerated workarounds and concluded that the real fix needed a new kernel primitive: a readiness-notification mechanism whose cost scaled with the number of active file descriptors, not the total watched. Over the next few years that primitive landed in every major kernel: kqueue in FreeBSD 4.1 (2000), epoll in Linux 2.6 (2003), IOCP in Windows NT, and eventually io_uring in Linux 5.1 (2019).

With those primitives in place the architectural shift was small — a single while True: loop around a readiness call — but the performance shift was four orders of magnitude. nginx, Redis, HAProxy, Node.js, and PgBouncer are all direct descendants of this shift. Postgres, notably, is not — and this chapter is largely about why.

The process-per-connection model (Postgres)

Open a connection to Postgres. The postmaster — a single supervisor listening on port 5432 — accepts the TCP connection, reads the StartupMessage, validates credentials, and calls fork(). The child inherits the socket, loads system catalogs into its private memory, and reaches its first ReadyForQuery. It then runs a single-threaded command loop — read wire-protocol message, plan and execute, write response, wait — until the client disconnects, at which point the backend exits.

This choice was made in the mid-1980s for Ingres (Postgres's ancestor). The reasons still make sense today.

Historical context. In 1986, fork() was the dominant UNIX concurrency primitive; pthreads did not arrive until 1995. A database that wanted concurrent clients on UNIX picked fork() almost by default. Why: the alternative — single-threaded multiplexing — required OS primitives (select() scaling, non-blocking disk I/O) that did not exist robustly for another fifteen years.

Fault isolation. A buggy backend that segfaults takes down exactly one connection. The postmaster notices, logs it, and in the paranoid case restarts the whole cluster (because shared memory may be corrupted). In a threaded model a single bad pointer dereference corrupts all clients' state.

Simplicity of the execution engine. Inside a backend, there is no concurrency. The executor, planner, and storage layer are all single-threaded. No locks on hot data structures, no atomics on counters. Every Postgres developer writing executor code can assume a sequential world — this alone has prevented an enormous class of bugs over forty years.

Security and privilege separation. Each backend can run with distinct OS credentials via SET SESSION AUTHORIZATION. Combined with process-level isolation, this enables multi-tenant deployments where a backend compromise contains the blast radius to one process.

The cost is concrete and large.

Resource	Per-connection cost	1,000 connections	10,000 connections
Resident memory	~10 MB	10 GB	100 GB
Fork + catalog warm-up	~20 ms	—	—
Process table entry	1	1,000	10,000
File descriptors	~10	10,000	100,000
`ProcArray` slot (shared memory)	~200 B	200 KB	2 MB + linear lock scan

At 10,000 connections the RAM cost exceeds most server budgets, the default pid_max on Linux (32,768) crowds, and snapshot acquisition turns linear in backend count. Postgres becomes unusable above a few thousand connections, which is why connections must be pooled externally before they reach Postgres — the subject of chapter 63. This chapter is about what the pooler does that Postgres cannot.

The thread-per-connection model (MySQL, Oracle dedicated server)

Between fork()-per-connection and a full event loop sits an intermediate design: one OS thread per connection, sharing one process address space.

MySQL / InnoDB uses thread-per-connection by default. A thread is lighter than a process — creation takes microseconds, not 20 ms, and there is no catalog to reload. But each thread needs its own stack — 1 MB on Linux, 8 MB on macOS — because the execution path must tolerate deep call chains through parsers, optimisers, and storage engines. Why: you cannot shrink the stack arbitrarily because you do not know the worst-case depth of recursive operators like nested subqueries, and a stack overflow is a process-level crash.

Scale	Thread count	Stack memory (1 MB × threads)
100 clients	100 threads	100 MB
1,000 clients	1,000 threads	1 GB
10,000 clients	10,000 threads	10 GB

Oracle Database in dedicated-server mode does the same. Oracle's shared-server mode (formerly MTS) is explicitly the hybrid we will describe later — a small pool of server processes plus a dispatcher. Postgres has never offered shared-server mode at all.

Thread-per-connection is cheaper than process-per-connection — perhaps 2-5× — but not the four-orders-of-magnitude improvement C10K requires. At 10,000 threads the scheduler itself becomes the bottleneck. Threads are not the answer to many clients. One event loop is.

The event-loop model (PgBouncer, Redis, nginx)

The event-loop architecture has three properties that together break the per-client cost model.

One process, one thread, many file descriptors. A single OS thread holds thousands of open sockets. No other threads, no fork() on connect. The process's memory footprint is roughly constant whether serving 100 or 100,000 clients — a connection is just a file descriptor plus a small state struct, typically 100-500 bytes.

Non-blocking I/O. Every socket is put into non-blocking mode with fcntl(fd, F_SETFL, O_NONBLOCK). A read() that would block returns -1 with errno == EAGAIN; a write() returns the number of bytes actually written. The thread never blocks on any one connection. Why: blocking on a single client would stall every other client sharing that thread — that is the defining constraint of event-loop design.

Kernel-assisted readiness notification. The thread does not busy-poll. It registers every socket with an epoll set (Linux) or kqueue (BSD/macOS) and calls epoll_wait(), which blocks until at least one watched socket has data or capacity. The kernel maintains the watch set internally and wakes the thread only when something actionable happens.

The canonical loop:

while True:
    ready_fds = epoll_wait(epfd, timeout)   # blocks until something is ready
    for fd in ready_fds:
        if fd is the listening socket:
            new_conn = accept(fd)
            set_nonblocking(new_conn)
            epoll_register(epfd, new_conn)
        elif fd has data to read:
            data = read(fd)            # returns immediately — kernel said ready
            handle_message(data)
        elif fd can accept a write:
            write_pending_buffer(fd)

The memory arithmetic at 10,000 connections is dramatic. Each socket costs roughly 100 bytes of kernel state plus 200-500 bytes of user-space state. Ten thousand connections consume about 5 MB of user-space memory total — small enough to fit in L2 cache. Compare to 100 GB for the Postgres equivalent.

Same workload — ten thousand connected clients — on two architectures. The left shows direct process-per-connection: ten thousand Postgres backends, roughly a hundred gigabytes of memory, and `ProcArrayLock` that scans linearly in the backend count. The right shows an event-loop pooler in front of a small Postgres pool: one thread holding all ten thousand client sockets at a few hundred bytes each, funnelling work down to about fifty real server backends. Both answer the same queries; the right is five hundred times cheaper in RAM.

A minimal event-loop server in Python

The Python standard library's selectors module is a portable wrapper around epoll, kqueue, and select() — it picks the best available primitive at import time. Here is a full echo server that handles thousands of clients on one thread. Run it with python server.py, then open many nc localhost 5000 connections.

# server.py — event-loop TCP server in ~40 lines
import selectors, socket

sel = selectors.DefaultSelector()   # epoll on Linux, kqueue on macOS

def accept(listener):
    conn, addr = listener.accept()   # never blocks — kernel said ready
    conn.setblocking(False)
    sel.register(conn, selectors.EVENT_READ, handle_client)
    print(f"accepted {addr}, total fds: {len(sel.get_map())}")

def handle_client(conn):
    try:
        data = conn.recv(4096)        # non-blocking, returns immediately
    except ConnectionResetError:
        data = b""
    if not data:
        sel.unregister(conn)
        conn.close()
        return
    conn.send(b"echo: " + data)       # non-blocking; assume small response

def main():
    listener = socket.socket()
    listener.setsockopt(socket.SOL_SOCKET, socket.SO_REUSEADDR, 1)
    listener.bind(("", 5000))
    listener.listen(128)
    listener.setblocking(False)
    sel.register(listener, selectors.EVENT_READ, accept)

    while True:
        for key, _mask in sel.select(timeout=1):
            callback = key.data
            callback(key.fileobj)

if __name__ == "__main__":
    main()

Thirty-eight lines give you a server that accepts five thousand simultaneous TCP connections on a laptop. Per-connection memory is under a kilobyte. The loop spends almost all its time blocked inside sel.select() waiting for the kernel; idle CPU usage is effectively zero. The same architecture with one-thread-per-connection would cap at maybe a thousand clients; with fork(), a hundred.

Three properties are worth examining closely.

The while True: is the whole program. No other thread, no signal handler, no background worker. Every piece of I/O work happens in one call stack rooted in the select loop. Why: single-threaded means no locks, no atomics, no cache-line bouncing — you pay the cost of one CPU doing all the coordination, which is vastly cheaper than synchronising many.

Callbacks dispatch per-fd state. Each register() stores a data attribute — here, a function reference. Real servers store richer state: a per-connection buffer, a parsing state machine, a reference to a downstream server. This is how PgBouncer tracks which client is waiting on which server.

Non-blocking recv and send are load-bearing. Without setblocking(False), a recv on a socket registered-but-not-actually-ready could block and hang every other client. A write-side nuance: conn.send(data) can return fewer bytes than asked for if the kernel buffer is full. Real servers must register EVENT_WRITE, buffer the rest, and send on the next callback. The toy server above assumes small responses; production code does not.

The OS primitives — epoll, kqueue, io_uring, IOCP

Four generations of kernel design on the same question: "of these 10,000 file descriptors, which ones have data now?"

`select()` and `poll()` — the old guard

select() (1983) takes three bitmasks indexed by fd, returns the ready subset, and scales O(N) in mask size. Hard-limited to FD_SETSIZE (1024 on Linux). poll() removes the 1024 limit but not the O(N) scan. Both recopy the full watch set from userland to kernel on every call. Unusable past a thousand connections; what Kegel was writing against.

`epoll` — Linux 2.6 (2003)

epoll splits the watch set from the ready check.

epfd = epoll_create1(0);                    # make a kernel-side set
epoll_ctl(epfd, EPOLL_CTL_ADD, sockfd, …);  # add once per socket
ready = epoll_wait(epfd, events, 64, -1);   # returns only ready fds

The kernel maintains the watch set internally — added fds persist across calls. epoll_wait returns only the ready subset, scaling O(ready) rather than O(watched). Two readiness modes: level-triggered (default) and edge-triggered (notifies once per state change, requires draining to EAGAIN). Edge-triggered is faster but unforgiving — forget to drain and the socket silently stops notifying.

This is the primitive behind nginx, PgBouncer, Redis, and every modern Linux server.

`kqueue` — FreeBSD 4.1 (2000), also macOS

Older than epoll and more general. A single kqueue can watch fds, file-system changes, signals, process exits, timers, and user-defined events. The kevent() syscall both adds filters and retrieves events in one call — more ergonomic than epoll's split API. Functionally equivalent at socket scale.

`io_uring` — Linux 5.1 (2019)

The fourth generation. Instead of a syscall per operation, io_uring establishes a pair of ring buffers shared between userland and kernel: the submission queue where userland writes I/O requests, and the completion queue where the kernel posts results. At steady state, kernel and userland poll the rings; whole batches of I/O happen without a syscall.

Two wins over epoll. First, batched submission — a server receiving a burst of 100 requests can queue 100 reads in one syscall (or zero, if the kernel polls the SQ). Second, real async I/O on regular files — something Linux has historically lacked. This matters for databases: io_uring is the first Linux primitive that lets a single thread issue a disk read and keep serving other clients.

nginx added io_uring support in 2022; benchmarks show io_uring outperforming epoll by 10-20% at 10k+ connections. Postgres has an experimental AIO patchset (Andres Freund's work) that uses io_uring for buffer-pool reads.

`IOCP` — Windows NT (1993)

Windows' answer, predating Linux epoll by a decade. I/O Completion Ports are proactive: instead of "which FDs are ready?" you say "do this read, notify me when done". Conceptually closer to io_uring than to epoll.

The cross-platform pattern is clear: every mature kernel now has an O(ready)-scaling primitive, and user-space libraries (libuv, Python's selectors, Java's NIO, Go's netpoll) abstract over them.

Why databases are mostly process-per-connection anyway

If event loops are so good, why has Postgres not adopted one? Because query execution is full of blocking I/O that would freeze the loop.

Consider SELECT * FROM users WHERE id = 42. The executor descends a B-tree and calls BufferAlloc(relation, page_7812). If the page is in shared buffers, the call returns in under a microsecond. If not, the buffer manager issues pread(datafile_fd, …), blocking the process for 5-10 ms while the kernel fetches the page.

In process-per-connection Postgres this is fine — the backend is the only client on this process. In an event-loop Postgres, blocking on pread would stall every other client on the same thread. Ten thousand clients behind one disk I/O is a 10-second stall for everyone.

Three solutions, each with tradeoffs.

Async I/O to the disk layer. Rewrite the buffer manager to issue io_uring reads and return control to the loop, resuming the query when the completion arrives. Conceptually clean, brutal to implement — query execution is written as deep recursive trees (ExecInitNode, ExecProcNode) that assume synchronous calls. ScyllaDB did this for Cassandra; the rewrite took years.

Thread pool for blocking work. Event loop at the socket layer, worker pool for disk-bound execution. MySQL's threadpool plugin and CockroachDB's gateway do this. You get connection multiplexing without rewriting execution — at the cost of an extra thread hop per query.

Just use process-per-connection. Accept that the database is not the right place for the event loop, and put the event loop in a separate tier in front. This is what Postgres does and why PgBouncer exists. The pooler handles thousands of TCP connections; Postgres handles the tens of backends actually doing work.

The event-loop model is perfect for workloads where between-callback work is short and CPU-bound — proxies, in-memory stores, message routers. It is uncomfortable for workloads that block on disk or locks — OLTP databases with real storage. That distinction, not ideology, determines the architecture.

The hybrid model — front-end loop, worker pool

Modern database-adjacent systems have converged on the hybrid: an event-loop front end for connection handling and wire-protocol framing, plus a worker pool for blocking execution.

Scylla runs one thread per core. Each thread is an event loop (built on Seastar, using io_uring), handling network, disk, and coordination. Work that would block is reshaped as continuations — every potentially-blocking call returns a future. The "worker pool" in Scylla is implicit in the scheduler.

CockroachDB gateways run Go's runtime, itself a hybrid: goroutines look synchronous but are multiplexed over a handful of OS threads by netpoll. A blocking disk I/O parks the goroutine without parking its thread.

Cassandra's Native Protocol handler uses Netty, parsing CQL frames on event-loop threads and dispatching to a worker thread pool for actual execution.

ProxySQL and RDS Proxy are event-loop-only — no query execution, no worker pool. Structurally identical to PgBouncer.

The common shape: the connection count scales with the event-loop front end, the concurrent-execution count scales with the worker pool, and the two are decoupled. This is the answer C10K was asking for — not that one thread should do all the work, but that one thread should own coordination while specialised backends own execution.

Measured scaling numbers

Concrete numbers from realistic deployments.

Postgres bare at 5,000 connections. Works if max_connections = 5000 and you have the RAM. Resident memory: ~50 GB. CPU idle usage climbs because ProcArrayLock scans 5000 entries on every snapshot. Throughput under uniform OLTP degrades by 20-40% compared to 500 connections. Feasible, painful.

Postgres at 10,000 connections. Most deployments fail to start. RAM budget would be ~100 GB with zero throughput gain over 200 connections. Nobody runs this.

Postgres behind PgBouncer at 10,000 clients, 50 server backends. The production pattern. Postgres uses ~500 MB; PgBouncer uses ~20-50 MB. Sustained throughput is indistinguishable from running 50 connections directly — the event loop adds perhaps 100 microseconds per query, invisible against query latency in milliseconds.

Redis at 10,000 connections. Trivial. Redis is single-threaded with one event loop. Memory at 10k clients: ~20 MB of buffers plus data. Throughput saturates at about 100,000 ops/sec on one core.

nginx at 100,000 connections. Handled on a single worker process with worker_connections = 100000. Memory: ~200 MB, mostly buffers. nginx advertises this in marketing, and it is real.

The pattern: event-loop servers scale to tens of thousands of connections per process as a matter of routine; process-per-connection caps at a few hundred to a few thousand. Two to three orders of magnitude apart for the same network workload.

Capacity calculation for a 10,000-client web app

Assume a Django deployment with steady-state 10,000 connected web workers (across pods, gunicorn instances, whatever) and a peak query concurrency of 100 (ninety-nine percent of the 10,000 connections are idle at any moment).

Option A — direct connections to Postgres. Each of the 10,000 web workers holds an open Postgres connection. Cost: 10,000 × 10 MB = 100 GB of Postgres backend memory, plus process-table pressure and ProcArrayLock contention that collapses throughput. Requires max_connections = 10000+ which most servers refuse to start at. Unworkable on any server smaller than 128 GB RAM, and a bad idea even on 256 GB servers.

Option B — PgBouncer in transaction mode, 100 server connections. Ten thousand clients connect to PgBouncer; PgBouncer maintains 100 real connections to Postgres and multiplexes. Postgres memory: 100 × 10 MB = 1 GB. PgBouncer memory: 10,000 × 500 B Ôëê 5 MB of client state, plus 100 × 1 KB of server state, plus the PgBouncer executable Ôëê 20 MB total resident. Query concurrency of 100 is exactly what the server pool supports; idle clients cost the pooler nothing. Sustains the workload on a 4 GB server.

Ratio: Option B uses 1/100 of the Postgres memory for the same client-facing capacity. The event loop inside PgBouncer is what enables this — without it, the pooler would itself fall over at 10,000 connections.

This is the pattern every production Postgres deployment converges on. The event-loop front end carries the connection count; the process-per-connection backend carries the execution. Both architectures play to their strengths.

Common confusions

"Event loops are single-threaded therefore slow." Single-threaded for coordination, not for work. A server runs its event loop on one thread and hands CPU-heavy work to a pool on other threads — the event loop only has to be fast enough to read bytes and dispatch. That dispatch costs microseconds.

"epoll is the same as threads." No. Epoll tells you which sockets are ready; threads let you process multiple in parallel. A 16-core server running one event-loop thread uses one core. To use all 16 cores, run 16 event-loop processes (SO_REUSEPORT lets them share a listener) or pair one event loop with a 15-thread worker pool. Epoll does not parallelise — it concentrates.

"Postgres could just switch to an event loop." Not easily. The executor, planner, and storage layer are written assuming synchronous, blocking calls. Rewriting them to suspend on every potential I/O point is a decade of work. Postgres's own AIO effort is moving slowly. The commitment to fork() runs deep because the payoff is deep.

"io_uring is faster than epoll always." Not at low connection counts. Below a few thousand connections, both perform similarly because the bottleneck is network latency, not syscall overhead. io_uring wins at 10k+ connections and wins much harder when disk I/O is involved (it offers real async file I/O, which epoll cannot).

"Thousands of connections is normal." Normal at the load-balancer and proxy tier. Not normal at the database tier. When a deployment has 10k database connections, someone has skipped pooling, and the fix is not to scale the database up but to put a pooler in front.

Going deeper

Postgres AIO (Andres Freund's patchset)

Postgres 17 introduces asynchronous I/O for buffer-pool reads and writes, authored primarily by Andres Freund. The patch uses io_uring on Linux with a worker-thread fallback elsewhere, allowing a backend to issue multiple overlapping page reads without blocking on each. Initial wins are for sequential scans and maintenance operations (VACUUM, bulk loads) where prefetching is dramatic. Long-term it sets up machinery for a future where Postgres backends could serve multiple clients each — a precondition for moving off process-per-connection.

Kernel-bypass networking — DPDK and SPDK

For millions of concurrent connections with microsecond tail latencies, even epoll and io_uring are too slow because they still cross the kernel TCP stack. DPDK and SPDK let userspace programs poll network and storage hardware directly, bypassing the kernel. ScyllaDB uses DPDK for high-end deployments; Redpanda uses it by default. The cost is dedicating CPU cores to constant polling and losing OS process isolation. The payoff is per-packet latencies in hundreds of nanoseconds.

The cost of a TCP connection at the kernel level

Each open TCP connection consumes a struct sock, send and receive buffers, and an ephemeral-port slot. On Linux, each costs 20-40 KB of kernel memory. At 100k connections per server, TCP alone consumes 2-4 GB of kernel RAM.

Worse, closing connections leaves TIME_WAIT entries that block port reuse for 60 seconds. A server closing 10,000 connections per second accumulates 600,000 TIME_WAIT entries, which can exhaust the ephemeral-port range (28,232 ports by default) and cause new outbound connections to fail. The fix is SO_REUSEADDR, a wider net.ipv4.ip_local_port_range, and net.ipv4.tcp_tw_reuse — painful in production until you have seen it once.

Where this leads next

Build 8 covered the physical transport and operational layer of a DBMS — wire protocols, connection pooling, and the event-loop architecture that makes pooling scale. Build 9 begins on replication — what happens when one Postgres is not enough and you need a second (or a hundredth) machine that agrees on the data. The next chapters cover the Write-Ahead Log as the replication substrate, streaming and logical replication, read-replica lag, and the synchronous/asynchronous tradeoff. The event-loop pattern will return there too: replication streams are themselves long-lived TCP connections that a backend must multiplex without blocking primary workloads.

References

Dan Kegel, The C10K Problem — the 1999 essay that framed the problem and survey the kernel primitives that grew up in response. Still hosted at kegel.com; still the best single-page introduction.
Linux manual pages, epoll(7) — the canonical specification of Linux epoll, including level-triggered versus edge-triggered semantics and the caveats around EPOLLET.
Redis documentation, Redis event library architecture — how the single-threaded Redis event loop works, including the I/O-thread extensions added in Redis 6.
nginx, Inside nginx: how we designed for performance and scale — the nginx authors describing their worker-process / event-loop architecture, with explicit discussion of why threads were rejected.
Jens Axboe, Efficient IO with io_uring — the original io_uring design paper, from the primary author. Explains ring buffers, submission / completion semantics, and the performance wins over epoll + aio.
Andres Freund, Asynchronous I/O in PostgreSQL — conference slides covering the Postgres AIO patchset, io_uring integration, and the long-term implications for Postgres's concurrency model.