The single-threaded Redis lesson

Aditi runs the session-cache tier for a Bengaluru ride-hailing app. Her team's first instinct, when traffic doubled in a quarter, was to rewrite their Redis-backed sticky-session layer in a multi-threaded language because "Redis is single-threaded and that's a bottleneck". They spent eleven weeks building a Java rewrite on top of a thread-per-core in-memory store. On launch day the new service did 380k ops/s before it fell over on lock contention; the old Redis box, on the same hardware, was doing 1.42 million ops/s the morning they decommissioned it. The rewrite was scrapped. The lesson — that "single-threaded" is not a slur and "more threads" is not a strategy — is what this chapter is about.

Redis serves more than a million ops per second from one CPU core because it eliminated almost every cost that multi-threaded in-memory stores spend most of their cycles on: lock acquisitions, cache-line bouncing, context switches, NUMA hops. The single-threaded design is not a limitation that survived despite the team's best efforts; it is the deliberate choice that lets the hot path be a tight loop over an epoll-driven event queue with no synchronisation primitives at all. The lesson generalises beyond Redis — every time you reach for more threads, ask first whether you have measured what they will cost you.

Why one core can outrun sixteen — the hot-path arithmetic

A modern x86 server core at 3.0 GHz executes roughly 3 billion cycles per second. A typical Redis GET or SET against an in-memory hash takes between 800 nanoseconds and 1.4 microseconds end-to-end, including the kernel TCP/IP path, the epoll wakeup, the command parse, the dictionary lookup, and the reply formatting. At 1 microsecond per op, one core can serve 1,000,000 ops/s before it saturates — and that is what redis-benchmark reports against a stock Redis 7 build on a c6i.4xlarge host with kernel-bypass networking disabled and standard TCP loopback.

The naive multi-threaded version of the same workload runs into a wall the single-threaded version simply does not have. Imagine sixteen worker threads, each pulling commands off a shared queue, each touching a shared hash table protected by a pthread_mutex_t. Every SET involves: (1) acquire mutex — at minimum, an atomic compare-and-swap that serialises across cores via the LLC, ~20–80 ns under low contention, hundreds of ns under high contention; (2) update the dictionary — a few cache-line writes, but those lines may be dirty in another core's cache and require an MESI bounce, ~80–200 ns; (3) release mutex — another atomic, ~10 ns; (4) reply — but if the worker thread that handled SET is not the one that owns the client's TCP connection, the reply has to be enqueued for that thread, which adds another bounce. The 1.4-microsecond Redis hot path is replaced with a 4–8-microsecond hot path that scales sub-linearly — adding cores past the point where coherence traffic saturates the LLC produces negative throughput gains.

The cost the multi-threaded design pays on every operation — mutex acquisition, MESI cache-line traffic between cores, and inter-thread reply handoff — is the cost the single-threaded design eliminated by construction. Sixteen cores fighting each other is slower than one core not fighting anyone.

Why the multi-threaded version scales sub-linearly: the shared mutex protecting the hash table is a single cache line that must be in modified state on whichever core holds it. Every CAS from another core forces an MESI invalidate, a remote-cache fetch, and a writeback — a transaction that takes 80–200 cycles on a single-socket Skylake-X part and 250–600 cycles on a dual-socket NUMA configuration. At 16 threads each issuing a CAS every microsecond, the cache line is bouncing 16 million times per second. The LLC's coherence bandwidth is the bottleneck long before any core's compute capacity is. This is why "throw more threads at it" produces less, not more, throughput past about 4 contended threads on this workload shape.

The single-threaded design pays none of these costs because there is nothing to synchronise. The hash table is not "locked"; there is exactly one writer and exactly one reader, and they are the same thread. The dictionary entries do not bounce between cores because they live entirely on the core Redis is pinned to. The CPU's caches do exactly what they were designed to do: keep the working set hot in L1 and L2, never having to invalidate a line because of another core's write.

What Redis actually does — the event loop and the command dispatch

A single-threaded server can only beat sixteen worker threads if the single thread is never blocked. Redis achieves this with a tight event loop built on top of epoll (Linux), kqueue (BSD), or IOCP (Windows), wrapped in ae.c — Salvatore Sanfilippo's hand-written event library that was originally extracted from tcl and is roughly 600 lines of C. The loop's structure is:

# ae_loop_model.py — a faithful Python model of how Redis's ae.c event loop
# dispatches commands. Real Redis is C, but the control flow is identical and
# this Python version is what you can read in 90 seconds and run on your laptop.
# Run: python3 ae_loop_model.py
import selectors, socket, time, collections
from typing import Callable

class RedisLikeEventLoop:
    def __init__(self):
        self.sel = selectors.EpollSelector()           # epoll on Linux
        self.dict_: dict[str, str] = {}                # the one and only data store
        self.cmds: dict[str, Callable] = {
            "GET": lambda k: self.dict_.get(k, "(nil)"),
            "SET": lambda k, v: (self.dict_.__setitem__(k, v), "OK")[1],
            "DEL": lambda k: ("1" if self.dict_.pop(k, None) is not None else "0"),
        }
        self.stats = collections.Counter()

    def serve(self, host: str = "127.0.0.1", port: int = 6390):
        srv = socket.socket(); srv.setblocking(False)
        srv.setsockopt(socket.SOL_SOCKET, socket.SO_REUSEADDR, 1)
        srv.bind((host, port)); srv.listen(2048)
        self.sel.register(srv, selectors.EVENT_READ, self._accept)
        print(f"redis-like loop on {host}:{port}, single thread, no locks")
        while True:
            for key, _ in self.sel.select(timeout=1.0):    # block until any fd ready
                key.data(key.fileobj)                       # dispatch handler
            self.stats["loops"] += 1

    def _accept(self, srv):
        client, _ = srv.accept(); client.setblocking(False)
        self.sel.register(client, selectors.EVENT_READ, self._read)

    def _read(self, client):
        try:
            data = client.recv(4096)
        except (ConnectionResetError, BlockingIOError):
            return self._close(client)
        if not data: return self._close(client)
        for line in data.decode().splitlines():
            parts = line.strip().split()
            if not parts: continue
            cmd, args = parts[0].upper(), parts[1:]
            handler = self.cmds.get(cmd)
            reply = handler(*args) if handler else f"-ERR unknown {cmd}"
            client.send((str(reply) + "\r\n").encode())
            self.stats[cmd] += 1

    def _close(self, client):
        self.sel.unregister(client); client.close()

if __name__ == "__main__":
    RedisLikeEventLoop().serve()

# Sample run, then a redis-cli session against the loop:
$ python3 ae_loop_model.py &
redis-like loop on 127.0.0.1:6390, single thread, no locks
$ redis-cli -p 6390
127.0.0.1:6390> SET upi:txn:8281749 SUCCESS
OK
127.0.0.1:6390> GET upi:txn:8281749
"SUCCESS"
127.0.0.1:6390> DEL upi:txn:8281749
"1"
# Throughput: ~95k ops/s on a laptop in pure Python — three orders of
# magnitude below real Redis (Python interpreter overhead dominates),
# but the event-loop SHAPE is identical to what redis-server does in C.

Walk through the lines that carry the design:

self.sel = selectors.EpollSelector(): every connection's file descriptor is registered with epoll. Each iteration of the loop blocks in a single epoll_wait syscall until at least one fd has data ready. This is the cost of "doing nothing" in Redis — one syscall, plus the kernel's wait-queue bookkeeping. Why epoll instead of poll/select: epoll's wait time is O(number of ready fds), not O(total fds registered). With 50,000 idle connections and 200 active ones, poll examines all 50,000 every wakeup; epoll examines only the 200 that fired. At Razorpay-scale fanout, the difference is the loop spending 95% of its cycles on userspace work versus 95% in the kernel.
self.dict_: dict[str, str]: in real Redis, this is the dict C structure — an open-addressed hash table with incremental rehashing. There is exactly one. No locks, no atomics, no memory barriers. The single-threaded discipline means this entire data structure is the property of the loop thread.
self.cmds: every command is a function pointer with a known argument count. Real Redis has ~250 commands; the dispatch table is a single hash lookup keyed by command name, then a function call. No virtual dispatch, no plugin architecture, no message bus.
handler(*args): the command runs synchronously to completion before the next event is processed. A command that takes 100 microseconds blocks every other client for 100 microseconds. This is the single-threaded design's biggest constraint — and the source of the most common Redis production foot-gun, which the next section unpacks.
client.send(...): the reply is written directly to the client's socket from the same thread that ran the command. There is no reply queue, no inter-thread handoff, no per-client output buffer manager. If send would block (TCP send buffer full), the data is appended to the client's output buffer and the fd is registered for EVENT_WRITE; the loop will flush it on the next wakeup. This is the only place "non-blocking" semantics enter the design — and they are scoped to one syscall.

The entire design fits in a paragraph: one thread, one event loop, one hash table per database, every command is a synchronous function call. There is no thread pool, no work queue, no actor model, no async runtime. The "concurrency" is the multiplexing of thousands of TCP connections through one epoll fd; it is not concurrent execution at all.

When the single-threaded design loses — the long-tail commands

The design's strength is also its production foot-gun: any command that takes long to execute blocks every other client. A KEYS * against a 30-million-key keyspace takes 18 seconds; for those 18 seconds, every other client gets nothing — no GET, no SET, no PING, no INFO. The Razorpay payments-cache cluster ran into this exactly once, in 2022, when an on-call engineer ran KEYS upi:txn:* against the production primary at 09:42 IST during the morning UPI rush. The cluster's p99 went from 4 ms to 18 seconds until the command completed; the payments service shed roughly ₹1.4 crore of throughput in those 18 seconds, and the postmortem made KEYS a banned command at the operations level.

The list of commands that can starve the loop is the most important Redis production knowledge that doesn't show up in tutorials. The principal offenders, with rough cost models:

KEYS pattern: O(N) over the entire keyspace. At 1M keys, ~600 ms; at 30M keys, ~18 s. Use SCAN instead — cursor-based, returns in chunks of COUNT keys per call, lets the loop service other clients between chunks.
SMEMBERS large_set / HGETALL large_hash / LRANGE list 0 -1: O(N) over the collection. A user-leaderboard set of 200,000 members serialises 200,000 entries into one reply; the loop is blocked for the duration of the encode plus the send syscall. Hotstar's IPL-leaderboard team learned this when their SMEMBERS leaderboard:final query, run from a dashboard refresh, blocked write traffic for 340 ms during a wicket spike.
Expensive Lua scripts: a EVAL that iterates over 100,000 elements blocks the loop for the script's full duration. Lua scripts are atomic by design (the single-threaded loop is what gives them atomicity), so they are useful — but a 200 ms script is a 200 ms outage for everyone else.
DEBUG SLEEP: literally a sleep in the command path, used in tests but disastrous in production.
SORT on a large list with BY patterns: the worst offender by cost-per-call I've seen. A SORT user_ids BY user:*->score LIMIT 0 100 against 50k user IDs does 50k hash lookups inside the command, totally synchronously.

The discipline that good Redis operators converge on is: any command whose worst-case work is more than ~1 ms is a foot-gun. The bound is empirical — at 1 ms per blocking command, a queue of even 50 such commands per second (out of, say, 100k ops/s total) doubles tail latency for everyone else. The right operational stance is to monitor the slowlog (commands above a configured threshold, default 10 ms in real Redis), alert on it aggressively, and either rewrite expensive callers (SCAN instead of KEYS) or move them to a replica that doesn't take live writes.

Why a 1-ms threshold and not 10 ms or 100 µs: Little's Law connects mean response time, throughput, and concurrency: L = λW. For a single-threaded server, L (concurrency) is at most 1 in service plus whatever is in the OS socket queue. At λ = 100,000 ops/s steady-state offered load, mean service time W must average below 10 µs to keep utilisation under the queueing knee at ρ ≈ 0.85. A single 1-ms blocker, occurring once per second, raises the mean to roughly 19 µs — pushing utilisation past the knee and propagating tail-latency growth across all subsequent operations until the queue drains. The 1-ms bound is the empirical choice that keeps the worst-case command's contribution to the mean below the knee even at typical Indian-fintech offered loads. It is not a number to memorise; it is a number derived from the offered load on your specific service via the same arithmetic.

Without the blocking command, the histogram lives below 2 ms; one `KEYS *` against a 30M-key keyspace pushes p99 from 1.4 ms to 18 seconds for every concurrently-issued command. The single-threaded design's atomicity is its strength when commands are fast and its weakness when they are not — there is no other thread to take over.

The right framing of "Redis is single-threaded" is therefore conditional. For workloads where every command is sub-millisecond and the working set fits in memory, single-threaded wins. For workloads where any meaningful fraction of commands are O(N) over large structures, single-threaded is a production trap — and the fix is not "make Redis multi-threaded" but "stop running expensive commands in the hot path", because the alternative (a multi-threaded design) would solve the symptom while introducing the synchronisation costs the design was specifically built to avoid.

A useful operational pattern from Hotstar's IPL-streaming team: every Redis primary in their fleet runs with slowlog-log-slower-than 5000 (5 milliseconds) and a 256-entry slowlog-max-len. The slowlog is scraped every 60 seconds by a Python sidecar that emits a Prometheus counter per command type, with cardinality bounded by the command name (not the full argument). When the counter for any command type increases by more than 10 entries per minute, an alert fires to the team's Slack channel — not paging, not waking anyone up, just visible. The discipline catches expensive commands as soon as they appear in production, before they become the cause of an incident. The cost of running this is essentially free (the slowlog is in-memory in Redis itself, the scrape is one SLOWLOG GET command per minute), and the value is enormous: most of the team's Redis-related learning since 2022 has come from following up on slowlog alerts and discovering what application-level callers were doing wrong. The slowlog is the cheapest production-feedback loop the single-threaded design provides; teams that ignore it spend their learning budget on incidents instead of alerts.

What Redis 6+ actually changed — and why it didn't change the lesson

Redis 6.0 (April 2020) introduced threaded I/O: separate threads for socket reads and writes, while the command-execution thread remains single. The motivation was that on very large clients (10k+ concurrent connections each pushing 1 KB+ requests), the read/write syscalls themselves were 30–50% of the loop's wall time, and parallelising just the I/O let the command thread spend more time on actual command execution. This is a careful, scoped optimisation — not a rewrite. The hash table is still single-threaded. The command dispatch is still serial. The only thing that became parallel is the syscall surface, and that change adds 1.8–2.2× throughput on the I/O-bound workloads it targets.

Redis 7.0 (April 2022) and 7.2 (August 2023) extended this with multithreaded fsync for AOF persistence and multithreaded lazy free for asynchronous deletion of large objects. Again — these are background-work threads. The hot path of GET and SET is single-threaded in Redis 7.2, exactly as it was in Redis 1.0 (April 2009).

The reason this matters for the chapter's lesson: Redis's authors had every opportunity in fifteen years to "fix" the single-threaded "limitation" and chose, every time, to keep the command path single-threaded. The reasons they cited in the design notes (and Salvatore Sanfilippo gave talks on this through 2018) are: (1) the synchronisation cost of multi-threading the command path would erase most of the throughput gain on real workloads; (2) the reasoning about command atomicity — which Lua scripts, transactions (MULTI/EXEC), and Redis Streams consumer groups all depend on — is much harder to get right with concurrent execution; (3) the operational story (one process, one core, easy to reason about CPU usage) is a feature for users, not just for the implementation. The 2020 threaded-I/O patch was the careful exception that proves the rule: parallelise only the part of the work that does not touch the data structures.

The competitor that tried the opposite approach — KeyDB, a fork of Redis that multi-threaded the command path in 2019 — is instructive. KeyDB does outperform stock Redis on heavily-loaded large-value workloads, by roughly 2-3× on synthetic benchmarks. But on the typical Indian-fintech workload (small keys, small values, very high op rate), KeyDB's multi-threading overhead consumes the gain — Razorpay's 2023 internal benchmark showed Redis 7.2 doing 1.41M ops/s on a 16-core box and KeyDB doing 1.18M ops/s on the same box, because the synchronisation cost on small ops outweighed the parallelism benefit. KeyDB has a real niche (large-value workloads with mixed read/write); it does not eliminate the need for Redis. The two coexist because the design trade-offs they make are different, not because one is strictly better.

A subtler observation about the Redis 6 threaded-I/O patch: it ships disabled by default, and the documentation explicitly warns that turning it on without a measured benefit can hurt performance. The threading is only worthwhile when the I/O syscall surface is the bottleneck — which it is for workloads with large values or many tiny clients, but not for workloads with small values and a moderate client count. The default-off shipping is a careful design choice: the majority of Redis users are not in the workload regime where threaded-I/O helps, and turning it on for them adds the I/O thread coordination cost (cache-line bouncing between the I/O threads and the main thread) without the I/O parallelism benefit. The discipline of "ship the optimisation off by default; let the user opt in based on measurement" is the same discipline the rest of this chapter argues for, applied to the project's own additions.

The pattern that holds across all three changes — Redis 6 threaded-I/O, Redis 7 multithreaded fsync, Redis 7.2 multithreaded lazy free — is that the new threads do not touch the data structures the command path mutates. Threaded-I/O reads bytes from sockets into per-client input buffers; the main thread parses those buffers and runs the command. Multithreaded fsync writes the AOF buffer to disk; the main thread filled the buffer beforehand. Multithreaded lazy free frees memory chunks that the main thread already removed from the live data structures. In every case, the new threads' work is downstream of the data-structure mutations, never concurrent with them. The single-threaded invariant on the hash table is preserved by construction; the new threads operate only on data the main thread has already finished with. This is the architectural pattern that lets a single-threaded core coexist with parallelism at the edges, without having to confront the synchronisation problem at all.

The operational discipline that protects the design

A single-threaded server gives you a beautiful hot path only if the operating environment cooperates. The Linux scheduler, the NUMA topology, the SMT siblings, and the kernel's TCP/IP stack each have ways of stealing the core from the redis-server thread that the user has to actively prevent. The operational discipline that holds 1.4M ops/s in production has six small parts, and each one of them shows up in postmortems when it is missing.

The first part is CPU pinning. A Redis primary running unpinned on a 16-core box is at the mercy of the Linux scheduler, which under load may migrate the redis-server thread from one core to another every few milliseconds. Each migration costs the L1 and L2 caches: the new core's L1 is cold for the working set, and the first 50,000 cache references after migration miss to L2 or further. On a 1M ops/s workload, a single migration during a peak second produces a 200 µs hiccup that shows up as a p99.99 spike. The fix is taskset -c <core> redis-server or the equivalent numactl --physcpubind invocation. Razorpay pins each Redis primary to a specific physical core — never a hyperthread sibling — because the SMT sibling competes for the L1 cache and the gain from SMT on a single-threaded server is negative.

The second part is NUMA-local memory allocation. On a dual-socket box, Redis allocated on socket 0 should run on a core on socket 0; cross-socket memory access is 2-3× slower than local-socket access, and a Redis instance whose memory is on socket 1 but whose thread is on socket 0 pays the remote-access cost on every operation. The numactl --cpunodebind=0 --membind=0 redis-server invocation is the standard wrapper.

The third part is disabling Transparent Huge Pages. Linux's THP feature collapses 2 MB regions of contiguous 4 KB pages into single huge pages in the background — a process that takes the kernel's khugepaged thread and walks the page tables, sometimes pausing the foreground application for tens of milliseconds. Redis's documentation has explicitly recommended disabling THP since 2014 because the latency cost of one THP collapse during a peak second is the same magnitude as one slow command — a multi-second tail-latency spike for everyone. The fix is one line: echo never > /sys/kernel/mm/transparent_hugepage/enabled.

The fourth part is disabling swap. A Redis instance that swaps any of its memory to disk takes 10 ms of latency for every operation that touches a swapped-out page. The vm.swappiness=0 and vm.overcommit_memory=1 sysctls are the standard configuration, plus maxmemory set below the available physical memory so Redis evicts before the OS swaps.

The fifth part is kernel TCP buffer tuning. Redis's hot path involves syscalls to read and write on TCP sockets; the kernel buffer sizes (net.core.rmem_max, net.core.wmem_max) determine whether each syscall returns immediately or has to wait for the buffer to drain. The default buffer sizes (~200 KB) are enough for most workloads; for connections doing pipelined MGET of large values, raising them to 1-2 MB removes a measurable wait-loop in the kernel.

The sixth part is disabling the kernel's NUMA balancing. The numa_balancing feature periodically migrates pages between NUMA nodes to follow the threads accessing them — useful for general workloads, but for a CPU-pinned, memory-bound Redis it produces unpredictable latency spikes during the migration sweeps. The fix is echo 0 > /proc/sys/kernel/numa_balancing or the equivalent boot-time kernel parameter.

Each of these is a small change. None of them alone produces a measurable improvement in benchmarks. Together — and only together — they produce the reproducible 1.4M ops/s number on a c6i.4xlarge. Skipping any one of them turns the same workload into 800k-1M ops/s with high variance and unpredictable p99.99 spikes. Why all six matter and none alone is sufficient: the single-threaded design's variance is bounded only when nothing else can preempt the redis thread or move data away from it. CPU pinning prevents scheduler preemption; NUMA binding prevents remote-memory access; THP disabling prevents khugepaged pauses; swap disabling prevents disk-fault pauses; TCP tuning prevents syscall waits; numa-balancing disabling prevents page-migration sweeps. Each one is a different source of variance that the others do not address. Hold all six and the variance is bounded by the workload alone; relax any one and the variance from that source dominates. This is why production Redis at scale always ships with all six configured — not because any one of them is the magic bullet, but because the magic is the absence of any one source of variance.

What the lesson generalises to — beyond Redis

The single-threaded lesson is not really about Redis. It is about three habits of thinking that the Redis design encodes, and that apply far more broadly than to in-memory key-value stores. Each habit is the kind of thing senior engineers internalise after their second or third "we rewrote it for performance and the rewrite was slower" experience — but the habits can be learned from someone else's rewrite, which is the entire reason this curriculum has a case-studies section.

The first habit: measure what your synchronisation costs before you scale it out. The instinct that "16 cores is faster than 1 core" is right for compute-bound workloads where each thread has its own data and they don't fight each other. It is wrong for shared-state workloads where every operation touches the same cache line, because the coherence traffic is the bottleneck. The MESI protocol moves cache lines between cores at hundreds-of-cycles latency; sixteen threads each issuing a CAS per microsecond produces a coherence storm that no amount of additional cores can resolve. Before designing a multi-threaded version of any service, run the back-of-envelope: how many synchronisations per operation, at what cost each, against what total throughput target? If the answer is "more synchronisation cost than the work itself", the multi-threaded design is wrong. The Aditi-rewrite-disaster from this chapter's lead was exactly this oversight — eleven weeks of work that the back-of-envelope, run on day one, would have prevented.

The second habit: the absence of a feature is sometimes the feature. Redis's authors made many decisions over fifteen years about what not to add. No background compaction in the hot path (Redis's defrag is opt-in and synchronous within bounded windows). No worker thread pool for command execution. No pluggable storage backends in the open-source build. Each of these "missing" features is a design choice — something simpler in exchange for something that wasn't paying its way. The opposite tendency, what database papers call "kitchen-sink-itis", is the failure mode of every feature-driven project: every reasonable-sounding addition compounds the synchronisation surface, the testing matrix, the operational complexity. A senior engineer's most valuable instinct is the one that resists adding a feature even when the request is reasonable, because the feature's cost is not in shipping it but in carrying it forever.

The third habit: prefer a tight measurement loop to a strong opinion about architecture. Redis's single-threaded design was not chosen because Salvatore had a manifesto about single-threaded servers; it was chosen because the early benchmarks of redis-benchmark against the early prototypes showed that single-threaded outperformed every multi-threaded variant Salvatore tried, on the workloads Redis was being built to serve. The architecture is a consequence of the measurement, not an axiom. The teams that import "lessons" from successful systems often import the conclusion (single-threaded server) without the process (relentless benchmarking against the actual workload). The conclusion's correctness depends on the workload; the process's correctness is universal. Razorpay's payments-cache team, Zerodha's order-cache team, and Hotstar's session-cache team all run their own benchmarks against their own workloads on a quarterly cadence, and run Redis. The two facts are not contradictory: they trust Redis because their measurements keep agreeing with the design, not because they took the design on faith.

The three habits are reinforcing. The measurement habit (habit three) is what produces the synchronisation-cost numbers that drive the back-of-envelope (habit one), and the willingness to leave features out (habit two) is what keeps the measurement honest by ensuring there are not so many features in the system that the measurement becomes intractable. A team that has all three habits ships systems that hold their performance across years. A team that has only one — usually the architectural opinion, without the measurement to back it up — ships rewrites that are slower than what they replaced. Aditi's eleven-week rewrite at the start of this chapter was a team with the architectural opinion and not the measurement habit; the cost of that gap was eleven weeks of work, the credibility cost of a public rollback, and the cultural cost that comes from a team learning the hard way that "obviously faster" is not the same as "measurably faster". The redis lesson, in one sentence: measure first, decide second, rewrite never unless the measurement says otherwise.

Common confusions

"Redis is single-threaded, so it can only use one core" Misleading. Redis-server is single-threaded for command execution, but a Redis cluster of 16 shards on the same 16-core host runs 16 single-threaded processes that collectively use all 16 cores. The "one core" framing applies to a single primary; production Redis at scale is almost always sharded, and the per-shard single-threaded design is what makes the sharding tractable (each shard's reasoning is independent). Use redis-cli --cluster to spread your keyspace across shards rather than fighting the per-shard limit.
"Multi-threading would always help if implemented correctly" It would not. Even a perfectly-implemented multi-threaded version of Redis pays synchronisation cost on every cross-thread operation — atomic CAS for shared counters, MESI traffic for shared cache lines, scheduler wakeups for queue handoffs. On the small-key/small-value workload Redis was built for, this cost is large enough to consume the parallelism benefit. KeyDB's existence proves multi-threading is implementable; KeyDB's benchmarks against stock Redis on small-op workloads prove it does not always win.
"You should never run a slow command in production Redis" The right rule is "never run a slow command on the primary that takes live traffic". Slow commands are sometimes necessary — BGSAVE for snapshotting, DEBUG OBJECT for diagnostic work, even KEYS for data-recovery tooling. The discipline is to run them on a replica (which can fall behind without affecting production traffic), or during a low-traffic window with explicit notice, never on a primary at peak hours. The Razorpay 2022 incident in this chapter's body was caused by violating exactly this rule.
"Redis Cluster removes the single-threaded limitation" Redis Cluster shards the keyspace across multiple primaries, each of which is single-threaded. So Redis Cluster scales out via sharding, not via multi-threading any one primary. The distinction matters because a hot key — a single key receiving a disproportionate fraction of traffic — still saturates one core regardless of how many shards the cluster has. Hot keys are a real production problem in Redis Cluster, solved by application-layer caching, key partitioning, or routing the hot key to a dedicated shard.
"Threaded I/O in Redis 6 means Redis is no longer single-threaded" It is more precise to say Redis 6 has a single-threaded command path and multi-threaded I/O path. The hash table is still touched by exactly one thread; what got parallelised is the work of reading bytes from sockets and writing replies to sockets. This is the careful, scoped exception — and the fact that it took eleven years and was implemented as a non-default io-threads setting that ships disabled tells you how seriously the project treats the single-threaded core invariant.
"Single-threaded servers don't scale" They scale by sharding — running many independent single-threaded processes, each owning a slice of the keyspace. This is the same scaling pattern Postgres uses for its per-connection backend processes, the same pattern nginx uses for its worker processes, and the same pattern HAProxy uses. The "scalability" of a system is rarely about one process using many cores; it is about the system as a whole using many cores via independent units that don't coordinate hot. Redis Cluster is a textbook implementation of this pattern.

Going deeper

The exact cost of a `pthread_mutex_lock` under contention

A pthread_mutex_lock on Linux is implemented on top of futex(2). In the uncontended case it is a single atomic compare-and-swap on a userspace word — about 15 cycles on a Skylake-X core. In the contended case (mutex held by another thread), the calling thread issues a futex(FUTEX_WAIT) syscall, which costs ~1,500 cycles minimum (syscall overhead), plus the cost of being descheduled and rescheduled when the lock is released — typically 5,000-15,000 cycles end-to-end. The relevant arithmetic for any server design is: at what fraction of contended acquisitions does the average mutex cost exceed the work the mutex protects? For a Redis-shaped command (1,400 ns of work), the answer is roughly 0.2% — even one in 500 acquisitions hitting contention pushes the average cost above the work itself. On a 16-thread design serving 1M ops/s, every thread is contending with 15 others on every operation; the contention rate is far above 0.2%. This is the core arithmetic that made the single-threaded design win, and the arithmetic generalises to any shared-state hot-path design with sub-microsecond work units.

Why the single-threaded design makes Lua scripting tractable

Redis supports Lua scripts via EVAL / EVALSHA, and the scripts have a strong atomicity guarantee: while a script runs, no other command from any client can run. This is a free consequence of single-threaded execution — there is no other thread that could run a command, so atomicity is the default rather than something to engineer. A multi-threaded Redis would have to implement script atomicity by acquiring a global lock for the script's duration, which would either serialise scripts (eliminating the multi-threading benefit during scripts) or risk deadlock when scripts touch keys held by other in-flight commands. The single-threaded design gets the atomicity for free, which is why production Redis users can rely on Lua scripts for compound operations (rate limiters, atomic counters with side effects, etc.) without elaborate concurrency reasoning. This is a concrete example of the "absence of a feature is the feature" principle — the absence of multi-threading produced atomicity as a side effect.

The connection limit and why Redis doesn't care about C10K-style fanout

Redis routinely handles tens of thousands of concurrent connections on one process, despite having only one thread of execution. The reason is the same epoll-based fanout that nginx and HAProxy use — connections that are idle cost almost nothing (kernel wait-queue bookkeeping, ~200 bytes per fd in kernel memory), and only the connections with data ready show up in the epoll wakeup. The maxclients setting defaults to 10,000 in stock Redis 7; production deployments at Hotstar and Swiggy run with maxclients 65535 and routinely peak at 40,000-50,000 concurrent connections during traffic spikes. The "connection cost" in a single-threaded server is genuinely O(1) per active connection per second of activity, not O(connections). This is what makes the single-threaded design work for high-fanout workloads — the threading model and the fanout model are independent dimensions.

What the design got wrong, and what later systems improved

The single-threaded design has known weaknesses that newer in-memory stores have addressed without falling into the synchronisation trap. Dragonfly (released 2022, multi-threaded shared-nothing architecture) shards the keyspace across cores within a single process, so each core owns a slice of the hash table and never coordinates on reads/writes. The design captures most of the single-threaded benefit (no synchronisation on the hot path, because each core owns its data) while letting one process use all cores — at the cost of cross-shard transactions becoming complicated. Garnet (Microsoft Research, 2024) takes a similar shared-nothing approach with a focus on tiered storage. Both prove that "single-threaded vs multi-threaded" is a false dichotomy — the right framing is "what do you share across threads, and at what coordination cost?". A shared-nothing multi-threaded design pays no coordination cost on per-key operations and only pays for cross-shard work, which is the right architecture for workloads where the per-shard hash table fits in the L2 of one core.

Reproduce this on your laptop

# Run the event-loop model from this chapter, and benchmark a real Redis next to it.
sudo apt install redis-server redis-tools
python3 -m venv .venv && source .venv/bin/activate
# (no pip deps for the model)

# Terminal 1 — start the model
python3 ae_loop_model.py
# Terminal 2 — start a real redis on a non-default port
redis-server --port 6391 --daemonize yes

# Terminal 3 — benchmark both
redis-benchmark -p 6390 -t set,get -n 100000 -c 50    # the python model
redis-benchmark -p 6391 -t set,get -n 100000 -c 50    # real redis-server

# Compare the two. The model will do ~80-100k ops/s (Python interpreter
# overhead). Real redis-server will do ~150-300k ops/s on a laptop, ~1M+ on
# a c6i.4xlarge. The shape of the latency histogram is identical — both have
# a tight peak under 2 ms and very low variance, because the design is the
# same; only the constant factors differ.

# Now break the design — add a slow command on the real redis:
redis-cli -p 6391 SET hot:key "$(python3 -c 'print("x"*1024*1024)')"
redis-cli -p 6391 DEBUG SLEEP 2
# At the same time, in another terminal, watch p99 spike:
redis-benchmark -p 6391 -t get -n 1000 -c 10
# The slow command starves the loop. This is the production trap.

This pair — the Python model and the real redis-server, benchmarked side by side, with a deliberate slow command injected — reproduces the chapter's three claims on a laptop in under five minutes. The numbers vary by hardware; the story (single-threaded loop, tight latency histogram until a slow command appears, then everyone waits) is hardware-independent.

Where this leads next

Part 16 — the case-studies part — uses this chapter as the template for the rest of its case studies. Each subsequent case study examines a real-world system whose design is a deliberate response to a particular performance constraint, and extracts the lesson that generalises beyond the specific system. The structure is consistent: a concrete production failure or surprising number, a walk through the mechanism that produces it, the design trade-off the team made, and the lesson that applies even to readers who will never run that specific system. The single-threaded Redis lesson is the prototype: a counter-intuitive design choice whose justification is measurement, not ideology, and whose generalisation is "measure your synchronisation costs before you scale out".

Readers who want to ground this chapter in their own systems should run the reproduce-this benchmark on a laptop, then re-run it on whatever the team's actual production hardware is. The shape of the latency histogram and the absolute throughput number will differ; the ratio of single-threaded throughput to a hypothetical multi-threaded version's throughput, on small ops, will be roughly the same. That ratio is the architectural number this chapter is really about — and once a team has measured it on their own workload, the conversation about "should we rewrite this" becomes a conversation about numbers rather than instinct.

The natural next reads are:

/wiki/what-netflix-learned-about-load-shedding — the next case study, on graceful degradation under overload at content-delivery scale.
/wiki/twitters-caching-migration — a counter-example to Redis's "don't change what works" stance, where the migration was the right call.
/wiki/m-m-1-and-why-utilization-80-hurts — the queueing-theory chapter that explains why the single-threaded loop's tail latency stays bounded as long as no command starves it.
/wiki/false-sharing-the-silent-killer — the cache-coherence chapter that explains the MESI traffic costs the single-threaded design avoids by construction.

The chapter after this — Netflix on load shedding — picks up a different facet of the same theme: what to do when the offered load exceeds the system's capacity, regardless of whether the system is single or multi-threaded. The Redis lesson is about not creating coordination cost where there does not need to be any; the Netflix lesson is about gracefully refusing work the system cannot do without lying about it. Both are studies in the discipline of not pretending — Redis does not pretend its single-threaded design is multi-threaded, and Netflix does not pretend a saturated cluster can serve traffic it cannot.

A final pointer for readers who want to push deeper on the lesson rather than the system: the single-threaded versus multi-threaded debate recurs in every domain where shared mutable state meets high op rates — single-threaded JavaScript event loops in Node.js, single-threaded actors in Erlang/Elixir per-process, single-threaded shards in Vitess, single-threaded ingest paths in InfluxDB. Each of these is a different system that converged on the same answer for the same arithmetic reason, and tracing the design notes across them is its own education in why the answer is structural rather than incidental.

References

Redis design notes — single-threaded design rationale — the official Redis latency-optimisation guide, which documents the single-threaded model and its operational consequences.
Salvatore Sanfilippo, "An update about Redis developments in 2019" — the 2018-19 design notes from Redis's original author, including the reasoning for keeping the command path single-threaded.
Redis 6.0 release notes — threaded I/O — the careful exception: parallelise I/O syscalls without changing the command-path invariant.
KeyDB benchmarks vs Redis — the 2019 fork that multi-threaded the command path; useful as a counter-example for when multi-threading does and does not help.
Brendan Gregg, Systems Performance (2nd ed., 2020), Ch. 6 (CPUs) — the cycle-cost arithmetic that makes the single-threaded vs multi-threaded math tractable.
Linux futex(2) man page and Dragonfly architecture notes — the syscall cost floor for contended mutexes, and the shared-nothing multi-threaded counter-example to the single-threaded design.
/wiki/cache-coherence-mesi-moesi — the cache-coherence protocol that turns shared mutable state into MESI traffic and bounds the throughput of any shared-state multi-threaded design.
/wiki/littles-law-the-one-formula-everyone-should-know — the queueing law that lets you compute the maximum throughput of a single-threaded server from its per-op latency, which is how the "1M ops/s on one core" arithmetic at the top of this chapter is derived.