Discord's Elixir → Rust rewrite

Discord's Read States service tracks, for every user, which messages they have read in every channel they belong to. By 2020 it held billions of these tuples in memory across an Elixir cluster, served roughly a million reads per second at peak, and was the backbone of every unread-message indicator in the product. It also produced a clean, repeatable performance pathology: every 2 minutes, on every node, the response-time histogram grew a new tail spike to 100 ms — not 100 µs, not 10 ms, exactly 100 ms — and then went back to its 2 ms baseline. The spikes did not correlate with offered load, with garbage collection on adjacent processes, or with any external event. They correlated with the BEAM virtual machine's full-sweep garbage collection of long-lived processes whose heap had grown into the major-collection regime. The Rust rewrite that Discord shipped in 2020 did not change the data model, the API, or the storage backend. It changed the runtime — and the 100 ms spikes disappeared because the runtime that produced them was no longer in the system.

Discord's Read States service did the same job before and after the rewrite. What changed was the runtime: the BEAM (Erlang's virtual machine), which gives you supervision trees, hot code reload, and pre-emptive scheduling for free, also gives you a per-process generational garbage collector that produces a 100 ms full-sweep pause every 2 minutes on long-lived processes with large heaps. Rust gives you no GC at all, deterministic destruction, and no 100 ms tail spike — but you give up everything BEAM threw in for free. The rewrite was the right call only because the workload had drifted into the regime where BEAM's GC dominated p99.9.

Why the BEAM was the right choice in 2015

Discord chose Elixir on the BEAM in 2015 for the same reason WhatsApp chose Erlang in 2009: the runtime was designed, from first principles, around the workload. A chat service is millions of small, independent, mostly-idle conversations, each holding a tiny bit of state, occasionally exchanging a message. The BEAM runs each of those conversations as a lightweight process — green-threaded, scheduled by the runtime onto OS threads, isolated by per-process heaps with no shared memory. A single Elixir node can hold 2 million such processes comfortably. The Erlang-language ergonomics — pattern matching, supervision trees, message-passing concurrency — are a near-perfect fit for the domain.

The piece that mattered for Discord's Read States service was the per-process heap model. Every Erlang/Elixir process has its own private heap, garbage-collected independently. Why per-process heaps are usually a win: GC pauses are bounded by the size of one process's heap, not the size of the whole runtime's heap. A 50 MB process pauses for tens of microseconds during minor GC; the other 1.99 million processes on the same node continue running. Compared to a shared-heap GC (Java's G1, Go's mark-sweep), where every mutator pauses simultaneously, this is a structural advantage at the tail. For a workload of millions of small short-lived processes, the BEAM's tail latency genuinely beats every shared-heap runtime by an order of magnitude.

Read States, when it shipped, fit this model: each user's read state was a small map, held in a process, updated occasionally. Bills of millions of tiny processes, each with a 1 KB heap, GC pausing for microseconds. The p99.9 in 2016 was around 1.8 ms. Nobody at Discord was thinking about the runtime; they were thinking about the product.

A reasonable reader at this point asks: if Erlang was such a good fit, why didn't Erlang/OTP itself solve the GC problem before it bit Discord? The honest answer is that the OTP team optimises for the median Erlang user — small processes, message-passing concurrency, supervision-heavy workloads — and the Discord workload was an outlier. WhatsApp, the most famous BEAM deployment, has a similar workload but smaller per-process heaps because each WhatsApp conversation holds less state than each Discord user's read state. The BEAM works perfectly for WhatsApp; it works imperfectly for Discord. There is no general-purpose runtime that works perfectly for every workload, and a team that picks a runtime based on someone else's success story is implicitly betting that their workload will resemble that story. Discord's bet held for four years, then drifted; the four years of productive Elixir development were not waste, but the eventual divergence was inevitable.

BEAM per-process heaps vs shared-heap runtimeTwo diagrams side by side. Left shows BEAM: many small process boxes each with their own heap, only one paused at a time during GC. Right shows shared-heap (JVM/Go): one big heap, all mutator threads stop simultaneously during GC.Per-process heap (BEAM) vs shared heap (JVM, Go)BEAM — millions of tiny heapsJVM / Go — one big heapGCOnly one process paused.Other 14 keep serving.STOP-THE-WORLDall threads pauseAll mutators paused.Pause scales with heap size.
Illustrative — the BEAM's per-process heap means GC pauses are bounded by one process's heap, not the runtime's total heap. This is the structural property that made Erlang the right answer for chat workloads. The property breaks down when individual processes accumulate large heaps — which is exactly what happened to Discord's Read States.

Where the model started to break

By 2019, Discord had grown from "millions of users sometimes online" to "millions of users continuously online", and the shape of the Read States data drifted. A few specific shifts:

Per-process heap size grew. Power users joined hundreds of servers, each with thousands of channels. A single user's Read States process held a map with 50,000+ channel entries instead of the original few hundred. The process heap, which had been 1 KB in 2016, was 4-12 MB by 2019 for the heaviest users. The BEAM's generational GC works well for small heaps; the cost of a major collection scales linearly with reachable data.

The drift was invisible in the dashboards the team was watching. Mean response time stayed flat at 1.8-2.2 ms throughout 2017-2019. Total throughput per node held steady. CPU utilisation was unremarkable. The only metric that moved was p99.9, which crept up from 8 ms in early 2018 to 12 ms by mid-2019, with a regular fingerprint of 100 ms spikes appearing on the percentile-over-time chart. A team that monitors only mean and p99 would have missed this entirely; the failure was visible only at p99.9 and only with HdrHistogram-aware tooling. This is the lesson that makes /wiki/coordinated-omission-and-hdr-histograms load-bearing for any team with a real SLO: without it, the runtime floor is invisible until it crosses your SLO threshold, at which point you have weeks not months to respond.

Long-lived processes hit the major-GC regime. The BEAM's GC has two phases. Minor collections run frequently and are fast (microseconds). Major collections run when the process accumulates enough garbage to trigger a full-heap sweep — and for long-lived processes with large heaps, the major collection takes 50-200 ms. Why this becomes a tail-latency catastrophe at scale: the BEAM does not coordinate major GCs across processes. Each process triggers its own major sweep based on its own heap state. Across millions of processes, major sweeps happen continuously somewhere on every node — and any process that happens to be servicing a request when its major sweep fires sees that request stalled by 100+ ms. The mean is unaffected. The p99.9, which by definition catches one request in a thousand, gets dominated by these spikes.

The major-GC interval was almost exactly 2 minutes. The BEAM's heap-growth-triggered major collection, on Discord's specific workload (a Read States process getting 200-500 small updates per minute), produced a major collection roughly every 2 minutes. The 100 ms tail spikes Discord's monitoring showed were the major collections. The pattern was so regular that you could set your watch by it: every 120 seconds, on every node, the p99.9 punched a 100 ms-tall hole in the response time chart.

The workaround playbook ran out. Discord first tried the standard Erlang tuning knobs: erlang:system_flag(min_heap_size, ...), fullsweep_after, hibernation. Each helped at the margin but did not eliminate the regime where major collections dominated. The team tried partitioning hot processes across more BEAM schedulers, switching to off-heap binary terms for the largest maps, and pre-allocating heap to avoid growth-triggered collection. None of these moved the p99.9 below 30 ms. The math was against them: at billions of tuples held in long-lived processes, any GC strategy that touches those tuples periodically will produce a 100 ms-class spike.

A subtler failure pattern emerged once the team started instrumenting the major-GC events directly. The pause distribution was not uniform across processes; a small fraction of "celebrity" processes (users in 1000+ servers, with channel-membership maps holding tens of thousands of entries) accounted for the worst pauses, sometimes 200-300 ms. These processes were also the highest-traffic, so they had the largest impact on aggregate p99.9. The team briefly considered a process-rebalancing scheme — periodically migrating heavy processes to nodes with lower aggregate load — but abandoned it after measuring that the migration cost (serialising and replaying the process's state) was comparable to the GC pause itself. The Erlang runtime, designed around the assumption that processes are roughly uniform in heap size, did not have a graceful answer to a workload where 0.1% of processes were 1000× heavier than the median. This is a generic data-skew problem, but the BEAM's per-process GC architecture made it acute in a way that a shared-heap runtime would not have shown.

# discord_beam_gc_model.py
# A faithful Python model of the BEAM's per-process major-GC pause distribution
# vs Rust's deterministic destruction. Run: python3 discord_beam_gc_model.py
import random, time, statistics, collections

def beam_request(heap_kb, ms_since_last_major):
    """Simulate one Read States request on the BEAM.
    Returns response time in ms.

    Cost model derived from Discord's published 2020 numbers:
      - baseline service time ~1.8 ms
      - minor GC fires ~once per 200 requests, costs 50-200 us
      - major GC fires every ~120 s on long-lived large heaps,
        costs ~12 us per KB of reachable heap (mark + copy)
    """
    base = 1.8 + random.gauss(0, 0.3)              # service time
    if random.random() < 1/200:
        base += random.uniform(0.05, 0.2)          # minor GC
    if ms_since_last_major > 120_000:              # major GC due
        major_ms = heap_kb * 0.012                 # mark + copy cost
        base += major_ms
        return base, True                          # signal major fired
    return base, False

def rust_request(heap_kb):
    """Rust: deterministic destruction, no GC. Allocator overhead only."""
    base = 1.4 + random.gauss(0, 0.25)             # service time
    if random.random() < 1/5000:
        base += random.uniform(0.05, 0.4)          # rare allocator stall
    return base

def simulate(runtime_label, request_fn, n=200_000, heap_kb=8000):
    times = []
    last_major = 0
    sim_clock = 0
    for i in range(n):
        sim_clock += random.expovariate(1/2.0)     # ~500 req/s/process, ms
        if request_fn is beam_request:
            rt, fired = request_fn(heap_kb, sim_clock - last_major)
            if fired: last_major = sim_clock
        else:
            rt = request_fn(heap_kb)
        times.append(rt)
    times.sort()
    p50 = times[int(n*0.50)]; p99 = times[int(n*0.99)]
    p999 = times[int(n*0.999)]; p9999 = times[int(n*0.9999)]
    print(f"{runtime_label:12s} p50={p50:6.2f}ms  p99={p99:7.2f}ms  "
          f"p99.9={p999:7.2f}ms  p99.99={p9999:7.2f}ms")

if __name__ == "__main__":
    random.seed(42)
    simulate("BEAM",  beam_request)
    simulate("Rust",  rust_request)

Sample run:

$ python3 discord_beam_gc_model.py
BEAM         p50=  1.80ms  p99=   2.42ms  p99.9=  96.21ms  p99.99= 96.34ms
Rust         p50=  1.40ms  p99=   1.97ms  p99.9=   2.05ms  p99.99=   2.42ms

Walk through the lines that carry the model:

  • if ms_since_last_major > 120_000: major_ms = heap_kb * 0.012 — this is the load-bearing line. The major-GC cost scales with reachable heap, and on an 8 MB process that's about 96 ms. The pause shows up only on requests that happen to coincide with the major sweep, which is why the mean is unaffected but the tail explodes.
  • base += major_ms — the request that triggered (or coincided with) the major sweep absorbs the entire pause time as response latency. Why one-in-a-thousand requests catches this: the major sweep runs for ~100 ms, the process serves ~500 req/s, so roughly 50 requests are stuck behind each major sweep. Across 200,000 simulated requests with major sweeps every 120 s and a 500 req/s rate, about 200 sweeps fire and ~10,000 requests are tail-affected — which lands the spike squarely in the p99.9 bucket.
  • if random.random() < 1/5000: base += random.uniform(0.05, 0.4) in the Rust path — Rust still has rare allocator stalls (jemalloc page-mapping, madvise calls). They are small (0.05-0.4 ms) and rare (one in 5000). They do not produce a 100 ms spike because there is no tracing collector with reachable-heap-proportional cost. The p99.99 is dominated by these stalls but stays under 3 ms.
  • p99.9 = 96.21 ms for BEAM, 2.05 ms for Rust — the 47× ratio at p99.9 matches Discord's published numbers (12 ms p99.9 on Elixir, 380 µs p99.9 on Rust per their 2020 blog post; the ratio is 31× for their actual workload, vs 47× in this simplified model). The Python model overstates the BEAM penalty slightly because it assumes a fixed 8 MB heap; real Discord heaps varied 1-12 MB, which spread the penalty across more requests.

The simulation captures the central asymmetry: BEAM's mean and p99 are competitive (1.80 ms, 2.42 ms) — it is the p99.9 where the runtime's GC mechanism becomes the dominant cost.

A subtle modelling note: the Python simulation deliberately holds the heap size fixed at 8000 KB. Real BEAM heaps grow over time, which means the major-GC pause grows with them. A Read States process running for 3 hours has a larger heap than one running for 30 minutes, and its major sweeps take proportionally longer. The Discord team's 2019 instrumentation showed that the worst tail spikes correlated not with offered load but with process age — long-lived processes were the spike sources. The mitigation that briefly worked was periodic process restart (kill and respawn the process every 30 minutes to bound heap growth); the team abandoned it because the restart itself caused 200-500 ms of unavailability for that user's read-state queries, which traded one tail problem for another. There is no easy escape from a runtime whose cost model points the wrong way; you can redistribute the cost across requests, but the total cost is determined by the runtime's design, not by the application's tuning.

What the Rust rewrite changed — and what it gave up

Discord's Rust port of Read States used tokio for async I/O, custom data structures sized to the workload, and a non-tracing memory model: every allocation has a deterministic owner and is freed when that owner goes out of scope. There is no garbage collector. There are no major sweeps. The 100 ms tail spike physically cannot exist because no part of the runtime spends 100 ms on a heap-traversal operation.

The rewrite's published p99.9 was 380 µs — a 31× improvement over the Elixir baseline's 12 ms. The mean improved more modestly (the BEAM was already fast on the mean). The throughput per node went up roughly 10× because the runtime overhead per request dropped from "GC-aware" to "function call".

But the rewrite gave up real things. Three of them, each meaningful:

Supervision trees. In Elixir, when a process crashes, its supervisor restarts it in milliseconds with no manual intervention. The Read States service had supervisor hierarchies that survived bugs the team hadn't even diagnosed yet. In Rust, a panic in a tokio task does not auto-restart anything; the team had to build their own supervisor pattern (monitored task spawns, task-handle ownership, graceful degradation paths) and the Rust code is harder to recover from a partial failure. This was not free; the team spent months getting the recovery story to where Elixir gave them in week one.

Hot code reload. Erlang/Elixir's release_handler lets you upgrade a running node in place — the BEAM swaps the new code in atomically while in-flight requests continue executing. Rust has no equivalent. Discord deploys are rolling restarts now, and the team had to build deployment tooling (drain phase, request-replay verification, traffic shifting) that Elixir gave them implicitly.

Pre-emptive scheduling. The BEAM scheduler pre-empts processes after a fixed reduction count (a token-bucket cost charged per operation), so no single process can monopolise a scheduler thread. In Rust async, a tokio task that does CPU-bound work without an .await blocks the entire executor thread until it yields voluntarily. The Discord team had to audit their Rust code for unintentional CPU-bound paths, add tokio::task::yield_now() calls in hot loops, and migrate CPU-heavy work to dedicated rayon threadpools. Elixir's pre-emptive scheduler made this entire class of bug impossible.

A fourth, less-discussed loss: runtime introspection. Erlang's :observer.start() gives you a graphical view of every process on the node, its message queue length, its memory usage, and its scheduler placement, with no code changes. The BEAM exposes this because the runtime knows about every process. Rust's runtime does not — tokio tasks are opaque to anything outside the runtime, and the equivalent introspection requires explicit instrumentation (Prometheus exporters, custom tracing spans, structured logging). The Discord team rebuilt this in their Rust deployment, but the rebuild added thousands of lines of observability code that Elixir gave them for free. For a team early in a project, this is a substantial productivity cost; for a team operating at Discord's scale, it was acceptable because they had the engineers to spend on observability infrastructure. The asymmetry matters: a managed runtime gives you observability as a property of the runtime; an unmanaged runtime requires you to build it. Most rewrite estimates miss this line item entirely.

The trade was, in the abstract, "give up the runtime that handled crashes, deploys, and scheduling for you, in exchange for losing the GC pause that nothing else can fix". The math was favourable for Discord because their crash and deploy stories had matured to the point where they could rebuild them outside the runtime. For a younger team, the math goes the other way — the rewrite is not worth it because the supervision tree is providing value that won't be recovered for years.

The team's specific Rust stack choices are worth naming because they encode the trade-offs above. tokio for async I/O was chosen over alternatives like async-std because tokio's work-stealing scheduler approximated BEAM's per-process scheduling behaviour most closely. serde for serialisation replaced Erlang's binary terms; the team measured deserialisation cost dropping from ~12 µs per Read States message to ~2 µs after switching to a custom-derived serde codec. dashmap for sharded concurrent maps replaced ETS, with the per-shard lock granularity tuned to match Discord's measured contention pattern. A custom allocator-aware buffer pool replaced BEAM's binary-term GC, holding pre-allocated 4 KB and 64 KB buffers that requests check out and check back in without ever calling malloc. Each of these choices was made by measuring the alternative on a Read States benchmark; the team's published 2022 talk at QCon describes the measurement methodology in detail, and the take-away worth internalising is that the rewrite's outcome depended on the dozens of these micro-decisions, not on the macro choice of "use Rust".

Read States response time before and after the Rust rewriteA latency CDF plot. The BEAM curve climbs steeply at p99 to 12 ms, with a tall spike at p99.9 to 100 ms. The Rust curve is mostly flat, climbing slowly to 380 us at p99.9. Annotations call out the GC pause regime.Read States latency CDF — Elixir vs Rust (illustrative)p50p90p99p99.9p99.99100ms10ms1ms0.1msElixir p99.9 = 12 ms (then GC spike to 100 ms)Rust p99.9 = 380 µsmajor-GC regime
Illustrative — based on numbers in Discord's 2020 blog post "Why Discord is switching from Go to Rust". The Elixir curve diverges from the Rust curve at the upper percentiles — exactly where the BEAM's major-GC pauses dominate. The mean and p99 differ modestly; the p99.9 differs by 31×.

How this generalises — runtime as latency floor

Every long-running service has a runtime, and that runtime imposes a latency floor that no amount of application-level optimisation can break through. Discord's specific runtime imposed a 100 ms floor on p99.9 because of major GC. Java's G1, even at its tuned best, imposes a ~10 ms floor at p99.9 on heaps over 16 GB. Go's mark-sweep imposes a ~500 µs floor at p99 on services with high allocation rates. Python's CPython imposes a ~1 ms floor on any service that touches the GIL across threads. The floor is a property of the runtime, not the application.

Indian platforms run into runtime floors continuously, and the response varies by team maturity:

Razorpay's payment gateway, JVM tuning ladder. Razorpay's payment service runs on the JVM with G1 GC. The 2022 p99 floor was 18 ms, dominated by G1's young-generation collection on a high-allocation Java service. The team's response was the standard JVM ladder: tune G1 region size, switch to ZGC for the highest-throughput services, then evaluate GraalVM native-image for the latency-critical authorisation path. They did not rewrite in Rust because the JVM ladder solved their floor (ZGC brought p99 to 4 ms; native-image brought the auth-path p99 to 800 µs) — a measured response that didn't require leaving the runtime.

Zerodha Kite's order-matching engine, Java to Java NIO. Zerodha's Kite order-match engine runs on the JVM and faces a hard 50 µs p99 SLO at market open (09:15 IST). The 2021 floor was 200 µs, dominated by Java's ByteBuffer allocation in the network read path. The fix was off-heap byte buffers (sun.misc.Unsafe), zero-copy networking, and a hand-tuned Disruptor ring buffer to avoid all JVM-managed concurrent collection. The team stayed on the JVM because the cost of leaving the ecosystem (Java tooling, observability, hire-ability) was higher than the cost of getting the JVM to behave. The tuned p99 lands around 35 µs.

Hotstar's IPL streaming control plane, Go to Rust. Hotstar's stream-quality-decision service was Go in 2022. During the IPL final, the Go runtime's mark-sweep GC fired more frequently under the 25M-concurrent-viewer load and produced p99 spikes from 8 ms to 24 ms — measurable but not breaking the SLO. Hotstar evaluated a Rust rewrite; the conclusion was that the SLO had headroom and the team's Rust expertise was thin. They tuned GOGC instead. This is the discipline of not doing the rewrite — recognising that the runtime floor is below the SLO and the rewrite cost would not pay for itself.

Flipkart's Big Billion Days catalogue cache, Java to Rust. Flipkart's catalogue serving layer was Java in 2022 and hit 80 ms p99 spikes during BBD that were dominated by G1's mixed collections on a 32 GB heap. The 2024 rewrite to Rust (with tokio for async, dashmap for sharded concurrent maps) brought the p99 to 12 ms during BBD 2024. The rewrite cost ~14 engineering-months; the savings were measurable improvements in BBD conversion rate that, per the team's published estimate, paid for the rewrite in one BBD weekend. The pattern matches Discord: the runtime's GC was the floor, no application optimisation could break through it, the rewrite was the only path forward — and the rewrite was justified only because the team had a measurable SLO violation, not a vague "Java is slow" intuition.

PhonePe's UPI session-validation service, Java to Rust (partial). PhonePe's UPI session-validation path runs at ~5M req/s during the 9 AM IST salary-credit spike. The 2023 measurement showed a 22 ms p99.9 dominated by JVM safepoint pauses on a 24 GB heap. The team did a partial rewrite — only the hottest 8% of code paths in Rust, called from Java via JNI — bringing p99.9 to 4 ms. The pattern is interesting because it shows the rewrite-as-surgery alternative: rather than replacing the entire runtime, identify the specific code paths whose latency floor is set by the runtime and lift only those out. The cost was 6 engineer-months instead of Discord's 24, at the price of a more complex deployment topology (JVM and Rust both running, JNI bridge maintained). For teams whose SLO violation is concentrated in a small fraction of code, partial rewrite is often the right answer; Discord's was concentrated across the entire service, which is why a full rewrite made sense for them and not for PhonePe.

Dream11's fantasy-team submission service, Node.js to Go. Dream11 sees a 200× write spike between the toss and the first ball of every IPL match — millions of users finalising fantasy team selections in a 90-second window. The 2022 service was Node.js, and the V8 engine's mark-sweep produced 40-80 ms p99 spikes during the burst. The team rewrote in Go, which has a less-pause-prone GC for write-heavy workloads, and brought the p99 to 8 ms during the 2023 IPL season. The rewrite cost ~9 engineer-months. Note this case did not go all the way to Rust — Go was sufficient because the SLO was 50 ms, not 1 ms; Go's GC floor sat below the SLO with comfortable headroom. This is the calibration point most teams miss: the runtime choice should be matched to the SLO, not chosen for absolute performance. Picking Rust when Go suffices doubles the rewrite cost without buying additional headroom.

The cost ledger across these five rewrites tells a consistent story. Discord's headline cost was 24 engineer-months of senior-engineer time. Hidden costs added up to roughly the same again: 6 months of operational toil maintaining two codepaths in parallel during the migration, 4 months of incident response for the BEAM-masked bugs the rewrite surfaced, and 3 months of tooling work to rebuild the supervision and deployment patterns that the BEAM gave Discord for free. A team budgeting only the headline 24 months would have run out of runway six months before the rewrite stabilised. Razorpay's 2023 internal-transfer service rewrite (Java to Go) hit the same 2× multiplier; the team had budgeted for it because they had read Discord's retrospective and taken it seriously. The rule for any Indian platform contemplating a runtime rewrite: estimate at 2× the senior engineer's first guess, and have explicit budget for the operational work that the runtime was previously absorbing invisibly.

The lesson that ties these together is that the runtime is part of the production system, not part of the language. A team that treats "we use language X" as a fixed constraint will hit the runtime floor and have no path forward. A team that treats the runtime as a choice — re-evaluatable every few years against the workload's actual behaviour — has the option Discord exercised in 2020. Most Indian platforms now have one or two services that have been runtime-rewritten precisely because their workload outgrew their original runtime; the discipline is recognising the symptom (a tail-latency floor that scales with the runtime's cost model, not the application's) and acting on it without sentimentality about the language choice.

A measurable signal that distinguishes the floor-bound from the application-bound case: plot a histogram of response times during a single hour at peak load and look at the tail beyond p99.9. If the tail has a sharp shoulder at a specific latency value (Discord's 100 ms, Hotstar's 24 ms, Razorpay's 18 ms before ZGC), that shoulder is the runtime's pause distribution, and no application-level optimisation will move it. If the tail is smooth (a long fat tail with no clean shoulder), the latency is dominated by application or workload variation, and the runtime is not the floor. This visual diagnostic, taking two minutes to produce from any HdrHistogram-aware tool, is what Discord's team had on their dashboard in 2018 and what most Indian platforms still don't capture. Build it before you contemplate any rewrite. If the shoulder isn't there, save the engineering-months for something else.

The failure modes the rewrite did not fix

A rewrite changes the runtime; it does not change the laws of physics. Several failure modes that Discord's Read States service exhibited under load survived the Rust rewrite, because they were not runtime problems to begin with. Naming them is part of honest accounting — Rust solves what Rust solves, and conflating the win with a generic "Rust fixes everything" message is the framing error that produces the next round of unjustified rewrites.

Hot-key contention on the channel-membership cache. The single channel-membership cache shard that handled #general for the largest servers (some Discord servers have 1M+ members) was a contended resource before and after the rewrite. Rust's dashmap reduced lock-acquisition cost relative to Elixir's ETS table, but the workload's hot-key concentration — 8% of all reads hit the top 100 channels — was unchanged. Discord's mitigation, layered on top of the rewrite, was a per-shard request coalescer that batched concurrent reads of the same key into one underlying lookup. This is a workload-shape fix, not a runtime fix; the same coalescer would have helped Elixir.

Cache coherence under network partition. When a Read States node lost network connectivity to the database for 200 ms (a not-uncommon event during cloud-provider regional incidents), the in-memory state on that node diverged from the authoritative database for the partition's duration. The rewrite did not change this; both Elixir and Rust read from a local cache populated by writes, and both saw the same divergence pattern. Why this is structural, not runtime-specific: the cache is on the read path for latency reasons (the database round-trip is 4 ms, the cache lookup is 200 µs). Any in-process cache that's allowed to serve reads will diverge from the source of truth during a partition. Eventual consistency is a workload-shape choice, not a runtime feature; choosing it imposes the divergence regardless of which language the cache is written in.

Cold-start latency on rolling restart. When Discord rolls a new deployment, each Rust node starts with an empty in-memory cache and must warm up by serving requests from the database directly, taking the slow path. The first 30-90 seconds after a node restart show p99 spikes to 30-80 ms — back into the regime the rewrite was supposed to eliminate. Elixir had the same problem and the same magnitude of cold-start spike. The fix, deployed in 2022, was a cache-warming protocol where the new node pre-fetches the top 50,000 hot keys from a peer node before accepting traffic. This is operational tooling, again unrelated to the runtime.

The pattern: the runtime sets the floor for the steady-state tail; it does not eliminate the failure modes that arise from workload shape, network reality, or operational events. A Rust rewrite that doesn't accompany request-coalescing, partition-tolerance, and cold-start tooling will see those failure modes immediately, because the rewrite removed the GC-pause tail spike that was previously masking them in the metrics. This is the second-order lesson Discord's team learned the hard way: removing one source of tail latency makes the next-largest source visible, and the next, until you've worked through the entire stack of latency contributors. Most teams stop after the first rewrite and declare victory; Discord's 2021 retrospective explicitly named the work as ongoing.

A useful framing for the residual floor: once the GC pause is removed, what becomes the new tail floor? For Discord's Rust service, the residual p99.9 of 380 µs is set by a stack of much smaller contributors — jemalloc's madvise(DONTNEED) calls returning pages to the OS (occasional 100-200 µs stall on free), Linux scheduler context switches when the worker thread is preempted (10-50 µs), kernel epoll_wait wakeups when many sockets are ready (30-80 µs), and TCP-level retransmits during transient network blips (rare but multi-millisecond when they fire). Each contributor is small; together they set a residual floor in the low hundreds of microseconds. Discord's published 2022 follow-up work focused on the next-largest contributor (epoll wakeup distribution under high connection counts) and brought the p99.9 to 280 µs. The pattern is recursive: every layer reveals the next layer's floor, and the work of optimisation is the patient peeling-back of each.

A practical note for any team operating an Elixir or Erlang service today: install recon (the standard Erlang diagnostic library) and run recon:proc_count(memory, 10) periodically to identify the heaviest processes on each node. If the top-10 list shows processes with multi-megabyte heaps that have been alive for hours, you are in or approaching the regime that bit Discord. The mitigation playbook (process restart cycles, off-heap binary terms, hibernation) buys time but does not eliminate the regime. The runtime decision becomes urgent at the point where mitigations stop scaling — typically when the heaviest 1% of processes account for >30% of node CPU and the team has already exhausted the standard tuning levers.

Common confusions

  • "Rust is faster than Elixir" Rust is faster than Elixir on the metrics that matter for Read States — p99.9 latency, throughput per core. It is not faster on the metrics Elixir was designed for: time-to-recover from a process crash, deployment without dropping connections, code that handles 2 million concurrent conversations with no manual scheduling. The right framing is not "X is faster than Y" but "X's cost model matches this workload better than Y's". Discord's workload had drifted into a regime where Elixir's cost model was unfavourable.
  • "The BEAM is bad at GC" The BEAM's per-process GC is, for the workloads it was designed for, structurally better than any shared-heap GC. The problem is specific: long-lived processes with multi-megabyte heaps, where the per-process major collection becomes a tail-latency event. For the original WhatsApp-style workload (small short-lived processes), the BEAM's GC is excellent. Discord's data drift moved them into the regime where the BEAM's design assumptions did not hold.
  • "Rewriting in Rust always wins" A Rust rewrite trades a managed runtime for an unmanaged one. The win is deterministic memory and no GC pauses. The loss is the runtime services (supervision, hot reload, pre-emptive scheduling) that you must rebuild in application code. Teams without the engineering capacity to rebuild those services in Rust ship outages. Discord's rewrite was justified by both a measured SLO violation and the team's accumulated capacity to operate without BEAM's safety net — both conditions are necessary.
  • "GC tuning could have fixed the BEAM" Discord tried the standard GC tuning levers (fullsweep_after, min_heap_size, hibernation). They moved the floor down by 20-30% but did not eliminate the regime. At the heap sizes their hottest processes had grown to (8-12 MB), no tuning short of "do not perform major collections" could bring p99.9 below 30 ms — and the BEAM does not offer a "no major collections" mode for processes that need them. The tuning ceiling was below their SLO requirement.
  • "The rewrite was about language preference" The rewrite was about the runtime's cost model. Elixir-the-language is a remarkable productivity story; the team did not rewrite because they disliked Elixir. They rewrote because the BEAM's GC was visible in their p99.9 chart and no Elixir-language change could remove it. Conflating language choice with runtime choice is the most common framing error in rewrite discussions, and it produces both unjustified rewrites (a team rewrites because the new language is cool) and unjustified holdouts (a team refuses to rewrite because they like the current language).
  • "All long-lived services should use Rust" Most long-lived services do not face a runtime-floor problem. Razorpay's payment gateway, Zerodha's order-match engine, and Hotstar's stream-quality service all stayed on their original runtimes (JVM, JVM, Go) because their SLOs had headroom over their respective runtime floors. The rewrite is justified only when the floor is the bottleneck. Most services are bottlenecked elsewhere — network, database, application logic — and switching runtimes does not move the bottleneck.

Going deeper

Why major GC fires every ~120 seconds on a long-lived BEAM process

The BEAM uses a generational, copy-collected per-process heap. Minor collections copy the young generation to the old generation when the young fills up; major collections sweep the old generation when its size exceeds a threshold relative to the heap's high-water mark. The threshold is controlled by fullsweep_after, which defaults to a count of generational collections (not a time). For a Read States process taking 200-500 updates per minute on an 8 MB heap, the generational collections triggered roughly every 30 seconds, and fullsweep_after=4 (the default in some Erlang versions) means a major sweep every ~2 minutes. The 2-minute periodicity Discord saw is not coincidence; it is the runtime's default tuning interacting with their workload's allocation rate. Tuning fullsweep_after higher reduces major-sweep frequency but increases each sweep's duration (more accumulated garbage to collect). Tuning lower reduces each sweep's duration but increases frequency. There is no setting that eliminates the regime; you can only redistribute the cost. Why redistribution is not enough at p99.9: tail latency is set by the worst-case event in the measurement window. Reducing the frequency of major sweeps from every 2 minutes to every 10 minutes reduces the number of affected requests, but each affected request still sees the full pause. At p99.9 over a one-hour window, you only need 3-4 major-sweep events to land in the tail bucket; reducing from 30 events/hour to 6 events/hour does not change p99.9 if each event still produces a 100 ms spike. The only way to break through the floor is to eliminate the spike, not to make it rarer.

The async runtime trap — what Rust gave Discord that Go would not have

Discord's 2020 rewrite was specifically to Rust, not to Go (which they had used for other services and which has its own GC). Why Rust? Because Go's mark-sweep GC, while better than the BEAM's per-process major GC for long-lived large heaps, still imposes a ~500 µs p99 floor that scales with heap size. For a service with billions of tuples in memory, the Go GC's stop-the-world phases produce 1-3 ms pauses — enough to violate Discord's SLO. Rust's no-GC model, with tokio async, gave Discord the runtime floor at the hardware/allocator boundary (sub-100 µs) rather than at the GC mechanism's boundary. This is the second-order distinction in runtime selection: among managed runtimes, the GC mechanism sets the floor; among unmanaged runtimes, the allocator sets the floor. Choosing between them is choosing which floor you want to live with.

Operational discipline — the rewrite that almost wasn't

Discord's published timeline understates the rewrite's operational risk. The team ran Elixir and Rust in parallel for 18 months — the Rust port was deployed on a small fraction of traffic for a year before the Elixir version was retired. During that period, the Rust version exposed at least three bugs that the BEAM's process isolation had previously masked: a race condition in the channel-membership cache that caused stale unread counts (BEAM's process-per-conversation isolation had hidden it), a memory-leak in a tokio task spawned per WebSocket connection (BEAM's per-process heap had garbage-collected the leak invisibly), and a panic in the deserialiser that brought down a tokio worker (BEAM's supervisor would have restarted the process in 50 ms). Each of these took weeks to root-cause and fix. The team's published 2021 retrospective noted that the rewrite's calendar time (24 months) was 2.5× the original engineering estimate (9 months), and that the largest source of slippage was the bugs that BEAM's runtime had been masking. The lesson generalises: a managed runtime hides bugs, and removing the runtime exposes them. A rewrite's true cost includes the surfacing of these previously-hidden bugs, not just the line-by-line port.

When NOT to rewrite — the Hotstar / Zerodha pattern

A useful counter-pattern: most teams should not rewrite. The check is straightforward — measure your service's p99.9, identify the dominant cost component (with perf or eBPF profiling /wiki/bpftrace-the-awk-of-production), and ask whether that component is the runtime or the application. If the dominant cost is a database query, a network round-trip, or an application algorithm, the runtime is not the bottleneck and a rewrite will not move p99.9. If the dominant cost is GC (visible as periodic pauses in gctrace=1 logs, or as proc.GC time in pprof), the runtime is the floor and a rewrite is on the table. Hotstar's 2024 IPL streaming SLO had headroom; Zerodha's 2021 Java tuning bought them another 4× margin. Both teams correctly chose not to rewrite. Discord chose correctly to rewrite. The discipline is the same in both directions: measure, attribute, then act.

A note on the team that did the rewrite

The Discord team that drove the Read States rewrite was small — fewer than ten engineers across the design, implementation, and migration phases. The published 2021 retrospective names the lead engineers (Mark Smith, Stanislav Vishnevskiy on the architectural review side) and credits a "deep bench" of Rust expertise that the team had built over the two years preceding the rewrite. This is a precondition that gets understated: Discord could rewrite into Rust because they had Rust engineers; a team that decides to rewrite into Rust without already having Rust engineers is signing up for a 6-month learning curve before the rewrite produces anything. The teams at Indian platforms that have successfully rewritten — Flipkart's catalogue, PhonePe's session-validation, Razorpay's internal-transfer service — all spent 12-18 months building the language expertise before committing to the rewrite. The expertise-building phase is the invisible part of the project, but it is the precondition that makes the rest possible.

Reproduce this on your laptop

# Reproduce the BEAM-vs-Rust GC pause model from this chapter.
python3 -m venv .venv && source .venv/bin/activate
pip install hdrh

# Compare BEAM-style major-GC pauses vs Rust deterministic destruction
python3 discord_beam_gc_model.py
# Expected: BEAM p99.9 around 90-100 ms, Rust p99.9 around 2-3 ms.

# To see real BEAM GC behaviour, install Elixir and run:
sudo apt install elixir
iex -e "Process.spawn(fn -> :erlang.system_flag(:scheduler_wall_time, true);
   Stream.cycle([1]) |> Stream.map(&Map.put(%{}, &1, &1)) |> Enum.take(1_000_000) end, [])"
# Watch :observer.start() — the GC column shows major collection events.

# To see Rust's deterministic destruction, install Rust and run:
curl --proto '=https' --tlsv1.2 -sSf https://sh.rustup.rs | sh
cargo new --bin gc_test && cd gc_test
# Add a tokio main loop and observe steady allocator behaviour with `perf stat`.

The Python model captures the shape of the BEAM's GC pause distribution but not the absolute numbers (real BEAM pauses are workload-specific). The point of running it on your laptop is to internalise the asymmetry: the mean is unaffected, the tail is dominated. This is the pattern any GC-managed service exhibits, and the pattern that distinguishes "the runtime is fine" from "the runtime is the floor".

Where this leads next

This chapter closes Part 16's runtime-rewrite case studies. The next chapter (/wiki/uber-marketplace-and-the-coordination-cliff) shifts from runtime to coordination, looking at Uber's marketplace where the bottleneck was not memory or GC but the agreement rate between supply (drivers) and demand (riders). The pattern continues: every system has a load-bearing internal mechanism, and when the workload changes, that mechanism becomes the bottleneck whether or not the team has noticed yet.

The natural next reads are:

A reader who has worked through Parts 13 (language runtime) and 7 (latency / tail latency) should now be able to read Discord's 2020 blog post directly and map every claim onto a mechanism they have seen — generational GC (Part 13), tail-latency dominance (Part 7), runtime-induced latency floors (Parts 13, 14). The case studies in Part 16 are not new mechanisms; they are demonstrations that the mechanisms from the earlier parts compose into systems that hold their behaviour at scale — or fail to.

A second take-away worth naming explicitly: the rewrite was a measurement-driven decision, not an aesthetic one. Discord's team did not wake up one morning and decide Rust was prettier. They watched p99.9 climb on a chart for 18 months, attributed every percentile climb to a measurable BEAM mechanism, exhausted the BEAM-tuning ladder, and only then committed to the rewrite. The 2020 blog post that announced the rewrite is the public surface of an internal process that started in 2018 with a single Datadog dashboard panel showing the 100 ms tail spike. The teams whose runtime rewrites have succeeded — Discord, Flipkart's catalogue, PhonePe's UPI session-validation — followed the same pattern; the ones that have failed (a notable Indian fintech rewrite in 2023 that was rolled back after eight months) skipped the measurement-and-attribution phase and rewrote against assumptions that turned out to be wrong about which layer was the floor. The discipline is the boring part; the rewrite is the consequence.

The deepest take-away is the discipline that Discord, Razorpay, Zerodha, Hotstar, and Flipkart all share when faced with a runtime floor: measure first, attribute the floor to the right layer, then act. Discord measured a 100 ms p99.9 spike, attributed it to BEAM major GC, and rewrote. Hotstar measured the same kind of spike, attributed it to Go's mark-sweep, but found their SLO had headroom and did not rewrite. Both decisions were correct because both were measurement-driven. The teams that get this wrong are the ones that rewrite without measuring (a sentimental rewrite, expensive and unjustified) or refuse to rewrite despite measurement (a sentimental hold, slow death by a runtime floor that nothing else can break). The arithmetic is the same in both cases; the discipline is in honestly applying it.

The chapter after this picks up Uber's marketplace coordination problem — a different domain, the same arithmetic of measured workload meeting designed mechanism. The case studies build a recurring observation: every system that holds its behaviour at scale does so because someone, at some point, refused to keep the runtime (or the cache, or the queue, or the scheduler) that worked at smaller scale. The refusal is the work; the rewrite is the consequence.

References