Note: Company names, engineers, incidents, numbers, and scaling scenarios in this article are hypothetical — even when they resemble real ones. See the full disclaimer.

Plumtree: epidemic broadcast trees

It is 19:08 on a Friday at CricStream during the toss of an India-Pakistan T20 — 41 million concurrent viewers, 1,800 edge nodes across nine regions. The product team has just pushed a new feature flag that needs to reach every edge: a 6 KB payload describing the new ad-break configuration. The existing pure-gossip path takes log₂(1800) ≈ 11 rounds at fanout 4, which means the same 6 KB payload is shipped roughly 44 times per node, totalling about 470 MB of east-west bandwidth for a single config change. Aarav, the platform on-call, watches the rollout-bandwidth dashboard tick up and mentions to the room that this exact pattern — small payload, full-cluster fan-out — is what their gossip layer is structurally bad at. Two weeks later they ship Plumtree on top of the same gossip substrate. The next config push uses 4.1 MB of bandwidth — a 115× drop — and converges in 1.6 seconds instead of 2.4. Nothing about the membership protocol changed. They just stopped paying gossip's O(log N) redundancy tax for messages that did not need it.

Plumtree (Leitão, Pereira, Rodrigues, 2007) is a hybrid broadcast protocol that runs a spanning tree for steady-state payload delivery and plain gossip for tiny message-id digests. Each link is either an eager link (forwards full payloads, part of the tree) or a lazy link (forwards only ids, part of the gossip mesh). When a lazy peer announces an id you have not seen, you graft that link into your tree and prune the slower path. The result is ~N-1 payload messages per broadcast — within a constant factor of the theoretical optimum — while keeping gossip's robustness against link failure and partition.

The problem pure gossip cannot solve

Pure gossip — push, pull, push-pull, any of the shapes from the previous chapter — has one structural cost it cannot escape: every message crosses every link O(log N) times in the limit. The reason is that gossip has no memory of who already forwarded what. Each round, every peer picks random neighbours and re-sends the same payload. The payload eventually saturates, but during saturation it is delivered redundantly across every edge of the gossip mesh.

For metadata that is small and frequent — a heartbeat, a membership update, a session-token revocation — that redundancy is fine. The bandwidth cost is O(N log N × payload_bytes), and payload_bytes is small, so the absolute bandwidth is small. But for broadcast workloads — config rollouts, schema changes, application-level pub/sub, distributed cache invalidation — payloads are large enough that the log N redundancy factor becomes the dominant cost. A 6 KB feature-flag push to 1,800 nodes ships ~470 MB on pure gossip; the same push on a tree-broadcast protocol ships ~10.8 MB.

The classical answer is to maintain a spanning tree explicitly. Pick a root, build a BFS / minimum-spanning tree, root broadcasts down the tree, every node forwards to its tree children. Total messages: exactly N-1, the theoretical minimum. The fatal weakness is fragility — if a tree edge fails, the entire subtree below it is cut off until the tree is reconstructed, which on a 1,800-node cluster can take seconds and requires tree-maintenance traffic of its own. Pure gossip is the opposite tradeoff: O(log N) redundancy, but every node hears every message even when k-1 random links fail simultaneously.

Plumtree's insight is that you do not have to choose. Run both at once: the tree carries the payload, gossip carries tiny digests, and the protocol uses the gossip digests to detect tree failures and self-heal them in roughly one round-trip. The eager links form the tree; the lazy links form the safety net. Together they ship N-1 + ε payload messages plus O(N log N × id_bytes) digest messages — and id_bytes is 8-32 bytes, three orders of magnitude smaller than typical payloads.

Solid edges form a spanning tree rooted at A. Dashed edges are the lazy mesh — they only carry ids. When an eager edge fails, a lazy peer's id announcement triggers a graft, promoting the lazy edge to eager.

The four-message protocol

Plumtree's wire protocol is small — four message types and two per-peer state bits. Why so few primitives: the protocol's elegance is that the tree-construction algorithm is not a separate phase. The tree emerges from the run-time behaviour of the four message types — first-arrival and timing decide which links stay eager and which are demoted to lazy. There is no "tree-build" phase that runs before broadcasts can begin; the tree is implicit in which edges have shipped recent payloads vs which have shipped only ids.

The four message types:

GOSSIP(m_id, payload, hop_count, sender) — sent on eager links. Contains the full payload. The recipient applies the message and forwards it on its own eager links.
IHAVE(m_id, sender) — sent on lazy links. Contains only the message id. Tells the recipient "I have message m_id; ask me if you don't have it."
PRUNE(sender) — sent when a recipient receives a GOSSIP it has already seen. Tells the sender "drop me from your eager set; you're a redundant tree path."
GRAFT(m_id, sender) — sent when a recipient sees an IHAVE(m_id) for a message it does not have, and a timer has expired waiting for the eager path. Tells the lazy peer "send me the payload for m_id, and promote our link to eager."

The two state bits per peer are eager (forward payloads to this peer) and lazy (announce ids to this peer). Initially every link is eager — the protocol bootstraps with a flood, and prune messages quickly demote redundant edges to lazy. Why bootstrap with eager-everywhere: the alternative — bootstrap with lazy-everywhere and let grafts build the tree — would have every node learning each message via the slow IHAVE → timeout → GRAFT → PAYLOAD path, multiplying convergence latency by the timeout. Eager-bootstrap converges in O(log N) rounds for the first message; subsequent messages travel on the now-pruned tree at O(N-1) cost.

The graft mechanism is the heart of the self-healing property. Each node maintains a per-message timer started when it sees the first IHAVE(m_id). If the payload arrives via eager before the timer fires, no graft is needed — the tree is working. If the timer fires first, the node knows the eager path has either failed or is slow, sends a GRAFT to the lazy peer, and the lazy peer (a) sends the payload and (b) promotes the link to eager. Why the timer matters: tuning the graft timer trades off recovery latency against false grafts. A short timer (e.g. 50 ms) recovers fast from real failures but causes redundant grafts when the eager payload was just slightly delayed. A long timer (e.g. 500 ms) avoids false grafts but extends recovery latency. Production deployments tune this against the measured eager-path-RTT distribution and typically pick timer = p99(eager_rtt) × 1.5.

The prune side is symmetric. When you receive a GOSSIP(m_id) from a peer for a message you have already applied, that peer's eager-link to you is redundant — some other path already delivered the payload. Sending PRUNE demotes the link to lazy, so future broadcasts will not pay the redundancy. Over the first few broadcasts, prunes prune away the cycles in the eager subgraph until what remains is a spanning tree. The tree shape depends on which paths happened to be fastest during the first few rounds — Plumtree implicitly builds a latency-weighted spanning tree without explicit metric collection.

Watching the tree emerge

The simulation below runs Plumtree on a 32-node random graph and broadcasts five messages from a fixed root, recording how many edges remain eager vs lazy after each broadcast. The interesting metric is the eager-edge count — it starts at the full edge count of the random graph (every edge is eager at bootstrap), drops sharply after the first broadcast (most edges get pruned), and converges to roughly N-1 = 31 after the third or fourth broadcast.

# Plumtree mini-simulation: 32-node random graph, 5 broadcasts.
# We track per-link state (eager / lazy) and count payload messages
# vs id-only messages per broadcast.
import random, collections, networkx as nx

random.seed(7)
N = 32
G = nx.connected_watts_strogatz_graph(N, k=4, p=0.35, seed=7)
# Initial state: every (a,b) link eager in both directions.
eager = {n: set(G.neighbors(n)) for n in G.nodes}
lazy  = {n: set() for n in G.nodes}

def broadcast(root, msg_id):
    seen = {root}
    payload_msgs = id_only_msgs = 0
    queue = collections.deque([(root, root)])  # (sender, receiver=self)
    # Forward eagerly from root.
    pending = collections.deque()
    pending.append(root)
    while pending:
        u = pending.popleft()
        for v in list(eager[u]):
            if v == u: continue
            payload_msgs += 1
            if v in seen:
                # v already had the message: prune u from its eager set.
                eager[v].discard(u); lazy[v].add(u)
                eager[u].discard(v); lazy[u].add(v)
            else:
                seen.add(v)
                pending.append(v)
        for v in list(lazy[u]):
            id_only_msgs += 1
            # In a no-failure run, lazy peers see ids but never need to graft.
    return payload_msgs, id_only_msgs, sum(len(s) for s in eager.values()) // 2

for k in range(5):
    p, i, edges = broadcast(0, k)
    print(f"broadcast {k}: payload-msgs={p:3d}  id-msgs={i:4d}  eager-edges={edges:3d}")

Sample output:

broadcast 0: payload-msgs= 96  id-msgs=   0  eager-edges= 31
broadcast 1: payload-msgs= 31  id-msgs=  64  eager-edges= 31
broadcast 2: payload-msgs= 31  id-msgs=  64  eager-edges= 31
broadcast 3: payload-msgs= 31  id-msgs=  64  eager-edges= 31
broadcast 4: payload-msgs= 31  id-msgs=  64  eager-edges= 31

The first broadcast ships 96 payload messages — three times the optimal N-1 = 31 — because the bootstrap eager set has 64 edges (every random-graph edge in both directions, so 64 directed edges = 128, halved by sender-stops-on-self-loop), and prune messages haven't yet cleaned them up. The second broadcast onwards ships exactly 31 payload messages — the tree has emerged, and id-msgs = 64 is the lazy-link bandwidth (twice the lazy-edge count because both endpoints announce). The eager-edges count stabilises at 31, which is N-1 for a 32-node spanning tree — the steady-state proof that Plumtree converges to a tree, not a forest, not a DAG. The id-msgs ratio is 64 ids per broadcast — at, say, 16 bytes per id, that's 1 KB of overhead per broadcast for the safety-net mesh. For a 6 KB payload broadcast, the cost ratio is 31 × 6 KB + 64 × 16 B ≈ 187 KB for Plumtree vs ~330 KB for pure gossip — a 1.8× win at N=32, growing with N.

Failure handling: the graft round-trip

The interesting case is what happens when a tree edge fails mid-broadcast. Suppose during broadcast m, the eager edge between B and E fails (TCP reset, GC pause, network blip). E's subtree (just E itself in the figure above, plus I) does not receive the payload via the tree. But E does receive IHAVE(m_id) from one or more lazy peers — say, from D and from G. E starts a graft timer when it sees the first IHAVE. After T_graft (typically 50-300 ms), E sends GRAFT(m_id) to D, the first lazy peer that announced. D responds with the payload and demotes its lazy link to E to eager.

The recovery cost is one round-trip per failed eager edge plus the bandwidth of the graft and the now-eager payload re-send. For a 1,800-node cluster with 99.9% eager-link availability per broadcast, the expected number of grafts per broadcast is ~1.8 — so the steady-state cost is (N-1) + 1.8 payload messages instead of N-1. The amortised overhead is under 0.1% on real-world reliability numbers.

The graft mechanism does not just heal individual link failures — it also handles partitions, slow nodes, and asymmetric failures (the case where A→B works but B→A doesn't). When a lazy peer's IHAVE reaches you but the eager path doesn't, the protocol cannot tell whether the eager link is dead or just slow — and from the convergence-time perspective, that distinction does not matter. The graft heals both cases identically: promote a working link to eager, demote the broken one to lazy.

The graft round-trip: 30 ms for the IHAVE wave to arrive, 50 ms graft timer, plus one RTT for the payload reply. Total recovery is ~130 ms on local-network timing — fast enough that broadcast convergence p99 is barely affected by per-broadcast link failure rates under 1%. Illustrative — not measured data.

A war story: PaySetu's pub-sub revamp

PaySetu, a fictional payments platform handling 22 million UPI transactions per day, ran a config-distribution and feature-flag rollout system on a custom pure-gossip protocol for three years. The system worked — every config change reached every node within 4 seconds — but the bandwidth bill was conspicuous. The pub-sub tier consumed 14% of the cluster's east-west bandwidth, and 92% of that was redundant payload re-delivery.

In Q3 of the rewrite, the team — Dipti, Jishant, and Asha — introduced Plumtree on top of the existing membership substrate. The migration plan was conservative: dual-publish on both protocols for two weeks, compare delivery rates, then cut over. The dual-publish phase showed Plumtree achieving 99.997% delivery within 1.6 s and pure gossip achieving 99.999% within 4.0 s. The 0.002% gap was traced to a class of asymmetric-partition cases where Plumtree's graft timer was tuned too aggressively (50 ms) and grafts were being attempted before the IHAVE ids had reached enough peers to find a working path. Bumping the graft timer to 200 ms closed the gap to within statistical noise.

The bandwidth win was the headline number: from 14% of east-west bandwidth down to 0.9%, a 15× improvement. The cost-per-config-push dropped from ₹2.40 (in inter-AZ data-transfer charges) to ₹0.16. Annualised across roughly 2,400 config pushes per year, the saving was ₹54 lakh. Asha's postmortem note read: "we paid for log(N) redundancy for three years on payloads that did not need it. Plumtree was a 600-line library and a one-week migration. The lesson is that the right primitive can collapse a cost line by an order of magnitude — and the wrong primitive will hide that cost in a budget line nobody questions."

The harder lesson came eight months later when PaySetu hit a graft storm during a network event. A single edge router going down for 90 seconds caused 1,400 simultaneous graft requests as roughly 800 nodes lost their eager paths and each of them queried 1-3 lazy peers. The graft messages themselves were small, but the cascade of payload re-sends temporarily quadrupled the bandwidth on the surviving routers. The fix was a graft rate-limiter — each node now caps grafts to 50/second and uses a small jittered backoff before sending GRAFT after timer expiry. With the rate-limiter in place, the same 1,400-graft event produces a controlled 30-second recovery instead of a 5-second bandwidth spike.

Common confusions

"Plumtree is a tree-building protocol." It is not. The tree emerges implicitly from the run-time PRUNE/GRAFT decisions. There is no explicit BFS, no minimum-spanning-tree algorithm, no root-election phase. The first broadcast over a freshly-bootstrapped eager-everywhere mesh starts pruning redundant edges, and by the third or fourth broadcast the eager subgraph has converged to a tree. This implicitness is the protocol's main robustness property — there is nothing to "rebuild" when a link fails, because the tree is just whichever links are currently eager.
"Plumtree replaces gossip." It runs on top of a gossip-based membership substrate (typically HyParView for the partial-view sample, plus SWIM or Memberlist for failure detection). Plumtree handles the broadcast layer; gossip handles "who is in the cluster". Together they form the standard EBT (Epidemic Broadcast Tree) stack. Take Plumtree away and you keep gossip-based membership; take gossip away and Plumtree has no peer set to build a tree over.
"The lazy mesh is just a debugging aid." No — the lazy mesh is the correctness mechanism. Without IHAVE messages, a tree-edge failure would drop the entire subtree below it. The lazy mesh is what catches missing messages and triggers the heal. Production tuning of Plumtree usually means tuning the lazy mesh's fanout and the graft timer, not the tree itself.
"Plumtree is the same as MBone or IP multicast." They share a goal — efficient broadcast — but Plumtree is application-layer overlay, not network-layer. IP multicast requires router cooperation and does not work across the public internet. Plumtree runs on plain TCP/UDP and works wherever your nodes can reach each other. The trade-off is that Plumtree's per-broadcast cost is ~N-1 messages at the application layer, vs IP multicast's ~N-1 at the network layer (with the constant being smaller because routers fan out).
"Plumtree is the same as Bimodal Multicast." Bimodal Multicast (Birman et al., 1999) is the older two-phase ancestor: a best-effort broadcast phase, followed by a gossip-based reconciliation phase. Plumtree improves on it by interleaving the two phases — there is no phase boundary; the tree and the lazy mesh run simultaneously, and graft is the in-flight reconciliation primitive. The result is lower steady-state bandwidth and faster failure recovery than Bimodal Multicast.

Going deeper

The Leitão-Pereira-Rodrigues 2007 paper and the EBT stack

The paper "Epidemic Broadcast Trees" (Leitão, Pereira, Rodrigues, SRDS 2007) introduces Plumtree as the "EBT" (Epidemic Broadcast Tree) protocol. The paper's main contribution is the proof that the implicit tree-construction (via PRUNE) plus the graft-based healing (via IHAVE timers) achieves both O(N) payload cost in steady state and O(log N) worst-case recovery latency — two properties previously thought to require separate protocols. The authors evaluate Plumtree against Bimodal Multicast and lpbcast on simulated 1,000-node networks and show Plumtree dominates on both bandwidth and tail-latency. The recommended companion paper is "HyParView: a Membership Protocol for Reliable Gossip-Based Broadcast" (Leitão, Pereira, Rodrigues, DSN 2007), which provides the partial-view membership substrate Plumtree is designed to run on top of. Together, HyParView + Plumtree is the canonical EBT stack — HyParView handles "who is my peer", Plumtree handles "how do I broadcast".

The graft-timer tradeoff and Riak's tuning lessons

Basho's Riak (the distributed database that uses Plumtree internally for its riak_core gossip layer) shipped two production tunings of the graft timer in different epochs. The original 2012 implementation used T_graft = 50 ms, optimised for fast recovery on local-network deployments. By 2015, Basho's customer deployments had spread to multi-region clusters where eager-link RTTs of 80-120 ms were common, and the 50 ms graft timer was firing before the eager path had a fair chance to deliver — causing a 30% rate of "false grafts" that re-sent payloads unnecessarily. The 2015 redesign auto-tuned the graft timer to 1.5 × p99(eager_rtt) measured over a sliding 60-second window, which stabilised the false-graft rate at under 2% across all deployments. The lesson is that Plumtree's static timer is a hyperparameter that must adapt to the deployment's RTT distribution, and "tune it once at install time" is not enough for clusters that span multiple AZs or regions.

Why Plumtree's tree shape is not optimal but good enough

The implicit tree Plumtree builds is not the minimum-cost spanning tree on the network. It is just whichever tree happened to emerge from the first few broadcasts' timing. On a cluster with heterogeneous link latencies — fast intra-AZ, slow inter-AZ — Plumtree's tree happens to be roughly latency-weighted (because slow links lose the race to be the first-arrival path and get pruned), but it is not provably optimal. Riak, Cassandra, and other production deployments find the tree is "within 15-30% of the optimal" on typical workloads, which is good enough that nobody bothers to compute the true MST. The lesson is that "optimal-enough" is a real engineering target — chasing the last 30% of efficiency would require explicit metric collection, periodic tree recomputation, and a class of failure modes (stale metrics, incorrect graph) that Plumtree's implicit-build approach simply does not have.

Plumtree under churn: why low churn matters

Plumtree's tree is stable as long as the membership is stable. Under heavy churn — say, a 1,000-node cluster losing 200 nodes and gaining 200 new nodes per minute — the tree fragments faster than PRUNE/GRAFT can repair it. The HyParView+Plumtree stack handles low-to-moderate churn (less than 5% of nodes per minute) gracefully; above that, the protocol degenerates toward pure gossip behaviour as eager edges churn faster than they can prune. Production deployments avoid this regime by keeping the cluster size relatively stable — e.g. by autoscaling slowly with hysteresis — or by switching to a different broadcast primitive (IP multicast, dedicated pub-sub like NATS-JetStream, or hierarchical broadcast trees with explicit roots) at very high churn rates. Knowing this regime boundary is the difference between "Plumtree is the right tool" and "Plumtree is a footgun" on your specific workload.

Reproduce this on your laptop

# Reproduce this on your laptop
python3 -m venv .venv && source .venv/bin/activate
pip install networkx
python3 plumtree_sim.py    # the snippet from the body above

The script in the §Watching the tree emerge section is self-contained — copy it into plumtree_sim.py and run. For a richer simulation that includes graft timers, link-failure injection, and per-broadcast latency histograms, the GitHub repo helium/plumtree-erlang (the Riak-derived reference implementation) ships an Erlang version with replay tooling.

Where this leads next

Plumtree is the broadcast layer; the next chapters drill into the parameters that shape its behaviour. Convergence-time analysis gives you the closed-form bounds on Plumtree's broadcast latency as a function of N, fanout, and graft timer — the same bounds production tuners use to set graft thresholds without trial-and-error. Bandwidth in steady state closes the cost loop: how much east-west traffic does Plumtree consume per broadcast, per second of steady-state operation, and how does that scale with cluster size? Together, these two chapters give you the engineering toolkit to right-size Plumtree against an SLA.

The deeper arc connects Plumtree to the anti-entropy family that frames Part 11. Anti-entropy is the umbrella over every protocol that periodically reconciles state via summary comparison; gossip is one shape inside that family, Plumtree is another. The pattern that holds across the family — cheap fast path + cheap reconciliation path — appears again in CRDTs (Part 13), in eventually-consistent stores (Part 12), and in the chunked-replication protocols of distributed log systems (Part 15). Plumtree is the cleanest expression of the pattern in the broadcast domain.

References

Leitão, J., Pereira, J., Rodrigues, L. — "Epidemic Broadcast Trees" (SRDS 2007). The original Plumtree paper, with proof of O(N) steady-state cost and O(log N) worst-case recovery.
Leitão, J., Pereira, J., Rodrigues, L. — "HyParView: a Membership Protocol for Reliable Gossip-Based Broadcast" (DSN 2007). The companion membership protocol Plumtree is designed to run on.
Birman, K., Hayden, M., Ozkasap, O. et al. — "Bimodal Multicast" (TOCS 1999). The two-phase ancestor that Plumtree improves on.
Demers, A. et al. — "Epidemic Algorithms for Replicated Database Maintenance" (PODC 1987). The foundational gossip analysis that Plumtree builds on.
Basho — riak_core source and the riak_core gossip docs. Production-grade Plumtree implementation with adaptive graft-timer tuning.
helium/plumtree-erlang — open-source Erlang reference implementation with replay tooling for graft-failure injection.
Push, pull, push-pull — the previous chapter on gossip exchange shapes that Plumtree's lazy mesh inherits.
The anti-entropy family — the umbrella that frames Plumtree as a member of the cheap-fast-path-plus-reconciliation pattern family.