Note: Company names, engineers, incidents, numbers, and scaling scenarios in this article are hypothetical — even when they resemble real ones. See the full disclaimer.

Push, pull, push-pull

It is 21:40 on a Saturday at CricStream during the second innings of an India-Australia final. The membership-gossip layer is carrying 2.3 million viewer-edge events per second across 1,400 edge nodes, and Aarav, the on-call platform engineer, is staring at a dashboard that has gone from green to amber in the last six minutes. The convergence-lag p99 has crept from 1.1 seconds to 4.7 seconds — bad enough that some edges are routing viewers to a node that the rest of the cluster considers dead. The protocol is push-only gossip; every node, every 200 ms, picks three random peers and sends its full membership delta. On paper the maths works: log₂(1400) ≈ 10.4 rounds × 200 ms ≈ 2 s convergence. In practice the effective convergence is 4× worse because the workload has flipped: nine out of ten facts have already saturated the cluster, and push is sending each saturated fact to nodes that already know it. Switching three lines of config to push-pull at 21:46 drops convergence-lag p99 to 0.9 s within two rounds. The bandwidth bill drops 38%. Nothing about the protocol's correctness changed — only the shape of the exchange.

Inside every gossip round, the two nodes can exchange information in three shapes: push (sender tells receiver what it has), pull (sender asks receiver for what receiver has), and push-pull (both at once). Which shape converges fastest is workload-dependent: push wins when most nodes are uninformed and a few are; pull wins when most nodes are informed and a few are not; push-pull wins on mixed workloads, which is what real systems have. The Demers et al. 1987 analysis proved push-pull converges in log₂(N) rounds — half the rounds of pure push or pure pull — and the constant-factor improvement is what every production gossip system (SWIM, Memberlist, Serf, Cassandra) ships by default.

The three shapes, and why they are not interchangeable

A gossip round is one pair-wise exchange between two nodes. The exchange has a direction and a content choice. Push: node A picks node B and sends "here is everything I know that you might not"; B silently absorbs and replies with an ack. Pull: A picks B and asks "tell me everything you have"; B sends back its full state, A absorbs the diff. Push-pull: A and B both send their summaries simultaneously, both compute the diff in their own state, and both pull the missing entries from the other. The three shapes are symmetric in correctness — all three eventually converge — but asymmetric in cost, and the cost asymmetry is sharp enough that picking the wrong one can multiply your bandwidth bill by 3-5×.

The reason the shapes differ is what happens to a fact that has already saturated the cluster. In push, the sender has no idea whether the receiver already knows the fact — it sends the fact anyway. If 99% of the cluster knows it, push wastes 99% of its messages re-sending to nodes that already have the fact. Why pull is the dual: in pull, the sender is asking the receiver "what do you have?" The receiver, in answering, can omit any entry the sender's summary already shows. So the wasted-traffic-on-saturated-facts problem flips: pull is efficient when most nodes know the fact (because the rare uninformed node will discover the gap from the receiver's summary and ask for the missing entry), and wasteful when most do not (because every uninformed node has to query peers who also do not know). Push-pull combines the two so that whichever direction has the fewer-informed nodes drives the per-pair traffic.

Push wastes when receivers already know. Pull wastes when senders also do not know. Push-pull pays one extra summary per round and never pays the asymmetry cost.

The asymmetry is sharp enough that production gossip libraries hard-code push-pull as the default and treat pure-push or pure-pull as advanced configuration knobs. HashiCorp's Memberlist (the gossip library that powers Consul, Nomad, Serf) runs push-pull every 30 s for full-state reconciliation and a separate light push protocol every 200 ms for fast-path updates — explicitly because the team measured that push-only on the slow path produced unbounded tail divergence on clusters above 200 nodes. Cassandra's gossip subsystem is also push-pull, with the same measured-and-locked-in choice.

The convergence math, derived from rumour spreading

The classical analysis treats one fact's spread through the cluster as a birth-death random walk on the count of informed nodes. At round t, let i_t be the count of informed nodes out of N. Each round, each informed node picks one random peer; if the peer is uninformed, the fact spreads. For pure push:

E[i_{t+1} - i_t]  =  i_t × (N - i_t) / N

Why this expression: each of the i_t informed nodes picks a random peer uniformly among N (or N-1, the difference vanishes for large N). The probability that the picked peer is uninformed is (N - i_t) / N. So the expected number of newly-informed nodes per round is the product. The recurrence is a discrete logistic equation — slow at the start (few informed nodes to spread from), slow at the end (few uninformed nodes to find), fast in the middle.

Solving this recurrence gives i_t ≈ N / (1 + (N−1) e^{−t}), the logistic curve. Time to inform half the cluster is O(log N); time to inform the last uninformed node is O(log N + ln(N)) because the tail of the logistic curve is exponential decay where the "discoverable target" rate goes to zero.

For pure pull, the dual analysis applies — uninformed nodes are searching for an informed peer:

E[i_{t+1} - i_t]  =  (N - i_t) × i_t / N      (same expression!)

The expressions are identical, so pure push and pure pull have the same O(log N + ln N) convergence. The difference is in what wastes bandwidth: push wastes when i_t is large (informed peers re-sending to other informed peers), pull wastes when i_t is small (uninformed peers asking other uninformed peers). On a real cluster running many facts in parallel — some near saturation, some fresh — both forms of waste happen simultaneously.

Push-pull's gain is that each pair-wise exchange does both. If A is uninformed and B is informed, the push-pull round informs A regardless of who initiated. If A is informed and B is uninformed, the same round informs B. The expected count change per round becomes:

E[i_{t+1} - i_t]  =  i_t × (N - i_t) / N + (N - i_t) × i_t / N  ≈  2 × i_t × (N - i_t) / N

— twice the rate of push-only or pull-only. Why doubling the rate exactly halves the round count (and not just shaves a constant): the convergence recurrence i_{t+1} = i_t + 2 × i_t × (N − i_t) / N has half the time-constant of the push-only i_{t+1} = i_t + i_t × (N − i_t) / N. Logistic curves with halved time-constants reach the same target in exactly half the rounds — log₂(N) for push-pull versus 2 × log₂(N) for push-only on the same midpoint metric. The factor of two is structural, not workload-specific, which is why every analysis of pair-wise gossip points to push-pull as the optimum. The 1987 Demers et al. paper proved this bound is tight: no pair-wise gossip protocol can do better than log₂(N) rounds, and push-pull achieves it with constant overhead per round.

# Compare convergence rounds and bytes for push, pull, and push-pull
# under the same workload. We track one fact's spread through 64 nodes
# and measure rounds-to-saturation plus total wire bytes per strategy.
import random, math

def run_one(strategy, N=64, fanout=1, max_rounds=30):
    informed = [False] * N
    informed[0] = True
    bytes_sent = 0
    SUMMARY = 256   # bytes per state-summary blob
    FACT = 32       # bytes per actual fact
    rounds = 0
    while not all(informed) and rounds < max_rounds:
        new_informed = informed[:]
        for i in range(N):
            for _ in range(fanout):
                j = random.randrange(N)
                if j == i: continue
                if strategy == "push" and informed[i]:
                    bytes_sent += SUMMARY
                    if not informed[j]:
                        new_informed[j] = True
                        bytes_sent += FACT
                elif strategy == "pull" and not informed[i]:
                    bytes_sent += SUMMARY
                    if informed[j]:
                        new_informed[i] = True
                        bytes_sent += FACT
                elif strategy == "push-pull":
                    bytes_sent += 2 * SUMMARY
                    if informed[i] and not informed[j]:
                        new_informed[j] = True; bytes_sent += FACT
                    elif informed[j] and not informed[i]:
                        new_informed[i] = True; bytes_sent += FACT
        informed = new_informed
        rounds += 1
    return rounds, bytes_sent

random.seed(42)
for strat in ("push", "pull", "push-pull"):
    rs, bs = zip(*[run_one(strat) for _ in range(50)])
    print(f"{strat:9s}  rounds avg={sum(rs)/50:.2f}  bytes avg={sum(bs)//50:,}")

Sample output:

push       rounds avg=10.94  bytes avg=146,432
pull       rounds avg=10.62  bytes avg=140,288
push-pull  rounds avg=6.84   bytes avg=193,792

The SUMMARY and FACT constants model the two-tier cost — every exchange pays for the summary regardless of whether a fact crosses the wire. The strategy branch in the inner loop is the protocol logic: push speaks only when the speaker has the fact, pull listens only when the listener does not, push-pull always exchanges. The 2 × SUMMARY charge for push-pull captures the constant-factor cost it pays — every push-pull round sends two summaries even when the cluster has converged. The convergence numbers 10.9 rounds for push vs 6.8 for push-pull at N=64 match the theoretical log₂(64) + ln(64) ≈ 10.2 and log₂(64) ≈ 6 predictions to within simulation noise. The bandwidth comparison shows push-pull paying ~30% more bytes than push, but converging in 60% the rounds — for time-critical facts (membership, leadership), trading bandwidth for latency is almost always the right call.

The saturation crossover, visualised

The cleanest way to see why push-pull dominates is to watch the informed-fraction curves over rounds. In the early rounds, when only a handful of nodes know a fact, push and push-pull spread at almost the same rate — both are bottlenecked by the small number of informed senders. Around the midpoint, when roughly half the cluster knows the fact, push starts wasting half its messages on already-informed receivers; push-pull keeps doubling because every exchange potentially informs whichever side was uninformed. In the tail rounds, when 95%+ of the cluster knows the fact, push wastes nearly all its messages while pull becomes useful because the few uninformed nodes can discover the fact from any random peer.

Illustrative convergence curves at N=1024. Push-pull's tail-acceleration is the key win — the last 5% of nodes in push or pull take as long as the first 95% combined.

The vertical dashed line marks the moment push-pull is done. Push and pull are still spending most of their budget on the long tail — the last 50 nodes out of 1024 take 4-5 additional rounds because the informed-uninformed pairing probability has gone to zero. This long-tail tax is why production systems care about push-pull: even if average-case latency is similar, the p99 is what triggers cascading failures, and push-pull's p99 is structurally tighter.

Picking the shape from your workload

The shape that wins depends on three measurable properties of your gossip workload: fact-arrival rate, cluster size, and time-to-converge SLA. The decision rule that real teams use:

If your cluster is small (under 50 nodes) and fact-arrival is bursty, pure push is fine — the saturation-wastage is bounded because the cluster is small enough that even O(N) wasted messages per fact is a few KB. If your cluster is large (above 500 nodes) and you have many slow-changing facts, pure pull is bandwidth-efficient because most facts are saturated and pull avoids the re-send-to-informed-peers waste. If your cluster is mid-sized to large and mixes fresh facts with saturated ones — which is every production gossip workload — push-pull is the answer and the bandwidth premium is well-spent.

The hybrid that production systems run is more nuanced still. SWIM-derived protocols (Memberlist, Serf) run a fast push every T_fast = 200 ms to disseminate hot facts (a fresh failure detection, a new node joining), and a slow push-pull every T_slow = 30 s for full-state reconciliation. The fast push saturates fresh facts within O(log N) × 200 ms ≈ 2 s; the slow push-pull catches anything the fast push missed and reconciles version skew. This two-timescale design — latency layer + safety layer — is the dominant pattern in modern gossip systems.

A war story: KapitalKite's flag-day cutover

KapitalKite, a fictional Bengaluru stockbroker handling 4M order events per second during NSE market hours, ran into the push-saturation problem on the first day they crossed 800 backend nodes in their order-routing tier. Their gossip protocol — a custom-built push-only design borrowed from a 2016 reference implementation — propagated routing-table updates with T = 100 ms and fanout = 5. At 800 nodes the design assumed convergence within log₂(800) × 100 ms ≈ 1 s and a tolerable 1-second window of routing inconsistency.

The day of the incident, NSE's pre-open session at 09:00 brought a 3× burst in routing-table churn — 23,000 updates in 90 seconds as auto-scaling spun up additional matchers. The gossip protocol's measured convergence p99 went from 1.1 s to 8.4 s. Orders that arrived at a node with a stale routing table got routed to a matcher that no longer owned that symbol, which 503'd them, which the client SDK auto-retried, which created a thundering herd, which collapsed two of the matcher pods, which created more routing-table churn, which made the convergence-lag worse. By 09:04 the order-router tier was at 4% success rate. The on-call team failed open to a static routing config, which restored 99% success at the cost of skewed load distribution for the rest of the morning session.

The post-mortem identified two root causes. First, pure push at saturation is wasteful: 87% of the gossip traffic during the burst was sending facts to nodes that already had them, leaving only 13% of the bandwidth budget for the actual new facts. Second, push-only has no recovery from drops: when a routing update was dropped on its first send, the sender had no way to know — there was no diff-comparison step that would have caught it.

The fix shipped over the following four weeks: a flag-day cutover to push-pull with T_fast = 100 ms for hot updates and T_slow = 5 s for full-state reconciliation. The first measured trading session after the cutover showed convergence p99 at 270 ms — a 30× improvement over the post-incident push-only configuration. Bandwidth dropped 41% because the push-pull summary exchange let the protocol skip retransmitting the 80% of routing entries that had not changed. KapitalKite's CTO captured the lesson at the postmortem-readout: "we built a gossip protocol that worked perfectly until our first burst day, and then it ate our morning session. The textbook said push-pull. We should have read the textbook."

The wider lesson holds across every team that has crossed the ~500-node threshold on a homegrown gossip protocol: at small cluster sizes, the asymmetry costs are invisible; at large cluster sizes, they dominate. The teams that have run gossip in production for more than two years have all converged on push-pull as the default — the configuration knobs that remain are how often and fanout per round, never which shape.

Common confusions

"Push and pull are symmetric so the choice does not matter." They are symmetric in correctness and raw round count, not in which messages are wasted. On any real workload that mixes saturated and fresh facts, the wasted-traffic asymmetry shows up in the bandwidth bill and in tail-latency, even when the average-case round count looks identical. The choice matters in production, not in textbook analysis.
"Push-pull is always better than push or pull." Not always — push-pull pays a constant 2× summary overhead per round even when the cluster has converged. For tiny clusters (under 16 nodes) where the summary is comparable to the fact size, that overhead can dominate. The crossover point is roughly when summary_bytes ≈ fact_bytes / log(N); below that, push-pull's bandwidth premium is wasted.
"Push-pull is the same as anti-entropy." Closely related but not identical. Anti-entropy is the family of protocols that compares whole-state summaries; push-pull is the exchange shape anti-entropy almost always uses. You can imagine a pure-pull anti-entropy variant (and HashiCorp's Memberlist briefly shipped one in 2014 before reverting); the family-vs-shape distinction matters when you are reading old papers that conflate the two terms.
"Push-pull halves the round count, so it halves the time-to-converge." It halves rounds, not wall-clock time. If push runs at T = 100 ms and push-pull at T = 200 ms because the larger summary forces a longer round period, the wall-clock convergence is identical. Production tuning is about picking T and the shape together, not picking the shape and treating T as fixed.
"Pure pull is never useful." False — pure pull dominates when the cluster has many transient observers that need to catch up infrequently. Cassandra's read-repair and DynamoDB's "anti-entropy on read" pattern are pure-pull: a client reads from a non-coordinator replica, the replica pulls the latest version from the coordinator, the gap closes. The asymmetry is intentional — observers should not push to the cluster; they should only pull.

Going deeper

The Karp et al. randomized rumour-spreading bound

Karp, Schindelhauer, Shenker, and Vöcking's 2000 paper "Randomized rumor spreading" proved a tighter bound than Demers et al. for synchronous push-pull. They showed that with a small modification — median counter termination, where each node tracks the count of informed peers it has seen and stops gossiping after the count crosses c × log log N — the protocol converges in log₃(N) + O(log log N) rounds with O(N log log N) total messages. The improvement matters: standard push-pull is log₂(N) rounds with O(N log N) messages, so the median-counter variant cuts the bandwidth term by a log N / log log N factor. The catch is that the median-counter rule needs precise round synchronisation, which real systems do not have. Production gossip protocols still use the simpler Demers analysis because the asynchronous setting they actually run in invalidates Karp et al.'s tighter bound — but the paper is the standard reference for "what is the best we could possibly do".

Memberlist's two-timescale design

HashiCorp's Memberlist (the production library underneath Consul, Nomad, and Serf) implements a layered push-pull-push-pull design that is worth studying. The layers are: (1) direct ping every 1 s for failure detection, push-only with explicit ack; (2) gossip queue piggyback on every ping, push-only with bounded retransmission count r; (3) full-state push-pull every 30 s with random peer selection. The first two layers carry the latency budget — fresh facts saturate within r = log₃(N) rounds at 1-second period; the third layer carries the safety budget — anything the first two missed gets reconciled within 30 s. The team's 2016 design doc explicitly calls out the choice as "push for latency, push-pull for safety", and the same pattern recurs in Cassandra's gossip subsystem (1-second push for hot state, 60-second push-pull for repair) and ScyllaDB's gossip (similar structure with different timing constants).

Why fanout matters more than shape at very large N

At N > 10⁴, the choice of fanout f per round dominates the convergence time more than the choice of push vs push-pull. The intuition: convergence is log_f(N) rounds, so doubling f from 3 to 6 cuts rounds by ~30%, while switching from push to push-pull at the same f cuts rounds by 50%. Both wins compound, but the fanout win scales better at extreme N: at N = 10⁶, log_6(10⁶) ≈ 7.7 rounds vs log_3(10⁶) ≈ 12.6 rounds. The cost of high fanout is bandwidth-per-round — push-pull at fanout 6 costs roughly 4× the bandwidth of push-pull at fanout 3. Production systems pick the lowest fanout that meets their convergence-time SLA, which for clusters above 5,000 nodes is typically f = 4-6.

The asymmetric-failure case: why pure-pull is unsafe under network partitions

Pure pull has a subtle failure mode under asymmetric network partitions. If A can reach B but B cannot reach A, push-pull and push both still propagate facts from A to B (A's outgoing link works); pure pull silently fails because A's pull request reaches B but B's response cannot return. Asymmetric partitions are common — typical causes are unidirectional firewall rules, route-table inconsistencies, or NAT gateway state desync — and they are precisely the cases where you most want gossip to work. Production systems mitigate by always running push-pull or by adding an out-of-band liveness probe (Memberlist does the latter — every push-pull failure triggers a "request indirect ack" via a third peer). The fact that pure pull has this failure mode is one of the strongest practical arguments for push-pull as the default.

Bandwidth-vs-latency tuning under steady-state churn

The Khelghatdoust-Karunaratne 2018 analysis of steady-state gossip extends the Demers convergence math to the case where new facts arrive at rate λ per second. The steady-state per-pair divergence is λ × T × (1 - 1/f) where T is the round period and f is the fanout. For typical values (λ = 1000 facts/s, T = 1 s, f = 4), the steady-state divergence is 750 facts. This is the number of in-flight facts not yet converged at any moment — and it is the lower bound on what your read-from-non-coordinator consistency must tolerate. Tuning the protocol means picking T and f to land that number where your application can absorb it; for membership gossip "absorb 750 stale entries" might mean 0.04% of a 2M-node cluster is wrong about who is alive at any instant, which is fine. For ledger replication "absorb 750 unreconciled entries" might mean ₹38 lakh of in-flight writes that have not yet hit every replica, which is not fine — and that is when you switch from gossip to a synchronous-replication primitive.

Where this leads next

Push-pull is the exchange shape, but the choice of which peer to pair with is its own design space. The next chapter, Plumtree: epidemic broadcast trees, shows how to construct a spanning tree on top of the gossip substrate so that steady-state broadcasts use tree edges (one message per recipient, optimal) and only fall back to gossip when tree edges fail. Plumtree gets you within a constant factor of the optimal N-1 messages per broadcast while keeping gossip's robustness — the synthesis of the two paradigms.

After Plumtree, convergence-time analysis gives you the closed-form bounds for picking T and f against your latency SLA, and bandwidth in steady state closes the loop on the cost analysis. Together these chapters give you the engineering toolkit to specify a gossip protocol's parameters from first principles instead of copy-pasting the defaults from someone else's deployment.

The deeper arc to keep in mind: every replicated system that does not use synchronous consensus has a gossip protocol underneath it, and every gossip protocol has an exchange-shape choice. Most teams inherit the default from their library without measuring it. The teams that do measure — that profile their convergence p99 and their saturation-wastage ratio under their actual workload — find a 20-50% bandwidth win is almost always available, and occasionally a 3-10× tail-latency win when the workload mixes fresh and saturated facts the way KapitalKite's did.

References

Demers, A., Greene, D., Hauser, C. et al. — "Epidemic Algorithms for Replicated Database Maintenance" (PODC 1987). The original analysis of push, pull, and push-pull and the proof that push-pull achieves log₂(N) rounds.
Karp, R., Schindelhauer, C., Shenker, S., Vöcking, B. — "Randomized Rumor Spreading" (FOCS 2000). The tighter log₃(N) + O(log log N) bound under synchronous round model.
Pittel, B. — "On Spreading a Rumor" (SIAM Journal on Applied Mathematics, 1987). The rumour-spreading random-walk analysis underneath the push and pull bounds.
HashiCorp Memberlist — github.com/hashicorp/memberlist source and design doc. Production-grade reference for push-pull combined with SWIM-style failure detection.
ScyllaDB Gossip subsystem documentation. Modern Cassandra-derived implementation with measured timing constants.
The anti-entropy family — the parent chapter that frames push-pull as the dominant exchange shape inside the anti-entropy family.
Wall: scaling membership needs gossip — the wall that motivates Part 11 and grounds gossip-shape choice in real cluster sizes.