Byzantine consensus — PBFT and HotStuff

It is 03:42 IST and PaySetu's settlement-net team is in a war room. A misconfigured firmware update has pushed a corrupted RocksDB build to one of the five replicas of their interbank settlement ledger. The corrupted node is not down — that would be easy. It is answering every consensus probe with the right protocol headers, the right term numbers, the right log-index numbers, but the application-layer payloads it serves are nonsense: it acks AppendEntries for log entries it never persisted, it claims to have applied transactions it dropped, and it gossip-replies with bogus committed-index values. The Raft cluster cannot detect this because Raft assumes a fail-stop model — nodes either work correctly or stop responding entirely. A node that lies in protocol-conformant ways is outside Raft's threat model. Why this matters for the rest of the chapter: Raft's safety proof rests on the assumption that any message a follower receives from the leader is either (a) genuinely from the leader or (b) absent. Lies are not in the proof's universe. The moment one node sends a AppendEntriesResponse{success: true, matchIndex: 9999} for entries that were never appended, the leader's commitIndex advances based on a fraudulent quorum, and committed-but-not-actually-replicated entries become "lost" on a real failover. Byzantine consensus exists to handle exactly this case.

Byzantine consensus tolerates up to f actively malicious replicas out of 3f+1 total — the bound is tight, not a tuning knob. PBFT (1999, Castro–Liskov) achieves this with three all-to-all message rounds (pre-prepare, prepare, commit), giving O(n²) messages per decision and ~10ms latency on a LAN of 4 nodes. HotStuff (2019, Yin et al.) reduces communication to O(n) per round by chaining views through threshold signatures and a star topology, making consensus over 100+ nodes feasible — which is why every modern proof-of-stake blockchain (Diem, Aptos, Sui) uses a HotStuff variant.

Why fail-stop is not enough — the firmware-bug threat model

Classical consensus protocols (Paxos, Raft, Viewstamped Replication) all assume the fail-stop failure model: a node either follows the protocol exactly or stops responding. There is no in-between. The proofs of safety — "no two leaders commit different values for the same slot" — depend on this. Stop responding is fine; lie is not.

In real systems, the fail-stop assumption breaks in three places. Firmware bugs can cause one machine to compute wrong checksums, corrupt RocksDB writes, or skip fsyncs while still acking them. Cosmic-ray bit-flips in unprotected RAM produce one-off message corruptions that look protocol-valid. Malicious actors — either external attackers who compromise a node, or insider threats — can deliberately send protocol-conformant lies. PaySetu's incident was firmware; for KapitalKite running multi-broker settlements with seven different counterparties, the threat model has to include "one of the seven counterparties is actively trying to commit a transaction the others did not agree to". Why this is qualitatively different from a crash: a crashed node contributes zero votes; the protocol works around its absence by requiring a quorum of n - f healthy nodes (Raft uses n/2 + 1, tolerating f = (n-1)/2 crashes). A lying node contributes a wrong vote that counts toward quorum unless explicitly cross-checked. To distinguish lies from truth without out-of-band verification, you need a quorum of votes large enough that the lies cannot dominate — and that turns out to require 2f + 1 matching votes out of 3f + 1 replicas, the famous Byzantine bound.

The bound n ≥ 3f + 1 is provably tight. The argument is short: any decision must be made on a quorum of size q. To handle f crashed/unreachable nodes, you need q ≤ n - f. To handle f Byzantine nodes whose votes are lies, the honest votes inside the quorum must outnumber the Byzantine ones — at least q - f honest votes, which must form a majority among the honest nodes (n - f of them) to be able to reach a different decision than any other quorum. Working through the inequalities (Lamport, Shostak, Pease 1982): n ≥ 3f + 1. So tolerating 1 Byzantine node needs 4 replicas; 2 needs 7; 100 (the typical PoS blockchain validator count) tolerates f = 33.

Why Byzantine consensus needs 3f+1 replicasDiagram showing four replicas with one Byzantine. A quorum of three (any 3 of 4) contains at least 2 honest nodes, ensuring overlap between any two quorums in at least one honest node — preventing two contradictory decisions. n = 3f + 1: the smallest cluster that can outvote the liars 4 replicas, f = 1 Byzantine R1 honest R2 honest R3 honest R4 lying Quorum size = 2f+1 = 3 Any 3-quorum has ≥ 2 honest votes Why 3 replicas would fail (n = 3, f = 1) A honest B liar C honest Quorum = 2: {A,B} sees value v {C,B} sees value v' — split brain
Illustrative — with 4 replicas tolerating 1 Byzantine, any 3-quorum contains at least 2 honest votes. Two quorums always overlap in at least 1 honest node, who refuses to vote for both v and v'. With only 3 replicas (right), the quorum size of 2 can include the liar entirely, so the system cannot distinguish two contradictory "majorities".

PBFT — three rounds, all-to-all

Practical Byzantine Fault Tolerance (PBFT, Castro and Liskov 1999) was the first protocol to make Byzantine consensus efficient enough to use. Before PBFT, Byzantine consensus protocols required exponential message complexity or randomised termination. PBFT achieves consensus in three communication rounds with O(n²) messages per decision — and crucially, no exponential blow-up.

A PBFT decision proceeds in views. Each view has a designated primary (rotates round-robin); the other replicas are backups. The primary's job is to order incoming requests; the backups' job is to verify the primary is not lying about that ordering. The three rounds are:

  1. Pre-prepare: the primary assigns a sequence number n to a client request m and broadcasts <PRE-PREPARE, view, n, m> signed with its private key. This round is O(n) messages — one from primary to every backup.

  2. Prepare: each backup that accepts the pre-prepare broadcasts <PREPARE, view, n, digest(m), my_id> to all other replicas. A replica considers the request "prepared" once it has collected 2f PREPARE messages from distinct backups (plus its own, totaling 2f + 1 matching prepares). This round is O(n²) — every backup talks to every other backup.

  3. Commit: once prepared, each replica broadcasts <COMMIT, view, n, digest(m), my_id>. A replica considers the request "committed" once it has collected 2f + 1 matching COMMIT messages. After commit, the replica executes the request and sends the reply to the client. This round is also O(n²).

The client waits for f + 1 matching replies — that is the smallest set guaranteed to contain at least one honest reply. Since f + 1 honest replies that agree on the same execution result is impossible to forge by f malicious replicas, the client is safe. Why three rounds, not two: pre-prepare alone is not enough because a malicious primary could send different pre-prepares to different backups (n=42, m=A to backup B1 and n=42, m=B to backup B2). The prepare round catches this by making every backup announce what it saw — once 2f+1 backups all announce they saw the same (n, digest(m)), no honest replica can have seen a different value. The commit round is for liveness during view changes: it ensures that if any honest replica committed m at sequence n, then in every subsequent view, every honest replica that participates in committing n also commits m. Without commit, a view change after prepare-but-before-commit could pick a different value. Two rounds give safety only; three rounds give safety plus liveness across primary failures.

The view change — what happens when the primary lies

If the primary stops sending pre-prepares (or sends bogus ones the backups reject), backups time out and trigger a view change. Each backup broadcasts a <VIEW-CHANGE, new_view, last_stable_checkpoint, prepared_set, my_id> carrying its set of "prepared but not committed" requests. The new primary collects 2f + 1 view-change messages, computes the union of prepared sets, and broadcasts a <NEW-VIEW> message carrying re-issued pre-prepares for every prepared request found in the union. The view change is O(n²) messages plus the inflight workload — it is the protocol's expensive step, and the reason PBFT does not scale beyond ~30 nodes in practice.

A working PBFT message-counting simulation

# pbft_message_complexity.py — counts messages and signatures per consensus
# decision under PBFT's three-round protocol.
import math

def pbft_messages(n: int, requests: int = 1) -> dict:
    """Count PBFT messages for a steady-state run with the same primary."""
    f = (n - 1) // 3
    # Pre-prepare: primary -> all backups (n - 1 messages).
    pre_prepare = (n - 1) * requests
    # Prepare: each backup broadcasts to all replicas; (n - 1) backups, (n - 1) targets each.
    prepare = (n - 1) * (n - 1) * requests
    # Commit: every replica broadcasts to every other; n senders, (n - 1) targets each.
    commit = n * (n - 1) * requests
    # Client waits for f + 1 replies (cheap, ignored in protocol cost).
    total = pre_prepare + prepare + commit
    # Signatures: every message is signed by sender, verified by receiver.
    sigs_made = (n - 1) + (n - 1) + n          # per request
    sigs_made *= requests
    sigs_verified = pre_prepare + prepare + commit
    return {
        "n": n, "f": f, "requests": requests,
        "pre_prepare": pre_prepare, "prepare": prepare, "commit": commit,
        "total_messages": total,
        "messages_per_request": total // requests,
        "sigs_made": sigs_made, "sigs_verified": sigs_verified,
    }

def view_change_cost(n: int, prepared_inflight: int = 10) -> dict:
    f = (n - 1) // 3
    # Every backup sends VIEW-CHANGE to new primary: n - 1 messages.
    # New primary collects 2f + 1 of them, broadcasts NEW-VIEW: 1 message * (n - 1) targets.
    # NEW-VIEW carries re-issued pre-prepares for prepared_inflight requests.
    view_change = n - 1
    new_view = (n - 1) * (1 + prepared_inflight)
    return {"n": n, "view_change_messages": view_change + new_view}

for n in [4, 7, 13, 31, 100]:
    r = pbft_messages(n, requests=1000)
    v = view_change_cost(n)
    print(f"n={n:3d} f={r['f']:2d}  msgs/req={r['messages_per_request']:6,d}  "
          f"sigs/req={r['sigs_made']//r['requests']:3d}/{r['sigs_verified']//r['requests']:5,d}  "
          f"view-change msgs={v['view_change_messages']:6,d}")

Sample run:

n=  4 f= 1  msgs/req=    27  sigs/req=  9/   27  view-change msgs=    33
n=  7 f= 2  msgs/req=    90  sigs/req= 19/   90  view-change msgs=    66
n= 13 f= 4  msgs/req=   324  sigs/req= 37/  324  view-change msgs=   132
n= 31 f=10  msgs/req= 1,860  sigs/req= 91/1,860  view-change msgs=   330
n=100 f=33  msgs/req=19,602  sigs/req=298/19,602 view-change msgs= 1,089

Per-line walkthrough. pre_prepare = (n - 1) * requests is the primary's per-request work — it scales linearly. prepare = (n - 1) * (n - 1) * requests is the all-to-all prepare round; this is the O(n²) term that dominates. commit = n * (n - 1) * requests is similarly O(n²). messages_per_request for n = 100 is 19,602 — that is why "PBFT does not scale". Every consensus decision in a 100-node PBFT cluster requires 20k inter-replica messages, and at typical 1ms per message round-trip on a LAN, the cluster's per-second decision rate is bounded by network IO long before CPU. sigs_made vs sigs_verified: every message is signed once but verified n - 1 times by the recipients, so signature verification is the real CPU bottleneck — n=100 requires ~20k verifications per decision, ~20µs each on a modern CPU using ECDSA, totaling 400ms of pure crypto per decision. Why O(n²) messages destroys scalability: the prepare and commit rounds each require every replica to send its vote to every other replica. The receiver must process every incoming vote — verify the signature, check the digest matches, store the vote in its protocol state. CPU cost per decision grows as ; bandwidth grows as ; protocol-state memory per replica grows as n (each replica tracks who voted). For Byzantine systems running thousands of decisions per second over hundreds of validators, this is the hard ceiling — and the reason HotStuff was invented.

HotStuff — linear messages via threshold signatures and a leader

HotStuff (Yin, Malkhi, Reiter, Gueta, Abraham, 2019) achieves Byzantine consensus with O(n) messages per decision instead of O(n²), by changing two things:

  1. Star topology, not all-to-all. Replicas send their votes only to the next leader, not to every other replica. The leader aggregates 2f + 1 votes into a single threshold signature (a BLS aggregate, or an equivalent), then broadcasts the aggregated signature to everyone. Each replica receives one O(n)-byte broadcast per round instead of O(n) separate vote messages.

  2. Pipelined three-chain commit. Instead of three separate rounds per decision (pre-prepare, prepare, commit) like PBFT, HotStuff pipelines: each "view" runs a single round of "leader proposes, replicas vote, leader aggregates" — but a value is only committed once it has been the parent of a chain of three consecutive proposals (the "three-chain" rule). This means three views' worth of one-round work, but the views are pipelined: view v is in its commit phase while view v+1 is in its prepare phase and view v+2 is in its propose phase. The result: one decision finalises per view, with O(n) messages per view.

The upshot for the message accountant: HotStuff's n=100 cluster pays ~3n = 300 messages per decision instead of PBFT's ~3n² = 30,000. The view change is also O(n) instead of O(n²), because the next leader is implicit (round-robin) and the proof-of-leadership is a single threshold-signed quorum certificate from the previous view. Why threshold signatures unlock the linearity: in PBFT, a "prepared" or "committed" certificate is 2f + 1 separate signatures from distinct replicas — every replica needs to see all of them to verify the certificate. With BLS threshold signatures (or PBFT-with-BLS variants), the leader aggregates the 2f + 1 signatures into a single O(1)-size signature that any replica can verify against the cluster's public key. The certificate becomes O(1) to ship and O(1) to verify. The leader no longer relays n signatures to n - 1 recipients ( total bytes); it relays one aggregate signature to n - 1 recipients (n total). That single change is what makes 100+ validator Byzantine consensus practical.

The three-chain commit rule and why it's safe

HotStuff's commit rule says: a block B is committed once there is a chain B → B' → B'' → B''' of three blocks each of which has been "voted on" by a 2f + 1 quorum. The first block, B, has been seen by 2f + 1 honest replicas (the prepare quorum). The second, B', certifies that 2f + 1 replicas have seen B plus also seen each other's votes for B. The third, B'', certifies that 2f + 1 replicas have seen B' plus the votes for B'. By induction, no two distinct three-chains can exist on the same height — and the safety proof falls out of that.

The pipeline interpretation is what makes HotStuff feel like a steady-state protocol: at view v, the leader proposes a block whose parent is the QC from view v-1; replicas vote on it; their votes form the QC for view v; meanwhile, the previous view's QC is now the parent of v+1's proposal. Three consecutive QCs form a three-chain; the oldest of the three commits.

HotStuff three-chain commit and pipeliningA timeline showing four views proceeding in sequence; each view has a leader proposing a block and collecting votes; the three-chain rule commits the block that is two steps behind the head when the next QC arrives. HotStuff: each view = one round; commit fires after three consecutive QCs View v propose B QC(B) formed View v+1 propose B' (parent QC(B)) QC(B') formed View v+2 propose B'' (parent QC(B')) QC(B'') formed View v+3 propose B''' (parent QC(B'')) B commits! Three-chain: B → B' → B'' → B''' (heights v, v+1, v+2, v+3) Once QC(B''') exists, B is irrevocably committed. Each view = one all-to-leader → leader-broadcast round. Total per commit: 3 views × O(n) = O(n) messages. Pipelined: at view v+3, view v+4 is already in flight. Steady-state finality = 3 view delays.
HotStuff pipeline. Each block becomes committed three views after it was proposed. The leader at view `v+3` is the one whose QC seals B's commitment — a property that makes the protocol genuinely O(n) per decision, in contrast to PBFT's O(n²).

Where production blockchains use this

Every modern proof-of-stake blockchain that runs a fixed validator set uses some HotStuff variant: Diem (the Facebook-then-not-Facebook chain) shipped LibraBFT, a HotStuff implementation; Aptos and Sui both run direct HotStuff descendants (Jolteon, Bullshark); Celo and Filecoin use BFT variants with similar pipeline structure. The choice is driven by validator count: Bitcoin and Ethereum-mainnet use Nakamoto consensus precisely because they need to scale to thousands of permissionless miners, where HotStuff's 2f + 1 quorum becomes impractical to track. PBFT-style protocols saturate around 30 validators; HotStuff scales to ~200 comfortably and ~500 with engineering effort.

PaySetu's recovery from the firmware-bug incident converged on a Byzantine-tolerant settlement layer: five replicas (f = 1) running a HotStuff variant with each replica's RocksDB checksums independently verified before voting. The three-chain commit means every settlement waits ~150ms for finality (three view delays at 50ms each) — slower than the previous Raft-based system's 8ms, but the team accepted the latency in exchange for tolerating up to one actively-corrupted replica without manual intervention.

Common confusions

Going deeper

Why the bound is exactly 3f+1, derived

The full derivation uses two cluster runs that an honest replica cannot distinguish. Suppose n = 3f (one less than the bound). Partition the cluster into three groups of size f: A, B, C. Consider two runs. Run 1: group C is offline; A sees clients propose value v; B is Byzantine and votes for v. The honest A nodes see 2f votes (themselves + B's lies), reach quorum 2f, decide v. Run 2: group A is offline; C sees clients propose v'; B is Byzantine and votes for v'. Honest C nodes see 2f votes, decide v'. From B's perspective in both runs, it is sending lies; from A and C's perspectives, the messages they receive are indistinguishable between the two runs. So both decide on different values — split brain. The proof goes through for any n ≤ 3f. Adding one more node (n = 3f + 1) makes the quorum 2f + 1 strictly larger than half of any partition; no partition can fake a quorum.

How HotStuff's responsiveness property works

HotStuff promises optimistic responsiveness: in the synchronous case (after GST in the partial-synchrony model), the protocol's throughput is bounded only by actual network round-trip time, not by any fixed timeout. PBFT does not have this property — its commit phase has a fixed timeout regardless of how fast the network is. The reason is that PBFT's primary must explicitly wait Δ (the synchrony bound) before considering a request committed; HotStuff's leader can drive the next view as soon as it has 2f + 1 votes, which in practice means immediately. For 100ms-RTT geographies, this means HotStuff achieves 8–10 decisions per second (limited by RTT) whereas PBFT achieves 2–3 (limited by Δ-adjusted timeouts).

Tendermint, Casper FFG, and the design space around HotStuff

Tendermint (2014–2018) is HotStuff-adjacent: it is also an O(n) BFT protocol with rotating leaders, but uses a two-chain commit rule and pre-vote / pre-commit rounds explicitly. Casper FFG (Ethereum's finality gadget) is a "checkpoint" BFT protocol layered atop a longest-chain protocol — it borrows the 2f + 1 quorum but applies it over epochs instead of single blocks. The design space is large: you can mix-and-match leader-rotation rules (round-robin vs reputation-weighted), commit rules (two-chain, three-chain, k-chain), and signature schemes (per-replica BLS, threshold BLS, Schnorr aggregates). Each combination has different throughput, finality-latency, and view-change-cost tradeoffs.

Reproduce this on your laptop

python3 -m venv .venv && source .venv/bin/activate
pip install --upgrade pip
# save pbft_message_complexity.py from the article body
python3 pbft_message_complexity.py

# A real HotStuff implementation to read:
git clone https://github.com/hot-stuff/libhotstuff
cd libhotstuff && cat README.md
# or for production-grade Rust HotStuff (Aptos's BFT engine):
git clone https://github.com/aptos-labs/aptos-core
# the consensus logic lives in consensus/src/

Where this leads next

Byzantine consensus is the final piece of the consensus story — the chapter that explains what "consensus" means when nodes can lie, not just crash. From here, the curriculum branches into the operational realities of running consensus protocols: leader leases, failure detection, and how consensus integrates with replication and storage.

Part 8 is now closed: FLP impossibility set the theoretical limit; Paxos and Raft showed practical fail-stop consensus; Multi-Raft scaled it to thousands of groups; PBFT and HotStuff extended it to the Byzantine setting. Part 9 picks up with the layer just above consensus — leases and fencing — where the question shifts from "agree on a value" to "act on the agreement safely".

References

  1. Castro & Liskov, "Practical Byzantine Fault Tolerance" — OSDI 1999 — the original PBFT paper. Sections 4 (normal-case operation) and 5 (view changes) are load-bearing.
  2. Yin, Malkhi, Reiter, Gueta, Abraham, "HotStuff: BFT Consensus in the Lens of Blockchain" — PODC 2019 — the HotStuff paper. The three-chain commit and linear view change are in Section 5.
  3. Lamport, Shostak, Pease, "The Byzantine Generals Problem" — TOPLAS 1982 — the original n ≥ 3f + 1 impossibility result.
  4. Buchman, Kwon, Milosevic, "The latest gossip on BFT consensus" (Tendermint paper) — 2018 — Tendermint's two-chain BFT, contrast with HotStuff's three-chain.
  5. Baudet et al., "State Machine Replication in the Libra Blockchain" — Diem whitepaper 2019 — LibraBFT, the production HotStuff variant.
  6. hot-stuff/libhotstuff — C++ reference implementation — the canonical reading for "what HotStuff actually looks like in code".
  7. Multi-Raft and sharding consensus — chapter 54; the fail-stop predecessor protocol whose limits motivate Byzantine consensus.
  8. Raft in detail — chapter 52; the protocol whose threat model PBFT and HotStuff explicitly extend.