Note: Company names, engineers, incidents, numbers, and scaling scenarios in this article are hypothetical — even when they resemble real ones. See the full disclaimer.

Pulsar's architecture

PaySetu's settlement team had a bad week in March. Their Kafka cluster — three brokers, 24 partitions per topic, replication-factor 3 — was the durable buffer between merchant payment events and the downstream ledger. On a Tuesday morning, broker-2 ran out of disk during a backfill. The on-call engineer, Riya, spent forty minutes shuffling segments to a fresh volume, then another two hours waiting for the partition reassignment to drain. Producers were not blocked, but consumer lag for settlement-events-v4 climbed to 18 minutes and the finance team's reconciliation job fired at midnight against a stale view. The post-mortem asked one question: why does adding storage capacity require touching the brokers at all? A week later they were prototyping on Apache Pulsar, and the answer surprised them — Pulsar is built around exactly that separation. The broker is stateless. Storage lives somewhere else, in a service called BookKeeper, and you can scale either layer without ever moving a byte of customer data.

This chapter is about Pulsar's two-layer architecture and the data model — segments, ledgers, bookies, brokers — that makes the separation work. It is not "Kafka vs Pulsar"; it is what Pulsar actually does on the wire and on disk, and why the operational story is different from what you've seen if you've only run Kafka. Pulsar is in Part 15 (messaging and streaming) for a reason: the log abstraction is the same, but the layering changes which failures are easy and which are hard.

Pulsar separates the message broker into two tiers — a stateless serving layer (brokers) and an independent durable storage layer (Apache BookKeeper). Brokers own topics, route producers and consumers, and cache; bookies own bytes, replicate them with a quorum-write protocol, and never know about topic semantics. A topic's data is a sequence of ledgers; each ledger is striped across an ensemble of bookies with write-quorum and ack-quorum tunable independently. The result: scale storage by adding bookies, scale serving by adding brokers, and replace either kind of node with no data movement at the topic level.

Two layers, three components

Pulsar has three node types in production: brokers, bookies, and ZooKeeper / metadata (replaced by oxia in newer versions, but conceptually the same role — store small, strongly-consistent metadata). Brokers handle the wire protocol — producer connections, consumer subscriptions, message routing, deduplication, schema validation, geo-replication. Bookies — the nodes that run BookKeeper — handle bytes on disk, nothing else. ZooKeeper holds the topic-to-broker assignment, the ledger-to-bookie mapping, and tenant configuration.

The trick is that a broker owns no persistent state for a topic. If broker-3 dies, the controller reassigns its topics to broker-1 and broker-7, the producers reconnect, and the new broker reads the topic's tail directly from BookKeeper. There is no partition replica to rebalance, no segment to copy, no ISR to repair. The handover is a metadata update.

Pulsar's two-layer architecture — stateless brokers above, BookKeeper bookies belowA stacked diagram showing producers and consumers connecting to a row of three Pulsar brokers, which in turn read and write to a row of five bookies in the BookKeeper layer below. ZooKeeper sits to the side holding metadata. Illustrative. Pulsar topology — serving layer above storage layer Producer (PaySetu app) Consumer (ledger writer) Broker layer (stateless — owns no bytes) broker-1 owns 412 topics broker-2 owns 408 topics broker-3 owns 415 topics BookKeeper layer (stateful — owns durable bytes) bookie-1 3.6 TB bookie-2 3.4 TB bookie-3 3.5 TB bookie-4 3.6 TB bookie-5 3.3 TB metadata (ZK / oxia) Illustrative — capacity and topic counts shown for shape, not measurement.
Brokers serve clients but hold no durable bytes. Each broker writes the topics it owns into BookKeeper, where a configurable subset of bookies stores each entry. Scale storage by adding bookies; scale fan-out by adding brokers. Illustrative.

Why the separation matters operationally: in a Kafka cluster, a broker is both an RPC server and a disk. If you need more disk you add brokers, which means new replicas to pull, which means partition reassignment, which means a controlled drain that can take hours. In Pulsar, more disk means more bookies — the bookie just registers itself with metadata and starts accepting new ledgers. Existing ledgers are not moved, because brokers can write the next ledger to the new bookie set. The cost of expansion drops from "rebalance the cluster" to "the next entry goes to the new node". This is the operational difference Riya saw on day one.

Segments, ledgers, entries — the data model

A Pulsar topic, internally, is a list of ledgers. A ledger is a finite, immutable, append-only sequence of entries (messages). When a topic is created, the broker opens ledger #1 and starts appending entries. When the ledger reaches a configured size (default 50,000 entries) or age (default 4 hours) — or when the broker holding the topic crashes — the broker closes ledger #1 and opens ledger #2. The topic is then the concatenation of ledger #1, ledger #2, ledger #3, etc., in order.

Each ledger is striped across an ensemble of bookies. Three numbers control this striping:

  • E (ensemble size) — how many bookies hold the ledger. Default 3.
  • Wq (write quorum) — how many bookies each entry is written to. Default 3.
  • Aq (ack quorum) — how many of those Wq writes must be acked before the entry is durable. Default 2.

If E = Wq, every entry goes to every bookie in the ensemble (full replication). If E > Wq, entries are striped — entry 0 goes to bookies {1,2,3}, entry 1 to {2,3,4}, entry 2 to {3,4,5}, and so on, distributing load across the ensemble. The acknowledged entry is durable on Aq bookies, even though the broker tries to write to Wq.

Pulsar topic as a sequence of ledgers, each ledger striped across a bookie ensembleThree ledgers shown horizontally as boxes labelled L1, L2, L3 making up one topic. Below, ledger 2 is expanded to show its ensemble of five bookies and entries 0 through 5 striped with write quorum 3. Illustrative. Topic = list of ledgers; each ledger striped across a bookie ensemble Ledger L1 closed, 50,000 entries Ledger L2 (open) currently being written L3 (future) opens when L2 closes Reads stream across L1 → L2 → L3 in order, transparently to the consumer. Ledger L2 — E=5, Wq=3, Aq=2 bookie A bookie B bookie C bookie D bookie E entry 0 → A B C entry 1 → B C D entry 2 → C D E stripe rotates across the ensemble (distributing load)
The topic is a list of ledgers; each ledger is striped across a bookie ensemble. With E=5, Wq=3, the broker writes each entry to a rolling window of 3 bookies, distributing load across all 5. Illustrative.

Why E ≥ Wq matters for failure recovery: if a single bookie in the ensemble dies mid-ledger, the broker triggers a process called ensemble change — close the ledger early at the last fully-acked entry, open a new ledger with a healthy bookie replacing the dead one. The reader of the topic sees no gap because the ledger boundary is invisible to consumers. With E = Wq, every entry needs every bookie, so a single bookie failure forces an immediate ledger rollover; with E > Wq, the surviving bookies in the stripe can still satisfy Wq for in-flight writes while the controller swaps in a replacement. PaySetu runs E=5, Wq=3, Aq=2 on payment topics for exactly this reason.

The write path — quorum across racks

When a producer sends a message to a Pulsar topic, the broker picks an entry id (next slot in the open ledger) and dispatches it to Wq bookies in parallel. Each bookie receives the entry, appends it to its journal (a sequential write-ahead log on a fast disk, optimised for low-latency commit), and acks. Once Aq acks return, the broker acks the producer.

The journal is the durability commitment. Once the entry is in the journal of Aq bookies, the write is durable — even if all Aq bookies crash before flushing to the entry log (the long-term storage on slower bulk disk), recovery will replay the journal on restart and the entry survives. This split — fast journal disk for commit latency, bulk entry-log disk for capacity — is why bookies are typically deployed with one NVMe device for the journal and one or more HDDs (or larger SSDs) for the entry log.

A worked example: PaySetu's payment-events-v4 topic runs E=5, Wq=3, Aq=2 across five bookies in three racks (rack-A: bookies 1, 2; rack-B: bookies 3, 4; rack-C: bookie 5). The broker uses rack-aware placement — when picking the Wq=3 bookies for a stripe, it tries to spread across at least two racks. The result: any one rack can fail entirely and Aq=2 acks are still achievable from the other racks. The producer's p99 commit latency is 14 ms; the p999 is 38 ms (dominated by the rare GC pause on a slow bookie).

Code walkthrough — produce, consume, and inspect ledgers

Here is a runnable Python example using pulsar-client that produces messages, consumes them, and inspects the underlying ledger metadata via the admin API. It assumes a single-node Pulsar standalone running locally.

# pulsar_demo.py — produce, consume, and inspect ledger metadata.
# Requires: pip install pulsar-client httpx
# Setup: docker run -d -p 6650:6650 -p 8080:8080 apachepulsar/pulsar:3.2.0 \
#        bin/pulsar standalone

import pulsar, json, time, httpx

BROKER = "pulsar://localhost:6650"
ADMIN = "http://localhost:8080"
TENANT, NS, TOPIC = "public", "default", "payment-events-demo"
FQ_TOPIC = f"persistent://{TENANT}/{NS}/{TOPIC}"

# 1. Producer — synchronous send, batching off so we see one entry per send.
client = pulsar.Client(BROKER)
producer = client.create_producer(FQ_TOPIC,
                                  block_if_queue_full=True,
                                  batching_enabled=False)

events = [
    {"merchant": "MR-2018", "amount_paise": 49900, "type": "CARD"},
    {"merchant": "MR-7341", "amount_paise": 12300, "type": "UPI"},
    {"merchant": "MR-2018", "amount_paise": 99900, "type": "UPI"},
    {"merchant": "MR-9027", "amount_paise":  4500, "type": "WALLET"},
    {"merchant": "MR-7341", "amount_paise": 78900, "type": "CARD"},
]
for e in events:
    msg_id = producer.send(json.dumps(e).encode("utf-8"))
    print(f"  sent  -> ledger={msg_id.ledger_id()} entry={msg_id.entry_id()}"
          f" merchant={e['merchant']}")

# 2. Inspect the topic's internal stats — shows the ledger list.
stats = httpx.get(f"{ADMIN}/admin/v2/persistent/{TENANT}/{NS}/{TOPIC}/internalStats")
js = stats.json()
print("\nledgers in topic:")
for L in js["ledgers"]:
    print(f"  ledgerId={L['ledgerId']}  entries={L['entries']}"
          f"  size={L['size']}B  offloaded={L.get('offloaded', False)}")

# 3. Consumer — read from the beginning, ack each entry.
consumer = client.subscribe(FQ_TOPIC,
                            subscription_name="demo-sub",
                            initial_position=pulsar.InitialPosition.Earliest,
                            consumer_type=pulsar.ConsumerType.Exclusive)
print("\nreading back:")
for _ in range(len(events)):
    msg = consumer.receive(timeout_millis=5000)
    payload = json.loads(msg.data())
    print(f"  got   -> ledger={msg.message_id().ledger_id()}"
          f" entry={msg.message_id().entry_id()}"
          f" merchant={payload['merchant']} amount={payload['amount_paise']}p")
    consumer.acknowledge(msg)

consumer.close()
producer.close()
client.close()

Sample run on a Pulsar 3.2 standalone:

  sent  -> ledger=12 entry=0 merchant=MR-2018
  sent  -> ledger=12 entry=1 merchant=MR-7341
  sent  -> ledger=12 entry=2 merchant=MR-2018
  sent  -> ledger=12 entry=3 merchant=MR-9027
  sent  -> ledger=12 entry=4 merchant=MR-7341

ledgers in topic:
  ledgerId=12  entries=5  size=412B  offloaded=False

reading back:
  got   -> ledger=12 entry=0 merchant=MR-2018 amount=49900p
  got   -> ledger=12 entry=1 merchant=MR-7341 amount=12300p
  got   -> ledger=12 entry=2 merchant=MR-2018 amount=99900p
  got   -> ledger=12 entry=3 merchant=MR-9027 amount=4500p
  got   -> ledger=12 entry=4 merchant=MR-7341 amount=78900p

Walkthrough of the load-bearing lines:

  • msg_id.ledger_id() / msg_id.entry_id() — every Pulsar MessageId is a (ledger, entry, partition, batch) tuple, not a monotonic offset. The ledger id changes whenever the topic rolls over to a new ledger; the entry id resets to 0 inside each ledger. This is why you cannot compare Pulsar message ids by integer subtraction the way you can with Kafka offsets — they are positions in a list of lists.
  • /internalStats — the admin API exposes the broker's view of the topic, including the list of ledgers and their entry counts. In production you watch this endpoint to see ledger rollovers happening. If a topic has 47 ledgers and you expected 3, your roll-time or roll-size is misconfigured.
  • InitialPosition.Earliest — Pulsar consumers, like Kafka's, have a configurable starting position. Pulsar additionally supports Latest, a specific message id, and a wall-clock timestamp (resolved against the .timeindex-equivalent the broker maintains).
  • ConsumerType.Exclusive — Pulsar has four subscription types (Exclusive, Shared, Failover, Key_Shared) that determine how messages are dispatched across consumers in the same subscription. This is the major model difference from Kafka, which has only consumer groups. Covered in the next chapter.
  • acknowledge(msg) — acks are per-message in Pulsar, not per-offset like Kafka. The broker tracks individual acknowledgements in a structure called the cursor, which is itself stored as a ledger in BookKeeper. This means consumers can ack out of order, redeliver only a single failed message, and the broker's bookkeeping is durable.

Why per-message acknowledgement is more than syntactic sugar: it changes the semantics of redelivery. In Kafka, a consumer that fails to process message at offset 100 must rewind the entire consumer group to offset 100, replaying messages 100..N for every consumer in the group. In Pulsar, only that single message is redelivered (subject to a redelivery delay). For a retry-heavy workload — payment retries, OTP sends, push notifications — this difference can be the difference between a 0.1% retry rate and a 5% replay rate. KapitalKite's order-routing service migrated from Kafka to Pulsar primarily for this property.

Common confusions

  • "Pulsar is just Kafka with extra steps." No — the data model is meaningfully different. Kafka partitions are tied to broker disks (the broker owns its data); Pulsar topics are tied to BookKeeper ledgers (the broker is interchangeable). Every operational story (broker replacement, capacity expansion, rack failure) plays out differently as a result.
  • "BookKeeper is a database." No — BookKeeper is a write-ahead log service. Each ledger is an append-only sequence of opaque byte arrays; bookies do not interpret the entries or maintain any index over them beyond (ledger, entry) → byte_position. Pulsar adds the topic / partition / subscription semantics on top.
  • "E, Wq, Aq are all the same thing." They are three independent knobs. E sets the pool of bookies for the ledger; Wq sets how many of those each entry is written to; Aq sets how many of those acks are required for durability. Tightening Aq below Wq trades latency for resilience to slow bookies — the broker waits for the fastest Aq, not all Wq.
  • "Per-message ack means messages are stored individually." No — per-message ack is a metadata distinction. The bytes are still stored as entries in ledgers; the cursor (the per-subscription progress marker) tracks which message ids have been acked, using a roaring-bitmap-like data structure to compactly represent gaps in the ack range.
  • "Tiered storage means infinite retention is free." Tiered storage offloads closed ledgers to S3 / GCS / HDFS, keeping only metadata in BookKeeper. Read latency for offloaded data is much higher (S3 GET vs local SSD). Use it for compliance / replay scenarios; do not use it as a default for hot consumer workloads.
  • "Brokers being stateless means restart is free." Restart is fast (no data movement), but the broker still needs to fence the open ledger of every topic it owns before another broker can take over. Fencing is a metadata operation in BookKeeper that costs an RTT to ZooKeeper / oxia per topic. Topic-dense brokers (>10,000 topics) take seconds, not milliseconds, to drain.

Going deeper

The fencing protocol — why split-brain is hard in Pulsar

When a broker dies mid-ledger, the open ledger is in an indeterminate state — some entries may be on Wq bookies, some on Aq, some on fewer. The recovery protocol, run by the new owner-broker, is:

  1. Mark the ledger as fenced in metadata. After this, any writer trying to append to the ledger gets an error from every bookie in the ensemble — even the dead broker, if it comes back, cannot write.
  2. Read the ledger's tail from the bookies, finding the last entry where Aq bookies have a copy. Entries beyond that are partial writes and are dropped.
  3. Close the ledger at that entry. Open a new ledger for the topic.

The fence is the load-bearing primitive — it is a single atomic operation in metadata that guarantees the old broker cannot resurrect and write more entries. This is the "fencing token" pattern from Part 9, applied at ledger granularity. See /wiki/leader-election-and-leases for the general principle.

Geo-replication — async per-cluster

Pulsar has built-in geo-replication: a topic can be configured to replicate asynchronously to other Pulsar clusters in other regions. The wire protocol uses Pulsar's own producer-consumer machinery — the source cluster has a system-internal consumer reading the topic and a producer pushing to the destination cluster. There is no synchronous cross-region commit; replication is best-effort, with replication-lag observable per-cluster. CricStream uses this for its match-events-v3 topic between Mumbai and Singapore data centres, with p99 cross-region replication lag of 80 ms (dominated by RTT). For synchronous geo-replication you need a wholly different design — see /wiki/spanner-style-txns-with-truetime for the consensus-across-regions pattern.

Tiered storage and the cold-tier latency cliff

A closed ledger can be offloaded — copied to S3 or GCS, then deleted from bookies. The offload writes a manifest into the topic's metadata; subsequent reads consult the manifest, fetch the segment over HTTP, and reconstruct the entries in the broker's read path. Latency for offloaded data is 50–200 ms p99 vs sub-ms for hot bookie reads. The trade-off makes sense when retention is measured in months and the read pattern is "rare backfill" rather than "real-time consumer". PaySetu offloads ledgers older than 7 days for compliance retention, keeping the cold tier in S3 ap-south-1 to minimise egress.

The metadata layer — ZooKeeper, then oxia

Pulsar's metadata (broker-to-topic assignment, ledger-to-bookie mapping, namespace policies) historically lived in ZooKeeper. From Pulsar 3.0 onwards, oxia is an alternative metadata store built specifically for Pulsar, scaling to millions of topics where ZooKeeper struggles past ~100k. oxia uses its own Raft-based consensus, sharded by key. The migration story matters because metadata-layer hot-spots (the top reason for broker startup timeouts in dense clusters) are eliminated. Apache Pulsar's KIP-equivalent is PIP-45.

Reproduce on your laptop

# Single-node Pulsar standalone (broker + bookie + ZK in one container).
docker run -d --name pulsar -p 6650:6650 -p 8080:8080 \
  apachepulsar/pulsar:3.2.0 bin/pulsar standalone
pip install pulsar-client httpx
python3 pulsar_demo.py
# Inspect the bookie's ledger directory:
docker exec pulsar ls -la /pulsar/data/bookkeeper/ledgers/current/
# List all ledgers:
docker exec pulsar bin/bookkeeper shell listledgers

Where this leads next

The next chapter in this part covers Pulsar subscription types — Exclusive, Shared, Failover, Key_Shared — and how they map onto the consumer-group model from Kafka, including the patterns where Pulsar's flexibility genuinely buys you something and the patterns where it just adds configuration to think about.

References