Note: Company names, engineers, incidents, numbers, and scaling scenarios in this article are hypothetical — even when they resemble real ones. See the full disclaimer.

In short

Downstream systems — search indexes, caches, fraud models, warehouses — do not want snapshots of your database; they want a stream of changes as they happen. Relational CDC bolts on a log reader like Debezium that tails the binlog out-of-band, with all the format-drift and operational cost that implies. MongoDB folds CDC into the protocol: a change stream is a tailing cursor on the replica set oplog that delivers typed events with resume tokens, so a crashed consumer picks up exactly where it left off — no external system required.

The CDC problem, stated cleanly

Every interesting data system that is not your primary database ultimately has the same job: stay in sync with that primary database. A search index needs to know when a product is added or its price changes. A cache needs to know when the source of truth invalidates. A data warehouse needs the day's new transactions. A fraud model needs the latest signals about user behaviour. A webhook integration needs to fire when an order moves to "shipped".

You have three crude ways to do this and one good one.

Polling is the crude default: every N seconds, run SELECT * FROM orders WHERE updated_at > :last_seen. It is simple, it works, it is also wrong in three ways at once. You miss deletes (a deleted row no longer satisfies the predicate), you miss intermediate states (two updates between polls collapse into one), and you put a constant scan load on your primary database that scales with both your data size and your polling frequency. Why polling rots: latency floor is your poll interval, deletes are invisible, and the load on the primary grows linearly with how fresh you want to be — exactly backwards from what you want.

Dual writes are worse. The application writes to MongoDB, then writes to Kafka, and you hope both succeed. They will not, eventually. A network blip after the first write and before the second leaves the systems permanently inconsistent. There is no atomic boundary that spans two distributed systems without a distributed transaction protocol you do not want to operate.

Triggers are better but rigid. Postgres AFTER INSERT OR UPDATE OR DELETE triggers can call a function that writes to a queue, but every change pays the trigger cost synchronously, and the queue write is inside the transaction. If the queue is down, your writes fail. If the queue is slow, your writes are slow.

The good way is log-based CDC. The database already writes a durable, ordered log of every change — that is how replication works. A consumer that can tail this log gets the perfect change stream: every event in commit order, no intermediate states lost, no extra load on the writer (the log was already being written), and the latency is bounded only by replication lag. Debezium is the canonical implementation of this idea for relational databases, parsing the MySQL binlog or Postgres WAL into Kafka records.

MongoDB's contribution is to skip the bolt-on layer. The replica set's oplog is already a structured, idempotent, ordered log of operations — and MongoDB exposes it through a first-class change stream API.

The MongoDB oplog

A MongoDB replica set is a primary plus one or more secondaries. The primary accepts writes; secondaries replay them in order. The mechanism is the oplog (local.oplog.rs), a capped collection that lives on every replica.

Every write to the primary appends an entry to the oplog. Each entry is itself a BSON document with roughly this shape:

{
  "ts": Timestamp(1714048000, 17),
  "t": 42,
  "h": NumberLong("..."),
  "v": 2,
  "op": "i",
  "ns": "shop.orders",
  "o": { "_id": ObjectId("..."), "user_id": "u_19", "amount": 4999 }
}

The fields that matter: ts is the cluster time (timestamp + ordinal counter, monotonically increasing), op is the operation type (i insert, u update, d delete, c command, n no-op), ns is the namespace (db.collection), and o is the operation payload — the inserted document, the update modifier, or the document key for a delete.

Secondaries tail this log over a regular query cursor with awaitData (the cursor blocks on the server until new data arrives, rather than returning empty and being polled). When a secondary applies an entry, it is doing the exact same logical operation the primary did. Why this matters for CDC: the oplog is the source of truth for replication. Every byte of it is durable, every ordering is preserved, every operation is idempotent (applying it twice produces the same result). Those are exactly the properties a change feed needs.

The oplog is capped — it has a fixed maximum size (default tuned to 5% of free disk, usually 10–50 GB on a production node). When it fills, oldest entries are overwritten. The window of time the oplog covers is the oplog window, and on a busy production cluster this might be 24–72 hours. A consumer that falls behind by more than the oplog window cannot resume from where it stopped — its resume token points to a deleted entry. We will come back to monitoring this.

Change stream architecture: a tailing cursor over the replica set oplogReplica SetPrimaryaccepts writesSecondary 1tails oplog from primarySecondary 2tails oplog from primarylocal.oplog.rs (capped)ordered by clusterTimemongod server logicchange stream cursor= aggregation pipelineover oplog with changeStream</text><text x="380" y="218" text-anchor="middle" font-size="10" fill="#374151">filters by ns, op, fields</text><text x="380" y="234" text-anchor="middle" font-size="10" fill="#374151">awaitData = blocks for new ops</text><text x="380" y="250" text-anchor="middle" font-size="10" fill="#374151">attaches resume token to event</text><text x="380" y="266" text-anchor="middle" font-size="10" fill="#15803d">reads from any replica</text><rect x="510" y="100" width="160" height="200" fill="#dcfce7" stroke="#15803d" rx="6"/><text x="590" y="120" text-anchor="middle" font-weight="bold" fill="#15803d">Application client</text><text x="590" y="142" text-anchor="middle" font-size="10">pymongo / Java driver</text><text x="590" y="158" text-anchor="middle" font-family="monospace" font-size="10">db.coll.watch(pipeline)</text><text x="590" y="180" text-anchor="middle" font-size="10" fill="#374151">long-running cursor</text><text x="590" y="196" text-anchor="middle" font-size="10" fill="#374151">receives ChangeEvents</text><text x="590" y="212" text-anchor="middle" font-size="10" fill="#374151">stores last resume token</text><text x="590" y="228" text-anchor="middle" font-size="10" fill="#374151">reconnects with token</text><text x="590" y="252" text-anchor="middle" font-size="10" fill="#15803d">writes to Kafka /</text><text x="590" y="266" text-anchor="middle" font-size="10" fill="#15803d">Elasticsearch / Redis</text><line x1="230" y1="180" x2="290" y2="180" stroke="#374151" stroke-width="1.5" marker-end="url(#arr1)"/><line x1="470" y1="180" x2="510" y2="180" stroke="#374151" stroke-width="1.5" marker-end="url(#arr1)"/><defs><marker id="arr1" markerWidth="8" markerHeight="8" refX="7" refY="4" orient="auto"><polygon points="0 0, 8 4, 0 8" fill="#374151"/></marker></defs><text x="260" y="172" text-anchor="middle" font-size="9" fill="#6b7280">tail</text><text x="490" y="172" text-anchor="middle" font-size="9" fill="#6b7280">stream</text><text x="350" y="335" text-anchor="middle" font-size="10" fill="#6b7280" font-style="italic">No external log reader. The "tailing cursor" is a normal MongoDB cursor with achangeStream stage thatthe server expands into reads against the oplog, formatted into typed events with resume tokens.

The change stream API is layered on top of this. When you call db.orders.watch(), the driver opens an aggregation pipeline cursor where the first stage is $changeStream. The server expands that into the appropriate oplog reads, applies your $match and $project stages, and streams events back. The cursor uses awaitData so it blocks server-side until new oplog entries arrive — the client is not polling.

What a change event looks like

Each event the change stream cursor returns is itself a BSON document with a fixed top-level shape, plus operation-specific fields. The Python driver gives you a dict; the canonical shape is:

{
  "_id": { "_data": "8265..." },                # resume token
  "operationType": "update",                    # insert | update | delete | replace | drop | rename | invalidate
  "clusterTime": Timestamp(1714048000, 17),
  "wallTime": datetime(2026, 4, 25, 14, 30, 0),
  "ns": { "db": "fintech", "coll": "transactions" },
  "documentKey": { "_id": ObjectId("...") },
  "updateDescription": {
    "updatedFields": { "status": "settled", "settled_at": ... },
    "removedFields": [],
    "truncatedArrays": []
  },
  "fullDocument": { "_id": ..., "user_id": "u_19", "amount": 4999, "status": "settled" }
}

The _id is the resume token — an opaque (but cluster-time-bearing) blob the driver stores so it can reopen the stream at exactly this position after a disconnect. The token is the primary durability primitive of the change stream protocol; we will come back to it.

The operationType discriminates events. The five document-level types you will handle 99% of the time are:

  • insert — a new document was created. fullDocument carries the inserted document. documentKey is { _id: ... }. No updateDescription.
  • update — an existing document was modified by a $set / $inc / $unset / array-update modifier. updateDescription shows exactly which fields changed (the new values for updatedFields, the names for removedFields). fullDocument is not included by default — set full_document='updateLookup' on the watch call to ask MongoDB to do a follow-up read and attach the post-image. There is also updateLookup's newer cousin, whenAvailable and required, plus fullDocumentBeforeWhenAvailable for pre-images (requires enabling changeStreamPreAndPostImages on the collection).
  • delete — a document was removed. documentKey carries the _id. fullDocument is unavailable unless you had pre-images enabled (in which case fullDocumentBeforeChange carries the deleted document).
  • replace — a replaceOne overwrote the document; semantically a delete-then-insert from the change stream's view. fullDocument carries the new document.
  • drop, rename, dropDatabase, invalidate — collection-level events. invalidate is special: it tells the consumer the stream is over (the watched collection was dropped or renamed) and the existing resume token is no longer valid.

Anatomy of a MongoDB change event_id: { _data: "8265..." }resume token — opaque blob, persist thisoperationType: "update"discriminator: insert | update | delete | replace | drop | rename | invalidateclusterTime: Timestamp(...)global ordering across the replica setns: { db, coll }namespace — useful when watching a whole databasedocumentKey: { _id: ... }always present — the affected document's _id (plus shard key on sharded clusters)updateDescription: {updatedFields: { "status": "settled", ... },removedFields: [...], truncatedArrays: [...]}update only — the diff in field-level detailfullDocument: { ...the post-image... }always for insert/replace; for update only when fullDocument='updateLookup'; never for delete (unless pre-images on)— a follow-up findOne, so it reflects the doc state at the time of the lookup, not necessarily the time of the update

The fullDocument: 'updateLookup' option is convenient but worth thinking about. The lookup is a separate read against the current state of the collection. If the same document is updated again between the original update and the lookup, the fullDocument you receive reflects the later state, not the state immediately after this particular update. If your consumer cares about exact post-images per update, you must enable pre-and-post-image collection storage (which doubles write amplification and oplog footprint, so it is opt-in).

Resume tokens: the durability primitive

A change stream is useless if it cannot survive consumer restarts, network partitions, or replica set failovers without losing or duplicating events. The resume token makes this work.

The protocol is: every event you receive carries an _id field that is a resume token. Your consumer commits the token alongside the work it did with the event (typically: write event to Kafka, then commit the token). When the consumer restarts, it reads the last committed token from durable storage and reopens the change stream with resume_after=token — the server seeks the oplog to the matching cluster time and resumes streaming from the next event.

stream = db.transactions.watch(pipeline, resume_after=last_token)

There are three resume modes the API supports:

  1. resume_after — start after the position of this token. Used for normal restarts. The token must still be inside the oplog window.
  2. start_after — like resume_after but tolerates invalidate events (lets you cross a collection-rename boundary, for example).
  3. start_at_operation_time — start at a specific cluster timestamp. Used for the very first run when you have no token, or for back-filling from a known point.

Why the resume-token model is robust where polling is not: the token is generated by the server from the actual oplog position, not from any application-level field like updated_at. There is no clock skew, no missed deletes, no "what if two writes share the same timestamp" — the cluster time is monotonic with an ordinal counter, and the server resolves ties.

Two failure modes the consumer must handle. Failover: when the primary changes, the existing cursor is killed mid-stream. The driver catches the disconnect, reopens with the last token against the new primary (or against any secondary, if you specified one), and continues. Because the oplog is replicated, the same entries exist everywhere, so the resume is seamless. Oplog overflow: if the consumer is slow enough that the oldest token it knows is older than the oldest oplog entry, the resume fails with ChangeStreamHistoryLost. The only recoveries are to start from the current time (losing the events in the gap) or to do a full snapshot of the collection plus a fresh stream — exactly the painful situation Debezium calls a "snapshot recovery". Monitoring oplog lag against the oplog window is therefore mandatory.

Filtering: aggregation stages on the stream

The change stream cursor is an aggregation pipeline. The first stage is implicitly $changeStream; you supply additional stages that run server-side, before the events cross the network.

pipeline = [
    { "$match": {
        "operationType": { "$in": ["insert", "update"] },
        "fullDocument.amount": { "$gt": 100000 }   # only large transactions
    }},
    { "$project": {
        "operationType": 1,
        "documentKey": 1,
        "user_id": "$fullDocument.user_id",
        "amount": "$fullDocument.amount",
        "clusterTime": 1
    }}
]
stream = db.transactions.watch(pipeline, full_document='updateLookup')

Filtering server-side is a meaningful cost saver when only a small fraction of writes are interesting to a particular consumer. If you have ten downstream services that each care about a different sliver of the change feed, give each of them its own change stream cursor with its own filter, rather than fanning out a firehose and filtering in the application. The server does the work either way; the network savings and consumer simplicity are the win.

You can watch at three scopes: a single collection (db.transactions.watch()), an entire database (db.watch() — events from any collection in this database), or the whole deployment (client.watch() — every collection in every database). Wider scopes have proportionally more events and (on sharded clusters) higher latency because the server has to merge ordered streams from every shard.

A real pipeline: fintech fraud signals

Time to build something. A Bengaluru-based UPI-and-cards fintech — call it PaisaPay — needs sub-second fraud signals. Every transaction lands in MongoDB (fintech.transactions). A fraud model needs per-user features in a serving cache (Redis): rolling-window transaction counts, velocity, total amount in the last hour, distinct-merchant count today. The feature pipeline reads change events, aggregates per user in Flink, and writes to Redis. The model itself reads from Redis at score time, returning a decision in under 50 ms.

PaisaPay: change stream → Kafka → Flink → Redis

The producer is a Python service that owns the change stream and forwards events into a Kafka topic. The consumer is a Flink job that does the actual stateful aggregation. Splitting them lets us scale the Flink layer independently of the MongoDB cursor (which is single-threaded per collection scope).

The producer:

import json
import os
import time
from pathlib import Path
from pymongo import MongoClient
from pymongo.errors import PyMongoError
from kafka import KafkaProducer
from bson import json_util

MONGO_URI = os.environ["MONGO_URI"]
TOKEN_PATH = Path("/var/lib/paisapay/cdc_token.json")

client = MongoClient(MONGO_URI)
db = client["fintech"]
producer = KafkaProducer(
    bootstrap_servers=os.environ["KAFKA_BROKERS"],
    value_serializer=lambda v: json_util.dumps(v).encode("utf-8"),
    acks="all",
    enable_idempotence=True,
    linger_ms=5,
)

def load_token():
    if TOKEN_PATH.exists():
        return json.loads(TOKEN_PATH.read_text())
    return None

def save_token(token):
    tmp = TOKEN_PATH.with_suffix(".tmp")
    tmp.write_text(json_util.dumps(token))
    tmp.replace(TOKEN_PATH)   # atomic on POSIX

def stream_loop():
    pipeline = [
        {"$match": {
            "operationType": {"$in": ["insert", "update", "replace"]},
        }}
    ]
    while True:
        try:
            kwargs = {"full_document": "updateLookup"}
            token = load_token()
            if token:
                kwargs["resume_after"] = token
            with db.transactions.watch(pipeline, **kwargs) as stream:
                for event in stream:
                    msg = {
                        "op": event["operationType"],
                        "user_id": event["fullDocument"]["user_id"],
                        "amount": event["fullDocument"]["amount"],
                        "merchant": event["fullDocument"].get("merchant"),
                        "txn_id": str(event["documentKey"]["_id"]),
                        "cluster_time": event["clusterTime"].as_datetime().isoformat(),
                    }
                    future = producer.send("paisapay.transactions", msg)
                    future.get(timeout=10)        # block until Kafka acks
                    save_token(event["_id"])      # then commit the token
        except PyMongoError as e:
            print(f"stream error: {e}; sleeping then resuming")
            time.sleep(2)

if __name__ == "__main__":
    stream_loop()

Two ordering guarantees matter here. First, the producer.send(...).get() is synchronous per event — we wait for Kafka to acknowledge before saving the token. Why synchronous: if we saved the token before Kafka acked and the process crashed in between, we would lose the event on restart. Sync-then-commit gives us at-least-once delivery into Kafka. The Flink layer downstream is responsible for idempotency on txn_id. Second, the token is written to a temp file and atomically renamed; a partial write that loses the token would be much worse than a small batch of duplicates.

The Flink aggregation (sketched, full job is in Java/Scala in production):

# pyflink sketch
from pyflink.datastream import StreamExecutionEnvironment
from pyflink.datastream.connectors.kafka import KafkaSource
from pyflink.datastream.window import TumblingProcessingTimeWindows
from pyflink.common.time import Time

env = StreamExecutionEnvironment.get_execution_environment()
source = KafkaSource.builder() \
    .set_bootstrap_servers("kafka:9092") \
    .set_topics("paisapay.transactions") \
    .set_group_id("fraud-features") \
    .build()

stream = env.from_source(source, ...)
features = (stream
    .map(parse_event)
    .key_by(lambda e: e["user_id"])
    .window(TumblingProcessingTimeWindows.of(Time.minutes(1)))
    .aggregate(UserFeatureAggregator())   # count, sum, distinct merchants
)
features.add_sink(RedisSink())   # writes to user:{user_id}:features
env.execute("paisapay-fraud-features")

End-to-end latency on this pipeline, measured in production-shaped tests: MongoDB write → change stream event ~50 ms (oplog replication + cursor wake), Kafka publish ~5 ms, Flink window emit ~1 minute (the window size — for sub-second features, drop the window and use stream-stream aggregation). The fraud model reads from Redis with sub-millisecond latency.

The same architecture, drawn:

Change stream → Kafka → Flink → serving cache + warehouseMongoDBreplica setoplogchange streamProducerpymongo .watch()save resume tokenat-least-onceKafkapaisapay.transactionspartitioned by user_idretain 7 daysFlinkstateful aggregationper-user featuresexactly-once via ckptRedisfeature cacheS3 / IcebergwarehouseDashboardlive opsOne change stream → many consumers via Kafka. Producer stays simple; downstream scales independently.

Why insert Kafka in the middle rather than have Flink read the change stream directly? Three reasons. First, fan-out: the same transaction events feed the fraud features, the data warehouse, the merchant analytics dashboard, and the email-notification service. Reading the change stream four times duplicates load on MongoDB; reading Kafka four times is what Kafka is for. Second, decoupling: if Flink restarts or the warehouse loader is down for an hour, the producer keeps running and Kafka holds the backlog. Third, replay: Kafka retention lets you reprocess the last seven days of events when a feature definition changes, without touching MongoDB.

Production concerns

The pipeline above hides a handful of operational realities that bite at scale.

Resume token storage. The token must be persisted somewhere durable that survives consumer-process death and host failure. A local file is fine for a single-instance producer; for redundancy, store the token in a small Postgres row, in a dedicated MongoDB collection, or in Kafka itself (commit the token alongside the offset). The token is small (~50 bytes); the durability guarantee is what matters.

Oplog window monitoring. Run db.getReplicationInfo() on a primary to get the oplog window in hours. Set an alert when the consumer's last-committed token is more than half the oplog window in age. Why half: if your consumer is consistently lagging by a fraction of the window, a brief period of slowdown plus a brief period of higher write rate can compound into total loss of position. Half-window gives you the headroom to recover before ChangeStreamHistoryLost.

Sharded clusters. When the watched scope spans multiple shards, the mongos router merges streams from each shard's oplog and reorders them by clusterTime before sending to the client. This requires every shard to be reachable, and the latency floor becomes the slowest shard's commit-acknowledgement time. The ordering guarantee holds (events come out in cluster-time order across shards) but tail latency is higher than a single replica set. For very wide deployments, consider per-shard change streams with downstream merging if you can tolerate weaker ordering.

Backpressure. A slow consumer cannot slow down MongoDB — writes keep happening regardless of whether anyone is reading the change stream. The cursor falls behind, eventually past the oplog window, and dies. Treat consumer lag as a critical SLO. If your consumer cannot keep up steadily, the answer is to parallelise (multiple consumers each reading a $match-filtered slice, merging downstream) or to add a Kafka layer that absorbs bursts.

Schema drift and consumer evolution. Change events are BSON documents, so a new field appearing in your collection appears in the change stream the moment a write contains it. Downstream consumers must be tolerant of unknown fields (ignore them) or you must coordinate schema changes between writers and consumers. This is the same discipline as evolving a Kafka schema with a registry; the change stream itself does not give you it.

Pre-images and post-images cost. Enabling changeStreamPreAndPostImages on a collection asks MongoDB to write the pre-image of every modified document into a side collection so the change stream can deliver it. Storage doubles for that collection's writes, and there is meaningful CPU cost on the primary. Enable only on collections where you genuinely need pre-images (typical case: compliance audit logs).

How the rest of the world does CDC

MongoDB is not the only document database with built-in change streams, and not all CDC happens on document databases.

AWS DocumentDB ships a near-identical change stream API, since it is wire-compatible with MongoDB. Subtle differences in event ordering on aggressively-resharded clusters and in pre-image support; otherwise the pymongo code above runs unchanged.

Azure Cosmos DB Change Feed exposes a similar capability under a different name and a different API shape. Cosmos partitions data by partition key, and the change feed is per-partition. You read with a continuation token (Cosmos's analogue of the resume token) and process events in partition order. The Cosmos change feed gives only post-images by default; deletes appear as a TTL-set update or require enabling the all-versions-and-deletes mode.

Couchbase DCP (Database Change Protocol) is the most "infrastructure" of the lot. DCP is the protocol Couchbase uses internally for replication, XDCR (cross-data-centre replication), and view/index maintenance. It exposes a low-level cursor over the per-vbucket mutation log. You write a DCP client (or use the Kafka connector) and you receive every mutation. More raw than MongoDB change streams, more flexible, more manual.

Debezium for MongoDB exists too — a Kafka Connect source connector that internally just opens a MongoDB change stream and shovels events into Kafka with a known schema. If your downstream world already standardises on Debezium connectors for Postgres and MySQL, using Debezium for MongoDB lets you have one CDC story across heterogeneous primary databases. The trade-off is one more system to run; if you already have Kafka Connect, the marginal cost is small.

Debezium for relational (MySQL, Postgres, SQL Server, Oracle) is the canonical bolt-on log reader. It tails the binlog or WAL out-of-band, parses the format, and emits Kafka records keyed by primary key with before and after images. The protocol is an exemplar of careful CDC engineering — handling DDL, schema changes, snapshot consistency, and exactly-once semantics — but it is two systems where MongoDB is one. Why this matters strategically: if you are starting greenfield with a document model, getting CDC for free in the protocol is a meaningful operational simplification. If you have an existing relational primary, Debezium is the right answer; the bolt-on cost is real but unavoidable.

AWS DMS for MongoDB (and DocumentDB) is a managed service that internally consumes change streams and writes to other AWS targets (S3, Redshift, Kinesis). Useful when your downstream is fully AWS-native and you do not want to operate a Kafka cluster.

System CDC mechanism Resume primitive Notes
MongoDB db.coll.watch() change stream over oplog resume token (per event) First-party, aggregation-pipeline filter
AWS DocumentDB MongoDB-compatible change stream resume token Same API; minor pre-image gaps
Cosmos DB Change Feed per partition continuation token Partition-scoped; all-versions mode optional
Couchbase DCP (Database Change Protocol) sequence number per vbucket Lower-level, used by XDCR internally
MySQL / Postgres Debezium tails binlog / WAL LSN / GTID Bolt-on log reader; canonical relational CDC

When to reach for change streams (and when not)

Change streams are the right tool when you need a low-latency, ordered, exact stream of changes from MongoDB to one or more downstream systems, and you are willing to operate the consumer. The strongest cases are real-time search-index updates, cache invalidation, fraud and risk feature pipelines, webhooks, and replicating into a warehouse with Iceberg or ClickHouse for analytics.

They are the wrong tool when you need a one-off snapshot (use a mongoexport or a mongodump), when your downstream consumer is so slow that it cannot stay inside the oplog window (snapshot first, stream after), or when you want every consumer to be totally isolated from MongoDB's evolution (in which case put Kafka in the middle as we did above, and let consumers read Kafka).

The deeper lesson goes back to the BSON chapter: the document model wins not by being a better key-value store than relational tables, but by treating each record as a self-describing tree that the database can hand back, mutate, and now stream as a structured event. Change streams are the third leg of that argument — your data is not just stored as documents, it is delivered as documents, all the way to your fraud model.

Common confusions

  • "A change stream gives me exactly-once delivery into my downstream system." No. The change stream itself delivers each oplog entry exactly once to your cursor, but the moment you forward it to Kafka, write to Redis, or call an HTTP webhook, you are back in the at-least-once world. The resume-token-after-ack pattern in the producer above is at-least-once: if the process crashes between Kafka acknowledging and the token being saved, the next run replays the same event. Idempotent downstream consumers (keyed by documentKey._id or clusterTime) are mandatory; the change stream cannot do this for you.

  • "fullDocument is a snapshot of the document at the moment of the update." It is not. With full_document='updateLookup', MongoDB does a follow-up findOne after generating the change event. If another write modifies the same document between the two, the fullDocument you receive reflects the later state — possibly several updates ahead of the event you are processing. For exact post-images you must enable changeStreamPreAndPostImages on the collection, which doubles write amplification and oplog footprint.

  • "resume_after lets me rewind to any point in history." Only as far back as the oplog still goes. The oplog is a capped collection (typically 24–72 hours on a busy cluster). Once your token's cluster time is older than the oldest oplog entry, the resume fails with ChangeStreamHistoryLost and your only recovery is a full snapshot followed by a fresh stream. Treat oplog window vs. consumer lag as a critical SLO, not a soft one.

  • "Watching the whole deployment (client.watch()) is just a convenience — same performance as per-collection." It is not. On a sharded cluster, deployment-wide and database-wide change streams require mongos to merge ordered streams from every shard before delivering events. The latency floor becomes the slowest shard's commit-acknowledgement time, and a single unreachable shard freezes the entire stream. Watch the narrowest scope your consumer actually needs.

  • "Change streams are MongoDB's version of Kafka." They share the surface — an ordered append-only log of events — but the semantics differ. Kafka retains messages independent of any consumer (you set retention by time or size); the oplog retains entries until it fills its capped allocation, and a slow consumer that falls behind silently loses position. Kafka supports replaying an arbitrary offset; change streams replay only inside the oplog window. The right pattern is usually change stream → Kafka → many consumers, not change stream → many consumers.

  • "A change stream with $match only sees the events that match — so the cursor work is proportional to matches, not writes." The server still has to scan every oplog entry to evaluate the match, even if 99% are filtered. The savings are entirely on the network and on consumer-side processing, not on the primary's CPU. If your match rejects almost everything, that is fine; if your match is computationally heavy (regex on a deep field), it can become a hot loop on the primary.

Going deeper

Why an oplog and not a write-ahead log. Postgres replication tails the WAL — the same log the storage engine uses for crash recovery. MongoDB's oplog is a logically separate, idempotent record of operations. Each oplog entry is shaped so that applying it twice produces the same result (an update is rewritten to set absolute values rather than $inc deltas in some cases, deletes are by _id). Why idempotent: a secondary that crashes mid-apply must be safe to re-apply the entry on restart. The same property is what lets a change-stream consumer re-process an event after a token-save crash without corrupting downstream state. The cost is a small write amplification compared to a raw operation log; the benefit is that replication and CDC share one trustworthy primitive.

Resume tokens are versioned. The _data field of a resume token is a hex-encoded BSON blob whose internal layout MongoDB has changed across versions (v0 in 4.0, v1 from 4.2 onward, with further additions for sharded post-image support in 6.0+). The protocol is forward-compatible — a newer server can resume from an older token — but not backward-compatible: tokens minted by 6.0 with post-image support cannot be resumed against a 4.4 server. Practical consequence: rolling-downgrade scenarios may fail to resume, and your consumer must be prepared to fall back to start_at_operation_time from the most recent known cluster time.

Sharded ordering: the clusterTime merge. On a sharded cluster, each shard maintains its own oplog. When you open a change stream against mongos, the router opens an internal cursor on each shard and merges the streams in cluster-time order. The merge can only emit an event from shard A once it has received an event from shard B with cluster time strictly greater — otherwise it cannot guarantee ordering. Why this matters under partial failure: if one shard goes silent (network partition, slow primary), the merge stalls because mongos cannot prove no earlier event exists on that shard. A heartbeat-on-idle mechanism ($changeStream.allChangesForCluster) emits no-op markers so the merge can advance, but the latency floor is the slowest shard's heartbeat interval. This is the same problem Spanner solves with TrueTime: ordering across partitions requires either bounded clock skew or explicit coordination.

Pre-images: write amplification math. When you enable changeStreamPreAndPostImages on a collection with average document size 4 KB and 1,000 writes/sec, you add roughly 4 MB/sec of pre-image writes to a side collection plus the oplog entries describing them. Over a 24-hour day that is ~340 GB of additional storage, and the disk write rate doubles. On a PaisaBridge-scale cluster doing UPI transactions (~3000 tps at peak), this is the difference between "comfortable on NVMe" and "buying more SSDs". Enable per collection only when you genuinely need the before-image — fraud audit logs, GDPR delete-trail requirements, or downstream systems that reconcile diffs (Iceberg merge-into).

Couchbase DCP and Kafka's design. Couchbase's DCP protocol predates Kafka and influenced it indirectly: per-partition (vbucket) sequence numbers, server-side filtering, and a binary protocol designed to push at line rate to network-attached consumers. The MongoDB change stream is more user-friendly (typed events, BSON, an aggregation pipeline) but slower per byte than DCP because of that abstraction layer. If you measure raw throughput (events/sec/vCPU), DCP wins; if you measure developer-time-to-working-pipeline, change streams win by a wide margin.

The "snapshot + stream" recovery pattern. When a consumer falls outside the oplog window, the recovery is exactly the same shape as Debezium's "initial snapshot": (1) start a change stream cursor at the current cluster time and buffer events, (2) snapshot the collection (mongodump or a paginated find), (3) seed the downstream system with the snapshot, then (4) replay the buffered events on top, using _id as the idempotency key. Why this works: the snapshot has each document at some point in time after step (1). Any change events from before the snapshot's read of a given _id are replayed and idempotently overwrite the snapshot's stale value; events after are replayed and applied in order. The result is consistent. Get this procedure documented and tested before your first oplog overflow incident; it is the single most expensive operation you will run on the cluster.

Latency budget on a real fintech. PaisaPay's fraud loop has a hard 200 ms budget from "user taps Pay" to "model returns score". Roughly: write to MongoDB ~15 ms (including replica set commit), oplog → change stream event ~30–50 ms (oplog replication + cursor wake-up), Kafka publish ~5 ms, Flink window emit if you use windows ~window-size, Redis read ~1 ms, model inference ~30 ms. The Flink window is the budget killer. For sub-second features, drop tumbling windows in favour of stream-stream incremental updates: maintain the per-user counter in Flink state, increment on each event, snapshot to Redis on every Nth event or every 100 ms — whichever comes first. The math is the same as a RUNNING SUM in SQL streaming.

References

  1. MongoDB Change Streams documentation — official reference for the watch API, event types, resume tokens, and pre-/post-image options.
  2. The MongoDB Oplog and Replication — deep-dive into how the oplog is structured and how secondaries replay it.
  3. Debezium MongoDB connector — Kafka Connect source built on top of change streams; useful as a reference architecture and for heterogeneous Debezium deployments.
  4. Confluent: CDC patterns and best practices — design patterns for log-based CDC pipelines into Kafka.
  5. AWS DMS for MongoDB sources — managed CDC out of MongoDB / DocumentDB into AWS targets.
  6. Couchbase DCP protocol overview — comparison point for a different document database's lower-level change protocol.