Change Streams: CDC Built In

In short

Most production data pipelines do not want a snapshot of your database — they want a stream of changes. Refresh a search index every time a product is edited, invalidate a Redis cache the moment a user updates their profile, push every new transaction into a fraud-detection model within 200 ms of it landing. The pattern that makes this possible is Change Data Capture: turn the database into an event source.

The traditional way is to bolt a log reader on the side. Debezium for MySQL or Postgres tails the binlog or WAL out-of-band, parses an internal-ish format the database vendor never quite promises to keep stable, and emits Kafka records. It works, but you operate two systems, you carry the risk of binlog format drift, and you fight the vendor every time you upgrade a major version.

MongoDB built CDC into the protocol. A change stream is an aggregation-pipeline cursor on the replica set's oplog — the operation log every replica already maintains for replication. You open the cursor, you get a stream of typed events (insert, update, delete, replace, drop, invalidate), and each event carries a resume token so a crashed consumer can pick up exactly where it left off. No external log reader, no format-drift risk, no second deployment.

This chapter walks the protocol — how the oplog is structured, what each event type contains, how resume tokens survive failover — and builds a real pymongo pipeline for an Indian fintech that streams transactions into Kafka, into Flink, into a per-user feature cache that a fraud model reads at score time. We finish with the comparison table: MongoDB change streams, AWS DocumentDB change streams, Cosmos DB change feed, Couchbase DCP, and Debezium for the relational world.

The CDC problem, stated cleanly

Every interesting data system that is not your primary database ultimately has the same job: stay in sync with that primary database. A search index needs to know when a product is added or its price changes. A cache needs to know when the source of truth invalidates. A data warehouse needs the day's new transactions. A fraud model needs the latest signals about user behaviour. A webhook integration needs to fire when an order moves to "shipped".

You have three crude ways to do this and one good one.

Polling is the crude default: every N seconds, run SELECT * FROM orders WHERE updated_at > :last_seen. It is simple, it works, it is also wrong in three ways at once. You miss deletes (a deleted row no longer satisfies the predicate), you miss intermediate states (two updates between polls collapse into one), and you put a constant scan load on your primary database that scales with both your data size and your polling frequency. Why polling rots: latency floor is your poll interval, deletes are invisible, and the load on the primary grows linearly with how fresh you want to be — exactly backwards from what you want.

Dual writes are worse. The application writes to MongoDB, then writes to Kafka, and you hope both succeed. They will not, eventually. A network blip after the first write and before the second leaves the systems permanently inconsistent. There is no atomic boundary that spans two distributed systems without a distributed transaction protocol you do not want to operate.

Triggers are better but rigid. Postgres AFTER INSERT OR UPDATE OR DELETE triggers can call a function that writes to a queue, but every change pays the trigger cost synchronously, and the queue write is inside the transaction. If the queue is down, your writes fail. If the queue is slow, your writes are slow.

The good way is log-based CDC. The database already writes a durable, ordered log of every change — that is how replication works. A consumer that can tail this log gets the perfect change stream: every event in commit order, no intermediate states lost, no extra load on the writer (the log was already being written), and the latency is bounded only by replication lag. Debezium is the canonical implementation of this idea for relational databases, parsing the MySQL binlog or Postgres WAL into Kafka records.

MongoDB's contribution is to skip the bolt-on layer. The replica set's oplog is already a structured, idempotent, ordered log of operations — and MongoDB exposes it through a first-class change stream API.

The MongoDB oplog

A MongoDB replica set is a primary plus one or more secondaries. The primary accepts writes; secondaries replay them in order. The mechanism is the oplog (local.oplog.rs), a capped collection that lives on every replica.

Every write to the primary appends an entry to the oplog. Each entry is itself a BSON document with roughly this shape:

{
  "ts": Timestamp(1714048000, 17),
  "t": 42,
  "h": NumberLong("..."),
  "v": 2,
  "op": "i",
  "ns": "shop.orders",
  "o": { "_id": ObjectId("..."), "user_id": "u_19", "amount": 4999 }
}

The fields that matter: ts is the cluster time (timestamp + ordinal counter, monotonically increasing), op is the operation type (i insert, u update, d delete, c command, n no-op), ns is the namespace (db.collection), and o is the operation payload — the inserted document, the update modifier, or the document key for a delete.

Secondaries tail this log over a regular query cursor with awaitData (the cursor blocks on the server until new data arrives, rather than returning empty and being polled). When a secondary applies an entry, it is doing the exact same logical operation the primary did. Why this matters for CDC: the oplog is the source of truth for replication. Every byte of it is durable, every ordering is preserved, every operation is idempotent (applying it twice produces the same result). Those are exactly the properties a change feed needs.

The oplog is capped — it has a fixed maximum size (default tuned to 5% of free disk, usually 10–50 GB on a production node). When it fills, oldest entries are overwritten. The window of time the oplog covers is the oplog window, and on a busy production cluster this might be 24–72 hours. A consumer that falls behind by more than the oplog window cannot resume from where it stopped — its resume token points to a deleted entry. We will come back to monitoring this.

changeStream</text><text x="380" y="218" text-anchor="middle" font-size="10" fill="#374151">filters by ns, op, fields</text><text x="380" y="234" text-anchor="middle" font-size="10" fill="#374151">awaitData = blocks for new ops</text><text x="380" y="250" text-anchor="middle" font-size="10" fill="#374151">attaches resume token to event</text><text x="380" y="266" text-anchor="middle" font-size="10" fill="#15803d">reads from any replica</text><rect x="510" y="100" width="160" height="200" fill="#dcfce7" stroke="#15803d" rx="6"/><text x="590" y="120" text-anchor="middle" font-weight="bold" fill="#15803d">Application client</text><text x="590" y="142" text-anchor="middle" font-size="10">pymongo / Java driver</text><text x="590" y="158" text-anchor="middle" font-family="monospace" font-size="10">db.coll.watch(pipeline)</text><text x="590" y="180" text-anchor="middle" font-size="10" fill="#374151">long-running cursor</text><text x="590" y="196" text-anchor="middle" font-size="10" fill="#374151">receives ChangeEvents</text><text x="590" y="212" text-anchor="middle" font-size="10" fill="#374151">stores last resume token</text><text x="590" y="228" text-anchor="middle" font-size="10" fill="#374151">reconnects with token</text><text x="590" y="252" text-anchor="middle" font-size="10" fill="#15803d">writes to Kafka /</text><text x="590" y="266" text-anchor="middle" font-size="10" fill="#15803d">Elasticsearch / Redis</text><line x1="230" y1="180" x2="290" y2="180" stroke="#374151" stroke-width="1.5" marker-end="url(#arr1)"/><line x1="470" y1="180" x2="510" y2="180" stroke="#374151" stroke-width="1.5" marker-end="url(#arr1)"/><defs><marker id="arr1" markerWidth="8" markerHeight="8" refX="7" refY="4" orient="auto"><polygon points="0 0, 8 4, 0 8" fill="#374151"/></marker></defs><text x="260" y="172" text-anchor="middle" font-size="9" fill="#6b7280">tail</text><text x="490" y="172" text-anchor="middle" font-size="9" fill="#6b7280">stream</text><text x="350" y="335" text-anchor="middle" font-size="10" fill="#6b7280" font-style="italic">No external log reader. The "tailing cursor" is a normal MongoDB cursor with achangeStream stage thatthe server expands into reads against the oplog, formatted into typed events with resume tokens.

The change stream API is layered on top of this. When you call db.orders.watch(), the driver opens an aggregation pipeline cursor where the first stage is $changeStream. The server expands that into the appropriate oplog reads, applies your $match and $project stages, and streams events back. The cursor uses awaitData so it blocks server-side until new oplog entries arrive — the client is not polling.

What a change event looks like

Each event the change stream cursor returns is itself a BSON document with a fixed top-level shape, plus operation-specific fields. The Python driver gives you a dict; the canonical shape is:

{
  "_id": { "_data": "8265..." },                # resume token
  "operationType": "update",                    # insert | update | delete | replace | drop | rename | invalidate
  "clusterTime": Timestamp(1714048000, 17),
  "wallTime": datetime(2026, 4, 25, 14, 30, 0),
  "ns": { "db": "fintech", "coll": "transactions" },
  "documentKey": { "_id": ObjectId("...") },
  "updateDescription": {
    "updatedFields": { "status": "settled", "settled_at": ... },
    "removedFields": [],
    "truncatedArrays": []
  },
  "fullDocument": { "_id": ..., "user_id": "u_19", "amount": 4999, "status": "settled" }
}

The _id is the resume token — an opaque (but cluster-time-bearing) blob the driver stores so it can reopen the stream at exactly this position after a disconnect. The token is the primary durability primitive of the change stream protocol; we will come back to it.

The operationType discriminates events. The five document-level types you will handle 99% of the time are:

insert — a new document was created. fullDocument carries the inserted document. documentKey is { _id: ... }. No updateDescription.
update — an existing document was modified by a $set / $inc / $unset / array-update modifier. updateDescription shows exactly which fields changed (the new values for updatedFields, the names for removedFields). fullDocument is not included by default — set full_document='updateLookup' on the watch call to ask MongoDB to do a follow-up read and attach the post-image. There is also updateLookup's newer cousin, whenAvailable and required, plus fullDocumentBeforeWhenAvailable for pre-images (requires enabling changeStreamPreAndPostImages on the collection).
delete — a document was removed. documentKey carries the _id. fullDocument is unavailable unless you had pre-images enabled (in which case fullDocumentBeforeChange carries the deleted document).
replace — a replaceOne overwrote the document; semantically a delete-then-insert from the change stream's view. fullDocument carries the new document.
drop, rename, dropDatabase, invalidate — collection-level events. invalidate is special: it tells the consumer the stream is over (the watched collection was dropped or renamed) and the existing resume token is no longer valid.

The fullDocument: 'updateLookup' option is convenient but worth thinking about. The lookup is a separate read against the current state of the collection. If the same document is updated again between the original update and the lookup, the fullDocument you receive reflects the later state, not the state immediately after this particular update. If your consumer cares about exact post-images per update, you must enable pre-and-post-image collection storage (which doubles write amplification and oplog footprint, so it is opt-in).

Resume tokens: the durability primitive

A change stream is useless if it cannot survive consumer restarts, network partitions, or replica set failovers without losing or duplicating events. The resume token makes this work.

The protocol is: every event you receive carries an _id field that is a resume token. Your consumer commits the token alongside the work it did with the event (typically: write event to Kafka, then commit the token). When the consumer restarts, it reads the last committed token from durable storage and reopens the change stream with resume_after=token — the server seeks the oplog to the matching cluster time and resumes streaming from the next event.

stream = db.transactions.watch(pipeline, resume_after=last_token)

There are three resume modes the API supports:

resume_after — start after the position of this token. Used for normal restarts. The token must still be inside the oplog window.
start_after — like resume_after but tolerates invalidate events (lets you cross a collection-rename boundary, for example).
start_at_operation_time — start at a specific cluster timestamp. Used for the very first run when you have no token, or for back-filling from a known point.

Why the resume-token model is robust where polling is not: the token is generated by the server from the actual oplog position, not from any application-level field like updated_at. There is no clock skew, no missed deletes, no "what if two writes share the same timestamp" — the cluster time is monotonic with an ordinal counter, and the server resolves ties.

Two failure modes the consumer must handle. Failover: when the primary changes, the existing cursor is killed mid-stream. The driver catches the disconnect, reopens with the last token against the new primary (or against any secondary, if you specified one), and continues. Because the oplog is replicated, the same entries exist everywhere, so the resume is seamless. Oplog overflow: if the consumer is slow enough that the oldest token it knows is older than the oldest oplog entry, the resume fails with ChangeStreamHistoryLost. The only recoveries are to start from the current time (losing the events in the gap) or to do a full snapshot of the collection plus a fresh stream — exactly the painful situation Debezium calls a "snapshot recovery". Monitoring oplog lag against the oplog window is therefore mandatory.

Filtering: aggregation stages on the stream

The change stream cursor is an aggregation pipeline. The first stage is implicitly $changeStream; you supply additional stages that run server-side, before the events cross the network.

pipeline = [
    { "$match": {
        "operationType": { "$in": ["insert", "update"] },
        "fullDocument.amount": { "$gt": 100000 }   # only large transactions
    }},
    { "$project": {
        "operationType": 1,
        "documentKey": 1,
        "user_id": "$fullDocument.user_id",
        "amount": "$fullDocument.amount",
        "clusterTime": 1
    }}
]
stream = db.transactions.watch(pipeline, full_document='updateLookup')

Filtering server-side is a meaningful cost saver when only a small fraction of writes are interesting to a particular consumer. If you have ten downstream services that each care about a different sliver of the change feed, give each of them its own change stream cursor with its own filter, rather than fanning out a firehose and filtering in the application. The server does the work either way; the network savings and consumer simplicity are the win.

You can watch at three scopes: a single collection (db.transactions.watch()), an entire database (db.watch() — events from any collection in this database), or the whole deployment (client.watch() — every collection in every database). Wider scopes have proportionally more events and (on sharded clusters) higher latency because the server has to merge ordered streams from every shard.

A real pipeline: fintech fraud signals

Time to build something. A Bengaluru-based UPI-and-cards fintech — call it PaisaPay — needs sub-second fraud signals. Every transaction lands in MongoDB (fintech.transactions). A fraud model needs per-user features in a serving cache (Redis): rolling-window transaction counts, velocity, total amount in the last hour, distinct-merchant count today. The feature pipeline reads change events, aggregates per user in Flink, and writes to Redis. The model itself reads from Redis at score time, returning a decision in under 50 ms.

PaisaPay: change stream → Kafka → Flink → Redis

The producer is a Python service that owns the change stream and forwards events into a Kafka topic. The consumer is a Flink job that does the actual stateful aggregation. Splitting them lets us scale the Flink layer independently of the MongoDB cursor (which is single-threaded per collection scope).

The producer:

import json
import os
import time
from pathlib import Path
from pymongo import MongoClient
from pymongo.errors import PyMongoError
from kafka import KafkaProducer
from bson import json_util

MONGO_URI = os.environ["MONGO_URI"]
TOKEN_PATH = Path("/var/lib/paisapay/cdc_token.json")

client = MongoClient(MONGO_URI)
db = client["fintech"]
producer = KafkaProducer(
    bootstrap_servers=os.environ["KAFKA_BROKERS"],
    value_serializer=lambda v: json_util.dumps(v).encode("utf-8"),
    acks="all",
    enable_idempotence=True,
    linger_ms=5,
)

def load_token():
    if TOKEN_PATH.exists():
        return json.loads(TOKEN_PATH.read_text())
    return None

def save_token(token):
    tmp = TOKEN_PATH.with_suffix(".tmp")
    tmp.write_text(json_util.dumps(token))
    tmp.replace(TOKEN_PATH)   # atomic on POSIX

def stream_loop():
    pipeline = [
        {"$match": {
            "operationType": {"$in": ["insert", "update", "replace"]},
        }}
    ]
    while True:
        try:
            kwargs = {"full_document": "updateLookup"}
            token = load_token()
            if token:
                kwargs["resume_after"] = token
            with db.transactions.watch(pipeline, **kwargs) as stream:
                for event in stream:
                    msg = {
                        "op": event["operationType"],
                        "user_id": event["fullDocument"]["user_id"],
                        "amount": event["fullDocument"]["amount"],
                        "merchant": event["fullDocument"].get("merchant"),
                        "txn_id": str(event["documentKey"]["_id"]),
                        "cluster_time": event["clusterTime"].as_datetime().isoformat(),
                    }
                    future = producer.send("paisapay.transactions", msg)
                    future.get(timeout=10)        # block until Kafka acks
                    save_token(event["_id"])      # then commit the token
        except PyMongoError as e:
            print(f"stream error: {e}; sleeping then resuming")
            time.sleep(2)

if __name__ == "__main__":
    stream_loop()

Two ordering guarantees matter here. First, the producer.send(...).get() is synchronous per event — we wait for Kafka to acknowledge before saving the token. Why synchronous: if we saved the token before Kafka acked and the process crashed in between, we would lose the event on restart. Sync-then-commit gives us at-least-once delivery into Kafka. The Flink layer downstream is responsible for idempotency on txn_id. Second, the token is written to a temp file and atomically renamed; a partial write that loses the token would be much worse than a small batch of duplicates.

The Flink aggregation (sketched, full job is in Java/Scala in production):

# pyflink sketch
from pyflink.datastream import StreamExecutionEnvironment
from pyflink.datastream.connectors.kafka import KafkaSource
from pyflink.datastream.window import TumblingProcessingTimeWindows
from pyflink.common.time import Time

env = StreamExecutionEnvironment.get_execution_environment()
source = KafkaSource.builder() \
    .set_bootstrap_servers("kafka:9092") \
    .set_topics("paisapay.transactions") \
    .set_group_id("fraud-features") \
    .build()

stream = env.from_source(source, ...)
features = (stream
    .map(parse_event)
    .key_by(lambda e: e["user_id"])
    .window(TumblingProcessingTimeWindows.of(Time.minutes(1)))
    .aggregate(UserFeatureAggregator())   # count, sum, distinct merchants
)
features.add_sink(RedisSink())   # writes to user:{user_id}:features
env.execute("paisapay-fraud-features")

End-to-end latency on this pipeline, measured in production-shaped tests: MongoDB write → change stream event ~50 ms (oplog replication + cursor wake), Kafka publish ~5 ms, Flink window emit ~1 minute (the window size — for sub-second features, drop the window and use stream-stream aggregation). The fraud model reads from Redis with sub-millisecond latency.

The same architecture, drawn:

Why insert Kafka in the middle rather than have Flink read the change stream directly? Three reasons. First, fan-out: the same transaction events feed the fraud features, the data warehouse, the merchant analytics dashboard, and the email-notification service. Reading the change stream four times duplicates load on MongoDB; reading Kafka four times is what Kafka is for. Second, decoupling: if Flink restarts or the warehouse loader is down for an hour, the producer keeps running and Kafka holds the backlog. Third, replay: Kafka retention lets you reprocess the last seven days of events when a feature definition changes, without touching MongoDB.

Production concerns

The pipeline above hides a handful of operational realities that bite at scale.

Resume token storage. The token must be persisted somewhere durable that survives consumer-process death and host failure. A local file is fine for a single-instance producer; for redundancy, store the token in a small Postgres row, in a dedicated MongoDB collection, or in Kafka itself (commit the token alongside the offset). The token is small (~50 bytes); the durability guarantee is what matters.

Oplog window monitoring. Run db.getReplicationInfo() on a primary to get the oplog window in hours. Set an alert when the consumer's last-committed token is more than half the oplog window in age. Why half: if your consumer is consistently lagging by a fraction of the window, a brief period of slowdown plus a brief period of higher write rate can compound into total loss of position. Half-window gives you the headroom to recover before ChangeStreamHistoryLost.

Sharded clusters. When the watched scope spans multiple shards, the mongos router merges streams from each shard's oplog and reorders them by clusterTime before sending to the client. This requires every shard to be reachable, and the latency floor becomes the slowest shard's commit-acknowledgement time. The ordering guarantee holds (events come out in cluster-time order across shards) but tail latency is higher than a single replica set. For very wide deployments, consider per-shard change streams with downstream merging if you can tolerate weaker ordering.

Backpressure. A slow consumer cannot slow down MongoDB — writes keep happening regardless of whether anyone is reading the change stream. The cursor falls behind, eventually past the oplog window, and dies. Treat consumer lag as a critical SLO. If your consumer cannot keep up steadily, the answer is to parallelise (multiple consumers each reading a $match-filtered slice, merging downstream) or to add a Kafka layer that absorbs bursts.

Schema drift and consumer evolution. Change events are BSON documents, so a new field appearing in your collection appears in the change stream the moment a write contains it. Downstream consumers must be tolerant of unknown fields (ignore them) or you must coordinate schema changes between writers and consumers. This is the same discipline as evolving a Kafka schema with a registry; the change stream itself does not give you it.

Pre-images and post-images cost. Enabling changeStreamPreAndPostImages on a collection asks MongoDB to write the pre-image of every modified document into a side collection so the change stream can deliver it. Storage doubles for that collection's writes, and there is meaningful CPU cost on the primary. Enable only on collections where you genuinely need pre-images (typical case: compliance audit logs).

How the rest of the world does CDC

MongoDB is not the only document database with built-in change streams, and not all CDC happens on document databases.

AWS DocumentDB ships a near-identical change stream API, since it is wire-compatible with MongoDB. Subtle differences in event ordering on aggressively-resharded clusters and in pre-image support; otherwise the pymongo code above runs unchanged.

Azure Cosmos DB Change Feed exposes a similar capability under a different name and a different API shape. Cosmos partitions data by partition key, and the change feed is per-partition. You read with a continuation token (Cosmos's analogue of the resume token) and process events in partition order. The Cosmos change feed gives only post-images by default; deletes appear as a TTL-set update or require enabling the all-versions-and-deletes mode.

Couchbase DCP (Database Change Protocol) is the most "infrastructure" of the lot. DCP is the protocol Couchbase uses internally for replication, XDCR (cross-data-centre replication), and view/index maintenance. It exposes a low-level cursor over the per-vbucket mutation log. You write a DCP client (or use the Kafka connector) and you receive every mutation. More raw than MongoDB change streams, more flexible, more manual.

Debezium for MongoDB exists too — a Kafka Connect source connector that internally just opens a MongoDB change stream and shovels events into Kafka with a known schema. If your downstream world already standardises on Debezium connectors for Postgres and MySQL, using Debezium for MongoDB lets you have one CDC story across heterogeneous primary databases. The trade-off is one more system to run; if you already have Kafka Connect, the marginal cost is small.

Debezium for relational (MySQL, Postgres, SQL Server, Oracle) is the canonical bolt-on log reader. It tails the binlog or WAL out-of-band, parses the format, and emits Kafka records keyed by primary key with before and after images. The protocol is an exemplar of careful CDC engineering — handling DDL, schema changes, snapshot consistency, and exactly-once semantics — but it is two systems where MongoDB is one. Why this matters strategically: if you are starting greenfield with a document model, getting CDC for free in the protocol is a meaningful operational simplification. If you have an existing relational primary, Debezium is the right answer; the bolt-on cost is real but unavoidable.

AWS DMS for MongoDB (and DocumentDB) is a managed service that internally consumes change streams and writes to other AWS targets (S3, Redshift, Kinesis). Useful when your downstream is fully AWS-native and you do not want to operate a Kafka cluster.

System	CDC mechanism	Resume primitive	Notes
MongoDB	`db.coll.watch()` change stream over oplog	resume token (per event)	First-party, aggregation-pipeline filter
AWS DocumentDB	MongoDB-compatible change stream	resume token	Same API; minor pre-image gaps
Cosmos DB	Change Feed per partition	continuation token	Partition-scoped; all-versions mode optional
Couchbase	DCP (Database Change Protocol)	sequence number per vbucket	Lower-level, used by XDCR internally
MySQL / Postgres	Debezium tails binlog / WAL	LSN / GTID	Bolt-on log reader; canonical relational CDC

When to reach for change streams (and when not)

Change streams are the right tool when you need a low-latency, ordered, exact stream of changes from MongoDB to one or more downstream systems, and you are willing to operate the consumer. The strongest cases are real-time search-index updates, cache invalidation, fraud and risk feature pipelines, webhooks, and replicating into a warehouse with Iceberg or ClickHouse for analytics.

They are the wrong tool when you need a one-off snapshot (use a mongoexport or a mongodump), when your downstream consumer is so slow that it cannot stay inside the oplog window (snapshot first, stream after), or when you want every consumer to be totally isolated from MongoDB's evolution (in which case put Kafka in the middle as we did above, and let consumers read Kafka).

The deeper lesson goes back to the BSON chapter: the document model wins not by being a better key-value store than relational tables, but by treating each record as a self-describing tree that the database can hand back, mutate, and now stream as a structured event. Change streams are the third leg of that argument — your data is not just stored as documents, it is delivered as documents, all the way to your fraud model.

References

MongoDB Change Streams documentation — official reference for the watch API, event types, resume tokens, and pre-/post-image options.
The MongoDB Oplog and Replication — deep-dive into how the oplog is structured and how secondaries replay it.
Debezium MongoDB connector — Kafka Connect source built on top of change streams; useful as a reference architecture and for heterogeneous Debezium deployments.
Confluent: CDC patterns and best practices — design patterns for log-based CDC pipelines into Kafka.
AWS DMS for MongoDB sources — managed CDC out of MongoDB / DocumentDB into AWS targets.
Couchbase DCP protocol overview — comparison point for a different document database's lower-level change protocol.