MySQL binlog format and replication protocol
The previous chapter built a Postgres logical-decoding consumer in 30 lines. MySQL has the same primitive but a different shape: instead of an in-process plugin emitting decoded events, MySQL writes a binary file called the binlog and asks consumers to pretend they are a replica. Every CDC tool you have heard of — Debezium MySQL, AWS DMS, Maxwell, Canal at Alibaba — is doing the same thing: opening a TCP connection, sending COM_BINLOG_DUMP_GTID, and parsing the bytes that come back. This chapter writes that connection by hand.
The MySQL binlog is an append-only file of typed events — WRITE_ROWS_EVENT, UPDATE_ROWS_EVENT, DELETE_ROWS_EVENT, plus TABLE_MAP_EVENT and QUERY_EVENT. A CDC consumer connects on the standard MySQL port, identifies as a replica, and the server streams events from a position (filename + offset, or a GTID set). The choice of binlog_format — STATEMENT, ROW, or MIXED — decides whether the events are SQL strings or before/after images. ROW is the only format that supports CDC correctly.
What the binlog actually is
MySQL has had a binlog since 4.1 (2003). It exists for two original purposes — point-in-time recovery and replication to a replica — and one accidental purpose that the rest of the industry built a $200 crore market on top of: change data capture. The mechanism is the same in all three cases. Every transaction that commits writes a sequence of events to a file in datadir, named mysql-bin.000001, mysql-bin.000002, and so on. Each event is a typed binary record with a header (timestamp, type, server id, length) and a body whose layout depends on the type.
The events you care about for CDC are five:
FORMAT_DESCRIPTION_EVENT— the first event in every binlog file; describes the format version and the layout of subsequent events. A consumer must read this to know how to parse anything else.TABLE_MAP_EVENT— emitted before every batch of row events; carries the table id, schema name, table name, and the column types. Essential, because the row events themselves only carry column values, not names — the consumer must remember the most recent TABLE_MAP for that table id.WRITE_ROWS_EVENT/UPDATE_ROWS_EVENT/DELETE_ROWS_EVENT— the actual change events. WRITE carries one image (the inserted row); UPDATE carries two (before and after); DELETE carries one (the deleted row).QUERY_EVENT— DDL goes here as a literal SQL string.ALTER TABLE orders ADD COLUMN refund_id BIGINTarrives as text; the consumer is responsible for understanding what changed about the schema.XID_EVENT— the commit marker. A row event without a subsequent XID_EVENT is part of an uncommitted transaction; honest consumers buffer until the XID arrives.
The crucial structural point: row events do not carry column names. They carry a table id and a bitmap of columns present in this image, plus the values, encoded according to the types declared in the most recent TABLE_MAP for that id. Why this matters for CDC correctness: if the consumer drops a TABLE_MAP for any reason — restart, network blip, a bug — the next row event arrives uninterpretable, because the type information has been lost. Consumers must replay binlog from the start of the transaction's containing event group, never from "the next row event". Debezium, Maxwell, and Canal all enforce this.
STATEMENT, ROW, MIXED — the format choice that decides everything
binlog_format is a server-wide setting. Three values:
- STATEMENT — writes the SQL string.
UPDATE orders SET status='paid' WHERE id=98421arrives as text. Compact. Replicas re-execute the SQL. Useless for CDC because non-deterministic functions (NOW(),RAND(),UUID()),LIMITwithoutORDER BY, and triggers all yield different results on the consumer than they did on the source. - ROW — writes the before and after images of every row affected.
UPDATE orders SET status='paid' WHERE merchant_id=88 AND amount > 5000could touch 10,000 rows, and ROW writes 10,000 events of ~140 bytes each. Verbose. Deterministic. The only correct choice for CDC. - MIXED — server picks STATEMENT for "safe" queries and ROW for unsafe ones. Sounds sensible; in practice means the consumer cannot tell which mode any given event was written in without per-event metadata, and any STATEMENT event in the stream is a CDC bug waiting to happen.
In production, every shop running CDC at scale runs binlog_format = ROW. Razorpay, Flipkart, Dream11, BookMyShow — all of them. The cost is real: ROW format binlogs are 5–20× larger than STATEMENT for the same workload. A high-write database that fit in 100 MB of binlog per hour under STATEMENT will produce 1–2 GB of binlog per hour under ROW. Why this trade-off is forced: the consumer needs deterministic, side-effect-free events. STATEMENT is fundamentally non-deterministic on the consumer because the consumer is not re-executing the original query plan with the original data. ROW pays size for correctness, and the size cost is paid by the source's disk, not the consumer's.
Two related settings the production checklist insists on:
binlog_row_image = FULL— record every column in both before and after images, not just the changed ones. The default in MySQL 5.7+ is FULL; if your team has setMINIMALto save space, the consumer cannot reconstruct the unchanged columns and downstream joins break.gtid_mode = ONplusenforce_gtid_consistency = ON— enables Global Transaction Identifiers. A GTID is a server-uuid + transaction-number pair that is unique across the cluster. The consumer can resume from a GTID set instead of a binlog file + offset, which means it survives source failovers without manual intervention. Without GTID, a failover from primary to replica forces the consumer to figure out the equivalent offset in the new primary's binlog — historically the leading cause of "we thought we had CDC but lost 6 hours of events during the failover" outages.
The replication handshake — what's actually on the wire
A CDC consumer is, from the server's perspective, a replica. The handshake is the same as a real MySQL replica's:
- TCP connect to port 3306. Standard.
- Authentication handshake — same as any client. The consumer's user account needs
REPLICATION SLAVEandREPLICATION CLIENTprivileges. (MySQL 8.0.22+ also acceptsREPLICATION_SLAVE_ADMINfor finer-grained control.) - Server identification — the consumer sends
COM_REGISTER_SLAVEdeclaring its server-id, which must be unique in the cluster. Two consumers sharing a server-id will collide and one will be disconnected. - Dump request — the consumer sends
COM_BINLOG_DUMP_GTID(modern) orCOM_BINLOG_DUMP(legacy, by filename + offset). The packet carries the position the consumer wants to start from. - Streaming — the server begins writing events to the socket. There is no acknowledgement protocol from the consumer back to the server for events; the TCP connection alone keeps the stream alive. If the consumer disconnects, the server stops sending; on reconnection, the consumer resumes from the next event after its last-processed GTID.
The asymmetry with Postgres is worth naming: Postgres requires the consumer to send send_feedback(flush_lsn=...) to advance the slot. MySQL has no such mechanism. The consumer's progress is its own state; if the consumer crashes mid-event, on restart it must look up its last durably-persisted GTID and ask the server to resume there. Why this matters for at-least-once delivery: between the consumer reading a row event and durably writing it to its sink, a crash means the consumer must replay from before the crash. Replays are inevitable. Every CDC consumer needs a sink that is idempotent on (server_uuid, gtid) or on (binlog_file, position) — the next chapter on dedup hashing in CDC pipelines covers the patterns.
Build it: a binlog consumer in Python
The library python-mysql-replication (pip install mysql-replication) speaks the wire protocol. The example below opens a binlog stream and prints events as they arrive — exactly what a one-host CDC pipeline at a Bengaluru SaaS startup would have run before they paid for Debezium.
import json
from datetime import datetime
from pymysqlreplication import BinLogStreamReader
from pymysqlreplication.row_event import (
WriteRowsEvent, UpdateRowsEvent, DeleteRowsEvent,
)
from pymysqlreplication.event import QueryEvent, GtidEvent, XidEvent
# Razorpay-shaped local dev: replication user has GRANTed
# REPLICATION SLAVE, REPLICATION CLIENT on *.* and SELECT on razorpay.*
SOURCE = {"host": "127.0.0.1", "port": 3306,
"user": "cdc_reader", "passwd": "dev-only"}
# server_id must be unique across the entire MySQL cluster, including
# real replicas. Picking 9001 to stay out of the way.
stream = BinLogStreamReader(
connection_settings=SOURCE,
server_id=9001,
only_events=[WriteRowsEvent, UpdateRowsEvent, DeleteRowsEvent,
QueryEvent, GtidEvent, XidEvent],
only_schemas=["razorpay"],
only_tables=["orders", "payments"],
blocking=True, # wait for new events; do not exit at EOF
resume_stream=True, # honour saved GTID position on restart
auto_position=None, # set to a GTID set string to seek
fail_on_table_metadata_unavailable=True, # safer than silent skip
)
last_gtid = None
events_seen = 0
try:
for ev in stream:
if isinstance(ev, GtidEvent):
last_gtid = ev.gtid
continue
if isinstance(ev, XidEvent):
# Commit boundary — checkpoint here in production.
print(f" COMMIT xid={ev.xid:>6} last_gtid={last_gtid}")
continue
if isinstance(ev, QueryEvent):
sql = ev.query.strip()
if sql.upper() not in ("BEGIN", "COMMIT"):
print(f" DDL {sql[:80]!r}")
continue
# Row events — emit one logical event per row in the image.
for row in ev.rows:
if isinstance(ev, WriteRowsEvent):
action, image = "INSERT", row["values"]
elif isinstance(ev, DeleteRowsEvent):
action, image = "DELETE", row["values"]
else:
action = "UPDATE"
image = {"before": row["before_values"],
"after": row["after_values"]}
print(f" {action:<6} {ev.schema}.{ev.table} "
f"@{ev.packet.log_pos} {json.dumps(image, default=str)[:120]}")
events_seen += 1
finally:
stream.close()
print(f"stream closed; processed {events_seen} row events; last_gtid={last_gtid}")
A representative run, after a separate session does INSERT, UPDATE, DELETE against razorpay.orders:
INSERT razorpay.orders @98421 {"id": 98421, "merchant_id": "razorpay-test", "amount": 4999, "status": "created"}
COMMIT xid= 2841 last_gtid=3E11FA47-71CA-11E1-9E33-C80AA9429562:18
UPDATE razorpay.orders @98598 {"before": {"id": 98421, "status": "created"}, "after": {"id": 98421, "status": "paid"}}
COMMIT xid= 2842 last_gtid=3E11FA47-71CA-11E1-9E33-C80AA9429562:19
DELETE razorpay.orders @98775 {"id": 98421, "merchant_id": "razorpay-test", "amount": 4999, "status": "paid"}
COMMIT xid= 2843 last_gtid=3E11FA47-71CA-11E1-9E33-C80AA9429562:20
^C
stream closed; processed 3 row events; last_gtid=3E11FA47-71CA-11E1-9E33-C80AA9429562:20
A walkthrough of the lines that matter:
server_id=9001— must be unique across the cluster. Picking a number that collides with a real replica or another consumer causes one of them to be disconnected unpredictably. Production teams reserve a server-id range (9000–9099 for CDC consumers, say) the same way they reserve port ranges.only_schemas/only_tables— server-side filtering happens before parsing. Without these, the consumer parses every row event in the binlog including tables it does not care about, which on a multi-tenant database wastes CPU on the consumer host. The library can also push down predicates to the server in MySQL 8.0+ viabinlog_row_event_max_size.auto_positionand GTID resumption — the right pattern in production is to durably persistlast_gtidto your sink (or a sidecar Postgres / Redis), and on restart passauto_position=<saved_gtid_set>. The server then streams from the next transaction, regardless of which physical binlog file that lives in. Without GTID, a binlog rotation or a primary failover loses your position.- Buffering until
XidEvent— the example prints row events as they arrive; a correct sink batches them per transaction and only flushes downstream after the XID. Otherwise a transaction that aborts (rare in row-format binlog, but possible during recovery) leaks half-applied events. fail_on_table_metadata_unavailable=True— if the consumer encounters a row event with a table id whose TABLE_MAP it has never seen, the safe behaviour is to fail loudly. The default is to silently skip the event, which is the worst CDC failure mode: the source thinks the row was sent, the sink never records it, no alert fires. Always set this.
The code is ~50 lines. Debezium MySQL is ~400,000 lines. The difference is everything around the parsing — schema history (so the consumer can reconstruct the table layout at any point in the past, even after later DDL), reconnection / retry, Kafka transactional sinks, snapshot bootstrap. The wire protocol itself is small.
Operating MySQL CDC at scale — what breaks
The textbook issues are well-documented. The three that actually page on-call at Indian shops are:
1. Schema change (DDL) without coordination. A backend engineer at Cred runs ALTER TABLE rewards ADD COLUMN voucher_id BIGINT at 3 a.m. The CDC consumer parses the QUERY_EVENT carrying the SQL string and either understands the change (Debezium with schema history enabled) or silently produces wrong-shaped events for every subsequent row write. Mitigation: enable schema history, and have a contract that DDL goes through a process that notifies the CDC team. Several Bengaluru shops keep a migrations Slack channel that is mirrored to a dashboard the data platform team watches.
2. Long-running transactions blocking checkpoint. A 30-minute transaction holds the binlog flush forward. Until it commits, no XID arrives, and a correct consumer cannot checkpoint past it — meaning if the consumer crashes during that 30 minutes, on restart it replays the entire transaction. Why this asymmetry with Postgres matters: Postgres logical decoding has the same problem (commit-order emit), but Postgres 14+ added streaming = on to emit in-progress transactions. MySQL has no equivalent. Long transactions on a CDC source are a hard latency floor. The fix is application-side: keep transactions short on tables that are in the publication.
3. purge_binary_logs running ahead of the consumer. MySQL deletes binlog files based on expire_logs_days or binlog_expire_logs_seconds (8.0+). If the consumer falls behind by more than the retention window, the binlog files it needs are gone, and the consumer must rebootstrap from a fresh snapshot of every table — a 10 crore-row snapshot on a payments database can take 6 hours. The standard production setting is binlog_expire_logs_seconds = 604800 (7 days), and an alert that fires when consumer lag exceeds 24 hours, well before the 7-day cliff.
A monitoring query for replica lag against the source's current position:
-- Run on the source. Compare against the consumer's last-processed GTID.
SELECT @@global.gtid_executed AS source_gtid_set,
@@global.binlog_expire_logs_seconds / 3600.0 AS retention_hours;
-- Approximate disk pressure: total binlog size vs free disk.
SELECT SUM(file_size) / 1024 / 1024 AS binlog_mb
FROM information_schema.files
WHERE file_type = 'BINLOG';
A consumer one binlog file behind is fine. A consumer 100 MB of binlog behind needs investigation. A consumer 1 GB behind on a 10 GB-per-day database is an active incident.
Common confusions
- "MySQL binlog and Postgres WAL are the same thing." They serve overlapping purposes (durability, replication, CDC) but the formats differ. Postgres WAL is physical (page offsets); the logical decoding plugin makes it row-shaped on the way out. MySQL binlog is already row-shaped on disk when
binlog_format = ROW. The protocols differ correspondingly: Postgres uses replication slots with consumer-acknowledged LSNs; MySQL streams freely with no per-event ack. - "
binlog_format = MIXEDis the safe default." It is the default on many MySQL installs but it is wrong for CDC. MIXED writes some events as STATEMENT (SQL strings) and the consumer cannot tell deterministically whether the resulting row image is correct. Production CDC requiresbinlog_format = ROW. - "A CDC consumer is the same as a MySQL replica." It uses the same protocol but does different things with the events. A real replica re-executes the events to maintain its own copy. A CDC consumer parses them and writes to a different sink (Kafka, Iceberg, Snowflake). The server cannot tell the difference, which is why the consumer needs a unique
server_id. - "GTID and binlog file+offset are interchangeable." GTID is a logical identity; file+offset is a physical position. After a primary failover, the same GTID exists on the new primary at a different file+offset. Consumers using file+offset must resync manually after failover; GTID consumers do not.
- "I can run two CDC consumers with the same
server_id." No. The first one connected will be disconnected when the second registers. Each consumer needs its own server_id; production teams maintain a server-id allocation registry the same way they maintain port and DNS allocation. - "
row_image = MINIMALsaves disk and is harmless." It saves ~30% of binlog size on update-heavy workloads. It also breaks any consumer that needs to know the values of unchanged columns — which is most of them, because downstream sinks usually want full row images for joins. Default to FULL.
Going deeper
How the binlog file format actually frames events
Every event on disk is a [19-byte header][variable-length body][optional 4-byte checksum]. The header carries: 4-byte timestamp (seconds since epoch), 1-byte event type, 4-byte server-id, 4-byte total event length, 4-byte next-event-position, 2-byte flags. The body varies by type. The checksum (CRC32, since 5.6) is enabled by binlog_checksum = CRC32 and detects corruption from disk-level errors — rare but high-impact when it happens. Disabling it is one of those "do not do this" foot-guns; the saved bytes are not worth the silent corruption risk.
TABLE_MAP and the column-type encoding
TABLE_MAP carries column types as a byte string in MySQL's internal type-id format (MYSQL_TYPE_LONG = 3, MYSQL_TYPE_VARCHAR = 15, etc.). For variable-length types, additional metadata (max length for VARCHAR, precision/scale for DECIMAL, character set) is packed after the type bytes. The row events then encode values using these types — e.g. a VARCHAR is length-prefixed, an INT is little-endian fixed-width, a DATETIME is a packed bit format. A correct parser must mirror MySQL's serialisation logic exactly, which is why writing a binlog parser from scratch is a 6-month project; using python-mysql-replication or Debezium is what every production team does.
GTID vs file+position — the failover argument in detail
Razorpay published an architecture note in 2023 (referenced below) describing a primary-replica failover where the file+position-based CDC consumer pointed at a binlog file that didn't exist on the new primary. The consumer treated this as an invalid position and rebootstrapped from a snapshot — which took 4 hours, during which 4 hours of payment events accumulated unprocessed in Kafka. After the incident they migrated to GTID auto-position. The lesson is generic: any production MySQL ≥ 5.6.5 with replicas should run gtid_mode = ON, and any CDC consumer against it should use auto_position. The cost is one config change; the avoided cost is one outage.
Multi-source replication and CDC at SaaS scale
A common pattern at Bengaluru SaaS shops with one MySQL per tenant: the analytics warehouse needs CDC from 200 tenants. Running 200 separate consumers is operationally expensive; multi-source replication (MySQL 5.7.6+) lets one MySQL replica subscribe to many sources. The CDC consumer then reads from the aggregating replica's binlog instead of from each tenant. The trade-off is that the aggregator becomes a single point of failure and a per-tenant lag amplifier — a slow tenant slows everyone. Most teams settle on per-shard CDC with an automation layer rather than per-tenant; the right granularity is the same one your application sharding chose.
What changes between MySQL 5.7, 8.0, and 8.4
5.7 introduced binlog_row_image and made GTID production-ready. 8.0 added invisible columns (must be handled in TABLE_MAP parsing), JSON column improvements, and binlog_transaction_compression (8.0.20+) which can reduce binlog size 70% on text-heavy workloads. 8.4 (LTS, 2024) made GTID the default and removed several legacy replication options. For new CDC pipelines in 2026, target 8.0.34 or 8.4.0 minimum; avoid 5.7 because its end-of-life passed in October 2023.
Where this leads next
The next chapter, /wiki/snapshot-cdc-the-bootstrapping-problem, pairs binlog streaming with a consistent snapshot of every table at the binlog start position, so the consumer can join an existing source rather than starting from "everything that committed today". After that, /wiki/debezium-as-a-cdc-platform is the production wrapper around both Postgres and MySQL CDC primitives.
Two crosslinks worth noting now:
- /wiki/postgres-logical-decoding-from-scratch — the Postgres counterpart. Same problem, different protocol.
- /wiki/why-polling-breaks-past-a-certain-scale — the load curve that motivates moving to binlog CDC instead of
SELECT * FROM orders WHERE updated_at > ?.
By the end of Build 11, MySQL and Postgres are both first-class CDC sources. Whichever your application database is, the pipeline pattern is the same: identify as a replica, stream events, deduplicate downstream, persist position before sink commit.
References
- MySQL 8.0 reference manual: The Binary Log — the canonical reference for binlog formats, settings, and the replication protocol.
- MySQL Internals Manual: Replication Protocol — packet-level documentation of
COM_BINLOG_DUMP_GTIDand event headers. python-mysql-replication(julien-duponchelle/python-mysql-replication) — the BinLogStreamReader library used in this chapter; reading its source is the fastest way to learn the wire format.- Debezium MySQL connector documentation — what wraps the wire protocol in production: schema history, snapshot mode, Kafka semantics.
- Razorpay engineering: lessons from running CDC at payments scale (2023) — the Indian-context reference for binlog operation, GTID failover, and consumer lag alerting.
- Alibaba Canal architecture (alibaba/canal) — an alternate MySQL CDC implementation; reading its design docs alongside Debezium's clarifies which choices are essential and which are taste.
- Maxwell's Daemon (zendesk/maxwell) — a single-node MySQL → Kafka CDC tool; its codebase is small enough to read end-to-end in an afternoon.
- /wiki/postgres-logical-decoding-from-scratch — the Postgres counterpart to this chapter; pair them when comparing CDC sources.