Note: Company names, engineers, incidents, numbers, and scaling scenarios in this article are hypothetical — even when they resemble real ones. See the full disclaimer.

MySQL binlog format and replication protocol

The previous chapter built a Postgres logical-decoding consumer in 30 lines. MySQL has the same primitive but a different shape: instead of an in-process plugin emitting decoded events, MySQL writes a binary file called the binlog and asks consumers to pretend they are a replica. Every CDC tool you have heard of — Debezium MySQL, AWS DMS, Maxwell, Canal at SilkBazaar — is doing the same thing: opening a TCP connection, sending COM_BINLOG_DUMP_GTID, and parsing the bytes that come back. This chapter writes that connection by hand.

The MySQL binlog is an append-only file of typed events — WRITE_ROWS_EVENT, UPDATE_ROWS_EVENT, DELETE_ROWS_EVENT, plus TABLE_MAP_EVENT and QUERY_EVENT. A CDC consumer connects on the standard MySQL port, identifies as a replica, and the server streams events from a position (filename + offset, or a GTID set). The choice of binlog_format — STATEMENT, ROW, or MIXED — decides whether the events are SQL strings or before/after images. ROW is the only format that supports CDC correctly.

What the binlog actually is

MySQL has had a binlog since 4.1 (2003). It exists for two original purposes — point-in-time recovery and replication to a replica — and one accidental purpose that the rest of the industry built a $200 crore market on top of: change data capture. The mechanism is the same in all three cases. Every transaction that commits writes a sequence of events to a file in datadir, named mysql-bin.000001, mysql-bin.000002, and so on. Each event is a typed binary record with a header (timestamp, type, server id, length) and a body whose layout depends on the type.

The events you care about for CDC are five:

  • FORMAT_DESCRIPTION_EVENT — the first event in every binlog file; describes the format version and the layout of subsequent events. A consumer must read this to know how to parse anything else.
  • TABLE_MAP_EVENT — emitted before every batch of row events; carries the table id, schema name, table name, and the column types. Essential, because the row events themselves only carry column values, not names — the consumer must remember the most recent TABLE_MAP for that table id.
  • WRITE_ROWS_EVENT / UPDATE_ROWS_EVENT / DELETE_ROWS_EVENT — the actual change events. WRITE carries one image (the inserted row); UPDATE carries two (before and after); DELETE carries one (the deleted row).
  • QUERY_EVENT — DDL goes here as a literal SQL string. ALTER TABLE orders ADD COLUMN refund_id BIGINT arrives as text; the consumer is responsible for understanding what changed about the schema.
  • XID_EVENT — the commit marker. A row event without a subsequent XID_EVENT is part of an uncommitted transaction; honest consumers buffer until the XID arrives.
Anatomy of a binlog transaction — the events on the wireA vertical event stream showing the sequence of binlog events for one UPDATE transaction: GTID, QUERY (BEGIN), TABLE_MAP, UPDATE_ROWS, XID (commit). Each event has a header showing timestamp, type, and length. One UPDATE transaction on the wire GTID_EVENT server_uuid:42 → assigns transaction GTID 65 bytes QUERY_EVENT "BEGIN" — opens transaction at LSN-equivalent 52 bytes TABLE_MAP_EVENT tbl_id=119, schema="razorpay", table="orders", cols=[BIGINT, VARCHAR, INT, ENUM] 88 bytes UPDATE_ROWS_EVENT tbl_id=119, before={id=98421, status="created", amt=4999} after ={id=98421, status="paid", amt=4999} 142 bytes XID_EVENT xid=2841 — commit; everything before this is now durable 31 bytes Total wire bytes: 378. The "row" you actually changed is 142 of them. A consumer sees these in order, parses each in turn, emits one logical UPDATE event after XID arrives.
Five events for one row change. The TABLE_MAP must be parsed before the UPDATE_ROWS — the row event references the table by id, not by name, and the column type metadata only lives in TABLE_MAP.

The crucial structural point: row events do not carry column names. They carry a table id and a bitmap of columns present in this image, plus the values, encoded according to the types declared in the most recent TABLE_MAP for that id. Why this matters for CDC correctness: if the consumer drops a TABLE_MAP for any reason — restart, network blip, a bug — the next row event arrives uninterpretable, because the type information has been lost. Consumers must replay binlog from the start of the transaction's containing event group, never from "the next row event". Debezium, Maxwell, and Canal all enforce this.

STATEMENT, ROW, MIXED — the format choice that decides everything

binlog_format is a server-wide setting. Three values:

  • STATEMENT — writes the SQL string. UPDATE orders SET status='paid' WHERE id=98421 arrives as text. Compact. Replicas re-execute the SQL. Useless for CDC because non-deterministic functions (NOW(), RAND(), UUID()), LIMIT without ORDER BY, and triggers all yield different results on the consumer than they did on the source.
  • ROW — writes the before and after images of every row affected. UPDATE orders SET status='paid' WHERE merchant_id=88 AND amount > 5000 could touch 10,000 rows, and ROW writes 10,000 events of ~140 bytes each. Verbose. Deterministic. The only correct choice for CDC.
  • MIXED — server picks STATEMENT for "safe" queries and ROW for unsafe ones. Sounds sensible; in practice means the consumer cannot tell which mode any given event was written in without per-event metadata, and any STATEMENT event in the stream is a CDC bug waiting to happen.

In production, every shop running CDC at scale runs binlog_format = ROW. PaisaBridge, BharatBazaar, KhelKing, NaatakBook — all of them. The cost is real: ROW format binlogs are 5–20× larger than STATEMENT for the same workload. A high-write database that fit in 100 MB of binlog per hour under STATEMENT will produce 1–2 GB of binlog per hour under ROW. Why this trade-off is forced: the consumer needs deterministic, side-effect-free events. STATEMENT is fundamentally non-deterministic on the consumer because the consumer is not re-executing the original query plan with the original data. ROW pays size for correctness, and the size cost is paid by the source's disk, not the consumer's.

Two related settings the production checklist insists on:

  • binlog_row_image = FULL — record every column in both before and after images, not just the changed ones. The default in MySQL 5.7+ is FULL; if your team has set MINIMAL to save space, the consumer cannot reconstruct the unchanged columns and downstream joins break.
  • gtid_mode = ON plus enforce_gtid_consistency = ON — enables Global Transaction Identifiers. A GTID is a server-uuid + transaction-number pair that is unique across the cluster. The consumer can resume from a GTID set instead of a binlog file + offset, which means it survives source failovers without manual intervention. Without GTID, a failover from primary to replica forces the consumer to figure out the equivalent offset in the new primary's binlog — historically the leading cause of "we thought we had CDC but lost 6 hours of events during the failover" outages.

The replication handshake — what's actually on the wire

A CDC consumer is, from the server's perspective, a replica. The handshake is the same as a real MySQL replica's:

  1. TCP connect to port 3306. Standard.
  2. Authentication handshake — same as any client. The consumer's user account needs REPLICATION SLAVE and REPLICATION CLIENT privileges. (MySQL 8.0.22+ also accepts REPLICATION_SLAVE_ADMIN for finer-grained control.)
  3. Server identification — the consumer sends COM_REGISTER_SLAVE declaring its server-id, which must be unique in the cluster. Two consumers sharing a server-id will collide and one will be disconnected.
  4. Dump request — the consumer sends COM_BINLOG_DUMP_GTID (modern) or COM_BINLOG_DUMP (legacy, by filename + offset). The packet carries the position the consumer wants to start from.
  5. Streaming — the server begins writing events to the socket. There is no acknowledgement protocol from the consumer back to the server for events; the TCP connection alone keeps the stream alive. If the consumer disconnects, the server stops sending; on reconnection, the consumer resumes from the next event after its last-processed GTID.

The asymmetry with Postgres is worth naming: Postgres requires the consumer to send send_feedback(flush_lsn=...) to advance the slot. MySQL has no such mechanism. The consumer's progress is its own state; if the consumer crashes mid-event, on restart it must look up its last durably-persisted GTID and ask the server to resume there. Why this matters for at-least-once delivery: between the consumer reading a row event and durably writing it to its sink, a crash means the consumer must replay from before the crash. Replays are inevitable. Every CDC consumer needs a sink that is idempotent on (server_uuid, gtid) or on (binlog_file, position) — the next chapter on dedup hashing in CDC pipelines covers the patterns.

Replica handshake — the four packets every CDC tool sendsA sequence diagram with two columns (consumer and MySQL server) and arrows showing the handshake packets: TCP connect, auth, COM_REGISTER_SLAVE, COM_BINLOG_DUMP_GTID, then a continuous event stream from server to consumer. CDC consumer pretends to be a MySQL replica Consumer (Debezium) MySQL server TCP connect to :3306 Auth (user with REPLICATION SLAVE) COM_REGISTER_SLAVE (server_id=42) COM_BINLOG_DUMP_GTID (gtid_set=...) Continuous binlog event stream → no per-event ack — TCP keeps the stream alive; consumer tracks GTID itself
The four packets above are exactly what a real MySQL replica sends. A CDC consumer differs only in what it does with the bytes after they arrive.

Build it: a binlog consumer in Python

The library python-mysql-replication (pip install mysql-replication) speaks the wire protocol. The example below opens a binlog stream and prints events as they arrive — exactly what a one-host CDC pipeline at a Bengaluru SaaS startup would have run before they paid for Debezium.

import json
from datetime import datetime
from pymysqlreplication import BinLogStreamReader
from pymysqlreplication.row_event import (
    WriteRowsEvent, UpdateRowsEvent, DeleteRowsEvent,
)
from pymysqlreplication.event import QueryEvent, GtidEvent, XidEvent

# PaisaBridge-shaped local dev: replication user has GRANTed
#   REPLICATION SLAVE, REPLICATION CLIENT on *.* and SELECT on razorpay.*
SOURCE = {"host": "127.0.0.1", "port": 3306,
          "user": "cdc_reader", "passwd": "dev-only"}

# server_id must be unique across the entire MySQL cluster, including
# real replicas. Picking 9001 to stay out of the way.
stream = BinLogStreamReader(
    connection_settings=SOURCE,
    server_id=9001,
    only_events=[WriteRowsEvent, UpdateRowsEvent, DeleteRowsEvent,
                 QueryEvent, GtidEvent, XidEvent],
    only_schemas=["razorpay"],
    only_tables=["orders", "payments"],
    blocking=True,           # wait for new events; do not exit at EOF
    resume_stream=True,      # honour saved GTID position on restart
    auto_position=None,      # set to a GTID set string to seek
    fail_on_table_metadata_unavailable=True,  # safer than silent skip
)

last_gtid = None
events_seen = 0
try:
    for ev in stream:
        if isinstance(ev, GtidEvent):
            last_gtid = ev.gtid
            continue
        if isinstance(ev, XidEvent):
            # Commit boundary — checkpoint here in production.
            print(f"  COMMIT  xid={ev.xid:>6}  last_gtid={last_gtid}")
            continue
        if isinstance(ev, QueryEvent):
            sql = ev.query.strip()
            if sql.upper() not in ("BEGIN", "COMMIT"):
                print(f"  DDL     {sql[:80]!r}")
            continue
        # Row events — emit one logical event per row in the image.
        for row in ev.rows:
            if isinstance(ev, WriteRowsEvent):
                action, image = "INSERT", row["values"]
            elif isinstance(ev, DeleteRowsEvent):
                action, image = "DELETE", row["values"]
            else:
                action = "UPDATE"
                image = {"before": row["before_values"],
                         "after":  row["after_values"]}
            print(f"  {action:<6}  {ev.schema}.{ev.table}  "
                  f"@{ev.packet.log_pos}  {json.dumps(image, default=str)[:120]}")
            events_seen += 1
finally:
    stream.close()
    print(f"stream closed; processed {events_seen} row events; last_gtid={last_gtid}")

A representative run, after a separate session does INSERT, UPDATE, DELETE against razorpay.orders:

  INSERT  razorpay.orders  @98421  {"id": 98421, "merchant_id": "razorpay-test", "amount": 4999, "status": "created"}
  COMMIT  xid=  2841  last_gtid=3E11FA47-71CA-11E1-9E33-C80AA9429562:18
  UPDATE  razorpay.orders  @98598  {"before": {"id": 98421, "status": "created"}, "after": {"id": 98421, "status": "paid"}}
  COMMIT  xid=  2842  last_gtid=3E11FA47-71CA-11E1-9E33-C80AA9429562:19
  DELETE  razorpay.orders  @98775  {"id": 98421, "merchant_id": "razorpay-test", "amount": 4999, "status": "paid"}
  COMMIT  xid=  2843  last_gtid=3E11FA47-71CA-11E1-9E33-C80AA9429562:20
^C
stream closed; processed 3 row events; last_gtid=3E11FA47-71CA-11E1-9E33-C80AA9429562:20

A walkthrough of the lines that matter:

  • server_id=9001 — must be unique across the cluster. Picking a number that collides with a real replica or another consumer causes one of them to be disconnected unpredictably. Production teams reserve a server-id range (9000–9099 for CDC consumers, say) the same way they reserve port ranges.
  • only_schemas / only_tables — server-side filtering happens before parsing. Without these, the consumer parses every row event in the binlog including tables it does not care about, which on a multi-tenant database wastes CPU on the consumer host. The library can also push down predicates to the server in MySQL 8.0+ via binlog_row_event_max_size.
  • auto_position and GTID resumption — the right pattern in production is to durably persist last_gtid to your sink (or a sidecar Postgres / Redis), and on restart pass auto_position=<saved_gtid_set>. The server then streams from the next transaction, regardless of which physical binlog file that lives in. Without GTID, a binlog rotation or a primary failover loses your position.
  • Buffering until XidEvent — the example prints row events as they arrive; a correct sink batches them per transaction and only flushes downstream after the XID. Otherwise a transaction that aborts (rare in row-format binlog, but possible during recovery) leaks half-applied events.
  • fail_on_table_metadata_unavailable=True — if the consumer encounters a row event with a table id whose TABLE_MAP it has never seen, the safe behaviour is to fail loudly. The default is to silently skip the event, which is the worst CDC failure mode: the source thinks the row was sent, the sink never records it, no alert fires. Always set this.

The code is ~50 lines. Debezium MySQL is ~400,000 lines. The difference is everything around the parsing — schema history (so the consumer can reconstruct the table layout at any point in the past, even after later DDL), reconnection / retry, Kafka transactional sinks, snapshot bootstrap. The wire protocol itself is small.

Operating MySQL CDC at scale — what breaks

The textbook issues are well-documented. The three that actually page on-call at Indian shops are:

1. Schema change (DDL) without coordination. A backend engineer at KreditClub runs ALTER TABLE rewards ADD COLUMN voucher_id BIGINT at 3 a.m. The CDC consumer parses the QUERY_EVENT carrying the SQL string and either understands the change (Debezium with schema history enabled) or silently produces wrong-shaped events for every subsequent row write. Mitigation: enable schema history, and have a contract that DDL goes through a process that notifies the CDC team. Several Bengaluru shops keep a migrations Slack channel that is mirrored to a dashboard the data platform team watches.

2. Long-running transactions blocking checkpoint. A 30-minute transaction holds the binlog flush forward. Until it commits, no XID arrives, and a correct consumer cannot checkpoint past it — meaning if the consumer crashes during that 30 minutes, on restart it replays the entire transaction. Why this asymmetry with Postgres matters: Postgres logical decoding has the same problem (commit-order emit), but Postgres 14+ added streaming = on to emit in-progress transactions. MySQL has no equivalent. Long transactions on a CDC source are a hard latency floor. The fix is application-side: keep transactions short on tables that are in the publication.

3. purge_binary_logs running ahead of the consumer. MySQL deletes binlog files based on expire_logs_days or binlog_expire_logs_seconds (8.0+). If the consumer falls behind by more than the retention window, the binlog files it needs are gone, and the consumer must rebootstrap from a fresh snapshot of every table — a 10 crore-row snapshot on a payments database can take 6 hours. The standard production setting is binlog_expire_logs_seconds = 604800 (7 days), and an alert that fires when consumer lag exceeds 24 hours, well before the 7-day cliff.

A monitoring query for replica lag against the source's current position:

-- Run on the source. Compare against the consumer's last-processed GTID.
SELECT @@global.gtid_executed AS source_gtid_set,
       @@global.binlog_expire_logs_seconds / 3600.0 AS retention_hours;

-- Approximate disk pressure: total binlog size vs free disk.
SELECT SUM(file_size) / 1024 / 1024 AS binlog_mb
FROM information_schema.files
WHERE file_type = 'BINLOG';

A consumer one binlog file behind is fine. A consumer 100 MB of binlog behind needs investigation. A consumer 1 GB behind on a 10 GB-per-day database is an active incident.

Common confusions

  • "MySQL binlog and Postgres WAL are the same thing." They serve overlapping purposes (durability, replication, CDC) but the formats differ. Postgres WAL is physical (page offsets); the logical decoding plugin makes it row-shaped on the way out. MySQL binlog is already row-shaped on disk when binlog_format = ROW. The protocols differ correspondingly: Postgres uses replication slots with consumer-acknowledged LSNs; MySQL streams freely with no per-event ack.
  • "binlog_format = MIXED is the safe default." It is the default on many MySQL installs but it is wrong for CDC. MIXED writes some events as STATEMENT (SQL strings) and the consumer cannot tell deterministically whether the resulting row image is correct. Production CDC requires binlog_format = ROW.
  • "A CDC consumer is the same as a MySQL replica." It uses the same protocol but does different things with the events. A real replica re-executes the events to maintain its own copy. A CDC consumer parses them and writes to a different sink (Kafka, Iceberg, Snowflake). The server cannot tell the difference, which is why the consumer needs a unique server_id.
  • "GTID and binlog file+offset are interchangeable." GTID is a logical identity; file+offset is a physical position. After a primary failover, the same GTID exists on the new primary at a different file+offset. Consumers using file+offset must resync manually after failover; GTID consumers do not.
  • "I can run two CDC consumers with the same server_id." No. The first one connected will be disconnected when the second registers. Each consumer needs its own server_id; production teams maintain a server-id allocation registry the same way they maintain port and DNS allocation.
  • "row_image = MINIMAL saves disk and is harmless." It saves ~30% of binlog size on update-heavy workloads. It also breaks any consumer that needs to know the values of unchanged columns — which is most of them, because downstream sinks usually want full row images for joins. Default to FULL.

Going deeper

How the binlog file format actually frames events

Every event on disk is a [19-byte header][variable-length body][optional 4-byte checksum]. The header carries: 4-byte timestamp (seconds since epoch), 1-byte event type, 4-byte server-id, 4-byte total event length, 4-byte next-event-position, 2-byte flags. The body varies by type. The checksum (CRC32, since 5.6) is enabled by binlog_checksum = CRC32 and detects corruption from disk-level errors — rare but high-impact when it happens. Disabling it is one of those "do not do this" foot-guns; the saved bytes are not worth the silent corruption risk.

TABLE_MAP and the column-type encoding

TABLE_MAP carries column types as a byte string in MySQL's internal type-id format (MYSQL_TYPE_LONG = 3, MYSQL_TYPE_VARCHAR = 15, etc.). For variable-length types, additional metadata (max length for VARCHAR, precision/scale for DECIMAL, character set) is packed after the type bytes. The row events then encode values using these types — e.g. a VARCHAR is length-prefixed, an INT is little-endian fixed-width, a DATETIME is a packed bit format. A correct parser must mirror MySQL's serialisation logic exactly, which is why writing a binlog parser from scratch is a 6-month project; using python-mysql-replication or Debezium is what every production team does.

GTID vs file+position — the failover argument in detail

PaisaBridge published an architecture note in 2023 (referenced below) describing a primary-replica failover where the file+position-based CDC consumer pointed at a binlog file that didn't exist on the new primary. The consumer treated this as an invalid position and rebootstrapped from a snapshot — which took 4 hours, during which 4 hours of payment events accumulated unprocessed in Kafka. After the incident they migrated to GTID auto-position. The lesson is generic: any production MySQL ≥ 5.6.5 with replicas should run gtid_mode = ON, and any CDC consumer against it should use auto_position. The cost is one config change; the avoided cost is one outage.

Multi-source replication and CDC at SaaS scale

A common pattern at Bengaluru SaaS shops with one MySQL per tenant: the analytics warehouse needs CDC from 200 tenants. Running 200 separate consumers is operationally expensive; multi-source replication (MySQL 5.7.6+) lets one MySQL replica subscribe to many sources. The CDC consumer then reads from the aggregating replica's binlog instead of from each tenant. The trade-off is that the aggregator becomes a single point of failure and a per-tenant lag amplifier — a slow tenant slows everyone. Most teams settle on per-shard CDC with an automation layer rather than per-tenant; the right granularity is the same one your application sharding chose.

What changes between MySQL 5.7, 8.0, and 8.4

5.7 introduced binlog_row_image and made GTID production-ready. 8.0 added invisible columns (must be handled in TABLE_MAP parsing), JSON column improvements, and binlog_transaction_compression (8.0.20+) which can reduce binlog size 70% on text-heavy workloads. 8.4 (LTS, 2024) made GTID the default and removed several legacy replication options. For new CDC pipelines in 2026, target 8.0.34 or 8.4.0 minimum; avoid 5.7 because its end-of-life passed in October 2023.

Where this leads next

The next chapter, /wiki/snapshot-cdc-the-bootstrapping-problem, pairs binlog streaming with a consistent snapshot of every table at the binlog start position, so the consumer can join an existing source rather than starting from "everything that committed today". After that, /wiki/debezium-as-a-cdc-platform is the production wrapper around both Postgres and MySQL CDC primitives.

Two crosslinks worth noting now:

By the end of Build 11, MySQL and Postgres are both first-class CDC sources. Whichever your application database is, the pipeline pattern is the same: identify as a replica, stream events, deduplicate downstream, persist position before sink commit.

References