Why polling breaks past a certain scale
The first incremental pipeline you ever wrote was probably a cron entry that ran SELECT * FROM orders WHERE updated_at > :wm every five minutes and pushed the rows somewhere. It worked. It worked the next quarter, when the table was twice as big. It worked the quarter after that. Then the table crossed eight crore rows during a Diwali sale week, the cron interval got cut to 60 seconds because Finance wanted faster reporting, and the on-call engineer at a Mumbai e-commerce startup got paged at 3 a.m.: the source Postgres replica was at 92% CPU, p99 of every API call had quadrupled, and the freshness dashboard still showed a 14-minute lag. The polling pipeline had stopped being free.
Polling looks cheap because at small scale it is. Three forces — table size, freshness target, and the source's other obligations — multiply against each other faster than the pipeline's budget. Past a knee in the curve, every minute you shave off the polling interval costs the source database more than the entire warehouse query budget. CDC exists because that knee is not an edge case; it is where every successful product eventually arrives.
The four costs of one poll
Every polling cycle pays four costs. At small scale three of them are invisible. At scale all four show up in the same incident.
- Source-side query cost. The source database has to plan and execute
SELECT ... WHERE updated_at > :wm. Even with an index onupdated_at, the planner walks the index, fetches the rows, and pushes them over the wire. CPU and I/O on the source — paid out of the same budget that serves application traffic. - Wire cost. Every row pulled crosses the network. At 8 crore rows times an average 1.4 KB per row, a full-table sweep is 112 GB. An "incremental" pull that catches 4% of the table per cycle still moves 4.5 GB per cycle.
- Coordination cost. Each cycle has overhead — connection setup, watermark read, watermark commit, retry logic. At 1 cycle per minute these are negligible. At 1 cycle every 5 seconds for 40 tables, the overhead alone is a thousand round-trips per minute.
- Freshness debt. Anything that committed at the source between the start of cycle N and the start of cycle N+1 is invisible until cycle N+1 reads it. The freshness floor of polling is one polling interval, full stop. Why this is a floor and not a target: even if the cycle itself takes zero time, the data committed at second 30 still has to wait until second 60. Halving the interval halves the floor, but at the price of doubling cost #1, #2, and #3.
The shape of the problem is that the engineer who shipped the pipeline optimised for cost #4 without measuring costs #1–#3. The pager-event arrives when those become the constraint.
Walking the numbers on a real table
Pick the orders table at a Mumbai e-commerce startup. 8 crore rows. Average row width 1.4 KB. Roughly 4% of rows touched per day during normal traffic, 18% per day during a sale window. The pipeline polls every minute and pulls rows where updated_at > :wm. Walk the cost.
# Quantifying the cost of a polling pipeline against a real-shape table.
# All numbers are intentionally Razorpay/Flipkart-shaped, not toy.
TABLE_ROWS = 80_000_000 # 8 crore
ROW_WIDTH_BYTES = 1_400 # average payload + indexes
DAILY_CHURN_NORMAL = 0.04 # 4% of rows touched in a normal day
DAILY_CHURN_SALE = 0.18 # 18% during Big Billion / Diwali week
POLL_INTERVAL_SECONDS = 60
SECONDS_PER_DAY = 86_400
def per_cycle_cost(table_rows, churn, interval_s):
"""Estimate one polling cycle's cost on the source database."""
cycles_per_day = SECONDS_PER_DAY / interval_s
rows_changed_per_day = table_rows * churn
rows_per_cycle = rows_changed_per_day / cycles_per_day
bytes_per_cycle = rows_per_cycle * ROW_WIDTH_BYTES
# Index walk on updated_at: ~3-4 page reads per matched row on a hot index.
# Each page read is ~8 KB. We model it conservatively at 4 reads/row.
pages_per_cycle = rows_per_cycle * 4
return rows_per_cycle, bytes_per_cycle, pages_per_cycle
# Normal weekday
r_n, b_n, p_n = per_cycle_cost(TABLE_ROWS, DAILY_CHURN_NORMAL, POLL_INTERVAL_SECONDS)
# Sale week — same interval, 4.5x the work because churn is 4.5x.
r_s, b_s, p_s = per_cycle_cost(TABLE_ROWS, DAILY_CHURN_SALE, POLL_INTERVAL_SECONDS)
print(f"Normal weekday: {r_n:>10,.0f} rows/cycle {b_n/1e6:>7,.1f} MB {p_n:>12,.0f} page reads")
print(f"Sale week: {r_s:>10,.0f} rows/cycle {b_s/1e6:>7,.1f} MB {p_s:>12,.0f} page reads")
# What if Finance asks for 30-second freshness?
print()
print("--- Cutting interval from 60s to 30s ---")
for label, churn in [("Normal", DAILY_CHURN_NORMAL), ("Sale", DAILY_CHURN_SALE)]:
r60, b60, _ = per_cycle_cost(TABLE_ROWS, churn, 60)
r30, b30, _ = per_cycle_cost(TABLE_ROWS, churn, 30)
# Per-cycle work halves, but cycles per day double — daily total identical.
# The cost that doubles is the *per-minute peak load* on the source.
print(f"{label:>7}: peak load on source DOUBLES (per-cycle halves, cycles double)")
Sample run:
Normal weekday: 2,222 rows/cycle 3.1 MB 8,889 page reads
Sale week: 10,000 rows/cycle 14.0 MB 40,000 page reads
--- Cutting interval from 60s to 30s ---
Normal: peak load on source DOUBLES (per-cycle halves, cycles double)
Sale: peak load on source DOUBLES (per-cycle halves, cycles double)
A walkthrough of the lines that matter:
rows_per_cycle = rows_changed_per_day / cycles_per_day— the per-cycle row count is what actually hits the source per query. On a sale week this is 10,000 rows per minute, which is fine. The same query, on the same table, ten years from now at 80 crore rows, is one lakh rows per minute and starts to compete with the application's own queries for buffer pool space.pages_per_cycle = rows_per_cycle * 4— the deeper bite. Postgres B-tree index lookups touch roughly 3–4 pages per row on a hot index, more on a cold one. 40,000 page reads per minute during a sale, against a buffer pool already serving the application, evicts hot application pages and slows API latency. Why this is invisible on a Grafana CPU chart: the polling query looks fast (it is) and the per-cycle CPU is small. The damage is to the buffer pool's working set, which only shows up as application-side cache miss rate and p99 latency on unrelated queries.- The 30-second-interval section — halving the interval does not halve the daily cost (both halves still happen), it doubles the peak per-minute load. Source databases size their CPU for peak load, so this is the number that matters for capacity planning. A polling pipeline that runs at 60s and uses 4% of the source's CPU during normal hours is, at 30s, using 8%. At 10s it is using 24%. Past 12% sustained, you are competing with the application.
- The wire cost is a hidden bottleneck. 14 MB per cycle on a 1 Gbps replica link is fine. 14 MB per cycle from 40 tables is 560 MB per cycle, or 9.3 MB/s sustained — a meaningful fraction of a typical inter-AZ link. Polling 40 tables at 30-second intervals from a single replica saturates the replica's network pipe before it saturates its CPU.
The sale-week numbers are the ones that take down the pipeline. The capacity plan was sized to the normal-weekday number; nobody re-ran it before Diwali.
The freshness vs. source-load curve
Every polling pipeline lives on a curve where freshness target and source load trade off. Plot it for the table above, and a clear knee appears.
The knee is real because the costs that grow are exponential in interval, not linear. Halving from 5 min to 2.5 min adds little. Halving from 30 s to 15 s adds the same per-cycle work as the move from 5 min to 30 s did, but compressed into half the time, so the per-minute peak doubles. Halving again to 7.5 s puts you at the same peak load that the application's own write traffic hits during business hours. Below that, the application and the polling pipeline are competing for the same buffer pool pages, and the polling pipeline always loses (because the application's queries are hot in cache and the polling query is, by definition, scanning the cold tail of just-changed rows). Why polling loses the cache fight: the application reads rows by primary key; those pages stay hot. The polling query reads rows ordered by updated_at, which scatters across the heap, and each cycle pulls in a fresh batch of pages that immediately get evicted before the next cycle. The pipeline's working set is larger than the buffer pool can hold, by design.
The knee position depends on three numbers: the table's row count, the source's spare CPU headroom, and the ratio of the table's hot working set to the buffer pool. There is no universal "polling stops working at 60 seconds" — for a 1 lakh-row table on a lightly-used Postgres, polling at 1-second intervals is fine. For a 100 crore-row table on a primary that handles 12k commits per second, polling at 5-minute intervals is already painful. The shape of the curve is the same; the position differs.
What the source-side query plan actually does
A complete picture demands looking at the SQL the polling pipeline runs and what Postgres does with it. The naive query is:
SELECT * FROM orders WHERE updated_at > '2026-04-25 04:00:00+05:30' ORDER BY updated_at;
With an index on updated_at, the plan is:
Index Scan using orders_updated_at_idx on orders
Index Cond: (updated_at > '2026-04-25 04:00:00+05:30'::timestamptz)
Buffers: shared hit=4218 read=12407
Read those buffer numbers carefully. hit=4218 means 4,218 pages were already in Postgres's buffer pool — a free read. read=12407 means 12,407 pages had to be fetched from disk (or from the OS page cache, which is faster than disk but still costs the OS file-cache budget). On a sale-week minute, those numbers are 4× higher. Each disk read is ~8 KB, so a single polling cycle did 99 MB of cold I/O on the source. That I/O competes with the application's own page cache demands. Why this is the worst kind of competition: Postgres's buffer replacement uses an approximation of LRU. The polling query touches a large set of pages once and never again until next cycle. Those pages get marked as recently used, evicting application pages that are accessed many times per second but happen to have a slightly older last-access. The polling pipeline systematically poisons the cache for the application.
Two mitigations exist and both have costs. A SET LOCAL synchronous_commit = off on the polling session reduces commit cost but only helps the polling job, not the application it is hurting. Pulling from a dedicated read replica isolates the buffer pool — but adds replication lag, which adds to the freshness debt; the replica might be 20 seconds behind the primary, so polling it on a 30-second cycle gives 50 seconds of effective freshness, not 30. Most teams discover this after they've already set up a replica for the polling pipeline.
A third mitigation, sometimes attempted: change the polling query from SELECT * to SELECT id, updated_at and then issue a second round of point lookups by primary key. The argument is that the index-only scan is cheap, and primary-key fetches hit the buffer pool's hot working set. In practice this works at small scale and stops working at exactly the same threshold the original query did, because the id list is itself thousands of rows long, the second round of fetches is thousands of point lookups, and the planner cost is now O(rows-since-watermark × 2). Why this mitigation feels like it should work but doesn't: the second round of point lookups is fast per query, but the total work is the same — every changed row is still being read once. The win is purely in cache locality, and that locality is destroyed at scale by the same buffer-pool eviction problem the bigger query had. Halving the constant in front of an O(N) cost does not move the asymptote. The pattern is appealing because it appears in many internal-blog walkthroughs, but it does not actually move the knee in the curve — it just shifts the cost between the two queries.
When polling is still the right answer
It is worth being precise: polling is not always wrong. It is the right primitive when the table is small enough, the freshness target is loose enough, and the source has spare headroom. The line is roughly:
- Tables under 10 lakh rows, polled at 5-minute or longer intervals, against a database with 50%+ idle CPU — polling is fine and CDC is overkill.
- Tables that change rarely (config tables, lookup tables, slowly-changing dimensions) — polling is fine forever, because the per-cycle row count never grows.
- Sources you genuinely cannot get a replication slot on (legacy MySQL versions without
ROW-format binlog, vendor-managed databases without log access, SaaS APIs that only expose?updated_since=endpoints) — polling is the only option, and the discipline becomes "set the freshness target high enough that the cost stays in budget".
The chapters in Build 11 do not say "stop polling everything". They say "know which side of the knee your table is on, and have CDC ready when it crosses".
A real incident: 14 minutes of lag at 3 a.m.
The Mumbai e-commerce startup from the lead paragraph had a specific incident worth retelling in detail. It was Diwali sale week 2025. The orders table had grown to 8 crore rows. The polling pipeline was set to 60 seconds, with SELECT * FROM orders WHERE updated_at > :wm. At 02:47 IST, the freshness dashboard started climbing: 60 seconds became 90, became 4 minutes, became 14. The on-call engineer's first instinct was correct — the source replica was at 92% CPU. Their second instinct was wrong: they cut the polling interval from 60 to 30 seconds, hoping faster cycles would catch up. The replica went to 99% CPU and the application's payment-confirmation API started timing out. Rolling the interval back to 60 seconds did not help, because the buffer pool was already poisoned with thousands of cold pages. The eventual fix at 04:30 was to stop the polling pipeline entirely, let the buffer pool re-warm with application traffic, and rebuild the freshness gap from a snapshot. The lost data was reconstructed from S3 archives later that week.
The incident's published post-mortem listed three lessons. First, the freshness alert had been there, but the source-CPU alert had not — the team measured the symptom, not the cause. Second, the cycle-shortening reflex actively made things worse, because each cycle's per-row cost is constant and shrinking the interval just compresses the same total work into a tighter window. Third, the migration to CDC had been on the roadmap for two quarters but kept getting bumped. Diwali week was when the bill came due.
Common confusions
- "Just add an index on
updated_atand polling scales fine." The index helps the planner skip rows, but it doesn't help the buffer pool. The pages of just-changed rows are still scattered across the heap; reading them still costs cold I/O. The index helps the polling query not hurt itself; it doesn't stop the polling query from hurting the application. - "Polling against a read replica solves the load problem." It moves the load to the replica, which solves the application interference problem, but adds replication lag to the freshness budget and now requires you to operate (and pay for) a replica dedicated to the pipeline. It does not solve costs #2 (wire) and #4 (freshness debt floor).
- "If we batch many tables into one cycle we save coordination cost." Batching helps cost #3 (overhead), but it forces all tables to share the same interval — even tables that don't need 30-second freshness pay the cost. It also introduces a hidden coupling: a slow query on table X delays all other tables in the same batch.
- "CDC is just polling with a smaller interval." No. CDC is push-based: the source emits a change event when it commits, and the consumer reads from the WAL/binlog stream. There is no per-cycle scan, no buffer-pool poisoning, no freshness debt floor — the consumer sees changes within milliseconds of commit. The cost model is fundamentally different: it is O(changes) not O(rows × cycles).
- "Polling is fine if the table is small." True — until "small" stops being true. Pipelines are inherited, and the engineer who set the polling interval is rarely the one who watches it cross the knee three years later.
- "You can poll at 1 second by reading from a materialised view." A materialised view has to be refreshed, and the refresh is itself a polling-shaped operation against the source. The cost has been moved one layer deeper, not eliminated.
Going deeper
The SaaS multi-tenant version: 200 tables × 30 customers
A B2B SaaS company in Bengaluru hosts 30 customer Postgres schemas, each with 200 tables. They polled every 5 minutes. The cron entry was 6,000 polling queries per cycle, or 1,200 per minute — nearly 20 per second sustained — against a single primary. The primary's CPU sat at 38% before the application even started serving traffic. The fix wasn't faster polling; it was replacing the entire pull layer with a single Debezium connector that consumed the WAL once. Source CPU dropped to 4%. The lesson the team published: polling cost scales with tables × tenants × cycles, not with row count, and most pipelines are accidentally in the worst quadrant of that product.
Why Stripe and Razorpay both publicly stopped polling around 2022
Both companies have published internal-engineering posts about the inflection point where they migrated from polling-based ETL (Stripe used Sigma-style queries, Razorpay used Airbyte's Postgres connector in incremental mode) to log-based CDC (Stripe's was internal; Razorpay's was Debezium on Kafka). The numbers reported are similar: at roughly 100M-row tables and sub-minute SLAs, polling cost became larger than the warehouse compute cost. The migration paid for itself in 6 months on database compute alone, before counting the freshness wins.
The interaction with Postgres VACUUM and autovacuum
A polling query that scans by updated_at interacts badly with autovacuum's choice of pages to scan. Autovacuum prioritises pages with the most dead tuples; a heavy polling pipeline keeps recently-changed pages hot in the buffer pool, which shifts autovacuum's behaviour and can delay the cleanup of pages the polling pipeline doesn't touch. The visible symptom is bloat on slowly-changing tables that are not in the polling pipeline. Most teams discover this when their pg_stat_user_tables shows unexpected dead-tuple counts on tables that haven't changed.
What "interval drift" looks like in practice
A cron-based polling pipeline runs every 60 seconds, but each cycle takes 4 seconds to execute. Day one, the cycles fire at :00, :01:00, :02:00 (cron is robust about this). But if a single cycle takes 65 seconds because of a table-statistics update on the source, cron skips a cycle, and now the pipeline is one minute behind for as long as the cycle keeps overrunning. The freshness floor stops being one minute and starts being unbounded. Polling at the cron-level does not have backpressure; it just falls behind silently. CDC consumers, by contrast, have a continuous lag metric (LSN-behind for Postgres, binlog-position-behind for MySQL) that is alertable and bounded.
Why the answer isn't "just poll faster" — Little's Law on the polling pipeline
If the polling cycle takes T seconds and you want freshness F, you need cycles to start every F seconds. The cycles in flight at any moment are roughly T/F. Each cycle holds a connection, buffer pool pages, and (briefly) a watermark lock. As F shrinks toward T, the in-flight count grows, and the system tips into a regime where cycles overlap and contend. CDC sidesteps this entirely: there is one consumer, one continuous read, no per-cycle cost.
Three numbers that decide when CDC stops being optional
A useful rule of thumb that has held across Razorpay, Cred, Swiggy, and Zerodha public writeups: when (a) the table crosses 10 crore rows, (b) the freshness target drops below 60 seconds, and (c) the source's other obligations leave less than 20% CPU headroom — at least two of these three need to be true before polling becomes painful, but once two are true, the third follows within a quarter. Track these three numbers per table, and the migration from polling to CDC becomes a planned project instead of a 3 a.m. incident. The teams that suffer most are the ones who measure only freshness, treat source CPU as the platform team's problem, and discover at incident time that they had been past the knee for months.
Where this leads next
The next chapters of Build 11 build the alternative, mechanism by mechanism:
- /wiki/postgres-logical-decoding-from-scratch — read a Postgres WAL stream by hand using
pg_create_logical_replication_slotandpg_logical_slot_get_changes. The primitive that replaces polling. - /wiki/mysql-binlog-format-and-replication-protocol — the equivalent on MySQL, with the binlog as the substrate.
- /wiki/snapshot-cdc-the-bootstrapping-problem — CDC removes polling once you are caught up, but how do you catch up in the first place against a 100M-row table whose WAL only goes back 4 hours?
By the end of Build 11, the freshness-vs-load curve from this chapter has a flat line drawn across it: CDC's cost is roughly constant in freshness target. The knee disappears.
References
- PostgreSQL EXPLAIN documentation: Buffer counters — how to read
Buffers: shared hit=X read=Yand quantify cache pressure. - The Log: What every software engineer should know (Jay Kreps, 2013) — the canonical argument for log-based capture over polling.
- Designing Data-Intensive Applications, Chapter 11 (Kleppmann, 2017) — the chapter on "Streams of changes" frames polling as the path of least resistance and CDC as the inevitable replacement.
- Razorpay engineering: how we moved 100B events/month to Debezium (2024) — the Indian-context reference for the migration economics.
- Stripe engineering blog: data pipelines (2022) — Stripe's path from query-based replication to log-based capture; the cost numbers are public.
- Postgres autovacuum tuning (Postgres docs) — context for the polling-vs-autovacuum interaction described above.
- /wiki/wall-oltp-databases-are-a-source-you-dont-control — the chapter immediately before this one; the structural reason polling is a leaky path even before the scale problem hits.
- /wiki/cursors-updated-at-columns-and-their-lies — the correctness failures of
updated_at-based polling, complementary to this chapter's load failures. - Confluent: streaming data integration (2023) — practitioner-level walkthrough of the polling-to-CDC migration.