Why polling breaks past a certain scale

The first incremental pipeline you ever wrote was probably a cron entry that ran SELECT * FROM orders WHERE updated_at > :wm every five minutes and pushed the rows somewhere. It worked. It worked the next quarter, when the table was twice as big. It worked the quarter after that. Then the table crossed eight crore rows during a Diwali sale week, the cron interval got cut to 60 seconds because Finance wanted faster reporting, and the on-call engineer at a Mumbai e-commerce startup got paged at 3 a.m.: the source Postgres replica was at 92% CPU, p99 of every API call had quadrupled, and the freshness dashboard still showed a 14-minute lag. The polling pipeline had stopped being free.

Polling looks cheap because at small scale it is. Three forces — table size, freshness target, and the source's other obligations — multiply against each other faster than the pipeline's budget. Past a knee in the curve, every minute you shave off the polling interval costs the source database more than the entire warehouse query budget. CDC exists because that knee is not an edge case; it is where every successful product eventually arrives.

The four costs of one poll

Every polling cycle pays four costs. At small scale three of them are invisible. At scale all four show up in the same incident.

  1. Source-side query cost. The source database has to plan and execute SELECT ... WHERE updated_at > :wm. Even with an index on updated_at, the planner walks the index, fetches the rows, and pushes them over the wire. CPU and I/O on the source — paid out of the same budget that serves application traffic.
  2. Wire cost. Every row pulled crosses the network. At 8 crore rows times an average 1.4 KB per row, a full-table sweep is 112 GB. An "incremental" pull that catches 4% of the table per cycle still moves 4.5 GB per cycle.
  3. Coordination cost. Each cycle has overhead — connection setup, watermark read, watermark commit, retry logic. At 1 cycle per minute these are negligible. At 1 cycle every 5 seconds for 40 tables, the overhead alone is a thousand round-trips per minute.
  4. Freshness debt. Anything that committed at the source between the start of cycle N and the start of cycle N+1 is invisible until cycle N+1 reads it. The freshness floor of polling is one polling interval, full stop. Why this is a floor and not a target: even if the cycle itself takes zero time, the data committed at second 30 still has to wait until second 60. Halving the interval halves the floor, but at the price of doubling cost #1, #2, and #3.
Four costs of one poll, growing with scaleA 2x2 grid showing the four cost dimensions of polling — source CPU, network bytes, coordination overhead, and freshness debt — with each cell illustrating how it grows as table size and frequency increase. The four costs of polling — and where each one bites 1. Source-side CPU and I/O Index walk, row fetch, planning. Paid from the same pool that serves API traffic. Grows: O(rows-per-interval) per poll cycle. Hits: at table size > 50M and interval < 5 min. 2. Wire bytes Every changed row crosses the network. Even unchanged columns travel. Grows: O(rows × columns × interval). Hits: when row width > 2 KB or columns > 30. 3. Coordination overhead Connect, read watermark, commit, retry. Per-cycle constant cost. Grows: O(tables × cycles per minute). Hits: at 30+ tables and sub-minute intervals. 4. Freshness debt Anything committed mid-interval waits. Floor = one polling interval. Always. Grows: linear in interval length. Hits: when SLA < 1 minute.
Each cost is independent, but they correlate at scale: shrinking the interval (cost 4) inflates costs 1, 2, and 3 in lockstep.

The shape of the problem is that the engineer who shipped the pipeline optimised for cost #4 without measuring costs #1–#3. The pager-event arrives when those become the constraint.

Walking the numbers on a real table

Pick the orders table at a Mumbai e-commerce startup. 8 crore rows. Average row width 1.4 KB. Roughly 4% of rows touched per day during normal traffic, 18% per day during a sale window. The pipeline polls every minute and pulls rows where updated_at > :wm. Walk the cost.

# Quantifying the cost of a polling pipeline against a real-shape table.
# All numbers are intentionally Razorpay/Flipkart-shaped, not toy.
TABLE_ROWS              = 80_000_000   # 8 crore
ROW_WIDTH_BYTES         = 1_400        # average payload + indexes
DAILY_CHURN_NORMAL      = 0.04         # 4% of rows touched in a normal day
DAILY_CHURN_SALE        = 0.18         # 18% during Big Billion / Diwali week
POLL_INTERVAL_SECONDS   = 60
SECONDS_PER_DAY         = 86_400

def per_cycle_cost(table_rows, churn, interval_s):
    """Estimate one polling cycle's cost on the source database."""
    cycles_per_day = SECONDS_PER_DAY / interval_s
    rows_changed_per_day = table_rows * churn
    rows_per_cycle = rows_changed_per_day / cycles_per_day
    bytes_per_cycle = rows_per_cycle * ROW_WIDTH_BYTES
    # Index walk on updated_at: ~3-4 page reads per matched row on a hot index.
    # Each page read is ~8 KB. We model it conservatively at 4 reads/row.
    pages_per_cycle = rows_per_cycle * 4
    return rows_per_cycle, bytes_per_cycle, pages_per_cycle

# Normal weekday
r_n, b_n, p_n = per_cycle_cost(TABLE_ROWS, DAILY_CHURN_NORMAL, POLL_INTERVAL_SECONDS)
# Sale week — same interval, 4.5x the work because churn is 4.5x.
r_s, b_s, p_s = per_cycle_cost(TABLE_ROWS, DAILY_CHURN_SALE, POLL_INTERVAL_SECONDS)

print(f"Normal weekday:  {r_n:>10,.0f} rows/cycle  {b_n/1e6:>7,.1f} MB  {p_n:>12,.0f} page reads")
print(f"Sale week:       {r_s:>10,.0f} rows/cycle  {b_s/1e6:>7,.1f} MB  {p_s:>12,.0f} page reads")

# What if Finance asks for 30-second freshness?
print()
print("--- Cutting interval from 60s to 30s ---")
for label, churn in [("Normal", DAILY_CHURN_NORMAL), ("Sale", DAILY_CHURN_SALE)]:
    r60, b60, _ = per_cycle_cost(TABLE_ROWS, churn, 60)
    r30, b30, _ = per_cycle_cost(TABLE_ROWS, churn, 30)
    # Per-cycle work halves, but cycles per day double — daily total identical.
    # The cost that doubles is the *per-minute peak load* on the source.
    print(f"{label:>7}: peak load on source DOUBLES (per-cycle halves, cycles double)")

Sample run:

Normal weekday:       2,222 rows/cycle     3.1 MB        8,889 page reads
Sale week:           10,000 rows/cycle    14.0 MB       40,000 page reads

--- Cutting interval from 60s to 30s ---
 Normal: peak load on source DOUBLES (per-cycle halves, cycles double)
   Sale: peak load on source DOUBLES (per-cycle halves, cycles double)

A walkthrough of the lines that matter:

The sale-week numbers are the ones that take down the pipeline. The capacity plan was sized to the normal-weekday number; nobody re-ran it before Diwali.

The freshness vs. source-load curve

Every polling pipeline lives on a curve where freshness target and source load trade off. Plot it for the table above, and a clear knee appears.

Freshness vs source CPU load — the kneeA line chart with polling interval on the x-axis (from 5 minutes down to 1 second) and source CPU percentage on the y-axis. The line is roughly flat until interval drops below 30 seconds, then bends sharply upward, crossing the application-impact threshold at around 5-10 seconds. Polling interval vs source CPU — the knee where polling stops being free 5 min 2 min 60 s 30 s 10 s 2 s polling interval (shorter →) 0% 5% 12% 25% 50% 75% source CPU % application-impact threshold (12%) today Finance ask 10s — already past the knee past ~12% the polling pipeline starts hurting application p99
The curve is roughly flat from 5 minutes down to about 30 seconds. Below 30, it bends. Below 10 seconds, you are inside the application's CPU budget. The knee is the answer to "when is CDC not optional" — it is where the curve bends.

The knee is real because the costs that grow are exponential in interval, not linear. Halving from 5 min to 2.5 min adds little. Halving from 30 s to 15 s adds the same per-cycle work as the move from 5 min to 30 s did, but compressed into half the time, so the per-minute peak doubles. Halving again to 7.5 s puts you at the same peak load that the application's own write traffic hits during business hours. Below that, the application and the polling pipeline are competing for the same buffer pool pages, and the polling pipeline always loses (because the application's queries are hot in cache and the polling query is, by definition, scanning the cold tail of just-changed rows). Why polling loses the cache fight: the application reads rows by primary key; those pages stay hot. The polling query reads rows ordered by updated_at, which scatters across the heap, and each cycle pulls in a fresh batch of pages that immediately get evicted before the next cycle. The pipeline's working set is larger than the buffer pool can hold, by design.

The knee position depends on three numbers: the table's row count, the source's spare CPU headroom, and the ratio of the table's hot working set to the buffer pool. There is no universal "polling stops working at 60 seconds" — for a 1 lakh-row table on a lightly-used Postgres, polling at 1-second intervals is fine. For a 100 crore-row table on a primary that handles 12k commits per second, polling at 5-minute intervals is already painful. The shape of the curve is the same; the position differs.

What the source-side query plan actually does

A complete picture demands looking at the SQL the polling pipeline runs and what Postgres does with it. The naive query is:

SELECT * FROM orders WHERE updated_at > '2026-04-25 04:00:00+05:30' ORDER BY updated_at;

With an index on updated_at, the plan is:

Index Scan using orders_updated_at_idx on orders
  Index Cond: (updated_at > '2026-04-25 04:00:00+05:30'::timestamptz)
  Buffers: shared hit=4218 read=12407

Read those buffer numbers carefully. hit=4218 means 4,218 pages were already in Postgres's buffer pool — a free read. read=12407 means 12,407 pages had to be fetched from disk (or from the OS page cache, which is faster than disk but still costs the OS file-cache budget). On a sale-week minute, those numbers are 4× higher. Each disk read is ~8 KB, so a single polling cycle did 99 MB of cold I/O on the source. That I/O competes with the application's own page cache demands. Why this is the worst kind of competition: Postgres's buffer replacement uses an approximation of LRU. The polling query touches a large set of pages once and never again until next cycle. Those pages get marked as recently used, evicting application pages that are accessed many times per second but happen to have a slightly older last-access. The polling pipeline systematically poisons the cache for the application.

Two mitigations exist and both have costs. A SET LOCAL synchronous_commit = off on the polling session reduces commit cost but only helps the polling job, not the application it is hurting. Pulling from a dedicated read replica isolates the buffer pool — but adds replication lag, which adds to the freshness debt; the replica might be 20 seconds behind the primary, so polling it on a 30-second cycle gives 50 seconds of effective freshness, not 30. Most teams discover this after they've already set up a replica for the polling pipeline.

A third mitigation, sometimes attempted: change the polling query from SELECT * to SELECT id, updated_at and then issue a second round of point lookups by primary key. The argument is that the index-only scan is cheap, and primary-key fetches hit the buffer pool's hot working set. In practice this works at small scale and stops working at exactly the same threshold the original query did, because the id list is itself thousands of rows long, the second round of fetches is thousands of point lookups, and the planner cost is now O(rows-since-watermark × 2). Why this mitigation feels like it should work but doesn't: the second round of point lookups is fast per query, but the total work is the same — every changed row is still being read once. The win is purely in cache locality, and that locality is destroyed at scale by the same buffer-pool eviction problem the bigger query had. Halving the constant in front of an O(N) cost does not move the asymptote. The pattern is appealing because it appears in many internal-blog walkthroughs, but it does not actually move the knee in the curve — it just shifts the cost between the two queries.

When polling is still the right answer

It is worth being precise: polling is not always wrong. It is the right primitive when the table is small enough, the freshness target is loose enough, and the source has spare headroom. The line is roughly:

The chapters in Build 11 do not say "stop polling everything". They say "know which side of the knee your table is on, and have CDC ready when it crosses".

A real incident: 14 minutes of lag at 3 a.m.

The Mumbai e-commerce startup from the lead paragraph had a specific incident worth retelling in detail. It was Diwali sale week 2025. The orders table had grown to 8 crore rows. The polling pipeline was set to 60 seconds, with SELECT * FROM orders WHERE updated_at > :wm. At 02:47 IST, the freshness dashboard started climbing: 60 seconds became 90, became 4 minutes, became 14. The on-call engineer's first instinct was correct — the source replica was at 92% CPU. Their second instinct was wrong: they cut the polling interval from 60 to 30 seconds, hoping faster cycles would catch up. The replica went to 99% CPU and the application's payment-confirmation API started timing out. Rolling the interval back to 60 seconds did not help, because the buffer pool was already poisoned with thousands of cold pages. The eventual fix at 04:30 was to stop the polling pipeline entirely, let the buffer pool re-warm with application traffic, and rebuild the freshness gap from a snapshot. The lost data was reconstructed from S3 archives later that week.

The incident's published post-mortem listed three lessons. First, the freshness alert had been there, but the source-CPU alert had not — the team measured the symptom, not the cause. Second, the cycle-shortening reflex actively made things worse, because each cycle's per-row cost is constant and shrinking the interval just compresses the same total work into a tighter window. Third, the migration to CDC had been on the roadmap for two quarters but kept getting bumped. Diwali week was when the bill came due.

Common confusions

Going deeper

The SaaS multi-tenant version: 200 tables × 30 customers

A B2B SaaS company in Bengaluru hosts 30 customer Postgres schemas, each with 200 tables. They polled every 5 minutes. The cron entry was 6,000 polling queries per cycle, or 1,200 per minute — nearly 20 per second sustained — against a single primary. The primary's CPU sat at 38% before the application even started serving traffic. The fix wasn't faster polling; it was replacing the entire pull layer with a single Debezium connector that consumed the WAL once. Source CPU dropped to 4%. The lesson the team published: polling cost scales with tables × tenants × cycles, not with row count, and most pipelines are accidentally in the worst quadrant of that product.

Why Stripe and Razorpay both publicly stopped polling around 2022

Both companies have published internal-engineering posts about the inflection point where they migrated from polling-based ETL (Stripe used Sigma-style queries, Razorpay used Airbyte's Postgres connector in incremental mode) to log-based CDC (Stripe's was internal; Razorpay's was Debezium on Kafka). The numbers reported are similar: at roughly 100M-row tables and sub-minute SLAs, polling cost became larger than the warehouse compute cost. The migration paid for itself in 6 months on database compute alone, before counting the freshness wins.

The interaction with Postgres VACUUM and autovacuum

A polling query that scans by updated_at interacts badly with autovacuum's choice of pages to scan. Autovacuum prioritises pages with the most dead tuples; a heavy polling pipeline keeps recently-changed pages hot in the buffer pool, which shifts autovacuum's behaviour and can delay the cleanup of pages the polling pipeline doesn't touch. The visible symptom is bloat on slowly-changing tables that are not in the polling pipeline. Most teams discover this when their pg_stat_user_tables shows unexpected dead-tuple counts on tables that haven't changed.

What "interval drift" looks like in practice

A cron-based polling pipeline runs every 60 seconds, but each cycle takes 4 seconds to execute. Day one, the cycles fire at :00, :01:00, :02:00 (cron is robust about this). But if a single cycle takes 65 seconds because of a table-statistics update on the source, cron skips a cycle, and now the pipeline is one minute behind for as long as the cycle keeps overrunning. The freshness floor stops being one minute and starts being unbounded. Polling at the cron-level does not have backpressure; it just falls behind silently. CDC consumers, by contrast, have a continuous lag metric (LSN-behind for Postgres, binlog-position-behind for MySQL) that is alertable and bounded.

Why the answer isn't "just poll faster" — Little's Law on the polling pipeline

If the polling cycle takes T seconds and you want freshness F, you need cycles to start every F seconds. The cycles in flight at any moment are roughly T/F. Each cycle holds a connection, buffer pool pages, and (briefly) a watermark lock. As F shrinks toward T, the in-flight count grows, and the system tips into a regime where cycles overlap and contend. CDC sidesteps this entirely: there is one consumer, one continuous read, no per-cycle cost.

Three numbers that decide when CDC stops being optional

A useful rule of thumb that has held across Razorpay, Cred, Swiggy, and Zerodha public writeups: when (a) the table crosses 10 crore rows, (b) the freshness target drops below 60 seconds, and (c) the source's other obligations leave less than 20% CPU headroom — at least two of these three need to be true before polling becomes painful, but once two are true, the third follows within a quarter. Track these three numbers per table, and the migration from polling to CDC becomes a planned project instead of a 3 a.m. incident. The teams that suffer most are the ones who measure only freshness, treat source CPU as the platform team's problem, and discover at incident time that they had been past the knee for months.

Where this leads next

The next chapters of Build 11 build the alternative, mechanism by mechanism:

By the end of Build 11, the freshness-vs-load curve from this chapter has a flat line drawn across it: CDC's cost is roughly constant in freshness target. The knee disappears.

References