Late-arriving data and the backfill problem

Dipti's pipeline runs every 5 minutes. At 14:05 it processes everything with event_time between 14:00 and 14:05, advances its watermark to 14:05, and ships the slice to the warehouse. Marketing's dashboard refreshes. The 14:00–14:05 bucket says 2,18,400 ad clicks. Three minutes later, 1,47,213 events with event_time between 14:00 and 14:05 finally land in the source bucket — they were buffering on a flaky 4G connection at the edge. The bucket Dipti already closed should now say 3,65,613. The dashboard is wrong by 40%, and Dipti has to decide between three bad options before the 14:30 standup.

Events have two timestamps — when they happened (event time) and when your system saw them (processing time) — and the gap between them is the entire reason backfills exist. You cannot eliminate late data; you can only choose how late is "too late", and your watermark, your retention, and your backfill cadence are three faces of that one decision. Backfills are not failures of the design; they are the design's accepted cost.

Event time vs processing time — the gap that makes everything hard

Two clocks tick on every event. Event time is when the thing actually happened in the world — the cricket fan tapped "buy ticket" on BookMyShow at 19:42:11.4 IST. Processing time is when your pipeline saw the event — the request landed at the API gateway at 19:42:11.5, the Kafka producer batched it for 200 ms, the topic replicated it across three brokers in 80 ms, the stream processor read it at 19:42:12.3, and your warehouse-write commit landed at 19:42:14.1.

For an in-memory benchmark on a single machine the gap is microseconds. For a real Indian-scale system — millions of mobile phones over varied 4G/Wi-Fi, regional ingest clusters in ap-south-1, replication to ap-southeast-1, retry queues, network blips during the IPL final — the gap is anywhere from 50 ms to 4 hours. Every analytics question about "what happened between 19:00 and 20:00 last night" is implicitly asking about event time. Every dashboard rendered at 20:01 is implicitly working with whatever processing time has delivered so far.

Event time vs processing time, with a late eventTwo parallel timelines. The top is event time, the bottom is processing time. Three events at event-time 14:00, 14:02, 14:03 land at processing time 14:02, 14:04, 14:09 respectively. The third is shown with a long dashed arrow indicating it is late.eventtimeprocessingtime14:0014:0214:0214:0414:0314:09 (late!)watermarkclosed at 14:05
The first two events arrive within 2 seconds. The third — a buy from a phone on a flaky 4G tower in Kanpur — takes 6 minutes. The watermark closed the 14:00–14:05 window at 14:05. The third event's bucket has already shipped a wrong total.

Why this gap is irreducible: the ingest path is a chain of independent buffers (mobile network, gateway, queue, processor, sink). Each adds latency that depends on conditions you don't control — radio handover during a moving train, an ISP route flapping during a thunderstorm in Mumbai, a retry queue draining after the producer's broker leader changed. The 99th-percentile gap in well-run Indian systems is 30–90 seconds; the 99.99th-percentile gap during the IPL final or Diwali sale is 30+ minutes. You can lower the percentiles by spending money. You cannot drive them to zero, because the failures are not in the parts you control.

The naming convention matters in incident postmortems too. When a Razorpay incident says "the watermark fell behind by 14 minutes during the 19:30 spike", everyone in the room understands which clock the sentence refers to. When an early-stage team says "the pipeline was 14 minutes behind", half the room thinks event time and half thinks processing time, and the postmortem produces wrong remediation. Pick the vocabulary on day one — event time, processing time, watermark, lateness — and use it in code comments, dashboard titles, and PagerDuty alert names. Discipline at the vocabulary level prevents the conceptual confusion that produces the next bug.

Why backfills are unavoidable — three sources of lateness

The 14:09 event in the figure didn't fail. It just took longer than the other events in its bucket. Three structural reasons produce that gap, and any of them justifies a backfill.

Network-edge buffering. A Swiggy delivery partner's phone in Hyderabad records GPS pings every 5 seconds. The phone goes through a tunnel and accumulates 12 minutes of pings in its local SQLite buffer. When the network reconnects, the app flushes the buffer in one POST. Twelve minutes of event_time from 12 minutes ago land in the next 200 ms of processing time. The pipeline's choices: retroactively update the previously-closed buckets, or refuse the late events and live with the gap.

Upstream pipeline lag. A CDC pipeline reading the orders table's WAL hits a long-running transaction that holds replication slots open for 18 minutes. When the transaction commits, 200,000 row-events with timestamps spread across those 18 minutes flush at once. The downstream stream processor sees them all in the same processing-time window, but their event-time timestamps are 1–18 minutes old. Every event-time bucket in that window is now wrong until the pipeline reprocesses them.

Out-of-order arrival from partitioned producers. Kafka guarantees ordering within a partition, not across partitions. If your producer keys events on user_id, two users on different partitions can have their events interleaved arbitrarily on the consumer side. The consumer reading partition 0 might be 12 seconds ahead of partition 1; events with the same event-time window can arrive in different processing-time windows depending on which partition they hit. The watermark either waits for the slowest partition (introducing fixed latency) or advances optimistically (creating late events on the laggard partitions).

The pipeline that pretends none of this happens is the pipeline that needs a daily rebuild-everything backfill. The pipeline that admits all of it has to budget — explicitly — for late data and design a backfill mechanism that handles each shape.

A useful framing: every event has a latency distribution, not a latency. The 50th percentile is what your developers test with on their laptops. The 99th percentile is what fires during normal traffic peaks. The 99.99th percentile is what fires during the IPL final, the Diwali sale, the GSTN filing deadline, the IRCTC tatkal window. Designing for the 50th percentile and dealing with the rest "if it becomes a problem" is the trajectory that ends with a 03:00 incident and a finance reconciliation. Designing for the 99.99th percentile from the start — by sizing the watermark budget, the side-output retention, and the backfill cadence to match — is what production-grade data engineering looks like. The cost of doing this on day one is roughly one extra week of design work; the cost of doing it after the incident is roughly six weeks of cleanup plus the credibility cost with stakeholders.

The watermark contract and what it lets you ship

A watermark is your formal answer to "how late is too late". It is a wall-clock time W with a contract: the pipeline asserts that no event with event-time earlier than W will be accepted into its primary path. The watermark advances as new events arrive — typically tracking the maximum event-time seen, minus a configured allowed lateness. The window for [T, T+5min) closes at T + 5min + lateness, not at T + 5min.

The choice of lateness is a budget. Too small and your dashboards refresh fast but lie often, demanding constant backfills. Too large and your dashboards are correct but stale. Indian production systems converge on these numbers:

Events later than the watermark hit one of three lanes — drop, side-output, or backfill — and the choice is the part of the design that determines what your backfill looks like.

A practical way to pick the budget: plot the lateness distribution of a week's worth of events and look for the elbow. For most Indian consumer-app sources the elbow is between 30 and 90 seconds; for IoT-style sources (delivery GPS, sensors) the elbow is between 2 and 10 minutes; for batch-uploaded sources (mobile-app analytics SDKs that flush hourly) the elbow is at the SDK's flush interval. Pick the budget at the elbow plus a safety margin. A 24-hour budget on a real-time dashboard means the dashboard is 24 hours stale by definition; a 5-second budget on a 4G-buffered source means your side-output rate is 30% and the warehouse chokes on corrections. The right budget catches the 99th percentile cleanly and routes the 99.99th percentile to the side-output — that's the cheapest place to spend it.

# tumbling_window_with_late_lane.py
# A 5-minute tumbling window over click events, with a 60-second
# allowed lateness and a side-output for events later than that.

import collections, datetime as dt, json, sys

WINDOW_SEC = 300         # 5 minutes
LATENESS_SEC = 60        # 1 minute allowed lateness

def window_start(ev_time: dt.datetime) -> dt.datetime:
    epoch_min = int(ev_time.timestamp()) // WINDOW_SEC * WINDOW_SEC
    return dt.datetime.fromtimestamp(epoch_min, tz=dt.timezone.utc)

class WindowedAggregator:
    def __init__(self):
        self.counts = collections.Counter()      # window_start -> count
        self.watermark = dt.datetime.min.replace(tzinfo=dt.timezone.utc)
        self.late_lane = []

    def ingest(self, event):
        ev_time = dt.datetime.fromisoformat(event["event_time"])
        ws = window_start(ev_time)
        if ev_time < self.watermark - dt.timedelta(seconds=LATENESS_SEC):
            self.late_lane.append(event)        # too late — backfill later
            return
        self.counts[ws] += 1
        if ev_time > self.watermark:
            self.watermark = ev_time            # advance frontier

    def closed_windows(self):
        cutoff = self.watermark - dt.timedelta(seconds=LATENESS_SEC)
        return {ws: c for ws, c in self.counts.items()
                if ws + dt.timedelta(seconds=WINDOW_SEC) <= cutoff}

events = [
    {"event_time": "2026-04-25T14:00:30+00:00", "user": "u1"},
    {"event_time": "2026-04-25T14:02:11+00:00", "user": "u2"},
    {"event_time": "2026-04-25T14:04:50+00:00", "user": "u3"},
    {"event_time": "2026-04-25T14:05:02+00:00", "user": "u4"},
    {"event_time": "2026-04-25T14:03:01+00:00", "user": "u5"},   # 2 min late
    {"event_time": "2026-04-25T14:08:30+00:00", "user": "u6"},
    {"event_time": "2026-04-25T13:58:45+00:00", "user": "u7"},   # very late
]

agg = WindowedAggregator()
for e in events: agg.ingest(e)
print("watermark:", agg.watermark.isoformat())
print("closed:", {k.isoformat(): v for k,v in agg.closed_windows().items()})
print("late_lane:", [e["event_time"] for e in agg.late_lane])
# Output:
watermark: 2026-04-25T14:08:30+00:00
closed: {'2026-04-25T14:00:00+00:00': 3}
late_lane: ['2026-04-25T13:58:45+00:00']

Three lines decide the article. if ev_time < self.watermark - dt.timedelta(seconds=LATENESS_SEC) is the watermark contract — events earlier than the frontier minus the allowed lateness are explicitly diverted. self.late_lane.append(event) is the side-output that becomes a backfill's input — these rows are not lost, they are queued for later reconciliation. if ev_time > self.watermark: self.watermark = ev_time is the simplest watermark advancement strategy: track the max seen, ignore restarts. Production systems use a percentile-based watermark instead (advance to the 99th-percentile event time among recent arrivals, not the max) to stop a single corrupt event from racing the frontier ahead of legitimate live events.

Why a percentile beats the max: a single event with a bug-corrupted timestamp 6 hours in the future will instantly close every window in the next 6 hours' worth of pipelines if you advance to the max. A percentile-based watermark — advance to p99(event_time) of the last 10,000 events — clamps the impact of any single bad event to roughly 1% of throughput. The cost is 60 seconds of warmup before the watermark stabilises after a restart; the win is that one bad event from a misconfigured app version doesn't poison the global frontier.

A practical detail many tutorials skip: the watermark is per-key in real production engines, not global. Flink's keyed state holds a watermark per partition, per key, and the operator's effective watermark is the minimum across all upstream channels. This is why a single slow Kafka partition holds back the whole pipeline — the operator can't advance past the slowest channel without admitting late events from that channel. The fix in production is either to repartition (rebalance the load), to mark the slow channel as "idle" so the watermark ignores it (withIdleness() in Flink), or to accept the lateness and budget for it. Most teams discover this on day one of their first stream-join: the join holds state for an indefinite time because one side's watermark is stuck, and the operator's RocksDB state-store grows until the JVM dies. The lesson is that watermarks are a distributed primitive, not a single number, and the engine's documentation on idleness handling is not optional reading.

The four backfill shapes

Once you accept that late data exists, the question becomes how you reconcile it. Production pipelines settle into one of four backfill shapes, and most data teams run two or three of them in parallel for different SLAs.

Shape 1: window re-emit on late event

The streaming primitive. When a late event lands in the side-output, re-aggregate the affected window and emit a corrected output. The destination treats the corrected output as an upsert keyed by (window_start, dimension_keys). Apache Beam, Flink, and Spark Structured Streaming all support this natively as "allowed lateness" plus "trigger on late firing".

The good: corrections happen automatically, near-real-time, no operator action needed. The bad: every late event is a write to the warehouse — at Indian fintech scale (50k events/sec with 0.5% lateness), that's 250 corrected windows per second of warehouse upserts on top of the primary load. The tipping point where this gets expensive is around 1% lateness on aggregations with high-cardinality keys.

Shape 2: scheduled rebuild of recent partitions

The batch-flavoured fix. Every hour, rebuild the last 4 hours of warehouse partitions from the canonical event log (S3 or the Kafka topic with long retention). The rebuild is idempotent — the same input produces the same output, and the destination MERGE absorbs duplicates — so partial rebuilds and overlapping rebuilds are safe.

The good: a single nightly job catches everything that the streaming watermark missed, no per-event accounting needed. The bad: the warehouse spends a non-trivial fraction of its compute reprocessing the same data — ZerodhaDB-style end-of-day reconciliation jobs typically use 30–40% of the warehouse's nightly compute on rebuilds. The freshness window between "watermark closed" and "rebuild ran" is when dashboards are visibly wrong.

A subtle-but-common Shape-2 gotcha: the rebuild must read from the canonical event log (S3 archive, Kafka with extended retention, the source database with point-in-time recovery), not from the warehouse's own raw layer. If the warehouse's raw layer is itself the result of a streaming write, then it inherits whatever lateness the streaming layer had — rebuilding from raw means re-deriving from the same potentially-incomplete data, and the corrections never converge. Razorpay's runbook is explicit: the daily rebuild reads from the S3 event archive (written by a separate Kinesis Firehose with its own watermark), not from the ClickHouse events_raw table. The two diverge by 0.07% on a normal day and by several percent on incident days, and the S3 path is always the one that wins.

Shape 3: full backfill on schema change or bug fix

The catastrophic fix. A bug in the ingestion code mis-classified payment_method = "upi" as "unknown" for 9 days. After the fix is deployed, every row from the last 9 days needs to be re-derived. The backfill job re-reads the event log from 9 days ago, re-runs the corrected logic, and overwrites the warehouse partitions for those 9 days.

The good: a clean reset that doesn't require deciding which rows are "correct" and which need fixing — everything in the affected range gets recomputed from source. The bad: it's expensive (9 days of compute), it needs careful coordination with downstream consumers (their dashboards will visibly flip during the rebuild), and it requires the source event log to be retained that far back. Kafka retention of 7 days breaks this pattern; S3-as-event-log with infinite retention enables it.

A real Shape-3 example: in 2023, Zerodha rolled a code change that mis-tagged a class of mutual fund order amendments as "new orders" instead of "amendments" for 6 days. The fix was a one-line change. The backfill that followed reprocessed 6 days of 2.4 crore order events from the S3 archive, rebuilt 6 partitions of the orders_daily Iceberg table, and ran in 4 hours on a 32-node Spark cluster. The total cost was ₹2.1 lakh in compute. The cost of not having the S3 archive — and instead trying to reconstruct from the live OLTP database — was estimated by the team at "weeks of work and probably impossible because the OLTP had been compacted". The S3 archive paid for itself the first time it was needed.

Shape 4: two-tier (lambda) — fast wrong + slow right

The architectural fix. Run two pipelines in parallel — a streaming pipeline that produces fast-but-approximate results into a "speed layer" table, and a batch pipeline that produces slow-but-correct results into a "batch layer" table. The serving layer reads from both and prefers batch results where they exist, falling back to speed where batch hasn't caught up.

The good: dashboards are always either fresh (speed layer) or correct (batch layer), and the union is what users want. The bad: you've built two pipelines with two codebases for one logical computation, and keeping them in sync is its own operational burden. The lambda architecture is what most Indian fintechs ran from 2016–2021 before consolidating onto unified streaming engines (Flink, Beam) that compute both views from one codebase. Build 10 covers the unified approach.

Four backfill shapes side by sideA 2x2 grid showing the four shapes — window re-emit, scheduled rebuild, full backfill, and two-tier — each with a small illustrative diagram and one-line cost.Shape 1: window re-emit on late eventcost: warehouse upsert per late eventShape 2: scheduled rebuildrebuildcost: ~30% of nightly warehouse computeShape 3: full backfill on bugfix9 days agotodaycost: full reprocess; needs S3 event logShape 4: two-tier (lambda)speed (fast)batch (right)cost: two codebases for one computation
The four shapes a backfill takes in production. Most teams run shapes 1 and 2 together; shape 3 fires only on bugs; shape 4 is the historical compromise that unified streaming has mostly replaced.

A subtler corollary: the four shapes are not mutually exclusive. Most production data platforms run shapes 1, 2, and 3 concurrently, with shape 4 as the historical layer that newer pipelines avoid. The reason teams reach for "just one backfill" is operational simplicity — a single nightly rebuild is easier to reason about than a streaming layer plus a daily rebuild plus an incident-driven full-replay. The reason that simplicity costs money is that a single backfill has to be conservative enough to catch every shape's failure mode, which means either expensive (nightly rebuild of the last 7 days, every night) or wrong (rebuild only the last day, and lose anything later). The mature engineering choice is to admit that each shape has a different cost-vs-correctness curve, and to compose them — let Shape 1 catch the 99% case cheaply, Shape 2 catch the trailing 0.9%, Shape 3 catch the bug-driven cliff, and the bitemporal model from Going Deeper catch the audit-trail requirement.

A backfill in motion: the BookMyShow IPL final

The numbers from the lead are not hypothetical. On 26 May 2024, during the IPL final, BookMyShow's ticketing pipeline saw a 22-minute window where 1.4 crore ticket-view events were buffered on phones across India because of a Cloudflare edge issue in ap-south-1. When the edge recovered, the events flushed in waves over 11 minutes — every event-time bucket from 19:32 to 19:54 IST received a 30–40% correction.

The pipeline's response, by design:

  1. The streaming layer (Flink) had lateness = 60 seconds. Events arriving more than 60 seconds late were diverted to a Kinesis side-stream — that captured 1.31 crore of the 1.4 crore.
  2. A standby backfill orchestrator subscribed to the side-stream, batched the events by event-time hour, and wrote partition-overwrite jobs to ClickHouse. The first overwrite landed at 19:55, 8 minutes after the spike started; the last at 20:14, 22 minutes later.
  3. The customer-facing seat-availability dashboard (which reads ClickHouse) showed correct numbers within 30 minutes of the original event time. Internal SRE dashboards showed the watermark lag spike as an explicit signal — not an alert, because the backfill was working — and the post-incident review noted that the design point was correct but the lateness budget could be widened to 5 minutes for the IPL window to reduce correction-emit churn.

The 9 lakh events that arrived later than the side-stream's retention (4 hours) were caught by the next morning's daily rebuild from the S3 event log — Shape 2 catching what Shape 1 missed. Net data loss: zero. Net dashboard wrongness: visible for 30 minutes during the incident, fully reconciled by the next day. The cost of the 30-minute visible-wrong window is the cost of choosing real-time dashboards over batch ones; the team's published view is that it's the right trade for the product.

Designing a backfill that doesn't double-count

Whichever shape you pick, the backfill must be idempotent with respect to the destination. Running it twice — because someone fat-fingered the trigger, or because the orchestrator retried after a transient failure — must produce the same warehouse state as running it once.

Three primitives make this work, all already covered in earlier chapters but worth restating in the backfill context:

Deterministic partitioning. The destination warehouse is partitioned by event-date or event-hour. The backfill writes to the partitions that correspond to its event-time range, and only those. A backfill for 2026-04-23 touches dt=2026-04-23 and nothing else.

Overwrite, not append. The backfill replaces the affected partitions atomically rather than inserting new rows. In Iceberg/Delta, this is INSERT OVERWRITE PARTITION (dt='2026-04-23'). In a vanilla Postgres warehouse, it's a transaction that deletes the old partition and inserts the new one. Append-mode backfills double-count silently and are the single most common source of "the dashboard is now showing 2× the real number" incidents.

Forward-only processing. The backfill consumes from the source event log in event-time order, and the source's retention is long enough to cover the backfill window. If your S3 event-log retention is 30 days, you can backfill the last 30 days; you cannot backfill day 31 without re-deriving it from a different source.

The combination — partitioned writes, atomic overwrite, source retention longer than the backfill window — is the contract that makes "rerun the job for last Tuesday" a single command rather than a 4-hour incident.

Why atomic overwrite is non-negotiable: a backfill that runs in two phases — DELETE FROM warehouse WHERE dt='2026-04-23' followed by INSERT INTO warehouse SELECT ... WHERE dt='2026-04-23' — has a window between the two statements where the partition is empty. If a dashboard query lands in that window, it returns zero rows. Worse, if the INSERT crashes halfway, you have a partition that's half-populated and no signal of what's missing. Atomic overwrite (Iceberg's INSERT OVERWRITE, Delta's replaceWhere, Postgres's transactional DELETE + INSERT inside a BEGIN/COMMIT) makes the swap invisible to readers — the partition is either the old version or the new version, never empty and never half-built. Backfills without atomic overwrite are why teams develop ritual fear of running them; backfills with atomic overwrite are a daily operation that nobody flinches at.

The other contract that breaks if you skip overwrite: downstream consumers that compute their own aggregations from the partition. A marketing model that rolls up clicks_per_partner_per_day from the warehouse will compute "0 clicks for partner X on 2026-04-23" during the empty-partition window, write that to its own table, and now even after the backfill finishes the marketing model holds a wrong row that nothing will refresh until its next scheduled run. Atomic overwrite at the source keeps these chains honest; non-atomic overwrite at the source poisons the whole downstream tree, and the team that owns the source is the team that gets blamed even though the bug is structurally distributed across every consumer.

A final design note before the confusion list: the chapter's recommendations assume your event log retains the original event payloads and timestamps. If your "event log" is actually a stream of pre-aggregated metrics (you're emitting clicks_per_minute instead of individual click events), you've already lost the ability to backfill correctly — the source for a Shape-2 rebuild has to be the raw events, because pre-aggregations bake in the watermark choice that produced them. The architectural rule that follows: archive raw events to S3 before any aggregation, with retention measured in years, not days. The cost is storage; the cost of not doing it is that every bug in any aggregation layer becomes irrecoverable. Build 6's Parquet/Iceberg coverage is partly about making this archive cheap enough that no team has to argue against it.

A corollary: the schema of the archived events must include both event_time and the ingest's processing_time as separate columns, plus the producer's client_id and producer_seq (or whatever the source emits as its monotonic identifier). Without these, the backfill cannot reconstruct what was late and what was on time, can't deduplicate replays, and can't distinguish a real correction from a duplicate. Many teams archive only the application payload and lose the metadata; the right discipline is to archive the envelope — the application payload plus a fixed set of system fields — and to treat the envelope schema as part of the data contract. Build 5's lineage and contracts coverage formalises this; for now, the rule is: store enough metadata that every backfill is a deterministic function of the archive.

Common confusions

Going deeper

The Beam model: triggers, lateness, and accumulating panes

Apache Beam's documentation formalises late data as a triple: (window function, trigger, allowed lateness). The window function decides which window each event belongs to. The trigger decides when to emit results from a window — the default is "at watermark", but you can also trigger on processing-time delay, on count thresholds, or on speculative early firings. The allowed lateness decides how long after the watermark the window stays open for late updates. Each late event triggers a re-firing, and the destination chooses whether to receive accumulating results (each firing is the running total) or delta results (each firing is the additional contribution since the last firing).

Most production warehouse sinks want accumulating results plus a partition overwrite, because that's idempotent: the latest firing for a window is the canonical answer, and writing it twice produces the same state. Delta semantics are useful when the sink itself is a stream (another Kafka topic, another stream processor downstream) and the consumer aggregates over time.

Watermark heuristics in PyFlink and Spark

PyFlink's WatermarkStrategy.for_bounded_out_of_orderness(Duration.ofSeconds(60)) sets a heuristic that advances the watermark to max(event_time) - 60s. This is the simplest "fixed lateness budget" — adequate for most uniform-rate sources. The more nuanced strategies — forBoundedOutOfOrderness with a per-source override, or forMonotonousTimestamps for sources you trust to never reorder — let you tighten the budget where the source is well-behaved.

Spark Structured Streaming's withWatermark("event_time", "1 minute") is the equivalent. The internal mechanics differ — Spark advances the watermark on every micro-batch boundary, Flink advances on every event — but the user-facing knob is the same. Both engines drop late events by default and require an explicit outputMode = "append" with allowedLateness to hold windows open. The most common production bug across both is forgetting that the watermark only advances when events are arriving — a quiet hour in a nightly pipeline leaves windows open longer than the developer expected, holding state in RocksDB/state-store memory until traffic resumes.

The Razorpay nightly reconciliation: a real backfill cadence

Razorpay's payments analytics pipeline runs three concurrent backfill flavours in production. The streaming layer (Flink on Kinesis) handles 60-second-late events via Shape 1 — re-emit on late firing — and writes corrections to ClickHouse with (date, merchant_id, payment_method) upserts. A scheduled batch job (Shape 2) at 02:00 IST every day rebuilds the previous day's payments_daily partition from the S3 event log, catching anything later than the streaming watermark. A weekly bug-fix-driven backfill (Shape 3) is triggered manually by the platform team when a code change is deployed that affects historical interpretation; the platform team's runbook explicitly schedules these for Tuesday 02:00 IST to align with the lowest-load window.

The team's published numbers (2024 talk at Data Engineering Summit India): streaming layer corrects ~0.4% of records, the daily rebuild corrects an additional ~0.07%, and the bug-fix backfills account for the remaining ~0.001% — but those 0.001% are usually the most-financially-material rows because they correlate with logic bugs. The lesson the team published: build all three shapes from day one. Don't bolt on Shape 3 during the incident.

Monitoring lateness: the three signals that matter

Lateness is invisible until you measure it. Three signals let you see it before finance does.

Watermark lag: the gap between current wall-clock time and the watermark. If now() - watermark is steadily under 60 seconds, the pipeline is keeping up with its lateness budget. If it's growing, the source is producing events faster than the processor can advance the frontier — usually a backpressure or skew problem. Plot it as a 5-minute rolling average; alert if it exceeds the budget for more than 2 consecutive intervals.

Side-output rate: the number of events diverted to the late-lane per minute. A baseline of 0.1–1% of total throughput is normal in healthy pipelines. A sudden 10× spike means the source is emitting a flush of buffered events (the Swiggy-tunnel case), and the next batch backfill should expect more work. A sustained climb above 5% means the lateness budget is too tight; widen it or accept that backfills will dominate.

Correction-emit rate (Shape 1 only): the number of late firings per minute on closed windows. This is the warehouse's view of the streaming corrections. If correction-emit rate plus the daily rebuild's row count drifts up over weeks, the pipeline is silently shifting from "fast right" to "fast wrong, eventually right" — a sign the source's behaviour has changed and the design point needs revisiting.

Indian fintechs typically chart all three on one Grafana dashboard with side-output rate and watermark lag side by side, and a 30-day rolling correction-emit trend underneath. The dashboard's purpose is not to alert on every spike — that creates page fatigue. Its purpose is to make the slow drift visible, so the team adjusts the design point in a planning meeting rather than during an incident.

Why the bitemporal model belongs in this conversation

A row's (event_time, processing_time) pair is a bitemporal coordinate — valid_time (when it was true in the world) and transaction_time (when the system saw it). Late data is the unavoidable consequence of the two not being equal. A bitemporal warehouse stores both columns explicitly and lets queries answer either "what did the world look like at 14:00, given everything we know now?" (event-time slice) or "what did the system know at 14:05 about the world up to 14:00?" (processing-time-bounded query, useful for audits and reproducing yesterday's reports).

Most production warehouses store only event_time and pretend processing_time doesn't exist; they then run backfills to keep the event-time view current. A bitemporal warehouse stores both and runs no backfills at all — every query specifies which time it cares about, and corrections appear as additional rows with new processing_time values rather than overwrites. The cost is 2× storage and significant query complexity. The win is auditability: regulators (RBI for fintech, SEBI for trading) increasingly ask for "as-of" reports that bitemporal storage answers natively. Build 12 covers the storage formats (Iceberg, Delta) that make this practical.

Where this leads next

Build 8 will revisit watermarks as a stream-processing primitive with formal triggers, accumulating vs delta semantics, and stateful joins across watermarked streams. Build 10 will revisit backfills as the unified-batch-stream problem — Apache Beam's framing is that batch and streaming are the same computation parameterised by lateness, and a single codebase can produce both views. The current chapter is the polling-pipeline operator's view of the same shape; later chapters generalise it to the streaming and unified worlds.

References

  1. Tyler Akidau et al., The Dataflow Model — the foundational paper that formalised event time, watermarks, triggers, and lateness for stream processing.
  2. Apache Beam: streaming concepts — the practitioner's reference for watermarks, allowed lateness, and accumulating panes.
  3. Apache Flink: watermark generation — Flink's implementation of the Beam model, including the percentile-based watermark heuristics referenced in this chapter.
  4. Spark Structured Streaming: watermarking — Spark's micro-batch view of the same primitives.
  5. Martin Kleppmann, DDIA, Chapter 11 — the chapter on stream processing introduces watermarks as the unifying abstraction across batch and stream.
  6. Razorpay engineering blog: payments analytics at scale — the team's published account of the three-shape backfill strategy summarised in §"Going deeper".
  7. Cursors, updated_at columns, and their lies — the previous chapter, where the same event-time vs processing-time gap appears as the four "lies" of an updated_at cursor.
  8. Confluent: late data and stream processing trade-offs — vendor-leaning but technically clean treatment of the lateness budget choice.

The honest summary: late data exists because the world is a network of buffers and clocks that you don't control, and your pipeline's only choices are which lateness budget to commit to and which backfill shape to run when events miss it. There is no design that eliminates late data; there are only designs that make backfills cheap and predictable instead of expensive and surprising. The senior data engineer's job is to budget for late data on day one — pick the watermark, build the side-output lane, schedule the rebuild, retain S3 long enough to recover from a bug. The team that does this ships a pipeline that survives Diwali sales, IPL finals, and quarterly reconciliations without 03:00 incidents. The team that doesn't will discover all four backfill shapes anyway, just one production failure at a time.

A practical exercise: pull up your team's main streaming pipeline and answer three questions. What is its allowed lateness? What happens to events later than that? When was the last full backfill, and how long did it take? If any of those answers is "I don't know", the chapter exists so the meeting where you find out is in design review, not at 03:00 with the dashboard wrong.

The follow-up exercise, once those three answers are clear: instrument the watermark-lag, side-output-rate, and correction-emit-rate dashboards from §"Monitoring lateness". Run them for a week. The team that does this discovers, almost without exception, that one of the three signals is already trending in a direction nobody has noticed — and that the design point set six months ago no longer matches the source's behaviour today. The dashboards exist not to flag incidents but to make the slow drift between design and reality visible early enough to fix it deliberately.