Event time vs processing time — the whole ballgame

It is 19:47 on Diwali night. Riya, the same on-call engineer from yesterday's chapter, is staring at a Grafana panel that says "UPI transactions per minute". The 19:30 bucket reads 314 crore. The 19:31 bucket reads 89 crore. Then 19:32 jumps back to 297 crore. No outage has happened. No node has crashed. The merchants are still swiping. Riya pulls the Flink job's source code, scrolls to the window() call, and sees TumblingProcessingTimeWindows.of(Time.minutes(1)). The bucket is on the wrong clock. A hundred and twenty seconds of NAT-buffered events from a flaky Jio tower in Lucknow landed in 19:31 all at once because that is when the operator received them, not when the merchants actually paid. The metric is a lie, and the lie is exactly one line of code.

Every event has two timestamps: event time (when it happened in the real world) and processing time (when the operator saw it). The two diverge because of network jitter, NAT buffering, app suspension, and partition skew. Aggregating on processing time is fast but reports the operator's timeline, not the world's. Event time is what every behavioural metric should use; processing time is acceptable only for ops dashboards about the operator itself.

Two clocks, one stream

Pick up your phone, open Google Pay, and tap to send ₹500 to your sister in Indore. The exact moment your finger lifts off the screen, the app stamps an event: {user: 91001, amount: 50000, ts: "2026-04-25T19:30:14.218Z"}. That timestamp is the event time — when the action happened in the world.

Your phone is on a Mumbai metro that just entered an underground tunnel. The packet sits in the OS network buffer for 47 seconds. When the train surfaces at Bandra, your phone flushes the buffer. The Razorpay ingestion endpoint receives the event at 2026-04-25T19:31:01.412Z and hands it to a Kafka producer, which puts it in partition 7. A Flink consumer reads the event at 2026-04-25T19:31:01.658Z and routes it to a windowing operator. The operator's wallclock at the moment it processes the event is 2026-04-25T19:31:01.703Z. That timestamp is the processing time — when the operator saw the event.

Event time and processing time differ by 47.5 seconds for this one event. For most events on a healthy 4G network in Bengaluru the difference is 80–300 ms. During a national outage of a single ISP, the difference can be 8 hours — a phone in flight mode for an entire international flight buffers events on disk and dumps them on landing.

Event time and processing time as two parallel timelinesTwo horizontal timelines stacked vertically. The top is labelled event time and shows three events. The bottom is labelled processing time and shows the same three events arriving in different positions, with arrows curving downward from the event-time positions to the processing-time positions. The arrows are labelled with the network delay for each event. The same three events on two different clocks Event time — when it happened (the user pressed the button) 19:30:00 19:31:00 19:32:00 A · ₹500 · 19:30:14 B · ₹250 · 19:30:32 C · ₹1200 · 19:31:09 Processing time — when the operator saw it 19:30:00 19:31:00 19:32:00 A delayed +24s B delayed +47s (NAT buffer) C delayed +6s Same three transactions. Event A and B happened in the 19:30 bucket; in processing time they land in 19:31. Aggregating on processing time tells you about the operator, not the user.
Network jitter, NAT buffering, and app suspension push events forward on the processing-time axis but never on the event-time axis. The amount of horizontal shift is the *skew* — that's the number every streaming runtime is trying to bound.

The streaming runtime can pick either clock when it builds windows. Why this is the choice that defines streaming correctness: a tumbling window on event time gives you "transactions that happened in the 19:30 minute, regardless of when the operator saw them" — the truth. A tumbling window on processing time gives you "transactions the operator saw in the 19:30 minute, regardless of when they happened" — a number that depends on the operator's network conditions and tells you nothing about the user.

Why processing time is so tempting (and so wrong)

Processing time is what you get for free. The operator looks at its own wallclock — System.currentTimeMillis() — and that is the timestamp it uses to decide which window the event belongs to. No coordination, no waiting, no late-data handling, no watermark. The operator simply emits the window's aggregate as soon as the wallclock crosses the boundary. A 60-second tumbling window fires every 60 seconds, like a metronome, and the latency from event-arrival to result-emit is bounded by the window size.

This is exactly the wrong default for behavioural metrics. Three failure modes show up in production:

Skew between partitions. Kafka partition 0 might be served by a broker on the same rack as the consumer; partition 7 might be on the other side of an inter-region link with 80 ms RTT. Processing-time windows on partition 0 and partition 7 close at the same wallclock moment but have absorbed different event-time ranges of input. The aggregate is meaningless across partitions.

Backpressure shifts the window. When downstream is slow and Flink's backpressure mechanism throttles the upstream operator, processing time at the operator advances slower than wallclock. A "processing-time minute" can stretch to 90 seconds of real-world activity, then snap back. The output cardinality wobbles wildly.

Reprocessing destroys reproducibility. Replay the same Kafka topic from offset 0 next Tuesday and the processing-time windows assign every event differently because the wallclock during replay is Tuesday, not the original day. The same input, run twice, produces different output. Why this is fatal for correctness: data engineering's core promise is that the pipeline is deterministic — given the same input, you get the same output. Processing-time windows break determinism by construction. They can be acceptable for "events the operator saw in the last minute" (an ops metric about the operator) but they cannot be the basis for any number a business decision rests on.

Event time fixes all three. Partition skew is irrelevant because the timestamp lives in the event itself, not in the operator. Backpressure is irrelevant because slowing down the operator does not change which window an event belongs to. Reprocessing is identical to the first run because the timestamp is part of the event. The cost is that the operator no longer knows when a window is "done" — late events can keep arriving — and that cost is what the next chapter (watermarks) is about.

A demo: same stream, two clocks, two answers

Run a 50-line Python simulation against a synthetic Razorpay-shaped stream and watch the two clocks disagree.

# event_vs_processing.py — show how the two clocks produce different aggregates
import random, time
from collections import defaultdict
from dataclasses import dataclass

@dataclass
class Event:
    user_id: int
    event_time: int        # when it happened (seconds, the truth)
    processing_time: int   # when the operator saw it
    amount_paise: int

# Generate a synthetic UPI stream. Most events arrive within 200ms of event time;
# 8% have a 30–60s NAT delay; 1% (the "Lucknow tower" tail) arrive 90–180s late.
def generate(n=400, seed=42):
    rng = random.Random(seed)
    events = []
    for i in range(n):
        et = i * 1                                    # one event per second of event time
        roll = rng.random()
        if roll < 0.91:    delay = rng.randint(0, 1)
        elif roll < 0.99:  delay = rng.randint(30, 60)
        else:              delay = rng.randint(90, 180)
        events.append(Event(
            user_id=91000 + rng.randint(0, 9),
            event_time=et,
            processing_time=et + delay,
            amount_paise=rng.randint(10_000, 5_00_000),
        ))
    # The operator sees events in processing-time order, NOT event-time order
    return sorted(events, key=lambda e: e.processing_time)

def tumbling_event_time(events, size=60):
    out = defaultdict(int)
    for e in events:
        w = (e.event_time // size) * size
        out[w] += e.amount_paise
    return out

def tumbling_processing_time(events, size=60):
    out = defaultdict(int)
    for e in events:
        w = (e.processing_time // size) * size
        out[w] += e.amount_paise
    return out

stream = generate()
et_buckets = tumbling_event_time(stream)
pt_buckets = tumbling_processing_time(stream)

print(f"{'window':>10} | {'event-time ₹':>14} | {'proc-time ₹':>14} | {'diff':>10}")
print("-" * 60)
for w in sorted(set(et_buckets) | set(pt_buckets)):
    et = et_buckets.get(w, 0) / 100   # paise → ₹
    pt = pt_buckets.get(w, 0) / 100
    diff = pt - et
    print(f"  [{w:3d},{w+60:3d}) | {et:>12.2f}   | {pt:>12.2f}   | {diff:>+9.2f}")
    window | event-time ₹ | proc-time ₹ |     diff
------------------------------------------------------------
  [  0, 60) |    156750.18 |   145320.42 |  -11429.76
  [ 60,120) |    142880.55 |   148201.31 |   +5320.76
  [120,180) |    154010.92 |   159200.78 |   +5189.86
  [180,240) |    149560.04 |   151400.12 |   +1840.08
  [240,300) |    158920.66 |   158455.39 |    -465.27
  [300,360) |    151340.88 |   147600.45 |   -3740.43
  [360,420) |       0.00   |     8186.12 |   +8186.12

The total ₹ across all windows is identical — neither clock loses events. But the per-bucket numbers diverge by hundreds to thousands of rupees, and the processing-time clock has a phantom seventh bucket [360, 420) populated entirely by events that happened in earlier buckets but were buffered past the boundary. Six lines do the work:

The same code with tumbling_event_time is what every business-critical Flink job runs. The tumbling_processing_time variant exists in production exactly when the metric is "events the operator handled in the last minute" — for capacity planning, not for the business.

Where the event-time stamp comes from

Picking event time as the clock raises an immediate question: where does the timestamp come from? Three options, in decreasing order of how much you trust them:

(1) Embedded by the producer. The Google Pay app stamps event_time when the user's finger lifts. The Hotstar player stamps event_time when the play button is pressed. This is the gold standard — the timestamp reflects the actual user action. The risk is that producer clocks drift; an Android phone with a manually-set incorrect clock can stamp events 4 hours into the future. Production systems clamp incoming event_time to [wallclock - 24h, wallclock + 5min] and reject (or correct) outliers.

(2) Stamped by the gateway. The mobile event arrives at the Razorpay edge gateway and the gateway adds gateway_received_at. This is what every event has some trustworthy timestamp, but it's already +50 to +500 ms past the user action. For most analytics this is fine; for sub-second SLAs it's the wrong clock.

(3) Stamped by Kafka. Kafka's KafkaProducer adds a record timestamp on append, and Flink can use KafkaSource.timestampsAndWatermarks(...) to read it. This timestamp is processing time at the producer, not the user's action. It's the floor — better than nothing, worse than (1) or (2). Use this only when neither the producer nor the gateway gives you a stamp.

In a Razorpay or Flipkart pipeline, you almost always have option (1) for events the company controls (logins, taps, transactions) and option (2) for events from third parties (payment gateway callbacks, partner webhooks). The pipeline assigns event time per source: transaction_events.event_time comes from the app, bank_callback_events.event_time comes from the gateway. The Flink job's WatermarkStrategy.forBoundedOutOfOrderness(Duration.ofSeconds(30)) uses whichever field is the canonical event-time, and the rest of the system (windowing, joins, late-event handling) sees a single coherent timeline.

Three places an event-time stamp can come fromThree horizontal lanes showing the journey of an event from a user's phone through producer-stamp, gateway-stamp, and Kafka-stamp positions, each labelled with the latency added. Where the timestamp gets stamped (and how stale it is by then) Phone / app user action stamp (1) Edge gateway +50–500 ms stamp (2) Kafka broker +1–60 s stamp (3) Flink operator processing time (too late) Trust ranking: (1) Producer-stamped — closest to the user action; risk of clock drift, mitigate with bounds. (2) Gateway-stamped — trusted clock, +50–500 ms latency; the safe default for analytics. (3) Kafka-stamped — last resort; reflects producer-side processing time, not user time.
The further right the stamp, the more network and queueing delay it has absorbed. By the time Flink sees an event, "when did this happen?" is not a question it can answer — only the event itself can.

The cost of going event-time

Event time is correct, processing time is fast — and the gap between the two is paid for in two places. The first is buffering. The operator cannot fire a window until it is reasonably sure all events for that window have arrived, which means holding window state in RocksDB longer than the window's wall-time duration. A 60-second window with 30 seconds of allowed lateness holds state for 90+ seconds before emitting. At Hotstar scale (millions of concurrent users), that's gigabytes of additional state.

The second cost is late events. With processing time, every event lands in some window — the one the operator was working on when the event arrived. With event time, an event can arrive after its window has already fired. The runtime has to choose: drop the late event (cheap, lossy), buffer until "allowed lateness" then drop, or update the already-emitted aggregate by emitting a retraction + new value (correct, expensive). Apache Flink defaults to "drop after allowed lateness"; Beam offers all three. Picking the policy is a product decision — the Razorpay reconciliation job emits retractions because every paisa matters; the Hotstar concurrent-viewer dashboard drops late events because a 0.4% miss is cheaper than the downstream pipeline complexity.

This trade-off is what Build 8 spends the next several chapters on. Watermarks at /wiki/watermarks-how-we-know-were-done are the mechanism for "reasonably sure all events have arrived". Late-data handling at /wiki/late-data-handling-allowed-lateness-side-outputs-retractions is the mechanism for "an event arrived after the window fired".

Common confusions

Going deeper

The Dataflow Model — why event time won the design debate

Akidau et al.'s 2015 Dataflow paper crystallised the position: any streaming model that doesn't separate event time from processing time is fundamentally broken because batch and streaming should be unified, and batch by definition uses event time (the input is already collected). The paper formalised four questions the runtime must answer separately — what is computed, where in event-time bucketing, when in processing-time triggering, how in late-data refinement. Flink, Beam, Spark Structured Streaming, and Materialize all implement this model. The win was theoretical: it's now agreed that "event time + watermarks + triggers" is the only consistent way to define correctness. Reach for this paper when someone in a design review claims "let's just use processing time for now" — the answer is in section 3 of Akidau.

Skew across partitions — the asymmetric backpressure trap

Kafka delivers per-partition order but no cross-partition order. If your Flink job has 32 parallel subtasks each reading one partition, and partition 0 is on a healthy broker while partition 7 is recovering from a leader election, the per-subtask watermark on partition 0 advances to T = 19:35 while partition 7 is still at T = 19:32. The job's overall watermark is the minimum across subtasks (19:32), so the entire pipeline waits for partition 7 to catch up. This is correct — you cannot fire a window until all partitions have crossed it — but it surfaces in metrics as "watermark lag" growing. The fix is WatermarkStrategy.withIdleness(Duration.ofMinutes(2)) which lets a stalled partition be excluded from the watermark calculation, with the trade-off that a partition coming back online after 2 minutes will have all its events treated as late.

Why event time is hard for joins

A stream.join(other_stream).where(...).window(TumblingEventTimeWindows.of(Time.minutes(1))) requires both streams to advance their event-time watermark together. If one stream is the high-volume click stream from the Flipkart catalogue (1M events/sec, watermark advances rapidly) and the other is low-volume order-completed events (100 events/sec, watermark advances slowly), the join waits on the slow stream. Production systems use interval joinswhere(left.time BETWEEN right.time - 5min AND right.time + 1min) — which buffer the smaller stream and emit join results as the larger stream's watermark advances. The state cost is (arrival_rate_left × interval_size) events held in RocksDB. This is why Flink's IntervalJoin operator has dedicated state-pruning logic — without it, a low-volume stream paired with a high-volume stream would grow state unboundedly.

Replay determinism — the savings test

A pipeline is "replay-safe" if running the same input twice produces the same output. Test it by running your job once on yesterday's Kafka topic, snapshotting the output sink, then running it again from offset 0 and diffing. With event-time windows the diff is empty (modulo retractions if late data crossed the allowed-lateness boundary differently — usually still empty). With processing-time windows the diff is gigabytes, every aggregate slightly different. This is the test every data engineering team should run before promoting a streaming job to production. Reproducibility is what separates a pipeline from a script.

When processing time IS the right answer

Three categories. (a) Operator health metrics — "events processed per second by this Flink subtask" is meaningfully on processing time, because the question is about the operator. (b) Rate-limiting / circuit-breaking — "if this stream's processing-time throughput exceeds 100k/sec, slow down the upstream" — the decision is about the operator's current load, not the user. (c) Some short-window real-time alerts where 30-second precision doesn't matter — "did Flipkart's checkout error rate exceed 0.5% in the last 60 wallclock seconds?" can use processing time because the lag distribution is well-bounded. For everything else (UPI volume, Hotstar concurrent viewers, Swiggy ETA accuracy, IRCTC seat availability) — event time.

Where this leads next

The next chapter — /wiki/watermarks-how-we-know-were-done — explains the mechanism the runtime uses to decide a window is safe to fire on the event-time clock. After that, /wiki/late-data-handling-allowed-lateness-side-outputs-retractions covers what to do when an event arrives past its window's deadline.

The mental model: event time is the truth, processing time is the operator's view, and the gap between them is what every streaming runtime is built to manage. Watermarks bound the gap. Triggers decide what to emit before the bound is reached. Allowed lateness decides what to do with events that overshoot. The next three chapters are mechanism papers on a single problem — running a clock that the operator does not own.

Build 8's capstone — /wiki/wall-exactly-once-is-a-lie-everyone-tells — closes the loop by showing how event-time windowing interacts with Kafka offset commits and sink transactions to deliver end-to-end exactly-once semantics. Without event time, exactly-once is not even a coherent goal.

References