Windowing: tumbling, sliding, session

It is 19:30 on the night of a Royal Challengers Bengaluru match. Riya, on the Hotstar streaming-platform team, has a Flink job that ingests "user pressed play" events at 4 lakh per minute. The product manager has asked a question that sounds simple — "how many concurrent viewers right now?" — and Riya has spent two hours discovering it is not simple at all. SQL's COUNT(*) works on a table that ends. Her event stream does not end. To get a number she has to slice the stream into finite pieces, and the slicing decision is the entire problem. Tumbling, sliding, session — these are the three slicing strategies every streaming engine ships, and the choice between them is the difference between "viewers per minute" and "viewers per session" being a number anyone can act on.

Streams are unbounded; aggregates need finite inputs. Tumbling windows chop the timeline into non-overlapping buckets of fixed size — clean for periodic reporting. Sliding windows overlap so every event contributes to several windows — smoother metrics but N× the state. Session windows close after a per-key gap of inactivity — variable size, naturally aligned with user behaviour. Pick by what the question is, not by what the engine defaults to.

Why a stream needs windows in the first place

Consider the question "how many UPI transactions did PhonePe see today?" against a static Postgres table — SELECT COUNT(*) FROM transactions WHERE day = '2026-04-25'. The query runs, returns 314 crore, exits. The query engine knew when to stop because the table ended.

Now consider the same question against a Kafka topic. The topic does not end. New transactions are appended every millisecond. A naive COUNT(*) over the topic stream would never return — there is always one more event arriving. The aggregate has to fire on some signal that the question's answer is "ready to emit", and that signal cannot be "the input ended" because the input never ends.

A window is that signal. It is a rule that splits the unbounded stream into bounded substreams, each of which has a clear start and a clear end. When the end fires, the aggregate over that substream's events is emitted, and a new window starts collecting events for the next interval. The three strategies — tumbling, sliding, session — differ in how they decide where the boundaries land.

Tumbling, sliding, and session windows over the same event streamThree horizontal timelines stacked vertically. Each shows the same sequence of event dots. The top is sliced into non-overlapping equal segments labelled tumbling. The middle is sliced into overlapping rectangles labelled sliding. The bottom is sliced based on inactivity gaps labelled session, with variable-width segments. The same event stream sliced three ways Tumbling (size = 60 s, no overlap) W1 [0,60) W2 [60,120) W3 [120,180) W4 [180,240) Sliding (size = 60 s, slide = 30 s — every event in 2 windows) windows overlap by 30 s, every event lands in two of them Session (gap = 30 s — closes after inactivity) session A — 4 events session B — 3 events session C — 1 |--- 30 s+ gap ---| Same input — three different aggregations. The product question decides which one is correct.
Tumbling fires once per fixed bucket. Sliding fires N times per event (N = window/slide). Session fires once per per-key burst, and the boundaries are data-driven.

The decision is not aesthetic. It cascades into state size, output cardinality, and how late events get handled — all of which the runtime has to budget for. Why this is "the entire stateful streaming problem" in disguise: every stateful operator in your job is, underneath, accumulating state for some window. The window definition determines how much state is live at once, when state can be evicted, and how often the operator emits results downstream. A wrong choice here multiplies into 10× the cluster cost three months later.

Tumbling: the simplest cut

A tumbling window has a fixed size and the next window starts exactly when the previous one ends — [0, 60), [60, 120), [120, 180), ... for a 60-second window. Every event lands in exactly one window. The aggregate fires once per window, on a clean boundary.

This is what every dashboard you've ever seen runs on. "UPI transactions per minute", "errors per 5 minutes", "page-load p95 per hour". The boundaries are the same for every key in the stream — user 91, user 92, and user 93 all close their [60, 120) window at the same moment. That alignment is what makes tumbling cheap: the runtime can collapse all per-key state into a single emit when the window closes, and the downstream consumer (a TimescaleDB sink, a Pinot table, a CloudWatch metric) sees a tidy point per minute per key.

The cost is rigidity. A user who started watching the cricket at 18:59:55 has their session sliced into one event in [18:59, 19:00) and however many in [19:00, 19:01) — the two halves are emitted as separate rows even though they belong to the same continuous behaviour. Tumbling is right when the question is temporal (what happened in this clock minute?), wrong when the question is behavioural (what did this user do in their burst of activity?).

Sliding: smoother but N× the state

A sliding window has a size and a slide — (size=60s, slide=10s) means a new window starts every 10 seconds, lasts 60 seconds, and overlaps the previous five. An event at second 35 lands in the windows [0,60), [10,70), [20,80), [30,90) — four windows. The aggregate fires every slide interval per key.

The reason to pay this cost is smoothness. A "rolling 5-minute error rate" computed via tumbling will jump every 5 minutes, with each jump revealing a sudden spike that was visible internally only after the bucket closed. The same metric via a 5-minute window with a 10-second slide updates every 10 seconds, smoothing the dashboard and cutting the alerting delay. For SLO-bound systems — a Razorpay payment gateway with a 99.95% target — the alerting delay matters more than the compute cost.

The state cost is size / slide copies of every event. For a (5min, 10s) sliding window, every event sits in 30 windows simultaneously. A naive implementation stores the event 30 times; a smart implementation stores it once and tracks which windows currently include it. Even the smart implementation has 30× the result-emit rate of tumbling — the output topic fills 30× faster, and the downstream sink has to handle the load. Why this is the most-misused window: teams reach for sliding because "smoother is better", forget the multiplication factor, and discover six weeks later that 70% of cluster CPU is spent emitting redundant aggregates that nobody reads. The fix is almost always to switch to tumbling with a smaller bucket (10s tumbling beats 60s/10s sliding for most dashboards).

Session: gap-defined and per-key

A session window has no fixed size. It opens when an event arrives for a key that does not currently have an open session, and it closes when no new event for that key has arrived for gap time. Two events 25 seconds apart with gap=30s are in the same session; an event 35 seconds after the previous one starts a new session. The window length is whatever the data is.

This is the one that maps cleanly to "user behaviour". A Hotstar viewer who watches for 7 minutes, gets a phone call, returns 4 minutes later for another 12 minutes is two sessions with gap=2min — exactly the right granularity for "average session length", "events per session", "session-level conversion". A Swiggy delivery rider whose location pings come every 5 seconds while on duty and stop completely between shifts has sessions that match shifts.

Session windows cost more state than tumbling because the boundaries are per-key, not aligned globally. The runtime cannot trigger "all keys close their session at 19:30" — it has to track each key's last-event time and close that key's session when its individual gap fires. State is managed in RocksDB (see /wiki/state-stores-why-rocksdb-is-in-every-streaming-engine) keyed by (user_id, session_start), with a per-key timer firing gap after each new event. There is also a tricky merge case: a late event that lands between two existing sessions of the same key — the runtime has to merge the two sessions into one, recomputing the aggregate.

Building the three windows from scratch in Python

Implement them, in 60 lines, against a synthetic UPI-event stream — Razorpay-style — to internalise what the runtime is doing.

# windows_demo.py — implement tumbling / sliding / session in pure Python
from collections import defaultdict
from dataclasses import dataclass
from typing import Iterable

@dataclass
class Event:
    user_id: int
    ts: int       # event-time, seconds since arbitrary epoch
    amount: int   # paise

def tumbling(events: Iterable[Event], size: int):
    """Fire (window_start, user_id) -> sum_amount once per closed bucket."""
    buckets = defaultdict(int)
    for e in events:
        w_start = (e.ts // size) * size      # snap event-time down to bucket start
        buckets[(w_start, e.user_id)] += e.amount
    return sorted(buckets.items())

def sliding(events: Iterable[Event], size: int, slide: int):
    """Every event contributes to (size/slide) windows."""
    buckets = defaultdict(int)
    for e in events:
        # Earliest window that still contains e.ts, then walk forward by slide
        first_start = ((e.ts - size + slide) // slide) * slide
        first_start = max(first_start, 0)
        w_start = first_start
        while w_start <= e.ts:
            buckets[(w_start, e.user_id)] += e.amount
            w_start += slide
    return sorted(buckets.items())

def session(events: Iterable[Event], gap: int):
    """Per-user gap-defined sessions. Events must arrive sorted by (user, ts)."""
    sessions = []      # (user_id, session_start, session_end, sum)
    open_session = {}  # user_id -> index in sessions
    for e in sorted(events, key=lambda x: (x.user_id, x.ts)):
        idx = open_session.get(e.user_id)
        if idx is not None and e.ts - sessions[idx][2] <= gap:
            u, s, _, total = sessions[idx]
            sessions[idx] = (u, s, e.ts, total + e.amount)
        else:
            open_session[e.user_id] = len(sessions)
            sessions.append((e.user_id, e.ts, e.ts, e.amount))
    return sessions

# Synthetic stream: 3 users paying merchants over a 4-minute window
stream = [
    Event(91_001,   5, 1240), Event(91_001,  20,  890),  Event(91_001,  35, 2100),
    Event(91_002,  10,  450), Event(91_002,  70, 1500),
    Event(91_001,  90, 3300), Event(91_001, 105,  500),  # session-2 starts after 55s gap
    Event(91_003, 130, 2199), Event(91_003, 145,  500),  Event(91_003, 200,  300),
]

print("\n-- TUMBLING (size=60s) --")
for (w, u), s in tumbling(stream, 60):
    print(f"window=[{w:3d},{w+60:3d})  user={u}  sum={s}")

print("\n-- SLIDING (size=60s, slide=30s) --")
for (w, u), s in sliding(stream, 60, 30):
    print(f"window=[{w:3d},{w+60:3d})  user={u}  sum={s}")

print("\n-- SESSION (gap=30s) --")
for u, st, en, s in session(stream, 30):
    print(f"user={u}  [{st:3d},{en:3d}]  ({en-st:3d}s)  sum={s}")
-- TUMBLING (size=60s) --
window=[  0, 60)  user=91001  sum=4230
window=[  0, 60)  user=91002  sum=450
window=[ 60,120)  user=91001  sum=3800
window=[ 60,120)  user=91002  sum=1500
window=[120,180)  user=91003  sum=2699
window=[180,240)  user=91003  sum=300

-- SLIDING (size=60s, slide=30s) --
window=[  0, 60)  user=91001  sum=4230
window=[  0, 60)  user=91002  sum=450
window=[ 30, 90)  user=91001  sum=2100
window=[ 30, 90)  user=91002  sum=1500
window=[ 60,120)  user=91001  sum=3800
window=[ 60,120)  user=91002  sum=1500
window=[ 90,150)  user=91001  sum=3800
window=[ 90,150)  user=91003  sum=2699
window=[120,180)  user=91003  sum=2699
window=[150,210)  user=91003  sum=300
window=[180,240)  user=91003  sum=300

-- SESSION (gap=30s) --
user=91001  [  5, 35]  ( 30s)  sum=4230
user=91001  [ 90,105]  ( 15s)  sum=3800
user=91002  [ 10, 10]  (  0s)  sum=450
user=91002  [ 70, 70]  (  0s)  sum=1500
user=91003  [130,145]  ( 15s)  sum=2699
user=91003  [200,200]  (  0s)  sum=300

Six lines decide the streaming mechanics:

Run it and the latency sits in microseconds because the implementation is in-RAM. The same logic in Flink runs in microseconds per key plus the RocksDB read/write tax, which is why the production p99 lands in the 1–10 ms range under load.

Where windows live in real engines

Flink, Kafka Streams, Spark Structured Streaming, and Beam all expose the three windows with very similar APIs. The differences are in the state-store choice, the trigger flexibility, and the watermark integration:

# Flink (PyFlink) — the API every other engine borrows from
from pyflink.datastream.window import (
    TumblingEventTimeWindows, SlidingEventTimeWindows, EventTimeSessionWindows
)
from pyflink.common.time import Time

# stream is a KeyedStream of (user_id, amount, event_time)
stream.window(TumblingEventTimeWindows.of(Time.minutes(1))).reduce(lambda a,b: a+b)
stream.window(SlidingEventTimeWindows.of(Time.minutes(5), Time.seconds(30))).reduce(lambda a,b: a+b)
stream.window(EventTimeSessionWindows.with_gap(Time.minutes(2))).reduce(lambda a,b: a+b)

The naming shifts across engines — Spark calls it window(timeColumn, "60 seconds", "30 seconds"), Beam calls it FixedWindows/SlidingWindows/Sessions — but the semantics are identical. Pick the engine your team already runs; the windowing decision survives the migration.

Common confusions

Going deeper

Why the runtime distinguishes event time from processing time

The window definitions above all assume the runtime can sort events into bucket-by-bucket order. In production, events arrive out of order — a 5G handoff in Bengaluru can delay a UPI ping by 8 seconds, an iOS app in flight mode can buffer 200 events and dump them at landing. The runtime gets to choose between event time (the timestamp embedded in the event when it was created) and processing time (the wallclock at the operator). Tumbling on event time gives "transactions in the 19:30 minute as they actually happened" — a Hotstar viewer who pressed play at 19:29:58 lands in [19:29, 19:30) even if the event reached the operator at 19:30:04. Tumbling on processing time gives "transactions the operator saw in the 19:30 minute" — easier to implement, less correct for behavioural metrics. The next chapter at /wiki/event-time-vs-processing-time-the-whole-ballgame walks through the consequences. For now, the rule of thumb: session windows are almost always event-time; tumbling and sliding are usually event-time but processing-time is acceptable for ops-style metrics.

Triggers — when does the window actually fire?

A window definition specifies the boundaries; a trigger specifies when the aggregate gets emitted. The default trigger in every engine is "fire when the watermark crosses the window end" — meaning fire once, with the complete result, when the runtime is confident no more in-window events will arrive. But you can override. Early triggers fire periodically before the window closes, giving "the count so far" — useful for live dashboards. Late triggers fire when a late event arrives after the window closed — re-emitting an updated aggregate. Flink's Trigger API exposes onElement, onEventTime, onProcessingTime, and onMerge; Kafka Streams uses suppressors. The default is right for 90% of jobs; a custom trigger is right when you genuinely need progressive output (live counter updating every 2 seconds within a 1-minute window) or late-data correction.

Window state in RocksDB — how it actually scales

For tumbling and sliding, the runtime stores a (window_start, key) → aggregate mapping in RocksDB. Tumbling keeps live windows + a fixed number of late-arrival windows (allowedLateness); sliding keeps size/slide simultaneous windows per key. For sessions, the storage is (key, session_start) → (session_end, aggregate). The trickiest case is sliding with a large multiplication factor — a (1h, 1s) sliding window means 3600 simultaneous windows per key. At 10M users, that is 36 billion (key, window_start) entries in RocksDB. The runtime mitigates this with incremental aggregation — instead of storing every event in the window, it stores only the running aggregate (sum, count, min, max). The event itself is dropped after contributing. This makes the per-window state O(1) bytes regardless of window size, which is the only reason multi-billion-row sliding windows fit on disk at all. The trick fails for non-mergeable aggregates like median or percentile-95 — those require either approximate algorithms (HyperLogLog, t-digest) or the full event list.

Session merging — the most subtle case

A late event with event_time = T arrives for a user who already has session A ending at T - 25 and session B starting at T + 10. With gap = 30, the event extends A and connects to B — A and B must merge. Flink's MergingWindowAssigner handles this by reopening A, swallowing B, and emitting one merged aggregate. The state-store mechanics: A's RocksDB row gets updated with the new end-time, B's row gets deleted, the per-key timer gets rescheduled. The downstream consumer sees three emits — A (open), B (closed), then A (re-emitted with merged result and a retraction of B). Most engines implement this; Spark Structured Streaming had a long-standing gap in session-merging support that was only filled in 3.5+. If your job needs sessions, check the engine's merge story before committing.

When not to use windows at all

Three workloads where windowing is the wrong tool. (a) Cumulative-from-start aggregates ("total UPI volume year-to-date") — use a Flink KeyedProcessFunction with a stateful counter, no window. (b) Stream-stream joins — those use interval joins (every left event matches right events within [T - delta, T + delta]), not windows. The semantics are similar but the implementation is a separate code path because the boundary is per-event-pair, not per-window. (c) Materialised-view-style aggregates that need to be queryable on demand — use a streaming KV store (Flink's KeyedBroadcastProcessFunction with snapshot output to Kafka, or a Materialize/RisingWave streaming database). Windows fire and emit; some questions need state that you can query, not just one that fires.

Where this leads next

The next chapter — /wiki/event-time-vs-processing-time-the-whole-ballgame — explains the two clocks every windowed job has to reconcile, and why "wallclock" is almost always the wrong choice. After that, /wiki/watermarks-how-we-know-were-done shows the mechanism that lets the runtime decide when a window's aggregate is safe to emit. Together they close the loop: windows define what to aggregate, watermarks define when to fire, and event time decides which bucket an event belongs to.

The mental model to carry forward: a window is just a key-prefix on the state store. A tumbling window assigns the prefix (window_start, user_id). A sliding window assigns size/slide prefixes per event. A session window assigns (user_id, session_id) and reshuffles session_id on merge. Once you see windows as prefixes in RocksDB, the cost model becomes obvious — and the choice between tumbling, sliding, and session stops being "what does the dashboard look like?" and starts being "what does my state store look like at 3 a.m. on Diwali night?".

The capstone of Build 8 — /wiki/wall-exactly-once-is-a-lie-everyone-tells — covers how a window's emit interacts with the input log's offset commit. That is where exactly-once semantics either work or fall apart.

References