Windowing: tumbling, sliding, session
It is 19:30 on the night of a Royal Challengers Bengaluru match. Riya, on the Hotstar streaming-platform team, has a Flink job that ingests "user pressed play" events at 4 lakh per minute. The product manager has asked a question that sounds simple — "how many concurrent viewers right now?" — and Riya has spent two hours discovering it is not simple at all. SQL's COUNT(*) works on a table that ends. Her event stream does not end. To get a number she has to slice the stream into finite pieces, and the slicing decision is the entire problem. Tumbling, sliding, session — these are the three slicing strategies every streaming engine ships, and the choice between them is the difference between "viewers per minute" and "viewers per session" being a number anyone can act on.
Streams are unbounded; aggregates need finite inputs. Tumbling windows chop the timeline into non-overlapping buckets of fixed size — clean for periodic reporting. Sliding windows overlap so every event contributes to several windows — smoother metrics but N× the state. Session windows close after a per-key gap of inactivity — variable size, naturally aligned with user behaviour. Pick by what the question is, not by what the engine defaults to.
Why a stream needs windows in the first place
Consider the question "how many UPI transactions did PhonePe see today?" against a static Postgres table — SELECT COUNT(*) FROM transactions WHERE day = '2026-04-25'. The query runs, returns 314 crore, exits. The query engine knew when to stop because the table ended.
Now consider the same question against a Kafka topic. The topic does not end. New transactions are appended every millisecond. A naive COUNT(*) over the topic stream would never return — there is always one more event arriving. The aggregate has to fire on some signal that the question's answer is "ready to emit", and that signal cannot be "the input ended" because the input never ends.
A window is that signal. It is a rule that splits the unbounded stream into bounded substreams, each of which has a clear start and a clear end. When the end fires, the aggregate over that substream's events is emitted, and a new window starts collecting events for the next interval. The three strategies — tumbling, sliding, session — differ in how they decide where the boundaries land.
The decision is not aesthetic. It cascades into state size, output cardinality, and how late events get handled — all of which the runtime has to budget for. Why this is "the entire stateful streaming problem" in disguise: every stateful operator in your job is, underneath, accumulating state for some window. The window definition determines how much state is live at once, when state can be evicted, and how often the operator emits results downstream. A wrong choice here multiplies into 10× the cluster cost three months later.
Tumbling: the simplest cut
A tumbling window has a fixed size and the next window starts exactly when the previous one ends — [0, 60), [60, 120), [120, 180), ... for a 60-second window. Every event lands in exactly one window. The aggregate fires once per window, on a clean boundary.
This is what every dashboard you've ever seen runs on. "UPI transactions per minute", "errors per 5 minutes", "page-load p95 per hour". The boundaries are the same for every key in the stream — user 91, user 92, and user 93 all close their [60, 120) window at the same moment. That alignment is what makes tumbling cheap: the runtime can collapse all per-key state into a single emit when the window closes, and the downstream consumer (a TimescaleDB sink, a Pinot table, a CloudWatch metric) sees a tidy point per minute per key.
The cost is rigidity. A user who started watching the cricket at 18:59:55 has their session sliced into one event in [18:59, 19:00) and however many in [19:00, 19:01) — the two halves are emitted as separate rows even though they belong to the same continuous behaviour. Tumbling is right when the question is temporal (what happened in this clock minute?), wrong when the question is behavioural (what did this user do in their burst of activity?).
Sliding: smoother but N× the state
A sliding window has a size and a slide — (size=60s, slide=10s) means a new window starts every 10 seconds, lasts 60 seconds, and overlaps the previous five. An event at second 35 lands in the windows [0,60), [10,70), [20,80), [30,90) — four windows. The aggregate fires every slide interval per key.
The reason to pay this cost is smoothness. A "rolling 5-minute error rate" computed via tumbling will jump every 5 minutes, with each jump revealing a sudden spike that was visible internally only after the bucket closed. The same metric via a 5-minute window with a 10-second slide updates every 10 seconds, smoothing the dashboard and cutting the alerting delay. For SLO-bound systems — a Razorpay payment gateway with a 99.95% target — the alerting delay matters more than the compute cost.
The state cost is size / slide copies of every event. For a (5min, 10s) sliding window, every event sits in 30 windows simultaneously. A naive implementation stores the event 30 times; a smart implementation stores it once and tracks which windows currently include it. Even the smart implementation has 30× the result-emit rate of tumbling — the output topic fills 30× faster, and the downstream sink has to handle the load. Why this is the most-misused window: teams reach for sliding because "smoother is better", forget the multiplication factor, and discover six weeks later that 70% of cluster CPU is spent emitting redundant aggregates that nobody reads. The fix is almost always to switch to tumbling with a smaller bucket (10s tumbling beats 60s/10s sliding for most dashboards).
Session: gap-defined and per-key
A session window has no fixed size. It opens when an event arrives for a key that does not currently have an open session, and it closes when no new event for that key has arrived for gap time. Two events 25 seconds apart with gap=30s are in the same session; an event 35 seconds after the previous one starts a new session. The window length is whatever the data is.
This is the one that maps cleanly to "user behaviour". A Hotstar viewer who watches for 7 minutes, gets a phone call, returns 4 minutes later for another 12 minutes is two sessions with gap=2min — exactly the right granularity for "average session length", "events per session", "session-level conversion". A Swiggy delivery rider whose location pings come every 5 seconds while on duty and stop completely between shifts has sessions that match shifts.
Session windows cost more state than tumbling because the boundaries are per-key, not aligned globally. The runtime cannot trigger "all keys close their session at 19:30" — it has to track each key's last-event time and close that key's session when its individual gap fires. State is managed in RocksDB (see /wiki/state-stores-why-rocksdb-is-in-every-streaming-engine) keyed by (user_id, session_start), with a per-key timer firing gap after each new event. There is also a tricky merge case: a late event that lands between two existing sessions of the same key — the runtime has to merge the two sessions into one, recomputing the aggregate.
Building the three windows from scratch in Python
Implement them, in 60 lines, against a synthetic UPI-event stream — Razorpay-style — to internalise what the runtime is doing.
# windows_demo.py — implement tumbling / sliding / session in pure Python
from collections import defaultdict
from dataclasses import dataclass
from typing import Iterable
@dataclass
class Event:
user_id: int
ts: int # event-time, seconds since arbitrary epoch
amount: int # paise
def tumbling(events: Iterable[Event], size: int):
"""Fire (window_start, user_id) -> sum_amount once per closed bucket."""
buckets = defaultdict(int)
for e in events:
w_start = (e.ts // size) * size # snap event-time down to bucket start
buckets[(w_start, e.user_id)] += e.amount
return sorted(buckets.items())
def sliding(events: Iterable[Event], size: int, slide: int):
"""Every event contributes to (size/slide) windows."""
buckets = defaultdict(int)
for e in events:
# Earliest window that still contains e.ts, then walk forward by slide
first_start = ((e.ts - size + slide) // slide) * slide
first_start = max(first_start, 0)
w_start = first_start
while w_start <= e.ts:
buckets[(w_start, e.user_id)] += e.amount
w_start += slide
return sorted(buckets.items())
def session(events: Iterable[Event], gap: int):
"""Per-user gap-defined sessions. Events must arrive sorted by (user, ts)."""
sessions = [] # (user_id, session_start, session_end, sum)
open_session = {} # user_id -> index in sessions
for e in sorted(events, key=lambda x: (x.user_id, x.ts)):
idx = open_session.get(e.user_id)
if idx is not None and e.ts - sessions[idx][2] <= gap:
u, s, _, total = sessions[idx]
sessions[idx] = (u, s, e.ts, total + e.amount)
else:
open_session[e.user_id] = len(sessions)
sessions.append((e.user_id, e.ts, e.ts, e.amount))
return sessions
# Synthetic stream: 3 users paying merchants over a 4-minute window
stream = [
Event(91_001, 5, 1240), Event(91_001, 20, 890), Event(91_001, 35, 2100),
Event(91_002, 10, 450), Event(91_002, 70, 1500),
Event(91_001, 90, 3300), Event(91_001, 105, 500), # session-2 starts after 55s gap
Event(91_003, 130, 2199), Event(91_003, 145, 500), Event(91_003, 200, 300),
]
print("\n-- TUMBLING (size=60s) --")
for (w, u), s in tumbling(stream, 60):
print(f"window=[{w:3d},{w+60:3d}) user={u} sum={s}")
print("\n-- SLIDING (size=60s, slide=30s) --")
for (w, u), s in sliding(stream, 60, 30):
print(f"window=[{w:3d},{w+60:3d}) user={u} sum={s}")
print("\n-- SESSION (gap=30s) --")
for u, st, en, s in session(stream, 30):
print(f"user={u} [{st:3d},{en:3d}] ({en-st:3d}s) sum={s}")
-- TUMBLING (size=60s) --
window=[ 0, 60) user=91001 sum=4230
window=[ 0, 60) user=91002 sum=450
window=[ 60,120) user=91001 sum=3800
window=[ 60,120) user=91002 sum=1500
window=[120,180) user=91003 sum=2699
window=[180,240) user=91003 sum=300
-- SLIDING (size=60s, slide=30s) --
window=[ 0, 60) user=91001 sum=4230
window=[ 0, 60) user=91002 sum=450
window=[ 30, 90) user=91001 sum=2100
window=[ 30, 90) user=91002 sum=1500
window=[ 60,120) user=91001 sum=3800
window=[ 60,120) user=91002 sum=1500
window=[ 90,150) user=91001 sum=3800
window=[ 90,150) user=91003 sum=2699
window=[120,180) user=91003 sum=2699
window=[150,210) user=91003 sum=300
window=[180,240) user=91003 sum=300
-- SESSION (gap=30s) --
user=91001 [ 5, 35] ( 30s) sum=4230
user=91001 [ 90,105] ( 15s) sum=3800
user=91002 [ 10, 10] ( 0s) sum=450
user=91002 [ 70, 70] ( 0s) sum=1500
user=91003 [130,145] ( 15s) sum=2699
user=91003 [200,200] ( 0s) sum=300
Six lines decide the streaming mechanics:
w_start = (e.ts // size) * size— integer-snap to the bucket start. This is the entire tumbling-window assignment in one line. The runtime does the same thing: the window assigner is a pure function of event time, no state required to decide which bucket an event belongs to. This is why tumbling is the cheapest of the three.first_start = ((e.ts - size + slide) // slide) * slide— the earliest sliding window that still contains the event. Thewhile w_start <= e.ts: w_start += slideloop then enumerates allsize/slidewindows the event lands in. The runtime storessize/slidewindow-IDs per event in MapState, which is where the multiplication factor shows up in your bill.if e.ts - sessions[idx][2] <= gap:— the session-extend rule. The current event extends the open session if it arrived withingapof the last event for that user. Otherwise a new session starts. Why this is harder than tumbling: the runtime cannot precompute the boundary. It maintains, per key, a "session end timer" that firesgapafter the last event. When the timer fires, the session is emitted; if a new event arrives before the timer, the timer is rescheduled. This is why session windows have higher state-store traffic than tumbling — every event causes a timer-cancel-and-reschedule on the per-key state.sessions[idx] = (u, s, e.ts, total + e.amount)— the in-place merge. In a real runtime, this also has to handle the out-of-order case: a late event that lands between two sessions of the same key has to merge them. Apache Flink'sSessionWindowsassigner has explicitmergeWindowslogic that the user code never sees but the state store has to support.- The
sorted(events, key=lambda x: (x.user_id, x.ts))line — this is what the watermark machinery (covered next chapter at /wiki/watermarks-how-we-know-were-done) does in production. Real streams are out-of-order, and the runtime cannot assume the sort. It uses watermarks — a timestamp it has been told "no more events withts < Wwill arrive" — to decide when it is safe to fire a window's aggregate. - The output cardinality difference — tumbling produced 6 rows, sliding produced 11, session produced 6. Sliding's row count for the same input is roughly
size/slide × tumbling. At a 5-minute window with 10-second slide, that is 30× the output rate. Plan capacity for that or pick a different window.
Run it and the latency sits in microseconds because the implementation is in-RAM. The same logic in Flink runs in microseconds per key plus the RocksDB read/write tax, which is why the production p99 lands in the 1–10 ms range under load.
Where windows live in real engines
Flink, Kafka Streams, Spark Structured Streaming, and Beam all expose the three windows with very similar APIs. The differences are in the state-store choice, the trigger flexibility, and the watermark integration:
# Flink (PyFlink) — the API every other engine borrows from
from pyflink.datastream.window import (
TumblingEventTimeWindows, SlidingEventTimeWindows, EventTimeSessionWindows
)
from pyflink.common.time import Time
# stream is a KeyedStream of (user_id, amount, event_time)
stream.window(TumblingEventTimeWindows.of(Time.minutes(1))).reduce(lambda a,b: a+b)
stream.window(SlidingEventTimeWindows.of(Time.minutes(5), Time.seconds(30))).reduce(lambda a,b: a+b)
stream.window(EventTimeSessionWindows.with_gap(Time.minutes(2))).reduce(lambda a,b: a+b)
The naming shifts across engines — Spark calls it window(timeColumn, "60 seconds", "30 seconds"), Beam calls it FixedWindows/SlidingWindows/Sessions — but the semantics are identical. Pick the engine your team already runs; the windowing decision survives the migration.
Common confusions
- "Tumbling and sliding with
size = slideare the same." They are. ASlidingWindow(size=60s, slide=60s)is exactly aTumblingWindow(60s). The runtimes recognise this and use the cheaper tumbling implementation. Don't write the verbose form when you mean tumbling. - "Session windows fix the late-event problem." They don't. Sessions still need a watermark to decide when a session is "really" closed. A late event that lands after the timer fired but with
event_timeinside the session boundaries causes either a session merge (Flink's default) or a duplicate session emit (some engines), depending on the trigger configuration. Late data is a separate problem; sessions just give you a more behaviourally-meaningful unit to attach late data to. - "Bigger window = better aggregate." Bigger window means more state, longer recovery, more late-data ambiguity. A "lifetime user spend" computed as a 10-year tumbling window is a bug — that is what a database-backed materialised view is for, not a streaming window. Stream windows live at the seconds-to-hours scale; anything longer belongs in the warehouse.
- "Sliding windows are smoother than tumbling." Only at the cost of N× output. For dashboards, a 10-second tumbling window updated every 10 seconds is visually identical to a 60-second sliding window with 10-second slide, and costs 30× less. Reach for sliding only when the metric semantics genuinely require an N-second-trailing aggregate (e.g. "errors in the last 5 minutes" for SLO alerting).
- "Session windows are stateless because they don't store windows." They store more state than tumbling — every key has its own session start/end and an active timer. The state cost scales with the number of active keys, not with the number of windows, but in a typical Hotstar deployment with millions of active users that is millions of timers in RocksDB.
- "Window = time. There's no other dimension." Most engines support count windows (every 1000 events) and global windows with custom triggers (window when a flag bit changes). Time is the most common because the question is usually "what happened in this minute?", but the runtimes are agnostic to the trigger source.
Going deeper
Why the runtime distinguishes event time from processing time
The window definitions above all assume the runtime can sort events into bucket-by-bucket order. In production, events arrive out of order — a 5G handoff in Bengaluru can delay a UPI ping by 8 seconds, an iOS app in flight mode can buffer 200 events and dump them at landing. The runtime gets to choose between event time (the timestamp embedded in the event when it was created) and processing time (the wallclock at the operator). Tumbling on event time gives "transactions in the 19:30 minute as they actually happened" — a Hotstar viewer who pressed play at 19:29:58 lands in [19:29, 19:30) even if the event reached the operator at 19:30:04. Tumbling on processing time gives "transactions the operator saw in the 19:30 minute" — easier to implement, less correct for behavioural metrics. The next chapter at /wiki/event-time-vs-processing-time-the-whole-ballgame walks through the consequences. For now, the rule of thumb: session windows are almost always event-time; tumbling and sliding are usually event-time but processing-time is acceptable for ops-style metrics.
Triggers — when does the window actually fire?
A window definition specifies the boundaries; a trigger specifies when the aggregate gets emitted. The default trigger in every engine is "fire when the watermark crosses the window end" — meaning fire once, with the complete result, when the runtime is confident no more in-window events will arrive. But you can override. Early triggers fire periodically before the window closes, giving "the count so far" — useful for live dashboards. Late triggers fire when a late event arrives after the window closed — re-emitting an updated aggregate. Flink's Trigger API exposes onElement, onEventTime, onProcessingTime, and onMerge; Kafka Streams uses suppressors. The default is right for 90% of jobs; a custom trigger is right when you genuinely need progressive output (live counter updating every 2 seconds within a 1-minute window) or late-data correction.
Window state in RocksDB — how it actually scales
For tumbling and sliding, the runtime stores a (window_start, key) → aggregate mapping in RocksDB. Tumbling keeps live windows + a fixed number of late-arrival windows (allowedLateness); sliding keeps size/slide simultaneous windows per key. For sessions, the storage is (key, session_start) → (session_end, aggregate). The trickiest case is sliding with a large multiplication factor — a (1h, 1s) sliding window means 3600 simultaneous windows per key. At 10M users, that is 36 billion (key, window_start) entries in RocksDB. The runtime mitigates this with incremental aggregation — instead of storing every event in the window, it stores only the running aggregate (sum, count, min, max). The event itself is dropped after contributing. This makes the per-window state O(1) bytes regardless of window size, which is the only reason multi-billion-row sliding windows fit on disk at all. The trick fails for non-mergeable aggregates like median or percentile-95 — those require either approximate algorithms (HyperLogLog, t-digest) or the full event list.
Session merging — the most subtle case
A late event with event_time = T arrives for a user who already has session A ending at T - 25 and session B starting at T + 10. With gap = 30, the event extends A and connects to B — A and B must merge. Flink's MergingWindowAssigner handles this by reopening A, swallowing B, and emitting one merged aggregate. The state-store mechanics: A's RocksDB row gets updated with the new end-time, B's row gets deleted, the per-key timer gets rescheduled. The downstream consumer sees three emits — A (open), B (closed), then A (re-emitted with merged result and a retraction of B). Most engines implement this; Spark Structured Streaming had a long-standing gap in session-merging support that was only filled in 3.5+. If your job needs sessions, check the engine's merge story before committing.
When not to use windows at all
Three workloads where windowing is the wrong tool. (a) Cumulative-from-start aggregates ("total UPI volume year-to-date") — use a Flink KeyedProcessFunction with a stateful counter, no window. (b) Stream-stream joins — those use interval joins (every left event matches right events within [T - delta, T + delta]), not windows. The semantics are similar but the implementation is a separate code path because the boundary is per-event-pair, not per-window. (c) Materialised-view-style aggregates that need to be queryable on demand — use a streaming KV store (Flink's KeyedBroadcastProcessFunction with snapshot output to Kafka, or a Materialize/RisingWave streaming database). Windows fire and emit; some questions need state that you can query, not just one that fires.
Where this leads next
The next chapter — /wiki/event-time-vs-processing-time-the-whole-ballgame — explains the two clocks every windowed job has to reconcile, and why "wallclock" is almost always the wrong choice. After that, /wiki/watermarks-how-we-know-were-done shows the mechanism that lets the runtime decide when a window's aggregate is safe to emit. Together they close the loop: windows define what to aggregate, watermarks define when to fire, and event time decides which bucket an event belongs to.
The mental model to carry forward: a window is just a key-prefix on the state store. A tumbling window assigns the prefix (window_start, user_id). A sliding window assigns size/slide prefixes per event. A session window assigns (user_id, session_id) and reshuffles session_id on merge. Once you see windows as prefixes in RocksDB, the cost model becomes obvious — and the choice between tumbling, sliding, and session stops being "what does the dashboard look like?" and starts being "what does my state store look like at 3 a.m. on Diwali night?".
The capstone of Build 8 — /wiki/wall-exactly-once-is-a-lie-everyone-tells — covers how a window's emit interacts with the input log's offset commit. That is where exactly-once semantics either work or fall apart.
References
- The Dataflow Model — Akidau et al., 2015 — the foundational paper that unified tumbling/sliding/session under one model. Read this once and the rest of streaming is footnotes.
- Apache Flink — Windows documentation — the canonical API, with the merge semantics spelled out in detail.
- Kafka Streams — Windowing — Confluent's take, including the suppressor pattern.
- Streaming Systems — Akidau, Chernyak, Lax (O'Reilly) — the textbook; chapters 2–4 cover windows in depth.
- Streamflix — windowing in Spark Structured Streaming — for the Spark-shaped reader.
- /wiki/state-stores-why-rocksdb-is-in-every-streaming-engine — the previous chapter; the storage layer that windows sit on top of.
- /wiki/watermarks-how-we-know-were-done — how the runtime decides a window is ready to fire.
- /wiki/wall-stateless-stream-processing-isnt-enough — the wall article that introduced why state matters in the first place.