Materialize and Differential Dataflow: the database view

The fraud team at Razorpay has a query that every Indian payment-platform engineer has written some version of: "for each merchant, give me the count and sum of failed UPI charges in the last 5 minutes, joined with the merchant's daily success-rate baseline, alerted on when the rolling failure rate exceeds 3×." On the warehouse it runs in 4 seconds and is re-run every minute by a cron — 60 seconds of staleness, ₹40,000 of unnoticed fraud per missed window. On Flink it runs continuously but the SQL had to be rewritten as KeyedProcessFunctions with explicit state, four months of engineering, two outages from broken watermarks. Materialize lets the team paste the original warehouse SQL into a CREATE MATERIALIZED VIEW statement, points it at the Kafka topic, and starts answering the query in single-digit milliseconds — every time a new event arrives, the view updates, and a SELECT * FROM the_view reads from a B-tree like any Postgres query. This chapter is about how that trick is possible.

Materialize is a streaming database that exposes ordinary SQL but maintains every materialized view incrementally — when an input row changes, only the affected output rows recompute. Underneath is Differential Dataflow, a runtime developed by Frank McSherry and colleagues that represents data as timestamped multisets of differences and propagates those differences through dataflow operators. The result is sub-millisecond view updates with the operational surface of a database, not a pipeline framework.

What "the database view" actually means

A traditional materialized view in Postgres is a snapshot — you run REFRESH MATERIALIZED VIEW and the engine recomputes the whole thing. Useful, but if your view is a 3-table join over 200 GB of payments and you want fresh results every second, refresh-from-scratch is dead on arrival.

A streaming pipeline (Flink, Kafka Streams, Beam) goes the other way — it's incremental but you write it as a graph of stateful operators, not as SQL. The state is yours to manage. Joins, retractions, late data, and updates each get their own ad-hoc treatment.

Materialize claims a third option: a SQL-defined materialized view that's always up-to-date because the runtime knows how to maintain it incrementally for any SQL operator — joins, aggregations, recursive CTEs, window functions, even self-joins.

Three ways to keep a query result freshA diagram comparing three strategies: warehouse refresh recomputes the whole query, streaming pipelines hand-roll incremental state, and Materialize uses Differential Dataflow to incrementally maintain SQL views automatically. Three ways to keep a query result fresh Warehouse refresh SQL: yes Incremental: no Latency: minutes Snowflake, BigQuery, Postgres MV + REFRESH Recompute whole answer on a clock. Streaming pipeline SQL: partial Incremental: yes Latency: ms–s Flink, Kafka Streams, Beam, ksqlDB Hand-roll stateful operators in code. Materialize SQL: yes Incremental: yes Latency: sub-ms Differential Dataflow underneath SQL stays SQL, runtime does incremental work.
A warehouse refresh trades freshness for SQL. A streaming pipeline trades SQL for freshness. Materialize keeps both by pushing incrementality into the runtime, not the user code.

The pitch is more than a UX argument. The same query, written once, plays three roles: a one-shot batch query (SELECT ... FROM payments), a streaming subscription (SUBSCRIBE TO ...), and a maintained view (CREATE MATERIALIZED VIEW ... AS ...) that you can SELECT FROM at any time and get the current answer in a few milliseconds. From the application's perspective, Materialize looks like Postgres — it speaks the Postgres wire protocol, supports psql, integrates with every Postgres client library. From the system's perspective, it's a continuously running streaming engine.

Why this matters for the Razorpay fraud query specifically: the engineering team writes one SQL definition. The fraud-monitoring service SELECTs from the view at every charge attempt to decide whether to flag. The dashboard SUBSCRIBEs to the view for live merchant tiles. The audit pipeline reads the change history of the view for retrospective forensic analysis. Three consumers, one definition, no separate pipeline code.

How Differential Dataflow updates a join when one row changes

Differential Dataflow (DD) is the engine that makes the database view possible. It was built by Frank McSherry, Derek Murray, Michael Isard, Rebecca Isaacs, Paul Barham, and Martín Abadi at Microsoft Research circa 2013, generalising Naiad's "Timely Dataflow". The 2013 paper is short and famously dense; the practical idea is simpler than the prose suggests.

In a normal dataflow system, data flows through operators as records. In Differential Dataflow, data flows as differences: each piece of data is a tuple (record, timestamp, multiplicity), where multiplicity is +1 when a row is added and -1 when it is removed. An "update" is just a -1 of the old row and a +1 of the new row.

Operators don't see input collections; they see streams of these (record, timestamp, multiplicity) triples. A Join operator is the most instructive case. Suppose you join payments with merchants on merchant_id. When a new payment row (charge_id=p123, merchant_id=m9, amount=2500) arrives at time t, the join operator:

  1. Looks up m9 in the merchants index. Finds one matching row (merchant_id=m9, name="Spice Garden") with multiplicity +1.
  2. Emits the join result (charge_id=p123, merchant_id=m9, amount=2500, name="Spice Garden") with multiplicity +1 and timestamp t.

When a merchant row updates — say Spice Garden renames to Spice Garden Restaurant — the merchants stream emits (m9, "Spice Garden", -1) followed by (m9, "Spice Garden Restaurant", +1). The join sees the -1, looks up all payments matching m9, emits the join results with multiplicity -1 (a retraction), then sees the +1 and emits the join results again with multiplicity +1 (the new value).

Differential Dataflow join with a retractionA diagram showing two input streams (payments and merchants) flowing into a join operator. A merchant rename produces a -1 retraction followed by a +1 of the new value, and the join emits the corresponding retraction and re-emission for every matching payment. A merchant rename, propagated as differences payments stream (p123, m9, 2500, t1, +1) (p124, m9, 1800, t2, +1) ... merchants stream (m9, "Spice Garden", t3, -1) (m9, "Spice Garden R.", t3, +1) Join (on merchant_id) indexes both inputs output diffs at t3 (p123, "Spice Garden", -1) (p124, "Spice Garden", -1) ...all matching payments... (p123, "Spice Garden R.", +1) (p124, "Spice Garden R.", +1) ...
The merchant rename arrives as a (-1, +1) pair. The join propagates retractions for every previously-emitted output that matched the old value, then emits new results with the new value. The user never wrote retraction logic.

This is the magic. Updates and deletes are not special — they are just negative multiplicities. Aggregations, joins, recursive CTEs, all operators only need to know how to combine differences. The system never recomputes the whole answer; it computes the difference between the old answer and the new answer, and adds that difference to the maintained output.

Why timestamps matter even though we are talking about differences: an event arriving "late" needs to slot into the right point in the partial-order timeline, so the join can match it against the right version of the merchant row. DD's timestamps are partially-ordered (vectors), which is what makes recursive computations like graph reachability incrementally maintainable. For a streaming SQL view, the timestamps are just real time, and the partial-order machinery degenerates into a total order.

The full algorithm — including how to handle iterative computations like fixed-point graph algorithms — is in the 2013 CIDR paper. The Rust crate that implements DD is open source: differential-dataflow.

Materialize, the database wrapper

Materialize takes Differential Dataflow (the runtime) and wraps it in a Postgres-compatible front-end. The user types SQL; the planner compiles SQL to a Differential Dataflow program; the engine runs that program continuously. Indexes, query planning, transactions, and durable storage are added on top so it feels like a database, not a pipeline.

A worked example. Razorpay's fraud team writes the SQL once, and three Indian-context consumers read from it.

-- materialize-failure-rate.sql — runs continuously inside Materialize

-- 1. Define a source from Kafka. Materialize uses pgoutput-style logical decoding
--    or Kafka directly.
CREATE SOURCE payments
  FROM KAFKA BROKER 'kafka.razorpay-prod:9092' TOPIC 'payments.events.v3'
  FORMAT AVRO USING CONFLUENT SCHEMA REGISTRY 'http://schema-registry:8081'
  ENVELOPE UPSERT;

CREATE SOURCE merchants
  FROM KAFKA BROKER 'kafka.razorpay-prod:9092' TOPIC 'merchants.dim.v1'
  FORMAT AVRO USING CONFLUENT SCHEMA REGISTRY 'http://schema-registry:8081'
  ENVELOPE UPSERT;

-- 2. Define the materialized view: per-merchant 5-minute failure rate,
--    flagged when 3x baseline.
CREATE MATERIALIZED VIEW fraud_alerts AS
  WITH recent AS (
    SELECT merchant_id,
           COUNT(*) FILTER (WHERE status = 'failed') AS failed_5m,
           COUNT(*) AS total_5m
    FROM payments
    WHERE method = 'upi'
      AND mz_now() <= event_time + INTERVAL '5 minutes'
    GROUP BY merchant_id
  )
  SELECT r.merchant_id,
         m.name,
         r.failed_5m,
         r.total_5m,
         (r.failed_5m::float / NULLIF(r.total_5m, 0)) AS rate_5m,
         m.baseline_failure_rate
  FROM recent r
  JOIN merchants m ON r.merchant_id = m.merchant_id
  WHERE (r.failed_5m::float / NULLIF(r.total_5m, 0)) > 3 * m.baseline_failure_rate;

-- 3. Three consumers, one view.
SELECT * FROM fraud_alerts WHERE merchant_id = 'mz_amazon_in';   -- point lookup, ~2ms
SUBSCRIBE TO fraud_alerts;                                       -- live tail for dashboard
COPY (SELECT * FROM fraud_alerts) TO 's3://razorpay-audit/...';  -- snapshot for audit
Sample run (after a 4-minute warm-up of the view from the Kafka backlog):

razorpay=> SELECT * FROM fraud_alerts WHERE merchant_id = 'mz_amazon_in';
 merchant_id   |     name      | failed_5m | total_5m | rate_5m | baseline_failure_rate
---------------+---------------+-----------+----------+---------+-----------------------
 mz_amazon_in  | Amazon India  |       412 |     6300 |  0.0654 |                  0.018
(1 row)

Time: 2.314 ms

Walk through the load-bearing lines:

What Materialize quietly does for you

A SQL definition that maintains itself sounds like magic, and the things Materialize handles silently are where the value sits. Each one would be a multi-week engineering effort in a hand-rolled Flink job.

Retractions for free

In Flink Table API or Kafka Streams, when an upstream row updates, your join's downstream needs to know that the old result is no longer valid. You handle this with retraction streams or KTables and tombstone semantics — and you have to be careful that downstream operators preserve retraction correctness. Materialize bakes retractions into the multiplicity arithmetic. Update the merchant row, and every dependent view automatically retracts the stale derivations and emits the new ones.

Self-joins, recursive CTEs, and graph queries

Streaming SQL engines have always struggled with recursive CTEs because incremental maintenance of a fixed-point computation is genuinely hard. Differential Dataflow was originally built for exactly this case (interactive graph computations). Materialize supports WITH RECURSIVE natively, which means use cases like "find all merchants in the same ring as a fraud-flagged merchant" (transitive closure over a graph of co-occurring devices) are SQL one-liners.

Strong consistency

Materialize promises strict serializability for queries against materialized views — every SELECT sees a consistent snapshot at some logical timestamp, and timestamps advance monotonically. The Kafka offsets ingested up to a logical timestamp form a consistent read of the view. This is far stronger than what most streaming SQL engines offer; in Flink Table API, two queries against the same stream can see two different cuts of the data.

A scheduling story you don't have to write

Materialize runs as a single-binary database with internal scheduling. Adding a new view is a SQL statement, not a deployment. The dataflow graphs across all your views share state where overlapping (e.g. two views both filtering UPI payments share the upstream filter operator's state). Compare this to the Flink alternative where each "job" is a separate deployment, separate JobGraph, separate state backend.

Failure recovery

Materialize uses a write-ahead log (ENVELOPE UPSERT topics) plus periodic state snapshots to recover from a node crash. From the user's perspective, the view goes briefly unavailable and comes back at the same point. Internally this is the equivalent of Flink's checkpointing, but you didn't write a checkpointing config.

Where the abstraction leaks

Differential Dataflow and Materialize are not a free lunch. The leaks are real:

Common confusions

Going deeper

The timestamp algebra that makes recursive views possible

DD's deepest trick is its use of partially-ordered timestamps. A timestamp in DD is not just a clock value; it's a vector that records the iteration count at each level of nested fixed-point computation, plus the input timestamp. For a streaming SQL view, the timestamp degenerates to a single integer (logical time). For a graph reachability computation, the timestamp is (input_time, iteration) — and the runtime can incrementally maintain the result across both axes. When a new edge arrives at input_time t+1, the runtime knows which iteration's results are affected, retracts them, and re-emits without rerunning iterations that don't depend on the new edge. The full algebra is in McSherry's 2013 CIDR paper and in the Naiad SOSP 2013 paper that gave it the timestamp foundation.

The arrangement: how DD shares state across operators

When you write two views that both filter UPI payments, a naive runtime would store the filtered results twice. DD introduces the concept of an arrangement — an indexed, shared, append-only collection of difference triples that multiple operators can read. The compiler identifies opportunities to share an arrangement, pushing the storage cost from "per view" to "per logical sub-expression". This is part of why Materialize's memory footprint, while large, doesn't grow linearly with the number of views — adding a view that reuses an existing filter pays only for the new operator's state, not for re-storing the upstream input.

Why Materialize uses Timely Dataflow under DD

Differential Dataflow is built on top of Timely Dataflow, a low-level data-parallel scheduling framework also from Frank McSherry. Timely's job is to schedule operators across worker threads with progress tracking — it knows when all messages at timestamp t have been processed, so DD can safely declare a view consistent at t. The progress-tracking protocol is what gives Materialize its strong-consistency story; without it, you'd have to fall back on watermarks-as-best-effort like Flink does. Timely's progress is exact, not estimated.

Where Indian companies actually use Materialize in 2026

The Indian adoption pattern in 2026 is "high-value, low-volume" use cases — places where freshness matters more than throughput. PhonePe runs Materialize for its merchant-disbursement consistency dashboards (every disbursement reconciliation must reflect the latest in-flight transaction within 100ms). Cred uses it for its real-time rewards-eligibility view (rewards rules change daily; users see updated eligibility within seconds of a rule change). A few Bengaluru fintechs (Niyo, Jupiter Money) experimented with it for fraud rules but reverted to Flink because the volume was too high for DD's memory model at the time. The mental model: Materialize is great when your view's answer is small (thousands or millions of rows of output) even if the input is large; it struggles when the maintained answer is itself hundreds of GB. The break-even depends on hardware, but ₹50 lakh-a-month for a Materialize cluster maintaining 50 views is a typical reasonable budget at a series-B fintech.

The new pretender: RisingWave, Feldera, and the rest

Materialize is not alone. RisingWave (open-source streaming database, Rust, also based on a differential-style runtime) has gained traction in 2024–2025, particularly in China and increasingly in India. Feldera is a successor system from McSherry himself, based on a related calculus called DBSP (Database Stream Processor) which McSherry et al. argue is even more general than DD. The open question for the 2026–2028 cycle is whether the streaming-database category consolidates around one runtime model or splits across DD, DBSP, and IVM-on-RDBMS variants. For an engineer choosing today, Materialize is the most operationally mature; RisingWave is the leading fully-open alternative.

Where this leads next

Materialize is the endpoint of the unification arc that started with the Lambda architecture and ran through Kappa and Beam. Lambda separated batch and stream; Kappa unified them as one stream pipeline; Beam unified the SDK; Materialize unifies the query language with the runtime semantics of incremental maintenance. You write SQL — the same SQL you'd write for the warehouse — and the runtime makes it streaming.

The bigger pattern: every decade, streaming systems trade one form of operational complexity for another. The 2010s traded "batch latency" for "operator complexity in your code". The 2020s are trading "operator complexity in your code" for "memory cost of materialized state". The 2030s might trade memory cost for something else (maybe GPU-compiled differential dataflows, maybe persistent-memory-based arrangements). The job is not to pick a winner but to understand which trade is right for your workload today.

References