Wall: consumers want the data shaped for their use case

Build 11 ended with a victory. The Razorpay payments Postgres now emits a faithful stream of every change — every insert, every update, every delete, every schema migration, every transaction boundary — into a Kafka topic. The pipeline that started this build with a 2:47 a.m. page about ₹40 lakh missing is now silent at 2:47 a.m. because the data is correct. Then on Tuesday afternoon, five different teams turn up at the data team's desk in Whitefield with five different requests, and every request is reasonable, and not a single one can be served by reading the Kafka topic directly. This is the wall that opens Build 12.

A faithful CDC stream is the right shape for one consumer: another database that wants to mirror the source. Every other consumer — the warehouse, the search index, the dashboard, the ML feature store, the audit ledger — wants the data ground, sliced, joined, or columnarised differently. Trying to spawn one pipeline per consumer scales as O(consumers × producers), and that is the wall the lakehouse exists to break.

The five Tuesday-afternoon requests

Sit at the data team's desk for one afternoon in a Bengaluru fintech and listen to what people ask for. The shapes are predictable across companies because the consumer roles are predictable.

Five reasonable asks. Five different shapes. Read them again — Finance wants a wide, joined, columnar batch table; Search wants a denormalised JSON document, one per row, indexed for full-text; Fraud wants per-key materialised views with rolling windows; ML wants temporally-correct training data plus an online cache; Audit wants every historical version with snapshot-isolated reads.

Five consumers, five required shapes for the same source dataDiagram showing one CDC stream from a Postgres source feeding five different downstream systems, each with a different shape: a columnar warehouse table, an Elasticsearch document, a key-value cache, an offline+online feature store pair, and an immutable history table. One change stream, five required shapes CDC stream Kafka topic payments.public.merchants ~12k events / sec Finance — warehouse columnar, joined, daily 2 TB scans, 30s OK Search — Elasticsearch denorm JSON, 30s freshness Fraud — KV cache 3 ms point lookup ML — feature store offline train + online 5 ms point-in-time correct Audit — history table every version, never GC'd "as of" queries batch + columnarise + join denormalise + index window + materialise temporal join + serve SCD2 + retain forever
The same change stream has to land in five differently-shaped systems. Build 12 is about the substrate that lets all five reuse one storage layer.

Why this isn't just "denormalise and dump" — the shapes disagree on more than column layout. Finance wants snapshot-isolated reads of a 2 TB table; Search wants 30 s freshness on a single document; Fraud wants 3 ms point lookups; ML wants temporal correctness across a join; Audit wants every old version. The storage primitives that satisfy each one are different: columnar files for Finance, an inverted index for Search, an LSM-tree or hash table for Fraud, a vector + KV pair for ML, an append-only log for Audit. No single storage layer is naively good at all five.

The naive "one pipeline per consumer" tax

The first instinct is the obvious one: stand up five Kafka consumer groups, each writing to its own sink. Search team writes a consumer that updates Elasticsearch on every change. Fraud team writes a consumer that updates Redis. Finance team writes a Spark job that reads the topic into Parquet. ML team writes a Flink job. Audit team writes another Flink job.

This works for one quarter. Then you count the cost.

Suppose the company has P producers (OLTP databases, message logs, third-party APIs, internal services) and C consumer shapes. The naive architecture has P × C pipelines: every producer fans out to every consumer's bespoke sink, with bespoke schema, bespoke retry logic, bespoke offset management, bespoke schema-evolution handling. A 30-service company with 4 consumer shapes is already 120 pipelines.

# pipelines_inventory.py — count the bespoke pipeline tax
# at a mid-stage Indian fintech.

producers = [
    "payments_postgres", "merchants_postgres", "ledger_oracle",
    "kyc_postgres", "settlements_postgres", "disputes_mysql",
    "auth_postgres", "events_kafka", "fraud_signals_kafka",
    "external_npci_api", "external_bank_recon_files",
    # ... ~30 source systems total
]

consumers = [
    ("warehouse",     "Snowflake / BigQuery"),
    ("search",        "Elasticsearch"),
    ("kv_cache",      "Redis"),
    ("feature_store", "Feast / DynamoDB"),
    ("audit_history", "Postgres + cold S3"),
]

# Each (producer, consumer) is a bespoke pipeline:
#  - schema mapping
#  - serialisation format choice (Avro / JSON / Protobuf)
#  - error handling, dead-letter queue
#  - schema-evolution handler
#  - on-call runbook
#  - cost attribution

n_pipelines = len(producers) * len(consumers)
print(f"Bespoke pipelines: {n_pipelines}")

# Engineer-hours per pipeline (industry rule of thumb):
hours_to_build = 80
hours_to_maintain_per_year = 40

print(f"Build cost      : {n_pipelines * hours_to_build} engineer-hours")
print(f"Yearly maintain : {n_pipelines * hours_to_maintain_per_year} engineer-hours")
print(f"At ₹2,500/hr (loaded): "
      f"₹{n_pipelines * (hours_to_build + hours_to_maintain_per_year) * 2500 / 1e7:.2f} crore "
      "first-year cost")

# Sample run:
# Bespoke pipelines: 55
# Build cost      : 4400 engineer-hours
# Yearly maintain : 2200 engineer-hours
# At ₹2,500/hr (loaded): ₹1.65 crore first-year cost

Walk through the lines that are doing real work:

This is the wall. Not "we can't build it" but "we can build any one of these and the whole inventory will eat us alive over three years".

Why each consumer's storage layer disagrees with each other's

Drill one level down. The reason five teams want five shapes is not arbitrary preference; it's that each storage layer is optimised for a fundamentally different access pattern. The five shapes correspond to five distinct points on the storage-design space.

Consumer Access pattern Optimal storage shape Indexing primitive
Finance / warehouse Wide scans across many rows, few columns at a time Columnar with min/max stats per row group Zone maps, bloom filters
Search Token-based lookup across many documents Inverted index Posting lists
Fraud / KV Single-key lookup, often with a 24-hour rolling window LSM-tree or hash table + per-key state store B-tree or hash on key
ML feature store Temporal join: "the value of feature F for entity E at time T" KV for online + columnar with event-time partition for offline Bitemporal index
Audit "Give me row R at time T", monotonically growing Append-only log + snapshot index Time-travel snapshot id

The columnar+statistics layout that makes Finance's SUM(amount) WHERE date BETWEEN ... query scan 2 GB instead of 2 TB is exactly the wrong layout for Search's "find all merchants whose name contains 'cafe'" — that needs an inverted index, which a columnar file does not offer. The key-value store that gives Fraud a 3 ms point lookup cannot answer Finance's group-by because it does not know what columns are. The append-only audit log makes "as-of" queries trivial but turns Finance's ad-hoc analytical query into a full historical replay.

Five storage shapes plotted against access pattern axesA two-dimensional chart with read pattern (point lookup vs wide scan) on the x-axis and freshness requirement (sub-second vs daily) on the y-axis. Five consumers are plotted at different positions: Search top-middle, Fraud top-left, Feature store top-left, Audit middle-right, Warehouse bottom-right. Five consumers in the access-pattern × freshness plane point lookup read pattern → wide scan daily freshness ↑ sub-second F Fraud Redis / DynamoDB M ML online S Search Elasticsearch A Audit history append-only log W Warehouse no single storage covers this whole plane
The five consumers occupy different regions of the read-pattern × freshness plane. No single storage primitive is a good fit for all five. The lakehouse's claim is not "one storage that does everything", but "one storage layer that the others can sit on top of without replicating raw data".

Why a single storage primitive cannot win all five corners: the access patterns conflict at the physics level. Wide-scan analytical queries want data laid out by column with skip-indexes; point lookups want data laid out by row with hash or B-tree indexes. You can build hybrids (HTAP databases like TiDB or SingleStore try) but they pay a 2–5× cost penalty versus a single-purpose engine, which means at scale you still end up with specialised engines on the hot paths. The lakehouse accepts this and aims at a different goal: a shared storage substrate that the specialised engines borrow from, rather than each one ingesting raw events independently.

What "shaped for the consumer" actually means

The verb "shape" hides three different operations, and Build 12 needs all three.

Reorganise the row layout. Take a stream of (merchant_id, column, old_value, new_value, lsn, txn_id) events and produce a Parquet file where rows are grouped by date and columns are grouped by column-id, with min/max statistics per row group. This is what makes Finance's SUM query fast.

Materialise a derived view. Take the stream and compute, for every cardholder, a rolling 24-hour aggregate plus the last 50 transactions. Update the view as new events arrive. This is what gives Fraud a 3 ms lookup.

Preserve history with snapshot isolation. Take the stream and write it to a layout where every commit produces a new version-snapshot, queryable independently. Old versions are retained until explicitly compacted. This is what gives Audit "as-of" queries.

The lakehouse story (Iceberg, Delta, Hudi) is precisely the substrate where all three operations can be expressed against the same files on S3, instead of three independent copies in three independent storage systems. Search and Fraud still live outside the lakehouse — those are point-shaped consumers — but Finance, ML offline, and Audit can share a single set of Parquet files, indexed differently for each query pattern, with snapshot-isolated commits coordinating the writers.

Edge cases that make the shape problem worse

The "five shapes" framing is clean. Production introduces five more dimensions that the diagram above doesn't show.

Common confusions

Going deeper

The Razorpay 2024 build-vs-buy tipping point

Razorpay's data engineering team published a 2024 retrospective on the moment their per-consumer-pipeline architecture became unsustainable. The trigger was 78 production pipelines, 14 source databases, 9 sink shapes, and an on-call rotation that was paged 18 times a week. The decision to consolidate around Iceberg-on-S3 + Trino as the shared analytical substrate cut the on-call page rate to 4 a week within two quarters, but it also forced a 6-month rewrite of every analytical pipeline and a one-time data migration of 380 TB. The cost was real; the alternative (continuing to scale linearly with consumer count) was estimated at ₹12 crore in extra headcount over three years.

Why HTAP databases (TiDB, SingleStore) didn't kill the lakehouse

The hybrid transactional-analytical processing pitch was: one database serves OLTP and OLAP workloads from the same storage, eliminating the need for CDC and a warehouse. The architecture is real (TiDB's TiFlash columnstore, SingleStore's row+column storage). What kills it at Indian-fintech scale is not technology but organisation: the OLTP database is owned by the platform team that signs an SLA on payment authorisation latency; analytical workloads are noisy neighbours that violate that SLA in ways that are hard to bound. CDC + lakehouse keeps the noisy workloads physically isolated from the critical path. The HTAP system technically works; the political economy of operating it doesn't.

The Cred history-table problem

Cred's compliance team needs to answer "what was a member's reward-eligibility status at any point in the last 5 years". A naive implementation writes every change to a member_status_history Postgres table, which by 2025 had grown to 2.1 billion rows, with month-end queries scanning 8% of the table. The fix Cred described in a 2025 blog was to mirror the audit shape into Iceberg: write the change stream as an append-only Iceberg table partitioned by event-date, use Iceberg's snapshot id as the temporal cursor, and let Trino answer "as-of" queries with snapshot expressions. Postgres no longer holds the history; Iceberg does. The Postgres member_status_history table now retains 90 days; everything older lives on S3 at 1/40th the storage cost.

Schema-on-read vs schema-on-write, in the consumer-shape lens

The classical "data lake" pitch was schema-on-read: dump every event raw, let each consumer parse what they need. This produced the data-swamp anti-pattern — every consumer wrote the same parsing logic, in subtly different ways, with no central authority on what customer_segment meant. The lakehouse is a deliberate retreat: the substrate is schema-on-write (Iceberg/Delta tables have committed schemas), but the schema can evolve cleanly with snapshot-isolated DDL. Consumers parse less; the substrate guarantees more. This is also why dbt and the semantic layer (Build 13) sit on top of the lakehouse — they need a consistent, evolving schema to define metrics over.

What this wall looks like at 1000-pipeline scale (Flipkart, Swiggy)

At Flipkart's scale, the producer count is in the low hundreds (one per microservice), the consumer-shape count is in the low tens (analytics, search, recommendations, fraud, observability, billing, finance). The naive matrix is in the thousands of pipelines; the actual pipeline count after lakehouse consolidation is in the low hundreds. The remaining pipelines are the consumer-specific tail — Search still has its own indexer, Fraud still has its own materialisation. But the analytical core — the shape that 70% of business questions are asked against — sits on a single Iceberg-on-S3 substrate with Trino as the query layer. That's the ratio Build 12 is aiming for.

Where this leads next

Build 12 takes this wall apart in four mechanism-shaped chapters:

The closing chapter of Build 12, /wiki/cdc-iceberg-the-real-world-pattern, brings Build 11's CDC stream and Build 12's lakehouse storage together: how do you actually land a Razorpay-scale CDC stream into Iceberg, with hot-key updates, schema evolution, and bounded latency? That is the chapter that retires this wall.

References