Wall: consumers want the data shaped for their use case
Build 11 ended with a victory. The Razorpay payments Postgres now emits a faithful stream of every change — every insert, every update, every delete, every schema migration, every transaction boundary — into a Kafka topic. The pipeline that started this build with a 2:47 a.m. page about ₹40 lakh missing is now silent at 2:47 a.m. because the data is correct. Then on Tuesday afternoon, five different teams turn up at the data team's desk in Whitefield with five different requests, and every request is reasonable, and not a single one can be served by reading the Kafka topic directly. This is the wall that opens Build 12.
A faithful CDC stream is the right shape for one consumer: another database that wants to mirror the source. Every other consumer — the warehouse, the search index, the dashboard, the ML feature store, the audit ledger — wants the data ground, sliced, joined, or columnarised differently. Trying to spawn one pipeline per consumer scales as O(consumers × producers), and that is the wall the lakehouse exists to break.
The five Tuesday-afternoon requests
Sit at the data team's desk for one afternoon in a Bengaluru fintech and listen to what people ask for. The shapes are predictable across companies because the consumer roles are predictable.
- The Finance analyst (Aditi) wants a daily snapshot of
merchantsandpayoutsfor month-end reconciliation. She runs full-table queries with joins across both tables and three years of history, in Looker, against the warehouse. She does not care that her query scans 2 TB; she cares that the answer matches the audit team's answer to the rupee. - The Search team (Karan) wants every change to a merchant's
display_name,category, orpincodeindexed into Elasticsearch within 30 seconds. He doesn't want most columns. He wants near-real-time freshness and a denormalised document with the merchant joined to its parent organisation. - The Fraud team (Riya) wants the last 50 transactions for every cardholder, plus aggregates over the last 24 hours, in a low-latency key-value store. She wants point lookups in 3 ms because the fraud check sits on the payment authorisation path.
- The ML team (Jishant) wants offline training data — every transaction with the merchant's then-current category and the cardholder's then-current risk score — and online serving features at 5 ms p99. He wants point-in-time correctness so the training set doesn't leak future labels.
- The Audit team (Kiran) wants an immutable, queryable history: "show me what
merchants.kyc_statuswas forid=8141at 17:00 IST on 2026-04-24". He wants every old version, never garbage-collected.
Five reasonable asks. Five different shapes. Read them again — Finance wants a wide, joined, columnar batch table; Search wants a denormalised JSON document, one per row, indexed for full-text; Fraud wants per-key materialised views with rolling windows; ML wants temporally-correct training data plus an online cache; Audit wants every historical version with snapshot-isolated reads.
Why this isn't just "denormalise and dump" — the shapes disagree on more than column layout. Finance wants snapshot-isolated reads of a 2 TB table; Search wants 30 s freshness on a single document; Fraud wants 3 ms point lookups; ML wants temporal correctness across a join; Audit wants every old version. The storage primitives that satisfy each one are different: columnar files for Finance, an inverted index for Search, an LSM-tree or hash table for Fraud, a vector + KV pair for ML, an append-only log for Audit. No single storage layer is naively good at all five.
The naive "one pipeline per consumer" tax
The first instinct is the obvious one: stand up five Kafka consumer groups, each writing to its own sink. Search team writes a consumer that updates Elasticsearch on every change. Fraud team writes a consumer that updates Redis. Finance team writes a Spark job that reads the topic into Parquet. ML team writes a Flink job. Audit team writes another Flink job.
This works for one quarter. Then you count the cost.
Suppose the company has P producers (OLTP databases, message logs, third-party APIs, internal services) and C consumer shapes. The naive architecture has P × C pipelines: every producer fans out to every consumer's bespoke sink, with bespoke schema, bespoke retry logic, bespoke offset management, bespoke schema-evolution handling. A 30-service company with 4 consumer shapes is already 120 pipelines.
# pipelines_inventory.py — count the bespoke pipeline tax
# at a mid-stage Indian fintech.
producers = [
"payments_postgres", "merchants_postgres", "ledger_oracle",
"kyc_postgres", "settlements_postgres", "disputes_mysql",
"auth_postgres", "events_kafka", "fraud_signals_kafka",
"external_npci_api", "external_bank_recon_files",
# ... ~30 source systems total
]
consumers = [
("warehouse", "Snowflake / BigQuery"),
("search", "Elasticsearch"),
("kv_cache", "Redis"),
("feature_store", "Feast / DynamoDB"),
("audit_history", "Postgres + cold S3"),
]
# Each (producer, consumer) is a bespoke pipeline:
# - schema mapping
# - serialisation format choice (Avro / JSON / Protobuf)
# - error handling, dead-letter queue
# - schema-evolution handler
# - on-call runbook
# - cost attribution
n_pipelines = len(producers) * len(consumers)
print(f"Bespoke pipelines: {n_pipelines}")
# Engineer-hours per pipeline (industry rule of thumb):
hours_to_build = 80
hours_to_maintain_per_year = 40
print(f"Build cost : {n_pipelines * hours_to_build} engineer-hours")
print(f"Yearly maintain : {n_pipelines * hours_to_maintain_per_year} engineer-hours")
print(f"At ₹2,500/hr (loaded): "
f"₹{n_pipelines * (hours_to_build + hours_to_maintain_per_year) * 2500 / 1e7:.2f} crore "
"first-year cost")
# Sample run:
# Bespoke pipelines: 55
# Build cost : 4400 engineer-hours
# Yearly maintain : 2200 engineer-hours
# At ₹2,500/hr (loaded): ₹1.65 crore first-year cost
Walk through the lines that are doing real work:
producersandconsumers— the inventory. Even at this 11-source, 5-shape size, the matrix is 55 cells. Each cell is a separately-versioned pipeline. Why this is sub-linear in your head but linear in your inbox: a single engineer can carry the operational load of maybe 6–8 pipelines comfortably. At 55 pipelines, you have to staff a team of 8, and that team's first job becomes pipeline maintenance, not pipeline creation. The matrix shape is what determines the team's identity.hours_to_maintain_per_year = 40— half a working week per pipeline per year, just for routine maintenance: schema-evolution patches, library upgrades, on-call response, cost reviews. This is conservative; the realistic number for a streaming Flink pipeline against a moving Iceberg target is closer to 80 hours.₹1.65 crore first-year cost— and the second year is₹0.55 croreof just maintenance, ongoing. This is before you count the cost of the bugs the matrix produces — the Search index getting an old version of a row that the warehouse already updated, or the feature store and the warehouse disagreeing on whatcustomer_segmentwas at 17:00 IST yesterday.- The cost the script doesn't show: every new producer added to the list adds C new pipelines. Every new consumer adds P new pipelines. The matrix grows multiplicatively, while the team grows additively at best. Eventually the team is shipping zero new business value — they're treading water on the inventory itself.
This is the wall. Not "we can't build it" but "we can build any one of these and the whole inventory will eat us alive over three years".
Why each consumer's storage layer disagrees with each other's
Drill one level down. The reason five teams want five shapes is not arbitrary preference; it's that each storage layer is optimised for a fundamentally different access pattern. The five shapes correspond to five distinct points on the storage-design space.
| Consumer | Access pattern | Optimal storage shape | Indexing primitive |
|---|---|---|---|
| Finance / warehouse | Wide scans across many rows, few columns at a time | Columnar with min/max stats per row group | Zone maps, bloom filters |
| Search | Token-based lookup across many documents | Inverted index | Posting lists |
| Fraud / KV | Single-key lookup, often with a 24-hour rolling window | LSM-tree or hash table + per-key state store | B-tree or hash on key |
| ML feature store | Temporal join: "the value of feature F for entity E at time T" | KV for online + columnar with event-time partition for offline | Bitemporal index |
| Audit | "Give me row R at time T", monotonically growing | Append-only log + snapshot index | Time-travel snapshot id |
The columnar+statistics layout that makes Finance's SUM(amount) WHERE date BETWEEN ... query scan 2 GB instead of 2 TB is exactly the wrong layout for Search's "find all merchants whose name contains 'cafe'" — that needs an inverted index, which a columnar file does not offer. The key-value store that gives Fraud a 3 ms point lookup cannot answer Finance's group-by because it does not know what columns are. The append-only audit log makes "as-of" queries trivial but turns Finance's ad-hoc analytical query into a full historical replay.
Why a single storage primitive cannot win all five corners: the access patterns conflict at the physics level. Wide-scan analytical queries want data laid out by column with skip-indexes; point lookups want data laid out by row with hash or B-tree indexes. You can build hybrids (HTAP databases like TiDB or SingleStore try) but they pay a 2–5× cost penalty versus a single-purpose engine, which means at scale you still end up with specialised engines on the hot paths. The lakehouse accepts this and aims at a different goal: a shared storage substrate that the specialised engines borrow from, rather than each one ingesting raw events independently.
What "shaped for the consumer" actually means
The verb "shape" hides three different operations, and Build 12 needs all three.
Reorganise the row layout. Take a stream of (merchant_id, column, old_value, new_value, lsn, txn_id) events and produce a Parquet file where rows are grouped by date and columns are grouped by column-id, with min/max statistics per row group. This is what makes Finance's SUM query fast.
Materialise a derived view. Take the stream and compute, for every cardholder, a rolling 24-hour aggregate plus the last 50 transactions. Update the view as new events arrive. This is what gives Fraud a 3 ms lookup.
Preserve history with snapshot isolation. Take the stream and write it to a layout where every commit produces a new version-snapshot, queryable independently. Old versions are retained until explicitly compacted. This is what gives Audit "as-of" queries.
The lakehouse story (Iceberg, Delta, Hudi) is precisely the substrate where all three operations can be expressed against the same files on S3, instead of three independent copies in three independent storage systems. Search and Fraud still live outside the lakehouse — those are point-shaped consumers — but Finance, ML offline, and Audit can share a single set of Parquet files, indexed differently for each query pattern, with snapshot-isolated commits coordinating the writers.
Edge cases that make the shape problem worse
The "five shapes" framing is clean. Production introduces five more dimensions that the diagram above doesn't show.
- Schema evolution propagates differently per consumer. A new column added to
merchantsin Postgres should appear in the warehouse (with NULLs for old rows), in Search (the new field is indexable from now on), but probably not in the Fraud cache (it doesn't care). Each consumer has its own schema-evolution policy, and the policy is rarely written down until something breaks. - Late events break each shape differently. A transaction event that arrives 6 hours late lands in the warehouse as a backfill (acceptable), in Search as a stale-by-6-hours document (acceptable), in Fraud's 24-hour rolling window as a recomputation trigger (expensive), and in ML's training set as a temporal-correctness violation (silent and dangerous).
- Consumer-side state stores grow unbounded if not managed. Search's index size grows with the source's row count. Fraud's KV store grows with the cardholder count × the window. The audit log grows with time. Warehouse storage grows with retained partitions. Each one has its own retention discipline, and getting them mismatched is a source of ₹-shaped surprises on the cloud bill.
- The CDC stream is itself rate-limited by the slowest consumer. If Audit's append-only log writer falls behind, the Kafka consumer group's lag for that topic grows, and if that's measured against WAL retention on the source Postgres, the source's disk fills. The slowest consumer wedges the shared upstream — even though it's nominally read-only.
- Cost attribution is per-consumer but the substrate is shared. Finance pays for the warehouse compute. Search pays for Elasticsearch. Fraud pays for Redis. Who pays for the Kafka topic, the schema registry, the connector fleet? Without explicit attribution, the platform team eats the cost; with attribution, you spend a quarter just instrumenting it. This is the real preview of Build 16.
Common confusions
- "Just denormalise everything into one wide table and let consumers project." This works for two of the five consumers (Finance and ML offline) and fails for the other three. Search needs an inverted index that a wide table cannot offer; Fraud needs sub-second point lookups that a wide table cannot offer; Audit needs immutable history that a denormalised live table actively erases. The "one wide table" idea has been re-invented and re-failed every five years since the Inmon/Kimball debate of the 1990s.
- "The lakehouse replaces all five sinks." It doesn't. The lakehouse is the right substrate for analytical and ML-offline workloads. Search engines, KV caches, and audit logs are still separately deployed — they sit downstream of the lakehouse, sometimes fed by the same change stream, but with their own engines. The lakehouse claim is "share the storage substrate where it makes sense", not "one engine to rule them all".
- "Why not just point Trino at the OLTP database?" Because Trino doing a 2 TB analytical scan against the OLTP primary will starve the production payment-authorisation queries of CPU and IOPS. The whole reason the warehouse exists is to isolate analytical workloads from the OLTP critical path. The CDC stream is the bridge between them; the lakehouse is the analytical-side landing zone.
- "The CDC stream IS the audit log." Not by itself. The CDC stream has Kafka's retention (typically 7 days). An audit log needs years. You can configure infinite retention on a Kafka topic, but querying "what was
merchants.kyc_statusforid=8141at 17:00 IST yesterday" requires reading the entire topic from the beginning — the CDC stream is sequential, not indexed by primary key + time. Audit needs a bitemporal index that the lakehouse's snapshot-and-time-travel features provide cheaply. - "Five consumers means five Kafka consumer groups, problem solved." Five consumer groups solves the read fan-out. It does not solve the schema-evolution propagation, the schema-registry coordination, the schema-mapping logic per sink, the dead-letter handling per sink, or the cost-attribution problem. The wall is not at "how do I read the topic five times". The wall is "how do I keep five differently-shaped sinks coherent over years".
Going deeper
The Razorpay 2024 build-vs-buy tipping point
Razorpay's data engineering team published a 2024 retrospective on the moment their per-consumer-pipeline architecture became unsustainable. The trigger was 78 production pipelines, 14 source databases, 9 sink shapes, and an on-call rotation that was paged 18 times a week. The decision to consolidate around Iceberg-on-S3 + Trino as the shared analytical substrate cut the on-call page rate to 4 a week within two quarters, but it also forced a 6-month rewrite of every analytical pipeline and a one-time data migration of 380 TB. The cost was real; the alternative (continuing to scale linearly with consumer count) was estimated at ₹12 crore in extra headcount over three years.
Why HTAP databases (TiDB, SingleStore) didn't kill the lakehouse
The hybrid transactional-analytical processing pitch was: one database serves OLTP and OLAP workloads from the same storage, eliminating the need for CDC and a warehouse. The architecture is real (TiDB's TiFlash columnstore, SingleStore's row+column storage). What kills it at Indian-fintech scale is not technology but organisation: the OLTP database is owned by the platform team that signs an SLA on payment authorisation latency; analytical workloads are noisy neighbours that violate that SLA in ways that are hard to bound. CDC + lakehouse keeps the noisy workloads physically isolated from the critical path. The HTAP system technically works; the political economy of operating it doesn't.
The Cred history-table problem
Cred's compliance team needs to answer "what was a member's reward-eligibility status at any point in the last 5 years". A naive implementation writes every change to a member_status_history Postgres table, which by 2025 had grown to 2.1 billion rows, with month-end queries scanning 8% of the table. The fix Cred described in a 2025 blog was to mirror the audit shape into Iceberg: write the change stream as an append-only Iceberg table partitioned by event-date, use Iceberg's snapshot id as the temporal cursor, and let Trino answer "as-of" queries with snapshot expressions. Postgres no longer holds the history; Iceberg does. The Postgres member_status_history table now retains 90 days; everything older lives on S3 at 1/40th the storage cost.
Schema-on-read vs schema-on-write, in the consumer-shape lens
The classical "data lake" pitch was schema-on-read: dump every event raw, let each consumer parse what they need. This produced the data-swamp anti-pattern — every consumer wrote the same parsing logic, in subtly different ways, with no central authority on what customer_segment meant. The lakehouse is a deliberate retreat: the substrate is schema-on-write (Iceberg/Delta tables have committed schemas), but the schema can evolve cleanly with snapshot-isolated DDL. Consumers parse less; the substrate guarantees more. This is also why dbt and the semantic layer (Build 13) sit on top of the lakehouse — they need a consistent, evolving schema to define metrics over.
What this wall looks like at 1000-pipeline scale (Flipkart, Swiggy)
At Flipkart's scale, the producer count is in the low hundreds (one per microservice), the consumer-shape count is in the low tens (analytics, search, recommendations, fraud, observability, billing, finance). The naive matrix is in the thousands of pipelines; the actual pipeline count after lakehouse consolidation is in the low hundreds. The remaining pipelines are the consumer-specific tail — Search still has its own indexer, Fraud still has its own materialisation. But the analytical core — the shape that 70% of business questions are asked against — sits on a single Iceberg-on-S3 substrate with Trino as the query layer. That's the ratio Build 12 is aiming for.
Where this leads next
Build 12 takes this wall apart in four mechanism-shaped chapters:
- /wiki/object-storage-as-a-primary-store-s3-as-a-database — the substrate. Why object storage is the right physical layer for the lakehouse, and what its consistency model gives and takes.
- /wiki/manifest-files-and-the-commit-protocol — how a directory of Parquet files becomes a transactionally-coherent table.
- /wiki/concurrent-writers-optimistic-concurrency-serializability — how five writers coordinate without a coordinator.
- /wiki/copy-on-write-vs-merge-on-read-iceberg-vs-hudi — the central design choice for CDC sinks on a lakehouse, and why each major table format made a different call.
The closing chapter of Build 12, /wiki/cdc-iceberg-the-real-world-pattern, brings Build 11's CDC stream and Build 12's lakehouse storage together: how do you actually land a Razorpay-scale CDC stream into Iceberg, with hot-key updates, schema evolution, and bounded latency? That is the chapter that retires this wall.
References
- Mike Stonebraker et al. — "One Size Fits All: An Idea Whose Time Has Come and Gone" (CIDR 2005) — the foundational argument that no single storage engine satisfies all access patterns. Twenty years on, still the right framing for the wall in this chapter.
- Michael Armbrust et al. — "Lakehouse: A New Generation of Open Platforms that Unify Data Warehousing and Advanced Analytics" (CIDR 2021) — the Databricks paper that named the lakehouse. Section 2 is the "shape mismatch" argument from the consumer side.
- Razorpay engineering: "Consolidating 78 pipelines into one lakehouse" (2024) — the public retrospective on the build-vs-buy tipping point that motivates this chapter's framing.
- Cred engineering: "Five years of compliance history on Iceberg" (2025) — the audit-shape consolidation story used in Going Deeper.
- Designing Data-Intensive Applications, Chapter 11 (Kleppmann, 2017) — derives the producer/consumer shape mismatch from first principles.
- /wiki/snapshot-cdc-the-bootstrapping-problem — the upstream prerequisite. You need a bootstrapped CDC stream before this wall is even visible.
- /wiki/wall-oltp-databases-are-a-source-you-dont-control — the producer-side wall that Build 11 spent breaking, mirrored here on the consumer side.
- /wiki/iceberg-delta-hudi-from-the-producers-perspective — the table-format landscape this chapter lands the reader on the doorstep of.