Polyglot persistence: picking the right DB per workload

Open the architecture page of any company you have heard of — Razorpay, Swiggy, Zerodha, Meesho — and count the databases on it. Five is normal. Seven is common. One is a lie. This chapter is about why no single engine survives a real workload, and how to read the trade-off matrix you have been quietly building since Build 1.

A real system runs three to seven different databases at once because workloads have wildly different shapes — and every storage engine you have built in this track makes the opposite trade-off from at least one other engine. You do not pick a database. You pick a portfolio. The skill is reading a workload's access pattern, latency budget, and consistency need, and matching it to the engine whose trade-offs that pattern can afford.

The one-database fantasy

Class assignments have one database. Production has many. The reason is not laziness or hype — it is geometry. Every storage engine in this track sits on a trade-off triangle whose corners cannot all be reached at once.

Recall the walls. B-trees give fast point reads and reasonable range scans, at the cost of write amplification on update. LSM-trees flip that — fast writes, painful read amplification, and tail-latency spikes during compaction. Column stores like Parquet or ClickHouse make a billion-row aggregate fly, but want nothing to do with a single-row update. Inverted indexes answer "find me every doc containing the word refund" in milliseconds and answer "give me row 4892's email" by re-scanning. Hash-based caches serve a hundred thousand reads per second per core, then forget everything when the box reboots. Vector indexes find the ten closest embeddings to a query in 5 ms — and cannot tell you which row was updated yesterday.

Every engine wins at one workload by losing at another. That is the whole game.

The trade-off space of database enginesA horizontal axis labelled write-rate on the right and read-rate on the left, vertical axis labelled scan-friendly on top and point-friendly on the bottom. Six engines plotted: B-tree near point-friendly mid, LSM lower right (write-fast, point-friendly), column store upper left (scan-friendly, read-fast), inverted index upper mid (scan-friendly), Redis cache far left (read-fast, point-friendly), vector index upper right.scan-friendlypoint-friendlyread-fastwrite-fastRedis (cache)RAM hash, no durabilityB-tree (Postgres)balanced, journaledLSM (Cassandra)write-fast, read-ampColumn storeClickHouse / ParquetInverted indexElasticsearchVector indexHNSW / IVF
Six engines, four corners, no winner. Every engine sits where its data structure thrives. Picking a database is picking which corner your workload lives in.

Why a single engine cannot win every corner: the on-disk layout that makes a column store fast (sorted, compressed, per-column files of millions of values) is the same layout that makes a single-row update brutal — you have to rewrite the whole column chunk. The inverted-list layout that makes Elasticsearch fast at "every doc with refund" is the same layout that makes "give me row 4892" expensive. There is no neutral data structure. The shape of the bytes on disk is the trade-off.

Reading a workload before you pick

Before you pick a database, write down four numbers and one shape. Without these, the choice is religion.

  1. Read-to-write ratio. Is this 99% read and 1% write (a product catalogue), 50/50 (a chat app), or 10% read and 90% write (a metrics pipeline)?
  2. Latency budget. What is the p99 your application can tolerate? 1 ms (cache layer), 50 ms (user-facing API), 500 ms (back-office screen), 30 s (analyst dashboard)?
  3. Working-set size. How much hot data does this workload touch in the last hour? 1 GB (fits in RAM), 100 GB (fits in NVMe), 50 TB (one column-store cluster), 10 PB (only object storage scales)?
  4. Consistency need. Does the read have to see the last write? (Bank balance: yes. Trending products: no.) This collapses to the CAP/PACELC decision.
  5. Access shape. Point lookup by primary key? Range scan? Full-text? K-nearest-neighbour vector? Aggregate over millions of rows? Graph traversal of friend-of-friend?

The shape is the most important. If you cannot describe the access shape in one sentence, you do not yet understand the workload, and any database you pick is a coin flip.

Five workloads inside one Razorpay request

A merchant opens the Razorpay dashboard at 11am on the 1st of the month — settlement day. A single page load fans out into five very different queries:

# Query the page asks Read shape Engine that fits
1 "Show me my total volume yesterday" aggregate of ~50k rows column store / OLAP
2 "List my last 20 captured payments" range scan, sorted by created_at, by merchant_id B-tree on Postgres
3 "Search for customer 'rahul@gmail.com'" inverted index on email field Elasticsearch
4 "What is my available balance right now?" single-row lookup, must reflect the most recent write Postgres OLTP
5 "Show me three suggested actions you think I'll click" nearest-neighbour over a feature vector vector index (HNSW)

If you tried to answer all five from one Postgres instance: query 1 takes 8 seconds because the row store has to read every payment column to sum one of them; query 3 either does a LIKE '%rahul%' table scan or you bolt on pg_trgm and pray; query 5 needs pgvector, which works but reaches its limit before you finish 1 lakh merchants. By the third such page-load per second, the database is on fire.

The fix is not "buy a bigger Postgres." The fix is to send each query to the engine its shape was designed for, and to keep all of them in sync from the same source of truth.

The portfolio, not the pick

Once you accept that one engine cannot serve every workload, the architecture you end up with looks remarkably consistent across companies. There is an OLTP engine that owns the truth — usually Postgres or MySQL, sometimes a NewSQL like CockroachDB. Around it sits a halo of derived stores, each indexing the truth differently for one specific access shape.

This is the inside-out architecture from chapter 181, but viewed from the operations side. Build 23 told you how the log fans out to many systems. Build 24 starts with which systems, and why these and not others.

A typical Razorpay-shaped polyglot stackCenter: a box labelled Postgres OLTP, source of truth. Below it a Kafka log labelled change-data-capture stream. Surrounding the Kafka log: five derived stores — Redis cache, Elasticsearch, ClickHouse, S3 + Iceberg lakehouse, and a vector index. Arrows go from Postgres to Kafka and from Kafka to each derived store.Postgres (OLTP)source of truth — payments, users, balancesWAL via DebeziumKafka logpayments-cdc topic, partitioned by merchantRedisdashboard cacheElasticsearchfree-text searchClickHouselive analyticsS3 + Iceberglakehouse, 7y archivepgvector / HNSWrecommendationsone log of facts → five views, each indexing it differently
The shape that almost every fintech, marketplace, and SaaS company in India converges on by Series B. One source of truth, one log to fan it out, and one engine per access shape.

The discipline is not to add stores randomly — it is to add a store only when an existing store cannot meet the workload's latency or shape. Each store earns its place by answering one question better than any incumbent. Add Elasticsearch only when free-text search becomes a real workload. Add ClickHouse only when analytics queries hit Postgres hard enough to threaten OLTP latency. Add the vector index only when "find similar items" is a product feature.

Worked example: Riya picks the stack for a small wallet app

Riya runs a four-engineer team in Pune building a wallet-style app — call it walletkart. They have 50,000 active users, growing at 20% a month, and a runway pressure that says do not buy a database you do not have to. She runs through the workloads on paper, in order of priority:

# walletkart_workloads.py — workload catalogue, in plain Python dicts
workloads = [
    {
        "name": "wallet balance lookup",
        "ops_per_sec": 800,
        "p99_latency_ms": 50,
        "consistency": "strict — must reflect last write",
        "shape": "point lookup by user_id",
        "fits": "OLTP B-tree (Postgres)",
    },
    {
        "name": "transaction history",
        "ops_per_sec": 200,
        "p99_latency_ms": 100,
        "consistency": "read-your-writes",
        "shape": "range scan: last 30 transactions for user_id, by created_at",
        "fits": "Postgres with index on (user_id, created_at)",
    },
    {
        "name": "merchant search",
        "ops_per_sec": 60,
        "p99_latency_ms": 200,
        "consistency": "stale by 30s OK",
        "shape": "free-text on merchant name + category",
        "fits": "Elasticsearch (or Postgres tsvector for now)",
    },
    {
        "name": "homepage carousel: 'recent merchants like yours'",
        "ops_per_sec": 800,
        "p99_latency_ms": 30,
        "consistency": "stale by hours OK",
        "shape": "point lookup of precomputed list, by user_id",
        "fits": "Redis (cache, recomputed nightly)",
    },
    {
        "name": "monthly settlement aggregates for ops team",
        "ops_per_sec": 0.01,  # 1 query every 100s during business hours
        "p99_latency_ms": 5000,
        "consistency": "yesterday's data is fine",
        "shape": "GROUP BY merchant SUM amount over 30M rows",
        "fits": "Postgres for now, ClickHouse if it gets slow",
    },
]

for w in workloads:
    print(f"{w['name']:<40} {w['ops_per_sec']:>6} ops/s  p99 {w['p99_latency_ms']}ms  "
          f"shape={w['shape'][:40]}  →  {w['fits']}")

Output:

wallet balance lookup                       800 ops/s  p99 50ms  shape=point lookup by user_id  →  OLTP B-tree (Postgres)
transaction history                         200 ops/s  p99 100ms  shape=range scan: last 30 transactions...  →  Postgres index
merchant search                              60 ops/s  p99 200ms  shape=free-text on merchant name...  →  Elasticsearch (or tsvector)
homepage carousel                           800 ops/s  p99 30ms  shape=point lookup of precomputed list...  →  Redis (cache)
monthly settlement aggregates              0.01 ops/s  p99 5000ms shape=GROUP BY merchant SUM amount...  →  Postgres for now, ClickHouse later

Riya's day-1 stack is Postgres + Redis, with three workloads sitting on Postgres and one on Redis. The merchant search is being answered by a Postgres tsvector column — slower than Elasticsearch but inside the latency budget, and one fewer system to operate. Settlement aggregates run against a read replica because the volume is currently fine.

Why she did not start with five databases: every additional engine costs an on-call rotation, a backup pipeline, a monitoring dashboard, and a CDC connector. A four-engineer team buying its sixth database is a team that has stopped shipping product. The right rule is "the smallest portfolio that answers the workloads inside their latency budgets, plus one engine of headroom for the workload most likely to outgrow its current home."

Six months later, search traffic crosses 500 ops/s and tsvector p99 climbs to 500 ms. Now Elasticsearch earns its place — it is solving a real problem that Postgres has stopped solving. Twelve months later, settlement queries take a minute and start blocking other replica reads. Now ClickHouse earns its place. The portfolio grows by demand, not by aspiration.

Why the order of additions matters: the first additional engine is almost always a cache (Redis), because cache misses on hot read paths are the cheapest workload to peel off. The second is almost always either a search engine or an analytics column store, depending on which workload outgrows Postgres first. The third is whatever the product roadmap demands next — vector index for ML features, graph store for fraud rings, time-series for metrics. Adding them in this order means each addition replaces a clearly visible bottleneck in the system you have, rather than answering a problem you do not yet have.

The picking matrix, made concrete

Builds 1 through 23 have given you the full inventory. This is the one-page lookup table that compresses what each build taught into a routing decision. Treat the rows as access shapes, not as products — the products listed are the canonical engines for that shape, and you should be willing to swap them for an equivalent.

Access shape Latency budget Working set Canonical engine Where in this track
Point lookup, strict consistency 5–50 ms up to 10 TB / shard Postgres / MySQL B-tree Build 4
Write-heavy point + range 1–20 ms TB to PB, shardable Cassandra / RocksDB LSM Build 3
Sub-millisecond hot read 0.2–2 ms fits in RAM Redis / Memcached Build 21
Free-text search 10–200 ms TB Elasticsearch / OpenSearch Build 17
K-nearest-neighbour vectors 5–50 ms GB to TB pgvector / Pinecone / Weaviate Build 18
Graph traversal 10–500 ms GB to TB Neo4j / Memgraph Build 19
Time-series ingest + rollup 10–100 ms TB, append-mostly Timescale / InfluxDB / VictoriaMetrics Build 20
Aggregate over millions 1–10 s TB to PB ClickHouse / DuckDB Build 16
Lakehouse archive + ad-hoc seconds to minutes PB S3 + Iceberg/Delta + Trino Build 16
Distributed strong-consistency OLTP 10–100 ms TB to PB CockroachDB / Spanner / TiDB Build 12

Two warnings about reading this table. First, the engines have meaningful overlap — Postgres can do point lookup, free-text, vector, and time-series, just worse than the specialist at high scale. The decision rule is "use the specialist when the cheaper general-purpose engine is no longer fast enough at this workload's scale," not "use the specialist on day one because the table says so." Second, latency budgets are p99 targets in steady state; tail latency under load is much worse and is the number you should actually engineer to.

What forces a new engine into the stack

There are exactly four signals that justify adding a new database. If you cannot point to at least one of them for a proposed addition, the addition is premature.

  1. An access shape the existing engines do not natively support. Free-text search, k-nearest-neighbour vectors, graph traversals, time-series rollups. You can fake some of these in Postgres for a while; eventually the fake hits a wall and a specialised engine pays its keep.

  2. A latency budget the existing engines cannot meet. A 1 ms p99 read forces a cache. A 5-second analytics query becomes unacceptable when the dashboard is on the customer-facing path. The trade-off matrix says "this engine cannot meet that budget at this scale" — and that is when you reach for the next engine.

  3. A working set that has outgrown the existing engine's capacity. 50 TB of historical events does not belong in Postgres no matter how many indexes you add — it belongs in object storage with a lakehouse table format. Trying to keep it in the OLTP engine wastes IOPS and crowds out hot data.

  4. A workload whose consistency need is genuinely weaker than the OLTP engine, and where the throughput cost of synchronous OLTP is not affordable. This is the cache case (eventual is fine) and the analytics case (yesterday is fine). The point of the polyglot stack is that you spend strong consistency only where you need it.

If none of these four hold, you do not need a new engine. You need a better index on the one you have.

Common confusions

Going deeper

The Fowler / Sadalage frame and where it came from

The phrase "polyglot persistence" was coined by Pramod Sadalage and Martin Fowler around 2011, in the run-up to the NoSQL Distilled book. Their case study was a fictional retail application broken into eight services, each picking a database from {RDBMS, key-value, document, column-family, graph} based on access shape. At the time it was a counter to two extremes — the classical "everything in Oracle" monolith, and the over-corrective "MongoDB for everything" early NoSQL fashion. The lasting contribution is the frame: stop asking "which database?" and start asking "which access shape, and what does it want?"

The frame survived because the trade-offs survived. Fifteen years later the engines are mostly different — RocksDB instead of Voldemort, ClickHouse instead of HBase, Pinecone instead of nothing — but the corners of the trade-off space have not moved. A scan-friendly engine is still bad at point updates; a write-fast engine still pays at read time; a cache is still volatile. The names change; the geometry does not.

CDC as the connective tissue

Polyglot persistence only works if the derived stores stay in sync with the OLTP engine. The naive approach — application code writes to Postgres, then writes to Elasticsearch, then writes to Redis — is the dual-write problem, and it is a bug factory. If the application crashes between the Postgres commit and the Elasticsearch write, you have permanent drift and no clean way to detect it.

The clean solution is change data capture. The application writes only to Postgres. Postgres's WAL emits a change record. Debezium (or equivalent) reads the WAL and publishes to Kafka. The other engines subscribe. Every derived store is now a function of the same log, and "which engines are in sync" is no longer a per-write question — it is "where is each subscriber's offset?" Drift becomes lag, lag becomes a number you can graph, and you can reset a derived store by deleting its data and replaying the topic from the beginning.

The CDC pattern is what makes polyglot survive at scale. Without it, polyglot is dual-write hell and most teams revert to "everything in one DB" within a year.

The reverse anti-pattern: too many engines

The Fowler frame is older than its current critique. Around 2017, several large engineering blogs (Airbnb, Uber, Stitch Fix) wrote post-mortems on polyglot stacks that had grown to fifteen, twenty, or thirty databases — every team picking its own — and the operational cost crushed the company. Each engine needed its own runbook, on-call rotation, backup pipeline, schema-evolution story, and security review. The teams that survived consolidated back down to a portfolio of five to seven engines, with strong defaults that every team had to justify deviating from.

The healthy rule is "right tool for the job, but the smallest set of tools that does all the jobs." Five Postgres clusters with different schemas is fine. Five different kinds of database, each new and unfamiliar, is fragile. Add a new engine when an existing one fails a workload, not when an engineer wants to learn it.

The CAP/PACELC layer underneath

The polyglot decision has a consistency dimension that the workload table does not always make explicit. Different stores in the same stack make different CAP/PACELC choices. The OLTP store is usually CP and PC/EC — strict consistency, accept latency cost. The cache is CP-shaped during normal operation but degrades to AP under partition (you serve stale, you keep going). The search index is AP, near-eventual. The lakehouse is asynchronous, hours-stale, no consistency guarantee at all.

This is fine — and explicit about it makes the polyglot stack honest. The application has to know which store to ask for which question. Asking the cache for an exact balance is a bug. Asking the OLTP store for a top-1000 aggregate is a different bug. The architectural diagram and the developer documentation must list each store's consistency level, so that engineers do not silently downgrade the system by routing the wrong question to the wrong place.

When polyglot is wrong

A useful corrective: polyglot is overkill if your workload has only one shape. A pure event-sourced log analytics tool — say, a tiny startup whose only product is "ingest webhooks, run aggregations, expose a dashboard" — does not need OLTP, cache, or search. ClickHouse alone covers it. A pure transactional ledger app with no search and no analytics can run on Postgres alone for years.

The decision rule is honest: count distinct access shapes in the workload. One shape, one engine. Three shapes, two or three engines. Seven shapes, you are running Razorpay and you have a portfolio.

Where this leads next

References

  1. Pramod Sadalage and Martin Fowler, NoSQL Distilled, ch. 13 Polyglot Persistence (Addison-Wesley, 2012) — the book that introduced the term and its discipline. martinfowler.com/bliki/PolyglotPersistence.html.
  2. Martin Kleppmann, Designing Data-Intensive Applications, ch. 1 Reliable, Scalable, Maintainable, and ch. 11 Stream Processing (O'Reilly, 2017) — the modern treatment of "many specialised stores, one log of facts." dataintensive.net.
  3. Jay Kreps, The Log: What every software engineer should know about real-time data's unifying abstraction (LinkedIn Engineering, 2013) — argues that CDC + a shared log is the architectural backbone of polyglot persistence. engineering.linkedin.com.
  4. Mark Cavage, There's Just No Getting around It: You're Building a Distributed System (ACM Queue, 2013) — on the operational cost of every additional store; the case against premature polyglot. queue.acm.org.
  5. Razorpay Engineering blog, How we handle 5000 TPS at peak (2023) — one public Indian-fintech walk-through of an OLTP + Kafka + ClickHouse + Elasticsearch + Redis stack. engineering.razorpay.com.
  6. Adrian Colyer, The morning paper on polyglot persistence and Lambda Architecture — multiple summaries of the foundational Fowler/Kleppmann arguments. blog.acolyer.org.
  7. padho-wiki: B-tree vs LSM: when to pick which and Lakehouse vs warehouse: the real tradeoffs — the per-engine trade-off chapters this picking framework is built on.