Note: Company names, engineers, incidents, numbers, and scaling scenarios in this article are hypothetical — even when they resemble real ones. See the full disclaimer.
Active-active across regions
It is a Tuesday morning at PaySetu and the platform team is staring at a rollout plan that says "promote Frankfurt from passive to active". Tarun, the SRE on call, has spent the last two weeks building the failover runbook for an active-passive deployment — a single primary in Mumbai, Frankfurt as a warm read-replica, and a documented 11-minute RTO if Mumbai goes down. The product team has now decided merchant onboarding from Europe is enough volume to justify writes in Frankfurt. Tarun's reflex is to think of this as "two primaries instead of one". Within an hour of his first whiteboard sketch, he has a list of seventeen things that break the moment a second region accepts writes — duplicate auto-increment IDs, customer-record fights between the two regions, a foreign-key constraint that assumes a single source of truth, a "transactions today" counter that will go backwards. Active-active is not active-passive with the safety off. It is a different architecture.
The core promise of active-active is straightforward: every region accepts writes for every key, no region is special, and a region failure costs you nothing in the write path because the other regions were already serving traffic. The cost is that you must answer — for every piece of state — what happens when two regions write to the same thing at almost the same time. There is no "primary" to break the tie. The tie-breaking rule lives in your code, in your CRDT choice, in your sticky-routing layer, or in your willingness to lose one of the two writes. There is no fourth option.
Active-active across regions accepts writes in every region simultaneously and replicates asynchronously between them. It eliminates the cross-region RTT in the write path and removes failover from the operational model — but in exchange you must have an answer for every concurrent write conflict, either via sticky routing (each key has a home region), CRDTs (the data type itself converges), or a willingness to lose writes (last-writer-wins). The architecture is hybrid by necessity: financial state stays sticky, high-volume per-key state goes CRDT, and only after both choices are made do you get the latency and availability wins.
What "active-active" actually means
The shorthand "active-active" hides four distinct architectural patterns that all answer the question "what does my system do with concurrent cross-region writes?" Each one has a different operational profile, and the production discipline starts with knowing which pattern each slice of your data falls under.
PaySetu, in the rollout above, ends up using three of these patterns simultaneously. The customer ledger goes through pattern 4 (Spanner-style synchronous geo-consensus, because money cannot be lost or duplicated). The merchant onboarding flow goes through pattern 1 (each merchant has a home region; the row is created in the home, replicated everywhere). The fraud-score cache and the per-merchant transaction counter go through pattern 2 (CRDT convergence). Nothing critical uses pattern 3 — last-writer-wins is reserved for "user UI preference last set across devices", where losing the older value is the desired semantics. Why: each data type has a different cost-of-inconsistency. Money-loss is unrecoverable; a counter being briefly off by ₹50 in the dashboard recovers in 200ms; a UI theme reverting after a clock skew is invisible to the user. Matching the consistency mechanism to the cost-of-inconsistency is the entire game.
The conflict matrix: what concurrent writes can produce
Once two regions accept writes for the same key, the matrix of outcomes is fully specified by two axes — what the merge function does, and what the application requires. This matrix is the most useful single artefact when reviewing an active-active design.
| Concurrent writes to same key | Sticky routing | CRDT convergence | Last-writer-wins | Sync quorum |
|---|---|---|---|---|
| Both regions accept locally | Cannot happen (one routes to other) | Yes — both committed | Yes — both committed | Cannot happen (quorum serialises) |
| Final state is deterministic | Yes (single home) | Yes (semilattice join) | Yes (max timestamp) | Yes (linearisable) |
| Both writes preserved | Yes | Yes (additive types) / No (LWW-Reg) | No (one is overwritten) | Yes |
| Tolerates clock skew | N/A | Yes | No (skew = data loss) | Yes (within TrueTime ε) |
| Survives region partition | Reads only; writes block | Yes (writes accepted, deferred merge) | Yes (writes accepted, deferred merge) | Reads only; writes block until quorum |
| Cross-region RTT in write path | Yes (to home) | No | No | Yes (to quorum) |
The cell that matters most is "both writes preserved". For a merchant_balance row, you want both writes preserved (a credit and a debit cannot drop one of them). For a user_preferences.theme row, you might be fine with overwrite. For a current_fraud_score row updated by a streaming pipeline, last-writer-wins is correct because the stream is the authority and the timestamp is monotonic per-stream. Why: the question is not "what is the safest pattern" but "what does the application semantically require?" An update that genuinely supersedes the previous value benefits from LWW. An update that genuinely combines with the previous value needs a CRDT or sync quorum. Forcing the wrong pattern onto data either loses information (CRDT under LWW) or pays unnecessary RTT (LWW under sync quorum).
A runnable simulation: routing decisions and conflict outcomes
The simulator below stands up two regions accepting writes for a mixed workload — accounts (sticky to home), counters (CRDT), and theme preferences (LWW). It generates concurrent writes, runs the per-data-type merge logic, and reports the final state. The point is to see the three patterns running side by side and verify that each one's promise holds.
# active_active_router.py — Python 3.11+
import random, time
from collections import defaultdict
from dataclasses import dataclass, field
@dataclass
class Account:
home_region: str
balance_paise: int = 0 # source of truth lives in home_region only
@dataclass
class CounterCRDT:
counts: dict = field(default_factory=lambda: defaultdict(int))
def inc(self, region, n=1): self.counts[region] += n
def value(self): return sum(self.counts.values())
def merge(self, other):
for r, c in other.counts.items():
if c > self.counts[r]: self.counts[r] = c
@dataclass
class LWWRegister:
value: str = ""
ts_ns: int = 0
def write(self, value, ts_ns):
if ts_ns > self.ts_ns:
self.value, self.ts_ns = value, ts_ns
@dataclass
class Region:
name: str
accounts: dict = field(default_factory=dict) # acct_id -> Account
counters: dict = field(default_factory=dict) # key -> CounterCRDT
prefs: dict = field(default_factory=dict) # user -> LWWRegister
def write_account(self, acct_id, delta_paise, peers):
a = self.accounts[acct_id]
if a.home_region != self.name:
# Sticky routing: forward to home region.
return peers[a.home_region].write_account(acct_id, delta_paise, peers)
a.balance_paise += delta_paise
return ("local-home", a.balance_paise)
def write_counter(self, key, n=1):
self.counters.setdefault(key, CounterCRDT()).inc(self.name, n)
def write_pref(self, user, value, ts_ns):
self.prefs.setdefault(user, LWWRegister()).write(value, ts_ns)
def gossip(regions):
# Push counters and prefs both ways. Accounts are not gossiped — home is authoritative.
rs = list(regions.values())
for src in rs:
for dst in rs:
if src is dst: continue
for k, c in src.counters.items():
dst.counters.setdefault(k, CounterCRDT()).merge(c)
for u, p in src.prefs.items():
dst.prefs.setdefault(u, LWWRegister()).write(p.value, p.ts_ns)
def simulate():
random.seed(7)
regions = {n: Region(name=n) for n in ("mumbai", "frankfurt")}
# Account A1 is home in mumbai; A2 home in frankfurt.
for r in regions.values():
r.accounts["A1"] = Account(home_region="mumbai", balance_paise=50000)
r.accounts["A2"] = Account(home_region="frankfurt", balance_paise=70000)
# Concurrent writes from each region.
regions["mumbai"].write_account("A1", -2500, regions) # local home
regions["frankfurt"].write_account("A1", -1500, regions) # forwarded to mumbai
regions["mumbai"].write_counter("page_views", 1)
regions["frankfurt"].write_counter("page_views", 1)
t = time.time_ns()
regions["mumbai"].write_pref("u9", "dark", t)
regions["frankfurt"].write_pref("u9", "light", t + 1000) # later wins
gossip(regions)
print("ACCOUNTS (sticky to home):")
for r in regions.values():
print(f" {r.name}: A1={r.accounts['A1'].balance_paise} A2={r.accounts['A2'].balance_paise}")
print("COUNTERS (CRDT):")
for r in regions.values():
c = r.counters["page_views"]
print(f" {r.name}: total={c.value()} vector={dict(c.counts)}")
print("PREFS (LWW):")
for r in regions.values():
print(f" {r.name}: u9.theme={r.prefs['u9'].value}")
if __name__ == "__main__":
simulate()
Sample run:
ACCOUNTS (sticky to home):
mumbai: A1=46000 A2=70000
frankfurt: A1=46000 A2=70000
COUNTERS (CRDT):
mumbai: total=2 vector={'mumbai': 1, 'frankfurt': 1}
frankfurt: total=2 vector={'mumbai': 1, 'frankfurt': 1}
PREFS (LWW):
mumbai: u9.theme=light
frankfurt: u9.theme=light
The walkthrough of the load-bearing logic:
if a.home_region != self.name: return peers[a.home_region].write_account(...)— sticky routing. The Frankfurt region looks up account A1, sees Mumbai is the home, and forwards the write. The result is identical to the user calling Mumbai directly, except the routing happened server-side. Why: the home-region rule eliminates the entire concurrent-write problem for accounts — there is exactly one writer per account at any time. The cost is the cross-region hop for non-home users, which is the per-account latency tax you accept for strong single-key consistency.self.counters.setdefault(key, CounterCRDT()).inc(self.name, n)— CRDT counter. Each region increments only its own slot. Concurrent increments in Mumbai and Frankfurt do not race because they touch different slots. The merge function takes per-slot maximum, which is commutative and idempotent. The total is the sum across slots.if ts_ns > self.ts_ns: self.value, self.ts_ns = value, ts_ns— last-writer-wins. The simulation deliberately uses two timestamps 1 microsecond apart to ensure deterministic ordering; in production the timestamps come from a hybrid logical clock to bound the cross-region skew. The "loser" write is silently dropped — which is the right behaviour for a pure preference-overwrite, and the wrong behaviour for anything else.gossipdoes not propagate accounts — accounts are not gossiped between regions because the home region's local view is the authoritative state for that account. Replicas exist for read-scaling, but they are read-only replicas of the home, not active-active. This separation is the discipline that makes the architecture coherent: data types with sync-quorum or sticky-home semantics live on a different replication channel from CRDTs and LWW data.
Run the simulator with concurrent writes that violate the home-region rule (two regions writing to the same account, both forwarded to the same home) and you'll see the home-region serialise them in arrival order. Run it with skewed clocks on the LWW prefs and you'll see the older write silently win — which is exactly why LWW is dangerous for any data where retention matters. Why: each pattern's failure mode is distinct, and the failure modes do not overlap. Sticky routing fails on home-region outage. CRDTs fail on global invariants. LWW fails on clock skew. A production system that mixes them must handle all three failure surfaces simultaneously — there is no single switch to throw.
When active-active goes wrong: the BharatBazaar inventory incident
In late 2024, BharatBazaar promoted its Singapore region from passive read-replica to active write-region. The product team's mental model was "now we have lower latency for Southeast Asian customers; the database handles the rest". The architecture team had migrated four data types to active-active using the patterns above. The fifth data type — product inventory — was missed in the migration review. It was running on plain async replication with no conflict-resolution rule, on the unstated assumption that Mumbai was the only writer. Within 36 hours of the cutover, the system was overselling SKUs.
The mechanism: a high-volume SKU had a quantity of 3 in stock. Mumbai received an order at 14:23:01.245, decremented to 2, and replicated. Singapore received an order at 14:23:01.260 (15ms later in wall-clock, but the replication had not yet arrived from Mumbai), saw quantity 3 in its local replica, decremented to 2, and replicated. The Mumbai replication of "decremented to 2" arrived in Singapore and was accepted as an overwrite. Singapore's "decremented to 2" arrived in Mumbai and was accepted. Final state on both sides: quantity 2. Two units had been sold. The stock was off by one. The reconciliation script flagged it the next morning, after seventeen similar incidents had piled up across the catalogue.
The fix had three parts, each illustrating a separate active-active discipline. First, the inventory data type was changed from a plain register to a PN-Counter CRDT — a counter that tracks decrements per-region and converges on the sum. The "quantity remaining" became initial_stock - sum(decrements_per_region), which is now commutative and lossless under concurrent writes. Second, the order-confirmation invariant was relaxed from "stock ≥ 0 at confirmation time" to "stock ≥ 0 in the merged state, with retroactive cancellation if the merge shows oversell". This required the order workflow to be able to cancel a confirmed order — a non-trivial change touching customer communication, refund logic, and merchant payouts. Third, the cutover process was updated to require a per-data-type migration plan, with explicit conflict-resolution rule for every column. No more "promote to active-active and the database handles it". Why: there is no database that "just handles" active-active. Every multi-region database — Spanner, Cosmos DB, DynamoDB Global Tables, Cassandra — exposes the same four patterns, and each pattern is a per-data-type choice. The "database handles it" assumption almost always means "the database silently last-writer-wins everything" — which is the inventory bug, dressed up as a feature.
Common confusions
- "Active-active means writes work in any region with no extra setup." No. Active-active means the database accepts writes in any region; what happens to those writes when they collide is a per-data-type design decision. A correctly-deployed active-active system has been audited row-by-row to choose the conflict-resolution rule for each column. If your team did not do that audit, you do not have active-active — you have last-writer-wins on everything, with all the data-loss that implies.
- "Active-active is just active-passive without the failover step." No. Active-passive has a single writer; concurrent-write conflicts cannot exist. Active-active has multiple concurrent writers; conflict resolution is mandatory. The promotion from passive to active is not a switch flip — it is a re-architecture of the data layer.
- "Sticky routing isn't really active-active because the write goes to the home region." Sticky routing is active-active for reads (every region serves reads locally) and for the read-write workload as a system (every region accepts client connections, no failover step on regional outage of a non-home region). It is not active-active per-key — but per-key active-active for strong-invariant data is not actually achievable without sync quorum or CRDTs, and sticky routing is the cheap, correct version for most account-shaped data.
- "Active-active improves write latency." Sometimes. For the local-home user with sticky routing, yes. For the cross-region user with sticky routing, no — they pay the cross-region RTT. For CRDT or LWW data, yes — every region's write is local. Net latency improvement depends on traffic patterns and which data types you are routing.
- "Two-region active-active is enough." Two-region active-active has a failure mode three-region does not: a network partition between the two regions leaves both with valid local state and no quorum-based way to detect "which side has the more recent merged state". Most production active-active deployments use three or more regions specifically to enable quorum-based reconciliation when a partition heals.
- "Active-active is a database feature." No. Active-active is a system design. The database provides the replication primitives; the routing layer (envoy / nginx / application code), the migration runbook, the per-data-type conflict rule, the regional quota for DR limits, and the operator on-call procedures are all part of the "active-active" feature. Treating it as a database checkbox is how teams ship the BharatBazaar incident.
Going deeper
The N-region availability arithmetic
A single-region system with availability p has unavailability 1 - p. Two-region active-active with independent failures has unavailability (1-p)² for both regions to be down — but the per-key availability story is more subtle. For sticky-routing data, a key is unavailable when its home region is unavailable, which is 1-p. For CRDT data, a key is unavailable only when all regions hosting it are unavailable, which is (1-p)ᴺ for N regions. The implication: putting your most-critical data on sticky routing gives you single-region availability for that data even in a multi-region deployment. Mixing CRDTs in for the high-volume parts of your workload is what gets you to the "five-nines" headline. The arithmetic gets uglier with correlated failures (shared upstream provider, shared cable cut, shared-tenancy outage on the cloud), and bounding the correlation is the entire job of region-pair selection.
Read-your-writes across regions
Even with all conflict resolution sorted, a cross-region active-active deployment can give a user the experience of "I just made this change, why don't I see it?" because their write committed in region A and their next read came from region B before replication caught up. The fixes form a hierarchy. Sticky session routing (cookie or load-balancer affinity) keeps a single user's traffic to one region for the duration of a session — cheap and 90% effective. Causal tokens (HLC stamps returned with each write, sent on subsequent reads) let any region fulfil the read by waiting until its replica has caught up to the token — universal but pays a small read latency. Synchronous read-after-write (the read goes to the write's home region) is correct but expensive. Most production systems use sticky sessions + causal tokens layered together; pure synchronous reads are reserved for financial dashboards.
Write bandwidth in a fully-meshed N-region deployment
If every region replicates to every other region, the cross-region bandwidth scales as N². For three regions this is irrelevant; for ten regions it dominates the cost model. Production systems use regional cells — a few hub regions that fully mesh, with leaf regions that replicate only to one or two hubs. The leaf regions trade staleness (they are now one extra hop behind) for bandwidth (their cross-region traffic scales linearly, not quadratically). The hub topology is a per-deployment choice; it does not change the conflict-resolution semantics, but it changes the operational profile of the gossip/replication layer. See the delta-state CRDT paper for the mathematics on bandwidth optimisation in this regime.
Why CockroachDB and Cassandra take such different positions
CockroachDB defaults to sticky routing with strong per-key consistency (it shards by key and runs Raft per-shard, with each shard's leaseholder being its de facto home region). It will pay the cross-region RTT for non-home writes rather than expose the application to conflict resolution. Cassandra defaults to last-writer-wins with hinted handoff — every node accepts every write, and concurrent writes silently resolve by timestamp. The two systems represent the strong/loose ends of the active-active spectrum, and neither is "right" — they answer different application requirements. Picking the wrong one is what produces incidents like BharatBazaar's; picking the right one for your dominant data type lets the rest of the system be designed around its guarantees.
Reproduce this on your laptop
python3 -m venv .venv && source .venv/bin/activate
python3 active_active_router.py
# Try: forward writes for both A1 and A2 to one region only — observe sticky routing.
# Try: skew the LWW timestamps backwards in one region — observe data loss.
# Try: extend gossip() to model packet drops — counters still converge, prefs may regress.
Where this leads next
Active-active across regions is the structural architecture; the per-region content is filled in by conflict-free geo-replication (the CRDT mechanism), geo-partitioned data (the sticky-routing mechanism), and follower reads and bounded staleness (the read-side discipline). The next questions are about the operational runtime: what does a region failure actually look like in an active-active deployment (disaster recovery RPO and RTO), and how do you cap the blast radius of a misbehaving region without losing the availability story (cell-based architecture).
The thread to hold: active-active is not a deployment toggle, it is a design discipline applied per-data-type. Sticky routing for strong-invariant rows, CRDTs for high-volume convergent state, sync quorum for the financial ledger, last-writer-wins only for genuinely overwrite-shaped data. Skipping the per-data-type audit is the path to oversold inventory, lost preferences, and reconciliation scripts that flag discrepancies you cannot explain to the merchant.
References
- Corbett, J. C. et al. (2012). Spanner: Google's Globally-Distributed Database. OSDI '12. The synchronous-quorum end of the active-active spectrum.
- DeCandia, G. et al. (2007). Dynamo: Amazon's Highly Available Key-Value Store. SOSP '07. The last-writer-wins / vector-clock end of the spectrum.
- Shapiro, M. et al. (2011). Conflict-free Replicated Data Types. SSS '11. The mathematical foundation of CRDT-based active-active.
- Almeida, P. S. et al. (2018). Delta state replicated data types. JPDC. The bandwidth-efficient hybrid used in production.
- Cockroach Labs. CockroachDB Multi-Region Survival Goals. The sticky-routing approach with explicit RPO/RTO controls.
- Bailis, P. et al. (2014). Coordination Avoidance in Database Systems. VLDB. The formal treatment of which invariants need cross-region coordination.
- DataStax. Cassandra Multi-Datacenter Replication. The LWW deployment guide and its pitfalls.
- Internal: conflict-free geo-replication, geo-partitioned data, hybrid logical clocks, disaster recovery RPO RTO, follower reads and bounded staleness.