Note: Company names, engineers, incidents, numbers, and scaling scenarios in this article are hypothetical — even when they resemble real ones. See the full disclaimer.

Geo-partitioned data

It is 09:42 on a Monday and Vikram, an SRE at YatriBook, is reading a complaint from a London-based customer who books a Mumbai-Delhi flight every fortnight. Each booking takes 1.4 seconds. The dashboard says the database round-trip alone is 1.1 seconds — three writes, each crossing London → Mumbai at 145ms one-way, with two extra hops for two-phase commit. The naive fix is "move the leader to London", but 92% of YatriBook's customers are in India, and moving the leader to London would slow down those 92%. The actual fix is to put this user's leader in London, while leaving every other user's leader in Mumbai. That is geo-partitioning: the database is one logical thing, but its leadership is sharded by who owns the row.

The honest framing: a globally-replicated database is a system that paid the price of geo-replication on every write. A geo-partitioned database is a system that paid that price only when a row's owner moved across regions — which almost never happens. The leverage is enormous, but only if you can express "this row belongs in region X" without losing the benefits of being one database.

Geo-partitioning is the technique of placing each shard's leader in the region most local to its data's owner — typically by adding a region column to the partition key. Reads and writes for a Bengaluru user stay in the Mumbai region; reads and writes for a London user stay in the eu-west region; the database remains one global system, but most queries become single-region. The engineering is in the placement policy, the cross-region access path, and the mechanism for moving a row's leadership when its owner relocates.

Why "replicate everywhere" stops working

The simplest multi-region database is one that replicates every shard to every region with a single global leader. CricStream tried this in their first global rollout — one Spanner-style cluster, leaders in Mumbai, replicas in Singapore, Frankfurt, and Virginia. Reads from Singapore were fast (followers were local), but every write paid Mumbai-Singapore RTT. For a viewer in Singapore who paused a match and the player's state had to be persisted, the write took 78ms minimum. During a cricket final at peak, with 25M concurrent viewers and a write rate of 180k/s globally, the leader-region datacentre's commit pipeline became the choke point — a single Mumbai datacentre's egress capacity was now serving the world's writes.

Single-leader globally-replicated database — every write pays cross-region RTTFour regions arranged: Mumbai (leader, centre), Singapore, Frankfurt, Virginia. Lines from each non-leader region show writes flowing to Mumbai with RTT labels. The Mumbai region has a thicker outline indicating it is the bottleneck. Illustrative — not measured data. Single global leader: every write traverses to Mumbai Mumbai leader (all writes) Frankfurt follower Virginia follower Singapore follower London follower 120ms RTT 220ms RTT 78ms RTT 145ms RTT
Every write — regardless of which region the user is in — pays the round-trip to Mumbai. Singapore-local writes for a Singapore-local user are an architecturally impossible fast path in this layout.

There are two distinct costs here, and conflating them is the most common modelling error. The first cost is latency: a write originating in Singapore for a Singapore user takes 78ms because the leader is in Mumbai. The second cost is capacity: every write across the planet flows through one region's commit pipeline, so total write throughput is bounded by the leader region's hardware. Geo-partitioning attacks both — the Singapore user's leader moves to Singapore (latency win) and the Singapore leader processes only Singapore writes in parallel with Mumbai's leader processing Mumbai writes (capacity win). Why: write throughput in a single-leader configuration is bounded by the leader's WAL fsync rate, which is per-shard but ultimately per-machine; partitioning leaders across regions lets each region's hardware contribute to global throughput linearly, so a 4-region geo-partitioned cluster can sustain ~4× the write rate of a single-leader-region cluster of the same total node count.

There is a third, often-forgotten cost: regulatory. Indian regulations (RBI's data localisation circular, 2018) require payment data to be stored in India. EU regulations (GDPR) require certain personal data to remain accessible only within the EU. A globally-replicated database that copies a Bengaluru user's payment record to a Frankfurt follower is a compliance violation by default, regardless of its latency story. Geo-partitioning provides a clean answer: the Bengaluru user's row is leadered in Mumbai with replicas in Hyderabad and Chennai (all within India), and never replicates to Frankfurt at all. The mechanism enforces the policy.

How the partition key carries region

The fundamental primitive is the region-prefixed partition key. Instead of partitioning on user_id alone, you partition on (region, user_id) where region is a small enum like IN, EU, US, SEA. The database's hash-or-range partitioner now lays out shards such that all IN rows form contiguous shard ranges, and the placement policy maps those ranges to nodes physically located in India. Insert a row with region='EU' and the storage engine routes it to a shard whose leader lives in Frankfurt; insert with region='IN' and it goes to a Mumbai-leadered shard.

CockroachDB exposes this via ALTER TABLE ... CONFIGURE ZONE USING constraints = '[+region=ap-south-1]', applied to a partition defined as PARTITION india VALUES IN (('IN')). Spanner exposes it via leader_options on a partitioned table. YugabyteDB calls them tablespaces with placement_info. The DDL surface is different per database; the underlying primitive is identical: the partition key carries an attribute the placement policy reads, and the placement policy converts that attribute into a physical location for the shard's leader.

Geo-partitioned cluster: each region holds the leaders for its own usersA diagram showing a logical table partitioned by region. Three regional groups: India region (Mumbai/Hyderabad/Chennai nodes holding rows where region='IN'), EU region (Frankfurt/Dublin/Amsterdam holding region='EU'), Singapore region (Singapore/Jakarta holding region='SEA'). Arrows show local user traffic landing in the local region without crossing oceans. Illustrative — not measured data. Geo-partitioned cluster: leaders co-located with their data's owners India region (ap-south) leaders for region='IN' Mumbai leader Hyderabad replica Chennai replica Bengaluru user → Mumbai leader 12ms commit EU region (eu-west) leaders for region='EU' Frankfurt leader Dublin replica Amsterdam replica London user → Frankfurt leader 22ms commit SEA region (ap-southeast) leaders for region='SEA' Singapore leader Jakarta replica Sydney replica Singapore user → Singapore leader 3ms commit
Each region's nodes hold leaders only for rows whose region attribute matches. A Singapore user's writes never traverse the Indian Ocean; a Bengaluru user's writes never cross to Europe. Cross-region traffic happens only when a query touches rows from multiple regions — the rare case, by design.

The implication is that the schema decision is now a latency decision. Choosing whether region belongs in the partition key is choosing whether the row's owner can be served locally. PaySetu's wallet table partitions on (region, user_id); their global currency-rate table does not (it's a small reference table replicated to every region). YatriBook's bookings table partitions on (region, booking_id) where the region is the user's home region at booking time; their flight-inventory table partitions on (origin_airport, flight_id) because flight inventory is naturally regional too. Each schema choice picks a fast path; rows that don't fit the chosen axis pay cross-region cost on every access. Why: a partition key is the access-path declaration — queries that filter by the leading prefix of the key get single-shard performance, and queries that don't get scatter-gather across shards. When the leading prefix is region, queries that filter by region are single-region; queries that don't filter by region (a global aggregate) pay every region's RTT in the worst case.

A working geo-partitioner in Python

Here is a runnable simulation of a three-region geo-partitioned cluster. It demonstrates routing, leader placement, and the cost difference between a co-located write and a cross-region one. Save and run it:

# geo_partition_demo.py
# Three-region geo-partitioned cluster: routing, placement, latency accounting.

import random
from dataclasses import dataclass, field
from collections import defaultdict

# Ground truth: one-way RTT between regions in ms.
RTT = {
    ("IN", "IN"): 4, ("IN", "EU"): 145, ("IN", "SEA"): 38,
    ("EU", "IN"): 145, ("EU", "EU"): 6, ("EU", "SEA"): 165,
    ("SEA", "IN"): 38, ("SEA", "EU"): 165, ("SEA", "SEA"): 3,
}

@dataclass
class Row:
    pk: tuple        # (region, entity_id)
    value: str

@dataclass
class Cluster:
    rows: dict = field(default_factory=dict)  # pk -> Row
    region_of: dict = field(default_factory=dict)  # pk -> region

    def insert(self, region, entity_id, value):
        pk = (region, entity_id)
        self.rows[pk] = Row(pk, value)
        self.region_of[pk] = region

    def write(self, client_region, region, entity_id, value):
        pk = (region, entity_id)
        if pk not in self.rows:
            self.insert(region, entity_id, value)
        leader_region = self.region_of[pk]
        # Cost: client → leader RTT + intra-region quorum (1 round-trip)
        latency = RTT[(client_region, leader_region)] + RTT[(leader_region, leader_region)]
        self.rows[pk].value = value
        return latency, leader_region

    def read(self, client_region, region, entity_id, max_staleness_ms=200):
        pk = (region, entity_id)
        if pk not in self.rows:
            return None, RTT[(client_region, self.region_of.get(pk, client_region))]
        leader_region = self.region_of[pk]
        # If a local follower exists in client_region (it does, by design for
        # this demo), serve from it at intra-region latency.
        return self.rows[pk].value, RTT[(client_region, client_region)]

if __name__ == "__main__":
    random.seed(11)
    c = Cluster()
    # Pre-populate 3 users per region
    for r in ("IN", "EU", "SEA"):
        for i in range(3):
            c.insert(r, f"u{i}", f"balance=1000")

    workloads = [
        ("IN",  "IN",  "u0", "Bengaluru user, IN home"),
        ("EU",  "EU",  "u1", "London user, EU home"),
        ("SEA", "SEA", "u2", "Singapore user, SEA home"),
        ("EU",  "IN",  "u0", "London user fetching IN-home user"),
        ("SEA", "EU",  "u1", "Singapore user fetching EU-home user"),
    ]

    print(f"{'client':<6} {'pk_region':<10} {'entity':<5} {'write_ms':>9} {'read_ms':>8}  notes")
    for client_r, pk_r, ent, note in workloads:
        w_lat, leader = c.write(client_r, pk_r, ent, value="balance=1500")
        r_val, r_lat = c.read(client_r, pk_r, ent)
        print(f"{client_r:<6} {pk_r:<10} {ent:<5} {w_lat:>9.1f} {r_lat:>8.1f}  {note}")

Sample run:

client pk_region  entity write_ms  read_ms  notes
IN     IN         u0          8.0      4.0  Bengaluru user, IN home
EU     EU         u1         12.0      6.0  London user, EU home
SEA    SEA        u2          6.0      3.0  Singapore user, SEA home
EU     IN         u0        149.0      6.0  London user fetching IN-home user
SEA    EU         u1        171.0      3.0  Singapore user fetching EU-home user

The walkthrough of the load-bearing logic:

  • pk = (region, entity_id) — the region is the first element of the partition key. The placement policy reads the first element to decide leader location; downstream code never has to ask "where does this row live", because the partition key answers it directly.
  • latency = RTT[(client_region, leader_region)] + RTT[(leader_region, leader_region)] — the write cost is one cross-region hop (client to leader) plus one intra-region quorum round-trip. When client_region == leader_region, the cross-region term is zero — that is the entire latency win.
  • The last two rows — a London user reading or writing an India-home user pays full cross-region cost (149ms write). This is unavoidable: if the row's owner is in India, the leader is in India. Geo-partitioning gives you a fast path for most traffic; it does not eliminate cross-region cost for the case where it is genuinely needed.
  • read(...) returns at intra-region latency — followers exist in every region by replication, so reads can always be served locally (subject to bounded staleness, which is the previous chapter). The read fast path is independent of the write fast path.

The actionable insight is the gap between the first three rows (3–12ms) and the last two (149–171ms). For a workload where 95% of writes are home-region writes, the average write latency is dominated by the 3–12ms numbers; for the 5% cross-region writes, the system gracefully degrades to the single-leader latency floor. The fast path is fast precisely because the slow path still works — geo-partitioning is not a workaround for cross-region access; it is a placement that makes most accesses not need to cross.

Moving leadership when an owner relocates

A user moves from Bengaluru to London permanently. Their row's leader is in Mumbai; their queries now travel London→Mumbai→London on every access. Three options exist, in increasing complexity.

Option 1: Do nothing. Most users don't move regions. The 0.1% who do pay 145ms cross-region cost forever. For a workload where this is rare, the operational simplicity is worth the latency tax. PaySetu chose this for their wallet table.

Option 2: Re-key the row. When the user updates their home region, the application performs INSERT INTO users (region='EU', user_id=...) ... ; DELETE FROM users WHERE region='IN' AND user_id=.... The row physically moves shards. This is a two-step operation that must be transactional, and any references to the user's (region, user_id) from other tables (orders, transactions) must update too — which is invasive. CockroachDB's ALTER TABLE ... SET LOCALITY = REGIONAL BY ROW automates this for the row itself; updating foreign references is still application work.

Option 3: Lease the leadership without re-keying. Keep the partition key stable; have the placement policy re-assign leadership of the user's shard to the new region. This is what Spanner's MOVE operation does, and what CockroachDB's leaseholder rebalancing does behind the scenes when access patterns shift. The shard's physical location moves; the partition key's logical region stays the same. The downside: the partition key now lies — region='IN' but the data is leadered in EU — and any query that uses the partition key to predict the region (for routing) gets it wrong. Why: the partition key is a static declaration; the actual leader location is dynamic. Code that assumes they coincide will break the moment a leader moves; production systems either commit to keeping them coincident (option 2) or commit to never assuming the partition key reflects physical location (option 3) — mixing the two is the bug.

A real production failure that bit BharatBazaar's order-fulfilment team: they used option 3 — leases moved automatically based on access patterns — but the warehouse-allocation service still assumed region='IN' meant the leader was in India. When the placement engine moved a hot shard's leader to Singapore (which had spare capacity during a Mumbai datacentre power dip), the warehouse service's "co-located write" path actually paid 38ms each instead of 4ms. It looked fine in metrics — every write succeeded — but order-fulfilment p99 jumped from 22ms to 65ms for that shard's customers, undetected for 11 days. The fix was to ship a small client-side service-discovery callback that asked the cluster "where is the leader for (region, user_id) right now" rather than inferring it from the key. Option 3 requires the application to never infer location from the key.

Common confusions

  • "Geo-partitioning is the same as multi-region replication." No. Multi-region replication copies the same data to multiple regions; geo-partitioning splits the data so each region holds its own subset. A geo-partitioned cluster typically still replicates each shard within its home region for durability, but does not replicate shards across regions.
  • "The region in the partition key has to be a country code." No — it can be any locality attribute. AWS regions (ap-south-1, eu-west-1), legal jurisdictions (india-rbi, eu-gdpr), or even per-customer dedicated zones (tenant-acme-corp). The mechanism is identical; the labels are policy.
  • "Geo-partitioning eliminates cross-region traffic." No. Cross-region traffic still happens for queries that span regions (a London user reading an Indian seller's catalogue, a global aggregate report). Geo-partitioning eliminates cross-region traffic for queries with a region filter; everything else is unchanged.
  • "It works the same as Cassandra's LOCAL_QUORUM." No. LOCAL_QUORUM is a consistency level — it asks for a quorum within the local datacentre. Geo-partitioning is a placement policy — it ensures the relevant shard's leader is in the local region. They compose: a geo-partitioned cluster typically uses LOCAL_QUORUM reads against the shard's home region.
  • "Geo-partitioning gives me linearisability across regions." Only for writes that hit the same shard. A transaction touching (region='IN', user_id=A) and (region='EU', user_id=B) becomes a cross-region transaction, with full two-phase-commit cost across an oceanic RTT. Designs that need such transactions typically denormalise to avoid them.
  • "The placement engine handles everything; I don't need to think about keys." No. The placement engine works with what you give it; if your partition key doesn't expose the locality attribute, the placement engine cannot infer where the row "should" live. The schema is the placement contract.

Going deeper

REGIONAL BY ROW vs REGIONAL BY TABLE in CockroachDB

CockroachDB exposes three localities. GLOBAL replicates the table to every region (best for read-mostly reference data — currency rates, country codes). REGIONAL BY TABLE pins the entire table to a single region (best for region-scoped data where the application already knows the region — say, an eu_orders table). REGIONAL BY ROW is the geo-partitioning pattern from this chapter — each row carries a crdb_region column, and the placement engine routes that row's leader to the matching region. The choice is a per-table latency-vs-flexibility decision. A subtle gotcha: changing locality from REGIONAL BY TABLE to REGIONAL BY ROW is an online operation but requires a backfill of the crdb_region column, which can take hours on multi-TB tables.

Partial replication and the regulatory dimension

Pure geo-partitioning replicates each shard within its home region — but RBI's localisation rule allows certain payment metadata (not the payment itself) to be replicated outside India for processing. Implementations handle this with placement constraints rather than partition keys: the payments table is REGIONAL BY ROW with strict constraints (replicas only in India), while a derived payment_audit_summary table is GLOBAL with relaxed constraints. The schema designer makes the regulatory boundary visible in the locality declarations rather than hiding it in application code.

Cross-region indexes and the global-secondary-index problem

A geo-partitioned table is fast for queries on its partition key. A query that needs a non-partition-key index — say "find all orders with status='pending' globally" — needs an index that spans regions. Three options: a global index (replicated to every region, expensive to update), a local index per region (fast updates, scatter-gather reads), or a denormalised summary table updated by an asynchronous job. Each is a different point on the latency-vs-staleness Pareto curve. CockroachDB's GLOBAL indexes use the first; Cassandra's local secondary indexes use the second; most production systems with geo-partitioned data use the third for any cross-region analytical query, because the freshness requirement is usually loose enough.

What happens when a region is offline

When the EU region goes offline (rare but real — see the chapters on regional failure), the leaders for region='EU' rows are unreachable. Reads against those rows can be served from their replicas (with bounded staleness) but writes cannot proceed without quorum. A geo-partitioned cluster therefore needs a cross-region quorum mode for the catastrophic case — typically a configurable replica in a different region that is normally inactive but can be promoted to leader during a regional outage. This is the bridge to the disaster-recovery chapter; geo-partitioning is the steady-state placement, and DR mode is the failure-mode placement.

Reproduce this on your laptop

python3 -m venv .venv && source .venv/bin/activate
python3 geo_partition_demo.py
# Try changing a user's region mid-script — observe the latency jump.
# Try inserting a row with region='ANTARCTICA' and add a corresponding
# RTT entry — the system has no concept of "wrong region", just placement.

Where this leads next

Geo-partitioning is the placement layer. Once each row's leader is in the right region, the next questions are: how do you handle a query that needs to span regions (cross-region transactions and 2PC), what do you do when a region goes dark (disaster recovery RPO/RTO), and how do you reconcile writes that did happen in two regions concurrently because the placement engine briefly disagreed about the leader (conflict-free geo-replication).

The thread to hold: geo-partitioning is the technique that turns a globally-replicated database into a federation of mostly-local ones. The "mostly" is doing all the work. The chapters that follow are each about a different shape of the cases that don't fit "mostly".

References

  • Taft, R. et al. (2020). CockroachDB: The Resilient Geo-Distributed SQL Database. SIGMOD '20. The reference implementation of REGIONAL BY ROW.
  • Corbett, J. et al. (2012). Spanner: Google's Globally-Distributed Database. OSDI '12. Per-row leader placement via directories.
  • Pavlo, A. et al. (2012). Skew-Aware Automatic Database Partitioning in Shared-Nothing Parallel OLTP Systems. SIGMOD '12. Foundational treatment of partition-key-as-locality.
  • Reserve Bank of India (2018). Storage of Payment System Data (RBI/2017-18/153). The regulatory pressure that forces geo-partitioning in Indian fintech.
  • YugabyteDB. Tablespaces and geo-partitioning (docs). The PostgreSQL-compatible take on the same primitive.
  • Microsoft. Cosmos DB: Geo-partitioned data with global distribution (docs). The NoSQL framing of the same idea.
  • Internal: follower reads and bounded staleness, wall: one datacenter isn't enough, the append-only log.