Cross-Shard Transactions — The Problem Statement

Note: Company names, engineers, incidents, numbers, and scaling scenarios in this article are hypothetical — even when they resemble real ones. See the full disclaimer.

In short

You shard your database. Each shard, taken on its own, is a perfectly good ACID system — atomicity, consistency, isolation, durability all hold for any transaction whose rows live on that one shard. The day you ship sharding to production this property survives unchanged for 99 percent of your traffic, because most transactions touch one user, one tenant, one device.

The remaining one percent is where the cluster breaks. The moment a transaction touches two shards — "transfer 100 rupees from Alice on shard 3 to Bob on shard 7" — local ACID is no longer enough. The debit on shard 3 and the credit on shard 7 are independent local transactions; if the first commits and the second fails, the money disappears. If the second commits and the first fails, money is created. There is no shard-local commit decision that can prevent either outcome, because neither shard knows what the other is doing.

Atomic commit across shards is a distinct distributed-systems problem with its own protocol family. The classical answer is two-phase commit (2PC): a coordinator asks every shard to prepare, then tells every shard to commit. It works, until the coordinator dies between phases — at which point the participants are stuck holding locks indefinitely. Newer designs replace the single coordinator with a Paxos- or Raft-replicated coordinator (Spanner, CockroachDB, YugabyteDB) so the protocol survives any single failure. Microservice architectures often skip atomic commit entirely and use the saga pattern — a chain of local transactions with compensating actions, accepting partial-state visibility for protocol simplicity.

This chapter is the problem statement. The next chapter shows why you cannot dodge it. The chapters after that are the solutions.

You ship sharding. Six months in, every dashboard is green. Then a support ticket arrives: "I sent 100 rupees to my brother. The money left my account. He never got it." Your reconciliation script confirms it — the debit is in accounts_shard_3, no matching credit in accounts_shard_7. The money is gone.

You stare at the application code:

BEGIN;
UPDATE accounts SET balance = balance - 100 WHERE user = 'alice';
UPDATE accounts SET balance = balance + 100 WHERE user = 'bob';
COMMIT;

In single-node Postgres this is atomic. The two updates either both happen or both don't; COMMIT is the moment of truth. You have written this code your entire career and it has always worked.

But Alice is on shard 3 and Bob is on shard 7. Your sharding proxy split the transaction into two local transactions — one per shard — and committed them independently. Most of the time both succeed. This time, shard 7 was failing over during a routine restart, the credit timed out, and the debit on shard 3 had already committed.

This chapter is the problem statement: why this happens, why it is fundamentally hard, and what the solution space looks like.

Why local ACID is not enough

Each shard runs its own ACID-compliant database. On shard 3, the debit is an atomic local transaction — it either commits or it doesn't, and Postgres or MySQL or whatever is running there guarantees that. Same on shard 7 for the credit. Two correct local transactions.

What is missing is a guarantee that both transactions reach the same outcome. A standalone ACID database gives you "atomicity within a transaction". A sharded system, by default, gives you "atomicity within each shard's piece of the transaction". These are not the same thing.

Why local ACID does not compose: each shard's transaction coordinator makes its commit decision independently, based only on what is happening on that one shard. The coordinator on shard 3 has no idea that shard 7 exists, let alone that shard 7 just failed. From shard 3's local perspective, the debit is a perfectly valid transaction that it should commit. From shard 7's perspective, the credit is a transaction it never received. Both shards are individually behaving correctly. The cluster as a whole is incoherent.

Three concrete failure modes follow from the lack of coordination:

Money disappears. Shard 3 commits the debit, shard 7 fails the credit. User's balance is 100 lower; recipient never sees the money. Without auditing, this surfaces as a customer-support ticket weeks later.

Money is created. Shard 7 commits the credit, then shard 3 crashes before its debit commit lands. Recipient sees 100 new rupees; sender's balance never decreased.

Constraints break across shards. A unique constraint on (user, day, idempotency_key) works per shard but not across shards — the same key can appear on two shards if a retry routes differently. Foreign-key references between sharded tables are no longer enforced when the tables live on different nodes.

You cannot fix any of these by tightening local-transaction semantics. The shards are doing what they are configured to do. The fix lives above them — in a protocol that gets the shards to agree on a single outcome.

The two-phase commit (2PC) protocol

The classical answer, dating from Jim Gray's work in the 1970s, is two-phase commit. The idea is to add a coordinator — a process that runs the application's transaction — that drives all participating shards through a synchronised commit.

Phase 1 — Prepare. The coordinator sends a PREPARE message to every shard touched by the transaction. Each shard receives the writes, validates them locally (no constraint violations, no deadlock, enough disk space, no conflicting locks), writes a prepare record to its own write-ahead log, and replies either:

YES — "I have written this to my log. I promise that if you say COMMIT later, I can commit. I will not crash my way out of this promise."
NO — "I cannot commit. Some local check failed. Abort the transaction."

The shard does not actually apply the writes yet. It holds locks on the affected rows so no other transaction can interfere.

Phase 2 — Commit or Abort. The coordinator collects the votes.

If every shard voted YES, the coordinator writes a COMMIT record to its own log, then sends COMMIT to all shards. Each shard applies the writes, releases locks, replies with an ack.
If any shard voted NO (or timed out), the coordinator writes an ABORT record and sends ABORT to every shard. Each shard discards the prepared writes and releases locks.

The coordinator's commit record is the source of truth. Once it is on disk, the transaction is committed even if every shard subsequently crashes — they will recover, see the prepare records, ask the coordinator for the outcome, and complete accordingly.

Two-phase commit for a money transfer. Phase 1: the coordinator asks both shards to prepare. Each writes a prepare record and locks the row. Phase 2: once both have voted YES, the coordinator logs its own COMMIT decision and broadcasts it; the shards apply the writes and release locks. The protocol works in the happy path; the failure modes are what make it interesting.

In the happy path this is correct. It is also expensive: every transaction now requires two synchronous round-trips per participant — one for prepare, one for commit — so latency roughly doubles, and the locks held during the gap reduce concurrency. You accept those costs because the alternative is corruption.

The 2PC failure modes

The interesting question is not "does 2PC work in the happy path?" but "what happens when something fails?". Walk through the cases:

Participant dies during PREPARE. Coordinator times out, treats the missing vote as NO, broadcasts ABORT. The dead shard recovers, sees no prepare record (or one with no decision), aborts independently. Clean.

Participant dies after voting YES. On recovery it finds a prepare record but no decision. It contacts the coordinator to ask "what happened to T?". The coordinator's log answers. Locks held until then; recovers.

Coordinator dies before writing COMMIT. Participants time out, no commit record anywhere, the protocol aborts. Clean.

Coordinator dies after writing COMMIT. Fine — the commit record is on disk. The recovered coordinator resumes from there.

Coordinator dies between writing COMMIT and sending it. Some shards committed, others did not. Those still in prepared state hold locks. They cannot commit on their own (no decision visible) and cannot abort (others already committed — aborting would diverge). They wait. This is the blocking case.

Network partition between coordinator and a participant. The prepared participant cannot reach the coordinator, cannot decide, holds locks, waits.

The first three failure modes are recoverable. The last two are the entire reason this chapter exists.

The blocking problem

Two-phase commit is a blocking protocol: there exist failure scenarios under which a participant in the prepared state cannot make forward progress until it can reach the coordinator (or learn the coordinator's decision from another participant's stable storage). During that wait, it holds locks on every row the transaction touched.

Practical consequence: a coordinator crash with prepare records outstanding freezes the rows of all in-flight cross-shard transactions on every participating shard. Any other transaction that needs those rows queues behind the held locks. If the coordinator runs the application server and the application server is gone, the database administrator must manually find the prepare records, reach a decision (commit or abort) by some out-of-band reasoning, and force the resolution shard by shard. Spanner's published failure-handling docs describe this as "stuck transactions" requiring an operator's intervention.

Why this is fundamentally a 2PC problem, not an implementation bug: the participants must agree on the outcome, and they reach agreement by trusting the coordinator's recorded decision. If the only place that decision is recorded becomes unreachable, no participant has the authority to decide on its own — committing risks divergence (if the coordinator's record actually says ABORT), aborting risks divergence (if it says COMMIT). The safe behaviour is to wait. That waiting is the blocking property.

In production this manifests as: 2 AM page, one cross-shard transaction stuck in prepared state on five shards, twenty downstream transactions queued waiting on the locks. You cannot proceed until you know what the original transaction was supposed to do. If you have a coordinator log on disk, you read it. If the coordinator's machine is gone — disk failure, terminated cloud instance — you have a corruption recovery problem.

The blocking problem motivates everything that comes after: 3PC variants, Paxos commit, sagas, and the eventual conclusion that you should architect to avoid cross-shard transactions wherever possible.

Paxos-based atomic commit (3PC variants)

The blocking problem stems from the fact that the coordinator is a single point of failure for the commit decision. Replicate the coordinator and the problem softens.

Three-phase commit (3PC) added a PRE_COMMIT round between PREPARE and COMMIT, the idea being that participants who reach pre-commit state can recover by majority vote without the coordinator. 3PC works in synchronous networks but breaks under partitions, so it never saw real adoption.

Paxos commit, introduced by Gray and Lamport in 2006, is the production answer. Replace the single coordinator with a consensus-replicated coordinator: the coordinator's state machine (prepare records, commit decisions) is replicated across N nodes via Paxos or Raft, and any majority of those nodes can drive the protocol forward.

The mechanics: every "decision" the coordinator would make — "I have received YES from shard A", "I am committing transaction T" — is run through the consensus protocol before being acted on. A coordinator-replica failure is not fatal because the surviving majority still has the decision and can continue sending COMMIT messages. A network partition that isolates a minority of coordinator replicas is not fatal because the majority can elect a new leader and continue. The blocking case disappears: any single failure is recoverable.

This is what production NewSQL systems do. Spanner runs Paxos-replicated coordinator state across data centres; its 2PC-on-Paxos design is documented in the original 2012 paper. CockroachDB uses Raft groups for both data ranges and transaction records — every transaction's status row is itself Raft-replicated. YugabyteDB uses transaction-status-tablet (TST) replication for the same purpose. The price is higher commit latency — every coordinator step is now a Paxos round — but the property gained is that no single failure stalls the cluster.

The downside: the protocol is dramatically more complex. The coordinator is itself a distributed system; you cannot reason about it as a single process. Implementing Paxos commit correctly is a multi-year engineering investment, which is why most sharded systems either accept 2PC's blocking risk or skip atomic commit entirely.

Saga pattern — the non-ACID alternative

Sometimes the right answer is: don't do atomic cross-shard commit. Restructure the transaction.

A saga is a sequence of local transactions, each on one shard, where every step has a corresponding compensating action. If step k succeeds but step k+1 fails, you run the compensating actions for steps 1 through k in reverse order to undo their effects.

For the money transfer:

Step 1 (forward): debit Alice on shard 3. Local atomic transaction, commits or doesn't.
Step 2 (forward): credit Bob on shard 7. Local atomic transaction, commits or doesn't.
Step 1 (compensating): credit Alice back on shard 3. Local atomic transaction.

If step 1 fails: stop. No money has moved.

If step 1 succeeds and step 2 fails: run step 1's compensation. Alice gets her money back. The end state is "no transfer happened", which is acceptable.

If both succeed: done.

What you give up: between step 1 and step 2, a third party reading the database sees Alice with a debited balance and Bob without the credit. The system has inconsistent intermediate state. Sagas accept this in exchange for protocol simplicity — every step is a normal local transaction with no distributed protocol on top.

What you gain: no blocking, no coordinator failure modes, no held locks, no Paxos. A saga is just an ordinary state machine running ordinary local transactions. You can implement it in a workflow engine (Temporal, Cadence, Camunda) or hand-write it in your application.

When sagas fit:

Microservices where each service owns its own database and 2PC is impractical.
Workflows where intermediate-state visibility is acceptable (most user-facing flows).
Long-running transactions (booking flow, payment + shipment + notification) where holding locks for minutes is unacceptable.

When sagas do not fit:

Hard real-money transfers where intermediate inconsistency is illegal.
Read-after-write within the workflow ("after the debit, immediately compute Alice's new balance") where partial-state reads break correctness.
Any case where compensating actions cannot fully undo the forward action (you sent an email — you cannot un-send it; the compensating "we apologise" email is a different state).

Sagas trade atomicity for availability. They are the most common answer in microservice architectures because the alternative — Paxos commit across services — is too heavy.

Why atomic commit is fundamentally hard (the FLP theorem)

There is a result from 1985 that explains why every atomic-commit protocol you will ever see has the same shape and the same trade-offs.

Fischer–Lynch–Paterson (FLP) impossibility: in an asynchronous distributed system with even one process that may fail by crashing, there is no deterministic algorithm that solves consensus.

Atomic commit requires consensus: every participant must agree on the same outcome (commit or abort). FLP therefore implies that no atomic-commit protocol can guarantee both safety (never disagree on the outcome) and liveness (always reach an outcome) in an asynchronous network where any participant may be slow or crashed.

Practical protocols choose: 2PC chooses safety and gives up liveness (it can block forever). Paxos commit chooses safety and approximates liveness using timeouts and leader elections — it almost always makes progress in practice, but a sufficiently adversarial network can stall it. Sagas avoid the consensus problem by giving up atomicity itself.

Why FLP matters operationally: any time you read a paper or vendor doc claiming "non-blocking atomic commit", you should immediately ask "under what failure model?". The answer is always "synchronous network" or "at most f failures out of 2f+1 nodes" or some other relaxation of pure async. In a real production network during a partition or a slow node, all of these protocols can degrade to 2PC's behaviour — locks held, transactions stuck, operators paged. FLP does not let you escape this; it only lets you push the bad cases into rarer corners.

Engineers who do not know FLP keep trying to invent non-blocking 2PC and discover, after months of work, that they have either reinvented Paxos or introduced a subtle correctness bug. Knowing the impossibility result up front saves that work.

Python sketch — 2PC with timeout

A minimal 2PC coordinator in Python. Real implementations add coordinator-side persistence, recovery on restart, and consensus replication; this sketch shows the protocol shape.

class TwoPhaseCommit:
    def __init__(self, log):
        self.log = log

    def execute(self, txn_id, participants, op):
        prepared = []
        # Phase 1 — prepare
        for p in participants:
            try:
                if p.prepare(txn_id, op, timeout=5.0):
                    prepared.append(p)
                else:
                    self.abort(txn_id, prepared)
                    return False
            except (TimeoutError, ConnectionError):
                self.abort(txn_id, prepared)
                return False
        # Decision: log first, then act
        self.log.write(("COMMIT", txn_id))
        # Phase 2 — commit
        for p in prepared:
            self._commit_with_retry(p, txn_id)
        return True

    def abort(self, txn_id, prepared):
        self.log.write(("ABORT", txn_id))
        for p in prepared:
            try:
                p.abort(txn_id)
            except Exception:
                pass  # retry on recovery

    def _commit_with_retry(self, p, txn_id, attempts=10):
        for _ in range(attempts):
            try:
                p.commit(txn_id)
                return
            except Exception:
                continue
        # Participant unreachable; commit record is in log.
        # Recovery thread will retry.

Three details worth noting. The decision is logged before sending COMMIT messages — that is the recovery anchor. Aborts are best-effort with retry-on-recovery semantics — if a participant cannot be reached during abort, the recovery thread reads the log on restart and retries. The coordinator log itself is the single point of failure that motivates Paxos commit; replace self.log.write with a Raft-replicated append and you have the skeleton of CockroachDB's transaction record.

Real-world choices

Different systems pick different points in the trade-off space:

Sharded SQL with classical 2PC (Vitess, Citus): opt in per-transaction. 2PC available with 2x-round-trip latency and blocking risk. Production deployments usually keep cross-shard transactions disabled or rare; the application is structured to keep transactions shard-local.

Wide-column NoSQL (Cassandra, DynamoDB, Bigtable): no cross-partition transactions. The data model forces denormalisation or eventual consistency. DynamoDB later added TransactWriteItems (2PC under the hood) for narrow cases. Cassandra has Paxos lightweight transactions only for single-partition CAS.

NewSQL with Paxos commit (Spanner, CockroachDB, YugabyteDB, TiDB): atomic cross-shard transactions are transparent. Application sees ordinary SQL; the engine runs Paxos commit underneath. Latency is higher — Spanner reports ~10ms cross-region commit floor — but the application does not think about sharding.

Microservices with sagas: no atomic commit at all. Workflow engines (Temporal, Camunda, Step Functions) coordinate compensating chains. Each service's database is sovereign; cross-service consistency is application-level.

The pattern: as you move from classical SQL with 2PC toward event-sourced systems, you get more availability and less coordination at the price of relaxed atomicity. Pick the point your business actually requires — not what your database happens to support.

The Alice-to-Bob transfer, in detail

Alice has 500 rupees on shard A. Bob has 200 rupees on shard B. Alice transfers 100 to Bob. Walk three implementations.

Implementation 1 — naive (no atomic commit). The application sends the debit to shard A, gets a success, sends the credit to shard B. If both succeed: Alice 400, Bob 300. If shard B fails after shard A commits: Alice 400, Bob 200, money disappears. Broken.

Implementation 2 — 2PC. Coordinator sends PREPARE(debit 100, txn=T) to A, PREPARE(credit 100, txn=T) to B. Both shards lock the row, write a prepare record, vote YES. Coordinator logs COMMIT(T), sends COMMIT(T) to both. Both apply, release locks, ack. Alice 400, Bob 300. Works in happy path. Failure case: coordinator crashes between COMMIT(T) log write and sending the commit messages. Both shards still hold the prepare locks. They wait. Until the coordinator (or its log) recovers, no other transaction touching Alice's or Bob's row can proceed.

Implementation 3 — Saga. Step 1: workflow engine sends "debit 100 from Alice" to shard A. A commits locally; Alice now has 400. Step 2: workflow engine sends "credit 100 to Bob" to shard B. B commits locally; Bob now has 300. Done. Failure case: step 2 fails (shard B unreachable). Workflow engine runs the compensating action — "credit 100 back to Alice" on shard A. A commits locally; Alice back to 500. End state: no transfer happened. Window of inconsistency: between step 1 and step 2, a third party reading both balances sees Alice 400 and Bob 200 — total 600 instead of 700. The saga accepts this transient inconsistency.

Three flows. 2PC: synchronous, two round-trips, atomic outcome on success but blocking under coordinator failure. Saga happy path: two local commits, no coordinator, brief inconsistency window. Saga failure path: forward step succeeds, next step fails, compensating action restores Alice's balance — final state is "transfer never happened".

Common confusions

"ACID on each shard gives you ACID overall." No. Atomicity, the A, is the property that breaks first at shard boundaries. Two correctly-ACID shards committing independent local transactions is not the same as one ACID transaction spanning both. Local ACID is necessary but not sufficient.

"Two-phase commit is just two messages." Each phase is a synchronous round-trip per participant. With N participants that is 2N network operations plus the coordinator's log fsync. Latency typically more than doubles relative to a single-shard transaction.

"Distributed transactions are rare." They are not. Every join across shards, every cross-user transfer, every multi-service workflow is a distributed transaction — even if you don't call it that. The question is not "do you have them" but "how are you handling them" — implicitly with eventual-consistency bugs, or explicitly with a protocol.

"A saga is a failed atomic transaction." A saga is a different consistency model. It accepts observable intermediate states and provides recovery via compensation. The right question is "what intermediate states can a reader see, and is the application robust to them?".

"Use 2PC for every cross-shard transaction." Almost certainly not. 2PC's latency and blocking risk make it appropriate only when atomicity is genuinely required and the transaction rate is low. Otherwise, sagas or shard-local restructuring win.

"Paxos commit is just better 2PC." It removes the blocking failure mode but not the FLP liveness ceiling. More available, not consensus-free, and significantly more complex.

Going deeper

Gray and Lamport, Consensus on Transaction Commit (2006) — frames atomic commit as a consensus problem and shows how to replace the 2PC coordinator with a Paxos-replicated state machine. Reading it makes the NewSQL design space click.

Spanner's transaction architecture — the 2012 OSDI paper describes 2PC over Paxos groups using TrueTime. The key section is "Transactions over Paxos groups", which explicitly motivates Paxos commit by the blocking property of classical 2PC.

CockroachDB's transaction layer — open source and well-documented. Their architecture docs cover transaction records, parallel commits, and 1PC fast paths. CockroachDB pioneered "parallel commits", which collapse prepare and commit phases under specific conditions.

Saga pattern (Garcia-Molina and Salem, 1987) — the original paper, much older than the recent microservice rediscovery. Worth reading for the precise semantics of compensating transactions.

Implementation engines — Temporal, Eventuate, Axon. They implement the saga as a durable state machine: you write forward and compensating steps, they handle retry, recovery, and visibility.

Where this leads next

Chapter 98 — The wall: you need atomic commit across shards — argues that you cannot dodge this problem indefinitely. Every plausible attempt to avoid cross-shard transactions eventually collides with a business requirement that demands atomicity.

Build 13 covers the consensus protocols — Raft, Paxos, EPaxos — that the Paxos-commit answer relies on. Atomic commit reduces to consensus; consensus is the bottom of the stack. After that come the Spanner, CockroachDB, and YugabyteDB internals chapters, where you finally see Paxos commit deployed end-to-end. This chapter is the language you will use when reading those.

References

Gray, The Transaction Concept: Virtues and Limitations, VLDB 1981 — the foundational paper that defined ACID and laid out the original 2PC protocol. Worth reading for the historical framing of why atomicity matters.
Gray and Lamport, Consensus on Transaction Commit, ACM TODS 2006 — the paper that reframes atomic commit as consensus and introduces Paxos commit. Essential for understanding NewSQL transaction design.
Corbett et al., Spanner: Querion's Globally-Distributed Database, OSDI 2012 — the original Spanner paper with its 2PC-on-Paxos transaction architecture and TrueTime-based external consistency.
Cockroach Labs, CockroachDB Transaction Architecture — the clearest open-source treatment of distributed transactions, including the parallel-commits optimisation.
Kleppmann, Designing Data-Intensive Applications, Chapter 7 — Transactions, and Chapter 9 — Consistency and Consensus, O'Reilly 2017 — the textbook treatment of distributed transactions, 2PC, and the link to consensus.
McCaffrey, Distributed Sagas: A Protocol for Coordinating Microservices, YOW! 2017 — the modern reframing of the saga pattern for microservice architectures, including precise semantics for compensating actions and durable workflow execution.