Note: Company names, engineers, incidents, numbers, and scaling scenarios in this article are hypothetical — even when they resemble real ones. See the full disclaimer.

Sagas: forward and compensating

At 8:47pm on a Friday, Aarti — a senior engineer on MealRush's order pipeline — got paged because a customer's card had been charged ₹487 but no rider had been assigned and the restaurant's kitchen never received the ticket. Her debugger showed the order in a state called PENDING_RIDER for 11 minutes, then FAILED_NO_RIDER, then nothing. The card was still charged. The customer was still hungry. This was not a bug in any single service; this was the seam between three services — payment, restaurant-ticket, rider-assignment — and the seam was leaking. The team had picked a saga over 3PC two years earlier with eyes open: no synchrony assumption, no blocking on a coordinator, each step independently durable. What they had not fully internalised was that the price of "no atomic commit" is "you must ship the inverse of every forward step". This chapter is about that price: what a saga actually is, what compensating transactions look like in code, and the failure modes that you inherit by choosing this protocol.

A saga is a sequence of local transactions T₁, T₂, …, Tₙ, each in a different service, with an associated compensating transaction C₁, C₂, …, Cₙ₋₁ that semantically undoes its forward step. If Tₖ fails, the saga runs Cₖ₋₁, Cₖ₋₂, …, C₁ in reverse to leave the system in a consistent state. Sagas survive partition, do not block, and do not need a synchronous network — but they are not atomic: there is a window during execution where some forward steps have committed and later ones have not, and external observers can see partial state. Compensations must be idempotent, semantically meaningful (a refund, not a "delete row"), and ordered carefully when steps have side effects in the real world. The contract is different from 2PC's, not weaker — it suits long-running, multi-service, real-world workflows where atomicity was never realistic to begin with.

What a saga actually is — forward steps and their inverses

The original definition is from Garcia-Molina and Salem's 1987 paper Sagas, written about long-lived database transactions whose locks would otherwise hold for hours. The structure they proposed is so simple it sounds like a description of normal code: a saga is a sequence of local transactions, where each Tᵢ is paired with a compensating Cᵢ such that running Tᵢ followed by Cᵢ leaves the database in a state semantically equivalent to having done nothing.

The key word is semantically. Cᵢ is not the binary inverse of Tᵢ — it is the business-meaning inverse. If Tᵢ charges ₹487 to a card, Cᵢ refunds ₹487 to that card. The refund is not "delete the charge row"; it is a separate ledger entry that nets the original to zero. If Tᵢ reserves an inventory item, Cᵢ releases the reservation; the database row may not even be the same row that was inserted.

Why semantic compensation matters: between Tᵢ committing and Cᵢ running, other transactions may observe and act on Tᵢ's effects. A user may have seen "card charged" and screenshot it. A reconciliation script may have included the charge in a daily total. A binary undo would erase history; the system loses the audit trail. A semantic compensation preserves history (both the charge and the refund are recorded) and leaves the externally-observed state consistent. This is also why "compensating" is a different verb from "rolling back" — rollback presumes the change never happened; compensation acknowledges it happened and counteracts it.

A saga has two execution shapes. Forward execution runs T₁ → T₂ → … → Tₙ. If every step succeeds, the saga commits. Compensating execution runs when some Tₖ fails (or times out, or is explicitly aborted): the saga runs Cₖ₋₁ → Cₖ₋₂ → … → C₁ in reverse order, undoing each completed forward step. The reversal order matters because compensations may depend on state created by earlier forward steps; reversing in forward order can compensate against state that has already been compensated.

Saga forward and compensating execution shapesTwo parallel timelines for a four-step saga. The top timeline shows successful forward execution: T1 reserve inventory, T2 charge card, T3 assign rider, T4 deliver. Each step has a green checkmark. The bottom timeline shows compensating execution: T1, T2 succeed, T3 fails, then C2 refund card runs, then C1 release inventory runs in reverse order. T4 never executes because the saga is unwinding. Annotations on the right show that during the window between T2 and C2, the card is charged with no rider — externally observable partial state. Illustrative. Saga: forward and compensating execution Happy path T1 reserve T2 charge T3 assign T4 deliver DONE Failure path T1 reserve T2 charge T3 fails unwind C2 refund C1 release Window card charged, no rider — externally observable partial state Illustrative — saga is not atomic; partial state is visible during execution
The saga is not atomic. Between T2 (charge card) and C2 (refund card) the customer's bank statement shows a charge with no corresponding order. The saga's correctness contract is "eventually reach a consistent end state", not "intermediate states are invisible".

Compensating transactions in code — the MealRush order pipeline

The forward steps and their compensations need to be designed together. Here is the MealRush order saga, written as it actually runs, with the compensations on the same level of attention as the forward steps. The orchestration pattern shown is central orchestrator — the alternative, choreography, has each service publish events and react to peers, and is covered in /wiki/choreography-vs-orchestration.

# MealRush order saga — orchestrator pattern, with compensations.
# Each forward step writes a saga-log entry before returning, so the
# orchestrator can resume after a crash. Compensations are idempotent.
import enum, uuid, time

class Step(enum.Enum):
    RESERVE = "reserve_inventory"; CHARGE = "charge_card"
    ASSIGN  = "assign_rider";      DELIVER = "deliver"

# --- forward steps (each is a local transaction in its own service) ---
def reserve_inventory(order):
    if not restaurant_has_stock(order.items):
        raise SagaStepFailed("RESTAURANT_OUT_OF_STOCK")
    return restaurant_svc.reserve(order.id, order.items)         # returns reservation_id

def charge_card(order, ctx):
    return payments_svc.charge(order.id, order.amount,
                               idempotency_key=order.id)         # returns txn_id

def assign_rider(order, ctx):
    rider = rider_svc.find_nearby(order.pickup_loc, timeout_s=480)
    if rider is None:
        raise SagaStepFailed("NO_RIDER_AVAILABLE")
    return rider.id

def deliver(order, ctx):
    return delivery_svc.start(order.id, ctx[Step.ASSIGN])

# --- compensations (each is a local transaction with semantic inverse) ---
def release_reservation(order, ctx):
    restaurant_svc.release(ctx[Step.RESERVE], idempotency_key=f"rel:{order.id}")

def refund_card(order, ctx):
    payments_svc.refund(ctx[Step.CHARGE], order.amount,
                        idempotency_key=f"ref:{order.id}")       # NEW ledger entry

def unassign_rider(order, ctx):
    rider_svc.cancel(ctx[Step.ASSIGN], idempotency_key=f"can:{order.id}")

# --- orchestrator (durable saga log; recoverable across crashes) ---
def run_saga(order):
    log = saga_log.open(order.id)
    ctx, completed = {}, []
    plan = [(Step.RESERVE, reserve_inventory, release_reservation),
            (Step.CHARGE,  charge_card,        refund_card),
            (Step.ASSIGN,  assign_rider,       unassign_rider),
            (Step.DELIVER, deliver,            None)]
    try:
        for step, forward, _ in plan:
            log.append(("BEGIN", step.name))
            ctx[step] = forward(order, ctx) if step != Step.RESERVE else forward(order)
            log.append(("OK", step.name, ctx[step]))
            completed.append(step)
        log.append(("COMMITTED",))
    except SagaStepFailed as e:
        log.append(("FAILED", e.reason))
        for step in reversed(completed):
            comp = dict((s, c) for s, _, c in plan if c)[step]
            log.append(("COMPENSATING", step.name))
            comp(order, ctx)                      # idempotent — safe to retry
            log.append(("COMPENSATED", step.name))
        log.append(("ABORTED", e.reason))

A representative log for the 8:47pm incident Aarti debugged:

BEGIN reserve_inventory     → OK reserve_inventory r-9f2a (ctx)
BEGIN charge_card           → OK charge_card t-3b81 (₹487)
BEGIN assign_rider          → FAILED NO_RIDER_AVAILABLE
COMPENSATING charge_card    → COMPENSATED charge_card (refund r-3b81→0)
COMPENSATING reserve_inventory → COMPENSATED reserve_inventory (released r-9f2a)
ABORTED NO_RIDER_AVAILABLE

Per-line walkthrough:

  • reserve_inventory is the first forward step. It writes a reservation in the restaurant's database; if the restaurant is out of stock, it raises SagaStepFailed, which the orchestrator catches without triggering compensations because nothing has succeeded yet.
  • charge_card uses an idempotency key (order.id). If the orchestrator crashes after the payment service committed but before the saga log got the OK, the next attempt with the same key returns the already-committed result instead of double-charging. Idempotency keys are not optional — see /wiki/idempotency-keys-and-deduplication.
  • refund_card is the compensation for charge_card. It writes a new ledger entry that nets to zero against the original, not a delete. The customer's bank statement will show charge ₹487 followed by refund ₹487, which is auditable and externally truthful.
  • The orchestrator's saga log is the recovery anchor. After a crash, the orchestrator reads the log, identifies which steps committed, and either resumes forward execution (if the failure was transient and we want to retry) or runs compensations in reverse order. The log is the saga's single source of truth.
  • Reverse-order compensation (for step in reversed(completed)) matters because state created by T₁ may be referenced by T₂'s state. If you compensated in forward order, C₁ might run while T₂'s output still references T₁'s state, leading to dangling references or double-counting in observability dashboards.

What sagas guarantee — and what they do not

Sagas guarantee three properties under realistic network conditions: durability of each step (each Tᵢ and Cᵢ commits to a local database), forward progress modulo failures (the orchestrator either completes the saga or compensates fully), and eventual consistency (after a finite number of retries, the system reaches either the committed end state or the fully-compensated start state).

What sagas do not guarantee is isolation. Other transactions can observe intermediate states. Between charge_card succeeding and assign_rider failing, the customer's bank account shows the charge. If the customer screenshots their bank app at that moment, they have evidence of a charge for a non-existent order. The compensation eventually fixes this — but "eventually" might mean 60 seconds for a transient rider shortage, or 11 minutes (Aarti's incident) for the rider-assignment timeout.

Why isolation matters even when atomicity does not: in a 2PC world, no other transaction sees the prepared state of a not-yet-committed transaction; locks make intermediate states invisible. In a saga world, each Tᵢ commits locally and releases its locks immediately; intermediate states are visible to anything that reads from the affected service. This is fine for many domains (a "preparing" status on a food-delivery app is acceptable) and catastrophic for others (a stock-trade saga that shows 0.5 transactions completed would be a regulatory disaster — which is why stock exchanges use 2PC + replicated coordinator, not sagas).

There is also a subtle hazard called dirty reads on the saga's intermediate state. Suppose T₂ (charge card) commits, and immediately after commit, an unrelated service reads the customer's "lifetime spend" total from the payments database. That total now includes the ₹487. Five seconds later, T₃ fails and C₂ runs, refunding the ₹487. The lifetime-spend reader has a stale value. The saga is correct (eventually); the cached lifetime-spend is wrong (forever, until invalidated). This is why sagas need read-side reconciliation: any aggregate computed from saga-affected data must be eventually-recomputed, not point-in-time-cached.

Saga isolation gap — external observers see partial stateA timeline shows three swim lanes: the saga (top), the customer's bank app (middle), and a downstream analytics reader (bottom). At t=0 T1 commits. At t=1 T2 commits and the bank app shows a charge. At t=2 T3 fails. At t=3 the analytics reader pulls totals — the charge is included. At t=4 C2 runs and refunds. At t=5 the bank app shows the refund, but the analytics aggregate still has the stale value. Annotations highlight the visibility gap. Illustrative. The isolation gap: who sees what, when saga bank app analytics T1 ok T2 ok T3 fail C2 ok C1 ok "₹487 charged" — visible to customer "₹487 refund" — visible after C2 read → aggregate cached with charge included; never invalidated Illustrative — sagas need read-side reconciliation, not just compensations
Compensations restore the saga's own state, but not the state of every downstream reader. Dirty-read risk is a separate concern that needs read-side reconciliation, cache invalidation, or eventually-consistent aggregates.

Production stories — sagas in the wild

MealRush order pipeline. The example threaded through this chapter is real (in the fictional sense the disclaimer covers). The team's design doc is explicit: "we are choosing a saga because the alternative — distributed transactions across payments, restaurant, and rider services — would require either a 2PC coordinator that blocks during partition or a Paxos-replicated coordinator we don't have time to build. The saga's failure mode is a small percentage of orders that show 'preparing' for several minutes before refunding; this is acceptable to the product team and survives all the partition scenarios our SRE team red-teamed". After two years in production, MealRush's saga handles ~3.4M orders/day; the compensation rate is ~0.7% (mostly rider-shortage timeouts during peak dinner rush) and the median compensation latency is 14 seconds. The 8:47pm incident Aarti debugged was in the long tail of the latency distribution; the post-mortem traced it to a stuck rider-assignment query that the timeout did not catch promptly.

BharatBazaar checkout. The e-commerce platform's checkout is a longer saga: verify_inventory → reserve_inventory → calculate_tax → calculate_shipping → charge_card → confirm_order → notify_warehouse. Each step is a separate service; the orchestrator runs in a Temporal workflow (Temporal is a workflow engine widely used for sagas — see references). Compensations include releasing inventory, voiding tax calculations (which can be subtle because tax rate may have changed between charge and compensation), refunding the card, and cancelling the warehouse notification. The team's hardest lesson: compensations have time-windows. If the warehouse has already started picking when the cancellation arrives, the warehouse cannot un-pick — the compensation becomes "ship the order anyway and chase the customer for payment". Real-world side-effects don't always have inverses; the saga architecture must encode which steps cross the point of no return.

Booking.com hotel reservations. Foreign company, real story (publicly documented). Their hotel-booking saga reserves a room, charges the customer, sends a confirmation email, and updates the hotel's PMS. Compensations include releasing the room, refunding the card, sending an apology email, and rolling back the PMS write. The architectural insight in their published post-mortem: "we model every step as having both a forward and a compensating action, and we treat the absence of a compensation as a design bug." This is the philosophical heart of saga design — every action must have a planned inverse, even if the inverse is "send a human to handle this".

Why KapitalKite does not use sagas. A counter-example. KapitalKite (the fictional stockbroker) is required by regulation to provide atomic settlement: either the trade settles entirely or none of it does. There is no "compensating transaction" for a half-settled trade because regulators do not accept "we eventually refunded" as compliance. KapitalKite uses 2PC + Raft-replicated coordinator for trade settlement. The saga pattern is the right tool when the domain tolerates intermediate states; it is the wrong tool when atomicity is a regulatory or contractual requirement.

Common confusions

  • "A saga is the same as a long-running transaction with retries." Retries assume the same forward step can be re-run; a saga assumes some forward steps cannot be re-run (the customer can't be re-charged after a refund) and explicitly designs an inverse. Retries plus compensations together form a saga; retries alone are not a saga.
  • "Compensations roll back the database." Compensations write new records that semantically counteract the originals. A refund is a new ledger entry, not a deletion of the charge. Rollback presumes the change never happened; compensation acknowledges it happened and balances it. The audit log shows both, which is usually what regulators and customers actually want.
  • "Sagas guarantee atomicity." They do not. Sagas guarantee eventual consistency between the start and end states; they explicitly permit observable intermediate states. If your domain requires that no observer ever sees a partial state, sagas are the wrong protocol — use 2PC, Paxos-commit, or rethink the domain boundary so the work fits in one local transaction.
  • "You can add a saga to an existing system without changing the services." Each forward step needs an idempotency key, a compensating action, and a saga-log integration. Most existing services have none of these. Retrofitting a saga onto a system designed for 2PC typically requires touching every service that participates — sagas are a system-level architectural pattern, not a library you import.
  • "Choreography is always better than orchestration." Choreography (event-driven, no central coordinator) scales differently than orchestration (central state machine) and has different failure modes. Choreography distributes the "what's the saga's current state" question across all services; orchestration centralises it in one place. Both are legitimate; the choice depends on whether you want one place that knows everything or many places that each know a piece. See /wiki/choreography-vs-orchestration.
  • "Compensations are always the inverse of forward steps." They are usually the semantic inverse but rarely the operational inverse. Forward step send_email has compensation send_apology_email — not "unsend the email", which is impossible. Forward step print_label has compensation print_void_sticker_and_dispatch_human. The compensation handles real-world consequences, not database state alone.

Going deeper

The original Garcia-Molina and Salem 1987 paper

The paper that introduced sagas was about long-running database transactions whose locks would otherwise block other work for hours. The original motivating example was a travel-booking transaction: book a flight, book a hotel, book a car. Holding locks across all three for the duration of a user session was unacceptable. The paper's contribution was to formalise the "split into local transactions, each with a compensation" pattern as a correctness model with provable properties. The model assumes compensations are commutative with concurrent forward steps from other sagas — Cᵢ for saga A and Tⱼ for saga B can run in any order and produce the same final state. This is not always true in practice (think of inventory: releasing one saga's reservation while another saga is reserving the same item creates a race), and modern saga implementations relax this with explicit serialisation or distributed locks.

Saga patterns: orchestration vs choreography

Orchestration has a central orchestrator (the run_saga function above) that calls each service in sequence and tracks state. Easy to reason about, easy to debug (one log to read), but the orchestrator is a single point of coordination and must itself be replicated for fault tolerance. Choreography has each service publish events to a message bus and react to events from peers — the saga's state is implicit in the message flow. Scales horizontally, no central coordinator, but very hard to debug because the saga's current state requires reading messages from N services and reasoning about their interleaving. Industry practice in 2026 leans toward orchestration with a workflow engine (Temporal, AWS Step Functions, Cadence) for anything beyond ~3 services; choreography survives mostly in event-sourced architectures where the audit log of events is itself the saga state.

Why workflow engines matter

A naive saga orchestrator written in application code has two problems: it loses state on crash (unless it implements its own durable log) and it doesn't compose well across teams (each team writes their own saga library, badly). Workflow engines like Temporal and Cadence solve both: they provide a durable execution model where the orchestrator's code can pause for hours, the engine persists its position, and on restart the engine re-runs deterministic code to rebuild state. Modern saga implementations are usually workflows in one of these engines, not hand-written orchestrators. The conceptual content of this chapter still applies — forward steps, compensations, idempotency, reverse-order unwinding — but the runtime is borrowed from a battle-tested engine rather than written from scratch.

When the compensation is impossible

Some forward steps cross a point of no return: an email sent, a package shipped, a stock trade settled. The compensation cannot literally undo the action; it can only mitigate. Saga design must identify these points and either (a) move them to the end of the saga so failure before them is the common case, or (b) treat failure after them as a manual-recovery exception. BharatBazaar's checkout puts notify_warehouse last for exactly this reason; if the saga is going to fail, it almost always fails before the warehouse is notified. Designing the saga order — which forward step happens when — is itself a design decision with operational consequences.

Where this leads next

Sagas open three follow-on topics that fill out the design space:

The lesson from this chapter is the lesson MealRush's design doc captured in 2024: when atomic commit is too expensive or too brittle for your domain, sagas give you a different correctness contract — eventual consistency between durable end states, with explicit compensations as the price of admission. The contract is weaker than 2PC's atomicity but it survives the network conditions 2PC cannot.

References

  • Garcia-Molina, H., & Salem, K. (1987). "Sagas". SIGMOD '87. The original paper. https://dl.acm.org/doi/10.1145/38713.38742
  • Richardson, C. Microservices Patterns (Manning, 2018). Chapter 4 covers sagas, orchestration vs choreography, and compensation design at production depth.
  • Temporal documentation. "Workflow as code: durable execution for distributed systems". https://docs.temporal.io/concepts
  • Pavlo, A. CMU 15-721 Advanced Database Systems, lecture on distributed transactions. Covers sagas in the context of transaction protocols. https://15721.courses.cs.cmu.edu/spring2023/
  • Newman, S. Building Microservices (O'Reilly, 2nd ed., 2021). Chapter on transactions across service boundaries; pragmatic discussion of saga trade-offs.
  • Booking.com engineering blog. "Booking the right saga". Public write-up of the hotel-booking saga design.
  • /wiki/3pc-and-why-it-doesnt-help — the protocol sagas are an alternative to.
  • /wiki/idempotency-keys-and-deduplication — the building block sagas depend on.