The Saga Pattern for Long-Running Business Transactions

Note: Company names, engineers, incidents, numbers, and scaling scenarios in this article are hypothetical — even when they resemble real ones. See the full disclaimer.

In short

Two-phase commit gives you atomicity across shards, but it does so by making every participant hold locks from PREPARE until COMMIT. That window is fine for a ₹500 bank transfer that closes in milliseconds. It is fatal for a business workflow that books a flight on SkyOne, a room at a SamratHotels hotel, and a taxi from AutoFly, where each step can take seconds, the user can pause for a phone call, and the whole thing can span hours.

The saga pattern is the answer the industry settled on. A saga is a sequence of local transactions T_1, T_2, \ldots, T_n, each of which commits independently on its own service. Each T_i ships with a compensating transaction C_i that semantically undoes the business effect of T_i. If T_k fails, the saga executes C_{k-1}, C_{k-2}, \ldots, C_1 in reverse order to roll the workflow back. There is no global lock, no coordinator-blocking failure mode, and no "in-doubt" state across the system as a whole.

The price is real. Atomicity is gone — there are intermediate states a customer can observe ("flight booked, hotel pending, taxi not yet"). Compensations are business actions, not byte-level undos: cancel-flight is a real refund, not a database ROLLBACK. Compensations themselves can fail, and designing them to be idempotent and retryable — with a human-in-the-loop fallback queue for the cases that defeat automation — is the hard part of running sagas in production.

Sagas come in two coordination styles. Choreographed sagas have each service emit events (FlightBooked) and the next service listen; there is no central authority. They are simple to set up and notoriously hard to reason about as the workflow grows. Orchestrated sagas have a central orchestrator (Temporal, Cadence, AWS Step Functions, Streamora Conductor) drive the sequence; the workflow is one diagram you can read top-to-bottom, at the cost of one more component to operate.

The chapter develops the model, walks both styles with diagrams, builds a tiny Python Saga class with step(forward, compensate) registration and a run() method that walks forward and compensates on failure, and traces through a holiday-booking example where the hotel step fails and the flight is compensated.

In Build 14 you have spent ten chapters learning two-phase commit and the consensus-based variants that fix its blocking failure mode. All of them share a property: from the moment a participant votes YES, it holds locks until the coordinator delivers the verdict. On a fast LAN with a healthy coordinator the lock window is a few milliseconds. That is the regime 2PC was designed for.

Now the product manager walks in with a different problem. A user on the SafarPlan app is booking a four-day Goa holiday. The workflow has three steps: book a flight on SkyOne, book a room at a SamratHotels hotel, book an airport taxi from AutoFly. Each call goes to a separate microservice over the public internet, hits a third-party API with its own latency budget, and may itself wait on inventory checks. A successful booking takes 3 to 8 seconds. A failed one might involve a 30-second timeout from the hotel API and a manual retry. The user might pause halfway to confirm with a partner over Chitchat. The whole flow can take minutes.

If you tried to wrap this in 2PC, every participant would hold locks for that entire duration. The flight inventory row would be locked for minutes. The hotel inventory would be locked for minutes. Other users trying to book the same flight or hotel would queue behind your transaction. Throughput would collapse. And if your coordinator happened to crash in the middle, every prepared participant would block until you brought it back — exactly the failure mode chapter 110 dissected.

The industry's answer is to abandon global atomicity and replace it with a different correctness story: each step commits locally as soon as it can, and if a later step fails, the system runs explicit compensating actions to undo the earlier successful steps. This is the saga pattern, first described by Hector Garcia-Molina and Kenneth Salem in 1987 in a paper on long-lived database transactions, and resurrected wholesale in the 2010s as the canonical way to coordinate workflows across microservices.

The model

A saga is a sequence of n local transactions:

T_1, T_2, T_3, \ldots, T_n

Each T_i runs on a single service or shard, commits with its own local atomicity guarantees (its own WAL, its own ACID), and produces a result the next step depends on. There is no global transaction holding any of them together — once T_i commits, its effects are visible to everyone immediately.

Each T_i (except T_n — see below) is associated with a compensating transaction C_i. The contract on C_i is precise:

C_i semantically undoes the business effect of T_i, and is safe to invoke at any time after T_i has committed.

If every T_i succeeds, the saga commits — the workflow is done, no compensation runs. If some T_k fails, the saga runs:

C_{k-1}, C_{k-2}, \ldots, C_1

in reverse order, undoing each successful predecessor in turn. The saga as a whole has eventually either fully committed or fully compensated — never half-done.

Forward path runs T1 through Tn left-to-right; each step commits locally as it finishes. If step Tk fails, the saga runs the compensations of every successful predecessor — Ck-1 down to C1 — in reverse. Tn typically has no compensation: if the last step is the one that "really" makes the workflow real (charges the card, sends the confirmation email), there is nothing after it that could fail and trigger a rollback.

The thing to internalise is that the saga has no global lock and no global rollback. The only way to undo T_2 after it has committed is to run C_2 as a brand-new transaction. Why this is the whole point: the trade is atomicity-now for availability-and-throughput. By giving up the global lock, the saga lets every service serve other customers during the workflow, lets the workflow span minutes or hours without holding state hostage, and tolerates any participant crashing and recovering without leaving anyone "in doubt". The cost paid is that an external observer can see intermediate states the saga has not yet finished — and that compensation is a real business action, not a ROLLBACK.

Compensations are semantic, not byte-level

The textbook example everyone reaches for is "cancel flight". It is also the example that most clearly illustrates the difference between database rollback and saga compensation.

If T_1 is "INSERT a row into SkyOne's bookings table reserving seat 14C on flight 6E-203", a database ROLLBACK of T_1 would simply remove the row — as if it had never happened. That is byte-level undo, available only inside the same uncommitted transaction.

The saga compensation C_1 runs after T_1 has committed and its effects are externally visible. By that point:

The seat-availability counter has been decremented on the airline's inventory system.
The user has been emailed a "booking confirmed" message.
A row has been written to the audit log.
A reservation fee may already have been transferred to the airline's bank account.

C_1 cannot un-send the email. It cannot un-write the audit log row. What it can do is run a new business action that produces the equivalent of "as if the booking never happened" from the customer's economic perspective: refund the fee, increment the seat-availability counter, mark the booking as CANCELLED, and send a "your booking has been cancelled" email. The audit log records both the original booking and the cancellation; the accounting system sees a payment and an offsetting refund.

This is what is meant by semantic compensation. Why this distinction matters operationally: every compensation is a real transaction, with its own failure modes, its own latency, and its own cost. The refund step talks to a payments processor that can itself be down. The cancellation email costs you a real outbound notification. You cannot reason about a saga as "atomic transaction with a fancy failure path" — you have to reason about it as a workflow where every undo is itself work.

A useful classification of compensations from Caitie McCaffrey's "Distributed Sagas" talk:

Perfect compensation: C_i \circ T_i leaves the world indistinguishable from "T_i never ran". This is rare. Even cancel-flight leaves an audit trail.
Semantic compensation: the customer is economically whole, but the world has a record of both events. This is the common case.
Apologetic compensation: the action is irrevocable (already shipped the package, already sent the SMS), and the compensation is a customer-service action — refund plus apology email, possibly a discount voucher. Build sagas knowing some compensations are this.

Two coordination styles

Once you have the forward-and-compensate model, there are exactly two ways to wire it up across services. Both ship in production at scale.

Choreographed: services emit events; the next service in the chain listens and acts. The "workflow" only exists as the implicit sum of who-listens-for-what. Orchestrated: a central orchestrator (Temporal, Step Functions) holds the workflow definition explicitly and calls each service in turn, recording the workflow's progress in its own durable store.

Choreographed sagas

Each service in the chain emits a domain event when its local transaction commits, and each subsequent service subscribes to the events that should trigger its own step. There is no central authority. The "saga" exists only as the emergent behaviour of who-publishes-what and who-subscribes-to-what.

Concretely, on a Kafka or RabbitMQ-style bus:

Booking service receives the user's request and writes a BookingRequested event.
Flight service consumes BookingRequested, books the flight via the SkyOne API, commits locally, and emits FlightBooked.
Hotel service consumes FlightBooked, books the room via the SamratHotels API, commits locally, and emits HotelBooked.
Taxi service consumes HotelBooked, books the taxi via the AutoFly API, commits locally, and emits BookingComplete.

For compensation, the same pattern runs in reverse with failure events. If the hotel service cannot book a room, it emits HotelBookingFailed; the flight service subscribes to that event and runs cancel-flight as C_1.

Strengths. No central component to run, scale, or recover. Each service is independently deployable. Aligns naturally with event-driven microservice architectures.

Weaknesses. The workflow is invisible — to understand "what happens when a user books a holiday", you have to trace event subscriptions across every service's code. As the saga grows (n > 4 or 5 steps), this becomes spaghetti. There is no single place to ask "what step is saga #12345 currently on?". Cycles are easy to create accidentally — if a compensation event matches the trigger of another step. Observability is poor; you usually end up bolting on a saga-correlation-id and a tracing system to reconstruct the workflow after the fact.

Why choreography degenerates: the workflow is encoded between services, in the subscription graph, rather than in a single artefact. There is nothing to read top-to-bottom that says "this is the holiday booking saga". Adding a new step means changing two services' code (the emitter and the listener) and hoping the rest of the chain still works. Most teams that start with choreographed sagas migrate to orchestration once the saga has more than three or four steps.

Orchestrated sagas

A central orchestrator holds the workflow definition explicitly: a state machine, a Python function, a BPMN diagram, whatever the orchestrator's DSL allows. The orchestrator calls each service in turn, waits for the response, and decides the next step (or the compensation chain) based on the result. Crucially, the orchestrator persists its own state durably so that if the orchestrator process crashes mid-workflow, recovery picks up exactly where it left off — same workflow ID, same step, same in-flight call.

Modern orchestrators include Temporal (and its predecessor Cadence, originally built at Glydex and now used by CryptoLedger, Snap, and others), AWS Step Functions, Streamora Conductor, and the BPMN-driven Camunda. They all share the same essential trick: the orchestrator's state is a durable replicated log; user code is written as ordinary functions (or workflows), but every external call is checkpointed to the log so it can be replayed deterministically on restart.

Strengths. The workflow is explicit and readable — one file, one diagram, one state machine. Observability comes for free (the orchestrator has a UI showing every saga's current step). Adding a step is editing the workflow definition, not re-wiring subscriptions. Failure handling, retries, timeouts, and compensation chains are first-class concepts in the DSL.

Weaknesses. One more component to operate, scale, and depend on. The orchestrator becomes a critical-path service — if Temporal is down, no saga makes progress. The workflow logic now lives in the orchestrator, which can pull business logic out of the services it calls (sometimes good, sometimes a god-object anti-pattern).

The choice between styles is mostly empirical. Choreography is fine for two- or three-step workflows that are stable and rarely changed. Orchestration becomes the right answer past about four steps, especially if the workflow is reviewed regularly by product or operations.

A tiny working Python Saga

Here is a stripped-down orchestrated Saga in around forty lines. It registers steps as (forward, compensate) pairs, runs them in order, and on failure walks back through the successful predecessors invoking their compensations.

# saga.py
from dataclasses import dataclass, field
from typing import Callable, Any
import logging

log = logging.getLogger("saga")

@dataclass
class Step:
    name: str
    forward: Callable[[dict], Any]
    compensate: Callable[[dict], Any] | None = None

@dataclass
class Saga:
    name: str
    steps: list[Step] = field(default_factory=list)

    def step(self, name, forward, compensate=None):
        self.steps.append(Step(name, forward, compensate))
        return self

    def run(self, ctx: dict) -> dict:
        completed: list[Step] = []
        try:
            for s in self.steps:
                log.info("FORWARD %s", s.name)
                result = s.forward(ctx)
                ctx[s.name] = result
                completed.append(s)
            return ctx
        except Exception as e:
            log.error("FAIL at %s: %s — compensating", s.name, e)
            self._compensate(ctx, completed)
            raise

    def _compensate(self, ctx: dict, completed: list[Step]):
        for s in reversed(completed):
            if s.compensate is None:
                continue
            for attempt in range(3):
                try:
                    log.info("COMPENSATE %s (attempt %d)", s.name, attempt + 1)
                    s.compensate(ctx)
                    break
                except Exception as ce:
                    log.warning("compensation %s failed: %s", s.name, ce)
            else:
                log.error("COMPENSATION EXHAUSTED for %s — needs manual intervention", s.name)
                _enqueue_for_human_review(self.name, s.name, ctx)

def _enqueue_for_human_review(saga, step, ctx):
    # Real implementation: write to a "needs operator" queue or PagerDuty
    pass

A few things this code gets right and several things it does not.

Right. Forward steps run in registration order. On any exception during a forward step, the framework walks completed in reverse and invokes each step's compensation. Compensations themselves are retried (three attempts) and, if they exhaust retries, the saga is escalated to a human-review queue rather than silently lost. The context dict carries data forward (so the hotel step can use the flight_id from the flight step, and the cancel-flight compensation can use the same flight_id).

Wrong, or at least missing. This is a single-process saga — if the orchestrating process dies mid-run, the saga is lost. A real saga framework like Temporal persists every step transition to a durable log so the workflow can resume on a different worker after crash. This implementation also assumes compensation can be called immediately; production sagas often need to wait for the original transaction's effects to settle (the refund cannot be issued until the charge has cleared on the payment processor). And it conflates retries with idempotency — three retries of a non-idempotent compensation is three duplicate refunds.

Why durability is the entire game in real saga frameworks: the saga's state — "which steps have completed, which are pending, which compensation is in flight" — must survive the orchestrator process crashing. If the orchestrator forgets after a crash that step 2 has committed, the recovered saga either re-runs step 2 (duplicate booking) or skips compensation (ghost booking). Temporal solves this by writing every event in the workflow's life to a durable log replicated by Cassandra; the workflow function is replayed deterministically on recovery, fast-forwarding through the events it has already seen. The user writes ordinary Python or Go; the framework provides crash-resumability through this replay trick.

Worked example — the Goa holiday

The holiday booking, traced through a hotel failure

A user on SafarPlan wants to book: an SkyOne flight DEL→GOI, two nights at SamratViv, and an AutoFly airport pickup. The orchestrator wires up the saga:

saga = Saga("book-goa-holiday")
saga.step("book_flight",
          forward=lambda c: indigo.book(c["flight_no"], c["pax"]),
          compensate=lambda c: indigo.cancel(c["book_flight"]["pnr"]))
saga.step("book_hotel",
          forward=lambda c: taj.reserve(c["hotel"], c["nights"]),
          compensate=lambda c: taj.release(c["book_hotel"]["res_id"]))
saga.step("book_taxi",
          forward=lambda c: ola.book_pickup(c["airport"], c["arrival"]),
          compensate=lambda c: ola.cancel(c["book_taxi"]["trip_id"]))

ctx = {"flight_no": "6E-203", "pax": 1, "hotel": "taj-goa",
       "nights": 2, "airport": "GOI", "arrival": "2026-05-12T15:30"}
saga.run(ctx)

Forward: T1 books the flight successfully (PNR ABC123 issued). T2 calls the SamratHotels API and gets back "no rooms available". The orchestrator triggers the compensation chain: only T1 needs undoing, so C1 runs a real cancellation against SkyOne (refund + seat release). T3 was never started, so there is nothing to compensate.

Step-by-step trace:

t=0. Saga starts with ctx containing the user's selections. completed = [].

t=1. book_flight forward runs. The SkyOne API returns {"pnr": "ABC123", "amount": 8400}. The local transaction on SkyOne's booking database commits — seat 14C is now reserved, the user's card has been charged ₹8,400, and a booking-confirmation email is queued. ctx["book_flight"] = {"pnr": "ABC123", ...}. completed = [book_flight].

t=2. book_hotel forward runs. The SamratHotels API replies HTTP 409 Conflict — "no availability for the requested dates". The forward function raises BookingFailed("hotel sold out").

t=3. The exception propagates into the saga's exception handler. The orchestrator logs FAIL at book_hotel: hotel sold out — compensating.

t=4. The compensation chain begins. _compensate walks completed in reverse — only one entry, book_flight. The orchestrator calls indigo.cancel(ctx["book_flight"]["pnr"]), which is indigo.cancel("ABC123").

t=5. The SkyOne cancellation API runs the real business cancellation: it marks PNR ABC123 as CANCELLED in the bookings table, increments the seat-availability counter for flight 6E-203 (releasing seat 14C), initiates a ₹8,400 refund through the payments processor (which itself may take 5–7 working days to clear on the user's card), and queues a "your booking has been cancelled" email. The compensation succeeds; the orchestrator records the compensation in its log and moves on.

t=6. No more entries in completed. The compensation chain is done. The orchestrator marks the saga as COMPENSATED and re-raises the original BookingFailed exception to the caller. The user sees a "we couldn't complete your booking — your flight has been refunded" message.

t=7 (T3 was never reached). No taxi was booked, no book_taxi compensation needs to run, no AutoFly API call is made.

The customer's economic state at the end: same as before they tried to book. The world's state: an audit trail shows a flight was booked and then cancelled, the airline's books show a charge and an offsetting refund, and an analytics pipeline somewhere increments a "booking-failed-at-hotel-step" metric that the product team will look at next morning.

Now consider a more painful variant. Suppose at t=5, the SkyOne cancel API itself fails — perhaps SkyOne's payment processor is down, perhaps the network call times out. The orchestrator retries the compensation three times (the for attempt in range(3) loop in the Saga class). All three fail. The orchestrator now invokes _enqueue_for_human_review, which writes a record to an operator queue: "Saga book-goa-holiday step book_flight needs manual compensation; PNR ABC123 should be cancelled". A human at SafarPlan's operations desk sees this in their queue the next morning, calls SkyOne's airline-relations contact, and runs the cancellation manually. The saga itself remains in a COMPENSATING state until the operator marks the compensation as resolved. This human-in-the-loop fallback is not optional in real saga deployments — it is the only way to handle the long tail of compensation failures that automation cannot cover.

Saga vs 2PC — the tradeoff in one paragraph

Two-phase commit gives you strict atomicity: every participant's state moves from "old" to "new" together, or nothing moves. The price is that participants hold locks from PREPARE to COMMIT, and the protocol blocks if the coordinator dies between phases. This is the right protocol when (a) the transaction's duration is short enough that holding locks is cheap, (b) the participants are inside a single trust domain (your own database shards, not third-party APIs), and (c) the system can tolerate a manual recovery for the rare blocking case. Bank transfers across shards of the same database fit this profile.

The saga gives you eventual consistency with explicit compensation: each step commits independently and immediately; failures are repaired by running compensating actions. The price is that intermediate states are externally observable, compensations are real business actions with their own failure modes, and the system has no atomic "all-or-nothing" instant — only a "will-eventually-be-all-or-nothing" trajectory. This is the right model when (a) the workflow is long-lived (seconds to days), (b) the participants are heterogeneous services with their own ownership and SLAs, and (c) the business has well-defined compensation actions for each step. Holiday bookings, employee onboarding, refund processing, multi-step purchase flows — all sagas.

The two are not in competition. A production system uses 2PC where it can (inside a single sharded database) and sagas where it must (across service boundaries and time horizons that exceed lock-holding budgets).

Real systems that ship saga orchestration

Temporal (and its predecessor Cadence, both originally from Glydex's engineering team) is the most polished open-source saga orchestrator. Workflows are written as ordinary Go, Java, Python, or TypeScript functions; the framework checkpoints every external call ("activity") to a durable event-sourced log so workflows survive worker crashes. CryptoLedger, Snap, Slipway, and HashiCorp run mission-critical workflows on Temporal. Compensation is just "if an activity fails, call the compensation activity from the workflow function".

AWS Step Functions is the managed equivalent: workflows are defined in Riverone States Language (a JSON DSL), executed by AWS, and integrated with Lambda, ECS, and the AWS service catalogue. Step Functions offers a Catch clause on each state that maps naturally onto compensation handlers.

Streamora Conductor is a JSON-driven orchestrator originally built to run video-encoding pipelines that span dozens of internal Streamora services. Workflows are defined as DAGs in JSON; Conductor schedules tasks to a worker pool. Conductor is the saga workhorse behind much of Streamora's content-platform infrastructure.

Camunda is a BPMN-2.0–based workflow engine widely used in European financial services and insurance. Workflows are drawn graphically (BPMN diagrams), and the engine executes them. Compensation is a first-class BPMN concept (the "compensation event").

Apache Airflow is sometimes pressed into service as a saga orchestrator for batch workflows where the steps are hours apart — overnight ETL pipelines that need compensation if a downstream step fails. It is not really designed for low-latency online sagas, but the model is the same.

For the choreography style, the toolkit is whatever event bus you are running — Kafka, RabbitMQ, AWS EventBridge, Querion Pub/Sub. The "saga framework" is just discipline: every service that participates in the workflow knows the events to emit on success and on failure, and the events to consume to trigger its forward step and its compensation.

Going deeper

Saga isolation, semantic locks, and the original Garcia-Molina paper

The original 1987 paper by Garcia-Molina and Salem introduced sagas in the context of long-lived database transactions — a single application running a long workflow inside one DBMS, where holding 2PL locks for the workflow's duration would starve every other transaction.

The paper's contribution was twofold: it formalised the forward + compensating-action structure, and it analysed what happens to isolation when you give up atomicity.

A saga's intermediate states are visible to other transactions. This means saga S_1 can read state that saga S_2 has written but not yet either committed or compensated. If S_2 later compensates, S_1 has read state that "never existed" — a classic anomaly the paper called a cascading abort in the saga setting. The paper's solution was semantic locks: each step takes business-level locks (not row locks) that prevent other sagas from observing the intermediate state. A "pending" booking is locked from being seen by report queries until the saga either commits or compensates.

Modern saga frameworks rarely implement semantic locks explicitly. Instead they rely on idempotency at the application layer (every operation is keyed by a saga ID so duplicates collapse) and business-level acceptance of the visibility window (the customer briefly sees "booking pending" instead of being shielded from it). This is a quiet weakening of Garcia-Molina's original model in the name of operational simplicity. The cost is that real production sagas have a class of bug — concurrent sagas observing each other mid-flight — that the original paper's semantic locking would have prevented.

Idempotency is the load-bearing property

Every forward step and every compensation in a real saga must be idempotent — calling it twice with the same arguments must have the same effect as calling it once. Without this, retries duplicate bookings, retries duplicate refunds, retries duplicate emails. The standard pattern is to attach a globally unique idempotency key (typically the saga ID plus the step name) to every external call; the receiving service deduplicates by checking whether it has already processed that key.

Slipway's API, for example, accepts an Idempotency-Key header on every request and guarantees that a given key produces exactly one charge no matter how many times the same request is retried. Every payment-processing saga uses this. Designing your own internal services to expose the same property is the unglamorous but necessary work of running sagas.

Sagas across years — workflow engines and durable execution

The longest sagas in production are not minutes long but years long. An employee-onboarding workflow at a large company might span the time from "offer accepted" to "first day starts" (weeks), with compensation actions for "candidate withdraws". A KYC re-verification workflow at a bank might run for months waiting on documents. Temporal explicitly supports workflows that can be paused for arbitrary durations and resumed, with the workflow's full state preserved across the entire interval. The architectural insight is that the saga's state is a first-class persistent object, not an in-flight RPC. Once you accept that, the upper bound on saga duration is "as long as the durable store retains the workflow's history".

References

Garcia-Molina and Salem, Sagas, ACM SIGMOD 1987 — the original paper that introduced the saga model for long-lived database transactions, including the formal treatment of forward + compensating actions and the analysis of saga isolation.
Richardson, Pattern: Saga, microservices.io — the canonical modern reference on choreographed vs orchestrated sagas in a microservices context, with sequence diagrams and pattern critique.
McCaffrey, Distributed Sagas: A Protocol for Coordinating Microservices, talk at YOW! 2017 — the practitioner's view from running sagas at Halo's matchmaking service, including the classification of compensation styles (perfect, semantic, apologetic) used in this chapter.
Richardson, Microservices Patterns, Manning 2018 — Chapter 4 develops the saga pattern in detail with worked examples in Java, including the orchestrator vs choreography tradeoff and the implementation of compensating transactions.
Temporal, Workflow Saga Pattern — the canonical implementation guide for sagas in Temporal, showing how compensation is implemented as a try/finally in the workflow function with the framework's durable-execution guarantees.
Helland, Life Beyond Distributed Transactions, ACM Queue 2016 — Pat Helland's argument that distributed transactions in the 2PC sense are the wrong abstraction at internet scale, and that sagas (under the names "activities" and "compensations") are the right one. The intellectual companion to the saga pattern.