Note: Company names, engineers, incidents, numbers, and scaling scenarios in this article are hypothetical — even when they resemble real ones. See the full disclaimer.
The saga pattern revisited in workflows
It is a Saturday at PaySetu and the on-call SRE, Aarav, is staring at a Slack message from a senior engineer at the merchant onboarding team: "we have 11 merchants whose KYC was approved, whose payout account was created at the partner bank, whose welcome email went out — and whose row in our merchants table never got inserted because Postgres flapped at exactly the wrong moment." The KYC approval cannot just be retried; the partner bank already created the account; the welcome email already linked to a merchant_id that does not exist. Three downstream services believe a merchant exists; the source-of-truth database does not. Cleaning this up by hand will take the team most of Sunday.
This is the failure mode the saga pattern was invented to prevent. Read literally — "a long-lived transaction expressed as a sequence of local transactions, each with a paired compensating transaction" — the saga pattern is forty years old; Garcia-Molina and Salem published the original paper in 1987, decades before the word "microservice" existed. What is new is that workflow engines (Temporal, AWS Step Functions, Cadence, DBOS) make sagas a first-class language construct rather than a discipline. The engine remembers which steps succeeded, the engine pairs each success with its compensation, and the engine unwinds the saga when a later step fails. You write the forward path and the rollback path; the engine does the bookkeeping.
A saga is a sequence of forward steps with paired compensating steps; if step k fails, the engine runs the compensations for steps k-1 down to 1 in reverse order. In a workflow engine, the compensation stack is maintained automatically — every successful activity registers its rollback, and a workflow-level failure handler unwinds them in LIFO order. The win is not the pattern itself; it is that the engine durably tracks which compensations are pending across worker crashes, retries, and operator intervention.
What a saga actually is — the LIFO stack on top of activities
A saga has the shape of a stack-based undo. You push a compensation when a forward step succeeds; you pop and execute compensations when a later step fails. In ordinary code that pattern looks like this:
compensations = []
try:
a_id = approve_kyc(merchant)
compensations.append(lambda: revoke_kyc(a_id))
acct = create_partner_account(merchant)
compensations.append(lambda: close_partner_account(acct.id))
send_welcome_email(merchant.email, acct.id)
compensations.append(lambda: send_account_void_notice(merchant.email, acct.id))
insert_merchant_row(merchant, acct.id)
# success — no compensations run
except Exception:
for c in reversed(compensations):
c()
raise
Five lines of bookkeeping carry the entire idea. Each successful forward step appends its compensation; on failure, the for c in reversed(compensations) runs them in LIFO order. The pattern is correct, the code is readable, and it is a lie — because the moment the process running it crashes, the compensations list evaporates. A new process restarting cannot find out which forward steps succeeded, which compensations are pending, or where in the stack to resume. The saga becomes incoherent the instant durability is required, and durability is exactly what production demands.
A workflow engine fixes this by promoting the compensations list out of process memory and into the workflow history. Every time a forward activity completes, the engine records "compensation X is now pending" into the durable history. Every time a compensation runs, the engine records "compensation X was executed". If the worker crashes mid-saga, a replacement worker reads the history and reconstructs the exact compensation stack. Why this matters: the failure mode at the start of this article — three services believing a merchant exists, the source of truth disagreeing — happens because the bookkeeping was in process memory. When the database write failed, the process raised; when the process raised, the compensation list was also lost; when the process restarted, no one knew what to roll back. The workflow engine eliminates that entire failure mode by making the bookkeeping itself durable.
The shape is the same as defer in Go or try/finally in Python — the language gives you a primitive for "run this cleanup when the surrounding scope exits abnormally". The workflow engine gives you the same primitive, but the scope is the workflow, the abnormal exit is any activity failure that exhausts its retry policy, and the cleanup is a durable activity that itself can be retried with its own retry policy. (See /wiki/retries-as-a-first-class-concept for the retry side; this chapter is the rollback side.)
Compensations are not inverses — they are business reversals
The single biggest misconception about sagas is that the compensation for approve_kyc is "un-approve KYC" — i.e. that compensations are mathematical inverses of forward operations. They are not. The KYC approval may have triggered a credit check, a sanctions screening, a partner-side onboarding event, and an audit log entry. You cannot un-do those, and you usually do not want to. The compensation is whatever the business says reversal looks like: in this case, marking the merchant as kyc_revoked (a new state, not the absence of kyc_approved), keeping the audit log, and notifying the partner bank that the previously-approved KYC is now revoked. The compensation produces a forward effect that semantically negates the original; it does not pretend the original never happened.
This is why sagas are described as semantic rollbacks rather than transactional ones. A traditional database transaction can roll back atomically because the database holds all the state. A saga's state is spread across many services, and rolling back means informing each service of the reversal in its own model. A few common patterns:
- Compensating an idempotent insert — mark the row
voidedrather than deleting it. The audit history is preserved; downstream consumers that queried the row before voidance still get a consistent answer. - Compensating a partner API call — call the partner's own reversal API. Most regulated partners (banks, payment networks, KYC providers) have one. Use it. Do not try to fake it by, e.g., overwriting their state.
- Compensating a notification — send a correction notification, not a deletion. The user already saw the original; you cannot un-send it. A "we made a mistake — please ignore the previous email" follow-up is the compensation.
- Compensating a money movement — initiate the reverse transfer through the same rails. Never net the two off internally; the regulator wants both legs visible. (PaisaCard learned this the hard way during a 2024 audit when a netted transfer reversal failed an RBI inspection.)
- Compensating a side effect that has no reversal — sometimes the answer is "log this as an inconsistency and page a human". A workflow that books a hotel night and then the saga fails after the hotel has irreversibly committed the room cannot un-book it; the compensation is a credit to the customer plus an internal ticket.
Why this matters: junior engineers often write compensations that try to delete rows, un-send messages, or otherwise pretend the forward step never happened. This produces inconsistent state in downstream systems that already saw the forward step. The right mental model: a compensation is another forward step that the business has classified as "the reversal action for X".
The compensation order — strictly LIFO, with one important caveat
Compensations run in reverse order of the forward steps that succeeded. If steps 1, 2, 3 succeeded and step 4 failed, the compensations run for 3, then 2, then 1. The reasoning is the same as defer in Go: later steps may depend on earlier ones, so undoing them in LIFO order respects those dependencies.
The caveat is independent compensations. If two forward steps were independent — neither depends on the other — their compensations can run in parallel. Workflow engines typically default to strict LIFO and let you opt into parallel compensation only when you assert independence. For the merchant-onboarding saga above, revoke_kyc and send_void_notice could be parallelised in principle, but the engine will run them sequentially unless you explicitly say otherwise. Defaults exist for a reason: a wrongly-parallel compensation can produce a final state that depends on race outcome, which is exactly the bug sagas are trying to eliminate.
A second subtlety: a compensation can itself fail. If close_partner_account returns a 503, the engine should retry it (per its own retry policy), not abandon the compensation. Compensations get the same first-class retry treatment as forward activities — same RetryPolicy, same history, same idempotency contract. Why this is non-obvious: the natural intuition is "we are already in failure mode, just give up". This is wrong. A compensation that fails leaves the saga in a worse state than no rollback at all — half the steps undone, half not. The engine's job is to drive every compensation to terminal success, even if that takes hours of retries. PaySetu's compensations have max_attempts=0 (retry forever) for exactly this reason; see /wiki/retries-as-a-first-class-concept.
A toy saga engine — runnable
The 60-line version below shows the compensation-stack pattern. It does not persist to disk (real engines do); it does not handle compensation retries (real engines do); but it captures the mechanism: forward steps push compensations, failures pop them in LIFO order.
import random
from dataclasses import dataclass, field
from typing import Callable, List, Tuple
@dataclass
class Saga:
history: List[Tuple] = field(default_factory=list)
compensations: List[Tuple[str, Callable]] = field(default_factory=list)
def step(self, name: str, forward: Callable, compensate: Callable):
self.history.append(("forward-start", name))
try:
result = forward()
except Exception as e:
self.history.append(("forward-fail", name, repr(e)))
self.unwind()
raise SagaCompensated(name, repr(e))
self.history.append(("forward-ok", name, result))
self.compensations.append((name, compensate))
return result
def unwind(self):
while self.compensations:
name, comp = self.compensations.pop()
self.history.append(("compensate-start", name))
try:
comp()
self.history.append(("compensate-ok", name))
except Exception as e:
self.history.append(("compensate-fail", name, repr(e)))
# real engine retries; we abort
raise
class SagaCompensated(Exception): pass
# Activity bodies — no rollback awareness
state = {"kyc": [], "accts": [], "rows": []}
def approve_kyc(mid): state["kyc"].append(mid); return f"K-{mid}"
def revoke_kyc(kid): state["kyc"].remove(kid.split('-')[1])
def create_acct(mid): a = f"A-{random.randint(1000,9999)}"
state["accts"].append(a); return a
def close_acct(aid): state["accts"].remove(aid)
def insert_row(mid, aid):
if random.random() < 0.6: # simulate flapping db
raise RuntimeError("postgres connection reset")
state["rows"].append((mid, aid)); return True
def delete_row(mid): state["rows"][:] = [r for r in state["rows"] if r[0] != mid]
# The saga
random.seed(7)
mid = "M-7731"
saga = Saga()
try:
kid = saga.step("approve_kyc", lambda: approve_kyc(mid), lambda: revoke_kyc(kid))
aid = saga.step("create_acct", lambda: create_acct(mid), lambda: close_acct(aid))
saga.step("insert_row", lambda: insert_row(mid, aid), lambda: delete_row(mid))
print("FORWARD COMPLETE")
except SagaCompensated as sc:
print(f"SAGA COMPENSATED at {sc.args[0]}: {sc.args[1]}")
for ev in saga.history: print(ev)
print("FINAL STATE:", state)
Sample run (with the seeded RNG making insert_row fail):
SAGA COMPENSATED at insert_row: RuntimeError('postgres connection reset')
('forward-start', 'approve_kyc')
('forward-ok', 'approve_kyc', 'K-M-7731')
('forward-start', 'create_acct')
('forward-ok', 'create_acct', 'A-3923')
('forward-start', 'insert_row')
('forward-fail', 'insert_row', "RuntimeError('postgres connection reset')")
('compensate-start', 'create_acct')
('compensate-ok', 'create_acct')
('compensate-start', 'approve_kyc')
('compensate-ok', 'approve_kyc')
FINAL STATE: {'kyc': [], 'accts': [], 'rows': []}
The forward path got two steps deep before the simulated database flap killed the third. The engine then unwound create_acct's compensation first, then approve_kyc's — strict LIFO. The final state matches the initial state across all three downstream services. No merchant believes they exist; no partner account is dangling; no KYC approval is left hanging.
The walkthrough, line by line:
Saga.step(name, forward, compensate)is the engine's pairing primitive. Each call records a forward-start event, runs the forward function, and on success pushes the paired compensation onto the stack. On failure it callsunwind()and re-raises asSagaCompensated.Saga.unwind()pops compensations in LIFO order. Each compensation is itself a callable; in a real engine it would be a full activity with its own retry policy.- The activity bodies (
approve_kyc,create_acct,insert_row) and their compensations (revoke_kyc,close_acct,delete_row) are written as ordinary functions. The saga engine pairs them; it does not require them to know about each other. - The
historylist is the durable audit trail. In Temporal it is the workflow history; in Step Functions it is the execution event log; in DBOS it is rows indbos.workflow_steps.
Where this shows up in production
KapitalKite's "place options trade" workflow is a five-step saga: validate margin, lock collateral, route order to exchange, log to compliance ledger, send confirmation push. If the exchange route fails (the exchange returns a MARKET_HALT), the saga unwinds: the order is not logged, the collateral lock is released, the margin pre-validation is forgotten — but the user is sent a failure push (a forward operation, not a compensation, despite running during unwind). The compensation for the collateral lock is a release call to the in-house collateral service, idempotent on the lock id; the compensation for the margin validation is a no-op (the validation produced no durable state). The saga's correctness depends on each compensation being idempotent and re-runnable — KapitalKite's collateral service deduplicates release calls on the lock id at the database level, so retries during compensation are safe.
CricStream's "publish IPL highlight" workflow is a three-step saga: transcode the clip, push to CDN edges, update the homepage feed index. If the homepage feed index update fails, the CDN push is not compensated — instead the workflow records the inconsistency and the on-call team manually reconciles. Why this asymmetry: CDN edge purges are extremely expensive (purging a video file from 2,400 edge locations costs ₹3-4 in vendor fees and 30+ seconds of wall-clock), and the cost of leaving a stale clip on the CDN for 4 hours until the next reconcile cycle is approximately zero (no one will see it because the homepage index does not reference it). The saga's "rollback" is partial by design — a business decision encoded in the activity registration as compensable=false.
PaySetu has the most rigorous compensation discipline: every saga compensation must (a) be idempotent, (b) have its own retry policy with max_attempts=0 (retry forever, capped by 24-hour expiration), (c) be approved by a compliance reviewer at registration time, and (d) leave an audit row that an RBI inspector can query. The discipline is heavy, but the alternative — letting saga compensations be ad-hoc engineer code — failed an audit in 2023 and cost PaySetu its operating licence for a fortnight.
Common confusions
- "Sagas are the same as distributed transactions" — They are not. A distributed transaction (via two-phase commit) provides atomic all-or-nothing semantics across services with locks. A saga provides eventual all-or-nothing semantics through compensations, with no locks. The trade-off: sagas allow partial intermediate states to be visible (a downstream service can briefly see the post-A2 state before A4 fails and triggers compensation), but they avoid the availability cost of cross-service locks.
- "A compensation should undo the forward step exactly" — A compensation should produce the business-meaningful reversal, which is rarely the literal inverse. See the partner-account section: the compensation is "close the account", not "rewrite history so the account was never created".
- "If compensation fails, give up and page a human" — The engine retries the compensation, the same way it retries forward steps. Paging a human is the last resort, after
max_attemptsis exhausted on the compensation's own retry policy. - "Forward steps and compensations should be in the same activity" — They should be in separate activities. Separating them lets the engine retry each independently, version each independently, and assign independent timeouts. The pairing happens in the workflow definition, not in the activity code.
- "Sagas only matter for cross-service workflows" — They also matter inside a single service when forward steps have non-database side effects (sending an email, calling a webhook, writing to S3). The saga pattern is about reversing side effects, not about service boundaries.
- "You can convert any sequence of activities into a saga retroactively" — You cannot. Compensations have to be designed alongside the forward steps; bolting them on after the fact almost always produces compensations that are partial, incorrect, or non-idempotent. The system that ignores sagas at design time pays for them at incident time.
Going deeper
Saga vs. process manager — the orchestration question
The saga as described above is orchestrated: a central workflow function calls each step and tracks the compensation stack. The alternative is choreographed, where each service emits an event on completion and the next service reacts; compensations flow as reverse events. Choreographed sagas avoid a central orchestrator but make the saga's state implicit — there is no single place that knows "we are in step 3 of saga X". Workflow engines have firmly chosen orchestration; choreography is mostly a relic of pure event-sourcing systems. (See /wiki/orchestration-vs-choreography for the trade-off in detail.)
Compensation idempotency — the contract that makes retry-during-rollback safe
Every compensation must be idempotent: calling it twice must have the same effect as calling it once. The reason is the same as for forward activities — the engine will retry it, and the engine cannot tell whether a previous attempt partially succeeded. Idempotency is achieved by either (a) carrying an idempotency key the downstream service deduplicates on, or (b) writing compensations that check current state and no-op if the rollback has already happened. The key-based approach is more reliable for partner APIs; the state-check approach is fine for internal services. (See /wiki/at-least-once-idempotency-in-practice.)
The pivot transaction — when no compensation is possible past a point
Some sagas have a "pivot transaction" — a step past which compensation is impossible (the bank has irrevocably wired the money, the SMS has been sent to the user's phone, the contract has been signed). After the pivot, the saga can no longer roll back; it can only roll forward through whatever recovery actions remain. Workflow engines do not enforce pivot semantics — they will dutifully attempt to compensate any successful step — so the application has to encode the pivot explicitly: either by marking subsequent steps as compensable=false, or by having the post-pivot compensation be "page a human and stop".
Saga history retention — the regulator's concern
In regulated domains (payments, healthcare, brokerages), saga histories must be retained for years. The workflow engine's history store becomes part of the audit trail. PaisaCard retains every saga history for 8 years to satisfy RBI's transaction-record retention rules; KapitalKite retains 10 years for SEBI compliance. The retention policy interacts with the engine's archival behaviour: Temporal archives histories to S3-compatible storage past a configurable age; Step Functions retains for up to 90 days in-engine and longer in CloudWatch Logs. Designing the retention story alongside the saga is part of the design, not an afterthought.
Where this leads next
Sagas are the third leg of the workflow tripod that began with retries and timeouts. The engine takes ownership of all three: retries handle transient failures within an activity, timeouts bound the wall-clock budget, and sagas handle multi-step rollback. Together they let the workflow function express only business intent — the resilience concerns are lifted into engine configuration.
- /wiki/retries-as-a-first-class-concept — the sibling concept; compensations are just retries with a different intent.
- /wiki/temporal-and-durable-execution — the engine that makes the saga's compensation stack durable.
- /wiki/at-least-once-idempotency-in-practice — the contract every compensation must satisfy.
- /wiki/orchestration-vs-choreography — the alternative shape sagas can take.
- /wiki/timeouts-and-deadline-propagation — the third leg of the tripod.
The lesson the rest of this section will reinforce: the workflow engine is a state machine over your business steps, and the saga is the half of that state machine that runs backwards. Once you accept that, you stop writing defensive code in your activities and start writing reversible activities — which is what the engine wanted from you all along.
References
- Hector Garcia-Molina, Kenneth Salem, "Sagas" (1987) — the foundational paper. Surprisingly readable; the core idea has not changed.
- Temporal — Sagas in Temporal — the canonical modern implementation of compensations.
- AWS Step Functions — Saga pattern — Amazon's variant, expressed via the State Language.
- Chris Richardson, "Microservices Patterns" (2018), chapter 4 — the modern reference text for saga in distributed systems.
- DBOS — Compensating workflows — the "compensations live in Postgres" approach.
- Caitie McCaffrey, "Building Scalable, Stateful Services" — the war-story version of why sagas matter.
- /wiki/retries-as-a-first-class-concept — the lower-level chapter this one builds on.
- /wiki/temporal-and-durable-execution — the chapter on the engine that makes durable sagas possible.