Note: Company names, engineers, incidents, numbers, and scaling scenarios in this article are hypothetical — even when they resemble real ones. See the full disclaimer.

Step Functions, Cadence, DBOS

Riya is the platform tech-lead at PaySetu. The merchant-payouts team has just been bitten by a 6-hour outage where a Kubernetes node drain ate 412 in-flight payout state machines, and the post-mortem action item is adopt a durable-execution engine. She opens a doc, types "Temporal", then pauses. Her cofounder pings her: "Why not just use AWS Step Functions? It's already in our account." Their best backend engineer pushes back: "DBOS keeps the state in Postgres — we don't need to run another stateful service." A vendor email lands offering a managed Cadence cluster with a 99.99% SLA. Four engines, each claiming to solve the same problem, each with a wildly different operational shape. Riya needs to pick one by Friday. This chapter is what she reads first — the comparison the previous chapter on Temporal implied but did not draw.

The shortest answer, before the long one: durable-execution engines differ on two axeswhere the state lives (a dedicated event store vs your own database) and how the workflow is expressed (JSON state machine vs ordinary code). Step Functions is JSON-state-machine + AWS-managed-store. Cadence and Temporal are code-as-workflow + dedicated-store (Cassandra-backed). DBOS is code-as-workflow + your-own-Postgres. The right pick depends on whether your workflows are short and AWS-native, long and code-heavy, or transactional-with-Postgres-already.

Step Functions, Cadence, and DBOS are three takes on the same idea — make a workflow function survive crashes — that disagree on the storage backend and the authoring model. Step Functions stores state in AWS and asks you to write JSON-Amazon-States-Language; Cadence (and Temporal, its descendant) runs a separate Cassandra-backed cluster and lets you write Python or Go; DBOS uses your Postgres database and lets you decorate ordinary functions. Pick the one whose operational footprint matches your team — there is no winner that dominates on every axis.

The two axes — storage backend and authoring model

Every durable-execution engine has to answer two questions. Where does the workflow's recorded history live? and How does the developer describe the workflow? These two answers are nearly orthogonal, and every engine in the field can be placed on the resulting 2x2.

Durable-execution engines on the storage-vs-authoring 2x2A 2x2 grid. Vertical axis is the authoring model: JSON state machine on the bottom, code-as-workflow on the top. Horizontal axis is the storage backend: managed AWS service on the left, your own database on the right, dedicated Cassandra-backed cluster in the middle. Boxes plot Step Functions in bottom-left, Temporal and Cadence in middle-top, DBOS in top-right, Restate in middle-top, Argo Workflows in bottom-middle. Illustrative. Durable-execution engines, plotted by authoring model and storage authoring model code-as-workflow hybrid JSON state-machine AWS-managed dedicated cluster your own Postgres storage backend Step Functions ASL JSON AWS-native, managed Temporal / Cadence Python / Go SDK Cassandra / SQL store DBOS Python / TS decorators your Postgres Argo Workflows YAML DAG k8s CRDs + etcd Restate Python / TS / Java SDK embedded RocksDB log Inngest JS / Python decorators SaaS-managed store
The same primitive (replay-against-history) lives in all of these; what differs is whether you write JSON or code, and whether the engine owns its own storage cluster or borrows yours. Illustrative.

The diagonal from bottom-left to top-right is roughly the integration depth with your application. Step Functions is the most decoupled — your workflow lives in AWS's account, your code is invoked through Lambda. DBOS is the most coupled — the workflow state is literally a row in your application's Postgres, in the same transaction as your business writes. Why this matters for the choice: the more decoupled the engine, the easier it is to drop into an existing system without changing data ownership; the more coupled, the easier it is to give workflow steps the same transactional guarantees as ordinary database writes. PaySetu's payouts can either go in the engine's store (Temporal/Step Functions) and rely on idempotency to reconcile, or go in PaySetu's own Postgres (DBOS) and atomically commit the state-update plus the workflow-step in one transaction. The second is operationally simpler when you already have a Postgres-of-record; the first is simpler when your workflow spans services that don't share a database.

Step Functions — the JSON state-machine model

AWS Step Functions, launched in 2016, is the oldest of the three and the most architecturally distinct. The workflow is a JSON document in Amazon States Language (ASL) — a declarative DSL with Task, Choice, Parallel, Map, Wait, Pass, Succeed, Fail states. Each Task state names a Lambda function (or a service-integration ARN) and the engine takes care of invoking it, persisting the result, and transitioning to the next state.

{
  "Comment": "PaySetu merchant-payout state machine",
  "StartAt": "PullSettlement",
  "States": {
    "PullSettlement": {
      "Type": "Task",
      "Resource": "arn:aws:lambda:ap-south-1:1234:function:pull-settlement",
      "Retry": [{"ErrorEquals": ["BankApiTimeout"], "IntervalSeconds": 2,
                 "MaxAttempts": 3, "BackoffRate": 2.0}],
      "Next": "DebitPlatform"
    },
    "DebitPlatform": {
      "Type": "Task",
      "Resource": "arn:aws:lambda:ap-south-1:1234:function:debit-platform",
      "Catch": [{"ErrorEquals": ["States.ALL"], "Next": "FailPayout"}],
      "Next": "CreditMerchant"
    },
    "CreditMerchant": {"Type": "Task", "Resource": "arn:...", "Next": "Audit"},
    "Audit":          {"Type": "Task", "Resource": "arn:...", "Next": "Done"},
    "Done":           {"Type": "Succeed"},
    "FailPayout":     {"Type": "Fail", "Error": "PayoutFailed"}
  }
}

The strengths fall out of the model. The state machine is inspectable as data — the AWS console renders it as a diagram, and a 14-step workflow with three Choice branches is legible at a glance. Retry, catch, and timeout are first-class JSON fields, not buried in code. Service integrations (arn:aws:states:::dynamodb:putItem, arn:aws:states:::sqs:sendMessage, arn:aws:states:::sns:publish.waitForTaskToken) let the engine call AWS services directly, no Lambda needed. Why this matters operationally: PaySetu's "send a confirmation SMS via SNS" step in Step Functions is a 6-line JSON block, no code. The same step in Temporal is a Python activity with retry decoration, deployed in a worker pod. For purely-AWS workflows, Step Functions is the lower-friction option — half the boilerplate disappears.

The weaknesses fall out of the same model. Loops are clumsy — JSON has no for, only Map (parallel iteration over a list) and recursive Choice transitions. Branching logic is verbose — what is if amount > 50000: do_extra_kyc() in code becomes a Choice state with two Variable: "$.amount", NumericGreaterThan: 50000 rules. Versioning the JSON is manual — you publish a new state-machine version, in-flight executions keep running on the old version, and there is no automatic patching API. Lock-in is total — the state machine cannot leave AWS without a rewrite. Why teams that started on Step Functions sometimes migrate off: the moment the workflow grows past about 30 states with non-trivial branching, the JSON becomes harder to reason about than the equivalent Python code. The very property that made Step Functions easy to adopt — declarative JSON — becomes the property that makes it hard to maintain at scale. Several published engineering blogs from late-2010s SaaS companies describe exactly this trajectory.

There is a second flavour of Step Functions worth knowing: Express Workflows, introduced in 2019. Standard Workflows have a 1-year max duration and at-least-once execution; Express Workflows have a 5-minute max duration, at-least-once or at-most-once execution, and an order-of-magnitude cheaper per-state-transition price. Express Workflows are appropriate for the high-throughput, short-lived patterns (event-driven decisioning, IoT pipelines, request enrichment); Standard for long-running, low-throughput orchestrations (multi-day approvals, batch processing).

Cadence — Uber's open-source ancestor of Temporal

Cadence was open-sourced by Uber in 2017. Its lineage is Amazon's SWS → Microsoft's Durable Task Framework → Cadence → Temporal (a 2020 fork by Cadence's original authors). The model is the code-as-workflow model walked through in the previous chapter: you write the workflow as a Go or Java function, the engine replays it against a recorded history on every worker handoff, activities are the only things allowed to talk to the outside world.

Architecturally, Cadence runs as a stateful cluster:

  • Frontend service — handles client RPCs, multiplexes onto matching service.
  • Matching service — task-queue dispatch. Workers poll matching, matching dispatches tasks to free workers.
  • History service — owns the workflow histories. Sharded by workflow id; each shard is a single-writer over a range of workflow ids.
  • Worker service — internal workers for system tasks (history archival, replication).
  • Cassandra (or MySQL/Postgres) — the persistence backend. Histories are append-only event lists keyed by (workflow_id, run_id).

The same model lives on in Temporal with refinements: Temporal's storage layer is more configurable (you can run it on Cassandra, MySQL, or Postgres without recompiling), the SDK landscape is broader (Go, Java, Python, .NET, TypeScript, Ruby, PHP — Cadence is mostly Go and Java), and the API surface has been cleaned up. For most new deployments, Temporal is the right choice; Cadence still runs at Uber and a handful of other shops that adopted it pre-fork. Why the fork happened: a governance disagreement at Uber over open-source roadmap led the original authors (Maxim Fateev, Samar Abbas) to start Temporal as an independent company in 2019. The technical core is the same — replay against an event history — but Temporal's funding and 24/7 maintenance has made it the de-facto choice in 2024–2026. If you read "Cadence" in a 2024+ engineering blog, the writer almost always means Temporal.

The honest operational picture: a Cadence or Temporal deployment is a stateful cluster that has to be operated. You run 4-7 frontend pods, 4-7 matching pods, 4-7 history pods, and a Cassandra ring (or a Postgres primary + replicas). At KapitalKite, the broker that runs ~80M order-related workflows per day, the Temporal cluster is itself a tier-0 service — a Cassandra OOM or a history-shard hot-spot is a P0 incident. The trade-off Riya needs to internalise: Temporal removes workflow-level operational pain at the cost of adding cluster-level operational pain.

DBOS — your Postgres is the workflow log

DBOS, founded in 2023 by Mike Stonebraker (Ingres, Postgres, Vertica) and Matei Zaharia (Spark), takes the radically different position that the application's own Postgres database should be the workflow log. There is no separate stateful cluster. The library decorates ordinary Python or TypeScript functions and writes the workflow's event history to tables in the developer's existing Postgres.

# DBOS-flavoured Python — illustrative, not a real-cluster run
from dbos import DBOS, Workflow, Step

@DBOS.workflow()
def merchant_payout(merchant_id: str, amount: int) -> str:
    settlement_id = pull_settlement(merchant_id)            # @Step
    debit_platform(amount)                                  # @Step
    credit_merchant(merchant_id, amount)                    # @Step
    write_audit(settlement_id, "credited")                  # @Step
    return f"paid {amount} to {merchant_id}"

@DBOS.step()
def debit_platform(amount: int) -> dict:
    # ordinary Postgres write — runs in the same transaction
    # that DBOS uses to record the step result
    return db.execute("UPDATE platform_balance SET ...").fetchone()

The clever part is the same-transaction commit. When debit_platform returns, DBOS records the step result and the application's own writes in a single Postgres transaction. If the transaction commits, the side-effect happened and the engine knows it. If the transaction aborts (worker crash, connection death), the side-effect did not happen and the engine will re-execute the step on replay. There is no window where the engine and the database disagree. Why this is a meaningful improvement over Temporal for transactional workloads: in Temporal, the activity does its work (a Postgres write) and then the engine records the activity result (in Cassandra). The window between those two events is small but non-zero — if the activity worker crashes after the Postgres write but before the engine records the success, replay will run the activity again, and only idempotency keys save you. In DBOS, the engine record and the Postgres write are the same transaction. There is no window. Idempotency keys remain useful for external side-effects (HTTP calls, email sends), but for in-database steps the story is much cleaner.

The trade-off is the same coin from the other side. DBOS requires that the steps' side-effects live in the same Postgres database as the workflow log. A workflow whose steps span six microservices each with its own database cannot use DBOS's same-transaction trick — the steps that talk to other services have to be marked as external, and external steps revert to the at-least-once-with-idempotency model. For a monolith-on-Postgres team, DBOS is dramatically simpler than Temporal. For a service mesh with N independent stores, DBOS's killer feature does not apply, and the comparison reduces to "another Python decorator vs a separate cluster".

A toy comparison — the same workflow in three styles

The cleanest way to feel the difference is to write the same merchant-payout workflow in three skeletons and watch the structural differences. The version below runs locally — it does not need a real engine, but it shows how each style records its history.

import json, sqlite3, contextlib, random, time

# ---------- Skeleton 1: Step-Functions style — execute a JSON state machine
asl = {
    "StartAt": "Debit",
    "States": {
        "Debit":  {"Type": "Task", "Fn": "debit",  "Next": "Credit"},
        "Credit": {"Type": "Task", "Fn": "credit", "Next": "Audit"},
        "Audit":  {"Type": "Task", "Fn": "audit",  "Next": "Done"},
        "Done":   {"Type": "Succeed"},
    }
}
def run_asl(machine, fn_table, history):
    state = machine["StartAt"]
    while True:
        s = machine["States"][state]
        if s["Type"] == "Succeed": return "DONE"
        if s["Type"] == "Task":
            if state in history:                      # short-circuit on replay
                pass
            else:
                history[state] = fn_table[s["Fn"]]()  # run + record
            state = s["Next"]

# ---------- Skeleton 2: Temporal style — replay code against an event list
def run_temporal(workflow_fn, fn_table, history):
    class Ctx:
        def __init__(self): self.idx = 0
        def activity(self, name):
            if name in history:                       # cached
                self.idx += 1
                return history[name]
            history[name] = fn_table[name]()           # run + record
            self.idx += 1
            return history[name]
    return workflow_fn(Ctx())

def payout_workflow(ctx):
    ctx.activity("debit"); ctx.activity("credit"); ctx.activity("audit")
    return "DONE"

# ---------- Skeleton 3: DBOS style — record step + side-effect in one txn
@contextlib.contextmanager
def txn(db):
    try: yield db; db.commit()
    except: db.rollback(); raise

def run_dbos(workflow_fn, db, fn_table, wf_id):
    db.execute("CREATE TABLE IF NOT EXISTS hist(wf TEXT, step TEXT, result TEXT)")
    def step(name):
        cur = db.execute("SELECT result FROM hist WHERE wf=? AND step=?", (wf_id, name))
        r = cur.fetchone()
        if r: return json.loads(r[0])
        with txn(db):
            res = fn_table[name]()
            db.execute("INSERT INTO hist VALUES(?,?,?)",
                       (wf_id, name, json.dumps(res)))
        return res
    step("debit"); step("credit"); step("audit")
    return "DONE"

fns = {
    "debit":  lambda: {"txn": f"D{random.randint(1000,9999)}"},
    "credit": lambda: {"txn": f"C{random.randint(1000,9999)}"},
    "audit":  lambda: {"audit_id": random.randint(10000,99999)},
}
print("ASL:     ", run_asl(asl, fns, history={}))
print("TEMPORAL:", run_temporal(payout_workflow, fns, history={}))
db = sqlite3.connect(":memory:")
print("DBOS:    ", run_dbos(payout_workflow, db, fns, wf_id="P-7731"))
print("DBOS hist:", db.execute("SELECT * FROM hist").fetchall())

Sample run:

ASL:      DONE
TEMPORAL: DONE
DBOS:     DONE
DBOS hist: [('P-7731', 'debit',  '{"txn": "D7283"}'),
            ('P-7731', 'credit', '{"txn": "C2916"}'),
            ('P-7731', 'audit',  '{"audit_id": 51894}')]

Walk through what the structural differences are. run_asl drives a JSON document — the workflow is data, and the engine interprets it. run_temporal calls a Python function that uses a context object to record activities — the workflow is code, replayed against a history. run_dbos is the same code-as-workflow shape but the recording is a INSERT inside the same transaction as the side-effect. Why these three skeletons are not equivalent in failure mode: imagine a process crash between the side-effect and the recording. In run_temporal, the side-effect is a real Postgres write inside fn_table["debit"](), the engine appends to history after — if you crash in between, restart sees no history entry, replays, and runs the side-effect twice. In run_dbos, the side-effect and the history append are wrapped in a single txn block — Postgres atomicity means both happen or neither happens. The 4-line txn context manager is the difference between "needs idempotency keys for in-database steps" and "the engine and the database cannot disagree".

Where this shows up in production

CricStream's video-encoding pipeline runs on Temporal — the workflow is multi-day-spanning (a 4-hour match recording, then 30-day retention checks, then archive-to-cold-storage), the steps span seven different services, and there is no single database to anchor the workflow to. Temporal's Cassandra-backed cluster is the right shape for this — each workflow has a 30-day-long event history, the cluster carries 800K active workflows at peak, and the workflow code is real Go talking to seven gRPC services.

PaySetu's settlement reconciliation runs on DBOS. Every settlement is a row in PaySetu's own Postgres; every step of the reconciliation (fetch bank statement, match transactions, debit/credit ledger, write audit row) is also a write to that same Postgres. Using DBOS here means the workflow's step record and the ledger update are one transaction. Reconciliation is mathematically simpler — there is no "did the step happen but the engine didn't record it" race. The cost is that PaySetu's auth-token refresh workflow (which talks to four external services, none in PaySetu's Postgres) is a worse fit for DBOS and has been left on a small Temporal cluster.

A small-scale e-commerce shop on the BharatBazaar platform runs the order-fulfilment workflow on Step Functions. The workflow has 11 states, no loops, calls 6 Lambda functions and 3 service integrations (SQS, SNS, DynamoDB). The total state-machine JSON is 240 lines. The team is 4 engineers; running a Cassandra cluster would be operationally absurd. Step Functions is the right answer because the workflow is short, branching is shallow, and AWS-native is genuinely an asset. Why this is the realistic picture: there is no engine that wins everywhere. The "right" engine is a function of (1) does your workflow live in one database or many, (2) how many engineers can you afford to dedicate to operating a stateful cluster, (3) how AWS-native is your stack already, (4) how often does the workflow logic change. PaySetu, CricStream, and the BharatBazaar shop ended up at three different engines because they have three different answers to those four questions.

Where each engine stores the workflow historyThree side-by-side panels showing the data path between the application and the workflow history store. Panel 1 (Step Functions): the application's Lambda is on the left, an arrow crosses an AWS-account boundary to the Step Functions managed history on the right. Panel 2 (Temporal): the application's worker is on the left, an arrow crosses to a separate Temporal cluster with its own Cassandra ring on the right. Panel 3 (DBOS): the application's process talks to a single Postgres on the right that holds both the application data and the workflow history, with one transaction wrapping both. Where the workflow history lives — three storage shapes Step Functions your Lambda activity code result SFN service history (managed) in AWS account opaque to you across-account boundary 2 systems, eventual sync Temporal / Cadence your worker activity code result Temporal cluster frontend / matching history shards Cassandra across-cluster boundary 2 systems, you operate both DBOS your process @workflow 1 txn your Postgres app tables + dbos.hist same database same txn no boundary 1 system, atomic Reading left-to-right: the workflow log moves from "in someone else's account" → "in your own cluster" → "in your own database, in the same row write". Each step removes one boundary at the cost of constraining where the workflow's side-effects can land. Illustrative.
Three different answers to "where does the workflow history live?". Step Functions externalises it to AWS, Temporal externalises it to a cluster you operate, DBOS keeps it in the same database as your application — letting the step-record and the side-effect commit atomically. Illustrative.

Common confusions

  • "Step Functions and Temporal solve the same problem so they're interchangeable" — They solve overlapping problems. Step Functions is at home with short, AWS-native, branch-light workflows up to a year long; Temporal is at home with long, code-heavy, branch-rich workflows. Picking Step Functions for a 200-state workflow with deep branching is feasible but painful; picking Temporal for a 6-state AWS-only workflow is operational overkill.
  • "DBOS is just Temporal with a different database" — It is structurally different. Temporal's history is in a separate cluster the engine owns; DBOS's history is in the application's own database, in the same transaction as the application's writes. That same-transaction commit is the entire reason DBOS exists; reading it as a "different storage backend" misses the architectural point.
  • "Cadence is dead, ignore it" — Cadence still runs production workloads at Uber and a handful of post-fork holdouts. The protocol is alive. New deployments should pick Temporal (the actively-maintained fork by the original authors); existing Cadence deployments are not in immediate danger but are a migration target whenever the team has budget.
  • "Step Functions Express Workflows are just cheaper Standard Workflows" — They are functionally distinct. Express has a 5-minute hard cap, supports at-most-once mode, has a different SDK surface (no .waitForTaskToken-style synchronous integration with humans-in-the-loop), and uses an entirely different storage tier. Picking Express for a 30-minute workflow does not work — the engine kills the execution at 5 minutes.
  • "You can't do versioning in Step Functions" — You can publish a new state-machine version, and in-flight executions keep running on the version they were started with. What you cannot do is patch an in-flight execution to a new version mid-flight (Temporal's workflow.patched API). Step Functions makes you wait for the old version's executions to drain naturally.
  • "DBOS removes the need for idempotency keys" — Only for in-database steps. Steps that hit external services (HTTP, email, SMS, third-party APIs) still need idempotency keys for the same reason as Temporal — DBOS cannot wrap an SMS send in a Postgres transaction.

Going deeper

The Cadence paper's framing — why "code-as-workflow" beat "JSON-state-machine"

The Cadence design paper (Fateev & Abbas, 2018) explicitly argues against the JSON-state-machine model. Their case is that real workflows have arbitrary control flow — early returns, dynamic branching on activity output, conditionally-launched child workflows, fan-out-fan-in over runtime-determined collections. JSON state machines can express all of this, but the expressions become incomprehensible past about 50 states. A Python function can express the same logic in a tenth of the line count, with the language's own tooling (linters, debuggers, type-checkers) for free. Why this trade-off has shifted over time: in 2016 (Step Functions launch) the appeal of a declarative state machine was strong because the AWS console could render it visually and non-programmer stakeholders could read the workflow. In 2024+ that appeal is weaker — engineers are comfortable reading code, and tools like Temporal's web UI render the running workflow's history as well as Step Functions renders the state-machine definition. The pendulum has swung toward code.

Saga patterns and how the three engines support them

A saga needs forward steps and compensating steps, with the engine guaranteeing that if a forward step fails, all previously-executed forward steps' compensations run in reverse order. Step Functions does this via Catch clauses + manually-written compensation states (you write the compensation as another Task); the structure is explicit but verbose. Temporal does it via the Saga helper class — you append compensation activities to the saga as you go, and a try/except catch invokes them in reverse. DBOS does it via the same code-as-workflow style with the additional benefit that the compensation can be an in-database rollback inside the same transaction (when the saga is single-database). The cleanest code-shape for sagas is Temporal's; the cleanest semantics for single-database sagas is DBOS's.

Why "exactly-once execution" is a marketing claim everywhere

All three engines describe themselves as offering "exactly-once execution" or "exactly-once semantics". The honest version: they offer at-least-once execution of activities with exactly-once-recording of activity results. The activity itself can run more than once if a worker crashes mid-execution; the engine guarantees that the recorded result is exactly the one observed once. Activities that have external side-effects (HTTP, email) still need idempotency keys, and if you forget them, you get duplicate side-effects. The "exactly-once" badge sells the ideal, not the reality. See /wiki/rpc-semantics-at-most-once-at-least-once-exactly-once for the deeper treatment.

Reproduce this on your laptop

python3 -m venv .venv && source .venv/bin/activate
# The toy three-style comparison above is self-contained — save it as
# durable_compare.py and run:
python3 durable_compare.py
# Expected: three "DONE" lines and the DBOS history showing three rows.

# To run a real Temporal cluster locally:
brew install temporal
temporal server start-dev
# Then in another terminal:
pip install temporalio
# and follow the Temporal Python SDK quickstart.

# To run a real DBOS demo locally:
pip install dbos
dbos init demo && cd demo && dbos start
# DBOS will spin up a Postgres or use an existing one.

Where this leads next

The three engines are different surface treatments of the same primitive — replay-against-history — covered in /wiki/temporal-and-durable-execution. The next chapters in this part build on the engine-level material:

The takeaway Riya carries into Friday's decision meeting: durable execution is one primitive with three operational shapes. Pick the shape that matches your team's existing operational footprint, not the shape with the best marketing page.

References