Note: Company names, engineers, incidents, numbers, and scaling scenarios in this article are hypothetical — even when they resemble real ones. See the full disclaimer.

Wall: orchestration is its own layer

It is 02:14 on a Tuesday and Reema, the on-call at PaySetu, is staring at a Slack thread three hours deep. A merchant onboarding flow that used to be one function call has, over the past year, been "decomposed" into eleven asynchronous steps. KYC fetches a document. A consumer publishes to kyc-events. Another consumer enriches with a sanctions check. Another publishes to merchant-created. Another sends a welcome email. Another creates the gateway-side settlement account. Another seeds the dashboard. Each of those eleven consumers is, individually, a clean little service with retries and a DLQ. None of them is wrong. And yet 1,847 merchants signed up tonight and only 1,402 have a working dashboard. The remaining 445 are scattered across nine different DLQs in five different brokers, each in a different mid-state — KYC done but settlement account missing, settlement account created but dashboard unseeded, welcome email sent but profile row absent. There is no single place Reema can open to see "what is the state of merchant m-7731". The state is smeared across eleven topics, three databases, and four DLQs, and reconstructing it for one merchant takes 40 minutes of kafka-console-consumer and SQL joins. By 02:50 she gives up and writes a one-line message: "we have an orchestration problem".

This is the wall this part has been building toward. The previous seven chapters — queues, streams, exactly-once, at-least-once, Kafka, Pulsar, retries and DLQs — each made a single hop reliable. None of them made a multi-step business workflow reliable. The reliability of a workflow is not the product of the reliabilities of its hops. It is its own concern, with its own state, its own failure modes, and — eventually — its own dedicated layer in your stack. This chapter is about why that layer keeps emerging on its own if you don't plan for it, and what it means to plan for it.

A reliable hop is not a reliable workflow. When a business process is decomposed into N async steps, you have implicitly built an N-step state machine; the question is only whether that state machine lives in a tool you can introspect, or smeared across ten topics and four DLQs. Choreography (services react to events) scales to 3–5 steps, after that it becomes archaeologically opaque. Orchestration (a coordinator drives the steps) is the explicit answer: tools like Temporal, Cadence, AWS Step Functions, and Airflow exist precisely because hand-rolled multi-step async workflows reliably fail in the same way every team rediscovers.

The wall — when N hops stop composing

The seductive thing about messaging is that each hop is independently reasonable. Each consumer has a retry policy, a DLQ, an idempotency key, an at-least-once contract. You read the DLQ chapter and felt, correctly, that you understood how to make one hop reliable. Now decompose a business process into eleven such hops and ask three questions about it. None of them have local answers.

Question 1: where is the workflow right now? A merchant signed up at 21:14:02. It is 22:11. Where is m-7731? In a chunky monolith you would SELECT status FROM merchants WHERE id='m-7731' and get one row back. In an eleven-hop pipeline, "where it is" means which of the eleven consumers has and has not yet processed this merchant's event. That information lives in eleven separate consumer-group lag positions on three brokers, plus four DLQs. There is no single query for it. The first time a customer-success agent has to answer "is my merchant set up yet?" you discover this. Why: each consumer's offset is private state for that consumer; nothing in messaging requires consumers to publish their progress in a queryable form. The hop-level abstraction does not expose workflow-level state, by design — that is what the abstraction is buying you, and also what is now missing.

Question 2: when does a workflow timeout? The settlement-account step normally takes 4 seconds. Tonight it is taking 90 seconds because the downstream is slow. Is that fine (the hop will eventually retry-and-succeed) or is that a workflow timeout (the customer is staring at a half-loaded dashboard and we should fail visibly rather than fail invisibly)? At the hop level, the answer is "it'll succeed eventually". At the workflow level, the customer left 60 seconds ago and the half-state will sit forever. There is no consumer in the chain that knows what the workflow SLA was, because no consumer was told.

Question 3: how do you compensate? The welcome email got sent. The settlement account creation now fails permanently with a sanctions match. To roll the workflow back, you need to un-send the welcome email — or, more accurately, send a "we're sorry, your account is on hold" email. But the consumer that sends the welcome email has no record that it sent one for m-7731 — or it does, in its own private store. To compensate, the system needs to know what was done. Sagas (see /wiki/saga-pattern-compensating-actions) formalise this with explicit compensating actions, but the saga itself is a workflow — and someone has to drive it.

The wall is that all three questions ("where is it", "when does it time out", "how do we compensate") have workflow-scoped answers and hop-scoped infrastructure. You have the wrong layer for the right question.

Same eleven steps. Left: state is implicit, distributed, and unqueryable. Right: state is one row in a coordinator. Illustrative.

Choreography vs orchestration — the only architectural choice that matters here

There are exactly two ways to compose multiple async steps into a workflow. The community names are choreography and orchestration, and the choice between them is the choice this whole part has been deferring.

In choreography, each service publishes events when its work is done; downstream services subscribe and react. There is no central coordinator. The KYC service publishes kyc-completed; the sanctions service consumes it, does its work, publishes sanctions-cleared; the settlement service consumes that, does its work, publishes account-created; and so on. Each arrow in the diagram is local. The whole workflow is implicit in which services subscribe to which events. Why this is appealing: there is no single point of failure, no coordinator to scale, and each service knows only about its inputs and outputs. Adding a step is "deploy a new consumer"; nobody else has to know.

In orchestration, a single coordinator owns the workflow. The coordinator calls KYC, waits for the result, calls sanctions, waits for the result, calls settlement, waits for the result. The coordinator has the workflow definition and the workflow state; the individual services are dumb workers that do one job and report back. Why this is appealing: the workflow is one piece of code, the state is one row in a database, and "where is m-7731?" is SELECT * FROM workflow_state WHERE id='m-7731'. The coordinator can timeout, retry, compensate, and report — all centrally.

The trade-off is real and has a sharp inflection. Choreography is genuinely simpler for 2–4 steps because the workflow is small enough to fit in a head, the failures are local enough to debug from one service's logs, and the coupling cost (which service subscribes to which event) is low. Past about 5 steps, choreography hits the wall described above: the workflow becomes archaeologically opaque, the state is irrecoverable without a custom tool, and every new step requires someone to mentally simulate the entire fan-out to predict what will happen. PaySetu's eleven-step merchant onboarding was choreographed; the team had not noticed they had passed the inflection until the night Reema was paged. Orchestration's overhead — running a workflow engine, modelling steps as activities — pays for itself the moment the workflow has a single owner who needs to answer "where is it now".

The wrong answer to this trade-off is "let's do choreography because it's decoupled" without measuring whether you have a workflow or a pipeline. A pipeline is a one-way fire-and-forget chain (analytics events landing in a warehouse). A workflow is a transaction with a customer-visible outcome (a merchant being onboarded). Pipelines stay choreographed forever. Workflows past 5 steps almost always end up needing orchestration.

Modelling a workflow as a state machine — runnable

The cleanest way to feel why orchestration is its own layer is to write a tiny state-machine-driven workflow coordinator and watch it absorb the concerns the messaging layer cannot.

import json, uuid, time, random
from dataclasses import dataclass, field, asdict
from enum import Enum

class State(Enum):
    PENDING="pending"; KYC="kyc"; SANCTIONS="sanctions"; SETTLE="settle"
    DASH="dash"; DONE="done"; FAILED="failed"; COMPENSATING="compensating"

@dataclass
class Workflow:
    wid: str; merchant_id: str
    state: State = State.PENDING
    history: list = field(default_factory=list)
    deadline: float = 0.0
    error: str = ""

# Activities — each is one "hop". They can fail; the coordinator handles that.
def kyc(wf): 
    if random.random() < 0.05: raise TimeoutError("kyc-svc 503")
    return {"kyc_id": "KYC-"+wf.merchant_id}
def sanctions(wf, kyc_id):
    if wf.merchant_id.endswith("BAD"): raise ValueError("sanctions hit")
    return {"cleared": True}
def settle(wf): return {"acct": "ACT-"+wf.merchant_id}
def dash(wf, acct):  return {"dashboard": f"https://psu.app/{acct}"}
def compensate_settle(wf, acct): return {"closed": acct}

TRANSITIONS = [
    (State.PENDING,   State.KYC,       lambda wf, ctx: ctx.update(kyc(wf))),
    (State.KYC,       State.SANCTIONS, lambda wf, ctx: ctx.update(sanctions(wf, ctx["kyc_id"]))),
    (State.SANCTIONS, State.SETTLE,    lambda wf, ctx: ctx.update(settle(wf))),
    (State.SETTLE,    State.DASH,      lambda wf, ctx: ctx.update(dash(wf, ctx["acct"]))),
    (State.DASH,      State.DONE,      lambda wf, ctx: None),
]

def run(wf, ctx, deadline_s=10.0, max_retries_per_step=3):
    wf.deadline = time.time() + deadline_s
    for from_s, to_s, action in TRANSITIONS:
        if wf.state != from_s: continue
        for attempt in range(1, max_retries_per_step+1):
            if time.time() > wf.deadline:
                wf.state = State.FAILED; wf.error = "workflow-deadline"; return
            try:
                action(wf, ctx); wf.state = to_s
                wf.history.append({"to": to_s.value, "attempt": attempt}); break
            except (TimeoutError, ConnectionError) as e:                 # transient
                wf.history.append({"err": str(e), "class": "transient", "attempt": attempt})
                time.sleep(0.05 * (2**(attempt-1)))
            except Exception as e:                                       # terminal
                wf.history.append({"err": str(e), "class": "terminal"})
                # Compensate steps already done — saga style
                if "acct" in ctx: compensate_settle(wf, ctx["acct"])
                wf.state = State.FAILED; wf.error = str(e); return

for mid in ["m-7731", "m-9001-BAD", "m-7732"]:
    wf = Workflow(wid=str(uuid.uuid4())[:8], merchant_id=mid); ctx = {}
    run(wf, ctx)
    print(f"{mid:14s} -> {wf.state.value:10s} steps={len(wf.history)} err={wf.error or '-'}")

Sample run:

m-7731         -> done       steps=5  err=-
m-9001-BAD     -> failed     steps=3  err=sanctions hit
m-7732         -> done       steps=6  err=-

The shape of this run is the whole point. m-7731 completes all five transitions cleanly. m-9001-BAD advances through KYC and sanctions, hits a terminal sanctions match, the coordinator runs the saga's compensating action (closes the half-created account if any), and the workflow ends in failed with a recorded error. m-7732 completes but with steps=6 — one transient failure was retried and recovered, all in one workflow record. Now answer the three questions from the wall section: where is it? — the state field. Has it timed out? — compare now() to deadline. How do we compensate? — the coordinator does it inline, with full context.

This 50-line state machine is doing what would otherwise be smeared across four consumers, three DLQs, and a 40-minute SQL session. Why this works where choreography did not: the workflow is now a first-class object in the system. It has an id, a state, a history, a deadline, and an error. None of those are properties of any single message or any single consumer; they are properties of the workflow as a whole, and only a coordinator-shaped thing can hold them.

The state machine the coordinator owns. Each transition is durably persisted; restarts resume from the last persisted state. Illustrative.

Why hand-rolled coordinators turn into Temporal

The state machine above is what every team writes the first time they hit the orchestration wall. It works for a week. Then it grows.

It needs persistence: the coordinator process restarts mid-workflow, you want it to resume from the last completed transition. Now you are writing every state change to a database with idempotency keys. You are inventing event sourcing.

It needs timers: "wait 24 hours then send a follow-up email if the merchant hasn't logged in" is a step. You need durable timers that survive coordinator restarts. Now you are writing a timer-fired-at table with a polling worker. You are inventing a scheduler.

It needs versioning: the workflow definition changes ("add an SMS step before email"), but there are 2,000 workflows currently in flight on the old definition. You can't just deploy the new code; you need to keep both versions running. You are inventing workflow versioning, which is harder than schema versioning.

It needs visibility: an SRE wants a UI to see all workflows in failed state in the last hour, grouped by error class. You build it. Now you are competing with Temporal's web UI.

It needs fan-out: the workflow has a step that processes a list of 10,000 sub-items in parallel. You build worker pools and child workflows. You are reinventing Temporal's child-workflow primitive.

This is the path every team walks. Temporal, Cadence, AWS Step Functions, Azure Durable Functions, and Airflow all exist because hand-rolled orchestration converges on the same set of features and the same set of bugs. The bugs are interesting because they are not random: they are always the same five (lost timers, double-fired activities, restart-resumption races, version skew, and fan-out cardinality blow-ups). When you hear about a workflow engine, what is being sold is "we already wrote and debugged the five bugs you are about to write".

Common confusions

"Choreography vs orchestration is a religious war" — It is not; it is a step-count threshold. Below ~4 steps, choreography is fine. Above ~5, orchestration almost always wins. Above 10, orchestration is non-negotiable. Anyone arguing otherwise has not personally been Reema at 02:50.
"Sagas are an alternative to orchestration" — No, sagas are a pattern for compensation that almost always requires orchestration to drive them. The compensation order matters, and the only thing that knows the order is the coordinator. See /wiki/saga-pattern-compensating-actions.
"Airflow is a workflow engine like Temporal" — They are different shapes. Airflow is for batch ETL DAGs (run nightly, 1000s of tasks). Temporal/Cadence are for transactional business workflows (millions of in-flight workflows, each a few-to-many steps, with timers and external signals). Using Airflow for merchant onboarding is a category error and gets ugly fast.
"Microservices imply choreography" — Microservices imply something coordinates the workflows; orchestration is the more often-correct choice for transactional flows. Netflix's famous "choreography over orchestration" position is from a 2014 era of mostly-pipeline workloads; their newer payment workflows are orchestrated with Conductor.
"Orchestration is a SPOF" — A workflow coordinator is durable and highly available by construction; that is its job. The same coordinator runs for 1M workflows and survives node failures via consensus underneath (Cassandra in Cadence; configurable in Temporal). It is no more a SPOF than your database is.
"Idempotency makes orchestration unnecessary" — Idempotency makes individual hops safe to retry. It says nothing about cross-hop state, deadlines, or compensation. The two solve different problems and stack.

Going deeper

Temporal's "history" model — why it is not a queue

Temporal does not represent workflows as messages on a queue; it represents them as an event-sourced history of activity-completed events. The workflow function is replayed deterministically from the history every time the worker picks it up. Why: replaying from a deterministic history is what makes "the worker process crashed mid-workflow and another worker resumed it" work without lost or double-executed activities — the new worker reaches the same line of code at the same point in the history, and the history tells it which activities have already been completed and what they returned. This is the same trick Raft uses for state-machine replication, applied to workflow execution. The constraint is that the workflow function must be deterministic — no time.now(), no random(), no direct I/O. All non-determinism is funneled through activities, which are the only things allowed to interact with the world. This constraint is the price you pay for crash-resumability, and it is what most "build our own Temporal" attempts do not anticipate.

Saga compensation in practice — why the order matters

A naive saga assumes "if step k fails, run compensations for steps k-1, k-2, ..., 1 in reverse". The reverse-order rule is right for the invariants (un-doing in reverse undoes the dependencies properly), but production sagas have to handle that some compensations themselves can fail. PaySetu's saga for failed onboardings has a rule: a failed compensation does not abort the saga — it tags the step as compensation-deferred and the saga continues compensating earlier steps. The deferred ones go to a human-driven retry queue. The reason is empirical: a sanctions-cleared compensation that fails because the sanctions API is down should not block the close-settlement-account compensation, because money is sitting in a half-funded account and that is the worse failure mode.

"Workflow as code" vs "workflow as YAML" vs "workflow as DAG"

Three modelling choices show up across engines.

Workflow as code (Temporal, Cadence): the workflow is a function in a real programming language. Loops, conditionals, exceptions are first-class. Best for transactional workflows where the logic is genuinely conditional.
Workflow as YAML (AWS Step Functions, GCP Workflows): the workflow is a declarative state machine in YAML/JSON. Easy to visualise, hard to express genuine logic in. Best for clearly-shaped workflows with simple branching.
Workflow as DAG (Airflow, Prefect): the workflow is a directed acyclic graph of tasks. No loops, no conditional re-entry. Best for batch ETL where the shape is fixed and the data flows one-way.

A common mistake is using Airflow for transactional workflows because "we already have Airflow". The DAG model can express it clumsily, but the engine's assumptions (one run per schedule, tasks are stateless, retries reset the task) fight you on every page.

The "outbox" pattern as a pre-orchestration stepping stone

Before you adopt Temporal, there is a smaller pattern that solves the most common case: making sure a database write and a message publish happen atomically. The outbox: write the message into an outbox table in the same DB transaction as the business-data update; a separate worker reads the outbox and publishes to the broker. This guarantees "if the DB row exists, the message will eventually be published". It does not give you workflow state; it just removes the dual-write race that otherwise haunts choreographed pipelines. Many teams use the outbox as a bridge to orchestration — it removes the worst class of bugs while they evaluate whether they need a full coordinator.

Reproduce this on your laptop

python3 -m venv .venv && source .venv/bin/activate
# Save the Workflow snippet as orch_demo.py and run:
python3 orch_demo.py
# Expected: m-7731 done, m-9001-BAD failed with sanctions hit, m-7732 done with one transient retry.

Where this leads next

Orchestration is the lid on Part 15 because it ties messaging back to the rest of the curriculum:

/wiki/saga-pattern-compensating-actions — the canonical compensation pattern, which orchestration is the natural home for.
/wiki/at-least-once-idempotency-in-practice — why every orchestrated step still needs idempotency keys at the activity level.
/wiki/dead-letter-queues-and-retries — what the coordinator wraps; activities are still individual hops with retry policies.
/wiki/leader-election-without-consensus — workflow coordinators themselves run multi-instance and need leader election under the hood.

The lesson the next part will inherit: a system whose state is observable in one place is operable; a system whose state is implicit across ten places is debuggable only by archaeology. Orchestration is the choice to make state observable. Most production teams arrive at it the night a Reema-style page makes the alternative untenable.

References

Temporal — concepts and architecture — the canonical introduction to event-sourced workflow execution and replay.
Cadence — design overview — the open-source predecessor to Temporal; same ideas, the original paper.
AWS Step Functions — state machine model — the YAML-state-machine-shaped engine and its tradeoffs.
Netflix Conductor — Netflix's open-source orchestrator, used internally for content-pipeline workflows.
Caitie McCaffrey, "Distributed sagas: a protocol for coordinating microservices" (2017) — the canonical saga talk; ties compensation to orchestration explicitly.
Chris Richardson, "Pattern: Saga" (microservices.io) — the practitioner reference for the saga pattern with examples.
/wiki/saga-pattern-compensating-actions — the next step beyond this chapter.
Maxim Fateev (Temporal CEO), "Designing a workflow engine from first principles" (2020) — the article that explains why hand-rolled coordinators converge on the same five bugs.